15. Polars#
In addition to what’s in Anaconda, this lecture will need the following libraries:
!pip install --upgrade polars wbgapi yfinance pyarrow
15.1. Overview#
Polars is a fast data manipulation library for Python written in Rust.
Polars has gained significant popularity in recent years due to its superior performance compared to traditional data analysis tools.
This makes it an excellent choice for modern data science and machine learning workflows.
Polars is designed with performance and memory efficiency in mind, leveraging:
Arrow’s columnar memory format for fast data access
Lazy evaluation to optimize query execution
Parallel processing for enhanced performance
Expressive API similar to pandas but with better performance characteristics
Just as NumPy provides the basic array data type plus core array operations, Polars
defines fundamental structures for working with data and
endows them with methods that facilitate operations such as
reading in data
adjusting indices
working with dates and time series
sorting, grouping, re-ordering and general data munging [1]
dealing with missing values, etc.
More sophisticated statistical functionality is left to other packages, such as statsmodels and scikit-learn, which can work with Polars DataFrames through their interoperability with pandas.
This lecture will provide a basic introduction to Polars.
Tip
Why use Polars over pandas? One reason is performance: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found in this JetBrains blog post.
Throughout the lecture, we will assume that the following imports have taken place
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Two important data types defined by Polars are Series
and DataFrame
.
You can think of a Series
as a “column” of data, such as a collection of observations on a single variable.
A DataFrame
is a two-dimensional object for storing related columns of data.
15.2. Series#
Let’s start with Series.
We begin by creating a series of four random observations
s = pl.Series(name='daily returns', values=np.random.randn(4))
s
daily returns |
---|
f64 |
1.125966 |
-0.594306 |
0.792461 |
0.452324 |
Note
You may notice the above series has no indices, unlike in pd.Series; this is because Polars’ is column centric and accessing data is predominantly managed through filtering and boolean masks; here is an interesting blog post discussing this in more detail.
Polars Series
are built on top of Apache Arrow arrays and support many similar operations to Pandas Series
.
(For interested readers, please see this extended reading on Apache Arrow)
s * 100
daily returns |
---|
f64 |
112.596609 |
-59.430626 |
79.246067 |
45.232421 |
s.abs()
daily returns |
---|
f64 |
1.125966 |
0.594306 |
0.792461 |
0.452324 |
But Series
provide more than basic arrays.
For example, they have some additional (statistically oriented) methods
s.describe()
statistic | value |
---|---|
str | f64 |
"count" | 4.0 |
"null_count" | 0.0 |
"mean" | 0.444111 |
"std" | 0.744905 |
"min" | -0.594306 |
"25%" | 0.452324 |
"50%" | 0.792461 |
"75%" | 0.792461 |
"max" | 1.125966 |
However, the pl.Series
object cannot be used in the same way as a pd.Series
when pairing data with indices.
For example, using a pd.Series
you can do the following:
s = pd.Series(np.random.randn(4), name='daily returns')
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s
AMZN -1.090070
AAPL 0.484258
MSFT -0.313922
GOOG -1.967324
Name: daily returns, dtype: float64
However, in Polars you will need to use the DataFrame
object to do the same task.
This means you will use the DataFrame
object more often when using Polars if you are interested in relationships between data.
Let’s create a pl.DataFrame
containing the equivalent data in the pd.Series
.
df = pl.DataFrame({
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
'daily returns': s.to_list()
})
df
company | daily returns |
---|---|
str | f64 |
"AMZN" | -1.09007 |
"AAPL" | 0.484258 |
"MSFT" | -0.313922 |
"GOOG" | -1.967324 |
To access specific values by company name, we can filter the DataFrame for the AMZN
ticker code and select the daily returns
.
df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
-1.0900699038656159
If we want to update the AMZN
return to 0, you can use the following chain of methods.
Here with_columns
is similar to select
but adds columns to the same DataFrame
df = df.with_columns(
pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
.then(0) # set values to 0
.otherwise(pl.col('daily returns')) # otherwise keep original value
.alias('daily returns') # assign back to the column
)
df
company | daily returns |
---|---|
str | f64 |
"AMZN" | 0.0 |
"AAPL" | 0.484258 |
"MSFT" | -0.313922 |
"GOOG" | -1.967324 |
You can check if a ticker code is in the company list
'AAPL' in df['company']
True
15.3. DataFrames#
While a Series
is a single column of data, a DataFrame
is several columns, one for each variable.
In essence, a DataFrame
in Polars is analogous to a (highly optimized) Excel spreadsheet.
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
Let’s look at an example that reads data from the CSV file pandas/data/test_pwt.csv
, which is taken from the Penn World Tables.
The dataset contains the following indicators:
Variable Name |
Description |
---|---|
POP |
Population (in thousands) |
XRAT |
Exchange Rate to US Dollar |
tcgdp |
Total PPP Converted GDP (in million international dollar) |
cc |
Consumption Share of PPP Converted GDP Per Capita (%) |
cg |
Government Consumption Share of PPP Converted GDP Per Capita (%) |
We’ll read this in from a URL using the Polars function read_csv
.
URL = ('https://raw.githubusercontent.com/QuantEcon/'
'lecture-python-programming/master/source/_static/'
'lecture_specific/pandas/data/test_pwt.csv')
df = pl.read_csv(URL)
type(df)
polars.dataframe.frame.DataFrame
Here is the content of test_pwt.csv
df
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 0.9995 | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 1.72483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | 64.436451 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 227242.36949 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 282171.957 | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
15.3.1. Select data by position#
In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.
We can select particular rows using array slicing notation
df[2:5]
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | 64.436451 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
To select columns, we can pass a list containing the names of the desired columns
df.select(['country', 'tcgdp'])
country | tcgdp |
---|---|
str | f64 |
"Argentina" | 295072.21869 |
"Australia" | 541804.6521 |
"India" | 1.7281e6 |
"Israel" | 129253.89423 |
"Malawi" | 5026.221784 |
"South Africa" | 227242.36949 |
"United States" | 9.8987e6 |
"Uruguay" | 25255.961693 |
To select both rows and columns using integers, we can combine slicing with column selection
df[2:5].select(df.columns[0:4])
country | country isocode | year | POP |
---|---|---|---|
str | str | i64 | f64 |
"India" | "IND" | 2000 | 1.0063e6 |
"Israel" | "ISR" | 2000 | 6114.57 |
"Malawi" | "MWI" | 2000 | 11801.505 |
To select rows and columns using a mixture of integers and labels, we can use more complex selection
df[2:5].select(['country', 'tcgdp'])
country | tcgdp |
---|---|
str | f64 |
"India" | 1.7281e6 |
"Israel" | 129253.89423 |
"Malawi" | 5026.221784 |
15.3.2. Select data by conditions#
Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.
This section demonstrates various ways to do that.
The most straightforward way is with the filter
method.
df.filter(pl.col('POP') >= 20000)
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 0.9995 | 295072.21869 | 75.716805 | 5.578804 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 227242.36949 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 282171.957 | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
In this case, df.filter()
takes a boolean expression and only returns rows with the True
values.
We can view this boolean mask as a table with the alias meets_criteria
df.select(
pl.col('country'),
(pl.col('POP') >= 20000).alias('meets_criteria')
)
country | meets_criteria |
---|---|
str | bool |
"Argentina" | true |
"Australia" | false |
"India" | true |
"Israel" | false |
"Malawi" | false |
"South Africa" | true |
"United States" | true |
"Uruguay" | false |
Here is another example:
df.filter(
(pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
(pl.col('POP') > 40000)
)
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 227242.36949 | 72.71871 | 5.726546 |
We can also allow arithmetic operations between different columns.
df.filter(
(pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
)
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
For example, we can use the condition to select the country with the largest
household consumption–GDP share cc
.
df.filter(pl.col('cc') == pl.col('cc').max())
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
When we only want to look at certain columns of a selected sub-DataFrame, we can combine filter with select.
df.filter(
(pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
).select(['country', 'year', 'POP']
)
country | year | POP |
---|---|---|
str | i64 | f64 |
"Malawi" | 2000 | 11801.505 |
"Uruguay" | 2000 | 3219.793 |
Application: Subsetting DataFrame
Real-world datasets can be very large.
It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.
Let’s imagine that we’re only interested in the population (POP
) and total GDP (tcgdp
).
One way to strip the data frame df
down to only these variables is to overwrite the DataFrame
using the selection method described above
df_subset = df.select(['country', 'POP', 'tcgdp'])
df_subset
country | POP | tcgdp |
---|---|---|
str | f64 | f64 |
"Argentina" | 37335.653 | 295072.21869 |
"Australia" | 19053.186 | 541804.6521 |
"India" | 1.0063e6 | 1.7281e6 |
"Israel" | 6114.57 | 129253.89423 |
"Malawi" | 11801.505 | 5026.221784 |
"South Africa" | 45064.098 | 227242.36949 |
"United States" | 282171.957 | 9.8987e6 |
"Uruguay" | 3219.793 | 25255.961693 |
We can then save the smaller dataset for further analysis.
df_subset.write_csv('pwt_subset.csv')
15.3.3. Apply and map operations#
Polars provides powerful methods for applying functions to data.
Instead of pandas’ apply
method, Polars uses expressions within select
, with_columns
, or filter
methods.
Here is an example using built-in functions to find the max
value for each column
df.select([
pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg'])
.max()
.name.suffix('_max')
])
year_max | POP_max | XRAT_max | tcgdp_max | cc_max | cg_max |
---|---|---|---|---|---|
i64 | f64 | f64 | f64 | f64 | f64 |
2000 | 1.0063e6 | 59.543808 | 9.8987e6 | 78.97874 | 14.072206 |
For more complex operations, we can use map_elements
(similar to pandas’ apply):
df.select([
pl.col('country'),
pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
])
/tmp/ipykernel_6418/1063562650.py:3: PolarsInefficientMapWarning:
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
- pl.col("POP").map_elements(lambda x: ...)
with this one instead:
+ pl.col("POP") * 2
pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
country | POP_doubled |
---|---|
str | f64 |
"Argentina" | 74671.306 |
"Australia" | 38106.372 |
"India" | 2.0126e6 |
"Israel" | 12229.14 |
"Malawi" | 23603.01 |
"South Africa" | 90128.196 |
"United States" | 564343.914 |
"Uruguay" | 6439.586 |
However as you can see from the warning issued by Polars there is often a better way to achieve this using the Polars API.
df.select([
pl.col('country'),
(pl.col('POP') * 2).alias('POP_doubled')
])
country | POP_doubled |
---|---|
str | f64 |
"Argentina" | 74671.306 |
"Australia" | 38106.372 |
"India" | 2.0126e6 |
"Israel" | 12229.14 |
"Malawi" | 23603.01 |
"South Africa" | 90128.196 |
"United States" | 564343.914 |
"Uruguay" | 6439.586 |
We can use complex filtering conditions with boolean logic:
complex_condition = (
pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
.then(pl.col('POP') > 40000)
.otherwise(pl.col('POP') < 20000)
)
df.filter(complex_condition).select([
'country', 'year', 'POP', 'XRAT', 'tcgdp'
])
country | year | POP | XRAT | tcgdp |
---|---|---|---|---|
str | i64 | f64 | f64 | f64 |
"Australia" | 2000 | 19053.186 | 1.72483 | 541804.6521 |
"India" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 |
"Israel" | 2000 | 6114.57 | 4.07733 | 129253.89423 |
"Malawi" | 2000 | 11801.505 | 59.543808 | 5026.221784 |
"South Africa" | 2000 | 45064.098 | 6.93983 | 227242.36949 |
"Uruguay" | 2000 | 3219.793 | 12.099592 | 25255.961693 |
15.3.4. Make changes in DataFrames#
The ability to make changes in DataFrames is important to generate a clean dataset for future analysis.
1. We can use conditional logic to “keep” certain values and replace others
df.with_columns(
pl.when(pl.col('POP') >= 20000) # when population >= 20000
.then(pl.col('POP')) # keep the population value
.otherwise(None) # otherwise set to null
.alias('POP_filtered') # save results in POP_filtered
).select(['country', 'POP', 'POP_filtered']) # select the columns
country | POP | POP_filtered |
---|---|---|
str | f64 | f64 |
"Argentina" | 37335.653 | 37335.653 |
"Australia" | 19053.186 | null |
"India" | 1.0063e6 | 1.0063e6 |
"Israel" | 6114.57 | null |
"Malawi" | 11801.505 | null |
"South Africa" | 45064.098 | 45064.098 |
"United States" | 282171.957 | 282171.957 |
"Uruguay" | 3219.793 | null |
2. We can modify specific values based on conditions
df_modified = df.with_columns(
pl.when(pl.col('cg') == pl.col('cg').max()) # pick the largest cg value
.then(None) # set to null
.otherwise(pl.col('cg')) # otherwise keep the value
.alias('cg') # update the column
)
df_modified
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 0.9995 | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 1.72483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | null |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | 64.436451 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 227242.36949 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 282171.957 | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
3. We can use expressions to modify columns as a whole
df.with_columns([
pl.when(pl.col('POP') <= 10000) # when population is < 10,000
.then(None) # set the value to null
.otherwise(pl.col('POP')) # otherwise keep existing value
.alias('POP'), # update the POP column
(pl.col('XRAT') / 10).alias('XRAT') # update XRAT in-place
])
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 0.09995 | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 0.172483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 4.49416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | null | 0.407733 | 129253.89423 | 64.436451 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 5.954381 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 0.693983 | 227242.36949 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 282171.957 | 0.1 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | null | 1.209959 | 25255.961693 | 78.97874 | 5.108068 |
4. We can use in-built functions to modify all individual entries in specific columns by data type.
df.with_columns([
pl.col(pl.Float64).round(2) # round all Float64 columns
])
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.65 | 1.0 | 295072.22 | 75.72 | 5.58 |
"Australia" | "AUS" | 2000 | 19053.19 | 1.72 | 541804.65 | 67.76 | 6.72 |
"India" | "IND" | 2000 | 1006300.3 | 44.94 | 1.7281e6 | 64.58 | 14.07 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.08 | 129253.89 | 64.44 | 10.27 |
"Malawi" | "MWI" | 2000 | 11801.5 | 59.54 | 5026.22 | 74.71 | 11.66 |
"South Africa" | "ZAF" | 2000 | 45064.1 | 6.94 | 227242.37 | 72.72 | 5.73 |
"United States" | "USA" | 2000 | 282171.96 | 1.0 | 9.8987e6 | 72.35 | 6.03 |
"Uruguay" | "URY" | 2000 | 3219.79 | 12.1 | 25255.96 | 78.98 | 5.11 |
Application: Missing Value Imputation
Replacing missing values is an important step in data munging.
Let’s randomly insert some null values
# Create a copy with some null values
df_with_nulls = df.clone()
# Set some specific positions to null
indices_to_null = [(0, 'XRAT'), (3, 'cc'), (5, 'tcgdp'), (6, 'POP')]
for row_idx, col_name in indices_to_null:
df_with_nulls = df_with_nulls.with_columns(
pl.when(pl.int_range(pl.len()) == row_idx)
.then(None)
.otherwise(pl.col(col_name))
.alias(col_name)
)
df_with_nulls
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | null | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 1.72483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | null | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | null | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | null | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
We can replace all missing values with 0
df_with_nulls.fill_null(0)
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 0.0 | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 1.72483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | 0.0 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 0.0 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 0.0 | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
Polars also provides us with convenient methods to replace missing values.
For example, we can use forward fill, backward fill, or interpolation
Here we fill null
values with the column means
cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
pl.col(cols).fill_null(pl.col(cols).mean())
])
country | country isocode | year | POP | XRAT | tcgdp | cc | cg |
---|---|---|---|---|---|---|---|
str | str | i64 | f64 | f64 | f64 | f64 | f64 |
"Argentina" | "ARG" | 2000 | 37335.653 | 18.618141 | 295072.21869 | 75.716805 | 5.578804 |
"Australia" | "AUS" | 2000 | 19053.186 | 1.72483 | 541804.6521 | 67.759026 | 6.720098 |
"India" | "IND" | 2000 | 1.0063e6 | 44.9416 | 1.7281e6 | 64.575551 | 14.072206 |
"Israel" | "ISR" | 2000 | 6114.57 | 4.07733 | 129253.89423 | 72.400502 | 10.266688 |
"Malawi" | "MWI" | 2000 | 11801.505 | 59.543808 | 5026.221784 | 74.707624 | 11.658954 |
"South Africa" | "ZAF" | 2000 | 45064.098 | 6.93983 | 1.8033e6 | 72.71871 | 5.726546 |
"United States" | "USA" | 2000 | 161269.871714 | 1.0 | 9.8987e6 | 72.347054 | 6.032454 |
"Uruguay" | "URY" | 2000 | 3219.793 | 12.099592 | 25255.961693 | 78.97874 | 5.108068 |
Missing value imputation is a big area in data science involving various machine learning techniques.
There are also more advanced tools in Python to impute missing values.
15.3.5. Standardization and visualization#
Let’s imagine that we’re only interested in the population (POP
) and total GDP (tcgdp
).
One way to strip the data frame df
down to only these variables is to overwrite the DataFrame
using the selection method described above
df = df.select(['country', 'POP', 'tcgdp'])
df
country | POP | tcgdp |
---|---|---|
str | f64 | f64 |
"Argentina" | 37335.653 | 295072.21869 |
"Australia" | 19053.186 | 541804.6521 |
"India" | 1.0063e6 | 1.7281e6 |
"Israel" | 6114.57 | 129253.89423 |
"Malawi" | 11801.505 | 5026.221784 |
"South Africa" | 45064.098 | 227242.36949 |
"United States" | 282171.957 | 9.8987e6 |
"Uruguay" | 3219.793 | 25255.961693 |
While Polars doesn’t have a traditional index like pandas, we can work with country names directly
df
country | POP | tcgdp |
---|---|---|
str | f64 | f64 |
"Argentina" | 37335.653 | 295072.21869 |
"Australia" | 19053.186 | 541804.6521 |
"India" | 1.0063e6 | 1.7281e6 |
"Israel" | 6114.57 | 129253.89423 |
"Malawi" | 11801.505 | 5026.221784 |
"South Africa" | 45064.098 | 227242.36949 |
"United States" | 282171.957 | 9.8987e6 |
"Uruguay" | 3219.793 | 25255.961693 |
Let’s give the columns slightly better names
df = df.rename({'POP': 'population', 'tcgdp': 'total GDP'})
df
country | population | total GDP |
---|---|---|
str | f64 | f64 |
"Argentina" | 37335.653 | 295072.21869 |
"Australia" | 19053.186 | 541804.6521 |
"India" | 1.0063e6 | 1.7281e6 |
"Israel" | 6114.57 | 129253.89423 |
"Malawi" | 11801.505 | 5026.221784 |
"South Africa" | 45064.098 | 227242.36949 |
"United States" | 282171.957 | 9.8987e6 |
"Uruguay" | 3219.793 | 25255.961693 |
The population
variable is in thousands, let’s revert to single units
df = df.with_columns((pl.col('population') * 1e3).alias('population'))
df
country | population | total GDP |
---|---|---|
str | f64 | f64 |
"Argentina" | 3.7335653e7 | 295072.21869 |
"Australia" | 1.9053186e7 | 541804.6521 |
"India" | 1.0063e9 | 1.7281e6 |
"Israel" | 6.11457e6 | 129253.89423 |
"Malawi" | 1.1801505e7 | 5026.221784 |
"South Africa" | 4.5064098e7 | 227242.36949 |
"United States" | 2.82171957e8 | 9.8987e6 |
"Uruguay" | 3.219793e6 | 25255.961693 |
Next, we’re going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.
Note
Polars (or Pandas) doesn’t have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.
df = df.with_columns(
(pl.col('total GDP') * 1e6 / pl.col('population')).alias('GDP percap')
)
df
country | population | total GDP | GDP percap |
---|---|---|---|
str | f64 | f64 | f64 |
"Argentina" | 3.7335653e7 | 295072.21869 | 7903.229085 |
"Australia" | 1.9053186e7 | 541804.6521 | 28436.433261 |
"India" | 1.0063e9 | 1.7281e6 | 1717.324719 |
"Israel" | 6.11457e6 | 129253.89423 | 21138.672749 |
"Malawi" | 1.1801505e7 | 5026.221784 | 425.896679 |
"South Africa" | 4.5064098e7 | 227242.36949 | 5042.647686 |
"United States" | 2.82171957e8 | 9.8987e6 | 35080.381854 |
"Uruguay" | 3.219793e6 | 25255.961693 | 7843.97062 |
One of the nice things about Polars DataFrame
and Series
objects is that they can be easily converted to pandas for visualization through Matplotlib.
For example, we can easily generate a bar plot of GDP per capita
# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

At the moment the data frame is ordered alphabetically on the countries—let’s change it to GDP per capita
df = df.sort('GDP percap', descending=True)
df
country | population | total GDP | GDP percap |
---|---|---|---|
str | f64 | f64 | f64 |
"United States" | 2.82171957e8 | 9.8987e6 | 35080.381854 |
"Australia" | 1.9053186e7 | 541804.6521 | 28436.433261 |
"Israel" | 6.11457e6 | 129253.89423 | 21138.672749 |
"Argentina" | 3.7335653e7 | 295072.21869 | 7903.229085 |
"Uruguay" | 3.219793e6 | 25255.961693 | 7843.97062 |
"South Africa" | 4.5064098e7 | 227242.36949 | 5042.647686 |
"India" | 1.0063e9 | 1.7281e6 | 1717.324719 |
"Malawi" | 1.1801505e7 | 5026.221784 | 425.896679 |
Plotting as before now yields
# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

15.4. Lazy evaluation#
One of Polars’ most powerful features is lazy evaluation. This allows Polars to optimize your entire query before executing it, leading to significant performance improvements.
15.4.1. Eager vs lazy APIs#
Polars provides two APIs:
Eager API - Operations are executed immediately (like pandas)
Lazy API - Operations are collected and optimized before execution
Let’s see the difference using our dataset:
# First, let's reload our original dataset for this example
URL = ('https://raw.githubusercontent.com/QuantEcon/'
'lecture-python-programming/master/source/_static/'
'lecture_specific/pandas/data/test_pwt.csv')
df_full = pl.read_csv(URL)
# Eager API (executed immediately)
result_eager = (df_full
.filter(pl.col('tcgdp') > 1000)
.select(['country', 'year', 'tcgdp'])
.sort('tcgdp', descending=True)
)
print("Eager result shape:", result_eager.shape)
result_eager.head()
Eager result shape: (8, 3)
country | year | tcgdp |
---|---|---|
str | i64 | f64 |
"United States" | 2000 | 9.8987e6 |
"India" | 2000 | 1.7281e6 |
"Australia" | 2000 | 541804.6521 |
"Argentina" | 2000 | 295072.21869 |
"South Africa" | 2000 | 227242.36949 |
# Lazy API (builds a query plan)
lazy_query = (df_full.lazy() # Convert to lazy frame
.filter(pl.col('tcgdp') > 1000)
.select(['country', 'year', 'tcgdp'])
.sort('tcgdp', descending=True)
)
print("Lazy query:")
print(lazy_query)
Lazy query:
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
SORT BY [col("tcgdp")]
SELECT [col("country"), col("year"), col("tcgdp")]
FILTER [(col("tcgdp")) > (1000.0)]
FROM
DF ["country", "country isocode", "year", "POP", ...]; PROJECT */8 COLUMNS
We can now execute the lazy query using collect
:
result_lazy = lazy_query.collect()
print("Lazy result shape:", result_lazy.shape)
result_lazy.head()
Lazy result shape: (8, 3)
country | year | tcgdp |
---|---|---|
str | i64 | f64 |
"United States" | 2000 | 9.8987e6 |
"India" | 2000 | 1.7281e6 |
"Australia" | 2000 | 541804.6521 |
"Argentina" | 2000 | 295072.21869 |
"South Africa" | 2000 | 227242.36949 |
15.4.2. Query optimization#
The lazy API allows Polars to perform several optimizations:
Predicate Pushdown - Filters are applied as early as possible
Projection Pushdown - Only required columns are read
Common Subexpression Elimination - Duplicate calculations are removed
Dead Code Elimination - Unused operations are removed
# Example of optimization - only columns needed are processed
optimized_query = (df_full.lazy()
.select(['country', 'year', 'tcgdp', 'POP']) # Select early
.filter(pl.col('tcgdp') > 500) # Filter pushdown
.with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_per_capita'))
.filter(pl.col('gdp_per_capita') > 10) # Additional filter
.select(['country', 'year', 'gdp_per_capita']) # Final projection
)
print("Optimized query plan:")
print(optimized_query.explain())
Optimized query plan:
simple π 3/3 ["country", "year", ... 1 other column]
FILTER [(col("gdp_per_capita")) > (10.0)]
FROM
WITH_COLUMNS:
[[(col("tcgdp")) / (col("POP"))].alias("gdp_per_capita")]
FILTER [(col("tcgdp")) > (500.0)]
FROM
DF ["country", "country isocode", "year", "POP", ...]; PROJECT["country", "year", "tcgdp", "POP"] 4/8 COLUMNS
# Execute the optimized query
result_optimized = optimized_query.collect()
result_optimized.head()
country | year | gdp_per_capita |
---|---|---|
str | i64 | f64 |
"Australia" | 2000 | 28.436433 |
"Israel" | 2000 | 21.138673 |
"United States" | 2000 | 35.080382 |
15.4.3. When to use lazy vs eager#
Use Lazy API when:
Working with large datasets
Performing complex transformations
Building data pipelines
Performance is critical
Use Eager API when:
Exploring data interactively
Working with small datasets
Need immediate results for debugging
The lazy API is particularly powerful for data processing pipelines where multiple transformations can be optimized together as a single operation.
15.5. Online data sources#
Python makes it straightforward to query online databases programmatically.
An important database for economists is FRED — a vast collection of time series data maintained by the St. Louis Fed.
For example, suppose that we are interested in the unemployment rate.
(To download the data as a csv, click on the top right Download
and select the CSV (data)
option).
Alternatively, we can access the CSV file from within a Python program.
In Pandas, we studied how to use requests
and pandas
to access API data.
Here Polars’ read_csv
function provides the same functionality.
We use try_parse_dates=True
so that Polars recognizes our dates column
url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
'bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&'
'graph_bgcolor=%23ffffff&height=450&mode=fred&'
'recession_bars=on&txtcolor=%23444444&ts=12&tts=12&'
'width=1318&nt=0&thu=0&trc=0&show_legend=yes&'
'show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&'
'cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&'
'link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&'
'ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&'
'fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&'
'vintage_date=2024-07-29&revision_date=2024-07-29&'
'nd=1948-01-01')
data = pl.read_csv(url, try_parse_dates=True)
The data has been read into a Polars DataFrame called data
that we can now manipulate in the usual way
type(data)
polars.dataframe.frame.DataFrame
data.head() # A useful method to get a quick look at a DataFrame
observation_date | UNRATE |
---|---|
date | f64 |
1948-01-01 | 3.4 |
1948-02-01 | 3.8 |
1948-03-01 | 4.0 |
1948-04-01 | 3.9 |
1948-05-01 | 3.5 |
data.describe() # Your output might differ slightly
statistic | observation_date | UNRATE |
---|---|---|
str | str | f64 |
"count" | "918" | 918.0 |
"null_count" | "0" | 0.0 |
"mean" | "1986-03-17 06:30:35.294117" | 5.693464 |
"std" | null | 1.710032 |
"min" | "1948-01-01" | 2.5 |
"25%" | "1967-02-01" | 4.4 |
"50%" | "1986-04-01" | 5.5 |
"75%" | "2005-05-01" | 6.7 |
"max" | "2024-06-01" | 14.8 |
We can also plot the unemployment rate from 2006 to 2012 as follows:
# Filter data for the specified date range and convert to pandas for plotting
filtered_data = data.filter(
(pl.col('observation_date') >= pl.date(2006, 1, 1)) &
(pl.col('observation_date') <= pl.date(2012, 12, 31))
).to_pandas().set_index('observation_date')
ax = filtered_data.plot(title='US Unemployment Rate', legend=False)
ax.set_xlabel('year', fontsize=12)
ax.set_ylabel('%', fontsize=12)
plt.show()

Note that Polars offers many other file type alternatives.
Polars has a wide variety of methods that we can use to read excel, json, parquet or plug straight into a database server.
15.6. Exercises#
Exercise 15.1
With these imports:
import datetime as dt
import yfinance as yf
Write a program to calculate the percentage price change over 2021 for the following shares using Polars:
ticker_list = {'INTC': 'Intel',
'MSFT': 'Microsoft',
'IBM': 'IBM',
'BHP': 'BHP',
'TM': 'Toyota',
'AAPL': 'Apple',
'AMZN': 'Amazon',
'C': 'Citigroup',
'QCOM': 'Qualcomm',
'KO': 'Coca-Cola',
'GOOG': 'Google'}
Here’s the first part of the program that reads data into a Polars DataFrame:
def read_data_polars(ticker_list,
start=dt.datetime(2021, 1, 1),
end=dt.datetime(2021, 12, 31)):
"""
This function reads in closing price data from Yahoo
for each tick in the ticker_list and returns a Polars DataFrame.
Different indices may have different trading days, so we use joins
to handle this.
"""
dataframes = []
for tick in ticker_list:
stock = yf.Ticker(tick)
prices = stock.history(start=start, end=end)
# Create a Polars DataFrame from the closing prices
df = pl.DataFrame({
'Date': pd.to_datetime(prices.index.date),
tick: prices['Close'].values
})
dataframes.append(df)
# Start with the first DataFrame
result = dataframes[0]
# Join additional DataFrames, handling mismatched dates with full outer join
for df in dataframes[1:]:
result = result.join(df, on='Date', how='full', coalesce=True)
return result
ticker = read_data_polars(ticker_list)
Complete the program to plot the result as a bar graph using Polars operations and matplotlib visualization.
Solution to Exercise 15.1
Here’s a solution using Polars operations to calculate percentage changes:
price_change_df = ticker.select([
(pl.col(tick).last() / pl.col(tick).first() * 100 - 100).alias(tick)
for tick in ticker_list.keys()
]).transpose(
include_header=True,
header_name='ticker',
column_names=['pct_change']
)
# Add company names and sort
price_change_df = price_change_df.with_columns([
pl.col('ticker')
.replace_strict(ticker_list, default=pl.col('ticker'))
.alias('company')
]).sort('pct_change')
print(price_change_df)
shape: (11, 3)
┌────────┬────────────┬───────────┐
│ ticker ┆ pct_change ┆ company │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞════════╪════════════╪═══════════╡
│ BHP ┆ -2.249111 ┆ BHP │
│ C ┆ 3.550571 ┆ Citigroup │
│ AMZN ┆ 5.845049 ┆ Amazon │
│ INTC ┆ 6.86852 ┆ Intel │
│ KO ┆ 14.92246 ┆ Coca-Cola │
│ … ┆ … ┆ … │
│ TM ┆ 23.416755 ┆ Toyota │
│ QCOM ┆ 25.318508 ┆ Qualcomm │
│ AAPL ┆ 38.55075 ┆ Apple │
│ MSFT ┆ 57.179601 ┆ Microsoft │
│ GOOG ┆ 68.960919 ┆ Google │
└────────┴────────────┴───────────┘
Now plot the results:
# Convert to pandas for plotting (as demonstrated in the lecture)
df_pandas = price_change_df.to_pandas().set_index('company')
fig, ax = plt.subplots(figsize=(10,8))
ax.set_xlabel('stock', fontsize=12)
ax.set_ylabel('percentage change in price', fontsize=12)
# Create colors: red for negative returns, green for positive returns
colors = ['red' if x < 0 else 'blue' for x in df_pandas['pct_change']]
df_pandas['pct_change'].plot(kind='bar', ax=ax, color=colors)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Exercise 15.2
Using the method read_data_polars
introduced in Exercise 15.1, write a program to obtain year-on-year percentage change for the following indices using Polars operations:
indices_list = {'^GSPC': 'S&P 500',
'^IXIC': 'NASDAQ',
'^DJI': 'Dow Jones',
'^N225': 'Nikkei'}
Complete the program to show summary statistics and plot the result as a time series graph demonstrating Polars’ data manipulation capabilities.
Solution to Exercise 15.2
Following the work you did in Exercise 15.1, you can query the data using read_data_polars
by updating the start and end dates accordingly.
indices_data = read_data_polars(
indices_list,
start=dt.datetime(2000, 1, 1),
end=dt.datetime(2021, 12, 31)
)
# Add year column for grouping
indices_data = indices_data.with_columns(
pl.col('Date').dt.year().alias('year')
)
print("Data shape:", indices_data.shape)
print("\nFirst few rows:")
print(indices_data.head())
print("\nData availability check:")
for index in indices_list.keys():
non_null_count = (indices_data
.select(pl.col(index).is_not_null().sum())
.item())
print(f"{indices_list[index]}: {non_null_count} non-null values")
Data shape: (5716, 6)
First few rows:
shape: (5, 6)
┌─────────────────────┬─────────────┬─────────────┬──────────────┬──────────────┬──────┐
│ Date ┆ ^GSPC ┆ ^IXIC ┆ ^DJI ┆ ^N225 ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 │
╞═════════════════════╪═════════════╪═════════════╪══════════════╪══════════════╪══════╡
│ 2000-01-03 00:00:00 ┆ 1455.219971 ┆ 4131.149902 ┆ 11357.509766 ┆ null ┆ 2000 │
│ 2000-01-04 00:00:00 ┆ 1399.420044 ┆ 3901.689941 ┆ 10997.929688 ┆ 19002.859375 ┆ 2000 │
│ 2000-01-05 00:00:00 ┆ 1402.109985 ┆ 3877.540039 ┆ 11122.650391 ┆ 18542.550781 ┆ 2000 │
│ 2000-01-06 00:00:00 ┆ 1403.449951 ┆ 3727.129883 ┆ 11253.259766 ┆ 18168.269531 ┆ 2000 │
│ 2000-01-07 00:00:00 ┆ 1441.469971 ┆ 3882.620117 ┆ 11522.55957 ┆ 18193.410156 ┆ 2000 │
└─────────────────────┴─────────────┴─────────────┴──────────────┴──────────────┴──────┘
Data availability check:
S&P 500: 5535 non-null values
NASDAQ: 5535 non-null values
Dow Jones: 5535 non-null values
Nikkei: 5391 non-null values
Calculate yearly returns using Polars groupby operations:
# Calculate first and last valid price for each year and each index
yearly_returns = indices_data.group_by('year').agg([
*[pl.col(index)
.filter(pl.col(index).is_not_null())
.first()
.alias(f"{index}_first") for index in indices_list.keys()],
*[pl.col(index)
.filter(pl.col(index).is_not_null())
.last()
.alias(f"{index}_last") for index in indices_list.keys()]
])
# Calculate percentage returns for each index, handling null values properly
return_columns = []
for index in indices_list.keys():
company_name = indices_list[index]
return_col = (
(pl.col(f"{index}_last") - pl.col(f"{index}_first")) /
pl.col(f"{index}_first") * 100
).alias(company_name)
return_columns.append(return_col)
yearly_returns = yearly_returns.with_columns(return_columns)
# Select only the year and return columns, filter out years with insufficient data
yearly_returns = yearly_returns.select([
'year',
*list(indices_list.values())
]).filter(
pl.col('year') >= 2001 # Ensure we have complete years of data
).sort('year')
print("Yearly returns shape:", yearly_returns.shape)
print("\nYearly returns:")
print(yearly_returns.head(10))
Yearly returns shape: (21, 5)
Yearly returns:
shape: (10, 5)
┌──────┬───────────┬────────────┬────────────┬────────────┐
│ year ┆ S&P 500 ┆ NASDAQ ┆ Dow Jones ┆ Nikkei │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═══════════╪════════════╪════════════╪════════════╡
│ 2001 ┆ -10.53481 ┆ -14.898819 ┆ -5.866722 ┆ -21.872491 │
│ 2002 ┆ -23.80334 ┆ -32.524441 ┆ -17.191519 ┆ -5.438447 │
│ 2003 ┆ 22.319396 ┆ 44.663323 ┆ 21.451016 ┆ 22.458114 │
│ 2004 ┆ 9.331703 ┆ 8.409905 ┆ 3.584683 ┆ 6.481657 │
│ 2005 ┆ 3.844177 ┆ 2.470561 ┆ -0.111187 ┆ -2.182891 │
│ 2006 ┆ 11.782786 ┆ 7.645719 ┆ 14.89517 ┆ -5.645009 │
│ 2007 ┆ 3.65382 ┆ 9.455427 ┆ 6.33532 ┆ 1.347957 │
│ 2008 ┆ -37.58465 ┆ -39.568824 ┆ -32.716831 ┆ -12.641608 │
│ 2009 ┆ 19.671602 ┆ 39.023162 ┆ 15.422326 ┆ -14.297612 │
│ 2010 ┆ 11.001865 ┆ 14.92147 ┆ 9.387316 ┆ 1.879812 │
└──────┴───────────┴────────────┴────────────┴────────────┘
Generate summary statistics using Polars:
# Summary statistics for all indices
summary_stats = yearly_returns.select(list(indices_list.values())).describe()
print("Summary Statistics:")
print(summary_stats)
# Check for any null values or data issues
print(f"\nData shape: {yearly_returns.shape}")
print(f"Null counts:")
print(yearly_returns.null_count())
print(f"\nData range (first few years):")
print(yearly_returns.head())
Summary Statistics:
shape: (9, 5)
┌────────────┬───────────┬────────────┬────────────┬────────────┐
│ statistic ┆ S&P 500 ┆ NASDAQ ┆ Dow Jones ┆ Nikkei │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════════╪════════════╪════════════╪════════════╡
│ count ┆ 21.0 ┆ 21.0 ┆ 21.0 ┆ 21.0 │
│ null_count ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ 7.338348 ┆ 11.275998 ┆ 6.548263 ┆ 0.267225 │
│ std ┆ 16.776667 ┆ 22.508323 ┆ 14.241644 ┆ 12.272468 │
│ min ┆ -37.58465 ┆ -39.568824 ┆ -32.716831 ┆ -21.872491 │
│ 25% ┆ -0.692839 ┆ 2.470561 ┆ -0.111187 ┆ -5.645009 │
│ 50% ┆ 11.237369 ┆ 9.790363 ┆ 6.33532 ┆ -0.438448 │
│ 75% ┆ 18.415027 ┆ 27.155799 ┆ 15.422326 ┆ 4.915217 │
│ max ┆ 29.132182 ┆ 44.663323 ┆ 24.331151 ┆ 26.990829 │
└────────────┴───────────┴────────────┴────────────┴────────────┘
Data shape: (21, 5)
Null counts:
shape: (1, 5)
┌──────┬─────────┬────────┬───────────┬────────┐
│ year ┆ S&P 500 ┆ NASDAQ ┆ Dow Jones ┆ Nikkei │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════╪═════════╪════════╪═══════════╪════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└──────┴─────────┴────────┴───────────┴────────┘
Data range (first few years):
shape: (5, 5)
┌──────┬───────────┬────────────┬────────────┬────────────┐
│ year ┆ S&P 500 ┆ NASDAQ ┆ Dow Jones ┆ Nikkei │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═══════════╪════════════╪════════════╪════════════╡
│ 2001 ┆ -10.53481 ┆ -14.898819 ┆ -5.866722 ┆ -21.872491 │
│ 2002 ┆ -23.80334 ┆ -32.524441 ┆ -17.191519 ┆ -5.438447 │
│ 2003 ┆ 22.319396 ┆ 44.663323 ┆ 21.451016 ┆ 22.458114 │
│ 2004 ┆ 9.331703 ┆ 8.409905 ┆ 3.584683 ┆ 6.481657 │
│ 2005 ┆ 3.844177 ┆ 2.470561 ┆ -0.111187 ┆ -2.182891 │
└──────┴───────────┴────────────┴────────────┴────────────┘
Plot the time series:
# Convert to pandas for plotting
df_pandas = yearly_returns.to_pandas().set_index('year')
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Flatten 2-D array to 1-D array
for iter_, ax in enumerate(axes.flatten()):
if iter_ < len(indices_list):
# Get index name per iteration
index_name = list(indices_list.values())[iter_]
# Plot with markers and lines for better visibility
ax.plot(df_pandas.index, df_pandas[index_name], 'o-',
linewidth=2, markersize=4)
ax.set_ylabel("yearly return", fontsize=12)
ax.set_xlabel("year", fontsize=12)
ax.set_title(index_name, fontsize=12)
ax.grid(True, alpha=0.3)
# Add horizontal line at zero for reference
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

Alternative: Create a single plot with all indices:
# Single plot with all indices
fig, ax = plt.subplots(figsize=(12, 8))
for index_name in indices_list.values():
# Only plot if the column has valid data
if (index_name in df_pandas.columns and
not df_pandas[index_name].isna().all()):
ax.plot(df_pandas.index, df_pandas[index_name],
label=index_name, linewidth=2, marker='o', markersize=3)
ax.set_xlabel("year", fontsize=12)
ax.set_ylabel("yearly return (%)", fontsize=12)
ax.set_title("Yearly Returns of Major Stock Indices (2001-2021)", fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linestyle='--', alpha=0.5, label='Zero line')
plt.tight_layout()
plt.show()
