15. Polars#

In addition to what’s in Anaconda, this lecture will need the following libraries:

!pip install --upgrade polars wbgapi yfinance pyarrow

Show code cell output

Hide code cell output

Collecting polars
  Downloading polars-1.33.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)

Collecting wbgapi
  Downloading wbgapi-1.0.12-py3-none-any.whl.metadata (13 kB)

Collecting yfinance
  Downloading yfinance-0.2.66-py2.py3-none-any.whl.metadata (6.0 kB)
Requirement already satisfied: pyarrow in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (19.0.0)

Collecting pyarrow
  Downloading pyarrow-21.0.0-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Requirement already satisfied: requests in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (2.32.3)
Requirement already satisfied: PyYAML in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (6.0.2)
Requirement already satisfied: tabulate in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (0.9.0)
Requirement already satisfied: pandas>=1.3.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2.2.3)
Requirement already satisfied: numpy>=1.16.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2.1.3)
Collecting multitasking>=0.0.7 (from yfinance)
  Downloading multitasking-0.0.12.tar.gz (19 kB)

  Preparing metadata (setup.py) ... ?25l-

 done
?25hRequirement already satisfied: platformdirs>=2.0.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (4.3.7)
Requirement already satisfied: pytz>=2022.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2024.1)

Collecting frozendict>=2.3.4 (from yfinance)
  Downloading frozendict-2.4.6-py313-none-any.whl.metadata (23 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Downloading peewee-3.18.2.tar.gz (949 kB)
?25l

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/949.2 kB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 949.2/949.2 kB 51.6 MB/s eta 0:00:00
?25h

  Installing build dependencies ... ?25l-

 done

?25h  Getting requirements to build wheel ... ?25l-

 done

?25h  Preparing metadata (pyproject.toml) ... ?25l-

 done
?25hRequirement already satisfied: beautifulsoup4>=4.11.1 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (4.12.3)
Collecting curl_cffi>=0.7 (from yfinance)

  Downloading curl_cffi-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Requirement already satisfied: protobuf>=3.19.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (5.29.3)

Collecting websockets>=13.0 (from yfinance)
  Downloading websockets-15.0.1-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Requirement already satisfied: soupsieve>1.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.5)
Requirement already satisfied: cffi>=1.12.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from curl_cffi>=0.7->yfinance) (1.17.1)
Requirement already satisfied: certifi>=2024.2.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from curl_cffi>=0.7->yfinance) (2025.4.26)
Requirement already satisfied: pycparser in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from cffi>=1.12.0->curl_cffi>=0.7->yfinance) (2.21)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from pandas>=1.3.0->yfinance) (2.9.0.post0)
Requirement already satisfied: tzdata>=2022.7 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from pandas>=1.3.0->yfinance) (2025.2)

Requirement already satisfied: six>=1.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->yfinance) (1.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (2.3.0)
Downloading polars-1.33.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.7 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/39.7 MB ? eta -:--:--

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━ 27.5/39.7 MB 137.2 MB/s eta 0:00:01

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 136.6 MB/s eta 0:00:00
?25hDownloading wbgapi-1.0.12-py3-none-any.whl (36 kB)
Downloading yfinance-0.2.66-py2.py3-none-any.whl (123 kB)
Downloading pyarrow-21.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (42.8 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/42.8 MB ? eta -:--:--

   ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━ 22.0/42.8 MB 109.2 MB/s eta 0:00:01

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 129.6 MB/s eta 0:00:00
?25hDownloading curl_cffi-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/8.3 MB ? eta -:--:--

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.3/8.3 MB 134.9 MB/s eta 0:00:00
?25hDownloading frozendict-2.4.6-py313-none-any.whl (16 kB)
Downloading websockets-15.0.1-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (182 kB)

Building wheels for collected packages: multitasking, peewee
  DEPRECATION: Building 'multitasking' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'multitasking'. Discussion can be found at https://github.com/pypa/pip/issues/6334

  Building wheel for multitasking (setup.py) ... ?25l-

 done
?25h  Created wheel for multitasking: filename=multitasking-0.0.12-py3-none-any.whl size=15548 sha256=a82f126f27c63c3d4b7f08c5f26ba5cfa1c43e404eab1809abdc9f05b3521788
  Stored in directory: /home/runner/.cache/pip/wheels/1e/df/0f/e2bbb22d689b30c681feb5410ab64a2523437b34c8ecfc6476

  Building wheel for peewee (pyproject.toml) ... ?25l-

 done
?25h  Created wheel for peewee: filename=peewee-3.18.2-cp313-cp313-linux_x86_64.whl size=302166 sha256=ab835a7817992550f8b341ca9d3db4bced1e760739bd82a2bb4879e7dad76f34
  Stored in directory: /home/runner/.cache/pip/wheels/1c/48/cc/00b7d0e7defa21a58915654917c89eaedd32a6e614d8e4ad92
Successfully built multitasking peewee

Installing collected packages: peewee, multitasking, websockets, pyarrow, polars, frozendict, wbgapi, curl_cffi, yfinance
?25l

   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]

  Attempting uninstall: pyarrow
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
    Found existing installation: pyarrow 19.0.0
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]

    Uninstalling pyarrow-19.0.0:
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
      Successfully uninstalled pyarrow-19.0.0
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]

   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]

   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]

   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]

   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]

   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 7/9 [curl_cffi]

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 [yfinance]
?25h

Successfully installed curl_cffi-0.13.0 frozendict-2.4.6 multitasking-0.0.12 peewee-3.18.2 polars-1.33.1 pyarrow-21.0.0 wbgapi-1.0.12 websockets-15.0.1 yfinance-0.2.66

15.1. Overview#

Polars is a fast data manipulation library for Python written in Rust.

Polars has gained significant popularity in recent years due to its superior performance compared to traditional data analysis tools.

This makes it an excellent choice for modern data science and machine learning workflows.

Polars is designed with performance and memory efficiency in mind, leveraging:

Arrow’s columnar memory format for fast data access
Lazy evaluation to optimize query execution
Parallel processing for enhanced performance
Expressive API similar to pandas but with better performance characteristics

Just as NumPy provides the basic array data type plus core array operations, Polars

defines fundamental structures for working with data and
endows them with methods that facilitate operations such as
- reading in data
- adjusting indices
- working with dates and time series
- sorting, grouping, re-ordering and general data munging [1]
- dealing with missing values, etc.

More sophisticated statistical functionality is left to other packages, such as statsmodels and scikit-learn, which can work with Polars DataFrames through their interoperability with pandas.

This lecture will provide a basic introduction to Polars.

Tip

Why use Polars over pandas? One reason is performance: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found in this JetBrains blog post.

Throughout the lecture, we will assume that the following imports have taken place

import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Two important data types defined by Polars are Series and DataFrame.

You can think of a Series as a “column” of data, such as a collection of observations on a single variable.

A DataFrame is a two-dimensional object for storing related columns of data.

15.2. Series#

Let’s start with Series.

We begin by creating a series of four random observations

s = pl.Series(name='daily returns', values=np.random.randn(4))
s

shape: (4,)

daily returns
f64
1.125966
-0.594306
0.792461
0.452324

Note

You may notice the above series has no indices, unlike in pd.Series; this is because Polars’ is column centric and accessing data is predominantly managed through filtering and boolean masks; here is an interesting blog post discussing this in more detail.

Polars Series are built on top of Apache Arrow arrays and support many similar operations to Pandas Series.

(For interested readers, please see this extended reading on Apache Arrow)

s * 100

shape: (4,)

daily returns
f64
112.596609
-59.430626
79.246067
45.232421

s.abs()

shape: (4,)

daily returns
f64
1.125966
0.594306
0.792461
0.452324

But Series provide more than basic arrays.

For example, they have some additional (statistically oriented) methods

s.describe()

shape: (9, 2)

statistic	value
str	f64
"count"	4.0
"null_count"	0.0
"mean"	0.444111
"std"	0.744905
"min"	-0.594306
"25%"	0.452324
"50%"	0.792461
"75%"	0.792461
"max"	1.125966

However, the pl.Series object cannot be used in the same way as a pd.Series when pairing data with indices.

For example, using a pd.Series you can do the following:

s = pd.Series(np.random.randn(4), name='daily returns')
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s

AMZN   -1.090070
AAPL    0.484258
MSFT   -0.313922
GOOG   -1.967324
Name: daily returns, dtype: float64

However, in Polars you will need to use the DataFrame object to do the same task.

This means you will use the DataFrame object more often when using Polars if you are interested in relationships between data.

Let’s create a pl.DataFrame containing the equivalent data in the pd.Series.

df = pl.DataFrame({
    'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
    'daily returns': s.to_list()
})
df

shape: (4, 2)

company	daily returns
str	f64
"AMZN"	-1.09007
"AAPL"	0.484258
"MSFT"	-0.313922
"GOOG"	-1.967324

To access specific values by company name, we can filter the DataFrame for the AMZN ticker code and select the daily returns.

df.filter(pl.col('company') == 'AMZN').select('daily returns').item()

-1.0900699038656159

If we want to update the AMZN return to 0, you can use the following chain of methods.

Here with_columns is similar to select but adds columns to the same DataFrame

df = df.with_columns(
    pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
    .then(0)                             # set values to 0
    .otherwise(pl.col('daily returns'))  # otherwise keep original value
    .alias('daily returns')              # assign back to the column
)
df

shape: (4, 2)

company	daily returns
str	f64
"AMZN"	0.0
"AAPL"	0.484258
"MSFT"	-0.313922
"GOOG"	-1.967324

You can check if a ticker code is in the company list

'AAPL' in df['company']

True

15.3. DataFrames#

While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in Polars is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.

Let’s look at an example that reads data from the CSV file pandas/data/test_pwt.csv, which is taken from the Penn World Tables.

The dataset contains the following indicators:

Variable Name	Description
POP	Population (in thousands)
XRAT	Exchange Rate to US Dollar
tcgdp	Total PPP Converted GDP (in million international dollar)
cc	Consumption Share of PPP Converted GDP Per Capita (%)
cg	Government Consumption Share of PPP Converted GDP Per Capita (%)

We’ll read this in from a URL using the Polars function read_csv.

URL = ('https://raw.githubusercontent.com/QuantEcon/'
       'lecture-python-programming/master/source/_static/'
       'lecture_specific/pandas/data/test_pwt.csv')
df = pl.read_csv(URL)
type(df)

polars.dataframe.frame.DataFrame

Here is the content of test_pwt.csv

df

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	0.9995	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	1.72483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	64.436451	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	6.93983	227242.36949	72.71871	5.726546
"United States"	"USA"	2000	282171.957	1.0	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

15.3.1. Select data by position#

In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.

We can select particular rows using array slicing notation

df[2:5]

shape: (3, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	64.436451	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954

To select columns, we can pass a list containing the names of the desired columns

df.select(['country', 'tcgdp'])

shape: (8, 2)

country	tcgdp
str	f64
"Argentina"	295072.21869
"Australia"	541804.6521
"India"	1.7281e6
"Israel"	129253.89423
"Malawi"	5026.221784
"South Africa"	227242.36949
"United States"	9.8987e6
"Uruguay"	25255.961693

To select both rows and columns using integers, we can combine slicing with column selection

df[2:5].select(df.columns[0:4])

shape: (3, 4)

country	country isocode	year	POP
str	str	i64	f64
"India"	"IND"	2000	1.0063e6
"Israel"	"ISR"	2000	6114.57
"Malawi"	"MWI"	2000	11801.505

To select rows and columns using a mixture of integers and labels, we can use more complex selection

df[2:5].select(['country', 'tcgdp'])

shape: (3, 2)

country	tcgdp
str	f64
"India"	1.7281e6
"Israel"	129253.89423
"Malawi"	5026.221784

15.3.2. Select data by conditions#

Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.

This section demonstrates various ways to do that.

The most straightforward way is with the filter method.

df.filter(pl.col('POP') >= 20000)

shape: (4, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	0.9995	295072.21869	75.716805	5.578804
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"South Africa"	"ZAF"	2000	45064.098	6.93983	227242.36949	72.71871	5.726546
"United States"	"USA"	2000	282171.957	1.0	9.8987e6	72.347054	6.032454

In this case, df.filter() takes a boolean expression and only returns rows with the True values.

We can view this boolean mask as a table with the alias meets_criteria

df.select(
    pl.col('country'),
    (pl.col('POP') >= 20000).alias('meets_criteria')
)

shape: (8, 2)

country	meets_criteria
str	bool
"Argentina"	true
"Australia"	false
"India"	true
"Israel"	false
"Malawi"	false
"South Africa"	true
"United States"	true
"Uruguay"	false

Here is another example:

df.filter(
    (pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
    (pl.col('POP') > 40000)
)

shape: (2, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"South Africa"	"ZAF"	2000	45064.098	6.93983	227242.36949	72.71871	5.726546

We can also allow arithmetic operations between different columns.

df.filter(
    (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
)

shape: (2, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

For example, we can use the condition to select the country with the largest household consumption–GDP share cc.

df.filter(pl.col('cc') == pl.col('cc').max())

shape: (1, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

When we only want to look at certain columns of a selected sub-DataFrame, we can combine filter with select.

df.filter(
    (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
    ).select(['country', 'year', 'POP']
)

shape: (2, 3)

country	year	POP
str	i64	f64
"Malawi"	2000	11801.505
"Uruguay"	2000	3219.793

Application: Subsetting DataFrame

Real-world datasets can be very large.

It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the DataFrame using the selection method described above

df_subset = df.select(['country', 'POP', 'tcgdp'])
df_subset

shape: (8, 3)

country	POP	tcgdp
str	f64	f64
"Argentina"	37335.653	295072.21869
"Australia"	19053.186	541804.6521
"India"	1.0063e6	1.7281e6
"Israel"	6114.57	129253.89423
"Malawi"	11801.505	5026.221784
"South Africa"	45064.098	227242.36949
"United States"	282171.957	9.8987e6
"Uruguay"	3219.793	25255.961693

We can then save the smaller dataset for further analysis.

df_subset.write_csv('pwt_subset.csv')

15.3.3. Apply and map operations#

Polars provides powerful methods for applying functions to data.

Instead of pandas’ apply method, Polars uses expressions within select, with_columns, or filter methods.

Here is an example using built-in functions to find the max value for each column

df.select([
    pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg'])
    .max()
    .name.suffix('_max')
])

shape: (1, 6)

year_max	POP_max	XRAT_max	tcgdp_max	cc_max	cg_max
i64	f64	f64	f64	f64	f64
2000	1.0063e6	59.543808	9.8987e6	78.97874	14.072206

For more complex operations, we can use map_elements (similar to pandas’ apply):

df.select([
    pl.col('country'),
    pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
])

/tmp/ipykernel_6418/1063562650.py:3: PolarsInefficientMapWarning: 
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("POP").map_elements(lambda x: ...)
with this one instead:
  + pl.col("POP") * 2

  pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')

shape: (8, 2)

country	POP_doubled
str	f64
"Argentina"	74671.306
"Australia"	38106.372
"India"	2.0126e6
"Israel"	12229.14
"Malawi"	23603.01
"South Africa"	90128.196
"United States"	564343.914
"Uruguay"	6439.586

However as you can see from the warning issued by Polars there is often a better way to achieve this using the Polars API.

df.select([
    pl.col('country'),
    (pl.col('POP') * 2).alias('POP_doubled')
])

shape: (8, 2)

country	POP_doubled
str	f64
"Argentina"	74671.306
"Australia"	38106.372
"India"	2.0126e6
"Israel"	12229.14
"Malawi"	23603.01
"South Africa"	90128.196
"United States"	564343.914
"Uruguay"	6439.586

We can use complex filtering conditions with boolean logic:

complex_condition = (
    pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
    .then(pl.col('POP') > 40000)
    .otherwise(pl.col('POP') < 20000)
)

df.filter(complex_condition).select([
    'country', 'year', 'POP', 'XRAT', 'tcgdp'
])

shape: (6, 5)

country	year	POP	XRAT	tcgdp
str	i64	f64	f64	f64
"Australia"	2000	19053.186	1.72483	541804.6521
"India"	2000	1.0063e6	44.9416	1.7281e6
"Israel"	2000	6114.57	4.07733	129253.89423
"Malawi"	2000	11801.505	59.543808	5026.221784
"South Africa"	2000	45064.098	6.93983	227242.36949
"Uruguay"	2000	3219.793	12.099592	25255.961693

15.3.4. Make changes in DataFrames#

The ability to make changes in DataFrames is important to generate a clean dataset for future analysis.

1. We can use conditional logic to “keep” certain values and replace others

df.with_columns(
    pl.when(pl.col('POP') >= 20000)          # when population >= 20000
    .then(pl.col('POP'))                     # keep the population value
    .otherwise(None)                         # otherwise set to null
    .alias('POP_filtered')                   # save results in POP_filtered
).select(['country', 'POP', 'POP_filtered']) # select the columns

shape: (8, 3)

country	POP	POP_filtered
str	f64	f64
"Argentina"	37335.653	37335.653
"Australia"	19053.186	null
"India"	1.0063e6	1.0063e6
"Israel"	6114.57	null
"Malawi"	11801.505	null
"South Africa"	45064.098	45064.098
"United States"	282171.957	282171.957
"Uruguay"	3219.793	null

2. We can modify specific values based on conditions

df_modified = df.with_columns(
    pl.when(pl.col('cg') == pl.col('cg').max())    # pick the largest cg value
    .then(None)                                    # set to null
    .otherwise(pl.col('cg'))                       # otherwise keep the value
    .alias('cg')                                   # update the column
)
df_modified

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	0.9995	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	1.72483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	null
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	64.436451	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	6.93983	227242.36949	72.71871	5.726546
"United States"	"USA"	2000	282171.957	1.0	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

3. We can use expressions to modify columns as a whole

df.with_columns([
    pl.when(pl.col('POP') <= 10000)          # when population is < 10,000
    .then(None)                              # set the value to null
    .otherwise(pl.col('POP'))                # otherwise keep existing value
    .alias('POP'),                           # update the POP column
    (pl.col('XRAT') / 10).alias('XRAT')      # update XRAT in-place
])

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	0.09995	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	0.172483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	4.49416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	null	0.407733	129253.89423	64.436451	10.266688
"Malawi"	"MWI"	2000	11801.505	5.954381	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	0.693983	227242.36949	72.71871	5.726546
"United States"	"USA"	2000	282171.957	0.1	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	null	1.209959	25255.961693	78.97874	5.108068

4. We can use in-built functions to modify all individual entries in specific columns by data type.

df.with_columns([
    pl.col(pl.Float64).round(2)   # round all Float64 columns
])

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.65	1.0	295072.22	75.72	5.58
"Australia"	"AUS"	2000	19053.19	1.72	541804.65	67.76	6.72
"India"	"IND"	2000	1006300.3	44.94	1.7281e6	64.58	14.07
"Israel"	"ISR"	2000	6114.57	4.08	129253.89	64.44	10.27
"Malawi"	"MWI"	2000	11801.5	59.54	5026.22	74.71	11.66
"South Africa"	"ZAF"	2000	45064.1	6.94	227242.37	72.72	5.73
"United States"	"USA"	2000	282171.96	1.0	9.8987e6	72.35	6.03
"Uruguay"	"URY"	2000	3219.79	12.1	25255.96	78.98	5.11

Application: Missing Value Imputation

Replacing missing values is an important step in data munging.

Let’s randomly insert some null values

# Create a copy with some null values
df_with_nulls = df.clone()

# Set some specific positions to null
indices_to_null = [(0, 'XRAT'), (3, 'cc'), (5, 'tcgdp'), (6, 'POP')]

for row_idx, col_name in indices_to_null:
    df_with_nulls = df_with_nulls.with_columns(
        pl.when(pl.int_range(pl.len()) == row_idx)
        .then(None)
        .otherwise(pl.col(col_name))
        .alias(col_name)
    )

df_with_nulls

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	null	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	1.72483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	null	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	6.93983	null	72.71871	5.726546
"United States"	"USA"	2000	null	1.0	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

We can replace all missing values with 0

df_with_nulls.fill_null(0)

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	0.0	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	1.72483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	0.0	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	6.93983	0.0	72.71871	5.726546
"United States"	"USA"	2000	0.0	1.0	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

Polars also provides us with convenient methods to replace missing values.

For example, we can use forward fill, backward fill, or interpolation

Here we fill null values with the column means

cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
    pl.col(cols).fill_null(pl.col(cols).mean())
])

shape: (8, 8)

country	country isocode	year	POP	XRAT	tcgdp	cc	cg
str	str	i64	f64	f64	f64	f64	f64
"Argentina"	"ARG"	2000	37335.653	18.618141	295072.21869	75.716805	5.578804
"Australia"	"AUS"	2000	19053.186	1.72483	541804.6521	67.759026	6.720098
"India"	"IND"	2000	1.0063e6	44.9416	1.7281e6	64.575551	14.072206
"Israel"	"ISR"	2000	6114.57	4.07733	129253.89423	72.400502	10.266688
"Malawi"	"MWI"	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
"South Africa"	"ZAF"	2000	45064.098	6.93983	1.8033e6	72.71871	5.726546
"United States"	"USA"	2000	161269.871714	1.0	9.8987e6	72.347054	6.032454
"Uruguay"	"URY"	2000	3219.793	12.099592	25255.961693	78.97874	5.108068

Missing value imputation is a big area in data science involving various machine learning techniques.

There are also more advanced tools in Python to impute missing values.

15.3.5. Standardization and visualization#

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the DataFrame using the selection method described above

df = df.select(['country', 'POP', 'tcgdp'])
df

shape: (8, 3)

country	POP	tcgdp
str	f64	f64
"Argentina"	37335.653	295072.21869
"Australia"	19053.186	541804.6521
"India"	1.0063e6	1.7281e6
"Israel"	6114.57	129253.89423
"Malawi"	11801.505	5026.221784
"South Africa"	45064.098	227242.36949
"United States"	282171.957	9.8987e6
"Uruguay"	3219.793	25255.961693

While Polars doesn’t have a traditional index like pandas, we can work with country names directly

df

shape: (8, 3)

country	POP	tcgdp
str	f64	f64
"Argentina"	37335.653	295072.21869
"Australia"	19053.186	541804.6521
"India"	1.0063e6	1.7281e6
"Israel"	6114.57	129253.89423
"Malawi"	11801.505	5026.221784
"South Africa"	45064.098	227242.36949
"United States"	282171.957	9.8987e6
"Uruguay"	3219.793	25255.961693

Let’s give the columns slightly better names

df = df.rename({'POP': 'population', 'tcgdp': 'total GDP'})
df

shape: (8, 3)

country	population	total GDP
str	f64	f64
"Argentina"	37335.653	295072.21869
"Australia"	19053.186	541804.6521
"India"	1.0063e6	1.7281e6
"Israel"	6114.57	129253.89423
"Malawi"	11801.505	5026.221784
"South Africa"	45064.098	227242.36949
"United States"	282171.957	9.8987e6
"Uruguay"	3219.793	25255.961693

The population variable is in thousands, let’s revert to single units

df = df.with_columns((pl.col('population') * 1e3).alias('population'))
df

shape: (8, 3)

country	population	total GDP
str	f64	f64
"Argentina"	3.7335653e7	295072.21869
"Australia"	1.9053186e7	541804.6521
"India"	1.0063e9	1.7281e6
"Israel"	6.11457e6	129253.89423
"Malawi"	1.1801505e7	5026.221784
"South Africa"	4.5064098e7	227242.36949
"United States"	2.82171957e8	9.8987e6
"Uruguay"	3.219793e6	25255.961693

Next, we’re going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.

Note

Polars (or Pandas) doesn’t have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.

df = df.with_columns(
    (pl.col('total GDP') * 1e6 / pl.col('population')).alias('GDP percap')
)
df

shape: (8, 4)

country	population	total GDP	GDP percap
str	f64	f64	f64
"Argentina"	3.7335653e7	295072.21869	7903.229085
"Australia"	1.9053186e7	541804.6521	28436.433261
"India"	1.0063e9	1.7281e6	1717.324719
"Israel"	6.11457e6	129253.89423	21138.672749
"Malawi"	1.1801505e7	5026.221784	425.896679
"South Africa"	4.5064098e7	227242.36949	5042.647686
"United States"	2.82171957e8	9.8987e6	35080.381854
"Uruguay"	3.219793e6	25255.961693	7843.97062

One of the nice things about Polars DataFrame and Series objects is that they can be easily converted to pandas for visualization through Matplotlib.

For example, we can easily generate a bar plot of GDP per capita

# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

_images/737f95fd2954e0fbb3902defbbb71952181fd454125a990a684c119dae41f30f.png

At the moment the data frame is ordered alphabetically on the countries—let’s change it to GDP per capita

df = df.sort('GDP percap', descending=True)
df

shape: (8, 4)

country	population	total GDP	GDP percap
str	f64	f64	f64
"United States"	2.82171957e8	9.8987e6	35080.381854
"Australia"	1.9053186e7	541804.6521	28436.433261
"Israel"	6.11457e6	129253.89423	21138.672749
"Argentina"	3.7335653e7	295072.21869	7903.229085
"Uruguay"	3.219793e6	25255.961693	7843.97062
"South Africa"	4.5064098e7	227242.36949	5042.647686
"India"	1.0063e9	1.7281e6	1717.324719
"Malawi"	1.1801505e7	5026.221784	425.896679

Plotting as before now yields

# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

_images/01dc7177d8c53ddf29b2d901726b63d16e34c8c9679002ba32c8173c1f4f4d14.png

15.4. Lazy evaluation#

One of Polars’ most powerful features is lazy evaluation. This allows Polars to optimize your entire query before executing it, leading to significant performance improvements.

15.4.1. Eager vs lazy APIs#

Polars provides two APIs:

Eager API - Operations are executed immediately (like pandas)
Lazy API - Operations are collected and optimized before execution

Let’s see the difference using our dataset:

# First, let's reload our original dataset for this example
URL = ('https://raw.githubusercontent.com/QuantEcon/'
       'lecture-python-programming/master/source/_static/'
       'lecture_specific/pandas/data/test_pwt.csv')
df_full = pl.read_csv(URL)

# Eager API (executed immediately)
result_eager = (df_full
    .filter(pl.col('tcgdp') > 1000)
    .select(['country', 'year', 'tcgdp'])
    .sort('tcgdp', descending=True)
)
print("Eager result shape:", result_eager.shape)
result_eager.head()

Eager result shape: (8, 3)

shape: (5, 3)

country	year	tcgdp
str	i64	f64
"United States"	2000	9.8987e6
"India"	2000	1.7281e6
"Australia"	2000	541804.6521
"Argentina"	2000	295072.21869
"South Africa"	2000	227242.36949

# Lazy API (builds a query plan)
lazy_query = (df_full.lazy()  # Convert to lazy frame
    .filter(pl.col('tcgdp') > 1000)
    .select(['country', 'year', 'tcgdp'])
    .sort('tcgdp', descending=True)
)

print("Lazy query:")
print(lazy_query)

Lazy query:
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SORT BY [col("tcgdp")]
  SELECT [col("country"), col("year"), col("tcgdp")]
    FILTER [(col("tcgdp")) > (1000.0)]
    FROM
      DF ["country", "country isocode", "year", "POP", ...]; PROJECT */8 COLUMNS

We can now execute the lazy query using collect:

result_lazy = lazy_query.collect()
print("Lazy result shape:", result_lazy.shape)
result_lazy.head()

Lazy result shape: (8, 3)

shape: (5, 3)

country	year	tcgdp
str	i64	f64
"United States"	2000	9.8987e6
"India"	2000	1.7281e6
"Australia"	2000	541804.6521
"Argentina"	2000	295072.21869
"South Africa"	2000	227242.36949

15.4.2. Query optimization#

The lazy API allows Polars to perform several optimizations:

Predicate Pushdown - Filters are applied as early as possible
Projection Pushdown - Only required columns are read
Common Subexpression Elimination - Duplicate calculations are removed
Dead Code Elimination - Unused operations are removed

# Example of optimization - only columns needed are processed
optimized_query = (df_full.lazy()
    .select(['country', 'year', 'tcgdp', 'POP'])  # Select early
    .filter(pl.col('tcgdp') > 500)                # Filter pushdown
    .with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_per_capita'))
    .filter(pl.col('gdp_per_capita') > 10)        # Additional filter
    .select(['country', 'year', 'gdp_per_capita']) # Final projection
)

print("Optimized query plan:")
print(optimized_query.explain())

Optimized query plan:
simple π 3/3 ["country", "year", ... 1 other column]
  FILTER [(col("gdp_per_capita")) > (10.0)]
  FROM
     WITH_COLUMNS:
     [[(col("tcgdp")) / (col("POP"))].alias("gdp_per_capita")] 
      FILTER [(col("tcgdp")) > (500.0)]
      FROM
        DF ["country", "country isocode", "year", "POP", ...]; PROJECT["country", "year", "tcgdp", "POP"] 4/8 COLUMNS

# Execute the optimized query
result_optimized = optimized_query.collect()
result_optimized.head()

shape: (3, 3)

country	year	gdp_per_capita
str	i64	f64
"Australia"	2000	28.436433
"Israel"	2000	21.138673
"United States"	2000	35.080382

15.4.3. When to use lazy vs eager#

Use Lazy API when:

Working with large datasets
Performing complex transformations
Building data pipelines
Performance is critical

Use Eager API when:

Exploring data interactively
Working with small datasets
Need immediate results for debugging

The lazy API is particularly powerful for data processing pipelines where multiple transformations can be optimized together as a single operation.

15.5. Online data sources#

Python makes it straightforward to query online databases programmatically.

An important database for economists is FRED — a vast collection of time series data maintained by the St. Louis Fed.

For example, suppose that we are interested in the unemployment rate.

(To download the data as a csv, click on the top right Download and select the CSV (data) option).

Alternatively, we can access the CSV file from within a Python program.

In Pandas, we studied how to use requests and pandas to access API data.

Here Polars’ read_csv function provides the same functionality.

We use try_parse_dates=True so that Polars recognizes our dates column

url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
       'bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&'
       'graph_bgcolor=%23ffffff&height=450&mode=fred&'
       'recession_bars=on&txtcolor=%23444444&ts=12&tts=12&'
       'width=1318&nt=0&thu=0&trc=0&show_legend=yes&'
       'show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&'
       'cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&'
       'link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&'
       'ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&'
       'fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&'
       'vintage_date=2024-07-29&revision_date=2024-07-29&'
       'nd=1948-01-01')
data = pl.read_csv(url, try_parse_dates=True)

The data has been read into a Polars DataFrame called data that we can now manipulate in the usual way

type(data)

polars.dataframe.frame.DataFrame

data.head()  # A useful method to get a quick look at a DataFrame

shape: (5, 2)

observation_date	UNRATE
date	f64
1948-01-01	3.4
1948-02-01	3.8
1948-03-01	4.0
1948-04-01	3.9
1948-05-01	3.5

data.describe()  # Your output might differ slightly

shape: (9, 3)

statistic	observation_date	UNRATE
str	str	f64
"count"	"918"	918.0
"null_count"	"0"	0.0
"mean"	"1986-03-17 06:30:35.294117"	5.693464
"std"	null	1.710032
"min"	"1948-01-01"	2.5
"25%"	"1967-02-01"	4.4
"50%"	"1986-04-01"	5.5
"75%"	"2005-05-01"	6.7
"max"	"2024-06-01"	14.8

We can also plot the unemployment rate from 2006 to 2012 as follows:

# Filter data for the specified date range and convert to pandas for plotting
filtered_data = data.filter(
    (pl.col('observation_date') >= pl.date(2006, 1, 1)) &
    (pl.col('observation_date') <= pl.date(2012, 12, 31))
).to_pandas().set_index('observation_date')

ax = filtered_data.plot(title='US Unemployment Rate', legend=False)
ax.set_xlabel('year', fontsize=12)
ax.set_ylabel('%', fontsize=12)
plt.show()

_images/1bdc888c21e340f5fe5469f0d60f63b1bef87e9c2aaf4596c8177875f38de21d.png

Note that Polars offers many other file type alternatives.

Polars has a wide variety of methods that we can use to read excel, json, parquet or plug straight into a database server.

15.6. Exercises#

Exercise 15.1

With these imports:

import datetime as dt
import yfinance as yf

Write a program to calculate the percentage price change over 2021 for the following shares using Polars:

ticker_list = {'INTC': 'Intel',
               'MSFT': 'Microsoft',
               'IBM': 'IBM',
               'BHP': 'BHP',
               'TM': 'Toyota',
               'AAPL': 'Apple',
               'AMZN': 'Amazon',
               'C': 'Citigroup',
               'QCOM': 'Qualcomm',
               'KO': 'Coca-Cola',
               'GOOG': 'Google'}

Here’s the first part of the program that reads data into a Polars DataFrame:

def read_data_polars(ticker_list,
                    start=dt.datetime(2021, 1, 1),
                    end=dt.datetime(2021, 12, 31)):
    """
    This function reads in closing price data from Yahoo
    for each tick in the ticker_list and returns a Polars DataFrame.
    Different indices may have different trading days, so we use joins
    to handle this.
    """
    dataframes = []

    for tick in ticker_list:
        stock = yf.Ticker(tick)
        prices = stock.history(start=start, end=end)

        # Create a Polars DataFrame from the closing prices
        df = pl.DataFrame({
            'Date': pd.to_datetime(prices.index.date),
            tick: prices['Close'].values
        })
        dataframes.append(df)

    # Start with the first DataFrame
    result = dataframes[0]

    # Join additional DataFrames, handling mismatched dates with full outer join
    for df in dataframes[1:]:
        result = result.join(df, on='Date', how='full', coalesce=True)

    return result

ticker = read_data_polars(ticker_list)

Complete the program to plot the result as a bar graph using Polars operations and matplotlib visualization.

Solution to Exercise 15.1

Here’s a solution using Polars operations to calculate percentage changes:

price_change_df = ticker.select([
    (pl.col(tick).last() / pl.col(tick).first() * 100 - 100).alias(tick)
    for tick in ticker_list.keys()
]).transpose(
    include_header=True,
    header_name='ticker',
    column_names=['pct_change']
)

# Add company names and sort
price_change_df = price_change_df.with_columns([
    pl.col('ticker')
    .replace_strict(ticker_list, default=pl.col('ticker'))
    .alias('company')
]).sort('pct_change')

print(price_change_df)

shape: (11, 3)
┌────────┬────────────┬───────────┐
│ ticker ┆ pct_change ┆ company   │
│ ---    ┆ ---        ┆ ---       │
│ str    ┆ f64        ┆ str       │
╞════════╪════════════╪═══════════╡
│ BHP    ┆ -2.249111  ┆ BHP       │
│ C      ┆ 3.550571   ┆ Citigroup │
│ AMZN   ┆ 5.845049   ┆ Amazon    │
│ INTC   ┆ 6.86852    ┆ Intel     │
│ KO     ┆ 14.92246   ┆ Coca-Cola │
│ …      ┆ …          ┆ …         │
│ TM     ┆ 23.416755  ┆ Toyota    │
│ QCOM   ┆ 25.318508  ┆ Qualcomm  │
│ AAPL   ┆ 38.55075   ┆ Apple     │
│ MSFT   ┆ 57.179601  ┆ Microsoft │
│ GOOG   ┆ 68.960919  ┆ Google    │
└────────┴────────────┴───────────┘

Now plot the results:

# Convert to pandas for plotting (as demonstrated in the lecture)
df_pandas = price_change_df.to_pandas().set_index('company')

fig, ax = plt.subplots(figsize=(10,8))
ax.set_xlabel('stock', fontsize=12)
ax.set_ylabel('percentage change in price', fontsize=12)

# Create colors: red for negative returns, green for positive returns
colors = ['red' if x < 0 else 'blue' for x in df_pandas['pct_change']]
df_pandas['pct_change'].plot(kind='bar', ax=ax, color=colors)

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

_images/4f34fc7d6d31d04fad6203ef46bc6ee129d73a712eacdf0edde5fdc43fb03fa8.png

Exercise 15.2

Using the method read_data_polars introduced in Exercise 15.1, write a program to obtain year-on-year percentage change for the following indices using Polars operations:

indices_list = {'^GSPC': 'S&P 500',
               '^IXIC': 'NASDAQ',
               '^DJI': 'Dow Jones',
               '^N225': 'Nikkei'}

Complete the program to show summary statistics and plot the result as a time series graph demonstrating Polars’ data manipulation capabilities.

Solution to Exercise 15.2

Following the work you did in Exercise 15.1, you can query the data using read_data_polars by updating the start and end dates accordingly.

indices_data = read_data_polars(
    indices_list,
    start=dt.datetime(2000, 1, 1),
    end=dt.datetime(2021, 12, 31)
)

# Add year column for grouping
indices_data = indices_data.with_columns(
    pl.col('Date').dt.year().alias('year')
)

print("Data shape:", indices_data.shape)
print("\nFirst few rows:")
print(indices_data.head())
print("\nData availability check:")
for index in indices_list.keys():
    non_null_count = (indices_data
                      .select(pl.col(index).is_not_null().sum())
                      .item())
    print(f"{indices_list[index]}: {non_null_count} non-null values")

Data shape: (5716, 6)

First few rows:
shape: (5, 6)
┌─────────────────────┬─────────────┬─────────────┬──────────────┬──────────────┬──────┐
│ Date                ┆ ^GSPC       ┆ ^IXIC       ┆ ^DJI         ┆ ^N225        ┆ year │
│ ---                 ┆ ---         ┆ ---         ┆ ---          ┆ ---          ┆ ---  │
│ datetime[ns]        ┆ f64         ┆ f64         ┆ f64          ┆ f64          ┆ i32  │
╞═════════════════════╪═════════════╪═════════════╪══════════════╪══════════════╪══════╡
│ 2000-01-03 00:00:00 ┆ 1455.219971 ┆ 4131.149902 ┆ 11357.509766 ┆ null         ┆ 2000 │
│ 2000-01-04 00:00:00 ┆ 1399.420044 ┆ 3901.689941 ┆ 10997.929688 ┆ 19002.859375 ┆ 2000 │
│ 2000-01-05 00:00:00 ┆ 1402.109985 ┆ 3877.540039 ┆ 11122.650391 ┆ 18542.550781 ┆ 2000 │
│ 2000-01-06 00:00:00 ┆ 1403.449951 ┆ 3727.129883 ┆ 11253.259766 ┆ 18168.269531 ┆ 2000 │
│ 2000-01-07 00:00:00 ┆ 1441.469971 ┆ 3882.620117 ┆ 11522.55957  ┆ 18193.410156 ┆ 2000 │
└─────────────────────┴─────────────┴─────────────┴──────────────┴──────────────┴──────┘

Data availability check:
S&P 500: 5535 non-null values
NASDAQ: 5535 non-null values
Dow Jones: 5535 non-null values
Nikkei: 5391 non-null values

Calculate yearly returns using Polars groupby operations:

# Calculate first and last valid price for each year and each index
yearly_returns = indices_data.group_by('year').agg([
    *[pl.col(index)
      .filter(pl.col(index).is_not_null())
      .first()
      .alias(f"{index}_first") for index in indices_list.keys()],
    *[pl.col(index)
      .filter(pl.col(index).is_not_null())
      .last()
      .alias(f"{index}_last") for index in indices_list.keys()]
])

# Calculate percentage returns for each index, handling null values properly
return_columns = []
for index in indices_list.keys():
    company_name = indices_list[index]
    return_col = (
        (pl.col(f"{index}_last") - pl.col(f"{index}_first")) /
        pl.col(f"{index}_first") * 100
    ).alias(company_name)
    return_columns.append(return_col)

yearly_returns = yearly_returns.with_columns(return_columns)

# Select only the year and return columns, filter out years with insufficient data
yearly_returns = yearly_returns.select([
    'year',
    *list(indices_list.values())
]).filter(
    pl.col('year') >= 2001  # Ensure we have complete years of data
).sort('year')

print("Yearly returns shape:", yearly_returns.shape)
print("\nYearly returns:")
print(yearly_returns.head(10))

Yearly returns shape: (21, 5)

Yearly returns:
shape: (10, 5)
┌──────┬───────────┬────────────┬────────────┬────────────┐
│ year ┆ S&P 500   ┆ NASDAQ     ┆ Dow Jones  ┆ Nikkei     │
│ ---  ┆ ---       ┆ ---        ┆ ---        ┆ ---        │
│ i32  ┆ f64       ┆ f64        ┆ f64        ┆ f64        │
╞══════╪═══════════╪════════════╪════════════╪════════════╡
│ 2001 ┆ -10.53481 ┆ -14.898819 ┆ -5.866722  ┆ -21.872491 │
│ 2002 ┆ -23.80334 ┆ -32.524441 ┆ -17.191519 ┆ -5.438447  │
│ 2003 ┆ 22.319396 ┆ 44.663323  ┆ 21.451016  ┆ 22.458114  │
│ 2004 ┆ 9.331703  ┆ 8.409905   ┆ 3.584683   ┆ 6.481657   │
│ 2005 ┆ 3.844177  ┆ 2.470561   ┆ -0.111187  ┆ -2.182891  │
│ 2006 ┆ 11.782786 ┆ 7.645719   ┆ 14.89517   ┆ -5.645009  │
│ 2007 ┆ 3.65382   ┆ 9.455427   ┆ 6.33532    ┆ 1.347957   │
│ 2008 ┆ -37.58465 ┆ -39.568824 ┆ -32.716831 ┆ -12.641608 │
│ 2009 ┆ 19.671602 ┆ 39.023162  ┆ 15.422326  ┆ -14.297612 │
│ 2010 ┆ 11.001865 ┆ 14.92147   ┆ 9.387316   ┆ 1.879812   │
└──────┴───────────┴────────────┴────────────┴────────────┘

Generate summary statistics using Polars:

# Summary statistics for all indices
summary_stats = yearly_returns.select(list(indices_list.values())).describe()
print("Summary Statistics:")
print(summary_stats)

# Check for any null values or data issues
print(f"\nData shape: {yearly_returns.shape}")
print(f"Null counts:")
print(yearly_returns.null_count())
print(f"\nData range (first few years):")
print(yearly_returns.head())

Summary Statistics:
shape: (9, 5)
┌────────────┬───────────┬────────────┬────────────┬────────────┐
│ statistic  ┆ S&P 500   ┆ NASDAQ     ┆ Dow Jones  ┆ Nikkei     │
│ ---        ┆ ---       ┆ ---        ┆ ---        ┆ ---        │
│ str        ┆ f64       ┆ f64        ┆ f64        ┆ f64        │
╞════════════╪═══════════╪════════════╪════════════╪════════════╡
│ count      ┆ 21.0      ┆ 21.0       ┆ 21.0       ┆ 21.0       │
│ null_count ┆ 0.0       ┆ 0.0        ┆ 0.0        ┆ 0.0        │
│ mean       ┆ 7.338348  ┆ 11.275998  ┆ 6.548263   ┆ 0.267225   │
│ std        ┆ 16.776667 ┆ 22.508323  ┆ 14.241644  ┆ 12.272468  │
│ min        ┆ -37.58465 ┆ -39.568824 ┆ -32.716831 ┆ -21.872491 │
│ 25%        ┆ -0.692839 ┆ 2.470561   ┆ -0.111187  ┆ -5.645009  │
│ 50%        ┆ 11.237369 ┆ 9.790363   ┆ 6.33532    ┆ -0.438448  │
│ 75%        ┆ 18.415027 ┆ 27.155799  ┆ 15.422326  ┆ 4.915217   │
│ max        ┆ 29.132182 ┆ 44.663323  ┆ 24.331151  ┆ 26.990829  │
└────────────┴───────────┴────────────┴────────────┴────────────┘

Data shape: (21, 5)
Null counts:
shape: (1, 5)
┌──────┬─────────┬────────┬───────────┬────────┐
│ year ┆ S&P 500 ┆ NASDAQ ┆ Dow Jones ┆ Nikkei │
│ ---  ┆ ---     ┆ ---    ┆ ---       ┆ ---    │
│ u32  ┆ u32     ┆ u32    ┆ u32       ┆ u32    │
╞══════╪═════════╪════════╪═══════════╪════════╡
│ 0    ┆ 0       ┆ 0      ┆ 0         ┆ 0      │
└──────┴─────────┴────────┴───────────┴────────┘

Data range (first few years):
shape: (5, 5)
┌──────┬───────────┬────────────┬────────────┬────────────┐
│ year ┆ S&P 500   ┆ NASDAQ     ┆ Dow Jones  ┆ Nikkei     │
│ ---  ┆ ---       ┆ ---        ┆ ---        ┆ ---        │
│ i32  ┆ f64       ┆ f64        ┆ f64        ┆ f64        │
╞══════╪═══════════╪════════════╪════════════╪════════════╡
│ 2001 ┆ -10.53481 ┆ -14.898819 ┆ -5.866722  ┆ -21.872491 │
│ 2002 ┆ -23.80334 ┆ -32.524441 ┆ -17.191519 ┆ -5.438447  │
│ 2003 ┆ 22.319396 ┆ 44.663323  ┆ 21.451016  ┆ 22.458114  │
│ 2004 ┆ 9.331703  ┆ 8.409905   ┆ 3.584683   ┆ 6.481657   │
│ 2005 ┆ 3.844177  ┆ 2.470561   ┆ -0.111187  ┆ -2.182891  │
└──────┴───────────┴────────────┴────────────┴────────────┘

Plot the time series:

# Convert to pandas for plotting
df_pandas = yearly_returns.to_pandas().set_index('year')

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Flatten 2-D array to 1-D array
for iter_, ax in enumerate(axes.flatten()):
    if iter_ < len(indices_list):

        # Get index name per iteration
        index_name = list(indices_list.values())[iter_]

        # Plot with markers and lines for better visibility
        ax.plot(df_pandas.index, df_pandas[index_name], 'o-',
                linewidth=2, markersize=4)
        ax.set_ylabel("yearly return", fontsize=12)
        ax.set_xlabel("year", fontsize=12)
        ax.set_title(index_name, fontsize=12)
        ax.grid(True, alpha=0.3)

        # Add horizontal line at zero for reference
        ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

_images/d5cb0670627b3964a467db19106661216619088c1763dc6f8e50210e0b18d592.png

Alternative: Create a single plot with all indices:

# Single plot with all indices
fig, ax = plt.subplots(figsize=(12, 8))

for index_name in indices_list.values():
    # Only plot if the column has valid data
    if (index_name in df_pandas.columns and
            not df_pandas[index_name].isna().all()):
        ax.plot(df_pandas.index, df_pandas[index_name],
                label=index_name, linewidth=2, marker='o', markersize=3)

ax.set_xlabel("year", fontsize=12)
ax.set_ylabel("yearly return (%)", fontsize=12)
ax.set_title("Yearly Returns of Major Stock Indices (2001-2021)", fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linestyle='--', alpha=0.5, label='Zero line')
plt.tight_layout()
plt.show()

_images/bc2ab796f09f4f87a1785c1b8f505f40ec54700c13dec220b992b9bcd4080743.png