15. Polars#

In addition to what’s in Anaconda, this lecture will need the following libraries:

!pip install --upgrade polars wbgapi yfinance pyarrow

Hide code cell output

Collecting polars
  Downloading polars-1.33.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting wbgapi
  Downloading wbgapi-1.0.12-py3-none-any.whl.metadata (13 kB)
Collecting yfinance
  Downloading yfinance-0.2.66-py2.py3-none-any.whl.metadata (6.0 kB)
Requirement already satisfied: pyarrow in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (19.0.0)
Collecting pyarrow
  Downloading pyarrow-21.0.0-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Requirement already satisfied: requests in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (2.32.3)
Requirement already satisfied: PyYAML in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (6.0.2)
Requirement already satisfied: tabulate in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from wbgapi) (0.9.0)
Requirement already satisfied: pandas>=1.3.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2.2.3)
Requirement already satisfied: numpy>=1.16.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2.1.3)
Collecting multitasking>=0.0.7 (from yfinance)
  Downloading multitasking-0.0.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hRequirement already satisfied: platformdirs>=2.0.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (4.3.7)
Requirement already satisfied: pytz>=2022.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (2024.1)
Collecting frozendict>=2.3.4 (from yfinance)
  Downloading frozendict-2.4.6-py313-none-any.whl.metadata (23 kB)
Collecting peewee>=3.16.2 (from yfinance)
  Downloading peewee-3.18.2.tar.gz (949 kB)
?25l
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/949.2 kB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 949.2/949.2 kB 51.6 MB/s eta 0:00:00
?25h
  Installing build dependencies ... ?25l-
 \
 done
?25h  Getting requirements to build wheel ... ?25l-
 done
?25h  Preparing metadata (pyproject.toml) ... ?25l-
 done
?25hRequirement already satisfied: beautifulsoup4>=4.11.1 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (4.12.3)
Collecting curl_cffi>=0.7 (from yfinance)
  Downloading curl_cffi-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Requirement already satisfied: protobuf>=3.19.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from yfinance) (5.29.3)
Collecting websockets>=13.0 (from yfinance)
  Downloading websockets-15.0.1-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Requirement already satisfied: soupsieve>1.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.5)
Requirement already satisfied: cffi>=1.12.0 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from curl_cffi>=0.7->yfinance) (1.17.1)
Requirement already satisfied: certifi>=2024.2.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from curl_cffi>=0.7->yfinance) (2025.4.26)
Requirement already satisfied: pycparser in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from cffi>=1.12.0->curl_cffi>=0.7->yfinance) (2.21)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from pandas>=1.3.0->yfinance) (2.9.0.post0)
Requirement already satisfied: tzdata>=2022.7 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from pandas>=1.3.0->yfinance) (2025.2)
Requirement already satisfied: six>=1.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->yfinance) (1.17.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/miniconda3/envs/quantecon/lib/python3.13/site-packages (from requests->wbgapi) (2.3.0)
Downloading polars-1.33.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.7 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/39.7 MB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━ 27.5/39.7 MB 137.2 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 136.6 MB/s eta 0:00:00
?25hDownloading wbgapi-1.0.12-py3-none-any.whl (36 kB)
Downloading yfinance-0.2.66-py2.py3-none-any.whl (123 kB)
Downloading pyarrow-21.0.0-cp313-cp313-manylinux_2_28_x86_64.whl (42.8 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/42.8 MB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━ 22.0/42.8 MB 109.2 MB/s eta 0:00:01
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.8 MB 129.6 MB/s eta 0:00:00
?25hDownloading curl_cffi-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB)
?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/8.3 MB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.3/8.3 MB 134.9 MB/s eta 0:00:00
?25hDownloading frozendict-2.4.6-py313-none-any.whl (16 kB)
Downloading websockets-15.0.1-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (182 kB)
Building wheels for collected packages: multitasking, peewee
  DEPRECATION: Building 'multitasking' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'multitasking'. Discussion can be found at https://github.com/pypa/pip/issues/6334

  Building wheel for multitasking (setup.py) ... ?25l-
 \
 |
 done
?25h  Created wheel for multitasking: filename=multitasking-0.0.12-py3-none-any.whl size=15548 sha256=a82f126f27c63c3d4b7f08c5f26ba5cfa1c43e404eab1809abdc9f05b3521788
  Stored in directory: /home/runner/.cache/pip/wheels/1e/df/0f/e2bbb22d689b30c681feb5410ab64a2523437b34c8ecfc6476
  Building wheel for peewee (pyproject.toml) ... ?25l-
 \
 |
 done
?25h  Created wheel for peewee: filename=peewee-3.18.2-cp313-cp313-linux_x86_64.whl size=302166 sha256=ab835a7817992550f8b341ca9d3db4bced1e760739bd82a2bb4879e7dad76f34
  Stored in directory: /home/runner/.cache/pip/wheels/1c/48/cc/00b7d0e7defa21a58915654917c89eaedd32a6e614d8e4ad92
Successfully built multitasking peewee
Installing collected packages: peewee, multitasking, websockets, pyarrow, polars, frozendict, wbgapi, curl_cffi, yfinance
?25l
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
  Attempting uninstall: pyarrow
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
    Found existing installation: pyarrow 19.0.0
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
    Uninstalling pyarrow-19.0.0:
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
      Successfully uninstalled pyarrow-19.0.0
   ━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/9 [websockets]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/9 [pyarrow]
   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]
   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]
   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]
   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]
   ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━ 4/9 [polars]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━ 7/9 [curl_cffi]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9/9 [yfinance]
?25h

Successfully installed curl_cffi-0.13.0 frozendict-2.4.6 multitasking-0.0.12 peewee-3.18.2 polars-1.33.1 pyarrow-21.0.0 wbgapi-1.0.12 websockets-15.0.1 yfinance-0.2.66

15.1. Overview#

Polars is a fast data manipulation library for Python written in Rust.

Polars has gained significant popularity in recent years due to its superior performance compared to traditional data analysis tools.

This makes it an excellent choice for modern data science and machine learning workflows.

Polars is designed with performance and memory efficiency in mind, leveraging:

  • Arrow’s columnar memory format for fast data access

  • Lazy evaluation to optimize query execution

  • Parallel processing for enhanced performance

  • Expressive API similar to pandas but with better performance characteristics

Just as NumPy provides the basic array data type plus core array operations, Polars

  1. defines fundamental structures for working with data and

  2. endows them with methods that facilitate operations such as

    • reading in data

    • adjusting indices

    • working with dates and time series

    • sorting, grouping, re-ordering and general data munging [1]

    • dealing with missing values, etc.

More sophisticated statistical functionality is left to other packages, such as statsmodels and scikit-learn, which can work with Polars DataFrames through their interoperability with pandas.

This lecture will provide a basic introduction to Polars.

Tip

Why use Polars over pandas? One reason is performance: as a general rule, it is recommended to have 5 to 10 times as much RAM as the size of the dataset to carry out operations in pandas, compared to 2 to 4 times needed for Polars; in addition, Polars is between 10 and 100 times as fast as pandas for common operations; a great article comparing Polars and pandas can be found in this JetBrains blog post.

Throughout the lecture, we will assume that the following imports have taken place

import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Two important data types defined by Polars are Series and DataFrame.

You can think of a Series as a “column” of data, such as a collection of observations on a single variable.

A DataFrame is a two-dimensional object for storing related columns of data.

15.2. Series#

Let’s start with Series.

We begin by creating a series of four random observations

s = pl.Series(name='daily returns', values=np.random.randn(4))
s
shape: (4,)
daily returns
f64
1.125966
-0.594306
0.792461
0.452324

Note

You may notice the above series has no indices, unlike in pd.Series; this is because Polars’ is column centric and accessing data is predominantly managed through filtering and boolean masks; here is an interesting blog post discussing this in more detail.

Polars Series are built on top of Apache Arrow arrays and support many similar operations to Pandas Series.

(For interested readers, please see this extended reading on Apache Arrow)

s * 100
shape: (4,)
daily returns
f64
112.596609
-59.430626
79.246067
45.232421
s.abs()
shape: (4,)
daily returns
f64
1.125966
0.594306
0.792461
0.452324

But Series provide more than basic arrays.

For example, they have some additional (statistically oriented) methods

s.describe()
shape: (9, 2)
statisticvalue
strf64
"count"4.0
"null_count"0.0
"mean"0.444111
"std"0.744905
"min"-0.594306
"25%"0.452324
"50%"0.792461
"75%"0.792461
"max"1.125966

However, the pl.Series object cannot be used in the same way as a pd.Series when pairing data with indices.

For example, using a pd.Series you can do the following:

s = pd.Series(np.random.randn(4), name='daily returns')
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s
AMZN   -1.090070
AAPL    0.484258
MSFT   -0.313922
GOOG   -1.967324
Name: daily returns, dtype: float64

However, in Polars you will need to use the DataFrame object to do the same task.

This means you will use the DataFrame object more often when using Polars if you are interested in relationships between data.

Let’s create a pl.DataFrame containing the equivalent data in the pd.Series.

df = pl.DataFrame({
    'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
    'daily returns': s.to_list()
})
df
shape: (4, 2)
companydaily returns
strf64
"AMZN"-1.09007
"AAPL"0.484258
"MSFT"-0.313922
"GOOG"-1.967324

To access specific values by company name, we can filter the DataFrame for the AMZN ticker code and select the daily returns.

df.filter(pl.col('company') == 'AMZN').select('daily returns').item()
-1.0900699038656159

If we want to update the AMZN return to 0, you can use the following chain of methods.

Here with_columns is similar to select but adds columns to the same DataFrame

df = df.with_columns(
    pl.when(pl.col('company') == 'AMZN') # filter for AMZN in company column
    .then(0)                             # set values to 0
    .otherwise(pl.col('daily returns'))  # otherwise keep original value
    .alias('daily returns')              # assign back to the column
)
df
shape: (4, 2)
companydaily returns
strf64
"AMZN"0.0
"AAPL"0.484258
"MSFT"-0.313922
"GOOG"-1.967324

You can check if a ticker code is in the company list

'AAPL' in df['company']
True

15.3. DataFrames#

While a Series is a single column of data, a DataFrame is several columns, one for each variable.

In essence, a DataFrame in Polars is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.

Let’s look at an example that reads data from the CSV file pandas/data/test_pwt.csv, which is taken from the Penn World Tables.

The dataset contains the following indicators:

Variable Name

Description

POP

Population (in thousands)

XRAT

Exchange Rate to US Dollar

tcgdp

Total PPP Converted GDP (in million international dollar)

cc

Consumption Share of PPP Converted GDP Per Capita (%)

cg

Government Consumption Share of PPP Converted GDP Per Capita (%)

We’ll read this in from a URL using the Polars function read_csv.

URL = ('https://raw.githubusercontent.com/QuantEcon/'
       'lecture-python-programming/master/source/_static/'
       'lecture_specific/pandas/data/test_pwt.csv')
df = pl.read_csv(URL)
type(df)
polars.dataframe.frame.DataFrame

Here is the content of test_pwt.csv

df
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.6530.9995295072.2186975.7168055.578804
"Australia""AUS"200019053.1861.72483541804.652167.7590266.720098
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"Israel""ISR"20006114.574.07733129253.8942364.43645110.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"South Africa""ZAF"200045064.0986.93983227242.3694972.718715.726546
"United States""USA"2000282171.9571.09.8987e672.3470546.032454
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

15.3.1. Select data by position#

In practice, one thing that we do all the time is to find, select and work with a subset of the data of our interests.

We can select particular rows using array slicing notation

df[2:5]
shape: (3, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"Israel""ISR"20006114.574.07733129253.8942364.43645110.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954

To select columns, we can pass a list containing the names of the desired columns

df.select(['country', 'tcgdp'])
shape: (8, 2)
countrytcgdp
strf64
"Argentina"295072.21869
"Australia"541804.6521
"India"1.7281e6
"Israel"129253.89423
"Malawi"5026.221784
"South Africa"227242.36949
"United States"9.8987e6
"Uruguay"25255.961693

To select both rows and columns using integers, we can combine slicing with column selection

df[2:5].select(df.columns[0:4])
shape: (3, 4)
countrycountry isocodeyearPOP
strstri64f64
"India""IND"20001.0063e6
"Israel""ISR"20006114.57
"Malawi""MWI"200011801.505

To select rows and columns using a mixture of integers and labels, we can use more complex selection

df[2:5].select(['country', 'tcgdp'])
shape: (3, 2)
countrytcgdp
strf64
"India"1.7281e6
"Israel"129253.89423
"Malawi"5026.221784

15.3.2. Select data by conditions#

Instead of indexing rows and columns using integers and names, we can also obtain a sub-dataframe of our interests that satisfies certain (potentially complicated) conditions.

This section demonstrates various ways to do that.

The most straightforward way is with the filter method.

df.filter(pl.col('POP') >= 20000)
shape: (4, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.6530.9995295072.2186975.7168055.578804
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"South Africa""ZAF"200045064.0986.93983227242.3694972.718715.726546
"United States""USA"2000282171.9571.09.8987e672.3470546.032454

In this case, df.filter() takes a boolean expression and only returns rows with the True values.

We can view this boolean mask as a table with the alias meets_criteria

df.select(
    pl.col('country'),
    (pl.col('POP') >= 20000).alias('meets_criteria')
)
shape: (8, 2)
countrymeets_criteria
strbool
"Argentina"true
"Australia"false
"India"true
"Israel"false
"Malawi"false
"South Africa"true
"United States"true
"Uruguay"false

Here is another example:

df.filter(
    (pl.col('country').is_in(['Argentina', 'India', 'South Africa'])) &
    (pl.col('POP') > 40000)
)
shape: (2, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"South Africa""ZAF"200045064.0986.93983227242.3694972.718715.726546

We can also allow arithmetic operations between different columns.

df.filter(
    (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
)
shape: (2, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

For example, we can use the condition to select the country with the largest household consumption–GDP share cc.

df.filter(pl.col('cc') == pl.col('cc').max())
shape: (1, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

When we only want to look at certain columns of a selected sub-DataFrame, we can combine filter with select.

df.filter(
    (pl.col('cc') + pl.col('cg') >= 80) & (pl.col('POP') <= 20000)
    ).select(['country', 'year', 'POP']
)
shape: (2, 3)
countryyearPOP
stri64f64
"Malawi"200011801.505
"Uruguay"20003219.793

Application: Subsetting DataFrame

Real-world datasets can be very large.

It is sometimes desirable to work with a subset of data to enhance computational efficiency and reduce redundancy.

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the DataFrame using the selection method described above

df_subset = df.select(['country', 'POP', 'tcgdp'])
df_subset
shape: (8, 3)
countryPOPtcgdp
strf64f64
"Argentina"37335.653295072.21869
"Australia"19053.186541804.6521
"India"1.0063e61.7281e6
"Israel"6114.57129253.89423
"Malawi"11801.5055026.221784
"South Africa"45064.098227242.36949
"United States"282171.9579.8987e6
"Uruguay"3219.79325255.961693

We can then save the smaller dataset for further analysis.

df_subset.write_csv('pwt_subset.csv')

15.3.3. Apply and map operations#

Polars provides powerful methods for applying functions to data.

Instead of pandas’ apply method, Polars uses expressions within select, with_columns, or filter methods.

Here is an example using built-in functions to find the max value for each column

df.select([
    pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg'])
    .max()
    .name.suffix('_max')
])
shape: (1, 6)
year_maxPOP_maxXRAT_maxtcgdp_maxcc_maxcg_max
i64f64f64f64f64f64
20001.0063e659.5438089.8987e678.9787414.072206

For more complex operations, we can use map_elements (similar to pandas’ apply):

df.select([
    pl.col('country'),
    pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
])
/tmp/ipykernel_6418/1063562650.py:3: PolarsInefficientMapWarning: 
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("POP").map_elements(lambda x: ...)
with this one instead:
  + pl.col("POP") * 2

  pl.col('POP').map_elements(lambda x: x * 2).alias('POP_doubled')
shape: (8, 2)
countryPOP_doubled
strf64
"Argentina"74671.306
"Australia"38106.372
"India"2.0126e6
"Israel"12229.14
"Malawi"23603.01
"South Africa"90128.196
"United States"564343.914
"Uruguay"6439.586

However as you can see from the warning issued by Polars there is often a better way to achieve this using the Polars API.

df.select([
    pl.col('country'),
    (pl.col('POP') * 2).alias('POP_doubled')
])
shape: (8, 2)
countryPOP_doubled
strf64
"Argentina"74671.306
"Australia"38106.372
"India"2.0126e6
"Israel"12229.14
"Malawi"23603.01
"South Africa"90128.196
"United States"564343.914
"Uruguay"6439.586

We can use complex filtering conditions with boolean logic:

complex_condition = (
    pl.when(pl.col('country').is_in(['Argentina', 'India', 'South Africa']))
    .then(pl.col('POP') > 40000)
    .otherwise(pl.col('POP') < 20000)
)

df.filter(complex_condition).select([
    'country', 'year', 'POP', 'XRAT', 'tcgdp'
])
shape: (6, 5)
countryyearPOPXRATtcgdp
stri64f64f64f64
"Australia"200019053.1861.72483541804.6521
"India"20001.0063e644.94161.7281e6
"Israel"20006114.574.07733129253.89423
"Malawi"200011801.50559.5438085026.221784
"South Africa"200045064.0986.93983227242.36949
"Uruguay"20003219.79312.09959225255.961693

15.3.4. Make changes in DataFrames#

The ability to make changes in DataFrames is important to generate a clean dataset for future analysis.

1. We can use conditional logic to “keep” certain values and replace others

df.with_columns(
    pl.when(pl.col('POP') >= 20000)          # when population >= 20000
    .then(pl.col('POP'))                     # keep the population value
    .otherwise(None)                         # otherwise set to null
    .alias('POP_filtered')                   # save results in POP_filtered
).select(['country', 'POP', 'POP_filtered']) # select the columns
shape: (8, 3)
countryPOPPOP_filtered
strf64f64
"Argentina"37335.65337335.653
"Australia"19053.186null
"India"1.0063e61.0063e6
"Israel"6114.57null
"Malawi"11801.505null
"South Africa"45064.09845064.098
"United States"282171.957282171.957
"Uruguay"3219.793null

2. We can modify specific values based on conditions

df_modified = df.with_columns(
    pl.when(pl.col('cg') == pl.col('cg').max())    # pick the largest cg value
    .then(None)                                    # set to null
    .otherwise(pl.col('cg'))                       # otherwise keep the value
    .alias('cg')                                   # update the column
)
df_modified
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.6530.9995295072.2186975.7168055.578804
"Australia""AUS"200019053.1861.72483541804.652167.7590266.720098
"India""IND"20001.0063e644.94161.7281e664.575551null
"Israel""ISR"20006114.574.07733129253.8942364.43645110.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"South Africa""ZAF"200045064.0986.93983227242.3694972.718715.726546
"United States""USA"2000282171.9571.09.8987e672.3470546.032454
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

3. We can use expressions to modify columns as a whole

df.with_columns([
    pl.when(pl.col('POP') <= 10000)          # when population is < 10,000
    .then(None)                              # set the value to null
    .otherwise(pl.col('POP'))                # otherwise keep existing value
    .alias('POP'),                           # update the POP column
    (pl.col('XRAT') / 10).alias('XRAT')      # update XRAT in-place
])
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.6530.09995295072.2186975.7168055.578804
"Australia""AUS"200019053.1860.172483541804.652167.7590266.720098
"India""IND"20001.0063e64.494161.7281e664.57555114.072206
"Israel""ISR"2000null0.407733129253.8942364.43645110.266688
"Malawi""MWI"200011801.5055.9543815026.22178474.70762411.658954
"South Africa""ZAF"200045064.0980.693983227242.3694972.718715.726546
"United States""USA"2000282171.9570.19.8987e672.3470546.032454
"Uruguay""URY"2000null1.20995925255.96169378.978745.108068

4. We can use in-built functions to modify all individual entries in specific columns by data type.

df.with_columns([
    pl.col(pl.Float64).round(2)   # round all Float64 columns
])
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.651.0295072.2275.725.58
"Australia""AUS"200019053.191.72541804.6567.766.72
"India""IND"20001006300.344.941.7281e664.5814.07
"Israel""ISR"20006114.574.08129253.8964.4410.27
"Malawi""MWI"200011801.559.545026.2274.7111.66
"South Africa""ZAF"200045064.16.94227242.3772.725.73
"United States""USA"2000282171.961.09.8987e672.356.03
"Uruguay""URY"20003219.7912.125255.9678.985.11

Application: Missing Value Imputation

Replacing missing values is an important step in data munging.

Let’s randomly insert some null values

# Create a copy with some null values
df_with_nulls = df.clone()

# Set some specific positions to null
indices_to_null = [(0, 'XRAT'), (3, 'cc'), (5, 'tcgdp'), (6, 'POP')]

for row_idx, col_name in indices_to_null:
    df_with_nulls = df_with_nulls.with_columns(
        pl.when(pl.int_range(pl.len()) == row_idx)
        .then(None)
        .otherwise(pl.col(col_name))
        .alias(col_name)
    )

df_with_nulls
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.653null295072.2186975.7168055.578804
"Australia""AUS"200019053.1861.72483541804.652167.7590266.720098
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"Israel""ISR"20006114.574.07733129253.89423null10.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"South Africa""ZAF"200045064.0986.93983null72.718715.726546
"United States""USA"2000null1.09.8987e672.3470546.032454
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

We can replace all missing values with 0

df_with_nulls.fill_null(0)
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.6530.0295072.2186975.7168055.578804
"Australia""AUS"200019053.1861.72483541804.652167.7590266.720098
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"Israel""ISR"20006114.574.07733129253.894230.010.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"South Africa""ZAF"200045064.0986.939830.072.718715.726546
"United States""USA"20000.01.09.8987e672.3470546.032454
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

Polars also provides us with convenient methods to replace missing values.

For example, we can use forward fill, backward fill, or interpolation

Here we fill null values with the column means

cols = ["cc", "tcgdp", "POP", "XRAT"]
df_with_nulls.with_columns([
    pl.col(cols).fill_null(pl.col(cols).mean())
])
shape: (8, 8)
countrycountry isocodeyearPOPXRATtcgdpcccg
strstri64f64f64f64f64f64
"Argentina""ARG"200037335.65318.618141295072.2186975.7168055.578804
"Australia""AUS"200019053.1861.72483541804.652167.7590266.720098
"India""IND"20001.0063e644.94161.7281e664.57555114.072206
"Israel""ISR"20006114.574.07733129253.8942372.40050210.266688
"Malawi""MWI"200011801.50559.5438085026.22178474.70762411.658954
"South Africa""ZAF"200045064.0986.939831.8033e672.718715.726546
"United States""USA"2000161269.8717141.09.8987e672.3470546.032454
"Uruguay""URY"20003219.79312.09959225255.96169378.978745.108068

Missing value imputation is a big area in data science involving various machine learning techniques.

There are also more advanced tools in Python to impute missing values.

15.3.5. Standardization and visualization#

Let’s imagine that we’re only interested in the population (POP) and total GDP (tcgdp).

One way to strip the data frame df down to only these variables is to overwrite the DataFrame using the selection method described above

df = df.select(['country', 'POP', 'tcgdp'])
df
shape: (8, 3)
countryPOPtcgdp
strf64f64
"Argentina"37335.653295072.21869
"Australia"19053.186541804.6521
"India"1.0063e61.7281e6
"Israel"6114.57129253.89423
"Malawi"11801.5055026.221784
"South Africa"45064.098227242.36949
"United States"282171.9579.8987e6
"Uruguay"3219.79325255.961693

While Polars doesn’t have a traditional index like pandas, we can work with country names directly

df
shape: (8, 3)
countryPOPtcgdp
strf64f64
"Argentina"37335.653295072.21869
"Australia"19053.186541804.6521
"India"1.0063e61.7281e6
"Israel"6114.57129253.89423
"Malawi"11801.5055026.221784
"South Africa"45064.098227242.36949
"United States"282171.9579.8987e6
"Uruguay"3219.79325255.961693

Let’s give the columns slightly better names

df = df.rename({'POP': 'population', 'tcgdp': 'total GDP'})
df
shape: (8, 3)
countrypopulationtotal GDP
strf64f64
"Argentina"37335.653295072.21869
"Australia"19053.186541804.6521
"India"1.0063e61.7281e6
"Israel"6114.57129253.89423
"Malawi"11801.5055026.221784
"South Africa"45064.098227242.36949
"United States"282171.9579.8987e6
"Uruguay"3219.79325255.961693

The population variable is in thousands, let’s revert to single units

df = df.with_columns((pl.col('population') * 1e3).alias('population'))
df
shape: (8, 3)
countrypopulationtotal GDP
strf64f64
"Argentina"3.7335653e7295072.21869
"Australia"1.9053186e7541804.6521
"India"1.0063e91.7281e6
"Israel"6.11457e6129253.89423
"Malawi"1.1801505e75026.221784
"South Africa"4.5064098e7227242.36949
"United States"2.82171957e89.8987e6
"Uruguay"3.219793e625255.961693

Next, we’re going to add a column showing real GDP per capita, multiplying by 1,000,000 as we go because total GDP is in millions.

Note

Polars (or Pandas) doesn’t have a way of recording dimensional analysis units such as GDP represented in millions of dollars. This is left to the user to ensure they track their own units when undertaking analysis.

df = df.with_columns(
    (pl.col('total GDP') * 1e6 / pl.col('population')).alias('GDP percap')
)
df
shape: (8, 4)
countrypopulationtotal GDPGDP percap
strf64f64f64
"Argentina"3.7335653e7295072.218697903.229085
"Australia"1.9053186e7541804.652128436.433261
"India"1.0063e91.7281e61717.324719
"Israel"6.11457e6129253.8942321138.672749
"Malawi"1.1801505e75026.221784425.896679
"South Africa"4.5064098e7227242.369495042.647686
"United States"2.82171957e89.8987e635080.381854
"Uruguay"3.219793e625255.9616937843.97062

One of the nice things about Polars DataFrame and Series objects is that they can be easily converted to pandas for visualization through Matplotlib.

For example, we can easily generate a bar plot of GDP per capita

# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()
_images/737f95fd2954e0fbb3902defbbb71952181fd454125a990a684c119dae41f30f.png

At the moment the data frame is ordered alphabetically on the countries—let’s change it to GDP per capita

df = df.sort('GDP percap', descending=True)
df
shape: (8, 4)
countrypopulationtotal GDPGDP percap
strf64f64f64
"United States"2.82171957e89.8987e635080.381854
"Australia"1.9053186e7541804.652128436.433261
"Israel"6.11457e6129253.8942321138.672749
"Argentina"3.7335653e7295072.218697903.229085
"Uruguay"3.219793e625255.9616937843.97062
"South Africa"4.5064098e7227242.369495042.647686
"India"1.0063e91.7281e61717.324719
"Malawi"1.1801505e75026.221784425.896679

Plotting as before now yields

# Convert to pandas for plotting
df_pandas = df.to_pandas().set_index('country')
ax = df_pandas['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()
_images/01dc7177d8c53ddf29b2d901726b63d16e34c8c9679002ba32c8173c1f4f4d14.png

15.4. Lazy evaluation#

One of Polars’ most powerful features is lazy evaluation. This allows Polars to optimize your entire query before executing it, leading to significant performance improvements.

15.4.1. Eager vs lazy APIs#

Polars provides two APIs:

  1. Eager API - Operations are executed immediately (like pandas)

  2. Lazy API - Operations are collected and optimized before execution

Let’s see the difference using our dataset:

# First, let's reload our original dataset for this example
URL = ('https://raw.githubusercontent.com/QuantEcon/'
       'lecture-python-programming/master/source/_static/'
       'lecture_specific/pandas/data/test_pwt.csv')
df_full = pl.read_csv(URL)

# Eager API (executed immediately)
result_eager = (df_full
    .filter(pl.col('tcgdp') > 1000)
    .select(['country', 'year', 'tcgdp'])
    .sort('tcgdp', descending=True)
)
print("Eager result shape:", result_eager.shape)
result_eager.head()
Eager result shape: (8, 3)
shape: (5, 3)
countryyeartcgdp
stri64f64
"United States"20009.8987e6
"India"20001.7281e6
"Australia"2000541804.6521
"Argentina"2000295072.21869
"South Africa"2000227242.36949
# Lazy API (builds a query plan)
lazy_query = (df_full.lazy()  # Convert to lazy frame
    .filter(pl.col('tcgdp') > 1000)
    .select(['country', 'year', 'tcgdp'])
    .sort('tcgdp', descending=True)
)

print("Lazy query:")
print(lazy_query)
Lazy query:
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SORT BY [col("tcgdp")]
  SELECT [col("country"), col("year"), col("tcgdp")]
    FILTER [(col("tcgdp")) > (1000.0)]
    FROM
      DF ["country", "country isocode", "year", "POP", ...]; PROJECT */8 COLUMNS

We can now execute the lazy query using collect:

result_lazy = lazy_query.collect()
print("Lazy result shape:", result_lazy.shape)
result_lazy.head()
Lazy result shape: (8, 3)
shape: (5, 3)
countryyeartcgdp
stri64f64
"United States"20009.8987e6
"India"20001.7281e6
"Australia"2000541804.6521
"Argentina"2000295072.21869
"South Africa"2000227242.36949

15.4.2. Query optimization#

The lazy API allows Polars to perform several optimizations:

  1. Predicate Pushdown - Filters are applied as early as possible

  2. Projection Pushdown - Only required columns are read

  3. Common Subexpression Elimination - Duplicate calculations are removed

  4. Dead Code Elimination - Unused operations are removed

# Example of optimization - only columns needed are processed
optimized_query = (df_full.lazy()
    .select(['country', 'year', 'tcgdp', 'POP'])  # Select early
    .filter(pl.col('tcgdp') > 500)                # Filter pushdown
    .with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_per_capita'))
    .filter(pl.col('gdp_per_capita') > 10)        # Additional filter
    .select(['country', 'year', 'gdp_per_capita']) # Final projection
)

print("Optimized query plan:")
print(optimized_query.explain())
Optimized query plan:
simple π 3/3 ["country", "year", ... 1 other column]
  FILTER [(col("gdp_per_capita")) > (10.0)]
  FROM
     WITH_COLUMNS:
     [[(col("tcgdp")) / (col("POP"))].alias("gdp_per_capita")] 
      FILTER [(col("tcgdp")) > (500.0)]
      FROM
        DF ["country", "country isocode", "year", "POP", ...]; PROJECT["country", "year", "tcgdp", "POP"] 4/8 COLUMNS
# Execute the optimized query
result_optimized = optimized_query.collect()
result_optimized.head()
shape: (3, 3)
countryyeargdp_per_capita
stri64f64
"Australia"200028.436433
"Israel"200021.138673
"United States"200035.080382

15.4.3. When to use lazy vs eager#

Use Lazy API when:

  • Working with large datasets

  • Performing complex transformations

  • Building data pipelines

  • Performance is critical

Use Eager API when:

  • Exploring data interactively

  • Working with small datasets

  • Need immediate results for debugging

The lazy API is particularly powerful for data processing pipelines where multiple transformations can be optimized together as a single operation.

15.5. Online data sources#

Python makes it straightforward to query online databases programmatically.

An important database for economists is FRED — a vast collection of time series data maintained by the St. Louis Fed.

For example, suppose that we are interested in the unemployment rate.

(To download the data as a csv, click on the top right Download and select the CSV (data) option).

Alternatively, we can access the CSV file from within a Python program.

In Pandas, we studied how to use requests and pandas to access API data.

Here Polars’ read_csv function provides the same functionality.

We use try_parse_dates=True so that Polars recognizes our dates column

url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
       'bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&'
       'graph_bgcolor=%23ffffff&height=450&mode=fred&'
       'recession_bars=on&txtcolor=%23444444&ts=12&tts=12&'
       'width=1318&nt=0&thu=0&trc=0&show_legend=yes&'
       'show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&'
       'cosd=1948-01-01&coed=2024-06-01&line_color=%234572a7&'
       'link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&'
       'ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&'
       'fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&'
       'vintage_date=2024-07-29&revision_date=2024-07-29&'
       'nd=1948-01-01')
data = pl.read_csv(url, try_parse_dates=True)

The data has been read into a Polars DataFrame called data that we can now manipulate in the usual way

type(data)
polars.dataframe.frame.DataFrame
data.head()  # A useful method to get a quick look at a DataFrame
shape: (5, 2)
observation_dateUNRATE
datef64
1948-01-013.4
1948-02-013.8
1948-03-014.0
1948-04-013.9
1948-05-013.5
data.describe()  # Your output might differ slightly
shape: (9, 3)
statisticobservation_dateUNRATE
strstrf64
"count""918"918.0
"null_count""0"0.0
"mean""1986-03-17 06:30:35.294117"5.693464
"std"null1.710032
"min""1948-01-01"2.5
"25%""1967-02-01"4.4
"50%""1986-04-01"5.5
"75%""2005-05-01"6.7
"max""2024-06-01"14.8

We can also plot the unemployment rate from 2006 to 2012 as follows:

# Filter data for the specified date range and convert to pandas for plotting
filtered_data = data.filter(
    (pl.col('observation_date') >= pl.date(2006, 1, 1)) &
    (pl.col('observation_date') <= pl.date(2012, 12, 31))
).to_pandas().set_index('observation_date')

ax = filtered_data.plot(title='US Unemployment Rate', legend=False)
ax.set_xlabel('year', fontsize=12)
ax.set_ylabel('%', fontsize=12)
plt.show()
_images/1bdc888c21e340f5fe5469f0d60f63b1bef87e9c2aaf4596c8177875f38de21d.png

Note that Polars offers many other file type alternatives.

Polars has a wide variety of methods that we can use to read excel, json, parquet or plug straight into a database server.

15.6. Exercises#

Exercise 15.1

With these imports:

import datetime as dt
import yfinance as yf

Write a program to calculate the percentage price change over 2021 for the following shares using Polars:

ticker_list = {'INTC': 'Intel',
               'MSFT': 'Microsoft',
               'IBM': 'IBM',
               'BHP': 'BHP',
               'TM': 'Toyota',
               'AAPL': 'Apple',
               'AMZN': 'Amazon',
               'C': 'Citigroup',
               'QCOM': 'Qualcomm',
               'KO': 'Coca-Cola',
               'GOOG': 'Google'}

Here’s the first part of the program that reads data into a Polars DataFrame:

def read_data_polars(ticker_list,
                    start=dt.datetime(2021, 1, 1),
                    end=dt.datetime(2021, 12, 31)):
    """
    This function reads in closing price data from Yahoo
    for each tick in the ticker_list and returns a Polars DataFrame.
    Different indices may have different trading days, so we use joins
    to handle this.
    """
    dataframes = []

    for tick in ticker_list:
        stock = yf.Ticker(tick)
        prices = stock.history(start=start, end=end)

        # Create a Polars DataFrame from the closing prices
        df = pl.DataFrame({
            'Date': pd.to_datetime(prices.index.date),
            tick: prices['Close'].values
        })
        dataframes.append(df)

    # Start with the first DataFrame
    result = dataframes[0]

    # Join additional DataFrames, handling mismatched dates with full outer join
    for df in dataframes[1:]:
        result = result.join(df, on='Date', how='full', coalesce=True)

    return result

ticker = read_data_polars(ticker_list)

Complete the program to plot the result as a bar graph using Polars operations and matplotlib visualization.

Exercise 15.2

Using the method read_data_polars introduced in Exercise 15.1, write a program to obtain year-on-year percentage change for the following indices using Polars operations:

indices_list = {'^GSPC': 'S&P 500',
               '^IXIC': 'NASDAQ',
               '^DJI': 'Dow Jones',
               '^N225': 'Nikkei'}

Complete the program to show summary statistics and plot the result as a time series graph demonstrating Polars’ data manipulation capabilities.