Solving Data Science 100 Knocks with Pandas, Polars, and SQL (#21-#30)

Table of Contents
Continuing from the previous knocks (previous knocks), this section solves questions 21 through 30 using Pandas, Polars, and SQL.
Problems
Before starting to solve the problems, make sure to execute the first code cell. If you don’t, variables will be undefined, and you may encounter errors during data processing, so please don’t forget to run it.
Question 21
P-021: For the receipt details data (df_receipt), count the number of records.
# Pandas
len(df_receipt)
# Polars
len(df_receipt)
%%sql
SELECT COUNT(*) FROM receipt
Question 22
P-022: For the receipt details data (df_receipt), count the unique number of customer IDs (customer_id).
# Pandas
len(df_receipt["customer_id"].unique())
# Polars
len(df_receipt["customer_id"].unique())
%%sql
SELECT COUNT(DISTINCT customer_id) FROM receipt
Question 23
P-023: For the receipt details data (df_receipt), sum the sales amount (amount) and sales quantity (quantity) by store code (store_cd).
# Pandas
df_receipt[["store_cd", "amount", "quantity"]].groupby("store_cd").sum()
# Polars
df_receipt.group_by("store_cd").agg(
pl.col("amount").sum(),
pl.col("quantity").sum()
)
%%sql
SELECT
store_cd,
SUM(amount),
SUM(quantity)
FROM
receipt
GROUP BY
store_cd
Question 24
P-024: For the receipt details data (df_receipt), find the most recent sales date (sales_ymd) for each customer ID (customer_id), and display 10 records.
# Pandas
df_receipt[["customer_id", "sales_ymd"]].groupby("customer_id").max().head(10)
# Polars
df_receipt.group_by("customer_id").agg(pl.col("sales_ymd").max()).head(10)
%%sql
SELECT
customer_id,
MAX(sales_ymd)
FROM
receipt
GROUP BY
customer_id
LIMIT
10
Question 25
P-025: For the receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id), and display 10 records.
# Pandas
df_receipt[["customer_id", "sales_ymd"]].groupby("customer_id").min().head(10)
# Polars
df_receipt.group_by("customer_id").agg(pl.col("sales_ymd").min()).head(10)
%%sql
SELECT
customer_id,
MIN(sales_ymd)
FROM
receipt
GROUP BY
customer_id
LIMIT
10
Question 26
P-026: For the receipt details data (df_receipt), for each customer ID (customer_id), find the most recent sales date (sales_ymd) and the oldest sales date, and display 10 records where they differ.
# Pandas
df_tmp = df_receipt.groupby("customer_id").agg({"sales_ymd": ["max", "min"]}).reset_index()
df_tmp.columns = ["customer_id", "sales_ymd_max", "sales_ymd_min"]
df_tmp.query("sales_ymd_max != sales_ymd_min").head(10)
# Polars
df_tmp = df_receipt.group_by("customer_id").agg(
pl.col("sales_ymd").max().alias("sales_ymd_max"),
pl.col("sales_ymd").min().alias("sales_ymd_min"),
)
df_tmp.filter(pl.col("sales_ymd_max") != pl.col("sales_ymd_min")).head(10)
%%sql
SELECT
customer_id,
MAX(sales_ymd) AS sales_ymd_max,
MIN(sales_ymd) AS sales_ymd_min
FROM
receipt
GROUP BY
customer_id
HAVING
MAX(sales_ymd) <> MIN(sales_ymd)
LIMIT 10
Question 27
P-027: For the receipt details data (df_receipt), calculate the average sales amount (amount) by store code (store_cd), and display the top 5 in descending order.
# Pandas
df_receipt[["store_cd", "amount"]].groupby("store_cd").mean().sort_values("amount", ascending=False).head()
# Polars
df_receipt.group_by("store_cd").agg(pl.col("amount").mean()).sort("amount", descending=True).head()
%%sql
SELECT
store_cd,
AVG(amount) as amount_avg
FROM
receipt
GROUP BY
store_cd
ORDER BY
amount_avg DESC
LIMIT
5
Question 28
P-028: For the receipt details data (df_receipt), calculate the median sales amount (amount) by store code (store_cd), and display the top 5 in descending order.
# Pandas
df_tmp = df_receipt[["store_cd", "amount"]].groupby("store_cd").median().reset_index()
df_tmp.columns = ["store_cd", "amount_med"]
df_tmp.sort_values("amount_med", ascending=False).head()
# Polars
df_receipt.group_by("store_cd").agg(pl.col("amount").median()).sort("amount", descending=True).head()
%%sql
SELECT
store_cd,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS amount_med
FROM
receipt
GROUP BY
store_cd
ORDER BY
amount_med DESC
LIMIT
5
Question 29
P-029: For the receipt details data (df_receipt), find the mode of product code (product_cd) by store code (store_cd), and display 10 records.
# Pandas
df_receipt.groupby("store_cd").product_cd.apply(lambda x: x.mode()).reset_index().head(10)
# Polars
df_receipt.group_by("store_cd").agg(
pl.col("product_cd").mode()
).head(10)
%%sql
SELECT
store_cd,
MODE() WITHIN GROUP(ORDER BY product_cd)
FROM
receipt
GROUP BY
store_cd
LIMIT
10
Question 30
P-030: For the receipt details data (df_receipt), calculate the variance of sales amount (amount) by store code (store_cd), and display the top 5 in descending order.
# Pandas
df_receipt \
.groupby("store_cd") \
.amount.var(ddof=0) \
.reset_index() \
.sort_values('amount', ascending=False) \
.head(5)
# Polars
df_receipt.group_by("store_cd").agg(
pl.col("amount").var(ddof=0)
).sort("amount", descending=True).head(5)
%%sql
SELECT
store_cd,
VAR_POP(amount) AS amount_variance
FROM
receipt
GROUP BY
store_cd
ORDER BY
amount_variance DESC
LIMIT 5;
Reflections on Questions up to 30
The first 20 questions focused mainly on string processing using regular expressions, but up to question 30, the emphasis was on numerical data processing using basic statistics. It was confirmed that these can be easily implemented using built-in functions or methods in SQL, Pandas, and Polars.
Additionally, while Polars tends to require slightly longer code for simple processing compared to Pandas, for basic statistical operations, it maintains its characteristic style and remains stable in length. In some cases, it can even be written more simply than Pandas. Although it requires a certain amount of code whether for better or worse, and follows a similar structure even for complex tasks, Polars offers the advantage of very fast data processing. On the other hand, Pandas, as a long-standing and widely used framework before Polars, has a large user base and abundant resources, making it convenient for many users.
For cases requiring high-speed processing of large-scale data, Polars is a strong option, but for many other scenarios, Pandas remains a solid choice.