Solving Data Science 100 Knocks with Pandas, Polars, and SQL (#31-#40)

Table of Contents
Continuing from the previous knock ([/data-science-knock100-part3]), we solve questions 31 through 40 using Pandas/Polars/SQL.
Problems
Before starting to solve the problems, please make sure to execute the first code cell. If you do not execute it, variables will be undefined, and you will encounter errors during data processing, so please do not forget to run it.
Question 31
P-031: For the receipt details data (df_receipt), calculate the standard deviation of the sales amount (amount) for each store code (store_cd), and display the top 5 in descending order.
# Pandas
tmp = df_receipt[["store_cd", "amount"]].groupby("store_cd").std(ddof=0).reset_index()
tmp.sort_values("amount", ascending=False).head()
# Polars
df_receipt.group_by("store_cd").agg(
pl.col("amount").std(ddof=0)
).sort("amount", descending=True).head(5)
%%sql
SELECT
store_cd,
STDDEV(amount)
FROM
receipt
GROUP BY
store_cd
ORDER BY
stddev DESC
LIMIT
5
Question 32
P-032: For the sales amount (amount) in the receipt details data (df_receipt), calculate the percentile values at 25% intervals.
# Pandas
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)
# Polars
df_receipt.select(
pl.col("amount").quantile(q, interpolation="linear").alias(f"amount_quantile_{int(q * 100)}")
for q in np.arange(1, 5) / 4
)
%%sql
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY amount) AS percentile_50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS percentile_75,
PERCENTILE_CONT(1.00) WITHIN GROUP (ORDER BY amount) AS percentile_100
FROM
receipt
;
Question 33
P-033: For the receipt details data (df_receipt), calculate the average sales amount (amount) for each store code (store_cd), and extract those that are 330 or more.
# Pandas
avg_df = df_receipt.groupby("store_cd").amount.mean().reset_index()
avg_df.query("amount >= 330")df_receipt[["store_cd", "amount", "quantity"]].groupby("store_cd").sum()
# Polars
avg_df = df_receipt.group_by("store_cd").agg(pl.mean("amount").alias("amount"))
avg_df.filter(pl.col("amount") >= 330)
%%sql
SELECT
store_cd,
AVG(amount)
FROM
receipt
GROUP BY
store_cd
HAVING
AVG(amount) >= 330
Question 34
P-034: For the receipt details data (df_receipt), calculate the total sales amount (amount) for each customer ID (customer_id), then find the average across all customers. However, exclude those with customer ID starting with “Z” as they represent non-members.
# Pandas
member_df = df_receipt[~df_receipt["customer_id"].str.startswith("Z")]
member_df.groupby("customer_id").amount.sum().mean()
# Polars
df_receipt.filter(~pl.col("customer_id").str.starts_with("Z")) \
.group_by("customer_id").agg(pl.sum("amount")) \
.select(pl.col("amount").mean())
%%sql
WITH customer_revenue_totals AS (
SELECT
customer_id,
SUM(amount) as revenue
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
)
SELECT AVG(revenue) FROM customer_revenue_totals
Question 35
P-035: For the receipt details data (df_receipt), calculate the total sales amount (amount) for each customer ID (customer_id), then find the average across all customers, and extract customers who have spent more than the average. Display 10 records. However, exclude those with customer ID starting with “Z” as they represent non-members.
# Pandas
df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
groupby('customer_id').amount.sum()
amount_mean = df_amount_sum.mean()
df_amount_sum = df_amount_sum.reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
# Polars
df_amount_sum = (
df_receipt.filter(~pl.col("customer_id").str.starts_with("Z"))
.group_by("customer_id")
.agg(pl.sum("amount").alias("amount"))
)
amount_mean = df_amount_sum.select(pl.col("amount").mean()).item()
df_amount_sum.filter(pl.col("amount") >= amount_mean).head(10)
%%sql
WITH customer_amount_sum AS (
SELECT
customer_id,
SUM(amount) AS amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
)
SELECT
*
FROM
customer_amount_sum
WHERE
amount >= (SELECT AVG(amount) FROM customer_amount_sum)
LIMIT
10;
Question 36
P-036: Inner join the receipt details data (df_receipt) and store data (df_store), and display all items from the receipt details data along with the store name (store_name) for 10 records.
# Pandas
pd.merge(df_receipt, df_store[["store_cd", "store_name"]], how="inner").head(10)
# Polars
df_receipt.join(df_store.select(["store_cd", "store_name"]), on="store_cd", how="inner")
%%sql
SELECT
r.*,
s.store_name
FROM
receipt r INNER JOIN store s
ON r.store_cd = s.store_cd
LIMIT
10
Question 37
P-037: Inner join the product data (df_product) and category data (df_category), and display all items from the product data along with the category small name (category_small_name) for 10 records.
# Pandas
pd.merge(
df_product,
df_category[['category_major_cd', 'category_medium_cd', 'category_small_cd','category_small_name']],
how="inner",
on=['category_major_cd', 'category_medium_cd', 'category_small_cd']
).head(10)
# Polars
df_product.join(
df_category.select(['category_major_cd', 'category_medium_cd', 'category_small_cd','category_small_name']),
on=['category_major_cd', 'category_medium_cd', 'category_small_cd'],
how="inner"
)
%%sql
SELECT
p.*,
c.category_small_name
FROM
product p INNER JOIN category c
ON p.category_major_cd = c.category_major_cd
AND p.category_medium_cd = c.category_medium_cd
AND p.category_small_cd = c.category_small_cd
LIMIT
10
Question 38
P-038: From the customer data (df_customer) and receipt details data (df_receipt), calculate the total sales amount for each customer, and display 10 records. However, treat customers with no sales as 0 for sales amount. Target only customers with gender code (gender_cd) as female (1), and exclude non-members (customer ID starting with “Z”).
# Pandas
df_amount_sum = df_receipt.groupby("customer_id").amount.sum().reset_index()
tar_customer_df = df_customer[(df_customer["gender_cd"] == "1") & (~df_customer["customer_id"].str.startswith("Z"))]
pd.merge(tar_customer_df, df_amount_sum, how="left").fillna(0).head(10)
# Polars
df_amount_sum = df_receipt.group_by("customer_id").agg(pl.sum("amount").alias("amount"))
tar_customer_df = df_customer.filter((pl.col("gender_cd") == "1") & (~pl.col("customer_id").str.starts_with("Z")))
tar_customer_df.join(df_amount_sum, on="customer_id", how="left").fill_null(0).head(10)
%%sql
SELECT
c.customer_id,
COALESCE(SUM(r.amount), 0)
FROM
customer c LEFT OUTER JOIN receipt r
ON r.customer_id = c.customer_id
WHERE
c.gender_cd = '1'
AND NOT c.customer_id LIKE 'Z%'
GROUP BY
c.customer_id
LIMIT
10
Question 39
P-039: From the receipt details data (df_receipt), create data for the top 20 customers by number of sales days and the top 20 customers by total sales amount, then perform a full outer join on these two datasets. However, exclude non-members (customer ID starting with “Z”).
# Pandas
tar_cust_df = df_receipt[~df_receipt["customer_id"].str.startswith("Z")]
t20_cust_df = tar_cust_df[["customer_id", "sales_ymd"]].drop_duplicates()\
.groupby("customer_id").count().reset_index() \
.sort_values("sales_ymd", ascending=False).head(20)
t20_sales_df = tar_cust_df[["customer_id", "amount"]]\
.groupby("customer_id").sum().reset_index() \
.sort_values("amount", ascending=False).head(20)
pd.merge(t20_cust_df, t20_sales_df, how="outer")
# Polars
tar_cust_df = df_receipt.filter(~pl.col("customer_id").str.starts_with("Z"))
t20_cust_df = tar_cust_df.select(["customer_id", "sales_ymd"]) \
.unique().group_by("customer_id").agg(pl.count("sales_ymd").alias("sales_ymd_count")) \
.sort("sales_ymd_count", descending=True).head(20)
t20_sales_df = tar_cust_df.select(["customer_id", "amount"]) \
.group_by("customer_id").agg(pl.sum("amount").alias("amount_sum")) \
.sort("amount_sum", descending=True).head(20)
t20_cust_df.join(t20_sales_df, on="customer_id", how="full")
%%sql
WITH t20_customer as (
SELECT
customer_id,
count(DISTINCT sales_ymd) as cnt
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
ORDER BY
cnt DESC
LIMIT
20
), t20_sales as (
SELECT
customer_id,
SUM(amount) as total_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
ORDER BY
total_amount DESC
LIMIT
20
)
SELECT
COALESCE(c.customer_id, s.customer_id) customer_id,
c.cnt,
s.total_amount
FROM
t20_customer c
FULL OUTER JOIN t20_sales s
ON
c.customer_id = s.customer_id
Question 40
P-040: We want to create data that combines all stores and all products. Perform a Cartesian product of the store data (df_store) and product data (df_product), and calculate the number of records.
# Pandas
df_store_tmp = df_store.copy()
df_product_tmp = df_product.copy()
df_store_tmp["key"] = 0
df_product_tmp["key"] = 0
len(pd.merge(df_store_tmp, df_product_tmp, how="outer", on="key"))
# Polars
df_store.join(df_product, how="cross").height
%%sql
SELECT
COUNT(*)
FROM
store
CROSS JOIN product
Reflections on Questions up to 40
After calculating basic statistics, we performed more complex processing such as filtering based on those values, saving aggregation results in temporary tables for further processing, and more. Consequently, SQL tends to have more code compared to Pandas or Polars. Pandas can still write processes relatively concisely, and there are no particular performance issues up to this point. Polars may result in longer code than Pandas, but its consistent style of method chaining and expressions makes it an attractive option for pure data processing. Additionally, various joins like outer joins, inner joins, and cross joins have appeared, which are important concepts for appropriately linking different data in data processing. Understanding the differences in join methods, along with the code, is useful for data shaping in data analysis and machine learning, including how to handle data omissions and NaN values.