0.1 Before You Begin: Important Instructions for All Workshops
Welcome to our workshop series! Please read these instructions carefully before starting any activity. Following these guidelines will make your work smoother and ensure that your submissions are graded without issues.
0.1.1 Working Environment
We will use Google Colab for all workshops. Colab runs Python in the cloud — you don’t need to install anything locally.
Sign in with your institutional Google account for access to all features.
Always save a copy of the notebook to your Google Drive:
Go to File → Save a copy in Drive.
0.1.2 Loading Data
You may work with datasets provided by the instructor or public datasets online. You will receive instructions each time to load the data with Python code. However, it is a good idea to store files, like data or your own notes, in a dedicated Google Drive folder:
Create a folder in your Google Drive named econ_workshops (or similar).
Upload your datasets there.
0.1.3 Output and Submission Format
After completing the workshop, export your notebook as PDF:
In Colab: File → Print → Save as PDF.
Only submit the PDF file through Canvas. Do not submit .ipynb or .py files unless explicitly requested.
Include all outputs, tables, and graphs in your PDF — make sure you run all cells before exporting.
0.1.4 Naming Convention
Name your PDF file using the following format: Lastname_Firstname_WorkshopX.pdf
Example:
0.1.5 Deadlines
All assignments must be uploaded to Canvas before the stated deadline. Late submissions are not accepted.
Once you have read and understood these instructions, you are ready to begin the workshop!
1 Introduction
In this lecture, we will review two key statistical tools:
Correlation analysis — measures the strength and direction of the relationship between two continuous variables.
ANOVA (Analysis of Variance) — tests whether the means of different groups are significantly different.
We will work through both theoretical concepts and practical Python examples using simulated datasets.
Our examples will involve three types of companies: - SMEs - Startups - Big Companies
2 Correlation Analysis
2.1 Concept
The Pearson correlation coefficient measures the linear relationship between two variables (X) and (Y):
This workshop uses the ISLR Carseats dataset (downloaded online via statsmodels.get_rdataset) to practice:
Correlation between business variables (e.g., Sales, Price, Advertising, Income).
Two-way plots (scatter, pair plots) and box plots.
ANOVA (one-way and two-way with interaction) using categorical marketing/retail factors (ShelveLoc, Urban, US).
A short preview of regression models motivated by the ANOVA results, including AIC comparison and robust standard errors.
Business context.Sales is store-level unit sales. We’ll ask: Do mean sales differ across store characteristics (e.g., shelf location quality)? and What continuous predictors co-move with Sales?
This workshop bridges statistical analysis with real business decision-making. The Carseats dataset represents sales and characteristics of retail outlets. It is a strong example for applied econometrics because:
Variables cover pricing, promotion, demographics, and store characteristics.
We can illustrate relationships (correlation), differences between groups (ANOVA), and predictive modeling (regression).
Why ANOVA in business?
ANOVA lets us test if average outcomes (e.g., sales) differ significantly between groups, such as stores with good vs poor shelf location, or urban vs rural markets. This is essential for deciding where to invest, how to segment customers, or which store attributes matter most.
4 Setup and Data
We pull the dataset directly from the R ISLR package using statsmodels.api.datasets.get_rdataset. This works in Colab (internet required).
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy import statsimport statsmodels.api as smfrom statsmodels.formula.api import olsfrom statsmodels.stats.anova import anova_lm# Display / plotting stylepd.set_option("display.precision", 3)sns.set_context("notebook")# Load Carseats from ISLR via statsmodels (downloads from online R datasets repo)carseats = sm.datasets.get_rdataset("Carseats", package="ISLR", cache=True).data.copy()# Inspectcarseats.head()
Sales
CompPrice
Income
Advertising
Population
Price
ShelveLoc
Age
Education
Urban
US
0
9.50
138
73
11
276
120
Bad
42
17
Yes
Yes
1
11.22
111
48
16
260
83
Good
65
10
Yes
Yes
2
10.06
113
35
10
269
80
Medium
59
12
Yes
Yes
3
7.40
117
100
4
466
97
Medium
55
14
Yes
Yes
4
4.15
141
64
3
340
128
Bad
38
13
Yes
No
4.0.1 Variable notes (selected)
Sales: Unit sales (response).
CompPrice: Price charged by competitor in the region.
Income: Community income level.
Advertising: Local advertising budget.
Price: Price at each location.
ShelveLoc: Factor with levels Bad/Medium/Good (quality of shelf location).
Age: Average age of the local population.
Education: Education level.
Urban: Yes/No.
US: Yes/No.
We’ll create a Sales log variable (optional, useful for variance stabilization) and a simple margin proxy to connect with business intuition.
carseats["log_Sales"] = np.log(carseats["Sales"] +1e-6)# Simple price margin proxy: competitor price minus our price (higher may indicate room to price higher)carseats["margin_proxy"] = carseats["CompPrice"] - carseats["Price"]carseats.describe()
Sales
CompPrice
Income
Advertising
Population
Price
Age
Education
log_Sales
margin_proxy
count
400.000
400.000
400.000
400.000
400.000
400.000
400.000
400.000
400.000
400.000
mean
7.496
124.975
68.657
6.635
264.840
115.795
53.322
13.900
1.887
9.180
std
2.824
15.335
27.986
6.650
147.376
23.677
16.200
2.621
0.927
19.263
min
0.000
77.000
21.000
0.000
10.000
24.000
25.000
10.000
-13.816
-46.000
25%
5.390
115.000
42.750
0.000
139.000
100.000
39.750
12.000
1.685
-4.000
50%
7.490
125.000
69.000
5.000
272.000
117.000
54.500
14.000
2.014
9.000
75%
9.320
135.000
91.000
12.000
398.500
131.000
66.000
16.000
2.232
21.250
max
16.270
175.000
120.000
29.000
509.000
191.000
80.000
18.000
2.789
57.000
5 Correlation matrices (business metrics)
What correlation tells us:
Correlation measures how two variables move together.
Positive correlation: As one increases, the other tends to increase.
Negative correlation: As one increases, the other tends to decrease.
Zero correlation: No linear relationship.
Pearson vs Spearman:
Pearson is for linear relationships with continuous variables.
Spearman is rank-based, works well when data has outliers or a non-linear but monotonic pattern.
Sales and Price: Strong negative correlation? That suggests higher prices lower sales.
Sales and Advertising: Positive correlation? Higher advertising might be linked to higher sales.
margin_proxy: Can hint at competitive positioning.
5.1 Heatmap
plt.figure(figsize=(7,5))sns.heatmap(corr_p, annot=True, vmin=-1, vmax=1, cmap="vlag")plt.title("Pearson Correlation — Carseats Business Metrics")plt.tight_layout()plt.show()
Interpretation prompt. How do Sales co-move with Price and Advertising? What does margin_proxy suggest?
How to read the heatmap:
Color scale: Blue for negative, red for positive (in vlag palette).
Diagonal: Always 1.0 — variable with itself.
Strong correlation: > |0.7| is strong, 0.3–0.7 is moderate.
6 2. Two-Way Plots
Two-way plots help diagnose linearity, clusters, and heteroskedasticity.
# Sales vs Pricesns.jointplot(data=carseats, x="Price", y="Sales", kind="reg", height=5)plt.show()
Scatter with regression line:
The slope shows the direction of the relationship. If the spread of points widens with higher values, that’s heteroskedasticity.
# Pair plot of selected variablessns.pairplot( carseats[["Sales","Price","CompPrice","Income","Advertising","margin_proxy"]], kind="reg", diag_kind="hist", corner=True)plt.show()
Pair plot:
Check if clouds of points form linear patterns or curves.
Look for outliers — extreme points may distort correlation.
Detect clusters — groups of stores that behave differently.
7 3. Box Plots (Group Comparisons)
We compare Sales across ShelveLoc (Bad/Medium/Good) and across Urban/US segments.
What you’re seeing: - Box: Middle 50% of the data (interquartile range). - Line inside box: Median value. - Whiskers: Range without outliers. - Dots beyond whiskers: Outliers.
plt.figure(figsize=(7,4))sns.boxplot(data=carseats, x="US", y="Sales")sns.stripplot(data=carseats, x="US", y="Sales", size=3, alpha=0.4, color="k")plt.title("Sales by US vs non-US")plt.show()
Business question. Does shelf location quality drive higher average sales? Are there urban or US effects that matter?
Interpretation examples:
ShelveLoc: If “Good” is clearly higher median than “Bad”, this suggests shelf placement is a key driver of sales.
Urban: If medians are similar, location type might not be a strong differentiator.
US: Differences could imply market-level conditions.
Managerial link:
Insights here can guide store design, merchandising, and market targeting.
8 ANOVA
8.1 One-Way ANOVA: Sales ~ ShelveLoc
Tests if the mean of a continuous variable differs across k groups.
- Null hypothesis: All group means are equal.
Alternative: At least one group differs.
F-statistic: Ratio of variation between groups to variation within groups.
Large F and small p-value → reject null.
η² (eta-squared): Proportion of variance explained by the grouping variable.
# Normality (QQ plot) and variance homogeneity (Levene)sm.qqplot(m1.resid, line="45", fit=True)plt.title("QQ Plot — ANOVA Residuals (Sales ~ ShelveLoc)")plt.show()groups = [g["Sales"].values for _, g in carseats.groupby("ShelveLoc")]W, p = stats.levene(*groups, center="median")print(f"Levene's test for equal variances: W={W:.3f}, p={p:.3f}")
Levene's test for equal variances: W=0.809, p=0.446
8.1.3 Post-hoc comparisons (Tukey HSD)
from statsmodels.stats.multicomp import pairwise_tukeyhsdtukey = pairwise_tukeyhsd(endog=carseats["Sales"], groups=carseats["ShelveLoc"], alpha=0.05)print(tukey.summary())
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj lower upper reject
---------------------------------------------------
Bad Good 4.6911 0.0 3.8714 5.5108 True
Bad Medium 1.7837 0.0 1.11 2.4573 True
Good Medium -2.9074 0.0 -3.6107 -2.2041 True
---------------------------------------------------
8.2 Two-Way ANOVA: Sales ~ ShelveLoc * Urban
Two categorical variables + their interaction.
Main effect: Impact of one factor ignoring the other.
Interaction: When the effect of one factor changes depending on the level of the other.
Interaction plot:
Parallel lines → no interaction.
Crossing lines → strong interaction.
Business application: If shelf location matters more in urban stores than rural, marketing strategy should be location-specific.
means = carseats.groupby(["ShelveLoc","Urban"])["Sales"].mean().reset_index()plt.figure(figsize=(7,4))for u in means["Urban"].unique(): s = means[means["Urban"]==u] plt.plot(s["ShelveLoc"], s["Sales"], marker="o", label=f"Urban={u}")plt.title("Interaction: Sales by ShelveLoc × Urban")plt.xlabel("ShelveLoc")plt.ylabel("Mean Sales")plt.legend()plt.show()
Interpretation prompt. If the interaction is significant, how would you tailor merchandising or store layout by ShelveLoc × Urban segment?
9 From ANOVA to Regression (Preview)
ANOVA with categorical factors is a special case of linear regression with dummy variables. We now include continuous controls used by retail/marketing teams.
Why regression?
ANOVA is a special case of regression where predictors are only categorical.
Adding continuous predictors lets us:
Control for other factors.
Estimate marginal effects.
Interpreting coefficients:
For categorical variables, coefficients are relative to a reference category.
For continuous variables, coefficients are the change in outcome per unit change.
AIC:
Lower AIC = better balance of fit and simplicity.
Be cautious: a very low AIC in a complex model may overfit.
C:\Users\L03544739\AppData\Local\Programs\Python\Python313\Lib\site-packages\statsmodels\base\model.py:1894: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 12, but rank is 11
warnings.warn('covariance of constraints does not have full '
Reporting tip. Present the preferred model’s coefficients with robust SEs and interpret the magnitude and sign of key effects (e.g., the expected change in Sales for Good vs Bad shelf location, holding other factors constant).