Workshop 3 Solution, Stats for AI

Author

Alberto Dorantes

Published

August 19, 2025

Abstract
This is an INDIVIDUAL workshop. In this workshop we practice with excercises related to the Introduction to Linear Regression Models.

Create a new Google Colab Notebook for this Workshop and share (with Edit privileges) it with me (cdorante@tec.mx).

For this Workshop, you have to submit the Jupyter file (.ipynb) instead of the Colab Link. You can download directly it from Colab.

The material related to this workshop is covered in Chapters 6 and 7 of my eBook. You can find my eBook at:

https://www.apradie.com/StatsBAbook/

1 CHALLENGE 1

You have to download monthly stock prices (from Yahoo Finance) from Jan 2021 to July 2023 for:

  1. The Mexican IPC market index (ticker = “^MXX”)

  2. The Walmart company (ticker = “WMT.MX”)

  3. The Alfa company (ticker = “ALFAA.MX”)

You have to:

  1. Calculate the descriptive statistics for the 3 variables (find the right data transformation) to show how a good overview of these variables

  2. Calculate the covariance and correlation between: a) The IPC return vs Walmart return, and b) the Walmart return vs the Alfa company. Explain how you calculated your results, and interpret the correlations.

  3. What you can say about the statistical significance of the previous 2 correlations? Calculate the corresponding p-values and Explain with YOUR OWN WORDS.

2 SOLUTION CHALLENGE 1

1) Calculate the descriptive statistics for the 3 variables (find the right data transformation) to show how a good overview of these variables

I download monthly prices for Walmart, Alfa and the Mexican market index:

#!pip install yfinance

import yfinance as yf
import pandas as pd

tickers = ["^MXX", "WMT.MX", "ALFAA.MX"]
start_date = "2021-01-01"
end_date = "2023-07-31"

data = yf.download(tickers, start=start_date, end=end_date, interval="1mo")
YF.download() has changed argument auto_adjust default to True

[                       0%                       ]
[**********************67%*******                ]  2 of 3 completed
[*********************100%***********************]  3 of 3 completed
prices = data['Close']
prices.head()
Ticker       ALFAA.MX      WMT.MX          ^MXX
Date                                           
2021-01-01  11.689448  960.190247  42985.730469
2021-02-01  10.321149  906.422424  44592.910156
2021-03-01  10.692672  924.280823  47246.261719
2021-04-01  12.840269  939.580566  48009.718750
2021-05-01  12.690719  940.284851  50885.949219

I calculate descriptive statistics for prices:

prices.describe()
Ticker   ALFAA.MX       WMT.MX          ^MXX
count   31.000000    31.000000     31.000000
mean    12.272249   927.296887  50677.527344
std      1.251091    52.412692   3397.982532
min     10.289640   814.784485  42985.730469
25%     11.117467   893.459778  48304.093750
50%     12.441734   924.280823  51330.851562
75%     13.557646   958.964294  53288.589844
max     13.925297  1036.755127  56536.679688

We can see that the range of prices (max - min) for these 3 instruments is very different. In Finance, the price of a stock in one month is not a measure of performance. Each stock can have different level of stock price. What is important is how much the price changes over time.

Then, it might be a good idea to standardize these 3 prices in the same scale, and then calculate descriptive statistics and compare the performance of the 2 stocks and the index.

In this context it might be good to create an index for each price starting in $1.00, and calculate the index values based on how much the price change with respect to its price in the first month. Let’s calculate this index by dividing each stock price by its first price:

indexed_prices = prices / prices.iloc[0]

With this transformation, the index for each instrument represents how much $1.00 invested in the instrument (stock or MXX) grows (or declines) over time. Then, the index is actually a growth factor if I would have invested in the instrument from the first month.

I can visualize how each stock and the market grows over time by plotting the indexed prices:

import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
indexed_prices.plot(ax=plt.gca())
plt.title('Indexed Prices Over Time')
plt.xlabel('Date')
plt.ylabel('$1.00 invested in Jan 2021)')
plt.legend()
plt.grid(True)
plt.show()

We can see that the MXX grew about 1.3 times, where Alfa and Walmart declined its initial value to about 90%.

We can now calculate descriptive statistics for these indexed values:

indexed_prices.describe()
Ticker   ALFAA.MX     WMT.MX       ^MXX
count   31.000000  31.000000  31.000000
mean     1.049857   0.965743   1.178938
std      0.107027   0.054586   0.079049
min      0.880250   0.848566   1.000000
25%      0.951069   0.930503   1.123724
50%      1.064356   0.962602   1.194137
75%      1.159819   0.998723   1.239681
max      1.191271   1.079739   1.315243

We can see that the MXX had the highest mean with 1.17 and highest maximum value (1.31). WMT had the lowest mean and the lowest minimum value. The instrument with more variability in its index is ALFAA with a standard deviation of 0.10, almost the double compared to WMT.

For the point of view of an investor, the price movements are relevant, but the end value of the price compared to the initial value is very important. Then, we can calculate the percentage growth (or decline) of the prices (or indexes) from the first month until the last month. In Finance this is called, the Holding-Period Return (HPR). I can calculate the HPR just dividing the last value by the first value and then subtract 1. Let’s calculate the HPR for each instrument:

HPR = indexed_prices.iloc[-1] / indexed_prices.iloc[0] - 1
100*HPR
Ticker
ALFAA.MX   -11.974970
WMT.MX      -7.563955
^MXX        27.528485
dtype: float64

We can see that the instrument with the highest % holding-period return was the MXX with more than 27% for the 2.5 years, while ALFA and WMT ended up with negative returns in the whole period (-12% and -7.5%)

In addition, from the perspective of investors, it is always informative to calculate period returns and look at descriptive statistics to have an idea of the average return per period and the volatility (standard deviation of period returns). Then, let’s calculate the simple period returns and the logarithmic returns for the 3 instruments:

Simple returns:

R = (prices / prices.shift(1) - 1 ).dropna()
R.head()
Ticker      ALFAA.MX    WMT.MX      ^MXX
Date                                    
2021-02-01 -0.117054 -0.055997  0.037389
2021-03-01  0.035996  0.019702  0.059502
2021-04-01  0.200848  0.016553  0.016159
2021-05-01 -0.011647  0.000750  0.059909
2021-06-01  0.070100 -0.004864 -0.011716

Log returns:

import numpy as np
r = (np.log(prices) - np.log(prices.shift(1))).dropna() 
r.head()
Ticker      ALFAA.MX    WMT.MX      ^MXX
Date                                    
2021-02-01 -0.124492 -0.057626  0.036707
2021-03-01  0.035364  0.019510  0.057799
2021-04-01  0.183028  0.016418  0.016030
2021-05-01 -0.011715  0.000749  0.058183
2021-06-01  0.067752 -0.004876 -0.011786

We see that both returns are similar, but the log returns are bigger in magnitude for negative returns, and lower in magnitude for positive return. Also, log returns usually behave more normally distributed compared to simple returns.

Let’s plot the histograms for the simple returns:

R.hist(figsize=(12, 6))
array([[<Axes: title={'center': 'ALFAA.MX'}>,
        <Axes: title={'center': 'WMT.MX'}>],
       [<Axes: title={'center': '^MXX'}>, <Axes: >]], dtype=object)
plt.suptitle('Histograms of Log Returns', y=1.02)
plt.tight_layout()
plt.show()

From an investor point of view, these histograms are very informative since we can see the range of percentage losses and percentage gains over time for each instrument, and how frequent each range of percentage gain/loss shows up (how many months). For example, we can see that there were 1 month that WMT lost between 15% and 20%. Also, we can see that the MXX gained between 15% and 20% in one month.

I can calculate descriptive statistics for the simple returns and the log returns:

Descriptive statistics for simple returns:

R.describe()
Ticker   ALFAA.MX     WMT.MX       ^MXX
count   30.000000  30.000000  30.000000
mean    -0.002028  -0.001211   0.009489
std      0.068332   0.052737   0.053154
min     -0.119589  -0.186033  -0.090538
25%     -0.040985  -0.032716  -0.032677
50%     -0.001113  -0.003715   0.014010
75%      0.028158   0.026714   0.039589
max      0.200848   0.097968   0.125875

We can see that ALFAA was the more volatile of the 3 since its monthly standard deviation is about 6.8% and the other 2 have standard deviation around 5.2%. ALFAA had the worse average return with -0.2% monthly. If we assume that ALFAA had a normal distribution, we can think that around 68% of the time, ALFAA had monthly returns between -7% and +6.6% since this range is about -1 standard deviation and +1 standard deviation from its mean return.

The only instrument with positive average return was MXX with 0.94% (almost 1% monthly). To have a better idea whether this positive return is good for an investor, it is a good idea to estimate an annualized version of this monthly return. We can do this by multiplying this monthly return by 12, so this estimate of the annual average return will be close to 12%, which is a good one considering that the investment with no risk (for example CETES) was much less that 12% in this time period.

We can do the same with log returns:

r.describe()
Ticker   ALFAA.MX     WMT.MX       ^MXX
count   30.000000  30.000000  30.000000
mean    -0.004252  -0.002622   0.008106
std      0.067512   0.054670   0.052637
min     -0.127367  -0.205835  -0.094902
25%     -0.041887  -0.033268  -0.033223
50%     -0.001114  -0.003722   0.013913
75%      0.027769   0.026358   0.038824
max      0.183028   0.093461   0.118561

These estimates are similar compared to the descriptive statistics using simple returns. When the magnitude of percentages (returns) is small (less than 0.10 in absolute value), both, the simple and the log returns are very similar. They start to be very different for higher magnitudes.

It is worth noting that the mean log returns are more negative compared to the simple returns, and for the positive average returns, the log returns are less in magnitude compared to the simple returns.

Log returns are not very useful for calculating descriptive statistics. They are very useful when we construct statistical models with percentages / returns compared to simple returns, since log returns behave more normal than simple returns.

2) Calculate the covariance and correlation between: a) The IPC return vs Walmart return, and b) the Walmart return vs the Alfa company. Explain how you calculated your results, and interpret the correlations.

I calculate covariance and correlation of returns using the cov and corr functions:

cov = r.cov()
cov
Ticker    ALFAA.MX    WMT.MX      ^MXX
Ticker                                
ALFAA.MX  0.004558 -0.000425  0.001235
WMT.MX   -0.000425  0.002989  0.000618
^MXX      0.001235  0.000618  0.002771
corr = r.corr()
corr
Ticker    ALFAA.MX    WMT.MX      ^MXX
Ticker                                
ALFAA.MX  1.000000 -0.115227  0.347503
WMT.MX   -0.115227  1.000000  0.214586
^MXX      0.347503  0.214586  1.000000

The covariance values are difficult to interpret. The only thing we can say for covariance estimates between 2 variables is to see the sign of the covariance. If the covariance is positive (negative) this means that the linear relationship between both variables is positive (negative).

However, the correlation is more informative. The correlation between 2 variables tells us whether the linear relationship is positive or negative, and in how much % both random variables are linearly related.

In the correlation matrix we will always see 1’s in the diagonal since the correlation of one variable with itself is always 1 (100%).

The correlation between MXX and WMT is about +0.21 indicating that both returns move in the same direction (positive or negative) about 21% of the time. We can also say that they are positively and linearly related about 21% of the time.

The correlation between WMT and ALFAA is negative (-0.11) indicating that both returns move in opposite direction about 11% of the time. However, we must look at the statistical significance of this correlation before we conclude that both returns are negatively related. Below I calculate this statistical significance with the corresponding p-value.

We see that the highest correlation among these 2-variable set of correlations is the correlation between ALFA and MXX, which is about +35%. This means that there is a positive linear relationship between the returns of ALFAA and the returns of MXX, and about 35% of the time both move in the same direction since their correlation is positive.

3) What you can say about the statistical significance of the previous 2 correlations? Calculate the corresponding p-values and Explain with YOUR OWN WORDS.

Besides looking at the sign and magnitude of each pair of correlations, we can calculate the p-value of these correlations to see whether the correlation is statistically significant. In other words, we need to know whether the correlation is significantly positive (or negative).

I calculate the correlation and p-values for each pair:

from scipy.stats import pearsonr
import pandas as pd

def pairwise_corr_pvalues(df):
    """
    Calculates the pairwise correlation p-values for a pandas DataFrame.

    Args:
        df: pandas DataFrame.

    Returns:
        pandas DataFrame of pairwise correlation p-values.
    """
    cols = df.columns
    n_cols = len(cols)
    p_matrix = np.zeros((n_cols, n_cols))
    for i in range(n_cols):
        for j in range(i, n_cols):
            if i == j:
                p_matrix[i, j] = 0.0  # Correlation with itself is 1, p-value is 0
            else:
                # Calculate Pearson correlation and its p-value
                corr, p_value = pearsonr(df[cols[i]], df[cols[j]])
                p_matrix[i, j] = p_value
                p_matrix[j, i] = p_value # Correlation matrix is symmetric

    return pd.DataFrame(p_matrix, columns=cols, index=cols)

# Calculate and display the p-values for the correlation matrix
correlation_pvalues = pairwise_corr_pvalues(r)
print("\nP-values for Correlation Matrix of Log Returns:")

P-values for Correlation Matrix of Log Returns:
correlation_pvalues
Ticker    ALFAA.MX    WMT.MX      ^MXX
Ticker                                
ALFAA.MX  0.000000  0.544294  0.059891
WMT.MX    0.544294  0.000000  0.254819
^MXX      0.059891  0.254819  0.000000

We can see that the p-value of the correlation between ALFAA and MXX is 0.059, so we can say that both returns are marginally significant since its p-value is less than 0.10 and greater than 0.05. We can also say that about 94% of the time (1 - 0.059), both returns will be positively correlated.

The other 2-pairs of p-values are much higher than 0.05, indicating that we cannot say that these 2-pairs of correlations are significant. In other words, we can treat them as non correlated or not linearly related. For example, WMT and ALFAA that had a -0.11 correlation and p-value= 0.54, this means that although the mean correlation is negative, we cannot say that they are negatively correlated most of the time. In other words, the negative correlation between ALFAA and WMT is NOT significantly NEGATIVE, and we can treat both returns as NOT CORRELATED (or independent of each other).

3 CHALLENGE 2

Run a regression model to examine whether the Alfa return is related to the IPC return. You want to see how sensible is Alfa return with respect to the IPC return. Decide which can be the dependent variable and the independent variable, run the model and interpret the regression coefficients with your OWN WORDS.

4 SOLUTION TO CHALLENGE 2

When running a regression model with percentage variables, it is recommended to use the log percentages instead of simple returns. The main reason is that the log returns behave closer to normal compared to simple returns, and one of the assumptions of the linear regression model is that the dependent variable should behave like a normal distributed variable.

Then, I run the regression model considering Alfa return as the Dependent Variable (Y), and the MXX return (IPC) as the Independent Variable (X). I do this since I want to see how much Alfa return changes with respect to MXX return:

import statsmodels.formula.api as smf 
r.columns = ['ALFAA','WMT','MXX']
# I estimate the OLS regression model:
regmodel = smf.ols('ALFAA ~ MXX', data=r).fit()
print(regmodel.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  ALFAA   R-squared:                       0.121
Model:                            OLS   Adj. R-squared:                  0.089
Method:                 Least Squares   F-statistic:                     3.846
Date:              lun., 01 sep. 2025   Prob (F-statistic):             0.0599
Time:                        10:07:47   Log-Likelihood:                 40.734
No. Observations:                  30   AIC:                            -77.47
Df Residuals:                      28   BIC:                            -74.67
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0079      0.012     -0.661      0.514      -0.032       0.017
MXX            0.4457      0.227      1.961      0.060      -0.020       0.911
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   2.136
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                2.947
Skew:                           0.527   Prob(JB):                        0.229
Kurtosis:                       4.117   Cond. No.                         19.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I can see that beta0 = -0.0079 and beta1 = 0.44. This means that the linear regression equation is given by:

E[ALFAr] = -0.0079 + 0.4457 (MXXr)

In other words, it is expected that the monthly return of Alfa is about -0.0079 (beta0) when the MXX return is zero.

The slope of the line is +0.4457 (beta1), indicating a positive relationship. In addition, we can say that on average, when the MXX return moves in +1%, the expected move of the ALFA return will be about +0.4457 (on average). However, we need to look at the level of significance of these values for beta0 and beta1. We can look at the corresponding t-Statistic, p-values and/or their 95% confidence intervals.

When we run a linear regression model, one hypothesis test is automatically performed for each coefficient, and the results for these hypothesis test is displayed in the Coefficient Table. The Null Hypothesis for these tests is always that the corresponding coefficient is equal to zero.

Then, looking at the Intercept row (the beta0), we see that its p-value is 0.51, which is much bigger than 0.05, indicating that even though beta0 is negative, it is not significantly negative. In other words, we can say that 95% of the time (in the future), beta0 can move between -0.032 and +0.017 (the 95% CI). This means that the future value of beta0 is uncertain in terms of being negative or positive!

Looking at the MXX row (the beta1), we see that its p-value=0.06, indicating that beta1 is marginally significant or significantly positive, but at the 94% confidence level (1-0.06). In other words, in the future, 95% of the time beta1 might move between -0.02 to +0.911. Then, we can conclude that the sensitivity of Alfa return with respect to MXX returns is +0.44, and 95% of the time will be less than 1. In this context, beta1 is a measure of market risk of the stock. Then, we can say that Alfa is significantly less risky than the market since its beta1 < 1 95% of the time (according to its 95% C.I.)!

5 CHALLENGE 3

Download the following data-set:

https://www.apradie.com/datos/SALES1.xlsx

It is an Excel file with monthly historical sales of 2 chocolate products in a Wholesaler with more than 150 stores in all Mexican states. You have to do the following analysis:

  1. Show important descriptive statistics about these 2 products that give you a good overview of sales and price performance over time. You have to decide what type of data wrangling and/or transformations you need to do, and what type of descriptive statistics can be relevant in this context. Think that you are responsible for the sales of these products at a national level, and you have to present your analysis to your boss.

  2. For each product, design a regression model to estimate their direct price elasticity. Price elasticity refers to how much sensible is a consumer product to price changes. In other words, on average what is the percentage change of volume sold with respect to percentage change in price. You have to:

  1. Decide the data wrangling/transformations and decide the dependent and the independent(s) variable(s) for each model,

  2. For each product run a regression model to estimate the price elasticity, and interpret the model WITH YOUR OWN WORDS.

  3. Besides changes in price, what other variables can you consider that might influence changes in sales volume? Just explain your ideas about this with your OWN WORDS.

6 CHALLENGE 3 SOLUTION

Downloading the data-set and importing it into a data-set:

import pandas as pd

salesdf = pd.read_excel('SALES1.xlsx')
salesdf.head()
      FAB     ID       DATE           PRODUCT         VALUE  UNITS
0  FABUNO  97928 2020-01-01  CHOCOTAB 27/18GR  4.614653e+06  40793
1  FABUNO  97928 2020-02-01  CHOCOTAB 27/18GR  5.395821e+06  51813
2  FABUNO  97928 2020-03-01  CHOCOTAB 27/18GR  5.189054e+06  49169
3  FABUNO  97928 2020-04-01  CHOCOTAB 27/18GR  4.073454e+06  35361
4  FABUNO  97928 2020-05-01  CHOCOTAB 27/18GR  4.942860e+06  44344
  1. Show important descriptive statistics about these 2 products that give you a good overview of sales and price performance over time. You have to decide what type of data wrangling and/or transformations you need to do, and what type of descriptive statistics can be relevant in this context. Think that you are responsible for the sales of these products at a national level, and you have to present your analysis to your boss.

We have monthly sales of 2 chocolate products sold in retail stores. The VALUE column is the sales in $MXP, and the UNITS is the number of units sold per month. If we divide VALUE by UNITS, we can get the average unit price per month for each chocolate. Unit price is an important variable in this context to understand how price vs sales relates.

I calculate the PRICE column for the average unit price per month:

salesdf['PRICE'] = salesdf['VALUE'] / salesdf['UNITS']

Since we have only 2 products, we can start with a plot to show the performance of sales and unit price over time for both products.

I start plotting units sold over time for both products in one plot:

import matplotlib.pyplot as plt

# Plot UNITS over time for both products using product names
# Loop to create a temporal data frame with one product, and then plot it
plt.figure(figsize=(12, 6))
for product_name in salesdf['PRODUCT'].unique():
    product_df = salesdf[salesdf['PRODUCT'] == product_name]
    plt.plot(product_df['DATE'], product_df['UNITS'], label=f'{product_name}')

plt.xlabel('Date')
plt.ylabel('Units')
plt.title('Units Sold Over Time by Product')
plt.legend()
plt.grid(True)
plt.show()

It seems that the ESTUCHE CHOC sells more units than CHOCOTAB. However, we need to calculate sales by year since ESTUCHE CHOC shows a strong seasonal pattern with very high sales for some months and very low sales for other months. We can see a clear pattern of seasonality for the ESTUCHE CHOC with a pick of sales each December (close to 160,000 units each December). It is difficult to see a clear pattern of seasonality for the CHOCOTAB.

Both products do not show a clear pattern of sales growth over time; actually, ESTUCHE CHOC seems to be declining for the last w years.

We need to calculate annual sales and sales growth in percentage by year to have a better understanding. I do the same for unit price:

# Plot UNITS over time for both products using product names
# Loop to create a temporal data frame with one product, and then plot it
plt.figure(figsize=(12, 6))
for product_name in salesdf['PRODUCT'].unique():
    product_df = salesdf[salesdf['PRODUCT'] == product_name]
    plt.plot(product_df['DATE'], product_df['PRICE'], label=f'{product_name}')

plt.xlabel('Date')
plt.ylabel('Units')
plt.title('Unit Price Over Time by Product')
plt.legend()
plt.grid(True)
plt.show()

We clearly see that ESTUCHE CHOC is more expensive than CHOCOTAB. ESTUCHE CHOC shows an increasing trend in its price over the whole period, while CHOCOTAB has a growing price trend only from mid 2022 to 2024.

To better appreciate the price difference between both products, we can do a histogram to show the distribution of prices for both products:

# Create a single histogram with both PRICE distributions
plt.figure(figsize=(12, 6))
plt.hist(salesdf[salesdf['PRODUCT'] == 'CHOCOTAB 27/18GR']['PRICE'], bins=10, alpha=0.7, label='CHOCOTAB 27/18GR', edgecolor='black', color='skyblue')
plt.hist(salesdf[salesdf['PRODUCT'] == 'ESTUCHE CHOC 24 PZ']['PRICE'], bins=10, alpha=0.7, label='ESTUCHE CHOC 24 PZ', edgecolor='black', color='salmon')

plt.xlabel('PRICE')
plt.ylabel('Frequency')
plt.title('Histogram of Units Price Distribution by Product')
plt.legend()
plt.grid(True)
plt.show()

We can clearly see that the ESTUCHE CHOC is much more expensive than CHOCOTAB. The range of unit price of ESTUCHE CHOC IS between $ 140 to $ 210, while CHOCOTAB price range goes from $80 to about $160. For both products, looking at the height of each bar, we can appreciate which price range appear more often over the months. In the histograms, we do not appreciate the chronology (when each price shows up)

It is also very informative to visualize sales values (in $) to appreciate how each product generates in sales value ($) over time. Sales value can be constructed by multiplying units sold times unit price for each product. Let’s see the sales value over time for both products:

# Plot VALUE over time for both products 
# Loop to create a temporal data frame with one product, and then plot it
plt.figure(figsize=(12, 6))
for product_name in salesdf['PRODUCT'].unique():
    product_df = salesdf[salesdf['PRODUCT'] == product_name]
    plt.plot(product_df['DATE'], product_df['VALUE'], label=f'{product_name}')

plt.xlabel('Date')
plt.ylabel('Units')
plt.title('Sales ($) Over Time by Product')
plt.legend()
plt.grid(True)
plt.show()

Since the behavior of units sold and price change over the months for each year, it is a good idea to calculate descriptive statistics per year for each product.

Let’s calculate descriptive statistics for price per year for each product:

# creating a column for year:
salesdf['Year']=salesdf['DATE'].dt.year 
#Grouping by PRODUCT and Year and calculate descriptive statistics for PRICE:
descriptive_stats_yearly = salesdf.groupby(['PRODUCT','Year'])[['PRICE']].describe()
# Display the results transposing to 
pd.options.display.float_format = "{:,.2f}".format
descriptive_stats_yearly.T
PRODUCT     CHOCOTAB 27/18GR                       ... ESTUCHE CHOC 24 PZ                     
Year                    2020   2021   2022   2023  ...               2021   2022   2023   2024
PRICE count            12.00  12.00  12.00  12.00  ...              12.00  12.00  12.00  12.00
      mean            119.14 123.15 116.46 141.23  ...             158.53 172.25 178.88 192.50
      std              10.52   9.97  17.55  10.82  ...               5.27   1.97  11.78  12.86
      min             104.14 103.63  83.16 126.54  ...             149.66 168.54 156.40 168.62
      25%             112.71 117.67 106.56 131.94  ...             155.15 171.14 169.84 188.00
      50%             115.65 122.76 118.42 141.71  ...             158.81 172.56 183.77 195.84
      75%             129.34 130.99 122.99 147.17  ...             161.21 173.70 187.24 199.84
      max             134.51 137.23 145.99 159.72  ...             166.46 174.80 191.78 208.85

[8 rows x 10 columns]

With the mean and standard deviation of prices over time for each product, we get a first idea about the pricing behavior for each product by year. Overall we see that ESTUCHE CHOC is more expensive than CHOCOTAB, and both products systematically increase prices over the years, and the CHOCOTAB standard deviation per year is much greater than that of ESTUCHE, indicating that CHOCOTAB (the cheapest product) makes more price changes (discounts) over the months per year.

We can also see that the mean and median of prices for ESTUCHE CHOC are very similar, indicating that it is possible that the distribution is close to normal. Looking at the previous histogram, we confirm that the price distribution for ESTUCHE CHOC is close-to-normal. However, looking at the mean and median for the price of CHOCOTAB, the mean is usually slightly greater than the median, indicating a possible skewed-to-the-right distribution. Looking at its histogram (above), confirm that the price distribution for CHOCOTAB is slightly skewed-to-the-right.

Box-plots are very informative to understand the distribution of any variable, specially when the variable does not behave close to a normal distribution. A box-plot illustrates how the data is distributed through its quartiles and the inter-quartile range:

import seaborn as sns
# Create a box plot of price by product
plt.figure(figsize=(10, 6))
sns.boxplot(x='PRODUCT', y='PRICE', data=salesdf)
plt.xlabel('Product')
plt.ylabel('Price')
plt.title('Box Plot of Price by Product')
plt.grid(True)
plt.show()

For CHOCOTAB, its price has moved between $120 and about $145 50% of the time - between the Quartile 1 and the Quartile 3.The inter-quartile range (IQR) is the difference between Q3 and Q1 of any distribution. For ESTUCHE, the IQR starts in $\160 and ends about $180. The horizontal lines below and above the box indicate a range where most of the values lie. The distance between these horizontal lines is calculated subtracting and adding 1.5 times the IQR from the Q2 (the median). It is said that points outside +-1.5 IQR can be treated as outliers, and they are usually plotted as dots. In this case, it seems that there is no outliers for both prices since there are no dots outside the +-1.5 IQR from Q2.

We have to be careful using arithmetic mean for prices, since price is the result of dividing 2 variables (value and units), and there are some months with high units sold with a specific price.

With this table we can see the variety of price changes per year for each product.

In the context of historical monthly sales data, the best measure of central tendency for annual average prices is usually the weighted mean according to the units sold per month. Bellow I will calculate this annual weighted average price, which is a better measure for mean price.

Here is an annual summary of sales units, weighted average price and their % growth per year:

# Group by product and year and sum total units and value:
product_yearly_summary = salesdf.groupby(['PRODUCT', 'Year'])[['UNITS', 'VALUE']].sum()

# Calculate weighted average price by product and year dividing total sales value by units sold:
product_yearly_summary['WAvg_Price'] = product_yearly_summary['VALUE'] / product_yearly_summary['UNITS']

# Calculating annual growth for UNITS, VALUE, and Weighted Price:
product_yearly_summary['Volume %Growth'] = 100 * (product_yearly_summary['UNITS'] / product_yearly_summary.groupby(['PRODUCT'])['UNITS'].shift(1) - 1)

product_yearly_summary['Value %Growth'] = 100 * (product_yearly_summary['VALUE'] / product_yearly_summary.groupby(['PRODUCT'])['VALUE'].shift(1) - 1)

product_yearly_summary['Price %Growth'] = 100 * (product_yearly_summary['WAvg_Price'] / product_yearly_summary.groupby(['PRODUCT'])['WAvg_Price'].shift(1) - 1)

# Unstack the 'Year' level to make years columns
product_yearly_summary_unstacked = product_yearly_summary.unstack('Year')

# Display the summary table for UNITS sold and its % Growth
#print(product_yearly_summary_unstacked[['UNITS','Volume %Growth']])

# Display the summary table for Avg Weighted Price and its % Growth
#print(product_yearly_summary_unstacked[['WAvg_Price','Price %Growth']])

# Display the summary table for VALUE Sales and and its % Growth
#print(product_yearly_summary_unstacked[['VALUE','Value %Growth']])

# Display the annual summary table for VALUE Sales, UNIT Sales and Weighted Avg Price:
print(product_yearly_summary[['UNITS','VALUE','WAvg_Price']])
                          UNITS          VALUE  WAvg_Price
PRODUCT            Year                                   
CHOCOTAB 27/18GR   2020  511350  60,817,964.30      118.94
                   2021  649633  79,765,836.61      122.79
                   2022  549128  64,042,836.20      116.63
                   2023  633877  88,814,862.54      140.11
                   2024  581427  93,565,565.04      160.92
ESTUCHE CHOC 24 PZ 2020  454099  69,101,851.17      152.17
                   2021  589939  94,497,661.94      160.18
                   2022  536823  92,529,473.99      172.36
                   2023  588696 102,781,676.44      174.59
                   2024  595364 110,596,983.98      185.76
# Display the annual % Growth for VALUE Sales, UNIT Sales and Weighted Avg Price:
print(product_yearly_summary[['Volume %Growth','Price %Growth','Value %Growth']])
                         Volume %Growth  Price %Growth  Value %Growth
PRODUCT            Year                                              
CHOCOTAB 27/18GR   2020             NaN            NaN            NaN
                   2021           27.04           3.24          31.16
                   2022          -15.47          -5.02         -19.71
                   2023           15.43          20.14          38.68
                   2024           -8.27          14.85           5.35
ESTUCHE CHOC 24 PZ 2020             NaN            NaN            NaN
                   2021           29.91           5.26          36.75
                   2022           -9.00           7.61          -2.08
                   2023            9.66           1.29          11.08
                   2024            1.13           6.40           7.60

In this case, there are some differences between each arithmetic annual average price with the corresponding weighted average annual price. The weighted annual average price is a better representation of the annual average price since this considers the frequency of the units sold per month for different prices.

Both products had positive annual % growth in price for most years, except CHOCOTAB had negative annual % price growth in 2022 (-5%).

The best year for CHOCOTAB was 2023 with +38.6% annual growth in Value Sales, and its worst year was 2022 with negative decline in Value Sales with -19.7%.

The best year for ESTUCHE was 2021 with +36.7% in Value Sales, and its worst year was 2022 with 2.08% decline in Value Sales.

Important note: When calculating descriptive statistics for variables that are the result of dividing 2 variables (for example, unit price, or any other percentage variable or ratio), the best measure of central tendency will always be the weighted mean, not the arithmetic mean.

  1. For each product, design a regression model to estimate their direct price elasticity. Price elasticity refers to how much sensible is a consumer product to price changes. In other words, on average what is the percentage change of volume sold with respect to percentage change in price. You have to:
  1. Decide the data wrangling/transformations and decide the dependent and the independent(s) variable(s) for each model,

  2. For each product run a regression model to estimate the price elasticity, and interpret the model WITH YOUR OWN WORDS.

  3. Besides changes in price, what other variables can you consider that might influence changes in sales volume? Just explain your ideas about this with your OWN WORDS.

SOLUTION:

2a) We start with the long-format monthly data frame. To calculate price elasticity, we can transform monthly UNITS and PRICE with the natural log in order the estimate direct price elasticity as the slope of the regression line for each product:

import numpy as np
salesdf['logPRICE']= np.log(salesdf['PRICE'])
salesdf['logUNITS'] = np.log(salesdf['UNITS'])

For each product I run an OLS regression with the logUNITS as the Y (Dependent Variable) and the logPRICE as the Y (Independent Variable).

Regression model for both products:

import statsmodels.formula.api as smf

product_names = salesdf['PRODUCT'].unique()
for product_name in product_names:
    print(f"\nLinear Regression Results for {product_name}:")
    # Filter data for the current product:
    product_df = salesdf[salesdf['PRODUCT']==product_name].copy()
    # Run the OLS regression
    model = smf.ols('logUNITS ~ logPRICE', data=product_df).fit()
    print(model.summary())

Linear Regression Results for CHOCOTAB 27/18GR:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               logUNITS   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.017
Method:                 Least Squares   F-statistic:                 1.298e-06
Date:              lun., 01 sep. 2025   Prob (F-statistic):              0.999
Time:                        10:07:52   Log-Likelihood:                -1.8109
No. Observations:                  60   AIC:                             7.622
Df Residuals:                      58   BIC:                             11.81
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     10.7621      1.016     10.597      0.000       8.729      12.795
logPRICE       0.0002      0.208      0.001      0.999      -0.417       0.417
==============================================================================
Omnibus:                        0.151   Durbin-Watson:                   1.525
Prob(Omnibus):                  0.927   Jarque-Bera (JB):                0.122
Skew:                           0.097   Prob(JB):                        0.941
Kurtosis:                       2.893   Cond. No.                         158.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Linear Regression Results for ESTUCHE CHOC 24 PZ:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               logUNITS   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.8992
Date:              lun., 01 sep. 2025   Prob (F-statistic):              0.347
Time:                        10:07:52   Log-Likelihood:                -63.652
No. Observations:                  60   AIC:                             131.3
Df Residuals:                      58   BIC:                             135.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     15.2177      5.021      3.031      0.004       5.167      25.269
logPRICE      -0.9267      0.977     -0.948      0.347      -2.883       1.029
==============================================================================
Omnibus:                        6.421   Durbin-Watson:                   1.206
Prob(Omnibus):                  0.040   Jarque-Bera (JB):                6.520
Skew:                           0.775   Prob(JB):                       0.0384
Kurtosis:                       2.550   Cond. No.                         292.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

INTERPRETATION for the CHOCOTAB regression output:

The R-square of the model is close to zero. The beta1 coefficient is +0.0002, but it is NOT significantly greater than zero since its p-value=0.999. This means that the change in +%1 in unit price DOES NOT significantly influence % change in UNITS sold. In other words, this product seems to be inelastic since a change in price does not significantly change sales volume.

We have to take this result with CAUTION since we are NOT considering other factors such as SEASONALITY and volume growing trend!

INTERPRETATION for the ESTUCHE CHOC regression output:

The R-square is just 1.5% and the beta1 coefficient is -0.92, but it is NOT significantly negative since its p-value=0.34. This means that although the beta1 is negative, in the near future it can change from -2.883 to +1.029 with 95% probability. In other words, there is no consistency at the 95% confidence that the direct price elasticity will be negative. Then, we can also say that this product is inelastic since its elasticity is NOT significantly NEGATIVE.

Again, we have to take this result with CAUTION since we are not considering other factors such as SEASONALITY and volume growing trend!

How can I consider the SEASONALITY pattern for volume sales? I can add binary dummy variables as X variables according to the month of the year.

How can I consider possible growing trend over time? I can add a time variable as another X variable (from 1 to the last month of the sales history).

Let’s do so for both products:

I create the binary dummies to the long-format data frame:

# Extract the month from the 'DATE' column and convert to string
salesdf['Month'] = salesdf['DATE'].dt.month.astype(str)

# Generate dummy variables for the 'Month' column, dropping the first category (January)
month_dummies = pd.get_dummies(salesdf['Month'], prefix='Month', drop_first=True)

# Concatenate the dummy variables with the original DataFrame
salesdf = pd.concat([salesdf, month_dummies], axis=1)

# I add t as a sequence number for the time periods (months)
salesdf['t'] = salesdf.groupby('PRODUCT').cumcount() + 1

salesdf=salesdf.drop('Month',axis=1)
# Display the first few rows with the new dummy variables
print(salesdf.head())
      FAB     ID       DATE           PRODUCT  ...  Month_7  Month_8  Month_9  t
0  FABUNO  97928 2020-01-01  CHOCOTAB 27/18GR  ...    False    False    False  1
1  FABUNO  97928 2020-02-01  CHOCOTAB 27/18GR  ...    False    False    False  2
2  FABUNO  97928 2020-03-01  CHOCOTAB 27/18GR  ...    False    False    False  3
3  FABUNO  97928 2020-04-01  CHOCOTAB 27/18GR  ...    False    False    False  4
4  FABUNO  97928 2020-05-01  CHOCOTAB 27/18GR  ...    False    False    False  5

[5 rows x 22 columns]

I now run a multiple regression model (instead of simple) for each product, adding the month dummy variables and the time sequence variable as X variables:

product_names = salesdf['PRODUCT'].unique()
for product_name in product_names:
    print(f"\nLinear Regression Results for {product_name}:")
    # Filter data for the current product:
    product_df = salesdf[salesdf['PRODUCT']==product_name].copy()
    # Run the OLS regression
    model = smf.ols('logUNITS ~ logPRICE+t+Month_2+Month_3+Month_4+Month_5+Month_6+Month_7+Month_8+Month_9+Month_10+Month_11+Month_12', data=product_df).fit()
    print(model.summary())

Linear Regression Results for CHOCOTAB 27/18GR:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               logUNITS   R-squared:                       0.398
Model:                            OLS   Adj. R-squared:                  0.228
Method:                 Least Squares   F-statistic:                     2.342
Date:              lun., 01 sep. 2025   Prob (F-statistic):             0.0172
Time:                        10:07:53   Log-Likelihood:                 13.425
No. Observations:                  60   AIC:                             1.150
Df Residuals:                      46   BIC:                             30.47
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           11.0574      1.322      8.364      0.000       8.396      13.719
Month_2[T.True]      0.3812      0.142      2.689      0.010       0.096       0.667
Month_3[T.True]      0.1189      0.140      0.847      0.401      -0.164       0.401
Month_4[T.True]      0.2343      0.141      1.658      0.104      -0.050       0.519
Month_5[T.True]      0.0887      0.143      0.619      0.539      -0.200       0.377
Month_6[T.True]     -0.1990      0.140     -1.417      0.163      -0.482       0.084
Month_7[T.True]     -0.0180      0.142     -0.127      0.900      -0.305       0.269
Month_8[T.True]      0.1304      0.146      0.895      0.376      -0.163       0.424
Month_9[T.True]      0.0846      0.141      0.601      0.551      -0.199       0.368
Month_10[T.True]     0.0760      0.141      0.541      0.591      -0.207       0.359
Month_11[T.True]     0.3440      0.144      2.387      0.021       0.054       0.634
Month_12[T.True]     0.2148      0.141      1.523      0.135      -0.069       0.499
logPRICE            -0.0990      0.278     -0.356      0.723      -0.658       0.460
t                    0.0022      0.002      0.909      0.368      -0.003       0.007
==============================================================================
Omnibus:                        0.568   Durbin-Watson:                   1.280
Prob(Omnibus):                  0.753   Jarque-Bera (JB):                0.503
Skew:                          -0.214   Prob(JB):                        0.778
Kurtosis:                       2.869   Cond. No.                     1.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Linear Regression Results for ESTUCHE CHOC 24 PZ:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               logUNITS   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.952
Method:                 Least Squares   F-statistic:                     90.51
Date:              lun., 01 sep. 2025   Prob (F-statistic):           3.09e-28
Time:                        10:07:53   Log-Likelihood:                 34.291
No. Observations:                  60   AIC:                            -40.58
Df Residuals:                      46   BIC:                            -11.26
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           32.6364      2.603     12.538      0.000      27.397      37.876
Month_2[T.True]      0.3882      0.101      3.825      0.000       0.184       0.593
Month_3[T.True]     -0.9005      0.099     -9.110      0.000      -1.099      -0.702
Month_4[T.True]     -0.8594      0.099     -8.669      0.000      -1.059      -0.660
Month_5[T.True]     -0.1537      0.107     -1.438      0.157      -0.369       0.061
Month_6[T.True]     -0.8022      0.099     -8.065      0.000      -1.002      -0.602
Month_7[T.True]     -0.8504      0.100     -8.517      0.000      -1.051      -0.649
Month_8[T.True]     -0.9268      0.100     -9.304      0.000      -1.127      -0.726
Month_9[T.True]     -0.9067      0.100     -9.089      0.000      -1.108      -0.706
Month_10[T.True]    -0.6669      0.101     -6.635      0.000      -0.869      -0.465
Month_11[T.True]     0.0936      0.111      0.847      0.401      -0.129       0.316
Month_12[T.True]     1.0545      0.100     10.536      0.000       0.853       1.256
logPRICE            -4.3867      0.518     -8.469      0.000      -5.429      -3.344
t                    0.0241      0.003      8.767      0.000       0.019       0.030
==============================================================================
Omnibus:                       11.877   Durbin-Watson:                   0.640
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               12.444
Skew:                          -0.923   Prob(JB):                      0.00199
Kurtosis:                       4.253   Cond. No.                     4.66e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.66e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Now we got interesting results for both products, specially for ESTUCHE CHOC.

Both R-square of the models increased a lot. The R-Square for EST UCHE CHOC is 96%, while that of CHOCOTAB is about 39.8%.

The beta coefficient that multiplies LogPRICE is the measure of direct elasticity after considering the SEASONALITY factor (the dummy variables for the Month) and the growing trend with the t variable.

We see that for ESTUCHE, the direct elasticity is -4.38 and it is SIGNIFICANTLY negative since its p-value is close to zero (t=-8.4). This means that, AFTER considering the systematic changes of volume sales by month (the dummy variables) and the growing trend (t), when the price changes in +1%, it is expected that the sales units change in about -4.38%! We can also see that ESTUCHE has a significant growing trend over time since the beta coefficient that multiplies t (b= +0.02) is positive and statistically significant! In other words, for each month, the % change in volume sales increase in about 2.4% after considering the direct elasticity and the seasonality.

What about the beta coefficients for the dummy variables? I leave this for your thinking!!

In conclusion, the ESTUCHE CHOC is VERY ELASTIC! after considering the Seasonality pattern and time pattern.

For the case of CHOCOTAB, after considering Seasonality and time trend, the beta coefficient for price elasticity was negative, but NOT statistically significant. Then, I keep the conclusion that CHOCOTAB is NOT significantly ELASTIC!

Finally, in order to better understand the price-volume relationship under the factors of seasonality and trend, it is a good idea to visualize how sales volume and unit price changes over time:

# Get unique product names
product_names = salesdf['PRODUCT'].unique()

# Create a figure and a set of subplots
fig, axes = plt.subplots(len(product_names), 1, figsize=(12, 6 * len(product_names)), sharex=True)

# If there's only one product, axes will not be a list
if len(product_names) == 1:
    axes = [axes]

for i, product_name in enumerate(product_names):
    # Filter data for the current product
    product_df = salesdf[salesdf['PRODUCT'] == product_name].copy()

    # Create the first y-axis (for UNITS)
    ax1 = axes[i]
    ax1.bar(product_df['DATE'], product_df['UNITS'], color='skyblue', label='Units', width=20) # Increased width to 20
    ax1.set_ylabel('Units', color='skyblue')
    ax1.tick_params(axis='y', labelcolor='skyblue')
    ax1.grid(True)

    # Create the second y-axis (for PRICE)
    ax2 = ax1.twinx()
    ax2.plot(product_df['DATE'], product_df['PRICE'], color='salmon', marker='o', label='Price')
    ax2.set_ylabel('Price', color='salmon')
    ax2.tick_params(axis='y', labelcolor='salmon')

    # Set title and legend
    ax1.set_title(f'{product_name}: Units and Price Over Time')
    fig.legend(loc="upper right", bbox_to_anchor=(1,1), bbox_transform=ax1.transAxes)

# Set common xlabel
axes[-1].set_xlabel('Date')

# Adjust layout to prevent labels overlapping
plt.tight_layout()
plt.show()

With these plots we can visualize the seasonality pattern, and also how changes in price make changes in units sold. For the ESTUCHE CHOC it is clear to see how each December has significantly much higher sales, and for the last 2 years we can see that prices have decline in Nov and Dec, making sales significantly increase. This is consistent with the estimate of direct elasticity of -4.3.