You have to download monthly stock prices (from Yahoo Finance) from Jan 2021 to July 2023 for:
The Mexican IPC market index (ticker = “^MXX”)
The Walmart company (ticker = “WMT.MX”)
The Alfa company (ticker = “ALFAA.MX”)
You have to:
Calculate the descriptive statistics for the 3 variables (find the right data transformation) to show how a good overview of these variables
Calculate the covariance and correlation between: a) The IPC return vs Walmart return, and b) the Walmart return vs the Alfa company. Explain how you calculated your results, and interpret the correlations.
What you can say about the statistical significance of the previous 2 correlations? Calculate the corresponding p-values and Explain with YOUR OWN WORDS.
2 SOLUTION CHALLENGE 1
1) Calculate the descriptive statistics for the 3 variables (find the right data transformation) to show how a good overview of these variables
I download monthly prices for Walmart, Alfa and the Mexican market index:
#!pip install yfinanceimport yfinance as yfimport pandas as pdtickers = ["^MXX", "WMT.MX", "ALFAA.MX"]start_date ="2021-01-01"end_date ="2023-07-31"data = yf.download(tickers, start=start_date, end=end_date, interval="1mo")
YF.download() has changed argument auto_adjust default to True
[ 0% ]
[**********************67%******* ] 2 of 3 completed
[*********************100%***********************] 3 of 3 completed
We can see that the range of prices (max - min) for these 3 instruments is very different. In Finance, the price of a stock in one month is not a measure of performance. Each stock can have different level of stock price. What is important is how much the price changes over time.
Then, it might be a good idea to standardize these 3 prices in the same scale, and then calculate descriptive statistics and compare the performance of the 2 stocks and the index.
In this context it might be good to create an index for each price starting in $1.00, and calculate the index values based on how much the price change with respect to its price in the first month. Let’s calculate this index by dividing each stock price by its first price:
indexed_prices = prices / prices.iloc[0]
With this transformation, the index for each instrument represents how much $1.00 invested in the instrument (stock or MXX) grows (or declines) over time. Then, the index is actually a growth factor if I would have invested in the instrument from the first month.
I can visualize how each stock and the market grows over time by plotting the indexed prices:
import matplotlib.pyplot as pltplt.figure(figsize=(12,6))indexed_prices.plot(ax=plt.gca())plt.title('Indexed Prices Over Time')plt.xlabel('Date')plt.ylabel('$1.00 invested in Jan 2021)')plt.legend()plt.grid(True)plt.show()
We can see that the MXX grew about 1.3 times, where Alfa and Walmart declined its initial value to about 90%.
We can now calculate descriptive statistics for these indexed values:
We can see that the MXX had the highest mean with 1.17 and highest maximum value (1.31). WMT had the lowest mean and the lowest minimum value. The instrument with more variability in its index is ALFAA with a standard deviation of 0.10, almost the double compared to WMT.
For the point of view of an investor, the price movements are relevant, but the end value of the price compared to the initial value is very important. Then, we can calculate the percentage growth (or decline) of the prices (or indexes) from the first month until the last month. In Finance this is called, the Holding-Period Return (HPR). I can calculate the HPR just dividing the last value by the first value and then subtract 1. Let’s calculate the HPR for each instrument:
We can see that the instrument with the highest % holding-period return was the MXX with more than 27% for the 2.5 years, while ALFA and WMT ended up with negative returns in the whole period (-12% and -7.5%)
In addition, from the perspective of investors, it is always informative to calculate period returns and look at descriptive statistics to have an idea of the average return per period and the volatility (standard deviation of period returns). Then, let’s calculate the simple period returns and the logarithmic returns for the 3 instruments:
Simple returns:
R = (prices / prices.shift(1) -1 ).dropna()R.head()
We see that both returns are similar, but the log returns are bigger in magnitude for negative returns, and lower in magnitude for positive return. Also, log returns usually behave more normally distributed compared to simple returns.
plt.suptitle('Histograms of Log Returns', y=1.02)plt.tight_layout()plt.show()
From an investor point of view, these histograms are very informative since we can see the range of percentage losses and percentage gains over time for each instrument, and how frequent each range of percentage gain/loss shows up (how many months). For example, we can see that there were 1 month that WMT lost between 15% and 20%. Also, we can see that the MXX gained between 15% and 20% in one month.
I can calculate descriptive statistics for the simple returns and the log returns:
We can see that ALFAA was the more volatile of the 3 since its monthly standard deviation is about 6.8% and the other 2 have standard deviation around 5.2%. ALFAA had the worse average return with -0.2% monthly. If we assume that ALFAA had a normal distribution, we can think that around 68% of the time, ALFAA had monthly returns between -7% and +6.6% since this range is about -1 standard deviation and +1 standard deviation from its mean return.
The only instrument with positive average return was MXX with 0.94% (almost 1% monthly). To have a better idea whether this positive return is good for an investor, it is a good idea to estimate an annualized version of this monthly return. We can do this by multiplying this monthly return by 12, so this estimate of the annual average return will be close to 12%, which is a good one considering that the investment with no risk (for example CETES) was much less that 12% in this time period.
These estimates are similar compared to the descriptive statistics using simple returns. When the magnitude of percentages (returns) is small (less than 0.10 in absolute value), both, the simple and the log returns are very similar. They start to be very different for higher magnitudes.
It is worth noting that the mean log returns are more negative compared to the simple returns, and for the positive average returns, the log returns are less in magnitude compared to the simple returns.
Log returns are not very useful for calculating descriptive statistics. They are very useful when we construct statistical models with percentages / returns compared to simple returns, since log returns behave more normal than simple returns.
2) Calculate the covariance and correlation between: a) The IPC return vs Walmart return, and b) the Walmart return vs the Alfa company. Explain how you calculated your results, and interpret the correlations.
I calculate covariance and correlation of returns using the cov and corr functions:
The covariance values are difficult to interpret. The only thing we can say for covariance estimates between 2 variables is to see the sign of the covariance. If the covariance is positive (negative) this means that the linear relationship between both variables is positive (negative).
However, the correlation is more informative. The correlation between 2 variables tells us whether the linear relationship is positive or negative, and in how much % both random variables are linearly related.
In the correlation matrix we will always see 1’s in the diagonal since the correlation of one variable with itself is always 1 (100%).
The correlation between MXX and WMT is about +0.21 indicating that both returns move in the same direction (positive or negative) about 21% of the time. We can also say that they are positively and linearly related about 21% of the time.
The correlation between WMT and ALFAA is negative (-0.11) indicating that both returns move in opposite direction about 11% of the time. However, we must look at the statistical significance of this correlation before we conclude that both returns are negatively related. Below I calculate this statistical significance with the corresponding p-value.
We see that the highest correlation among these 2-variable set of correlations is the correlation between ALFA and MXX, which is about +35%. This means that there is a positive linear relationship between the returns of ALFAA and the returns of MXX, and about 35% of the time both move in the same direction since their correlation is positive.
3) What you can say about the statistical significance of the previous 2 correlations? Calculate the corresponding p-values and Explain with YOUR OWN WORDS.
Besides looking at the sign and magnitude of each pair of correlations, we can calculate the p-value of these correlations to see whether the correlation is statistically significant. In other words, we need to know whether the correlation is significantly positive (or negative).
I calculate the correlation and p-values for each pair:
from scipy.stats import pearsonrimport pandas as pddef pairwise_corr_pvalues(df):""" Calculates the pairwise correlation p-values for a pandas DataFrame. Args: df: pandas DataFrame. Returns: pandas DataFrame of pairwise correlation p-values. """ cols = df.columns n_cols =len(cols) p_matrix = np.zeros((n_cols, n_cols))for i inrange(n_cols):for j inrange(i, n_cols):if i == j: p_matrix[i, j] =0.0# Correlation with itself is 1, p-value is 0else:# Calculate Pearson correlation and its p-value corr, p_value = pearsonr(df[cols[i]], df[cols[j]]) p_matrix[i, j] = p_value p_matrix[j, i] = p_value # Correlation matrix is symmetricreturn pd.DataFrame(p_matrix, columns=cols, index=cols)# Calculate and display the p-values for the correlation matrixcorrelation_pvalues = pairwise_corr_pvalues(r)print("\nP-values for Correlation Matrix of Log Returns:")
We can see that the p-value of the correlation between ALFAA and MXX is 0.059, so we can say that both returns are marginally significant since its p-value is less than 0.10 and greater than 0.05. We can also say that about 94% of the time (1 - 0.059), both returns will be positively correlated.
The other 2-pairs of p-values are much higher than 0.05, indicating that we cannot say that these 2-pairs of correlations are significant. In other words, we can treat them as non correlated or not linearly related. For example, WMT and ALFAA that had a -0.11 correlation and p-value= 0.54, this means that although the mean correlation is negative, we cannot say that they are negatively correlated most of the time. In other words, the negative correlation between ALFAA and WMT is NOT significantly NEGATIVE, and we can treat both returns as NOT CORRELATED (or independent of each other).
3 CHALLENGE 2
Run a regression model to examine whether the Alfa return is related to the IPC return. You want to see how sensible is Alfa return with respect to the IPC return. Decide which can be the dependent variable and the independent variable, run the model and interpret the regression coefficients with your OWN WORDS.
4 SOLUTION TO CHALLENGE 2
When running a regression model with percentage variables, it is recommended to use the log percentages instead of simple returns. The main reason is that the log returns behave closer to normal compared to simple returns, and one of the assumptions of the linear regression model is that the dependent variable should behave like a normal distributed variable.
Then, I run the regression model considering Alfa return as the Dependent Variable (Y), and the MXX return (IPC) as the Independent Variable (X). I do this since I want to see how much Alfa return changes with respect to MXX return:
import statsmodels.formula.api as smf r.columns = ['ALFAA','WMT','MXX']# I estimate the OLS regression model:regmodel = smf.ols('ALFAA ~ MXX', data=r).fit()print(regmodel.summary())
I can see that beta0 = -0.0079 and beta1 = 0.44. This means that the linear regression equation is given by:
E[ALFAr] = -0.0079 + 0.4457 (MXXr)
In other words, it is expected that the monthly return of Alfa is about -0.0079 (beta0) when the MXX return is zero.
The slope of the line is +0.4457 (beta1), indicating a positive relationship. In addition, we can say that on average, when the MXX return moves in +1%, the expected move of the ALFA return will be about +0.4457 (on average). However, we need to look at the level of significance of these values for beta0 and beta1. We can look at the corresponding t-Statistic, p-values and/or their 95% confidence intervals.
When we run a linear regression model, one hypothesis test is automatically performed for each coefficient, and the results for these hypothesis test is displayed in the Coefficient Table. The Null Hypothesis for these tests is always that the corresponding coefficient is equal to zero.
Then, looking at the Intercept row (the beta0), we see that its p-value is 0.51, which is much bigger than 0.05, indicating that even though beta0 is negative, it is not significantly negative. In other words, we can say that 95% of the time (in the future), beta0 can move between -0.032 and +0.017 (the 95% CI). This means that the future value of beta0 is uncertain in terms of being negative or positive!
Looking at the MXX row (the beta1), we see that its p-value=0.06, indicating that beta1 is marginally significant or significantly positive, but at the 94% confidence level (1-0.06). In other words, in the future, 95% of the time beta1 might move between -0.02 to +0.911. Then, we can conclude that the sensitivity of Alfa return with respect to MXX returns is +0.44, and 95% of the time will be less than 1. In this context, beta1 is a measure of market risk of the stock. Then, we can say that Alfa is significantly less risky than the market since its beta1 < 1 95% of the time (according to its 95% C.I.)!
It is an Excel file with monthly historical sales of 2 chocolate products in a Wholesaler with more than 150 stores in all Mexican states. You have to do the following analysis:
Show important descriptive statistics about these 2 products that give you a good overview of sales and price performance over time. You have to decide what type of data wrangling and/or transformations you need to do, and what type of descriptive statistics can be relevant in this context. Think that you are responsible for the sales of these products at a national level, and you have to present your analysis to your boss.
For each product, design a regression model to estimate their direct price elasticity. Price elasticity refers to how much sensible is a consumer product to price changes. In other words, on average what is the percentage change of volume sold with respect to percentage change in price. You have to:
Decide the data wrangling/transformations and decide the dependent and the independent(s) variable(s) for each model,
For each product run a regression model to estimate the price elasticity, and interpret the model WITH YOUR OWN WORDS.
Besides changes in price, what other variables can you consider that might influence changes in sales volume? Just explain your ideas about this with your OWN WORDS.
6 CHALLENGE 3 SOLUTION
Downloading the data-set and importing it into a data-set:
import pandas as pdsalesdf = pd.read_excel('SALES1.xlsx')salesdf.head()
FAB ID DATE PRODUCT VALUE UNITS
0 FABUNO 97928 2020-01-01 CHOCOTAB 27/18GR 4.614653e+06 40793
1 FABUNO 97928 2020-02-01 CHOCOTAB 27/18GR 5.395821e+06 51813
2 FABUNO 97928 2020-03-01 CHOCOTAB 27/18GR 5.189054e+06 49169
3 FABUNO 97928 2020-04-01 CHOCOTAB 27/18GR 4.073454e+06 35361
4 FABUNO 97928 2020-05-01 CHOCOTAB 27/18GR 4.942860e+06 44344
Show important descriptive statistics about these 2 products that give you a good overview of sales and price performance over time. You have to decide what type of data wrangling and/or transformations you need to do, and what type of descriptive statistics can be relevant in this context. Think that you are responsible for the sales of these products at a national level, and you have to present your analysis to your boss.
We have monthly sales of 2 chocolate products sold in retail stores. The VALUE column is the sales in $MXP, and the UNITS is the number of units sold per month. If we divide VALUE by UNITS, we can get the average unit price per month for each chocolate. Unit price is an important variable in this context to understand how price vs sales relates.
I calculate the PRICE column for the average unit price per month:
Since we have only 2 products, we can start with a plot to show the performance of sales and unit price over time for both products.
I start plotting units sold over time for both products in one plot:
import matplotlib.pyplot as plt# Plot UNITS over time for both products using product names# Loop to create a temporal data frame with one product, and then plot itplt.figure(figsize=(12, 6))for product_name in salesdf['PRODUCT'].unique(): product_df = salesdf[salesdf['PRODUCT'] == product_name] plt.plot(product_df['DATE'], product_df['UNITS'], label=f'{product_name}')plt.xlabel('Date')plt.ylabel('Units')plt.title('Units Sold Over Time by Product')plt.legend()plt.grid(True)plt.show()
It seems that the ESTUCHE CHOC sells more units than CHOCOTAB. However, we need to calculate sales by year since ESTUCHE CHOC shows a strong seasonal pattern with very high sales for some months and very low sales for other months. We can see a clear pattern of seasonality for the ESTUCHE CHOC with a pick of sales each December (close to 160,000 units each December). It is difficult to see a clear pattern of seasonality for the CHOCOTAB.
Both products do not show a clear pattern of sales growth over time; actually, ESTUCHE CHOC seems to be declining for the last w years.
We need to calculate annual sales and sales growth in percentage by year to have a better understanding. I do the same for unit price:
# Plot UNITS over time for both products using product names# Loop to create a temporal data frame with one product, and then plot itplt.figure(figsize=(12, 6))for product_name in salesdf['PRODUCT'].unique(): product_df = salesdf[salesdf['PRODUCT'] == product_name] plt.plot(product_df['DATE'], product_df['PRICE'], label=f'{product_name}')plt.xlabel('Date')plt.ylabel('Units')plt.title('Unit Price Over Time by Product')plt.legend()plt.grid(True)plt.show()
We clearly see that ESTUCHE CHOC is more expensive than CHOCOTAB. ESTUCHE CHOC shows an increasing trend in its price over the whole period, while CHOCOTAB has a growing price trend only from mid 2022 to 2024.
To better appreciate the price difference between both products, we can do a histogram to show the distribution of prices for both products:
# Create a single histogram with both PRICE distributionsplt.figure(figsize=(12, 6))plt.hist(salesdf[salesdf['PRODUCT'] =='CHOCOTAB 27/18GR']['PRICE'], bins=10, alpha=0.7, label='CHOCOTAB 27/18GR', edgecolor='black', color='skyblue')plt.hist(salesdf[salesdf['PRODUCT'] =='ESTUCHE CHOC 24 PZ']['PRICE'], bins=10, alpha=0.7, label='ESTUCHE CHOC 24 PZ', edgecolor='black', color='salmon')plt.xlabel('PRICE')plt.ylabel('Frequency')plt.title('Histogram of Units Price Distribution by Product')plt.legend()plt.grid(True)plt.show()
We can clearly see that the ESTUCHE CHOC is much more expensive than CHOCOTAB. The range of unit price of ESTUCHE CHOC IS between $ 140 to $ 210, while CHOCOTAB price range goes from $80 to about $160. For both products, looking at the height of each bar, we can appreciate which price range appear more often over the months. In the histograms, we do not appreciate the chronology (when each price shows up)
It is also very informative to visualize sales values (in $) to appreciate how each product generates in sales value ($) over time. Sales value can be constructed by multiplying units sold times unit price for each product. Let’s see the sales value over time for both products:
# Plot VALUE over time for both products # Loop to create a temporal data frame with one product, and then plot itplt.figure(figsize=(12, 6))for product_name in salesdf['PRODUCT'].unique(): product_df = salesdf[salesdf['PRODUCT'] == product_name] plt.plot(product_df['DATE'], product_df['VALUE'], label=f'{product_name}')plt.xlabel('Date')plt.ylabel('Units')plt.title('Sales ($) Over Time by Product')plt.legend()plt.grid(True)plt.show()
Since the behavior of units sold and price change over the months for each year, it is a good idea to calculate descriptive statistics per year for each product.
Let’s calculate descriptive statistics for price per year for each product:
# creating a column for year:salesdf['Year']=salesdf['DATE'].dt.year #Grouping by PRODUCT and Year and calculate descriptive statistics for PRICE:descriptive_stats_yearly = salesdf.groupby(['PRODUCT','Year'])[['PRICE']].describe()# Display the results transposing to pd.options.display.float_format ="{:,.2f}".formatdescriptive_stats_yearly.T
With the mean and standard deviation of prices over time for each product, we get a first idea about the pricing behavior for each product by year. Overall we see that ESTUCHE CHOC is more expensive than CHOCOTAB, and both products systematically increase prices over the years, and the CHOCOTAB standard deviation per year is much greater than that of ESTUCHE, indicating that CHOCOTAB (the cheapest product) makes more price changes (discounts) over the months per year.
We can also see that the mean and median of prices for ESTUCHE CHOC are very similar, indicating that it is possible that the distribution is close to normal. Looking at the previous histogram, we confirm that the price distribution for ESTUCHE CHOC is close-to-normal. However, looking at the mean and median for the price of CHOCOTAB, the mean is usually slightly greater than the median, indicating a possible skewed-to-the-right distribution. Looking at its histogram (above), confirm that the price distribution for CHOCOTAB is slightly skewed-to-the-right.
Box-plots are very informative to understand the distribution of any variable, specially when the variable does not behave close to a normal distribution. A box-plot illustrates how the data is distributed through its quartiles and the inter-quartile range:
import seaborn as sns# Create a box plot of price by productplt.figure(figsize=(10, 6))sns.boxplot(x='PRODUCT', y='PRICE', data=salesdf)plt.xlabel('Product')plt.ylabel('Price')plt.title('Box Plot of Price by Product')plt.grid(True)plt.show()
For CHOCOTAB, its price has moved between $120 and about $145 50% of the time - between the Quartile 1 and the Quartile 3.The inter-quartile range (IQR) is the difference between Q3 and Q1 of any distribution. For ESTUCHE, the IQR starts in $\160 and ends about $180. The horizontal lines below and above the box indicate a range where most of the values lie. The distance between these horizontal lines is calculated subtracting and adding 1.5 times the IQR from the Q2 (the median). It is said that points outside +-1.5 IQR can be treated as outliers, and they are usually plotted as dots. In this case, it seems that there is no outliers for both prices since there are no dots outside the +-1.5 IQR from Q2.
We have to be careful using arithmetic mean for prices, since price is the result of dividing 2 variables (value and units), and there are some months with high units sold with a specific price.
With this table we can see the variety of price changes per year for each product.
In the context of historical monthly sales data, the best measure of central tendency for annual average prices is usually the weighted mean according to the units sold per month. Bellow I will calculate this annual weighted average price, which is a better measure for mean price.
Here is an annual summary of sales units, weighted average price and their % growth per year:
# Group by product and year and sum total units and value:product_yearly_summary = salesdf.groupby(['PRODUCT', 'Year'])[['UNITS', 'VALUE']].sum()# Calculate weighted average price by product and year dividing total sales value by units sold:product_yearly_summary['WAvg_Price'] = product_yearly_summary['VALUE'] / product_yearly_summary['UNITS']# Calculating annual growth for UNITS, VALUE, and Weighted Price:product_yearly_summary['Volume %Growth'] =100* (product_yearly_summary['UNITS'] / product_yearly_summary.groupby(['PRODUCT'])['UNITS'].shift(1) -1)product_yearly_summary['Value %Growth'] =100* (product_yearly_summary['VALUE'] / product_yearly_summary.groupby(['PRODUCT'])['VALUE'].shift(1) -1)product_yearly_summary['Price %Growth'] =100* (product_yearly_summary['WAvg_Price'] / product_yearly_summary.groupby(['PRODUCT'])['WAvg_Price'].shift(1) -1)# Unstack the 'Year' level to make years columnsproduct_yearly_summary_unstacked = product_yearly_summary.unstack('Year')# Display the summary table for UNITS sold and its % Growth#print(product_yearly_summary_unstacked[['UNITS','Volume %Growth']])# Display the summary table for Avg Weighted Price and its % Growth#print(product_yearly_summary_unstacked[['WAvg_Price','Price %Growth']])# Display the summary table for VALUE Sales and and its % Growth#print(product_yearly_summary_unstacked[['VALUE','Value %Growth']])# Display the annual summary table for VALUE Sales, UNIT Sales and Weighted Avg Price:print(product_yearly_summary[['UNITS','VALUE','WAvg_Price']])
# Display the annual % Growth for VALUE Sales, UNIT Sales and Weighted Avg Price:print(product_yearly_summary[['Volume %Growth','Price %Growth','Value %Growth']])
Volume %Growth Price %Growth Value %Growth
PRODUCT Year
CHOCOTAB 27/18GR 2020 NaN NaN NaN
2021 27.04 3.24 31.16
2022 -15.47 -5.02 -19.71
2023 15.43 20.14 38.68
2024 -8.27 14.85 5.35
ESTUCHE CHOC 24 PZ 2020 NaN NaN NaN
2021 29.91 5.26 36.75
2022 -9.00 7.61 -2.08
2023 9.66 1.29 11.08
2024 1.13 6.40 7.60
In this case, there are some differences between each arithmetic annual average price with the corresponding weighted average annual price. The weighted annual average price is a better representation of the annual average price since this considers the frequency of the units sold per month for different prices.
Both products had positive annual % growth in price for most years, except CHOCOTAB had negative annual % price growth in 2022 (-5%).
The best year for CHOCOTAB was 2023 with +38.6% annual growth in Value Sales, and its worst year was 2022 with negative decline in Value Sales with -19.7%.
The best year for ESTUCHE was 2021 with +36.7% in Value Sales, and its worst year was 2022 with 2.08% decline in Value Sales.
Important note: When calculating descriptive statistics for variables that are the result of dividing 2 variables (for example, unit price, or any other percentage variable or ratio), the best measure of central tendency will always be the weighted mean, not the arithmetic mean.
For each product, design a regression model to estimate their direct price elasticity. Price elasticity refers to how much sensible is a consumer product to price changes. In other words, on average what is the percentage change of volume sold with respect to percentage change in price. You have to:
Decide the data wrangling/transformations and decide the dependent and the independent(s) variable(s) for each model,
For each product run a regression model to estimate the price elasticity, and interpret the model WITH YOUR OWN WORDS.
Besides changes in price, what other variables can you consider that might influence changes in sales volume? Just explain your ideas about this with your OWN WORDS.
SOLUTION:
2a) We start with the long-format monthly data frame. To calculate price elasticity, we can transform monthly UNITS and PRICE with the natural log in order the estimate direct price elasticity as the slope of the regression line for each product:
import numpy as npsalesdf['logPRICE']= np.log(salesdf['PRICE'])salesdf['logUNITS'] = np.log(salesdf['UNITS'])
For each product I run an OLS regression with the logUNITS as the Y (Dependent Variable) and the logPRICE as the Y (Independent Variable).
Regression model for both products:
import statsmodels.formula.api as smfproduct_names = salesdf['PRODUCT'].unique()for product_name in product_names:print(f"\nLinear Regression Results for {product_name}:")# Filter data for the current product: product_df = salesdf[salesdf['PRODUCT']==product_name].copy()# Run the OLS regression model = smf.ols('logUNITS ~ logPRICE', data=product_df).fit()print(model.summary())
INTERPRETATION for the CHOCOTAB regression output:
The R-square of the model is close to zero. The beta1 coefficient is +0.0002, but it is NOT significantly greater than zero since its p-value=0.999. This means that the change in +%1 in unit price DOES NOT significantly influence % change in UNITS sold. In other words, this product seems to be inelastic since a change in price does not significantly change sales volume.
We have to take this result with CAUTION since we are NOT considering other factors such as SEASONALITY and volume growing trend!
INTERPRETATION for the ESTUCHE CHOC regression output:
The R-square is just 1.5% and the beta1 coefficient is -0.92, but it is NOT significantly negative since its p-value=0.34. This means that although the beta1 is negative, in the near future it can change from -2.883 to +1.029 with 95% probability. In other words, there is no consistency at the 95% confidence that the direct price elasticity will be negative. Then, we can also say that this product is inelastic since its elasticity is NOT significantly NEGATIVE.
Again, we have to take this result with CAUTION since we are not considering other factors such as SEASONALITY and volume growing trend!
How can I consider the SEASONALITY pattern for volume sales? I can add binary dummy variables as X variables according to the month of the year.
How can I consider possible growing trend over time? I can add a time variable as another X variable (from 1 to the last month of the sales history).
Let’s do so for both products:
I create the binary dummies to the long-format data frame:
# Extract the month from the 'DATE' column and convert to stringsalesdf['Month'] = salesdf['DATE'].dt.month.astype(str)# Generate dummy variables for the 'Month' column, dropping the first category (January)month_dummies = pd.get_dummies(salesdf['Month'], prefix='Month', drop_first=True)# Concatenate the dummy variables with the original DataFramesalesdf = pd.concat([salesdf, month_dummies], axis=1)# I add t as a sequence number for the time periods (months)salesdf['t'] = salesdf.groupby('PRODUCT').cumcount() +1salesdf=salesdf.drop('Month',axis=1)# Display the first few rows with the new dummy variablesprint(salesdf.head())
I now run a multiple regression model (instead of simple) for each product, adding the month dummy variables and the time sequence variable as X variables:
product_names = salesdf['PRODUCT'].unique()for product_name in product_names:print(f"\nLinear Regression Results for {product_name}:")# Filter data for the current product: product_df = salesdf[salesdf['PRODUCT']==product_name].copy()# Run the OLS regression model = smf.ols('logUNITS ~ logPRICE+t+Month_2+Month_3+Month_4+Month_5+Month_6+Month_7+Month_8+Month_9+Month_10+Month_11+Month_12', data=product_df).fit()print(model.summary())
Now we got interesting results for both products, specially for ESTUCHE CHOC.
Both R-square of the models increased a lot. The R-Square for EST UCHE CHOC is 96%, while that of CHOCOTAB is about 39.8%.
The beta coefficient that multiplies LogPRICE is the measure of direct elasticity after considering the SEASONALITY factor (the dummy variables for the Month) and the growing trend with the t variable.
We see that for ESTUCHE, the direct elasticity is -4.38 and it is SIGNIFICANTLY negative since its p-value is close to zero (t=-8.4). This means that, AFTER considering the systematic changes of volume sales by month (the dummy variables) and the growing trend (t), when the price changes in +1%, it is expected that the sales units change in about -4.38%! We can also see that ESTUCHE has a significant growing trend over time since the beta coefficient that multiplies t (b= +0.02) is positive and statistically significant! In other words, for each month, the % change in volume sales increase in about 2.4% after considering the direct elasticity and the seasonality.
What about the beta coefficients for the dummy variables? I leave this for your thinking!!
In conclusion, the ESTUCHE CHOC is VERY ELASTIC! after considering the Seasonality pattern and time pattern.
For the case of CHOCOTAB, after considering Seasonality and time trend, the beta coefficient for price elasticity was negative, but NOT statistically significant. Then, I keep the conclusion that CHOCOTAB is NOT significantly ELASTIC!
Finally, in order to better understand the price-volume relationship under the factors of seasonality and trend, it is a good idea to visualize how sales volume and unit price changes over time:
# Get unique product namesproduct_names = salesdf['PRODUCT'].unique()# Create a figure and a set of subplotsfig, axes = plt.subplots(len(product_names), 1, figsize=(12, 6*len(product_names)), sharex=True)# If there's only one product, axes will not be a listiflen(product_names) ==1: axes = [axes]for i, product_name inenumerate(product_names):# Filter data for the current product product_df = salesdf[salesdf['PRODUCT'] == product_name].copy()# Create the first y-axis (for UNITS) ax1 = axes[i] ax1.bar(product_df['DATE'], product_df['UNITS'], color='skyblue', label='Units', width=20) # Increased width to 20 ax1.set_ylabel('Units', color='skyblue') ax1.tick_params(axis='y', labelcolor='skyblue') ax1.grid(True)# Create the second y-axis (for PRICE) ax2 = ax1.twinx() ax2.plot(product_df['DATE'], product_df['PRICE'], color='salmon', marker='o', label='Price') ax2.set_ylabel('Price', color='salmon') ax2.tick_params(axis='y', labelcolor='salmon')# Set title and legend ax1.set_title(f'{product_name}: Units and Price Over Time') fig.legend(loc="upper right", bbox_to_anchor=(1,1), bbox_transform=ax1.transAxes)# Set common xlabelaxes[-1].set_xlabel('Date')# Adjust layout to prevent labels overlappingplt.tight_layout()plt.show()
With these plots we can visualize the seasonality pattern, and also how changes in price make changes in units sold. For the ESTUCHE CHOC it is clear to see how each December has significantly much higher sales, and for the last 2 years we can see that prices have decline in Nov and Dec, making sales significantly increase. This is consistent with the estimate of direct elasticity of -4.3.