This report analyzes sales data from Dominick’s spanning 1990 to 1996, focusing on trends across days, weeks, and months. Additionally, it investigates how demographic variables such as household wealth, household size, income, and education may have influenced store performance. The goal is to derive insights that inform future operational and expansion strategies.
library(haven)
Warning: package 'haven' was built under R version 4.4.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(lubridate)
Warning: package 'lubridate' was built under R version 4.4.3
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom) ccount <-read_dta("C:/Users/nikol/Downloads/ccount_stata (1)/ccount.dta")demo <-read_dta("C:/Users/nikol/Downloads/ccount_stata (1)/demo.dta")merged <-merge(ccount, demo, by ="store") # Merging the datamerged$date <-iconv(merged$date, from ="",to ="UTF-8", sub ="byte")
#Combining the datasets and turning the numberes un the "date" column into actual date with as.date, aswell as turning the number into %Y%m%d formatmerged$date <-as.character(merged$date)merged <- merged[grepl("^[0-9]{6}$", merged$date), ]merged$date <-as.Date(paste0("19", merged$date), format ="%Y%m%d")
Task 1a: The overall performance of the chain
To start of with we will looking at total sales from 1990 and compare them to 1996, to see whether or not the store has had a positive development. We will also compare their total sales of each weekday to see if the the development in question could have been impacted by consumer behavior.
#Creating two new columns one as weekdays and one as yearSys.setlocale("LC_TIME", "C")
[1] "C"
merged$weekday <-weekdays(merged$date)merged$year <-year(merged$date)# Filtering out only year 1990 and 1996filtered <- merged %>%filter(year %in%c(1990, 1996))filtered$total_sales <- filtered$grocery #Filtering out grocery as i suspect it is the sum of all goods being sold#Summarizing the sales for each day and arranging the correct order of the weekdaysweekday_sales <- filtered %>%group_by(year, weekday, store) %>%summarise(total_sales =sum(total_sales, na.rm =TRUE), .groups ="drop")weekday_order <-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")weekday_sales$weekday <-factor(weekday_sales$weekday, levels = weekday_order)#Dividing the values by a million, this way the values become more clear and comprehensibleweekday_sales$total_sales <- weekday_sales$total_sales /10**6
#Using ggplot to add a very simple bar plot of total sales each weekggplot(weekday_sales, aes(x = weekday, y = total_sales, fill =factor(year))) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Total weekly sales revenue by weekday (1990 vs 1996)",x ="Weekday", y ="Total sales in millions(USD)", fill ="Year" ) +theme_minimal()
Figure 1a shows the total sales in dollars of each weekday in 1990 in comparison to 1996, which clearly shows that sales have increased from 1990 to 1996 every day of the week. The days in which this is most evident is Monday, Saturday and Sunday, with the biggest increase in total sales. Wednesday and Friday on the other hand has experienced the lowest increase in sales, but is still one of the best performing days. All sales in 1990 and 1996 from every day of the week, range between 1 to almost 5 million sales. The reason as to why sales are so high on Saturday and Sunday is due to it being the weekend, Friday and Monday have high sales due to their proximity to the weekend. Whether the change in sales revenue is due to growth in sales or due to inflation is hard to determine. However it would be fair to assume that the day with the biggest increases in revenue are genuine increases in sales, due to the big growth.
Task 1b; total daily sales percentage change 1990-1996
While in the big picture the chain might have had positive growth in sales, this however is only a surface level look. A better approach might be to see how each individual store has developed from 1990 to 1996.
#using pivot wider to create column for each yearsales_percent <-pivot_wider(weekday_sales, names_from = year, values_from = total_sales,names_prefix ="sales_")
# Creating a column with the percantage change for each day in each store by multiplying, subtracting and dividing the columns.sales_percent <- sales_percent %>%unnest(cols =c(sales_1990, sales_1996)) %>%mutate(change =100* (sales_1996 - sales_1990) / sales_1990 ) %>%filter(change >=-100, change <=100, sales_1990 >0)sales_percent <- sales_percent %>%# Also creating the average percentage change of all the stores each weekdaygroup_by(weekday) %>%mutate(mean_change =weighted.mean(change, w = sales_1990, na.rm =TRUE))
#Using facet wrap and histogram to show the perchentage change in sales between 1990 and 1996, for each day in each store. I have also added a vline that shows the average change in sales of each day.ggplot(sales_percent, aes(x = change)) +geom_histogram(binwidth =5, fill ="steelblue", color ="white") +facet_wrap(~ weekday) +geom_vline(data = sales_percent, aes(xintercept = mean_change),linetype ="dashed", color ="red") +labs(title ="Sales % change from 1990 to 1996 by weekday",x ="Percentage Change", y ="Number of stores") +theme_minimal()
Figure 1b shows the percentage change in sales each store has experienced from 1990 to 1996, in comparison to each other. Despite what figure 1a may imply, figure 1b shows that the percentage change in sales across all stores has on average declined. There are also exceptions to this with a few outliers performing much better than the rest. Wednesday, Friday and Saturday has experienced the largest decrease in average sales across all stores. While Sunday, Monday, Tuesday and Thursday a much smaller decline in sales. Overall total sales might gave increased as figure 1a indicates, due to a small minority of stores performing well and increasing total sales, while the majority of stores are facing a decline in sales.
Task 1c: Demographics and daily sales
Looking at different demographic variables and the degree too which they can be found at stores with different levels of performance, could give insights into how why some stores perform much better than others.
#Finding the percent change in sales from day to day for weekday salesweekday_wide <- weekday_sales %>%pivot_wider(names_from = year, values_from = total_sales, names_prefix ="year_")weekday_wide <- weekday_wide %>%mutate(pct_change =100* (year_1996 - year_1990) / year_1990)
#merging weekday sales with desired demographicsdemoanalasys <- weekday_wide %>%left_join(select(filtered, store, educ, income, hvalmean, hsizeavg), by ="store") %>%distinct(store, weekday, .keep_all =TRUE)
Warning in left_join(., select(filtered, store, educ, income, hvalmean, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
#Doing regression analisys of many different varibles on after a another to figure out which hads the biggest impact on sales growth.fit <-lm(pct_change ~ educ + income + hvalmean + hsizeavg, data = demoanalasys)summary(fit)
Call:
lm(formula = pct_change ~ educ + income + hvalmean + hsizeavg,
data = demoanalasys)
Residuals:
Min 1Q Median 3Q Max
-131.72 -54.60 -28.11 3.90 2869.94
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.306 379.192 -0.148 0.88201
educ -104.654 149.125 -0.702 0.48308
income -25.749 42.870 -0.601 0.54831
hvalmean 1.131 0.379 2.984 0.00296 **
hsizeavg 76.530 37.031 2.067 0.03920 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 182.8 on 597 degrees of freedom
(147 observations deleted due to missingness)
Multiple R-squared: 0.03786, Adjusted R-squared: 0.03142
F-statistic: 5.873 on 4 and 597 DF, p-value: 0.000122
fit_hvalmean <-lm(pct_change ~ hvalmean, data = demoanalasys )summary(fit_hvalmean)
Call:
lm(formula = pct_change ~ hvalmean, data = demoanalasys)
Residuals:
Min 1Q Median 3Q Max
-111.18 -47.94 -27.01 -1.93 2871.22
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.9277 24.8062 -3.303 0.00101 **
hvalmean 0.6848 0.1602 4.274 2.23e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 183.1 on 600 degrees of freedom
(147 observations deleted due to missingness)
Multiple R-squared: 0.02955, Adjusted R-squared: 0.02793
F-statistic: 18.27 on 1 and 600 DF, p-value: 2.23e-05
Out of every demographic variable tested the one with the most significant effect in determining sales growth is household value with a 12% correlation with changes in sales.
demoanalasys <- demoanalasys %>%filter(pct_change <=200)#Simple scatterplot comparing hvalmean and percentage changeggplot(demoanalasys, aes(x = hvalmean, y = pct_change)) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =TRUE, color ="blue") +labs(title ="Correlation between household value and sales change",x ="Average household value in thousands (USD)",y ="Sales change %") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
This graph seems to further confirm the results of the regression analysis. While there are many well performing stores in areas with low household value, the amount of poorly performing stores decreases significantly the higher the value of the household. The reason as to why could be multiple, households with higher value in general might have a higher income, thus being able to spend more. It could also be that bigger households contain a higher number of people, therefore having a higher level of consumption. Education similar to income could also be a big indicator of wealth an and value of the household, therefore it would be beneficial to see how these variable effect sales changes each day of the week.
#using nest and unnest to make models easier to plot weekday_models <- demoanalasys %>%group_by(weekday) %>%nest() %>%mutate( #Im not including hvalmean has no impact on daily sales changemodel =map(data, ~lm(pct_change ~ educ + income + hsizeavg, data = .)),results =map(model, tidy) ) %>%unnest(results)
Task 2a: Average monthly sales
Looking at the development of sales on a monthly basis could give further insight into consumer behavior. Especially how the average across all stores every year from 1990 to 1996, has developed each month.
monthly_sales <- merged %>%#Loading in the dataset and adding in monthsfilter(!is.na(date)) %>%mutate(year =year(date),month =month(date, label =TRUE, abbr =TRUE),total_sales =rowSums(select(., grocery), na.rm =TRUE) ) %>%filter(year >=1990, year <=1996)# Summarising the sales across store, month and year monthly_totals <- monthly_sales %>%group_by(store, year, month) %>%summarise(monthly_total =sum(total_sales, na.rm =TRUE), .groups ="drop")# Calculating the monthly averagemonthly_avg <- monthly_totals %>%group_by(month) %>%summarise(avg_sales =mean(monthly_total), .groups ="drop")monthly_avg$month <-factor(monthly_avg$month, levels = month.abb) # Arranging the month in correct row, necessary to plott it# Finding the highest and lowest month in terms of average saleshighest <- monthly_avg %>%filter(avg_sales ==max(avg_sales))lowest <- monthly_avg %>%filter(avg_sales ==min(avg_sales))print(monthly_avg)
# A tibble: 12 × 2
month avg_sales
<ord> <dbl>
1 Jan 746254.
2 Feb 667105.
3 Mar 748970.
4 Apr 712158.
5 May 764128.
6 Jun 716582.
7 Jul 734019.
8 Aug 702222.
9 Sep 732520.
10 Oct 722933.
11 Nov 739642.
12 Dec 798223.
print(highest)
# A tibble: 1 × 2
month avg_sales
<ord> <dbl>
1 Dec 798223.
print(lowest)
# A tibble: 1 × 2
month avg_sales
<ord> <dbl>
1 Feb 667105.
The month with the highest sales is December, the lowest month is February. The differences are probably sue to seasonal festivities, where consumers might spend more on groceries than usual.
Task 2b: monthly aggregate sales
While looking at the monthly average of all stores from every year gives insight into patterns of consumer behavior. However comparing them the monthly average of each year also gives us insight into the general performance of the store.
monthly_total2 <- monthly_sales %>%# Creating a new datasett which accounts for sales across year and monthgroup_by(year, month) %>%summarise(monthly_total =sum(total_sales, na.rm =TRUE), .groups ="drop")
# making a very simple plot with year as.factorggplot(monthly_total2, aes(x = month, y = monthly_total /1e6, group = year, color =as.factor(year))) +geom_line(size =1) +labs(title ="Monthly Aggregate Sales by month (1990–1996)",x ="Month",y ="Monthly Sales (Millions)",color ="Year" ) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Figure 2b shows the monthly development of sales in each year in comparison to each other from 1990 to 1996. 1991 was the best performing year in sales across very month except January and February. 1994, 1995 and 1996 being the lowest performing years. The biggest peaks in sales for all months were in March, May and December, May and December are month with a lot of holidays and seasonal celebrations, thus reaffirming the theory of seasonal festivities impacting sales. The biggest declines in sales are in February, April, August and October, the reason as to why could be that consumer are saving or cannot afford to spend more due to the high expenditure of the holidays.
Task 2c: Monthly percentage sales change
# Taking the monthly_totals from previously and making it widesales_change <- monthly_totals %>%pivot_wider(names_from = year, values_from = monthly_total, names_prefix ="sales_")# Making the percentage change like in previous taskssales_change <- sales_change %>%filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 >0, sales_1996 >0) %>%mutate(change_percent =100* (sales_1996 - sales_1990) / sales_1990)
#creating the averagesales_change <- sales_change %>%group_by(month) %>%mutate(mean_change =mean(change_percent))
#plotting a similar plot to task 1bggplot(sales_change, aes(x = change_percent)) +geom_histogram(binwidth =10, fill ="steelblue", color ="white") +facet_wrap(~ month, scales ="fixed") +scale_x_continuous(limits =c(-100, 100)) +labs(title ="Distribution of Sales % Change by Month (1990 vs 1996)",x ="% Change in Sales", y ="Number of Stores" ) +geom_vline(data = sales_change, aes(xintercept = mean_change),linetype ="dashed", color ="red") +theme_minimal()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_bar()`).
Figure 2c shows the change in monthly sales growth of each store compared to each other from 1990 to 1996. Sales growth has on average declined across every month. The average aside, the majority of stores have experienced a sales decline. October and November have experienced the sharpest decreases, while other months declines were milder. This feeds in to theory of a few well performing stores driving the total sales in figure 1a upwards.
Task 2d: Demographics and monthly sales
Demographic impact on monthly sales could be very different from weekly, especially with wealth indicators such as income, household value and education. Since the customers wealth might have a bigger influence on their monthly consumer pattern, a previously theorized in 2b.
sales_change <- sales_change %>%# removing na, makeing me ablo to run regressionfilter(is.finite(change_percent))regression_month <- sales_change %>%#doing more or less the same as i did in 3a, but now with onths instead of weekdays left_join(select(filtered, store, educ, income, hvalmean, hsizeavg), by ="store")
Warning in left_join(., select(filtered, store, educ, income, hvalmean, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
model1 <-lm(change_percent ~ hvalmean + educ + income + hsizeavg , data = regression_month)summary(model1)
Call:
lm(formula = change_percent ~ hvalmean + educ + income + hsizeavg,
data = regression_month)
Residuals:
Min 1Q Median 3Q Max
-68.078 -15.705 -2.337 11.488 275.852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.360535 1.454423 39.44 <2e-16 ***
hvalmean 0.489072 0.001469 332.88 <2e-16 ***
educ -78.941510 0.582741 -135.47 <2e-16 ***
income -13.908413 0.164923 -84.33 <2e-16 ***
hsizeavg 10.512446 0.143368 73.33 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24 on 693977 degrees of freedom
Multiple R-squared: 0.2232, Adjusted R-squared: 0.2232
F-statistic: 4.985e+04 on 4 and 693977 DF, p-value: < 2.2e-16
model_educ <-lm(change_percent ~ educ, data = regression_month)summary(model_educ)
Call:
lm(formula = change_percent ~ educ, data = regression_month)
Residuals:
Min 1Q Median 3Q Max
-94.111 -16.860 -3.548 12.893 275.023
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.88716 0.06944 -344.0 <2e-16 ***
educ 75.35543 0.28554 263.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.95 on 693980 degrees of freedom
Multiple R-squared: 0.09121, Adjusted R-squared: 0.0912
F-statistic: 6.965e+04 on 1 and 693980 DF, p-value: < 2.2e-16
model_income <-lm(change_percent ~ income, data = regression_month)summary(model_income)
Call:
lm(formula = change_percent ~ income, data = regression_month)
Residuals:
Min 1Q Median 3Q Max
-91.123 -18.201 -5.219 14.254 292.206
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -188.3046 1.1709 -160.8 <2e-16 ***
income 17.0564 0.1104 154.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.77 on 693980 degrees of freedom
Multiple R-squared: 0.03324, Adjusted R-squared: 0.03324
F-statistic: 2.386e+04 on 1 and 693980 DF, p-value: < 2.2e-16
model_hval_2 <-lm(change_percent ~ hvalmean, data = regression_month)summary(model_hval_2)
Call:
lm(formula = change_percent ~ hvalmean, data = regression_month)
Residuals:
Min 1Q Median 3Q Max
-79.73 -15.69 -2.16 11.81 270.92
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.468e+01 9.760e-02 -457.8 <2e-16 ***
hvalmean 2.540e-01 6.359e-04 399.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.55 on 693980 degrees of freedom
Multiple R-squared: 0.187, Adjusted R-squared: 0.187
F-statistic: 1.596e+05 on 1 and 693980 DF, p-value: < 2.2e-16
On the monthly basis the impact of wealth and household size increase slightly compared to the weekday sales, with all demographic variables influencing 22% of monthly sales changes form 1990 to 1996. Initially it seems as if income and education has a significant negative affect on sales, but it is only when they are taking each other and household value into account. When they are running on their own, the variables have a positive influence on sales. Since household value has the largest influence on sales growth, any other variable that takes into account similar measures as the average household value are going to be negative to prevent inflating and twisting the results. Household value in general influences 18% of the change on sales on a monthly basis, which is a significant increase from 1c where it was 12%. Therefore wealth does play a bigger role in more long term consumer behavior. Average household size does not seem to to have the same connection to wealth as other aforementioned variables, this pared with it having a very insignificant influence on sales has rendered it outside the scope of consideration for the rest of this report.
Task 3a: Yearly sales
#Summarising the sales of all stores for each year. Then i i take the average for every store for each year.average_aggregate_sales <- monthly_sales %>%group_by(year, store) %>%summarise(store_total =sum(total_sales), .groups ="drop") %>%group_by(year) %>%summarise(avg_sales =mean(store_total), .groups ="drop")# Now find highest and lowest yearhighest_year <- average_aggregate_sales %>%filter(avg_sales ==max(avg_sales))lowest_year <- average_aggregate_sales %>%filter(avg_sales ==min(avg_sales))print(average_aggregate_sales)
# A tibble: 1 × 2
year avg_sales
<dbl> <dbl>
1 1996 8916987.
print(lowest_year)
# A tibble: 1 × 2
year avg_sales
<dbl> <dbl>
1 1994 8305773.
average_aggregate_sales %>%ggplot(aes(x = year, y = avg_sales /10**6))+geom_line()+labs(title ="Yearly average aggregate sales", x ="Years (1990-1996)", y ="Average aggrregate sales(millions)")+theme_minimal()
From 1990 to 1996, the year with the lowest sales is 1994 with a very sharp decline for the previous year. From 1994 to 1996 there was a sharp increase in average sales, much larger the the decline of the previous years with the peak being in 1996. Therefore despite i dip in 1994, average aggregate sales have been on the increase in this time period.
Task 3b: Yearly percentage change in sales
ggplot(sales_change, aes(x = change_percent)) +#Have allready made the data necessary to make the graph. For the "mean" line i don't bother with creating a new dataset, just do the mean in the plot, same results anyway:)geom_histogram(binwidth =10, fill ="black", color ="white") +geom_vline(xintercept =mean(sales_change$change_percent, na.rm =TRUE),color ="red", linetype ="dashed")+labs(title ="Store-level % Change in Total Sales (1990–1996)",x ="% Change", y ="Number of Stores")
Figure 3b which shows the percentage change of all stores compared to each other from 1990 to 1996. The graph further indicates what the data has shown previously, mainly that the average change in sales is on a decline with the majority of stores experiencing a negative change in growth with some outliners having a very big decline in sales. On the opposite end we have some stores that have experienced an increase in sales far larger than any single store has declined, therefore likely contributing to the rising average aggregate sales from figure 3a. Note the outliers in this graph are almost unnoticeable
Task 3c: Best and worst performing stores
Since there seems to be lot of outliers influencing the total sales of the chain, it might be beneficial to look at the best and worst performing stores and how they have developed.
store_growth <- sales_change %>%group_by(store) %>%summarise(change_percent =mean(change_percent, na.rm =TRUE), .groups ="drop")#arranging the sales into the top 20 best and worst performin by arranging the i the desired order and then slicing the "top" of the listtop <- store_growth %>%arrange(desc(change_percent)) %>%slice_head(n =40)bottom <- store_growth %>%arrange(change_percent) %>%slice_head(n =40)top_extreme <- store_growth %>%arrange(desc(change_percent)) %>%slice_head(n =10)bottom_extreme <- store_growth%>%arrange(change_percent) %>%slice_head(n =10)print(top)
# Making 1990 the baseline for mye yearly sales baseline_1990 <- merged %>%mutate(date =as.Date(date)) %>%filter(year(date) ==1990) %>%mutate(total_sales =rowSums(select(., grocery ), na.rm =TRUE)) %>%group_by(store) %>%summarise(base_avg =mean(total_sales, na.rm =TRUE), .groups ="drop")# Merging the rows of the top and bottom togetherselected_stores <-bind_rows(top, bottom) %>%pull(store)# Uising a ifelse to group the stores in their correct spaces, aswell as creating percentage change and the correct month formatdaily_percent_data <- merged %>%mutate(date =as.Date(date)) %>%filter(store %in% selected_stores) %>%mutate(total_sales =rowSums(select(., grocery), na.rm =TRUE),year =year(date),month =floor_date(date, "month") ) %>%filter(year >=1990, year <=1996) %>%left_join(baseline_1990, by ="store") %>%mutate(pct_change =100* (total_sales - base_avg) / base_avg,group =ifelse(store %in% top$store, "Top 40", "Bottom 40") )# 4monthly_avg_percent <- daily_percent_data %>%group_by(month, group) %>%summarise(avg_pct_change =mean(pct_change, na.rm =TRUE), .groups ="drop")
ggplot(monthly_avg_percent, aes(x = month, y = avg_pct_change, color = group)) +geom_line(size =1.2) +geom_hline(yintercept =0, linetype ="dashed") +labs(title ="Monthly % Change in Sales (Top vs Bottom 40 Stores, 1990–1996)",x ="Month",y ="Average % Change from 1990 Baseline",color ="Store Group" ) +theme_minimal()
The change in sales for the 40 best performing stores have predictably had a very positive sales development. Simultaneously the 40 least performing stores have faced an equally sharp recession in sales. This again points toward the best performing stores being the ones driving the increase in sales from.
Task 3d: Regressionanalysis of the 40 best performing stores
Why the top 40 best and worst performing stores could have such a unequal development, would become clearer if demographics wealth disparity is taken into account.
top_growth_data <- monthly_sales %>%filter(store %in% top$store) %>%#Turning monthly sales into a data set for the top 40 stores by yeargroup_by(store, year) %>%summarise(avg_sales =mean(total_sales, na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = year, values_from = avg_sales, names_prefix ="sales_") %>%filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 >0, sales_1996 >0) %>%mutate(change_percent =100* (sales_1996 - sales_1990) / sales_1990)#Making new top and bottom data without a baslinebottom_growth_data <- monthly_sales %>%filter(store %in% bottom$store) %>%group_by(store, year) %>%summarise(avg_sales =mean(total_sales, na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = year, values_from = avg_sales, names_prefix ="sales_") %>%filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 >0, sales_1996 >0) %>%mutate(change_percent =100* (sales_1996 - sales_1990) / sales_1990)
#doing regression analysisgrowth_model_top <-lm(change_percent ~ income + educ + hvalmean, data = top_growth_data)summary(growth_model_top)
Call:
lm(formula = change_percent ~ income + educ + hvalmean, data = top_growth_data)
Residuals:
Min 1Q Median 3Q Max
-45.189 -11.366 -6.388 7.966 55.821
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.99039 127.92694 0.344 0.733
income -4.07590 12.47154 -0.327 0.746
educ -5.85923 57.96973 -0.101 0.920
hvalmean 0.09732 0.14065 0.692 0.493
Residual standard error: 19.91 on 36 degrees of freedom
Multiple R-squared: 0.02809, Adjusted R-squared: -0.05291
F-statistic: 0.3468 on 3 and 36 DF, p-value: 0.7917
growth_model_bottom <-lm(change_percent ~ income + educ + hvalmean, data = bottom_growth_data)summary(growth_model_bottom)
Call:
lm(formula = change_percent ~ income + educ + hvalmean, data = bottom_growth_data)
Residuals:
Min 1Q Median 3Q Max
-25.903 -7.156 2.321 6.385 17.880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.88909 94.15381 0.158 0.8752
income -5.42888 9.25635 -0.587 0.5612
educ -57.96038 37.91630 -1.529 0.1351
hvalmean 0.21004 0.09367 2.242 0.0312 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.09 on 36 degrees of freedom
Multiple R-squared: 0.1286, Adjusted R-squared: 0.05595
F-statistic: 1.771 on 3 and 36 DF, p-value: 0.1702
For the top 40 best performing stores it average household value, income and education, seems influence a very insignificant 2.8% of sales change, the sales. The effect is more significant on the 40 worst performing stores, but the p-value of the data is very low indicating very scattered and unreliable data to draw confident conclusions. The wealth does not seem to be a factor in why the best performing stores are doing so well, but there could be a certain wealth threshold that the stores need to reach to have a reduced decline in sales. To definitively prove if average household value has any effect on sales change, it would be advantageous to compare the sales development of the top 40 stores with the highest and lowest average household value customers.
# Selceting only hvalmeanstore_hval <- filtered %>%select(store, hvalmean) %>%distinct()# attaching it to sales changesales_with_hval <- sales_change %>%left_join(store_hval, by ="store")# dividing it into highest and lowest groups by the value of their hvalmeanstore_groups <- sales_with_hval %>%group_by(store) %>%summarise(hvalmean =mean(hvalmean, na.rm =TRUE), .groups ="drop") %>%arrange(hvalmean) %>%mutate(group =case_when(row_number() <=40~"Low hvalmean",row_number() >n() -40~"High hvalmean",TRUE~NA_character_ )) %>%filter(!is.na(group))# 3. Merging the data because there ere some things sales_change lackedsales_grouped <- sales_with_hval %>%left_join(store_groups, by ="store")#Join merging the years togheter for plottingsales_long <- sales_grouped %>%pivot_longer(cols =starts_with("sales_"),names_to ="year",names_prefix ="sales_",values_to ="sales") %>%mutate(year =as.integer(year),date =dmy(paste("01", month, year)) ) %>%filter(!is.na(group), year >=1990, year <=1996)# Finding average sales per groupmonthly_avg <- sales_long %>%group_by(date, group) %>%summarise(avg_sales =mean(sales, na.rm =TRUE), .groups ="drop")# Creatin a basline just like previouslybaseline_sales <- sales_long %>%filter(date ==ymd("1990-01-01")) %>%group_by(group) %>%summarise(baseline =mean(sales, na.rm =TRUE), .groups ="drop")#infign the percentage change in sales for each groupmonthly_pct_change <- sales_long %>%filter(!is.na(group)) %>%group_by(group, date) %>%summarise(avg_sales =mean(sales, na.rm =TRUE), .groups ="drop") %>%left_join(baseline_sales, by ="group") %>%mutate(pct_change =100* (avg_sales - baseline) / baseline)
ggplot(monthly_pct_change, aes(x = date, y = pct_change, color = group)) +geom_line(size =1) +geom_hline(yintercept =0, linetype ="dashed") +labs(title ="Monthly % Sales Change Relative to Jan 1990",subtitle ="Comparing High vs Low average household value Store Groups",x ="Date",y ="Percentage Change (%)",color ="Store Group" ) +theme_minimal()
From 1992 and onward the store group with highest average household value experienced higher growth in sales as opposed to the group the the lowest value, by 1996 the disparity between the groups reaches almost 10%. With this and all the analysis made previously, it is proven that wealth indicators such as household value, income and education has maybe not the biggest, but a significant positive correlation with sales growth.
Task 4: What can this data be used for
There are various finding in the data and analyses that could be useful for expanding future operations. First and foremost the sales growth data indicates that the majority of stores are facing a decline. A significant factor that impact the development of sales is wealth in the form om household value. With this information in mind it could be worth it for the company to evaluate the profitability of some of their lowest performing stores and maybe consider downsizing. When it comes to expansion of future operations it would be preferable to expand to wealthier areas with higher income, household value and education.
There is also data of the daily and monthly sales which brings insight into weekly and yearly consumer behavior. Which could be used for planning sales and staffing to optimize revenue and operations.
Appendix:
Chat GPT has has been used in this task to aid with writing / correcting code, analyzing data sets, help interpreting tasks and giving feedback on code and writing. Full documentation of the all chat logs used to make this assignment can be found in a word file uploaded together with this file.