Eksamen-SOK1005

Author

Introduction

This report analyzes sales data from Dominick’s spanning 1990 to 1996, focusing on trends across days, weeks, and months. Additionally, it investigates how demographic variables such as household wealth, household size, income, and education may have influenced store performance. The goal is to derive insights that inform future operational and expansion strategies.

library(haven)

Warning: package 'haven' was built under R version 4.4.3

library(dplyr)

Warning: package 'dplyr' was built under R version 4.4.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(lubridate)

Warning: package 'lubridate' was built under R version 4.4.3


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.4.3

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ stringr 1.5.1
✔ purrr   1.0.2     ✔ tibble  3.2.1
✔ readr   2.1.5     ✔ tidyr   1.3.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom) 
ccount <- read_dta("C:/Users/nikol/Downloads/ccount_stata (1)/ccount.dta")
demo <- read_dta("C:/Users/nikol/Downloads/ccount_stata (1)/demo.dta")
merged <- merge(ccount, demo, by = "store") # Merging the data
merged$date <- iconv(merged$date, from = "",to = "UTF-8", sub = "byte")

#Combining the datasets and turning the numberes un the "date" column into actual date with as.date, aswell as turning the number into %Y%m%d format
merged$date <- as.character(merged$date)
merged <- merged[grepl("^[0-9]{6}$", merged$date), ]
merged$date <- as.Date(paste0("19", merged$date), format = "%Y%m%d")

Task 1a: The overall performance of the chain

To start of with we will looking at total sales from 1990 and compare them to 1996, to see whether or not the store has had a positive development. We will also compare their total sales of each weekday to see if the the development in question could have been impacted by consumer behavior.

#Creating two new columns one as weekdays and one as year
Sys.setlocale("LC_TIME", "C")

[1] "C"

merged$weekday <- weekdays(merged$date)
merged$year <- year(merged$date)
# Filtering out only year 1990 and 1996
filtered <- merged %>% 
 filter(year %in% c(1990, 1996))
filtered$total_sales <- filtered$grocery #Filtering out grocery as i suspect it is the sum of all goods being sold

#Summarizing the sales for each day and arranging the correct order of the weekdays
weekday_sales <- filtered %>%
 group_by(year, weekday, store) %>%
 summarise(total_sales = sum(total_sales, na.rm = TRUE), .groups = "drop")
weekday_order <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
weekday_sales$weekday <- factor(weekday_sales$weekday, levels = weekday_order)
#Dividing the values by a million, this way the values become more clear and comprehensible
weekday_sales$total_sales <- weekday_sales$total_sales / 10**6

#Using ggplot to add a very simple bar plot of total sales each week
ggplot(weekday_sales, aes(x = weekday, y = total_sales, fill = factor(year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
  title = "Total weekly sales revenue by weekday (1990 vs 1996)",
  x = "Weekday", y = "Total sales in millions(USD)", fill = "Year"

  ) +
  theme_minimal()

Figure 1a shows the total sales in dollars of each weekday in 1990 in comparison to 1996, which clearly shows that sales have increased from 1990 to 1996 every day of the week. The days in which this is most evident is Monday, Saturday and Sunday, with the biggest increase in total sales. Wednesday and Friday on the other hand has experienced the lowest increase in sales, but is still one of the best performing days. All sales in 1990 and 1996 from every day of the week, range between 1 to almost 5 million sales. The reason as to why sales are so high on Saturday and Sunday is due to it being the weekend, Friday and Monday have high sales due to their proximity to the weekend. Whether the change in sales revenue is due to growth in sales or due to inflation is hard to determine. However it would be fair to assume that the day with the biggest increases in revenue are genuine increases in sales, due to the big growth.

Task 1b; total daily sales percentage change 1990-1996

While in the big picture the chain might have had positive growth in sales, this however is only a surface level look. A better approach might be to see how each individual store has developed from 1990 to 1996.

#using pivot wider to create column for each year
sales_percent <- pivot_wider(weekday_sales, 
                          names_from = year, 
                          values_from = total_sales,
                          names_prefix = "sales_")

# Creating a column with the percantage change for each day in each store by multiplying, subtracting and dividing the columns.
sales_percent <- sales_percent %>%
  unnest(cols = c(sales_1990, sales_1996)) %>%
  mutate(
    change = 100 * (sales_1996 - sales_1990) / sales_1990
  ) %>%
  filter(change >= -100, change <= 100, sales_1990 > 0)
sales_percent <- sales_percent %>% # Also creating the average percentage change of all the stores each weekday
 group_by(weekday) %>%
 mutate(mean_change = weighted.mean(change, w = sales_1990, na.rm = TRUE))

#Using facet wrap and histogram to show the perchentage change in sales between 1990 and 1996, for each day in each store. I have also added a vline that shows the average change in sales of each day.

ggplot(sales_percent, aes(x = change)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  facet_wrap(~ weekday) +
  geom_vline(data = sales_percent, aes(xintercept = mean_change),
             linetype = "dashed", color = "red") +
  labs(title = "Sales % change from 1990 to 1996 by weekday",
       x = "Percentage Change", y = "Number of stores") +
  theme_minimal()

Figure 1b shows the percentage change in sales each store has experienced from 1990 to 1996, in comparison to each other. Despite what figure 1a may imply, figure 1b shows that the percentage change in sales across all stores has on average declined. There are also exceptions to this with a few outliers performing much better than the rest. Wednesday, Friday and Saturday has experienced the largest decrease in average sales across all stores. While Sunday, Monday, Tuesday and Thursday a much smaller decline in sales. Overall total sales might gave increased as figure 1a indicates, due to a small minority of stores performing well and increasing total sales, while the majority of stores are facing a decline in sales.

Task 1c: Demographics and daily sales

Looking at different demographic variables and the degree too which they can be found at stores with different levels of performance, could give insights into how why some stores perform much better than others.

#Finding the percent change in sales from day to day for weekday sales
weekday_wide <- weekday_sales %>%
  pivot_wider(names_from = year, values_from = total_sales, names_prefix = "year_")
weekday_wide <- weekday_wide %>%
  mutate(pct_change = 100 * (year_1996 - year_1990) / year_1990)

#merging weekday sales with desired demographics
demoanalasys <- weekday_wide %>%
  left_join(select(filtered, store, educ, income, hvalmean, hsizeavg), by = "store") %>%
  distinct(store, weekday, .keep_all = TRUE)

Warning in left_join(., select(filtered, store, educ, income, hvalmean, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

#Doing regression analisys of many different varibles on after a another to figure out which hads the biggest impact on sales growth.

fit <- lm(pct_change ~ educ + income + hvalmean + hsizeavg, data = demoanalasys)
summary(fit)


Call:
lm(formula = pct_change ~ educ + income + hvalmean + hsizeavg, 
    data = demoanalasys)

Residuals:
    Min      1Q  Median      3Q     Max 
-131.72  -54.60  -28.11    3.90 2869.94 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -56.306    379.192  -0.148  0.88201   
educ        -104.654    149.125  -0.702  0.48308   
income       -25.749     42.870  -0.601  0.54831   
hvalmean       1.131      0.379   2.984  0.00296 **
hsizeavg      76.530     37.031   2.067  0.03920 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 182.8 on 597 degrees of freedom
  (147 observations deleted due to missingness)
Multiple R-squared:  0.03786,   Adjusted R-squared:  0.03142 
F-statistic: 5.873 on 4 and 597 DF,  p-value: 0.000122

fit_hvalmean <- lm(pct_change  ~ hvalmean, data = demoanalasys  )
summary(fit_hvalmean)


Call:
lm(formula = pct_change ~ hvalmean, data = demoanalasys)

Residuals:
    Min      1Q  Median      3Q     Max 
-111.18  -47.94  -27.01   -1.93 2871.22 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -81.9277    24.8062  -3.303  0.00101 ** 
hvalmean      0.6848     0.1602   4.274 2.23e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 183.1 on 600 degrees of freedom
  (147 observations deleted due to missingness)
Multiple R-squared:  0.02955,   Adjusted R-squared:  0.02793 
F-statistic: 18.27 on 1 and 600 DF,  p-value: 2.23e-05

Out of every demographic variable tested the one with the most significant effect in determining sales growth is household value with a 12% correlation with changes in sales.

demoanalasys <- demoanalasys %>% 
  filter(pct_change <= 200)
#Simple scatterplot comparing hvalmean and percentage change
ggplot(demoanalasys, aes(x = hvalmean, y = pct_change)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Correlation between household value and sales change",
       x = "Average household value in thousands (USD)",
       y = "Sales change %") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

This graph seems to further confirm the results of the regression analysis. While there are many well performing stores in areas with low household value, the amount of poorly performing stores decreases significantly the higher the value of the household. The reason as to why could be multiple, households with higher value in general might have a higher income, thus being able to spend more. It could also be that bigger households contain a higher number of people, therefore having a higher level of consumption. Education similar to income could also be a big indicator of wealth an and value of the household, therefore it would be beneficial to see how these variable effect sales changes each day of the week.

#using nest and unnest to make models easier to plot 
weekday_models <- demoanalasys %>%
  group_by(weekday) %>%
  nest() %>%
  mutate( #Im not including hvalmean has no impact on daily sales change
    model = map(data, ~ lm(pct_change ~ educ + income + hsizeavg, data = .)),
    results = map(model, tidy)
  ) %>%
  unnest(results)

Task 2a: Average monthly sales

Looking at the development of sales on a monthly basis could give further insight into consumer behavior. Especially how the average across all stores every year from 1990 to 1996, has developed each month.

monthly_sales <- merged %>% #Loading in the dataset and adding in months
  filter(!is.na(date)) %>%
  mutate(
    year = year(date),
    month = month(date, label = TRUE, abbr = TRUE),
    total_sales = rowSums(select(., grocery), na.rm = TRUE)
  ) %>%
  filter(year >= 1990, year <= 1996)

# Summarising the sales across store, month and year 
monthly_totals <- monthly_sales %>%
  group_by(store, year, month) %>%
  summarise(monthly_total = sum(total_sales, na.rm = TRUE), .groups = "drop")

# Calculating the monthly average
monthly_avg <- monthly_totals %>%
  group_by(month) %>%
  summarise(avg_sales = mean(monthly_total), .groups = "drop")

monthly_avg$month <- factor(monthly_avg$month, levels = month.abb) # Arranging the month in correct row, necessary to plott it

# Finding the highest and lowest month in terms of average sales
highest <- monthly_avg %>% filter(avg_sales == max(avg_sales))
lowest <- monthly_avg %>% filter(avg_sales == min(avg_sales))

print(monthly_avg)

# A tibble: 12 × 2
   month avg_sales
   <ord>     <dbl>
 1 Jan     746254.
 2 Feb     667105.
 3 Mar     748970.
 4 Apr     712158.
 5 May     764128.
 6 Jun     716582.
 7 Jul     734019.
 8 Aug     702222.
 9 Sep     732520.
10 Oct     722933.
11 Nov     739642.
12 Dec     798223.

print(highest)

# A tibble: 1 × 2
  month avg_sales
  <ord>     <dbl>
1 Dec     798223.

print(lowest)

# A tibble: 1 × 2
  month avg_sales
  <ord>     <dbl>
1 Feb     667105.

The month with the highest sales is December, the lowest month is February. The differences are probably sue to seasonal festivities, where consumers might spend more on groceries than usual.

Task 2b: monthly aggregate sales

While looking at the monthly average of all stores from every year gives insight into patterns of consumer behavior. However comparing them the monthly average of each year also gives us insight into the general performance of the store.

monthly_total2 <- monthly_sales %>% # Creating a new datasett which accounts for sales across year and month
  group_by(year, month) %>%
  summarise(monthly_total = sum(total_sales, na.rm = TRUE), .groups = "drop")

# making a very simple plot with year as.factor
ggplot(monthly_total2, aes(x = month, y = monthly_total / 1e6, group = year, color = as.factor(year))) +
  geom_line(size = 1) +
  labs(
    title = "Monthly Aggregate Sales by month (1990–1996)",
    x = "Month",
    y = "Monthly Sales (Millions)",
    color = "Year"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Figure 2b shows the monthly development of sales in each year in comparison to each other from 1990 to 1996. 1991 was the best performing year in sales across very month except January and February. 1994, 1995 and 1996 being the lowest performing years. The biggest peaks in sales for all months were in March, May and December, May and December are month with a lot of holidays and seasonal celebrations, thus reaffirming the theory of seasonal festivities impacting sales. The biggest declines in sales are in February, April, August and October, the reason as to why could be that consumer are saving or cannot afford to spend more due to the high expenditure of the holidays.

Task 2c: Monthly percentage sales change

# Taking the monthly_totals from previously and making it wide
sales_change <- monthly_totals %>%
 pivot_wider(names_from = year, values_from = monthly_total, names_prefix = "sales_")
# Making the percentage change like in previous tasks
sales_change <- sales_change %>%
 filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 > 0, sales_1996 > 0) %>%
 mutate(change_percent = 100 * (sales_1996 - sales_1990) / sales_1990)

#creating the average
sales_change <- sales_change %>%
  group_by(month) %>%
  mutate(mean_change = mean(change_percent))

#plotting a similar plot to task 1b
ggplot(sales_change, aes(x = change_percent)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "white") +
  facet_wrap(~ month, scales = "fixed") +
  scale_x_continuous(limits = c(-100, 100)) +
  labs(
    title = "Distribution of Sales % Change by Month (1990 vs 1996)",
    x = "% Change in Sales", y = "Number of Stores"
  ) +
  geom_vline(data = sales_change, aes(xintercept = mean_change),
             linetype = "dashed", color = "red") +
  theme_minimal()

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 24 rows containing missing values or values outside the scale range
(`geom_bar()`).

Figure 2c shows the change in monthly sales growth of each store compared to each other from 1990 to 1996. Sales growth has on average declined across every month. The average aside, the majority of stores have experienced a sales decline. October and November have experienced the sharpest decreases, while other months declines were milder. This feeds in to theory of a few well performing stores driving the total sales in figure 1a upwards.

Task 2d: Demographics and monthly sales

Demographic impact on monthly sales could be very different from weekly, especially with wealth indicators such as income, household value and education. Since the customers wealth might have a bigger influence on their monthly consumer pattern, a previously theorized in 2b.

sales_change <- sales_change %>% # removing na, makeing me ablo to run regression
  filter(is.finite(change_percent))

regression_month <- sales_change %>% #doing more or less the same as i did in 3a, but now with onths instead of weekdays 
  left_join(select(filtered, store, educ, income, hvalmean, hsizeavg), by = "store")

Warning in left_join(., select(filtered, store, educ, income, hvalmean, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

model1 <- lm(change_percent ~ hvalmean + educ + income + hsizeavg , data = regression_month)
summary(model1)


Call:
lm(formula = change_percent ~ hvalmean + educ + income + hsizeavg, 
    data = regression_month)

Residuals:
    Min      1Q  Median      3Q     Max 
-68.078 -15.705  -2.337  11.488 275.852 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.360535   1.454423   39.44   <2e-16 ***
hvalmean      0.489072   0.001469  332.88   <2e-16 ***
educ        -78.941510   0.582741 -135.47   <2e-16 ***
income      -13.908413   0.164923  -84.33   <2e-16 ***
hsizeavg     10.512446   0.143368   73.33   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24 on 693977 degrees of freedom
Multiple R-squared:  0.2232,    Adjusted R-squared:  0.2232 
F-statistic: 4.985e+04 on 4 and 693977 DF,  p-value: < 2.2e-16

model_educ <- lm(change_percent ~  educ, data = regression_month)
summary(model_educ)


Call:
lm(formula = change_percent ~ educ, data = regression_month)

Residuals:
    Min      1Q  Median      3Q     Max 
-94.111 -16.860  -3.548  12.893 275.023 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -23.88716    0.06944  -344.0   <2e-16 ***
educ         75.35543    0.28554   263.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.95 on 693980 degrees of freedom
Multiple R-squared:  0.09121,   Adjusted R-squared:  0.0912 
F-statistic: 6.965e+04 on 1 and 693980 DF,  p-value: < 2.2e-16

model_income <- lm(change_percent ~  income, data = regression_month)
summary(model_income)


Call:
lm(formula = change_percent ~ income, data = regression_month)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.123 -18.201  -5.219  14.254 292.206 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -188.3046     1.1709  -160.8   <2e-16 ***
income        17.0564     0.1104   154.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.77 on 693980 degrees of freedom
Multiple R-squared:  0.03324,   Adjusted R-squared:  0.03324 
F-statistic: 2.386e+04 on 1 and 693980 DF,  p-value: < 2.2e-16

model_hval_2 <- lm(change_percent ~  hvalmean, data = regression_month)
summary(model_hval_2)


Call:
lm(formula = change_percent ~ hvalmean, data = regression_month)

Residuals:
   Min     1Q Median     3Q    Max 
-79.73 -15.69  -2.16  11.81 270.92 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.468e+01  9.760e-02  -457.8   <2e-16 ***
hvalmean     2.540e-01  6.359e-04   399.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.55 on 693980 degrees of freedom
Multiple R-squared:  0.187, Adjusted R-squared:  0.187 
F-statistic: 1.596e+05 on 1 and 693980 DF,  p-value: < 2.2e-16

On the monthly basis the impact of wealth and household size increase slightly compared to the weekday sales, with all demographic variables influencing 22% of monthly sales changes form 1990 to 1996. Initially it seems as if income and education has a significant negative affect on sales, but it is only when they are taking each other and household value into account. When they are running on their own, the variables have a positive influence on sales. Since household value has the largest influence on sales growth, any other variable that takes into account similar measures as the average household value are going to be negative to prevent inflating and twisting the results. Household value in general influences 18% of the change on sales on a monthly basis, which is a significant increase from 1c where it was 12%. Therefore wealth does play a bigger role in more long term consumer behavior. Average household size does not seem to to have the same connection to wealth as other aforementioned variables, this pared with it having a very insignificant influence on sales has rendered it outside the scope of consideration for the rest of this report.

Task 3a: Yearly sales

#Summarising the sales of all stores for each year. Then i i take the average for every store for each year.
average_aggregate_sales <- monthly_sales %>%  
  group_by(year, store) %>%
  summarise(store_total = sum(total_sales), .groups = "drop") %>% 
  group_by(year) %>%
  summarise(avg_sales = mean(store_total), .groups = "drop")

# Now find highest and lowest year
highest_year <- average_aggregate_sales %>%
  filter(avg_sales == max(avg_sales))

lowest_year <- average_aggregate_sales %>%
  filter(avg_sales == min(avg_sales))

print(average_aggregate_sales)

# A tibble: 7 × 2
   year avg_sales
  <dbl>     <dbl>
1  1990  8313093.
2  1991  8560136.
3  1992  8629876.
4  1993  8597120.
5  1994  8305773.
6  1995  8723914.
7  1996  8916987.

print(highest_year)

# A tibble: 1 × 2
   year avg_sales
  <dbl>     <dbl>
1  1996  8916987.

print(lowest_year)

# A tibble: 1 × 2
   year avg_sales
  <dbl>     <dbl>
1  1994  8305773.

average_aggregate_sales %>% 
  ggplot(aes(x = year, y = avg_sales / 10**6))+
  geom_line()+
  labs(title = "Yearly average aggregate sales", x = "Years (1990-1996)", y = "Average aggrregate sales(millions)")+
  theme_minimal()

From 1990 to 1996, the year with the lowest sales is 1994 with a very sharp decline for the previous year. From 1994 to 1996 there was a sharp increase in average sales, much larger the the decline of the previous years with the peak being in 1996. Therefore despite i dip in 1994, average aggregate sales have been on the increase in this time period.

Task 3b: Yearly percentage change in sales

ggplot(sales_change, aes(x = change_percent)) + #Have allready made the data necessary to make the graph. For the "mean" line i don't bother with creating a new dataset, just do the mean in the plot, same results anyway:)
  geom_histogram(binwidth = 10, fill = "black", color = "white") +
  geom_vline(xintercept = mean(sales_change$change_percent, na.rm = TRUE),
           color = "red", linetype = "dashed")+
  labs(title = "Store-level % Change in Total Sales (1990–1996)",
       x = "% Change", y = "Number of Stores")

Figure 3b which shows the percentage change of all stores compared to each other from 1990 to 1996. The graph further indicates what the data has shown previously, mainly that the average change in sales is on a decline with the majority of stores experiencing a negative change in growth with some outliners having a very big decline in sales. On the opposite end we have some stores that have experienced an increase in sales far larger than any single store has declined, therefore likely contributing to the rising average aggregate sales from figure 3a. Note the outliers in this graph are almost unnoticeable

Task 3c: Best and worst performing stores

Since there seems to be lot of outliers influencing the total sales of the chain, it might be beneficial to look at the best and worst performing stores and how they have developed.

store_growth <- sales_change %>%
  group_by(store) %>%
  summarise(change_percent = mean(change_percent, na.rm = TRUE), .groups = "drop")
#arranging the sales into the top 20 best and worst performin by arranging the i the desired order and then slicing the "top" of the list
top <- store_growth %>% arrange(desc(change_percent)) %>% slice_head(n = 40)
bottom <- store_growth %>% arrange(change_percent) %>% slice_head(n = 40)
top_extreme <- store_growth %>% arrange(desc(change_percent)) %>% slice_head(n = 10)
bottom_extreme <- store_growth%>% arrange(change_percent) %>% slice_head(n = 10)
print(top)

# A tibble: 40 × 2
   store change_percent
   <dbl>          <dbl>
 1   137           87.2
 2   307           63.0
 3   304           61.1
 4    62           44.8
 5   306           44.1
 6   126           43.5
 7    86           33.5
 8   309           31.0
 9   305           31.0
10   129           23.6
# ℹ 30 more rows

print(bottom)

# A tibble: 40 × 2
   store change_percent
   <dbl>          <dbl>
 1    92          -59.0
 2   105          -53.5
 3    48          -48.6
 4    40          -43.5
 5   106          -40.4
 6    89          -38.9
 7   110          -37.8
 8    80          -37.6
 9    73          -36.8
10   103          -36.2
# ℹ 30 more rows

print(top_extreme)

# A tibble: 10 × 2
   store change_percent
   <dbl>          <dbl>
 1   137           87.2
 2   307           63.0
 3   304           61.1
 4    62           44.8
 5   306           44.1
 6   126           43.5
 7    86           33.5
 8   309           31.0
 9   305           31.0
10   129           23.6

print(bottom_extreme)

# A tibble: 10 × 2
   store change_percent
   <dbl>          <dbl>
 1    92          -59.0
 2   105          -53.5
 3    48          -48.6
 4    40          -43.5
 5   106          -40.4
 6    89          -38.9
 7   110          -37.8
 8    80          -37.6
 9    73          -36.8
10   103          -36.2

# Making 1990 the baseline for mye yearly sales 
baseline_1990 <- merged %>%
 mutate(date = as.Date(date)) %>%
 filter(year(date) == 1990) %>%
 mutate(total_sales = rowSums(select(., grocery ), na.rm = TRUE)) %>%
 group_by(store) %>%
 summarise(base_avg = mean(total_sales, na.rm = TRUE), .groups = "drop")
# Merging the rows of the top and bottom together
selected_stores <- bind_rows(top, bottom) %>% pull(store)
# Uising a ifelse to group the stores in their correct spaces, aswell as creating percentage change and the correct month format
daily_percent_data <- merged %>%
 mutate(date = as.Date(date)) %>%
 filter(store %in% selected_stores) %>%
 mutate(
 total_sales = rowSums(select(., grocery), na.rm = TRUE),
 year = year(date),
 month = floor_date(date, "month")
 ) %>%
 filter(year >= 1990, year <= 1996) %>%
 left_join(baseline_1990, by = "store") %>%
 mutate(
 pct_change = 100 * (total_sales - base_avg) / base_avg,
 group = ifelse(store %in% top$store, "Top 40", "Bottom 40")
 )
# 4
monthly_avg_percent <- daily_percent_data %>%
 group_by(month, group) %>%
 summarise(avg_pct_change = mean(pct_change, na.rm = TRUE), .groups = "drop")

ggplot(monthly_avg_percent, aes(x = month, y = avg_pct_change, color = group)) +
 geom_line(size = 1.2) +
 geom_hline(yintercept = 0, linetype = "dashed") +
 labs(
 title = "Monthly % Change in Sales (Top vs Bottom 40 Stores, 1990–1996)",
 x = "Month",
 y = "Average % Change from 1990 Baseline",
 color = "Store Group"
 ) +
 theme_minimal()

The change in sales for the 40 best performing stores have predictably had a very positive sales development. Simultaneously the 40 least performing stores have faced an equally sharp recession in sales. This again points toward the best performing stores being the ones driving the increase in sales from.

Task 3d: Regressionanalysis of the 40 best performing stores

Why the top 40 best and worst performing stores could have such a unequal development, would become clearer if demographics wealth disparity is taken into account.

top_growth_data <- monthly_sales %>%
  filter(store %in% top$store) %>%
  #Turning monthly sales into a data set for the top 40 stores by year
  group_by(store, year) %>%
  summarise(avg_sales = mean(total_sales, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = year, values_from = avg_sales, names_prefix = "sales_") %>%
  filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 > 0, sales_1996 > 0)   %>%
  mutate(change_percent = 100 * (sales_1996 - sales_1990) / sales_1990)
#Making new top and bottom data without a basline
bottom_growth_data <- monthly_sales %>%
  filter(store %in% bottom$store) %>% 
  group_by(store, year) %>%
  summarise(avg_sales = mean(total_sales, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = year, values_from = avg_sales, names_prefix = "sales_") %>%
  filter(!is.na(sales_1990), !is.na(sales_1996), sales_1990 > 0, sales_1996 > 0)   %>%
  mutate(change_percent = 100 * (sales_1996 - sales_1990) / sales_1990)

top_growth_data <- top_growth_data %>%
  left_join(
    monthly_sales %>% 
      select(store, income, educ, hsizeavg, hvalmean) %>%
      distinct(store, .keep_all = TRUE), 
    by = "store")

bottom_growth_data <- bottom_growth_data %>%
  left_join(
    monthly_sales %>% 
      select(store, income, educ, hsizeavg, hvalmean) %>%
      distinct(store, .keep_all = TRUE), 
    by = "store")

#doing regression analysis
growth_model_top <- lm(change_percent ~ income  + educ + hvalmean, data = top_growth_data)
summary(growth_model_top)


Call:
lm(formula = change_percent ~ income + educ + hvalmean, data = top_growth_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.189 -11.366  -6.388   7.966  55.821 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  43.99039  127.92694   0.344    0.733
income       -4.07590   12.47154  -0.327    0.746
educ         -5.85923   57.96973  -0.101    0.920
hvalmean      0.09732    0.14065   0.692    0.493

Residual standard error: 19.91 on 36 degrees of freedom
Multiple R-squared:  0.02809,   Adjusted R-squared:  -0.05291 
F-statistic: 0.3468 on 3 and 36 DF,  p-value: 0.7917

growth_model_bottom <- lm(change_percent ~ income + educ + hvalmean, data = bottom_growth_data)
summary(growth_model_bottom)


Call:
lm(formula = change_percent ~ income + educ + hvalmean, data = bottom_growth_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.903  -7.156   2.321   6.385  17.880 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  14.88909   94.15381   0.158   0.8752  
income       -5.42888    9.25635  -0.587   0.5612  
educ        -57.96038   37.91630  -1.529   0.1351  
hvalmean      0.21004    0.09367   2.242   0.0312 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.09 on 36 degrees of freedom
Multiple R-squared:  0.1286,    Adjusted R-squared:  0.05595 
F-statistic: 1.771 on 3 and 36 DF,  p-value: 0.1702

For the top 40 best performing stores it average household value, income and education, seems influence a very insignificant 2.8% of sales change, the sales. The effect is more significant on the 40 worst performing stores, but the p-value of the data is very low indicating very scattered and unreliable data to draw confident conclusions. The wealth does not seem to be a factor in why the best performing stores are doing so well, but there could be a certain wealth threshold that the stores need to reach to have a reduced decline in sales. To definitively prove if average household value has any effect on sales change, it would be advantageous to compare the sales development of the top 40 stores with the highest and lowest average household value customers.

# Selceting only hvalmean
store_hval <- filtered %>% 
  select(store, hvalmean) %>% 
  distinct()

# attaching it to sales change
sales_with_hval <- sales_change %>%
  left_join(store_hval, by = "store")

# dividing it into highest and lowest groups by the value of their hvalmean
store_groups <- sales_with_hval %>%
  group_by(store) %>%
  summarise(hvalmean = mean(hvalmean, na.rm = TRUE), .groups = "drop") %>%
  arrange(hvalmean) %>%
  mutate(group = case_when(
    row_number() <= 40 ~ "Low hvalmean",
    row_number() > n() - 40 ~ "High hvalmean",
    TRUE ~ NA_character_
  )) %>%
  filter(!is.na(group))

# 3. Merging the data because there ere some things sales_change lacked
sales_grouped <- sales_with_hval %>%
  left_join(store_groups, by = "store")


 #Join merging the years togheter for plotting
sales_long <- sales_grouped %>%
  pivot_longer(cols = starts_with("sales_"),
               names_to = "year",
               names_prefix = "sales_",
               values_to = "sales") %>%
  mutate(
    year = as.integer(year),
    date = dmy(paste("01", month, year))
  ) %>%
  filter(!is.na(group), year >= 1990, year <= 1996)
# Finding average sales per group
monthly_avg <- sales_long %>%
  group_by(date, group) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")
# Creatin a basline just like previously
baseline_sales <- sales_long %>%
  filter(date == ymd("1990-01-01")) %>%
  group_by(group) %>%
  summarise(baseline = mean(sales, na.rm = TRUE), .groups = "drop")
#infign the percentage change in sales for each group
monthly_pct_change <- sales_long %>%
  filter(!is.na(group)) %>%
  group_by(group, date) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop") %>%
  left_join(baseline_sales, by = "group") %>%
  mutate(pct_change = 100 * (avg_sales - baseline) / baseline)

ggplot(monthly_pct_change, aes(x = date, y = pct_change, color = group)) +
  geom_line(size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Monthly % Sales Change Relative to Jan 1990",
    subtitle = "Comparing High vs Low average household value Store Groups",
    x = "Date",
    y = "Percentage Change (%)",
    color = "Store Group"
  ) +
  theme_minimal()

From 1992 and onward the store group with highest average household value experienced higher growth in sales as opposed to the group the the lowest value, by 1996 the disparity between the groups reaches almost 10%. With this and all the analysis made previously, it is proven that wealth indicators such as household value, income and education has maybe not the biggest, but a significant positive correlation with sales growth.

Task 4: What can this data be used for

There are various finding in the data and analyses that could be useful for expanding future operations. First and foremost the sales growth data indicates that the majority of stores are facing a decline. A significant factor that impact the development of sales is wealth in the form om household value. With this information in mind it could be worth it for the company to evaluate the profitability of some of their lowest performing stores and maybe consider downsizing. When it comes to expansion of future operations it would be preferable to expand to wealthier areas with higher income, household value and education.

There is also data of the daily and monthly sales which brings insight into weekly and yearly consumer behavior. Which could be used for planning sales and staffing to optimize revenue and operations.

Appendix:

Chat GPT has has been used in this task to aid with writing / correcting code, analyzing data sets, help interpreting tasks and giving feedback on code and writing. Full documentation of the all chat logs used to make this assignment can be found in a word file uploaded together with this file.