Packages Downland

Data overview

PYT_data <- read_excel("/Users/shenzhuhan/Desktop/PinkYellowTail.xlsx")
head(PYT_data)
## # A tibble: 6 × 8
##   Sales Promotion Advertising Index    q1    q2    q3    q4
##   <dbl>     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  674.      20.8        40.1   100     1     0     0     0
## 2  543.      29.7        48.1   102     0     1     0     0
## 3  532.       0          60.1   104     0     0     1     0
## 4  785.       0          76.1   104     0     0     0     1
## 5  800.       0          52.1   104     1     0     0     0
## 6  940.      42.5        28.0   100     0     1     0     0
summary(PYT_data)
##      Sales         Promotion      Advertising        Index             q1      
##  Min.   :304.2   Min.   : 0.00   Min.   : 0.00   Min.   : 96.0   Min.   :0.00  
##  1st Qu.:495.7   1st Qu.: 0.00   1st Qu.:11.02   1st Qu.:102.8   1st Qu.:0.00  
##  Median :560.9   Median :16.83   Median :40.07   Median :105.0   Median :0.00  
##  Mean   :608.4   Mean   :21.30   Mean   :35.56   Mean   :105.5   Mean   :0.25  
##  3rd Qu.:788.8   3rd Qu.:32.86   3rd Qu.:52.09   3rd Qu.:108.5   3rd Qu.:0.25  
##  Max.   :939.8   Max.   :66.92   Max.   :76.14   Max.   :114.0   Max.   :1.00  
##        q2             q3             q4      
##  Min.   :0.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00  
##  Median :0.00   Median :0.00   Median :0.00  
##  Mean   :0.25   Mean   :0.25   Mean   :0.25  
##  3rd Qu.:0.25   3rd Qu.:0.25   3rd Qu.:0.25  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00

Model 1: Model with all variables

Building Model 1

As a preliminary model, we will incorporate all the variables to analyze their impact on sales. A linear regression model is built using the lm function with the formula Sales ~., which means that Sales is the dependent variable and all other variables in the PYT_data data set are independent variables.

## 
## Call:
## lm(formula = Sales ~ ., data = PYT_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -187.59  -66.51  -14.08   61.88  255.28 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1116.093    542.447   2.058 0.055311 .  
## Promotion      5.417      1.190   4.551 0.000283 ***
## Advertising    3.588      1.163   3.086 0.006706 ** 
## Index         -7.227      5.164  -1.400 0.179607    
## q1            74.448     70.268   1.059 0.304199    
## q2           -39.506     73.450  -0.538 0.597640    
## q3            13.475     71.401   0.189 0.852543    
## q4                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119.9 on 17 degrees of freedom
## Multiple R-squared:  0.657,  Adjusted R-squared:  0.5359 
## F-statistic: 5.426 on 6 and 17 DF,  p-value: 0.002697
## `geom_smooth()` using formula = 'y ~ x'

Model 1 Evaluation

Upon building the linear model, we notice several variables are statistically insignificant (p-value greater than 0.05). The adjusted R-squared value is 0.5359, indicating it may be relatively low, while the residual standard error is 119.9, which could be considered high. When examining a visual representation of the results, it appears that the model does not accurately predict the actual outcomes.

model1 %>%
  autoplot(which = 1:3) + theme_bw()

Model 2: Drop all insignificant independent variables from Model 1

Building Model 2

A new model is built with only the significant variables from Model 1 (in this case, Promotion and Advertising). The lm function is used again with the formula Sales ~ Promotion + Advertising.

model2 <- lm(Sales~ Promotion + Advertising, data = PYT_data)
summary(model2)
## 
## Call:
## lm(formula = Sales ~ Promotion + Advertising, data = PYT_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -252.774  -50.139    0.426   40.532  250.613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  402.775     50.738   7.938 9.32e-08 ***
## Promotion      4.841      1.138   4.255 0.000353 ***
## Advertising    2.883      1.090   2.645 0.015155 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122 on 21 degrees of freedom
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.5196 
## F-statistic: 13.44 on 2 and 21 DF,  p-value: 0.0001748
# Plot the Regression Result for model 2
ggplot(PYT_data, aes(x = Sales, y = predict(model2))) +
  geom_point(color = "black") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Regression Analysis Results - Model 2") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Model 2 Evaluation

Although all variables in the model are now significant, the adjusted R squared has now decreased, suggesting that the explanatory power of the model is now lower. Thus, we will try other models.

model2 %>%
  autoplot(which = 1:3) + theme_bw()

Model 3: Manipulate the original variables based on their real-life influence

Adjusting Original Variables

To adjust and optimize the model, feature engineering techniques were employed. These techniques were guided by some instructions from Session 4 and 5.

Lagging Variables

Lagging variables were introduced to account for the time delay in the effects of promotion and advertising on sales. In real-world scenarios, the impact of promotional activities and advertising campaigns may not be immediate. For example, a promotion in one quarter might have an effect on sales in the following quarter. To capture this lag effect, the mutate function was used to create Promotion_lag and Advertising_lag variables. The lag function was applied within the mutate to shift the values of Promotion and Advertising columns by one period. This allows the model to consider the historical values of these variables and better understand their relationship with sales over time.

Adjusting the Economic Index

The economic index provided in the data set was adjusted to make it more useful for the model. Only having the raw index numbers was not sufficient as it did not account for changes over time. A new column called Index_change was created using the mutate function. This column shows the differences in the economic index from one period to the next. By calculating these differences, the model can better capture the impact of economic changes on sales. For instance, an increase or decrease in the economic index between two consecutive quarters might have a different effect on sales compared to just looking at the absolute values of the index.

Quarters and Season Classification

The quarterly data was also manipulated to create a more meaningful variable for the model. The quarters were initially considered as dummy variables, but to simplify the model and avoid multicollinearity issues, only three of the quarter variables were used in a way that one was set as the baseline. Additionally, an attempt was made to engineer a feature called Season. The quarters were classified into seasons based on the months they represent. For example, Q1 (January - March) was classified as a “cold” season, Q2 (April - June) was classified as a “hot” season, Q3 (July - September) was classified as a “hot” season, and Q4 (October - December) was classified as a “cold” season. This Season variable was created using the case_when function within the mutate operation. By classifying the quarters into seasons, the model can potentially capture any seasonal patterns in sales that might be related to factors such as consumer behavior, weather, or holidays during different times of the year.

# Create a column that shows the differences in economic index
PYT_data <- PYT_data %>%
  mutate(Index_change = Index - lag(Index))
head(PYT_data)
## # A tibble: 6 × 9
##   Sales Promotion Advertising Index    q1    q2    q3    q4 Index_change
##   <dbl>     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  674.      20.8        40.1   100     1     0     0     0           NA
## 2  543.      29.7        48.1   102     0     1     0     0            2
## 3  532.       0          60.1   104     0     0     1     0            2
## 4  785.       0          76.1   104     0     0     0     1            0
## 5  800.       0          52.1   104     1     0     0     0            0
## 6  940.      42.5        28.0   100     0     1     0     0           -4
# Add the Promotion lag to a new variable, since promotion effect has a lag
PYT_data <- PYT_data %>%
  mutate(Promotion_lag = lag(Promotion, n = 1))
head(PYT_data)
## # A tibble: 6 × 10
##   Sales Promotion Advertising Index    q1    q2    q3    q4 Index_change
##   <dbl>     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  674.      20.8        40.1   100     1     0     0     0           NA
## 2  543.      29.7        48.1   102     0     1     0     0            2
## 3  532.       0          60.1   104     0     0     1     0            2
## 4  785.       0          76.1   104     0     0     0     1            0
## 5  800.       0          52.1   104     1     0     0     0            0
## 6  940.      42.5        28.0   100     0     1     0     0           -4
## # ℹ 1 more variable: Promotion_lag <dbl>
# Add the Advertising lag to a new variable,since advertising effect has a lag
PYT_data <- PYT_data %>%
  mutate(Advertising_lag = lag(Advertising, n = 1))
head(PYT_data)
## # A tibble: 6 × 11
##   Sales Promotion Advertising Index    q1    q2    q3    q4 Index_change
##   <dbl>     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  674.      20.8        40.1   100     1     0     0     0           NA
## 2  543.      29.7        48.1   102     0     1     0     0            2
## 3  532.       0          60.1   104     0     0     1     0            2
## 4  785.       0          76.1   104     0     0     0     1            0
## 5  800.       0          52.1   104     1     0     0     0            0
## 6  940.      42.5        28.0   100     0     1     0     0           -4
## # ℹ 2 more variables: Promotion_lag <dbl>, Advertising_lag <dbl>
# Engineer one feature called season (e.g., hot/cold):q1 and q4 are classified as cold seasons, and q3 and q2 are classified as hot seasons.
PYT_data <- PYT_data %>%
  mutate(Season = case_when(
    q1 == 1 ~ "cold",  # Q1 (January - March) is winter
    q2 == 1 ~ "hot",   # Q2 (April - June) is spring/summer
    q3 == 1 ~ "hot",   # Q3 (July - September) is summer
    q4 == 1 ~ "cold"   # Q4 (October - December) is winter
  ))

head(PYT_data)
## # A tibble: 6 × 12
##   Sales Promotion Advertising Index    q1    q2    q3    q4 Index_change
##   <dbl>     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>        <dbl>
## 1  674.      20.8        40.1   100     1     0     0     0           NA
## 2  543.      29.7        48.1   102     0     1     0     0            2
## 3  532.       0          60.1   104     0     0     1     0            2
## 4  785.       0          76.1   104     0     0     0     1            0
## 5  800.       0          52.1   104     1     0     0     0            0
## 6  940.      42.5        28.0   100     0     1     0     0           -4
## # ℹ 3 more variables: Promotion_lag <dbl>, Advertising_lag <dbl>, Season <chr>

Building Model 3

model3 <- lm(Sales ~ Promotion + Promotion_lag + Advertising + Advertising_lag + Season + Index_change, data = PYT_data)
summary(model3)
## 
## Call:
## lm(formula = Sales ~ Promotion + Promotion_lag + Advertising + 
##     Advertising_lag + Season + Index_change, data = PYT_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.630  -28.625    2.715   42.141  118.900 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     410.1245    31.7906  12.901 7.16e-10 ***
## Promotion         6.2082     0.7106   8.737 1.74e-07 ***
## Promotion_lag    -3.7229     0.6675  -5.577 4.17e-05 ***
## Advertising       2.5271     0.5878   4.299 0.000551 ***
## Advertising_lag   2.8774     0.6130   4.694 0.000244 ***
## Seasonhot       -69.0633    28.7109  -2.405 0.028606 *  
## Index_change    -16.8592     6.2155  -2.712 0.015375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.79 on 16 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.911,  Adjusted R-squared:  0.8776 
## F-statistic: 27.28 on 6 and 16 DF,  p-value: 1.509e-07

Model 3 Evaluation

  • The summary of this model shows that all variables are significant. The adjusted R-squared value is 0.8776, which is much higher than the previous models, indicating a better fit.
  • The residual standard error is 62.79, which is lower than the previous models, indicating more accurate predictions.
  • Visualizations such as residual plots are used to further assess the model’s performance. The variance inflation factor (VIF) is also calculated to check for multicollinearity. A VIF value less than 5 generally indicates no multicollinearity.
# VIF < 5: Generally indicates no multicollinearity.
vif_values <- vif(model3)
model3 %>%
  autoplot(which = 1:3) + theme_bw()

Final Equation of the Model

\[Sales = 410.1245 + 6.2082 \times Promotion -3.7229 \times Promotion_lag + 2.5271 \times Advertising + 2.8774 \times Advertising_lag - 69.0633 \times Seasonhot - 16.8592 \times Index_change\] # Questions 1 - 5:

Question 1:

# Get correlational coefficients for promotion and advertising respectively
PYT <- na.omit(PYT_data) #omit all NAs
(corr_promotion <- cor(PYT$Sales, PYT$Promotion)) # correlation coefficient between the Sales and Promotion
## [1] 0.6467924
(corr_promotion_lag <- cor(PYT$Sales, PYT$Promotion_lag)) # correlation coefficient between the Sales and Promotion_lag
## [1] -0.08224609
(corr_advertising <- cor(PYT$Sales, PYT$Advertising)) # correlation coefficient between the Sales and Advertisement
## [1] 0.4263245
(corr_advertising <- cor(PYT$Sales, PYT$Advertising_lag)) # correlation coefficient between the Sales and Advertisement_lag
## [1] 0.5655463
# Get regression coefficients for promotion and advertising respectively
(coef(model3)[2]) # regression coefficient for Promotion
## Promotion 
##  6.208237
(coef(model3)[3]) # regression coefficient for Promotion_lag
## Promotion_lag 
##     -3.722858
(coef(model3)[4]) # regression coefficient for Advertising
## Advertising 
##    2.527082
(coef(model3)[5]) # regression coefficient for Advertising_lag
## Advertising_lag 
##        2.877414

Question 2:

# Correlation analysis between sales and Index/Index_change
(corr_promotion <- cor(PYT$Sales, PYT$Index)) # correlation coefficient between the Sales and Index
## [1] 0.01527617
(corr_promotion_lag <- cor(PYT$Sales, PYT$Index_change)) # correlation coefficient between the Sales and Index_change
## [1] -0.1158283
# Regression analysis with Index and Index_change as independent variables
model4 <- lm(Sales ~ Index + Index_change, data = PYT_data)
summary(model4)
## 
## Call:
## lm(formula = Sales ~ Index + Index_change, data = PYT_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310.18 -113.11  -48.22  184.02  296.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   404.222    864.499   0.468    0.645
## Index           1.964      8.196   0.240    0.813
## Index_change  -10.590     18.564  -0.570    0.575
## 
## Residual standard error: 186.7 on 20 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01624,    Adjusted R-squared:  -0.08213 
## F-statistic: 0.1651 on 2 and 20 DF,  p-value: 0.849
(coef(model4)[2])
##    Index 
## 1.964279
(coef(model4)[3])
## Index_change 
##    -10.59049

Question 3:

# Identify quarters where both promotion and advertising expenditures are non-zero 
PYT_data$Policy_Violation <- ifelse(PYT_data$Promotion > 0 & PYT_data$Advertising > 0, 1, 0)
# Count the number of quarters where both PROMOTION and ADVERTISING are non-zero
num_violations <- sum(PYT_data$Policy_Violation)
violation_quarters <- which(PYT_data$Policy_Violation == 1)
# Display the results
cat("Number of quarters with policy violations:", num_violations, "\n")
## Number of quarters with policy violations: 15
cat("Policy violations occurred in the following quarters:", violation_quarters, "\n")
## Policy violations occurred in the following quarters: 1 2 6 7 8 10 11 12 13 14 18 21 22 23 24
# Visualization - Time-series plot showing Promotion and Advertising expenditures library(ggplot2)
# Create the time-series plot 
ggplot(data = PYT_data, aes(x = 1:nrow(PYT_data))) +
  geom_line(aes(y = Promotion, color = "Promotion"), size = 1) +
  geom_line(aes(y = Advertising, color = "Advertising"), size = 1) +
  labs(
    title = "Time-Series Plot of Promotion and Advertising Expenditures",
    x = "Quarter",
    y = "Expenditure (thousands of dollars)"
  ) +
  scale_color_manual(values = c("Promotion" = "blue", "Advertising" = "red")) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Check if either Advertising or Promotion has 0, but not both 
(result <- with(PYT_data, (Advertising == 0 | Promotion == 0) & !(Advertising == 0 & Promotion == 0)))
##  [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

Question 4:

# Engineer one feature called season (e.g., hot/cold):q1 and q4 are classified as cold seasons, and q3 and q2 are classified as hot seasons.
PYT_data <- PYT_data %>%
  mutate(Season = case_when(
    q1 == 1 ~ "cold",  # Q1 (January - March) is winter
    q2 == 1 ~ "hot",   # Q2 (April - June) is spring/summer
    q3 == 1 ~ "hot",   # Q3 (July - September) is summer
    q4 == 1 ~ "cold"   # Q4 (October - December) is winter
  ))

# Run regression only using the Season variables and Sales
model5 <- lm(Sales ~ Season, data = PYT_data)
summary(model5)
## 
## Call:
## lm(formula = Sales ~ Season, data = PYT_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321.75 -113.50  -41.96  183.45  313.86 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   590.87      51.69   11.43 1.01e-10 ***
## Seasonhot      35.09      73.11    0.48    0.636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 179.1 on 22 degrees of freedom
## Multiple R-squared:  0.01036,    Adjusted R-squared:  -0.03462 
## F-statistic: 0.2304 on 1 and 22 DF,  p-value: 0.636