PYT_data <- read_excel("/Users/shenzhuhan/Desktop/PinkYellowTail.xlsx")
head(PYT_data)
## # A tibble: 6 × 8
## Sales Promotion Advertising Index q1 q2 q3 q4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 674. 20.8 40.1 100 1 0 0 0
## 2 543. 29.7 48.1 102 0 1 0 0
## 3 532. 0 60.1 104 0 0 1 0
## 4 785. 0 76.1 104 0 0 0 1
## 5 800. 0 52.1 104 1 0 0 0
## 6 940. 42.5 28.0 100 0 1 0 0
summary(PYT_data)
## Sales Promotion Advertising Index q1
## Min. :304.2 Min. : 0.00 Min. : 0.00 Min. : 96.0 Min. :0.00
## 1st Qu.:495.7 1st Qu.: 0.00 1st Qu.:11.02 1st Qu.:102.8 1st Qu.:0.00
## Median :560.9 Median :16.83 Median :40.07 Median :105.0 Median :0.00
## Mean :608.4 Mean :21.30 Mean :35.56 Mean :105.5 Mean :0.25
## 3rd Qu.:788.8 3rd Qu.:32.86 3rd Qu.:52.09 3rd Qu.:108.5 3rd Qu.:0.25
## Max. :939.8 Max. :66.92 Max. :76.14 Max. :114.0 Max. :1.00
## q2 q3 q4
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.00 Median :0.00
## Mean :0.25 Mean :0.25 Mean :0.25
## 3rd Qu.:0.25 3rd Qu.:0.25 3rd Qu.:0.25
## Max. :1.00 Max. :1.00 Max. :1.00
As a preliminary model, we will incorporate all the variables to analyze their impact on sales. A linear regression model is built using the lm function with the formula Sales ~., which means that Sales is the dependent variable and all other variables in the PYT_data data set are independent variables.
##
## Call:
## lm(formula = Sales ~ ., data = PYT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -187.59 -66.51 -14.08 61.88 255.28
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1116.093 542.447 2.058 0.055311 .
## Promotion 5.417 1.190 4.551 0.000283 ***
## Advertising 3.588 1.163 3.086 0.006706 **
## Index -7.227 5.164 -1.400 0.179607
## q1 74.448 70.268 1.059 0.304199
## q2 -39.506 73.450 -0.538 0.597640
## q3 13.475 71.401 0.189 0.852543
## q4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 119.9 on 17 degrees of freedom
## Multiple R-squared: 0.657, Adjusted R-squared: 0.5359
## F-statistic: 5.426 on 6 and 17 DF, p-value: 0.002697
## `geom_smooth()` using formula = 'y ~ x'
Upon building the linear model, we notice several variables are statistically insignificant (p-value greater than 0.05). The adjusted R-squared value is 0.5359, indicating it may be relatively low, while the residual standard error is 119.9, which could be considered high. When examining a visual representation of the results, it appears that the model does not accurately predict the actual outcomes.
model1 %>%
autoplot(which = 1:3) + theme_bw()
A new model is built with only the significant variables from Model 1 (in this case, Promotion and Advertising). The lm function is used again with the formula Sales ~ Promotion + Advertising.
model2 <- lm(Sales~ Promotion + Advertising, data = PYT_data)
summary(model2)
##
## Call:
## lm(formula = Sales ~ Promotion + Advertising, data = PYT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -252.774 -50.139 0.426 40.532 250.613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 402.775 50.738 7.938 9.32e-08 ***
## Promotion 4.841 1.138 4.255 0.000353 ***
## Advertising 2.883 1.090 2.645 0.015155 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 122 on 21 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5196
## F-statistic: 13.44 on 2 and 21 DF, p-value: 0.0001748
# Plot the Regression Result for model 2
ggplot(PYT_data, aes(x = Sales, y = predict(model2))) +
geom_point(color = "black") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Regression Analysis Results - Model 2") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Although all variables in the model are now significant, the adjusted R squared has now decreased, suggesting that the explanatory power of the model is now lower. Thus, we will try other models.
model2 %>%
autoplot(which = 1:3) + theme_bw()
To adjust and optimize the model, feature engineering techniques were employed. These techniques were guided by some instructions from Session 4 and 5.
Lagging variables were introduced to account for the time delay in
the effects of promotion and advertising on sales. In real-world
scenarios, the impact of promotional activities and advertising
campaigns may not be immediate. For example, a promotion in one quarter
might have an effect on sales in the following quarter. To capture this
lag effect, the mutate function was used to create
Promotion_lag and Advertising_lag variables.
The lag function was applied within the mutate
to shift the values of Promotion and
Advertising columns by one period. This allows the model to
consider the historical values of these variables and better understand
their relationship with sales over time.
The economic index provided in the data set was adjusted to make it
more useful for the model. Only having the raw index numbers was not
sufficient as it did not account for changes over time. A new column
called Index_change was created using the
mutate function. This column shows the differences in the
economic index from one period to the next. By calculating these
differences, the model can better capture the impact of economic changes
on sales. For instance, an increase or decrease in the economic index
between two consecutive quarters might have a different effect on sales
compared to just looking at the absolute values of the index.
The quarterly data was also manipulated to create a more meaningful
variable for the model. The quarters were initially considered as dummy
variables, but to simplify the model and avoid multicollinearity issues,
only three of the quarter variables were used in a way that one was set
as the baseline. Additionally, an attempt was made to engineer a feature
called Season. The quarters were classified into seasons
based on the months they represent. For example, Q1 (January - March)
was classified as a “cold” season, Q2 (April - June) was classified as a
“hot” season, Q3 (July - September) was classified as a “hot” season,
and Q4 (October - December) was classified as a “cold” season. This
Season variable was created using the
case_when function within the mutate
operation. By classifying the quarters into seasons, the model can
potentially capture any seasonal patterns in sales that might be related
to factors such as consumer behavior, weather, or holidays during
different times of the year.
# Create a column that shows the differences in economic index
PYT_data <- PYT_data %>%
mutate(Index_change = Index - lag(Index))
head(PYT_data)
## # A tibble: 6 × 9
## Sales Promotion Advertising Index q1 q2 q3 q4 Index_change
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 674. 20.8 40.1 100 1 0 0 0 NA
## 2 543. 29.7 48.1 102 0 1 0 0 2
## 3 532. 0 60.1 104 0 0 1 0 2
## 4 785. 0 76.1 104 0 0 0 1 0
## 5 800. 0 52.1 104 1 0 0 0 0
## 6 940. 42.5 28.0 100 0 1 0 0 -4
# Add the Promotion lag to a new variable, since promotion effect has a lag
PYT_data <- PYT_data %>%
mutate(Promotion_lag = lag(Promotion, n = 1))
head(PYT_data)
## # A tibble: 6 × 10
## Sales Promotion Advertising Index q1 q2 q3 q4 Index_change
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 674. 20.8 40.1 100 1 0 0 0 NA
## 2 543. 29.7 48.1 102 0 1 0 0 2
## 3 532. 0 60.1 104 0 0 1 0 2
## 4 785. 0 76.1 104 0 0 0 1 0
## 5 800. 0 52.1 104 1 0 0 0 0
## 6 940. 42.5 28.0 100 0 1 0 0 -4
## # ℹ 1 more variable: Promotion_lag <dbl>
# Add the Advertising lag to a new variable,since advertising effect has a lag
PYT_data <- PYT_data %>%
mutate(Advertising_lag = lag(Advertising, n = 1))
head(PYT_data)
## # A tibble: 6 × 11
## Sales Promotion Advertising Index q1 q2 q3 q4 Index_change
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 674. 20.8 40.1 100 1 0 0 0 NA
## 2 543. 29.7 48.1 102 0 1 0 0 2
## 3 532. 0 60.1 104 0 0 1 0 2
## 4 785. 0 76.1 104 0 0 0 1 0
## 5 800. 0 52.1 104 1 0 0 0 0
## 6 940. 42.5 28.0 100 0 1 0 0 -4
## # ℹ 2 more variables: Promotion_lag <dbl>, Advertising_lag <dbl>
# Engineer one feature called season (e.g., hot/cold):q1 and q4 are classified as cold seasons, and q3 and q2 are classified as hot seasons.
PYT_data <- PYT_data %>%
mutate(Season = case_when(
q1 == 1 ~ "cold", # Q1 (January - March) is winter
q2 == 1 ~ "hot", # Q2 (April - June) is spring/summer
q3 == 1 ~ "hot", # Q3 (July - September) is summer
q4 == 1 ~ "cold" # Q4 (October - December) is winter
))
head(PYT_data)
## # A tibble: 6 × 12
## Sales Promotion Advertising Index q1 q2 q3 q4 Index_change
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 674. 20.8 40.1 100 1 0 0 0 NA
## 2 543. 29.7 48.1 102 0 1 0 0 2
## 3 532. 0 60.1 104 0 0 1 0 2
## 4 785. 0 76.1 104 0 0 0 1 0
## 5 800. 0 52.1 104 1 0 0 0 0
## 6 940. 42.5 28.0 100 0 1 0 0 -4
## # ℹ 3 more variables: Promotion_lag <dbl>, Advertising_lag <dbl>, Season <chr>
model3 <- lm(Sales ~ Promotion + Promotion_lag + Advertising + Advertising_lag + Season + Index_change, data = PYT_data)
summary(model3)
##
## Call:
## lm(formula = Sales ~ Promotion + Promotion_lag + Advertising +
## Advertising_lag + Season + Index_change, data = PYT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.630 -28.625 2.715 42.141 118.900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 410.1245 31.7906 12.901 7.16e-10 ***
## Promotion 6.2082 0.7106 8.737 1.74e-07 ***
## Promotion_lag -3.7229 0.6675 -5.577 4.17e-05 ***
## Advertising 2.5271 0.5878 4.299 0.000551 ***
## Advertising_lag 2.8774 0.6130 4.694 0.000244 ***
## Seasonhot -69.0633 28.7109 -2.405 0.028606 *
## Index_change -16.8592 6.2155 -2.712 0.015375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.79 on 16 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.911, Adjusted R-squared: 0.8776
## F-statistic: 27.28 on 6 and 16 DF, p-value: 1.509e-07
# VIF < 5: Generally indicates no multicollinearity.
vif_values <- vif(model3)
model3 %>%
autoplot(which = 1:3) + theme_bw()
\[Sales = 410.1245 + 6.2082 \times Promotion -3.7229 \times Promotion_lag + 2.5271 \times Advertising + 2.8774 \times Advertising_lag - 69.0633 \times Seasonhot - 16.8592 \times Index_change\] # Questions 1 - 5:
# Get correlational coefficients for promotion and advertising respectively
PYT <- na.omit(PYT_data) #omit all NAs
(corr_promotion <- cor(PYT$Sales, PYT$Promotion)) # correlation coefficient between the Sales and Promotion
## [1] 0.6467924
(corr_promotion_lag <- cor(PYT$Sales, PYT$Promotion_lag)) # correlation coefficient between the Sales and Promotion_lag
## [1] -0.08224609
(corr_advertising <- cor(PYT$Sales, PYT$Advertising)) # correlation coefficient between the Sales and Advertisement
## [1] 0.4263245
(corr_advertising <- cor(PYT$Sales, PYT$Advertising_lag)) # correlation coefficient between the Sales and Advertisement_lag
## [1] 0.5655463
# Get regression coefficients for promotion and advertising respectively
(coef(model3)[2]) # regression coefficient for Promotion
## Promotion
## 6.208237
(coef(model3)[3]) # regression coefficient for Promotion_lag
## Promotion_lag
## -3.722858
(coef(model3)[4]) # regression coefficient for Advertising
## Advertising
## 2.527082
(coef(model3)[5]) # regression coefficient for Advertising_lag
## Advertising_lag
## 2.877414
# Correlation analysis between sales and Index/Index_change
(corr_promotion <- cor(PYT$Sales, PYT$Index)) # correlation coefficient between the Sales and Index
## [1] 0.01527617
(corr_promotion_lag <- cor(PYT$Sales, PYT$Index_change)) # correlation coefficient between the Sales and Index_change
## [1] -0.1158283
# Regression analysis with Index and Index_change as independent variables
model4 <- lm(Sales ~ Index + Index_change, data = PYT_data)
summary(model4)
##
## Call:
## lm(formula = Sales ~ Index + Index_change, data = PYT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310.18 -113.11 -48.22 184.02 296.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 404.222 864.499 0.468 0.645
## Index 1.964 8.196 0.240 0.813
## Index_change -10.590 18.564 -0.570 0.575
##
## Residual standard error: 186.7 on 20 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01624, Adjusted R-squared: -0.08213
## F-statistic: 0.1651 on 2 and 20 DF, p-value: 0.849
(coef(model4)[2])
## Index
## 1.964279
(coef(model4)[3])
## Index_change
## -10.59049
# Identify quarters where both promotion and advertising expenditures are non-zero
PYT_data$Policy_Violation <- ifelse(PYT_data$Promotion > 0 & PYT_data$Advertising > 0, 1, 0)
# Count the number of quarters where both PROMOTION and ADVERTISING are non-zero
num_violations <- sum(PYT_data$Policy_Violation)
violation_quarters <- which(PYT_data$Policy_Violation == 1)
# Display the results
cat("Number of quarters with policy violations:", num_violations, "\n")
## Number of quarters with policy violations: 15
cat("Policy violations occurred in the following quarters:", violation_quarters, "\n")
## Policy violations occurred in the following quarters: 1 2 6 7 8 10 11 12 13 14 18 21 22 23 24
# Visualization - Time-series plot showing Promotion and Advertising expenditures library(ggplot2)
# Create the time-series plot
ggplot(data = PYT_data, aes(x = 1:nrow(PYT_data))) +
geom_line(aes(y = Promotion, color = "Promotion"), size = 1) +
geom_line(aes(y = Advertising, color = "Advertising"), size = 1) +
labs(
title = "Time-Series Plot of Promotion and Advertising Expenditures",
x = "Quarter",
y = "Expenditure (thousands of dollars)"
) +
scale_color_manual(values = c("Promotion" = "blue", "Advertising" = "red")) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Check if either Advertising or Promotion has 0, but not both
(result <- with(PYT_data, (Advertising == 0 | Promotion == 0) & !(Advertising == 0 & Promotion == 0)))
## [1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [13] FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# Engineer one feature called season (e.g., hot/cold):q1 and q4 are classified as cold seasons, and q3 and q2 are classified as hot seasons.
PYT_data <- PYT_data %>%
mutate(Season = case_when(
q1 == 1 ~ "cold", # Q1 (January - March) is winter
q2 == 1 ~ "hot", # Q2 (April - June) is spring/summer
q3 == 1 ~ "hot", # Q3 (July - September) is summer
q4 == 1 ~ "cold" # Q4 (October - December) is winter
))
# Run regression only using the Season variables and Sales
model5 <- lm(Sales ~ Season, data = PYT_data)
summary(model5)
##
## Call:
## lm(formula = Sales ~ Season, data = PYT_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -321.75 -113.50 -41.96 183.45 313.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 590.87 51.69 11.43 1.01e-10 ***
## Seasonhot 35.09 73.11 0.48 0.636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 179.1 on 22 degrees of freedom
## Multiple R-squared: 0.01036, Adjusted R-squared: -0.03462
## F-statistic: 0.2304 on 1 and 22 DF, p-value: 0.636