The goal of this project is to measure the impact of each form of advertising (Facebook, YouTube, newspaper) on sales. In particular, I want to find the answer to the following questions:
library(readr) #read csv file
library(dplyr) #data wrangling & manipulation
library(Hmisc) #describe data
library(ggplot2) #data visualization
library(datarium) #data used in project
library(caret) #splitting training and testing data
data("marketing", package = "datarium")
marketing_plan <- marketing
describe(marketing_plan)
## marketing_plan
##
## 4 Variables 200 Observations
## --------------------------------------------------------------------------------
## youtube
## n missing distinct Info Mean Gmd .05 .10
## 200 0 190 1 176.5 119 15.83 29.86
## .25 .50 .75 .90 .95
## 89.25 179.70 262.59 313.73 336.88
##
## lowest : 0.84 4.92 6.48 8.76 9.36, highest: 347.64 348.84 351.48 352.32 355.68
## --------------------------------------------------------------------------------
## facebook
## n missing distinct Info Mean Gmd .05 .10
## 200 0 167 1 27.92 20.59 2.394 4.080
## .25 .50 .75 .90 .95
## 11.970 27.480 43.830 52.224 56.172
##
## lowest : 0.00 0.36 0.48 0.96 1.56, highest: 57.36 58.68 58.80 59.28 59.52
## --------------------------------------------------------------------------------
## newspaper
## n missing distinct Info Mean Gmd .05 .10
## 200 0 172 1 36.66 28.97 4.320 7.188
## .25 .50 .75 .90 .95
## 15.300 30.900 54.120 70.884 86.190
##
## lowest : 0.36 1.08 1.20 2.04 2.16, highest: 95.04 101.76 107.28 121.08 136.80
## --------------------------------------------------------------------------------
## sales
## n missing distinct Info Mean Gmd .05 .10
## 200 0 121 1 16.83 7.088 7.920 9.552
## .25 .50 .75 .90 .95
## 12.450 15.480 20.880 26.052 28.560
##
## lowest : 1.92 3.84 5.76 6.36 6.60, highest: 29.64 30.48 30.60 31.44 32.40
## --------------------------------------------------------------------------------
From the data above, it can be observed that there are 200 observations in total. The variables in this data set include: 1. YouTube 2. Facebook 3. Newspaper 4. Sales
marketing_plan %>%
ggplot(aes(x = youtube, y = sales)) +
geom_point() +
labs(x = "Spending on YouTube ads",
y = "Sales",
title = "Graph 1: Relationship between YouTube ads and sales") +
stat_smooth(se = FALSE) +
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
marketing_plan %>%
ggplot(aes(x = facebook, y = sales)) +
geom_point() +
labs(x = "Spending on Facebook ads",
y = "Sales",
title = "Graph 2: Relationship between Facebook ads and sales") +
stat_smooth(se = FALSE) +
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
marketing_plan %>%
ggplot(aes(x = newspaper, y = sales)) +
geom_point() +
labs(x = "Spending on newspaper",
y = "Sales",
title = "Graph 3: Relationship between newspaper and sales") +
stat_smooth(se = FALSE) +
theme(panel.background = element_rect(fill = "white", colour = "grey50"))
Graph 1 shows that there is a positive relationship between YouTube ads and sales.
Similarly, graph 2 shows a positive relationship between Facebook ads and sales. However, the higher the spending on Facebook, the greater the dispersion of data in sales.
Graph 3 does not show any notable relationship between newspaper and sales.
set.seed(1)
train_indices <- createDataPartition(y = marketing[["sales"]],
p = 0.8,
list = FALSE)
train_listings <- marketing[train_indices,]
test_listings <- marketing[-train_indices,]
Next, I’m going to run a regression using all three types of ads on sales outcome. The null hypothesis is that there is no relationship between ads (Youtube, Facebook, newspaper) and sales. The alternative hypothesis is that there is some relationship.
model_0 <- lm(sales ~ youtube + facebook + newspaper,
data = train_listings)
summary(model_0)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4615 -1.0974 0.2926 1.3863 3.5313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.296118 0.427672 7.707 1.34e-12 ***
## youtube 0.045860 0.001573 29.155 < 2e-16 ***
## facebook 0.189562 0.009513 19.927 < 2e-16 ***
## newspaper 0.003724 0.006854 0.543 0.588
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.057 on 158 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8947
## F-statistic: 457.2 on 3 and 158 DF, p-value: < 2.2e-16
We can see that the p-values for youtube and facebook is extremely small, which means that we reject the null hypothesis that YouTube and Facebook do not impact sales. On the other hand, the p-value for newspaper is greater than 0.05, which means it’s not a significant value. We fail to reject the null hypothesis that there is any significant relationship between newspaper ads and sales. I will create a second model to exclude the variable newspaper.
model_1 <- lm(sales ~ youtube + facebook,
data = train_listings)
summary(model_1)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5772 -1.0562 0.3144 1.4042 3.5083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.379590 0.398243 8.486 1.41e-14 ***
## youtube 0.045907 0.001567 29.294 < 2e-16 ***
## facebook 0.191058 0.009086 21.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.052 on 159 degrees of freedom
## Multiple R-squared: 0.8965, Adjusted R-squared: 0.8952
## F-statistic: 688.6 on 2 and 159 DF, p-value: < 2.2e-16
## [1] "The p-value corresponding to the F-statistic for model_0 is: 1.23656333149779e-77"
## [1] "The value for model_1 is: 4.8806133315084e-79"
From the two models, we can see that the p-value for model_1 that excludes newspaper is smaller, showing clear evidence of the impact of YouTube and Facebook ads on sales.
Next, I’m going to try quadratic and polynomial fits of the model to see if model performance improves.
model_2 <- lm(sales ~ facebook + I(facebook^2) + youtube + I(youtube^2),
data = train_listings)
summary(model_2)
##
## Call:
## lm(formula = sales ~ facebook + I(facebook^2) + youtube + I(youtube^2),
## data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8778 -0.9270 0.0296 1.2108 3.8213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.095e+00 5.309e-01 3.947 0.000119 ***
## facebook 1.232e-01 3.145e-02 3.917 0.000134 ***
## I(facebook^2) 1.232e-03 5.293e-04 2.328 0.021210 *
## youtube 7.966e-02 5.381e-03 14.805 < 2e-16 ***
## I(youtube^2) -9.883e-05 1.521e-05 -6.499 1.02e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.811 on 157 degrees of freedom
## Multiple R-squared: 0.9204, Adjusted R-squared: 0.9184
## F-statistic: 453.7 on 4 and 157 DF, p-value: < 2.2e-16
Given that the quadratic variable facebook^2 is not significant. I am going to exclude this. At the same time, I will try a polynomial function on youtube up to the 5th order.
model_3 <- lm(sales ~ facebook + poly(youtube, 5),
data = train_listings)
summary(model_3)
##
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 5), data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0532 -0.9194 0.0635 0.8878 4.5350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.321174 0.253239 44.705 < 2e-16 ***
## facebook 0.196345 0.007659 25.634 < 2e-16 ***
## poly(youtube, 5)1 60.149936 1.721951 34.931 < 2e-16 ***
## poly(youtube, 5)2 -11.740689 1.721672 -6.819 1.93e-10 ***
## poly(youtube, 5)3 7.093487 1.724334 4.114 6.30e-05 ***
## poly(youtube, 5)4 -4.284675 1.719567 -2.492 0.0138 *
## poly(youtube, 5)5 2.558213 1.721365 1.486 0.1393
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.719 on 155 degrees of freedom
## Multiple R-squared: 0.9292, Adjusted R-squared: 0.9265
## F-statistic: 339.1 on 6 and 155 DF, p-value: < 2.2e-16
The p-value for youtube is insignificant after the 4th order. I will exclude the 4th and 5th order where p-values are greater than 0.001 in the following model.
Additionally, in multiple regression, it is assumed that the effect of one predictor (facebook) has no influence on the other predictor (youtube). If this is not true, then there is a synergy effect between the two predictors. We can include an interaction term facebook*youtube to see if this assumption holds
model_4 <- lm(sales ~ facebook + poly(youtube, 3) + facebook*youtube,
data = train_listings)
summary(model_4)
##
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 3) + facebook *
## youtube, data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9015 -0.2672 0.0348 0.2944 1.4054
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.127e+01 9.303e-02 121.194 < 2e-16 ***
## facebook 4.024e-02 5.576e-03 7.216 2.19e-11 ***
## poly(youtube, 3)1 2.713e+01 1.199e+00 22.637 < 2e-16 ***
## poly(youtube, 3)2 -1.145e+01 6.332e-01 -18.079 < 2e-16 ***
## poly(youtube, 3)3 5.716e+00 6.356e-01 8.993 7.66e-16 ***
## youtube NA NA NA NA
## facebook:youtube 8.763e-04 2.701e-05 32.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6322 on 156 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9901
## F-statistic: 3207 on 5 and 156 DF, p-value: < 2.2e-16
The p-value associated with the interaction term is significant, showing that there is indeed an interaction effect between facebook and youtube.
Now I’m going to apply model_1 and model_4 on test data to see how well they predict sales.
#add columns with prediction and residual values for model_1
test_listings <- test_listings %>%
mutate(prediction_1 = predict(model_1, newdata = test_listings))
test_listings <- test_listings %>%
mutate(residuals_1 = test_listings$sales - test_listings$prediction_1)
#add columns with prediction and residual values for model_4
test_listings <- test_listings %>%
mutate(prediction_4 = predict(model_4, newdata = test_listings))
## Warning in predict.lm(model_4, newdata = test_listings): prediction from a rank-
## deficient fit may be misleading
test_listings <- test_listings %>%
mutate(residuals_4 = test_listings$sales - test_listings$prediction_4)
attach(test_listings)
par(mfrow = c(1, 2))
plot(x = prediction_1, y = sales, main = "graph 4: linear fit")
abline(0, 1, col = "red", lwd = 2)
plot(x = prediction_4, y = sales, main = "graph 5: polynomial fit")
abline(0, 1, col = "red", lwd = 2)
The red line is a 1:1 ratio between predicted vs. observed values, showing the outcome if the model were to predict sales perfectly. model_1 aligns relatively well to the red line, but we can see that model_4 is actually clustered much closer around the red line than model_1. In order to compare their differences more closely, I will graph the the residuals plot (showing difference between predicted and observed).
par(mfrow = c(1, 2)) #plot area into 1 row 2 columns
plot(predict(model_1), residuals(model_1), main = "graph 6: linear fit", ylim = c(-12, 3))
lines(smooth.spline(predict(model_1), residuals(model_1)), col = "red", lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
plot(predict(model_4), residuals(model_4), main = "graph 7: polynomial fit", ylim = c(-12, 3))
lines(smooth.spline(predict(model_4), residuals(model_4)), col = "red", lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
Both graphs show the difference between predicted and observed values. The closer the observations are to the black dashed line at zero, the closer the predicted value is to actual value. Graph 6 shows a pattern in the residuals, which is often what happens when a linear method is used to model data that shows non-linearity. On the other hand, this pattern disappears in graph 7 that includes more flexibility in the model with a polynomial fit.
To quantify the extent the polynomial fit model_4 is superior to the linear fit model_1, we can use the Analysis of Variable Table below:
anova(model_1, model_4)
This is a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. The F-statistic is ~506.5 and the associated p-value is close to zero. This provides evidence that the model containing extra variables youtube^2 and youtube^3 is indeed superior to the simple model. This is not surprising considering the residual plot shows non-linearity in the relationship between advertising and sales. The testing RSS is also lower for model_4 because of the model’s higher flexibility. By being more flexible, it proves that it performs more closely to the real non-linear relationship between advertising and sales.
I’m going to use model_4 to answer the questions set out in the introduction of this project.
summary(model_4)
##
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 3) + facebook *
## youtube, data = train_listings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9015 -0.2672 0.0348 0.2944 1.4054
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.127e+01 9.303e-02 121.194 < 2e-16 ***
## facebook 4.024e-02 5.576e-03 7.216 2.19e-11 ***
## poly(youtube, 3)1 2.713e+01 1.199e+00 22.637 < 2e-16 ***
## poly(youtube, 3)2 -1.145e+01 6.332e-01 -18.079 < 2e-16 ***
## poly(youtube, 3)3 5.716e+00 6.356e-01 8.993 7.66e-16 ***
## youtube NA NA NA NA
## facebook:youtube 8.763e-04 2.701e-05 32.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6322 on 156 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9901
## F-statistic: 3207 on 5 and 156 DF, p-value: < 2.2e-16
pf(3207, 5, 156, lower.tail = FALSE)
## [1] 2.877416e-155
The p-value corresponding to the F-statistic is 2.87*10^-155. This indicates that there is virtually zero chance that the significance in the relationship is due to chance. Hence, there is clear evidence of a relationship between Facebook and YouTube ads on sales. However, as discussed above, there is no evidence that newspaper advertising affect sales.
RSE <- summary(model_4)$sigma
mean <- mean(marketing_plan$sales)
percentage_error <- (RSE / mean)*100
percentage_error
## [1] 3.756985
The percentage error in this model is approximately 3.76%. The R squared value is ~0.99, indicating that 99% of the variability in sales can be explained by YouTube and Facebook ads. This means the relationship is quite strong.
As we have examined from the p-values associated with each predictor, the p-values for YouTube and Facebook are low, while the p-value for newspaper is not. Hence, only YouTube and Facebook contribute to sales.
To assess each medium’s individual effect on sales, it can be helpful to perform 3 separate simple linear regressions:
facebook <- lm(sales ~ facebook, data = marketing_plan)
youtube <- lm(sales ~ youtube, data = marketing_plan)
newspaper <- lm(sales ~ newspaper, data = marketing_plan)
summary(facebook)$coefficients[2]
## [1] 0.2024958
The coefficient for Facebook is 0.2. This means that on average, for every extra dollar spent on Facebook, sales increase by $0.20.
summary(youtube)$coefficients[2]
## [1] 0.04753664
On the other hand, if we look at YouTube, the coefficient is ~0.05. This means that on average, every extra dollar spent on YouTube increases sales by $0.05. The actual increase in sales depends on the amount of spending already invested, since it is not a linear relationship.
summary(newspaper)
##
## Call:
## lm(formula = sales ~ newspaper, data = marketing_plan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.473 -4.065 -1.007 4.207 15.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.82169 0.74570 19.88 < 2e-16 ***
## newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.111 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Lastly, the coefficient for newspaper is also 0.05. However, we saw from the multiple regression that the relationship between newspaper and sales is insignificant. This is because when there is higher spending on YouTube and Facebook, there tends to be higher spending on newspaper too. So in a simple regression model, newspaper takes into account the increase in sales that is really the result of Facebook and YouTube ads, hence the signficant result summarized above. Once YouTube and Facebook ads are fixed in a multiple regression model, we see that newspaper actually does not affect sales.
Using residual plots, we have identified that the relationship is indeed not linear.
By including an interaction term facebook*youtube, we discovered that there is a synergy effect in the data, which is controlled for in model_4.