1. Introduction

The goal of this project is to measure the impact of each form of advertising (Facebook, YouTube, newspaper) on sales. In particular, I want to find the answer to the following questions:

  1. Is there a relationship between sales and budget?
  2. How strong is this relationship?
  3. Which medium contribute to sales the most?
  4. How large is the effect of each medium on sales?
  5. Is the relationship linear?
  6. Is there synergy among the advertising media?

2. Setting up the environment

library(readr) #read csv file
library(dplyr) #data wrangling & manipulation
library(Hmisc) #describe data
library(ggplot2) #data visualization
library(datarium) #data used in project
library(caret) #splitting training and testing data
data("marketing", package = "datarium")
marketing_plan <- marketing
describe(marketing_plan)
## marketing_plan 
## 
##  4  Variables      200  Observations
## --------------------------------------------------------------------------------
## youtube 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0      190        1    176.5      119    15.83    29.86 
##      .25      .50      .75      .90      .95 
##    89.25   179.70   262.59   313.73   336.88 
## 
## lowest :   0.84   4.92   6.48   8.76   9.36, highest: 347.64 348.84 351.48 352.32 355.68
## --------------------------------------------------------------------------------
## facebook 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0      167        1    27.92    20.59    2.394    4.080 
##      .25      .50      .75      .90      .95 
##   11.970   27.480   43.830   52.224   56.172 
## 
## lowest :  0.00  0.36  0.48  0.96  1.56, highest: 57.36 58.68 58.80 59.28 59.52
## --------------------------------------------------------------------------------
## newspaper 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0      172        1    36.66    28.97    4.320    7.188 
##      .25      .50      .75      .90      .95 
##   15.300   30.900   54.120   70.884   86.190 
## 
## lowest :   0.36   1.08   1.20   2.04   2.16, highest:  95.04 101.76 107.28 121.08 136.80
## --------------------------------------------------------------------------------
## sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      200        0      121        1    16.83    7.088    7.920    9.552 
##      .25      .50      .75      .90      .95 
##   12.450   15.480   20.880   26.052   28.560 
## 
## lowest :  1.92  3.84  5.76  6.36  6.60, highest: 29.64 30.48 30.60 31.44 32.40
## --------------------------------------------------------------------------------

From the data above, it can be observed that there are 200 observations in total. The variables in this data set include: 1. YouTube 2. Facebook 3. Newspaper 4. Sales

3. Exploratory data analysis

marketing_plan %>%
  ggplot(aes(x = youtube, y = sales)) +
  geom_point() +
  labs(x = "Spending on YouTube ads",
       y = "Sales",
       title = "Graph 1: Relationship between YouTube ads and sales") +
  stat_smooth(se = FALSE) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))

marketing_plan %>%
  ggplot(aes(x = facebook, y = sales)) +
  geom_point() +
  labs(x = "Spending on Facebook ads",
       y = "Sales",
       title = "Graph 2: Relationship between Facebook ads and sales") +
  stat_smooth(se = FALSE) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))

marketing_plan %>%
  ggplot(aes(x = newspaper, y = sales)) +
  geom_point() +
  labs(x = "Spending on newspaper",
       y = "Sales",
       title = "Graph 3: Relationship between newspaper and sales") +
  stat_smooth(se = FALSE) +
  theme(panel.background = element_rect(fill = "white", colour = "grey50"))

Graph 1 shows that there is a positive relationship between YouTube ads and sales.

Similarly, graph 2 shows a positive relationship between Facebook ads and sales. However, the higher the spending on Facebook, the greater the dispersion of data in sales.

Graph 3 does not show any notable relationship between newspaper and sales.

4. Training and splitting data

set.seed(1)
train_indices <- createDataPartition(y = marketing[["sales"]],
                                     p = 0.8,
                                     list = FALSE)

train_listings <- marketing[train_indices,]
test_listings <- marketing[-train_indices,]

5. Modelling

Next, I’m going to run a regression using all three types of ads on sales outcome. The null hypothesis is that there is no relationship between ads (Youtube, Facebook, newspaper) and sales. The alternative hypothesis is that there is some relationship.

model_0 <- lm(sales ~ youtube + facebook + newspaper,
              data = train_listings)
summary(model_0)
## 
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train_listings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4615  -1.0974   0.2926   1.3863   3.5313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.296118   0.427672   7.707 1.34e-12 ***
## youtube     0.045860   0.001573  29.155  < 2e-16 ***
## facebook    0.189562   0.009513  19.927  < 2e-16 ***
## newspaper   0.003724   0.006854   0.543    0.588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.057 on 158 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8947 
## F-statistic: 457.2 on 3 and 158 DF,  p-value: < 2.2e-16

We can see that the p-values for youtube and facebook is extremely small, which means that we reject the null hypothesis that YouTube and Facebook do not impact sales. On the other hand, the p-value for newspaper is greater than 0.05, which means it’s not a significant value. We fail to reject the null hypothesis that there is any significant relationship between newspaper ads and sales. I will create a second model to exclude the variable newspaper.

model_1 <- lm(sales ~ youtube + facebook,
              data = train_listings)
summary(model_1)
## 
## Call:
## lm(formula = sales ~ youtube + facebook, data = train_listings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5772  -1.0562   0.3144   1.4042   3.5083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.379590   0.398243   8.486 1.41e-14 ***
## youtube     0.045907   0.001567  29.294  < 2e-16 ***
## facebook    0.191058   0.009086  21.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.052 on 159 degrees of freedom
## Multiple R-squared:  0.8965, Adjusted R-squared:  0.8952 
## F-statistic: 688.6 on 2 and 159 DF,  p-value: < 2.2e-16
## [1] "The p-value corresponding to the F-statistic for model_0 is: 1.23656333149779e-77"
## [1] "The value for model_1 is: 4.8806133315084e-79"

From the two models, we can see that the p-value for model_1 that excludes newspaper is smaller, showing clear evidence of the impact of YouTube and Facebook ads on sales.

Next, I’m going to try quadratic and polynomial fits of the model to see if model performance improves.

model_2 <- lm(sales ~ facebook + I(facebook^2) + youtube + I(youtube^2),
              data = train_listings)

summary(model_2)
## 
## Call:
## lm(formula = sales ~ facebook + I(facebook^2) + youtube + I(youtube^2), 
##     data = train_listings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8778 -0.9270  0.0296  1.2108  3.8213 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.095e+00  5.309e-01   3.947 0.000119 ***
## facebook       1.232e-01  3.145e-02   3.917 0.000134 ***
## I(facebook^2)  1.232e-03  5.293e-04   2.328 0.021210 *  
## youtube        7.966e-02  5.381e-03  14.805  < 2e-16 ***
## I(youtube^2)  -9.883e-05  1.521e-05  -6.499 1.02e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.811 on 157 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9184 
## F-statistic: 453.7 on 4 and 157 DF,  p-value: < 2.2e-16

Given that the quadratic variable facebook^2 is not significant. I am going to exclude this. At the same time, I will try a polynomial function on youtube up to the 5th order.

model_3 <- lm(sales ~ facebook + poly(youtube, 5),
              data = train_listings)

summary(model_3)
## 
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 5), data = train_listings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0532 -0.9194  0.0635  0.8878  4.5350 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        11.321174   0.253239  44.705  < 2e-16 ***
## facebook            0.196345   0.007659  25.634  < 2e-16 ***
## poly(youtube, 5)1  60.149936   1.721951  34.931  < 2e-16 ***
## poly(youtube, 5)2 -11.740689   1.721672  -6.819 1.93e-10 ***
## poly(youtube, 5)3   7.093487   1.724334   4.114 6.30e-05 ***
## poly(youtube, 5)4  -4.284675   1.719567  -2.492   0.0138 *  
## poly(youtube, 5)5   2.558213   1.721365   1.486   0.1393    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.719 on 155 degrees of freedom
## Multiple R-squared:  0.9292, Adjusted R-squared:  0.9265 
## F-statistic: 339.1 on 6 and 155 DF,  p-value: < 2.2e-16

The p-value for youtube is insignificant after the 4th order. I will exclude the 4th and 5th order where p-values are greater than 0.001 in the following model.

Additionally, in multiple regression, it is assumed that the effect of one predictor (facebook) has no influence on the other predictor (youtube). If this is not true, then there is a synergy effect between the two predictors. We can include an interaction term facebook*youtube to see if this assumption holds

model_4 <- lm(sales ~ facebook + poly(youtube, 3) + facebook*youtube,
              data = train_listings)

summary(model_4)
## 
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 3) + facebook * 
##     youtube, data = train_listings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9015 -0.2672  0.0348  0.2944  1.4054 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.127e+01  9.303e-02 121.194  < 2e-16 ***
## facebook           4.024e-02  5.576e-03   7.216 2.19e-11 ***
## poly(youtube, 3)1  2.713e+01  1.199e+00  22.637  < 2e-16 ***
## poly(youtube, 3)2 -1.145e+01  6.332e-01 -18.079  < 2e-16 ***
## poly(youtube, 3)3  5.716e+00  6.356e-01   8.993 7.66e-16 ***
## youtube                   NA         NA      NA       NA    
## facebook:youtube   8.763e-04  2.701e-05  32.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6322 on 156 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9901 
## F-statistic:  3207 on 5 and 156 DF,  p-value: < 2.2e-16

The p-value associated with the interaction term is significant, showing that there is indeed an interaction effect between facebook and youtube.

6. Evaluating model performance

Now I’m going to apply model_1 and model_4 on test data to see how well they predict sales.

#add columns with prediction and residual values for model_1
test_listings <- test_listings %>%
  mutate(prediction_1 = predict(model_1, newdata = test_listings))

test_listings <- test_listings %>%
  mutate(residuals_1 = test_listings$sales - test_listings$prediction_1)

#add columns with prediction and residual values for model_4
test_listings <- test_listings %>%
  mutate(prediction_4 = predict(model_4, newdata = test_listings))
## Warning in predict.lm(model_4, newdata = test_listings): prediction from a rank-
## deficient fit may be misleading
test_listings <- test_listings %>%
  mutate(residuals_4 = test_listings$sales - test_listings$prediction_4)
attach(test_listings)
par(mfrow = c(1, 2))
plot(x = prediction_1, y = sales, main = "graph 4: linear fit")
abline(0, 1, col = "red", lwd = 2)
plot(x = prediction_4, y = sales, main = "graph 5: polynomial fit")
abline(0, 1, col = "red", lwd = 2)

The red line is a 1:1 ratio between predicted vs. observed values, showing the outcome if the model were to predict sales perfectly. model_1 aligns relatively well to the red line, but we can see that model_4 is actually clustered much closer around the red line than model_1. In order to compare their differences more closely, I will graph the the residuals plot (showing difference between predicted and observed).

par(mfrow = c(1, 2)) #plot area into 1 row 2 columns
plot(predict(model_1), residuals(model_1), main = "graph 6: linear fit", ylim = c(-12, 3))
lines(smooth.spline(predict(model_1), residuals(model_1)), col = "red", lwd = 2)
abline(0, 0, lty = 3, lwd = 2)

plot(predict(model_4), residuals(model_4), main = "graph 7: polynomial fit", ylim = c(-12, 3))
lines(smooth.spline(predict(model_4), residuals(model_4)), col = "red", lwd = 2)
abline(0, 0, lty = 3, lwd = 2)

Both graphs show the difference between predicted and observed values. The closer the observations are to the black dashed line at zero, the closer the predicted value is to actual value. Graph 6 shows a pattern in the residuals, which is often what happens when a linear method is used to model data that shows non-linearity. On the other hand, this pattern disappears in graph 7 that includes more flexibility in the model with a polynomial fit.

To quantify the extent the polynomial fit model_4 is superior to the linear fit model_1, we can use the Analysis of Variable Table below:

anova(model_1, model_4)

This is a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. The F-statistic is ~506.5 and the associated p-value is close to zero. This provides evidence that the model containing extra variables youtube^2 and youtube^3 is indeed superior to the simple model. This is not surprising considering the residual plot shows non-linearity in the relationship between advertising and sales. The testing RSS is also lower for model_4 because of the model’s higher flexibility. By being more flexible, it proves that it performs more closely to the real non-linear relationship between advertising and sales.

7. Conclusion

I’m going to use model_4 to answer the questions set out in the introduction of this project.

summary(model_4)
## 
## Call:
## lm(formula = sales ~ facebook + poly(youtube, 3) + facebook * 
##     youtube, data = train_listings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9015 -0.2672  0.0348  0.2944  1.4054 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.127e+01  9.303e-02 121.194  < 2e-16 ***
## facebook           4.024e-02  5.576e-03   7.216 2.19e-11 ***
## poly(youtube, 3)1  2.713e+01  1.199e+00  22.637  < 2e-16 ***
## poly(youtube, 3)2 -1.145e+01  6.332e-01 -18.079  < 2e-16 ***
## poly(youtube, 3)3  5.716e+00  6.356e-01   8.993 7.66e-16 ***
## youtube                   NA         NA      NA       NA    
## facebook:youtube   8.763e-04  2.701e-05  32.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6322 on 156 degrees of freedom
## Multiple R-squared:  0.9904, Adjusted R-squared:  0.9901 
## F-statistic:  3207 on 5 and 156 DF,  p-value: < 2.2e-16

1. Is there a relationship between advertising and sales?

pf(3207, 5, 156, lower.tail = FALSE)
## [1] 2.877416e-155

The p-value corresponding to the F-statistic is 2.87*10^-155. This indicates that there is virtually zero chance that the significance in the relationship is due to chance. Hence, there is clear evidence of a relationship between Facebook and YouTube ads on sales. However, as discussed above, there is no evidence that newspaper advertising affect sales.

2. How strong is this relationship?

RSE <- summary(model_4)$sigma
mean <- mean(marketing_plan$sales)
percentage_error <- (RSE / mean)*100
percentage_error
## [1] 3.756985

The percentage error in this model is approximately 3.76%. The R squared value is ~0.99, indicating that 99% of the variability in sales can be explained by YouTube and Facebook ads. This means the relationship is quite strong.

3. Which medium contribute to sales the most?

As we have examined from the p-values associated with each predictor, the p-values for YouTube and Facebook are low, while the p-value for newspaper is not. Hence, only YouTube and Facebook contribute to sales.

4. How large is the effect of each medium on sales?

To assess each medium’s individual effect on sales, it can be helpful to perform 3 separate simple linear regressions:

facebook <- lm(sales ~ facebook, data = marketing_plan)
youtube <- lm(sales ~ youtube, data = marketing_plan)
newspaper <- lm(sales ~ newspaper, data = marketing_plan)
summary(facebook)$coefficients[2]
## [1] 0.2024958

The coefficient for Facebook is 0.2. This means that on average, for every extra dollar spent on Facebook, sales increase by $0.20.

summary(youtube)$coefficients[2]
## [1] 0.04753664

On the other hand, if we look at YouTube, the coefficient is ~0.05. This means that on average, every extra dollar spent on YouTube increases sales by $0.05. The actual increase in sales depends on the amount of spending already invested, since it is not a linear relationship.

summary(newspaper)
## 
## Call:
## lm(formula = sales ~ newspaper, data = marketing_plan)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.473  -4.065  -1.007   4.207  15.330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.82169    0.74570   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.111 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Lastly, the coefficient for newspaper is also 0.05. However, we saw from the multiple regression that the relationship between newspaper and sales is insignificant. This is because when there is higher spending on YouTube and Facebook, there tends to be higher spending on newspaper too. So in a simple regression model, newspaper takes into account the increase in sales that is really the result of Facebook and YouTube ads, hence the signficant result summarized above. Once YouTube and Facebook ads are fixed in a multiple regression model, we see that newspaper actually does not affect sales.

6. Is the relationship linear?

Using residual plots, we have identified that the relationship is indeed not linear.

7. Is there synergy among the advertising media?

By including an interaction term facebook*youtube, we discovered that there is a synergy effect in the data, which is controlled for in model_4.