Objective

A company wants to know how well they can predict sales based on TV marketing budget, radio marketing budget, social media marketing budget, and the type of influencer they use to advertise their product. This analysis explores which model predicts sales the best and which variables contribute to high sales and are important in advertising.

Data Exploration and Visualization

Count of Influencer Type

There are more Mega Influencers in the dataset, however, there isn’t really a drastic difference in the amount of other types of influencers.

marketing_sales$Influencer = factor(marketing_sales$Influencer, levels = c("Mega",
                                                                           "Micro", 
                                                                           "Nano", 
                                                                           "Macro"))
ggplot(marketing_sales, aes(x = Influencer))+ 
  geom_bar(aes(fill = Influencer), stat = "count", position = "dodge") +
  geom_text(stat='count', aes(label=..count..), vjust= 0, colour = "black")+
  labs(title = "Count of Influencer Type", x = "Influencer Type", y = "Count") +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
  theme(legend.position = "none")

Comparing Total Sales of TV, Radio, and Social Media Marketing

TV has the largest total budget followed by radio marketing. Social media marketing has the smallest total budget.

sum.of.sales.df = 
  data.frame(Media_Type = c("TV", 
                            "Radio", 
                            "Social Media"), 
             Sum_of_Sales = c(round(sum(marketing_sales$TV),0), 
                              round(sum(marketing_sales$Radio),0), 
                              round(sum(marketing_sales$Social.Media),0)))



ggplot(sum.of.sales.df, aes(reorder(x = Media_Type,-Sum_of_Sales), y = Sum_of_Sales, fill = Media_Type))+ 
  geom_bar(stat = "identity") +
  labs(title = "Total Budget in Millions by Media Type", x = "Media Type", y = "Budget in Millions") +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_fill_manual(values = c("#21918c", "#fde725", "#440154")) +
  theme(legend.position = "none")

Distrubution of Influencer Type Violin Boxplot

The distribution of the influencer is similar across all types.

ggplot(marketing_sales, aes(y = Sales, x = Influencer, fill = Influencer)) + 
  geom_violin() +
  labs(title= "Distribution of Influencer Type", y = "Sales", x = "Influncer Type") +
  scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
  theme(legend.position="none")

Distrubution of Influncer Type Boxplot

ggplot(marketing_sales, aes(y = Sales, x = Influencer, fill = Influencer)) + 
  geom_boxplot() +
  labs(title= "Distribution of Influencer Type", y = "Sales", x = "Influncer Type") +
  scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
  theme(legend.position="none")

Distributions of Numeric Variables

Radio and social media budget are right-skewed.

marketing_sales %>%
  dplyr::select(c("TV", "Radio", "Social.Media"))%>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#3399CC") +
   labs(title= "Distribution of Numeric Variables")

Correlations

Sales and radio have a high positive correlation (r = .87), meaning as the radio budget increases, sales increase.

TV and radio have a high positive correlation (r = .87), meaning as TV budget increases, radio budget increases.

Sales and social media have a moderately positive correlation (r = .53), meaning as social media budget increases, sales increase.

TV and social media have a moderately positive correlation (r = .53), meaning as TV budget increases, social media budget increase.

Radio and social media have a moderately positive correlation (r = .61), meaning as radio budget increases, social media budget. increase.

cordata <- marketing_sales %>% 
 keep(is.numeric) 

cormatrix <- cor(cordata)
round(cormatrix, 2)
ggcorrplot(cormatrix, hc.order = TRUE,outline.color = "white", lab = TRUE, colors = c("#52c569", "white", "#fde725"), lab_size = 2.5) +
  labs(title="Correlation of Numeric Variables") 

Splitting Data into Test and Train

set.seed(123)
idx.train = createDataPartition(y=marketing_sales$Sales, p =.8, list =F)
train = marketing_sales[idx.train,]
test = marketing_sales[-idx.train,]

Linear Regression

Model

This model is statistically significant (p = < 2.2e-16). 99.47% of the variation in the data is explained by the model. TV (p = < 2e-16) and Radio (p = 2.63e-05) are statistically significant variables. TV and Radio have the most impact on the model because they are statistically significant. For every 1 unit increase in TV, sales will increase by 3.52, for every 1 unit increase in Radio, sales will increase by 0.10.

set.seed(123)

lm.model = lm(Sales ~., data = train)

summary(lm.model)
kable(regression_results, 
      col.names = c("Variable","Coefficients", "Standard Error", "t-value", "p-value"))
Variable Coefficients Standard Error t-value p-value
TV 3.521788 0.008628 408.176 < 2e-16 ***
Radio 0.104891 0.024922 4.209 2.63e-05 ***
Social Media 0.035006 0.064326 0.544 0.586
Influencer:Micro 0.095273 0.318500 0.299 0.765
Influencer:Nano 0.087135 0.316498 0.275 0.783
Influencer:Macro -0.151038 0.318976 -0.470 0.636
regression_model_info = 
  data.frame(R_squared = 0.9947,
             Adjusted_R_squared = 0.9947, 
             F_statistic = "1.14e+05 on 6 and 3653 DF", 
             F_statistic_p_value = "< 2.2e-16")
kable(regression_model_info, 
      col.names = c("R-squared","Adjusted R-squared", "F-statistic", "F-statistic p-value"))
R-squared Adjusted R-squared F-statistic F-statistic p-value
0.9947 0.9947 1.14e+05 on 6 and 3653 DF < 2.2e-16

There are no multicollinearity issues in the model.

kable(car::vif(lm.model))
GVIF Df GVIF^(1/(2*Df))
TV 4.024735 1 2.006174
Radio 4.609286 1 2.146925
Social.Media 1.594631 1 1.262787
Influencer 1.001755 3 1.000292

Predictions and Model Evualtion

The test r-squared is 98.79% and the test RMSE is 10.13. This model does well on the test data and makes strong and accurate predictions for sales.

predictions = predict(lm.model, newdata = test, type = "response")


mse = mean((predictions - test$Sales)^2)


lm.rmse = caret::RMSE(predictions,test$Sales)

lm.r2 = R2(predictions, test$Sales)

lm.results = data.frame(
  R_squared = lm.r2, 
  MSE = mse, 
  RMSE = lm.rmse
)


kable(lm.results, 
      col.names = c("R-squared","MSE","RMSE"))
R-squared MSE RMSE
0.987938 102.6211 10.13021

Decision Tree

Model

set.seed(123)

dt.model = rpart(Sales~., data = train, method = "anova")

rattle::fancyRpartPlot(dt.model, sub = "")

Predictions and Model Evaulation

The test r-squared is 97.22% and the test RMSE is 15.36. This model still predicts well, however, the linear regression model has a higher test r-squared and lower RMSE.

preds_DT = predict(dt.model, test, type = "vector")


mse.dt = mean((preds_DT - test$Sales)^2)


dt.rmse =caret::RMSE(preds_DT, test$Sales)

dt.r2 = R2(preds_DT, test$Sales)

dt_performance_metrics = 
  data.frame(R_squared = dt.r2, 
             MSE = mse.dt,
             RMSE = dt.rmse)
kable(dt_performance_metrics, 
      col.names = c("R-squared", "MSE", "RMSE"))
R-squared MSE RMSE
0.9722847 235.8939 15.35884

Pruning the Decision Tree

printcp(dt.model)
## 
## Regression tree:
## rpart(formula = Sales ~ ., data = train, method = "anova")
## 
## Variables actually used in tree construction:
## [1] TV
## 
## Root node error: 31848791/3660 = 8701.9
## 
## n= 3660 
## 
##         CP nsplit rel error   xerror      xstd
## 1 0.748529      0  1.000000 1.000931 0.0149480
## 2 0.095694      1  0.251471 0.251683 0.0041901
## 3 0.089422      2  0.155777 0.156044 0.0033039
## 4 0.012223      3  0.066355 0.066629 0.0018222
## 5 0.012080      4  0.054132 0.055865 0.0017740
## 6 0.010845      5  0.042052 0.042366 0.0017162
## 7 0.010624      6  0.031207 0.033051 0.0017226
## 8 0.010000      7  0.020583 0.020924 0.0014276
best.cp = dt.model$cptable[which.min(dt.model$cptable[,"xerror"]),"CP"]
plotcp(dt.model)

best.cp
## [1] 0.01

The results after pruning are the same as the original decision tree.

set.seed(123)
pruned.tree = prune(dt.model, cp = best.cp)
rattle::fancyRpartPlot(pruned.tree, sub = "")

Random Forest

Model

Using cross-validation to reduce overfitting

set.seed(123)

modelControl = trainControl(method = "repeatedcv",
                             repeats = 3,
                             number = 10)

rf = train(Sales~., 
             method="rf", 
             trControl=modelControl,
             data=train, 
             metric = "RMSE",
             verbose=FALSE)

rf
## Random Forest 
## 
## 3660 samples
##    4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3293, 3293, 3293, 3296, 3294, 3294, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE     
##   2     11.120542  0.9892459  7.444073
##   4      5.812441  0.9958476  3.027776
##   6      6.339729  0.9948639  3.087779
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.

Variable Importance

TV and radio are the most important variables. They have a great impact on influencing sales.

RF.varimp.dur1 =varImp(rf, scale = FALSE)


imp <- data.frame(importance = RF.varimp.dur1$importance) %>%
tibble::rownames_to_column(var = "variable")
  

imp %>% ggplot(aes(x = reorder(variable,Overall), y = Overall)) +
    geom_bar(stat = "identity", fill = "#440154", color = "black")+
    coord_flip() +
     labs(x = "Variables", y = "Variable importance")

Used a log transformation to show the small values easier.

imp.log <- data.frame(importance = RF.varimp.dur1$importance) %>%
  log() %>%
tibble::rownames_to_column(var = "variable")

imp.log %>% ggplot(aes(x = reorder(variable,Overall), y = Overall)) +
    geom_bar(stat = "identity", fill = "#440154", color = "black")+
    coord_flip() +
     labs(x = "Variables", y = "Variable importance")

Predictions and Model Evaulation

Random forest has a test r-squared of 99.17% and a RMSE of 8.32. This model has a slightly higher test r-squared compared to the linear regression and decision tree. This model has a lower RMSE compared to the decision tree and a slighly higher RMSE compared to the linear regression.

y_hat = predict(rf, test)


mse.rf = mean((y_hat - test$Sales)^2)
rf.rmse = caret::RMSE(y_hat,test$Sales)
rf.r2 = R2(y_hat, test$Sales)
rf_performance_metrics = 
  data.frame(R_squared = rf.r2, 
             MSE = mse.rf,
             RMSE = rf.rmse)
kable(rf_performance_metrics, 
      col.names = c("R-squared", "MSE", "RMSE"))
R-squared MSE RMSE
0.9918634 69.17844 8.317357

PDP Plot for TV

Higher TV budget predicts higher sales.

partial(rf, pred.var = "TV", train = train, plot = TRUE, plot.engine = "ggplot2")

PDP Plot for Radio

Higher radio budget predicts higher sales.

partial(rf, pred.var = "Radio", train = train, plot = TRUE, plot.engine = "ggplot2")

PDP Plot for Social Media

Higher social media budget predicts higher sales.

partial(rf, pred.var = "Social.Media", train = train, plot = TRUE, plot.engine = "ggplot2")

PDP Plot for Influencer Type

Macro influnecers predict higher sales.

partial(rf, pred.var = "Influencer", train = train, plot = TRUE, plot.engine = "ggplot2")

Conclusions

Random forest outperforms the two other models. The variable importance from the random forest let’s us know that TV budget and radio budget are crucial in the model. Interestingly, in the linear regression model, we saw that TV budget and radio budget were both statistically significant variables. Social media budget and influencer type was not important in the random forest or statistically significant in the linear model.

To increase sales, investing in TV budget and radio budget will bring in more money.