A company wants to know how well they can predict sales based on TV marketing budget, radio marketing budget, social media marketing budget, and the type of influencer they use to advertise their product. This analysis explores which model predicts sales the best and which variables contribute to high sales and are important in advertising.
There are more Mega Influencers in the dataset, however, there isn’t really a drastic difference in the amount of other types of influencers.
marketing_sales$Influencer = factor(marketing_sales$Influencer, levels = c("Mega",
"Micro",
"Nano",
"Macro"))
ggplot(marketing_sales, aes(x = Influencer))+
geom_bar(aes(fill = Influencer), stat = "count", position = "dodge") +
geom_text(stat='count', aes(label=..count..), vjust= 0, colour = "black")+
labs(title = "Count of Influencer Type", x = "Influencer Type", y = "Count") +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
theme(legend.position = "none")
The distribution of the influencer is similar across all types.
ggplot(marketing_sales, aes(y = Sales, x = Influencer, fill = Influencer)) +
geom_violin() +
labs(title= "Distribution of Influencer Type", y = "Sales", x = "Influncer Type") +
scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
theme(legend.position="none")
ggplot(marketing_sales, aes(y = Sales, x = Influencer, fill = Influencer)) +
geom_boxplot() +
labs(title= "Distribution of Influencer Type", y = "Sales", x = "Influncer Type") +
scale_fill_manual(values = c("#440154", "#31688e", "#35b779", "#fde725")) +
theme(legend.position="none")
Radio and social media budget are right-skewed.
marketing_sales %>%
dplyr::select(c("TV", "Radio", "Social.Media"))%>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#3399CC") +
labs(title= "Distribution of Numeric Variables")
Sales and radio have a high positive correlation (r = .87), meaning as the radio budget increases, sales increase.
TV and radio have a high positive correlation (r = .87), meaning as TV budget increases, radio budget increases.
Sales and social media have a moderately positive correlation (r = .53), meaning as social media budget increases, sales increase.
TV and social media have a moderately positive correlation (r = .53), meaning as TV budget increases, social media budget increase.
Radio and social media have a moderately positive correlation (r = .61), meaning as radio budget increases, social media budget. increase.
cordata <- marketing_sales %>%
keep(is.numeric)
cormatrix <- cor(cordata)
round(cormatrix, 2)
ggcorrplot(cormatrix, hc.order = TRUE,outline.color = "white", lab = TRUE, colors = c("#52c569", "white", "#fde725"), lab_size = 2.5) +
labs(title="Correlation of Numeric Variables")
set.seed(123)
idx.train = createDataPartition(y=marketing_sales$Sales, p =.8, list =F)
train = marketing_sales[idx.train,]
test = marketing_sales[-idx.train,]
This model is statistically significant (p = < 2.2e-16). 99.47% of the variation in the data is explained by the model. TV (p = < 2e-16) and Radio (p = 2.63e-05) are statistically significant variables. TV and Radio have the most impact on the model because they are statistically significant. For every 1 unit increase in TV, sales will increase by 3.52, for every 1 unit increase in Radio, sales will increase by 0.10.
set.seed(123)
lm.model = lm(Sales ~., data = train)
summary(lm.model)
kable(regression_results,
col.names = c("Variable","Coefficients", "Standard Error", "t-value", "p-value"))
Variable | Coefficients | Standard Error | t-value | p-value |
---|---|---|---|---|
TV | 3.521788 | 0.008628 | 408.176 | < 2e-16 *** |
Radio | 0.104891 | 0.024922 | 4.209 | 2.63e-05 *** |
Social Media | 0.035006 | 0.064326 | 0.544 | 0.586 |
Influencer:Micro | 0.095273 | 0.318500 | 0.299 | 0.765 |
Influencer:Nano | 0.087135 | 0.316498 | 0.275 | 0.783 |
Influencer:Macro | -0.151038 | 0.318976 | -0.470 | 0.636 |
regression_model_info =
data.frame(R_squared = 0.9947,
Adjusted_R_squared = 0.9947,
F_statistic = "1.14e+05 on 6 and 3653 DF",
F_statistic_p_value = "< 2.2e-16")
kable(regression_model_info,
col.names = c("R-squared","Adjusted R-squared", "F-statistic", "F-statistic p-value"))
R-squared | Adjusted R-squared | F-statistic | F-statistic p-value |
---|---|---|---|
0.9947 | 0.9947 | 1.14e+05 on 6 and 3653 DF | < 2.2e-16 |
There are no multicollinearity issues in the model.
kable(car::vif(lm.model))
GVIF | Df | GVIF^(1/(2*Df)) | |
---|---|---|---|
TV | 4.024735 | 1 | 2.006174 |
Radio | 4.609286 | 1 | 2.146925 |
Social.Media | 1.594631 | 1 | 1.262787 |
Influencer | 1.001755 | 3 | 1.000292 |
The test r-squared is 98.79% and the test RMSE is 10.13. This model does well on the test data and makes strong and accurate predictions for sales.
predictions = predict(lm.model, newdata = test, type = "response")
mse = mean((predictions - test$Sales)^2)
lm.rmse = caret::RMSE(predictions,test$Sales)
lm.r2 = R2(predictions, test$Sales)
lm.results = data.frame(
R_squared = lm.r2,
MSE = mse,
RMSE = lm.rmse
)
kable(lm.results,
col.names = c("R-squared","MSE","RMSE"))
R-squared | MSE | RMSE |
---|---|---|
0.987938 | 102.6211 | 10.13021 |
set.seed(123)
dt.model = rpart(Sales~., data = train, method = "anova")
rattle::fancyRpartPlot(dt.model, sub = "")
The test r-squared is 97.22% and the test RMSE is 15.36. This model still predicts well, however, the linear regression model has a higher test r-squared and lower RMSE.
preds_DT = predict(dt.model, test, type = "vector")
mse.dt = mean((preds_DT - test$Sales)^2)
dt.rmse =caret::RMSE(preds_DT, test$Sales)
dt.r2 = R2(preds_DT, test$Sales)
dt_performance_metrics =
data.frame(R_squared = dt.r2,
MSE = mse.dt,
RMSE = dt.rmse)
kable(dt_performance_metrics,
col.names = c("R-squared", "MSE", "RMSE"))
R-squared | MSE | RMSE |
---|---|---|
0.9722847 | 235.8939 | 15.35884 |
printcp(dt.model)
##
## Regression tree:
## rpart(formula = Sales ~ ., data = train, method = "anova")
##
## Variables actually used in tree construction:
## [1] TV
##
## Root node error: 31848791/3660 = 8701.9
##
## n= 3660
##
## CP nsplit rel error xerror xstd
## 1 0.748529 0 1.000000 1.000931 0.0149480
## 2 0.095694 1 0.251471 0.251683 0.0041901
## 3 0.089422 2 0.155777 0.156044 0.0033039
## 4 0.012223 3 0.066355 0.066629 0.0018222
## 5 0.012080 4 0.054132 0.055865 0.0017740
## 6 0.010845 5 0.042052 0.042366 0.0017162
## 7 0.010624 6 0.031207 0.033051 0.0017226
## 8 0.010000 7 0.020583 0.020924 0.0014276
best.cp = dt.model$cptable[which.min(dt.model$cptable[,"xerror"]),"CP"]
plotcp(dt.model)
best.cp
## [1] 0.01
The results after pruning are the same as the original decision tree.
set.seed(123)
pruned.tree = prune(dt.model, cp = best.cp)
rattle::fancyRpartPlot(pruned.tree, sub = "")
Using cross-validation to reduce overfitting
set.seed(123)
modelControl = trainControl(method = "repeatedcv",
repeats = 3,
number = 10)
rf = train(Sales~.,
method="rf",
trControl=modelControl,
data=train,
metric = "RMSE",
verbose=FALSE)
rf
## Random Forest
##
## 3660 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3293, 3293, 3293, 3296, 3294, 3294, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 11.120542 0.9892459 7.444073
## 4 5.812441 0.9958476 3.027776
## 6 6.339729 0.9948639 3.087779
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
TV and radio are the most important variables. They have a great impact on influencing sales.
RF.varimp.dur1 =varImp(rf, scale = FALSE)
imp <- data.frame(importance = RF.varimp.dur1$importance) %>%
tibble::rownames_to_column(var = "variable")
imp %>% ggplot(aes(x = reorder(variable,Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "#440154", color = "black")+
coord_flip() +
labs(x = "Variables", y = "Variable importance")
Used a log transformation to show the small values easier.
imp.log <- data.frame(importance = RF.varimp.dur1$importance) %>%
log() %>%
tibble::rownames_to_column(var = "variable")
imp.log %>% ggplot(aes(x = reorder(variable,Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "#440154", color = "black")+
coord_flip() +
labs(x = "Variables", y = "Variable importance")
Random forest has a test r-squared of 99.17% and a RMSE of 8.32. This model has a slightly higher test r-squared compared to the linear regression and decision tree. This model has a lower RMSE compared to the decision tree and a slighly higher RMSE compared to the linear regression.
y_hat = predict(rf, test)
mse.rf = mean((y_hat - test$Sales)^2)
rf.rmse = caret::RMSE(y_hat,test$Sales)
rf.r2 = R2(y_hat, test$Sales)
rf_performance_metrics =
data.frame(R_squared = rf.r2,
MSE = mse.rf,
RMSE = rf.rmse)
kable(rf_performance_metrics,
col.names = c("R-squared", "MSE", "RMSE"))
R-squared | MSE | RMSE |
---|---|---|
0.9918634 | 69.17844 | 8.317357 |
Higher TV budget predicts higher sales.
partial(rf, pred.var = "TV", train = train, plot = TRUE, plot.engine = "ggplot2")
Higher radio budget predicts higher sales.
partial(rf, pred.var = "Radio", train = train, plot = TRUE, plot.engine = "ggplot2")
Macro influnecers predict higher sales.
partial(rf, pred.var = "Influencer", train = train, plot = TRUE, plot.engine = "ggplot2")
Random forest outperforms the two other models. The variable importance from the random forest let’s us know that TV budget and radio budget are crucial in the model. Interestingly, in the linear regression model, we saw that TV budget and radio budget were both statistically significant variables. Social media budget and influencer type was not important in the random forest or statistically significant in the linear model.
To increase sales, investing in TV budget and radio budget will bring in more money.