library(readr)
library(dplyr)
library(ggplot2)
library(ggvis)
library(tidyr)
library(corrplot)
library(psych)

Introduction

Let’s consider that we are a consulting company and we have a client, organisation Gen-X where they want to know which channels they need to continue with regards to their advertising campaign. Our small company is then hired to solve this problem and come up with data-oriented findings.

Business Understanding

We are then handed confidential data by company Gen-X. But before we proceed, our team decided to have a meeting and came up with the following questions:

Which channel may better contribute with sales? Which channel would produce highest amount of sales? How much sales may grow if it is related given an increase in Radio Ads?

This dataset is from Introduction to statistical learning with applications in R by Gareth James, Daneila Witten, Trevor Hastie and Robert Tibshirani. Datasets-Advertising.csv

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   TV = col_double(),
##   Radio = col_double(),
##   Newspaper = col_double(),
##   Sales = col_double()
## )
## Classes 'tbl_df', 'tbl' and 'data.frame':    200 obs. of  5 variables:
##  $ X1       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ Radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ Newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ Sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 5
##   .. ..$ X1       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ TV       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Radio    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Newspaper: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Sales    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Let’s clean our dataset. We select only our needed columns

Ads <- Advertising %>% select(TV , Radio , Newspaper, Sales)

Visualising relationships among our channels

pairs(Ads, col = "blue")

M <- cor(Ads)
corrplot(M, method = "number")

pairs.panels(M)

Since we have 3 channels or variables, we can visualise it more clearly if we make a 3d scatterplot. First, we plot our TV and Radio. We want to know how much sales we can get, if we use both TV and radio channels.

plot_ly(data = Ads, z = ~Sales, x = ~TV, y = ~Radio, opacity = 0.6) %>%
  add_markers() 

Then we use Radio and Newspaper.

plot_ly(data = Ads, z = ~Sales, x = ~Radio, y = ~Newspaper, opacity = 0.6) %>%
  add_markers() 

Fit our regression model

Rmodel <- lm(Sales ~ TV + Radio + Newspaper, data = Ads)
summary(Rmodel)
## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = Ads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16
Ads$prediction <-predict(Rmodel)
ggplot(Ads, aes(x = prediction, y = Sales)) + 
  geom_point() +
  geom_abline(color = "blue")

Calculate RMSE

Ads$residuals <-Ads$prediction - Ads$Sales
# For convenience put the residuals in the variable res
res <- Ads$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 1.66857
# Calculate the standard deviation of female_unemployment and print it
(sd_Ads <- sd(Ads$Sales))
## [1] 5.217457

Good! An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

To summarise, by using simple Linear Regresion we can make predictions on how much sales we are going to get depending on which advertisement channels, we are going to use. From our model, we can see clearly that TV ads results into highest growth followed by Radio then least the Newspaper. These information can then be forwarded to our client so, they can decide which channels they need more in the long run.