library(readr)
library(dplyr)
library(ggplot2)
library(ggvis)
library(tidyr)
library(corrplot)
library(psych)
Let’s consider that we are a consulting company and we have a client, organisation Gen-X where they want to know which channels they need to continue with regards to their advertising campaign. Our small company is then hired to solve this problem and come up with data-oriented findings.
We are then handed confidential data by company Gen-X. But before we proceed, our team decided to have a meeting and came up with the following questions:
Which channel may better contribute with sales? Which channel would produce highest amount of sales? How much sales may grow if it is related given an increase in Radio Ads?
This dataset is from Introduction to statistical learning with applications in R by Gareth James, Daneila Witten, Trevor Hastie and Robert Tibshirani. Datasets-Advertising.csv
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## TV = col_double(),
## Radio = col_double(),
## Newspaper = col_double(),
## Sales = col_double()
## )
## Classes 'tbl_df', 'tbl' and 'data.frame': 200 obs. of 5 variables:
## $ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ Radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ Newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ Sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ X1 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ TV : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Radio : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Newspaper: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Sales : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Let’s clean our dataset. We select only our needed columns
Ads <- Advertising %>% select(TV , Radio , Newspaper, Sales)
Visualising relationships among our channels
pairs(Ads, col = "blue")
M <- cor(Ads)
corrplot(M, method = "number")
pairs.panels(M)
Since we have 3 channels or variables, we can visualise it more clearly if we make a 3d scatterplot. First, we plot our TV and Radio. We want to know how much sales we can get, if we use both TV and radio channels.
plot_ly(data = Ads, z = ~Sales, x = ~TV, y = ~Radio, opacity = 0.6) %>%
add_markers()
Then we use Radio and Newspaper.
plot_ly(data = Ads, z = ~Sales, x = ~Radio, y = ~Newspaper, opacity = 0.6) %>%
add_markers()
Fit our regression model
Rmodel <- lm(Sales ~ TV + Radio + Newspaper, data = Ads)
summary(Rmodel)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = Ads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Ads$prediction <-predict(Rmodel)
ggplot(Ads, aes(x = prediction, y = Sales)) +
geom_point() +
geom_abline(color = "blue")
Calculate RMSE
Ads$residuals <-Ads$prediction - Ads$Sales
# For convenience put the residuals in the variable res
res <- Ads$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 1.66857
# Calculate the standard deviation of female_unemployment and print it
(sd_Ads <- sd(Ads$Sales))
## [1] 5.217457
Good! An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.
To summarise, by using simple Linear Regresion we can make predictions on how much sales we are going to get depending on which advertisement channels, we are going to use. From our model, we can see clearly that TV ads results into highest growth followed by Radio then least the Newspaper. These information can then be forwarded to our client so, they can decide which channels they need more in the long run.