Introduction

This is a short machine learning project where I will be using a predictive model to predict Sales based on the advertising dataset.

The dataset contains the advertising expenditures on 3 different platforms: TV, Radio and Newspaper and the corresponding Sales Volume generated.

Sales Volumes were recorded in thousands of units and the expenditures on advertising were recorded in thousands of dollars.

In machine learning, the task of predicting numerical values, such as Sales, is also known as Regression.

Through this machine learning project, we will attempt to answer a very common business question:

“How much Sales can we expect to generate if we spend a given amount of money on each advertising platform?”

With that, let’s get started with the task.

Load Libraries

First, let’s load the libraries required.

library(tidyverse)
library(ggplot2)
library(plotly)
library(MLmetrics)

Exploratory Data Analysis

Next, we will load the dataset and explore the data.

By previewing the data using the head() function, we can see that we have 4 columns of data. Each row represents the Sales Volume generated by the corresponding amounts of advertisement spending on TV, Radio and Newspaper platforms.

#load dataaset
df <- read.csv("advertising.csv")

#preview data

head(df,15)

Let’s also check the dimensions of our dataset.

We can see that there are 4 columns and 200 rows in our dataset.

dim(df)

## [1] 200   4

Next, we will check for missing data in the dataset.

From this, we can see that there are no missing values in our data, perfect.

#Number of NAs in each column
(colSums(is.na(df)))

##        TV     Radio Newspaper     Sales 
##         0         0         0         0

We can also compute summary statistics for each column of our data.

summary(df)

##        TV             Radio          Newspaper          Sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:11.00  
##  Median :149.75   Median :22.900   Median : 25.75   Median :16.00  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :15.13  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:19.05  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00

We can also chart out the distributions of each variable to have a feel of the data.

p <- df %>% 
  select(where(is.numeric)) %>% 
  pivot_longer(
    cols = everything(),
    names_to = "variable", 
    values_to = "variable_value"
  ) %>% 
  
  #plot facets 
  ggplot(aes(x = variable_value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(variable), scales = "free")

ggplotly(p) %>% config(displayModeBar = F)

Machine Learning Procedure

In machine learning, a commonly used technique to evaluate our models is known as the Validation Set Approach.

In general, the machine learning process under this approach will be as follows:

We will split the dataset randomly into 2 parts: the Training Set and the Testing Set as illustrated below. Typically, as a rule of thumb, we will use 70% of our data as the Training Set and the remaining 30% as our Testing Set.
We will then build our model using the Training Set by excluding data from the Testing Set during the model building process.
After building the model with the Training Set, we will then attempt to use this model that we have built to predict the values in the Testing Set.
We will then measure how far off our predictions are as compared to the actual values in the Testing Set. This will give us a basis to measure the prediction accuracy of our model on “unseen” data.

With that, let’s get started on the modelling procedure.

Split Dataset

Now, we will start splitting the datasets into training and testing sets.

#for reproducibility
#otherwise different parts of the data will be assigned to training/testing set every time we run the code
set.seed(11) 

#These are the indices of df that you will use as training and testing sets

#train set, which we set as 70% of the total dataset
train <- sample(1:nrow(df), 0.7*nrow(df))  

test <- (-train)

Build Model with Training Set

Now that we have our training set, we can start building our machine learning model.

As mentioned previously, we will be using a Regression model for our task. More specifically, we will be using the Multiple Linear Regression model in this instance.

What is Linear Regression exactly?

Well, the intuition behind Linear Regression is simple. Basically, we want to create a “best-fit” line based on our dataset, as illustrated by the diagram below, where we try to “fit” the red line to our data points, which are the scatters in black.

You might also wonder, how do we define the “best-fit line” then? Well, the term for this is known as the “least-squares method”, where we are obtaining the line which minimizes the sum of squared residuals.

Mathematically, the formula for the sum of squared residuals can be expressed as:

It may sound very technical, but it actually isn’t.

A residual is simply the difference between our prediction (Yi hat in the equation above, which is also a point on the regression line) and the actual Y value (Yi in the equation above). Therefore, each residual is simply the vertical distance between a scatter and the regression line (as illustrated by the arrows in the diagram above).

Also, we “square” the residuals so that negative differences do not cancel out positive differences. We then sum them up to get the sum of the squared vertical distances. Lastly, we will use the line which minimizes the sum of the squared “vertical distances”, which is actually a “best-fit line”.

We will then obtain a regression line with an equation that looks like the one below, where the “X” variables in the formula each represents a variable that we are using to predict “Y”. In our case, our “Y” variable is Sales Volume and our “X” variables are the advertisement expenditures on TV, Radio and Newspaper platforms.

Fortunately, with the functions in R, we do not need to calculate the residuals and fit the regression line manually. R will be able to do this for us.

With that said, let’s fit the regression line to our training dataset, using all variables in our dataset as predictors.

From our model results, we can see that all our variables except Newspaper are significant predictors of Sales, as seen from the P Values of each variable (the column with the header “Pr(>|t|)”) in the table below. Therefore, we will create another Linear Regression model without the Newspaper variable.

lm.fit <- lm(Sales~., data=df, subset=train)  #subset = train means you only extract the index from train
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ ., data = df, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9780 -0.7139  0.0543  0.9176  3.8421 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.449119   0.371189   11.99   <2e-16 ***
## TV          0.055171   0.001633   33.78   <2e-16 ***
## Radio       0.102686   0.010222   10.05   <2e-16 ***
## Newspaper   0.002743   0.006524    0.42    0.675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.657 on 136 degrees of freedom
## Multiple R-squared:  0.9034, Adjusted R-squared:  0.9013 
## F-statistic: 424.2 on 3 and 136 DF,  p-value: < 2.2e-16

Let’s create another Linear Regression model without the Newspaper variable.

From our results below, we can see that Adjusted R-Squared improved slightly after dropping the Newspaper variable as well.

Adjusted R-Squared is a metric that we can use to evaluate a Linear Regression model and it basically measures how well our model fits to the data. The higher the Adjusted R-Squared value, the better the model fits. An Adjusted R-Squared of 0.9019 is considered to be very high and would therefore be deemed satisfactory most of the time.

We can also see that our model is statistically significant in predicting Sales, with a P Value that is < 2.2e-16.

From the code below, we have arrived at our Linear Regression model, which has the equation:

Sales = 4.495461 + 0.055184 TV + 0.104109 Radio

We can then use this equation to predict Sales later on.

#exclude Newspaper Variable
#subset = train means you only extract the index from train
lm.fit1 <- lm(Sales~TV+Radio, data=df, subset=train)  
summary(lm.fit1)

## 
## Call:
## lm(formula = Sales ~ TV + Radio, data = df, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0568 -0.7448  0.0615  0.9116  3.8531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.495461   0.353382   12.72   <2e-16 ***
## TV          0.055184   0.001628   33.89   <2e-16 ***
## Radio       0.104109   0.009616   10.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.652 on 137 degrees of freedom
## Multiple R-squared:  0.9033, Adjusted R-squared:  0.9019 
## F-statistic:   640 on 2 and 137 DF,  p-value: < 2.2e-16

We can also compute the Training Error of our model, using the Root of Mean Squared Error (RMSE) metric.

RMSE has the equation below. It basically tells us, on average, how far away are our Sales predictions from the actual Sales Volumes.

We have computed the RMSE on our Training Set as 1.63. Therefore, we can say that, on average, for the Training Set, our predicted Sales Volumes are 1.63 thousands away from the actual Sales Volumes.

rmse_train <- RMSE(predict(lm.fit1, df)[train], df$Sales[train])
rmse_train

## [1] 1.634244

Evaluate Model with Testing Set

Now that we have our model (“lm.fit1”), we will now use the model to predict the Sales numbers in the Testing Set. From this, we will be able to evaluate the prediction accuracy of our model as we are making predictions on “unseen” data from the Testing Set. We will compute the RMSE of our Testing Set predictions to evaluate our prediction accuracy.

Based on the RMSE we have calculated, we can say that, on average, for the Testing Set, our predicted Sales Volumes are 1.69 thousands away from the actual Sales Volumes, not too bad considering that the Median Sales Volume in our dataset is 16 thousand units.

rmse_test <- RMSE(predict(lm.fit1, df)[-train], df$Sales[-train])
rmse_test

## [1] 1.685202

We can also compute another metric to assess our model, which is Mean Absolute Percentage Error (MAPE).

The MAPE is calculated from the equation:

The MAPE tells us, on average, how far off are our predictions, in percentage terms.

With the MAPE we have calculated, we can see that our model produces predictions that are 8.72% off, on average.

mape_test <- MAPE(predict(lm.fit1, df)[-train], df$Sales[-train])*100
mape_test

## [1] 8.717138

Making New Predictions

Now that we have our model, we can also predict Sales Volumes in the future, given the advertising amount that we are going to invest.

For example, let’s assume we would like to spend 15 thousand dollars on Radio advertisements and 15 thousand dollars on TV advertisements. Since our model did not include Newspaper advertisement spending as a predictor, we do not need it to predict Sales Volume.

The code below will allow us to predict the Sales Volume generated from the corresponding advertising investments.

We see that the model has produced the predicted Sales Volume of 6.88 Thousand Units.

predicted <- predict(lm.fit1, data.frame(TV=15, Radio=15))

print(paste0("Predicted Sales Volume = ",round(predicted,2)," Thousand Units", sep=""))

## [1] "Predicted Sales Volume = 6.88 Thousand Units"

Conclusion

In conclusion, this short machine learning example demonstrates how we can make use of seemingly simple models, such as Linear Regression, to make forecasts and address business questions. We managed to answer the business question of “how much Sales can we expect” with the use of a very simple model.

Sales Prediction with Machine Learning

Ng Wei Xuan

03/5/2021