A simple workflow to build to build a predictive regression model is as follow: Randomly split your data into training set (80%) and test set (20%) Build the regression model using the training set Make predictions

Loading Required Packages

library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
theme_set(theme_bw())

Lets load the marketing dataset
The marketing data set [datarium package] contains the impact of three advertising medias (youtube, facebook and newspaper) on sales. It will be used for predicting sales units on the basis of the amount of money spent in the three advertising medias. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times with different budgets and the observed sales have been recorded.

data("marketing",package="datarium")
head(marketing)
##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48
## 6   10.44    58.68     90.00  8.64

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.

set.seed(123)
training.samples <- marketing$sales %>%
  createDataPartition(p=0.8,list=FALSE)
train.data <- marketing[training.samples,]
test.data <- marketing[-training.samples,]

Computing Linear Regression

The R function lm() is used to compute linear regression model.

Quick Start R code

Build the model
model <- lm(sales~.,data=train.data)

summarize the model summary(model)

make predictions
predictions <- model %>% predict( test.data)

Model performance (a) Prediction error, RMSE

RMSE( predictions, test.data $ sales)

(b) R-square

R2( predictions, test.data $ sales)

model <- lm( sales ~ youtube, data = train.data) 
summary( model) $ coef
##               Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept) 8.58914961 0.616044182 13.94242 1.987874e-29
## youtube     0.04671639 0.003003398 15.55451 8.019035e-34

Predictions can be easily made using the R function predict(). In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.

newdata <- data.frame( youtube = c( 0, 1000)) 
model %>% predict( newdata)
##        1        2 
##  8.58915 55.30554

Multiple Linear Regression

Multiple linear regression is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).

You can compute the multiple linear regressin as follow:

model <- lm( sales ~ youtube + facebook + newspaper, data = train.data) 
summary( model) $ coef
##                Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept) 3.594141778 0.420814685  8.5409134 1.054115e-14
## youtube     0.044635905 0.001552128 28.7578721 1.125356e-64
## facebook    0.188823227 0.009528828 19.8159966 1.090250e-44
## newspaper   0.002839757 0.006441963  0.4408217 6.599447e-01

You can predict the values also

newdata <- data.frame( youtube = 2000, facebook = 1000, newspaper = 1000)

model%>% predict(newdata)
##        1 
## 284.5289

Lets see the summary

summary(model)
## 
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7142  -0.9939   0.3684   1.4494   3.3619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.594142   0.420815   8.541 1.05e-14 ***
## youtube     0.044636   0.001552  28.758  < 2e-16 ***
## facebook    0.188823   0.009529  19.816  < 2e-16 ***
## newspaper   0.002840   0.006442   0.441     0.66    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.043 on 158 degrees of freedom
## Multiple R-squared:  0.8955, Adjusted R-squared:  0.8935 
## F-statistic: 451.2 on 3 and 158 DF,  p-value: < 2.2e-16

In the residuals have a look at the mean, it shoudl not be far from 0, in this case it is not.

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.

We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.

As the newspaper variable is not significant, it is possible to remove it from the model:

model <- lm( sales ~ youtube + facebook, data = train.data) 
summary( model)
## 
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8127  -1.0073   0.3236   1.4643   3.3454 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.658580   0.393609   9.295   <2e-16 ***
## youtube     0.044650   0.001548  28.846   <2e-16 ***
## facebook    0.190165   0.009006  21.114   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 159 degrees of freedom
## Multiple R-squared:  0.8954, Adjusted R-squared:  0.894 
## F-statistic: 680.2 on 2 and 159 DF,  p-value: < 2.2e-16

Finally, our model equation can be written as follow: sales = 3.43 + 0.045youtube + 0.187facebook.

accuracy Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit

So Residual standard error, r-squared, f.statistics and p. value

error, represents roughly the average difference between the observed outcome values and the predicted values by the model.

The lower the RSE the best the model fits to our data. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In this case

2.038/mean(train.data$sales)
## [1] 0.1208938

This is 12% which is very low.

Making Predictions

It has two parts
1. Predict
2. Assess the model performance using RMSE or Adj R sq

# making Predictions

predictions <- model%>% predict(test.data)
predictions
##         2         3         6        14        15        24        32        40 
## 15.011084 15.054460 15.283625 10.616936 22.101982 19.747444 13.678412 24.477896 
##        41        46        51        58        65        66        68        81 
## 19.597351 18.174899 15.071277 15.337583 20.449806  9.477844 14.431149 13.844987 
##        86        91        93       104       107       116       124       129 
## 18.209084 11.972545 22.967589 17.651272  7.508261 15.669379 18.149940 26.643971 
##       134       137       138       139       148       156       168       178 
## 23.080107 13.929963 24.918356 11.872856 27.870953  6.525358 15.925554 14.557842 
##       181       182       183       184       189       192 
## 12.642522 16.598079  7.970506 28.880715 22.154415 10.168411

Compute the model performance

# compute RMSE

RMSE( predictions, test.data $ sales)
## [1] 1.949125
# compute R square
R2(predictions,test.data$sales)
## [1] 0.9069336

Both the measures are good in this case.

Discussion

Note that the LR model assumes a linear relationship between the outcome and predictor variable. This can be checked by creating a scatterplot of the outcome variable vs the predictor variable

eg. sales unit vs. youtube budge

ggplot( marketing, aes( x = youtube, y = sales)) +
  geom_point() + 
  stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

In this case this is linear, we can check for all the other predictor variables.

Potential problems, include: a) the presence of influential observations in the data , non-linearity between the outcome and some predictor variables and the presence of strong correlation between predictor variables.