A simple workflow to build to build a predictive regression model is as follow: Randomly split your data into training set (80%) and test set (20%) Build the regression model using the training set Make predictions
library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 3.0.0 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
theme_set(theme_bw())
Lets load the marketing dataset
The marketing data set [datarium package] contains the impact of three advertising medias (youtube, facebook and newspaper) on sales. It will be used for predicting sales units on the basis of the amount of money spent in the three advertising medias. Data are the advertising budget in thousands of dollars along with the sales. The advertising experiment has been repeated 200 times with different budgets and the observed sales have been recorded.
data("marketing",package="datarium")
head(marketing)
## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
## 5 216.96 12.96 70.08 15.48
## 6 10.44 58.68 90.00 8.64
We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model). Make sure to set seed for reproducibility.
set.seed(123)
training.samples <- marketing$sales %>%
createDataPartition(p=0.8,list=FALSE)
train.data <- marketing[training.samples,]
test.data <- marketing[-training.samples,]
The R function lm() is used to compute linear regression model.
Build the model
model <- lm(sales~.,data=train.data)
summarize the model summary(model)
make predictions
predictions <- model %>% predict( test.data)
Model performance (a) Prediction error, RMSE
RMSE( predictions, test.data $ sales)
(b) R-square
R2( predictions, test.data $ sales)
model <- lm( sales ~ youtube, data = train.data)
summary( model) $ coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.58914961 0.616044182 13.94242 1.987874e-29
## youtube 0.04671639 0.003003398 15.55451 8.019035e-34
Predictions can be easily made using the R function predict(). In the following example, we predict sales units for two youtube advertising budget: 0 and 1000.
newdata <- data.frame( youtube = c( 0, 1000))
model %>% predict( newdata)
## 1 2
## 8.58915 55.30554
Multiple linear regression is an extension of simple linear regression for predicting an outcome variable (y) on the basis of multiple distinct predictor variables (x).
You can compute the multiple linear regressin as follow:
model <- lm( sales ~ youtube + facebook + newspaper, data = train.data)
summary( model) $ coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.594141778 0.420814685 8.5409134 1.054115e-14
## youtube 0.044635905 0.001552128 28.7578721 1.125356e-64
## facebook 0.188823227 0.009528828 19.8159966 1.090250e-44
## newspaper 0.002839757 0.006441963 0.4408217 6.599447e-01
You can predict the values also
newdata <- data.frame( youtube = 2000, facebook = 1000, newspaper = 1000)
model%>% predict(newdata)
## 1
## 284.5289
Lets see the summary
summary(model)
##
## Call:
## lm(formula = sales ~ youtube + facebook + newspaper, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7142 -0.9939 0.3684 1.4494 3.3619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.594142 0.420815 8.541 1.05e-14 ***
## youtube 0.044636 0.001552 28.758 < 2e-16 ***
## facebook 0.188823 0.009529 19.816 < 2e-16 ***
## newspaper 0.002840 0.006442 0.441 0.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.043 on 158 degrees of freedom
## Multiple R-squared: 0.8955, Adjusted R-squared: 0.8935
## F-statistic: 451.2 on 3 and 158 DF, p-value: < 2.2e-16
In the residuals have a look at the mean, it shoudl not be far from 0, in this case it is not.
The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
We found that newspaper is not significant in the multiple regression model. This means that, for a fixed amount of youtube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales units.
As the newspaper variable is not significant, it is possible to remove it from the model:
model <- lm( sales ~ youtube + facebook, data = train.data)
summary( model)
##
## Call:
## lm(formula = sales ~ youtube + facebook, data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8127 -1.0073 0.3236 1.4643 3.3454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.658580 0.393609 9.295 <2e-16 ***
## youtube 0.044650 0.001548 28.846 <2e-16 ***
## facebook 0.190165 0.009006 21.114 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.038 on 159 degrees of freedom
## Multiple R-squared: 0.8954, Adjusted R-squared: 0.894
## F-statistic: 680.2 on 2 and 159 DF, p-value: < 2.2e-16
Finally, our model equation can be written as follow: sales = 3.43 + 0.045youtube + 0.187facebook.
accuracy Once you identified that, at least, one predictor variable is significantly associated to the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit
So Residual standard error, r-squared, f.statistics and p. value
error, represents roughly the average difference between the observed outcome values and the predicted values by the model.
The lower the RSE the best the model fits to our data. Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.
In this case
2.038/mean(train.data$sales)
## [1] 0.1208938
This is 12% which is very low.
It has two parts
1. Predict
2. Assess the model performance using RMSE or Adj R sq
# making Predictions
predictions <- model%>% predict(test.data)
predictions
## 2 3 6 14 15 24 32 40
## 15.011084 15.054460 15.283625 10.616936 22.101982 19.747444 13.678412 24.477896
## 41 46 51 58 65 66 68 81
## 19.597351 18.174899 15.071277 15.337583 20.449806 9.477844 14.431149 13.844987
## 86 91 93 104 107 116 124 129
## 18.209084 11.972545 22.967589 17.651272 7.508261 15.669379 18.149940 26.643971
## 134 137 138 139 148 156 168 178
## 23.080107 13.929963 24.918356 11.872856 27.870953 6.525358 15.925554 14.557842
## 181 182 183 184 189 192
## 12.642522 16.598079 7.970506 28.880715 22.154415 10.168411
# compute RMSE
RMSE( predictions, test.data $ sales)
## [1] 1.949125
# compute R square
R2(predictions,test.data$sales)
## [1] 0.9069336
Both the measures are good in this case.
Note that the LR model assumes a linear relationship between the outcome and predictor variable. This can be checked by creating a scatterplot of the outcome variable vs the predictor variable
eg. sales unit vs. youtube budge
ggplot( marketing, aes( x = youtube, y = sales)) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In this case this is linear, we can check for all the other predictor variables.
Potential problems, include: a) the presence of influential observations in the data , non-linearity between the outcome and some predictor variables and the presence of strong correlation between predictor variables.