Linear Regression

Linear regression is a classical technique that is used to predict a quantitative response $y$ that results from one or more predictor variables $x_j$.

Simple linear regression is an approach for predicting a quantitative response $y$ on the basis of a single predictor variable $x$. It assumes that there is approximately a linear relationship between $x$ and $y$. If such a relationship exists, it can be written as \[ y = \beta_0 + \beta_1x + \varepsilon,\] where $y$ represents the response variable, $\beta_0$ represents the $y$-intercept, $\beta_1$ is the coefficient representing the slope of the regression line and $\varepsilon$ represents a random error term with mean 0.

We build a model to attempt to capture the relationship which is given by \[\hat{y} = b_0 + b_1x,\] where $b_0$ predicts $\beta_0$ and $b_1$ predicts $\beta_1$.

The lm command (which stands for linear model) finds the best fit linear relationship to the data using the criteria of minimizing the sum of the squared residuals. The mathematics of linear regression can be found here.

The Example

For this section, we use the marketing data set from the library datarium. It is a dataset containing data on the impact of three advertising medias (youtube, facebook and newspaper) on sales. The first three columns are the advertising budget in thousands of dollars for each of the three medias while the fourth column is sales. The data set has 200 rows.

library(datarium)
dat <- marketing
head(dat,5)

##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48

Variable	Description
youtube	Thousands of dollars spent on YouTube advertising.
facebook	Thousands of dollars spent on Facebook advertising.
newspaper	Thousands of dollars spent on newspaper advertising.
sales	Sales data in units sold.

Exploratory Analysis

In this example, we wish to quantify the effect of advertising on total sales for a company using the marketing data set.

Initially, the best thing to do is to create a diagram of how sales are related to each of the advertising methods.

plot(dat, pch=20)

The bottom row shows a sequence of graphs with sales on the horizontal axis and the various advertising methods on the vertical axis. It appears that there is a linear relationship between sales and YouTube expenditure as well as a positive relationship between sales and Facebook expenditure.

Using ggplot, we can create a similar graphic that also displays distributions of data and correlations. One can see that there is a high degree of correlation between sales and YouTube advertising with decreasing correlation between sales and both Facebook advertising and newspaper advertising.

library(ggplot2)
library(GGally)
ggpairs(dat)

Thus, since the correlation between sales and YouTube spending seems to be positive and linear, we explore further.

ggplot(dat, aes(x=youtube, y=sales)) + 
  geom_point() +
  geom_smooth(method=lm) +
  labs(x="YouTube Advertising Expenditure ($1000)",
       y="Total Sales (Units)",
       title="Sales and YouTube Advertising") +
  ggthemes::theme_tufte()

There appears to be a linear relation between sales and YouTube advertising.

ggplot(dat, aes(x=facebook, y=sales)) + 
  geom_point() +
  geom_smooth(method=lm) +
  labs(x="Facebook Advertising Expenditure ($1000)",
       y="Total Sales (Units)",
       title="Sales and YouTube Advertising") +
  ggthemes::theme_tufte()

There appears to be a positive relationship between Facebook advertising expenditure and sales - though, not nearly as strong as with YouTube.

Data Preparation

For now, we concentrate on the relationship between YouTube spending and sales.

We begin with a training and testing set. Formally, a training set is a set of data used to discover potentially predictive relationships.

A testing set is a set of data used to assess the strength and utility of a predictive relationship. For now, we use a 60% / 40% split where we train our model on 60% of the data and then test the model performance on 40% of the data that is withheld. (At this point, the percentages are not important).

set.seed(123456)
sample <- sample(c(TRUE, FALSE), nrow(dat), replace = T, prob = c(0.6,0.4))
train <- dat[sample, ]
test <- dat[!sample, ]

Our Model

We use a simple linear regression model to estimate sales based on the number of thousands of dollars spent on advertising on YouTube. This relationship can be written as \[ y = \beta_0 + \beta_1x + \varepsilon,\] where $y$ represents the number of units sold by the company, $\beta_0$ represents sales when no money is spent on YouTube advertising, $\beta_1$ represents the increase in sales due to a one-unit ($1000) increase in YouTube advertising ($x$) and $\varepsilon$ represents the error not captured by the model. The model used to predict this relationship is given by \[\hat{y} = b_0 + b_1x,\] where $b_0$ predicts $\beta_0$ and $b_1$ predicts $\beta_1$. To build the model, we use the lm function with the following syntax.

LinearModel <- lm(sales ~ youtube, data = train)
LinearModel

## 
## Call:
## lm(formula = sales ~ youtube, data = train)
## 
## Coefficients:
## (Intercept)      youtube  
##     8.85891      0.04322

From the output, the model found is \[\hat{y} = 8.85891 + 0.04322x\]

Analysis of the Model

Our model is \[\hat{y} = 8.85891 + 0.04322x.\] Thus, if we spend no money on YouTube advertising, we have predicted sales of 8.85891 units. For every additional us unit increase in $x$ (representing a $1000 increase in money spent), we see an increase in sales of 0.04322 units.

Standard Errors & Hypothesis Testing

We can perform hypothesis tests on the regression coefficients.

The null hypothesis ($H_0$): The coefficients are equal to zero i.e., there is no relationship between the predictor and response variables.
The alternate hypothesis ($H_1$): The coefficients are not equal to zero i.e., there is a relationship between the predictor and response variables.

To test the null hypothesis, we determine if our estimate for $\beta_1$ is sufficiently far from zero (implying that is is unlikely to be 0). If the standard error of $b_1$ is sufficiently small, then even small values of our estimate of $b_1$ provide evidence against the null hypothesis. The t-statistic measures the number of standard deviations $b_1$ is away from 0. We can compute the standard error (SE) with: \[SE(b_0) = \sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i - \hat{y}_i)^2\left[\dfrac{\sum_{i=1}^n \bar{x}^2}{n\sum_{i=1}^n (x_i - \bar{x})^2} \right]}, \;\;\;\;\;\; SE(b_1) = \sqrt{\dfrac{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}.\]

The standard error measures the degree to which our coefficient estimates vary from the actual value of our response variable. The standard error can be used to compute the confidence intervals. Confidence intervals show the uncertainty in the regression coefficients. A confidence interval around $b_1$, for example, is \[b_1 \pm t_{\alpha/2} \cdot SE(b_1).\] We can be $(1-\alpha)\times 100$% confident that the true value of $\beta_1$ is on that interval. Using R, we construct 95% confidence intervals for the coefficients using the confint command.

confint(LinearModel, level = 0.95)

##                  2.5 %     97.5 %
## (Intercept) 7.35566312 9.52256140
## dat$youtube 0.04223072 0.05284256

We can be 95% certain that the true intercept is on the interval $[7.53, 10.19]$ and the true slope is on the interval $[0.036, 0.050]$. In particular, neither interval contains 0, so we can be 95% certain that the true value of these coefficients is not 0 (and hence there is a positive relationship between the variables). Therefore, we can reject the null hypothesis in favor of the alternative in the case of both variables.

Other Summary Statistics

We use the summary command in order to view all of the various statistical values computed for the model.

summary(LinearModel)

## 
## Call:
## lm(formula = dat$sales ~ dat$youtube, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0632  -2.3454  -0.2295   2.4805   8.6548 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.439112   0.549412   15.36   <2e-16 ***
## dat$youtube 0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

We examine each of the statistics:

Residuals: Residuals are the difference between the actual and estimated values. The distribution of our residuals should ideally be symmetrical. A 5-number summary of the residuals is provided here.
Coefficients: Our coefficients $b_0$ and $b_1$ represent estimates of the intercept and slope respectively.
The coefficient standard error measures how much our coefficient estimates vary from the actual average value of our response variable. In other words, it measures the accuracy of coefficient estimates. Standard error values near 0 are best.
The coefficient $t$-value measures how many standard deviations our coefficient estimate is from 0, \[t = \dfrac{b_j}{SE(b_j)}.\] A large $t$-value, relative to standard error, would provide evidence against the null hypothesis and indicate that a relationships exists between the predictor and response variables. Predictors with low t-statistics can be dropped. Ideally, the $t$-value should be greater than 1.96 for a $p$-value to be less than 0.05. In this case, they are. That provides evidence against the null hypothesis.
The coefficient — $Pr(>t)$ represents the $p$-value or the probability of observing a value larger than $t$. The smaller the $p$-value, the more likely we are to reject the null hypothesis. A $p$-value of 5% or less is typically a good cut-off point. Note the ‘Signif. Codes’ associated to each estimate. Three asterisks represent a highly significant $p$-value. Since the relationship between sales and YouTube advertising is highly significant, we can reject our null hypothesis.
Residual standard error: This measures the quality of our regression fit. It is the average amount the response variable (sales) will vary from the true regression line.
Multiple $R$-squared: Besides the $t$-statistic and $p$-value, this is our most important metric for measuring regression model fit. $R^2$ measures the linear relationship between our predictor variable (YouTube advertising) and our response / target variable (sales). $R^2$ is always between 0 and 1. A number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable. In our example, the adjusted $R^2$ (which adjusts for degrees of freedom) is 0.5714 indicating that 57.14% of sales can be explained by spending on YouTube advertising. If we perform a multiple regression, we will find that the $R^2$ will increase with an increase in the number of response variables.
$F$-statistic: This is a good indicator of whether there is a relationship between $y$ and $x$. It tests to see if at least one predictor variable is non-zero. Of course, this becomes increasingly important as we add variables to the regression. The further our $F$-statistic is away from 1, the better our regression model. In our example, the $F$-statistic is 158.3, which is relatively larger than 1 given the size of our data set (200 observations). The $F$-statistic is more relevant in a multiple regression model. The small $p$-value associated with it $< 2.2 \times 10^{-16}$ indicates that it is likely that the $x$ variable is associated with the $y$ variable.

The key indicators of model fit are the $t$-statistic, $p$-value, $R^2$ and $F$-statistic. The larger our $t$-statistic, the smaller the $p$-value. The smaller the $p$-value, the greater the odds of a relationship between $x$ and $y$. $R^2$ measures how well a model fits the data — if $R^2$ is close to 1, then this indicates that a large proportion of the variation in $y$ can be explained by $x$. The $F$-statistic shows the overall significance of the model. A large $F$-statistic will correspond to a statistically significant $p$-value ($p < 0.05$).

Visual Assessment of the Model

It is not only important to understand quantitative measures of our model accuracy but we must also understand visual approaches. It is important to visualize our model, when possible. For simple linear regression, we graph it as we did above.

We now take a look at the residuals. If we use the plot command with our LinearModel output, we see a series of four plots:

Residuals vs. Fitted: When a linear model is appropriate, the residuals will have constant variance when plotted against fitted values; and the residuals and fitted values will be uncorrelated. Thus, we expect a random assortment of points with no discernible pattern. If there are trends in the residual plot, or the plot looks like a funnel, these indicate that a linear model is inappropriate. If the blue line is not horizontal, then non-linearity is a concern. If there is a pattern to our points, then heteroskedasticity is a concern and it may not be true that $Var(\varepsilon_i) = \sigma^2$.
Normal QQ plot: You can use a linear model for prediction even if the underlying normality assumptions don’t hold. However, in order for the p-values to be believable, the residuals from the regression must look approximately normally distributed. The closer the points hug the 45-degree line, the more we interpret that the residuals are normally distributed.
Scale-location plot: This is another version of the residuals vs fitted plot. There should be no discernible trends in this plot.
Residuals vs Leverage: Leverage measures how much an observation influenced the model. It summarizes how different the model fit would be if the given observation was excluded, compared to the model fit where the observation is included. Points with high residuals and high leverage are outliers. They skew the model fit.

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(LinearModel)

In our example,

The residual vs fitted plot has no indication of a pattern. So, a linear model seems appropriate here.
The normal QQ plot stays pretty close to the 45 degree line.
The scale-location plot shows a clear increasing trend in residual variance that runs through most of the plot. This is indicated by the upward slope of the red line, which we can interpret as the standard deviation of the residuals at the given level of fitted value.
Residuals vs leverage plot does not show any outliers.

Testing the Model

Now that we are reasonably happy with our model, we can see how it fits the testing data. Since the goal of linear regression is often to make predictions on new data, using our testing data will help us make an assessment of this.

library(modelr)
library(broom)
test <- test %>%
  add_predictions(LinearModel)
head(test)

##    youtube facebook newspaper sales      pred
## 1   276.12    45.36     83.04 26.52 20.792945
## 2    53.40    47.16     54.12 12.48 11.166886
## 9    10.32     2.52      1.20  5.76  9.304949
## 11   79.32     6.96     29.04 10.32 12.287160
## 13   28.56    42.12     79.08 11.04 10.093290
## 14  117.00     9.12      8.64 11.64 13.915707

The primary concern is to assess if the out-of-sample mean squared error (MSE) (the mean squared prediction error), is substantially higher than the in-sample mean square error, as this is a sign of model deficiency. The MSE is \[MSE = \dfrac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2.\]

We compare the test sample MSE to the training sample MSE below and note that the difference is small. This exercise becomes more powerful when you are comparing multiple models as it helps avoid overfitting issues.

test %>% 
  add_predictions(LinearModel) %>%
  summarise(MSE = mean((sales - pred)^2))

##        MSE
## 1 17.73662

train %>% 
  add_predictions(LinearModel) %>%
  summarise(MSE = mean((sales - pred)^2))

##        MSE
## 1 13.89657

We also see that there is a nearly 81% correlation between sales and the predicted sales value.

cor(cbind(test$sales,test$pred))

##           [,1]      [,2]
## [1,] 1.0000000 0.8086526
## [2,] 0.8086526 1.0000000

Another metric we could use is min max accuracy which takes the smallest of the predicted and actual values of each row then divides that number by the larger of the 2 numbers. Finally, this ratio is averaged across all rows. \[MinMaxAccuracy = \dfrac{1}{n}\sum_{i=1}^n\left(\dfrac{\min(y_i,\hat{y}_i)}{\max(y_i,\hat{y_i})} \right).\] The closer this number is to 1, the better as this would indicate that most minimum values are fairly close to most maximum values. Our output, 0.8317156, seems very good for our test data.

 mean(apply(cbind(test$sales,test$pred), 1, min) / apply(cbind(test$sales,test$pred), 1, max))

## [1] 0.8317156

Finally, we calculate MAPE (mean absolute percentage error), \[MAPE = \dfrac{1}{n}\sum_{i=1}^n \dfrac{|\hat{y}_i - y_i|}{y_i}.\]

mean(abs((test$sales - test$pred))/test$sales)

## [1] 0.1944688

Here, we have a small MAPE of 19.45%. Thus, on average the error between the predicted and actual values is around 19% of the sales value.

Citations

Prabhakaran, Selva. (2016 - 2017) r-statistics.co . http://r-statistics.co/Linear-Regression.html

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

Linear Regression

OC Data Science

Linear Regression

The Example

Exploratory Analysis

Data Preparation

Our Model

Analysis of the Model

Standard Errors & Hypothesis Testing

Other Summary Statistics

Visual Assessment of the Model

Testing the Model

Citations