Linear Regression Analysis

Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of inumberable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.

library(MASS)
library(ISLR2)
library(car)

Model Buidling

lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Model Validation

predict.sales <- lm.fit$fitted.values 
plot(predict.sales, advertising$Sales)

par(mfrow = c(2,2))
plot(lm.fit)

residuals <- lm.fit$residuals

# Testing for Multicollinearity
vif(lm.fit)
##        TV     Radio Newspaper 
##  1.004611  1.144952  1.145187
# Testing for Autocorrelation
durbinWatsonTest(lm.fit)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.04687792      2.083648   0.606
##  Alternative hypothesis: rho != 0
# Testing for Normality
shapiro.test(residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.91767, p-value = 3.939e-09
# Testing for Equality of Variance
# Use leveneTest()

Transformation of Variables

advertising$log.Sales <- log(advertising$Sales)

lm.fit2 <- lm(log.Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit2)
## 
## Call:
## lm(formula = log.Sales ~ TV + Radio + Newspaper, data = advertising)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.74201 -0.05892  0.04715  0.10553  0.20666 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.7390020  0.0345727  50.300   <2e-16 ***
## TV          0.0036697  0.0001546  23.735   <2e-16 ***
## Radio       0.0118019  0.0009545  12.365   <2e-16 ***
## Newspaper   0.0003544  0.0006508   0.545    0.587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared:  0.7998, Adjusted R-squared:  0.7967 
## F-statistic: 260.9 on 3 and 196 DF,  p-value: < 2.2e-16

Nonlinear Regression

lm.fit3 <- lm(Sales ~ TV + I(TV^2) + I(TV^3), data=advertising)
summary(lm.fit3)
## 
## Call:
## lm(formula = Sales ~ TV + I(TV^2) + I(TV^3), data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9734 -1.8900 -0.0897  2.0189  7.3765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.420e+00  8.641e-01   6.272 2.23e-09 ***
## TV           9.643e-02  2.580e-02   3.738 0.000243 ***
## I(TV^2)     -3.152e-04  2.022e-04  -1.559 0.120559    
## I(TV^3)      5.572e-07  4.494e-07   1.240 0.216519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.6162 
## F-statistic: 107.5 on 3 and 196 DF,  p-value: < 2.2e-16

Interaction Effects

lm.fit4 <- lm(Sales ~ TV + Radio + TV*Radio, data=advertising)
summary(lm.fit4)
## 
## Call:
## lm(formula = Sales ~ TV + Radio + TV * Radio, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## Radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:Radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16
predict.w.inter <- lm.fit4$fitted.values

par(mfrow=c(1,2))
plot(predict.sales, advertising$Sales)
plot(predict.w.inter, advertising$Sales)