Linear regression is a very useful tool for predicting a quantitative response. It has been around for a long time and is the topic of inumberable textbooks. Though it may seem somewhat dull compared to some of the modern statistical learning approaches, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for new approaches that serve as extension of regression. Some of the R libraries that we will need for this lesson is provided below.
library(MASS)
library(ISLR2)
library(car)
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
advertising <- read.csv(".\\Advertising.csv")
names(advertising)
## [1] "X" "TV" "Radio" "Newspaper" "Sales"
head(advertising)
## X TV Radio Newspaper Sales
## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
## 4 4 151.5 41.3 58.5 18.5
## 5 5 180.8 10.8 58.4 12.9
## 6 6 8.7 48.9 75.0 7.2
plot(advertising[,c(2:5)], col="#69b3a2")
lm.fit <- lm(Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## Radio 0.188530 0.008611 21.893 <2e-16 ***
## Newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
predict.sales <- lm.fit$fitted.values
plot(predict.sales, advertising$Sales)
par(mfrow = c(2,2))
plot(lm.fit)
residuals <- lm.fit$residuals
# Testing for Multicollinearity
vif(lm.fit)
## TV Radio Newspaper
## 1.004611 1.144952 1.145187
# Testing for Autocorrelation
durbinWatsonTest(lm.fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.04687792 2.083648 0.606
## Alternative hypothesis: rho != 0
# Testing for Normality
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.91767, p-value = 3.939e-09
# Testing for Equality of Variance
# Use leveneTest()
advertising$log.Sales <- log(advertising$Sales)
lm.fit2 <- lm(log.Sales ~ TV + Radio + Newspaper, data=advertising)
summary(lm.fit2)
##
## Call:
## lm(formula = log.Sales ~ TV + Radio + Newspaper, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.74201 -0.05892 0.04715 0.10553 0.20666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7390020 0.0345727 50.300 <2e-16 ***
## TV 0.0036697 0.0001546 23.735 <2e-16 ***
## Radio 0.0118019 0.0009545 12.365 <2e-16 ***
## Newspaper 0.0003544 0.0006508 0.545 0.587
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1868 on 196 degrees of freedom
## Multiple R-squared: 0.7998, Adjusted R-squared: 0.7967
## F-statistic: 260.9 on 3 and 196 DF, p-value: < 2.2e-16
lm.fit3 <- lm(Sales ~ TV + I(TV^2) + I(TV^3), data=advertising)
summary(lm.fit3)
##
## Call:
## lm(formula = Sales ~ TV + I(TV^2) + I(TV^3), data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9734 -1.8900 -0.0897 2.0189 7.3765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.420e+00 8.641e-01 6.272 2.23e-09 ***
## TV 9.643e-02 2.580e-02 3.738 0.000243 ***
## I(TV^2) -3.152e-04 2.022e-04 -1.559 0.120559
## I(TV^3) 5.572e-07 4.494e-07 1.240 0.216519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 196 degrees of freedom
## Multiple R-squared: 0.622, Adjusted R-squared: 0.6162
## F-statistic: 107.5 on 3 and 196 DF, p-value: < 2.2e-16
lm.fit4 <- lm(Sales ~ TV + Radio + TV*Radio, data=advertising)
summary(lm.fit4)
##
## Call:
## lm(formula = Sales ~ TV + Radio + TV * Radio, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## Radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:Radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
predict.w.inter <- lm.fit4$fitted.values
par(mfrow=c(1,2))
plot(predict.sales, advertising$Sales)
plot(predict.w.inter, advertising$Sales)