Problem Set #4: Lineear Regression

Problem 1: Complete the tables below

# T Value for the Intercept
-17.5791/6.7584
## [1] -2.601074
# T Value for Speed
3.9324/0.4155
## [1] 9.46426
# Pr(>|t|) for Speed
t=(3.9324/0.4155)
2*pt(-abs(t),df=48)
## [1] 1.488495e-12
# Degrees of Freedom
50 - 2 
## [1] 48
# Multiple R - Squares = SSreg/SStotal
21186/(21186+11354)
## [1] 0.6510756
# F-Statistic
f=(21186/1)/(11354/48)
f
## [1] 89.56562
# On 1 and 48 DF, P-Value: 
pf(f, 1, 48,  lower.tail=FALSE)
## [1] 1.490228e-12
#Mean Square For Speed
21186/1
## [1] 21186
#Mean Square for Residual
11354/48
## [1] 236.5417
# F Value for Speed
(21186/1)/(11354/48)
## [1] 89.56562
# P for F Value 
pf((21186/1)/(11354/48), 1, 48, lower.tail=FALSE)
## [1] 1.490228e-12

Problem 2: SLR

a

Auto=read.table("Auto.data")
library(tidyverse)
## ── Attaching packages ───── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Auto=read.table("Auto.data",header=T,na.strings ="?")
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
mod=lm(mpg~horsepower)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
newdata=data.frame(horsepower=c(98))
predict(mod, newdata, interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(mod, newdata, interval="predict")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

a(i): Yes, the slope coefficient is -0.157845.

a(ii): The slope coefficient suggests that there is a somewhat weak/medium relationship between horsepower and mpg.R^2 of about .61 suggests that about 61% of the variance in mpg values is explained by horsepower.

a(iii): Negative.

a(iv): The predicted mpg associated with a horsepower of 98 is 24.46708 mpg, where the confidence interval is [23.97308, 24.96108] and the prediction interval is [14.8094, 34.12476].

b

ggplot(Auto, aes(x=horsepower, y=mpg))+
  geom_point()+
  geom_abline(slope=mod$coefficients[2], intercept=mod$coefficients[1],
              color="blue", lty=2, lwd=1)+
  theme_bw()
## Warning: Removed 5 rows containing missing values (geom_point).

c

plot(mod)

The Residuals vs Fitted graph depicts the relationship as not very linear, with a slight U shape curve to the line. There also appears to not be very constant variance either.

Problem 3: MLR

auto<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.csv",
header=TRUE,
na.strings = "?")
auto<-auto[,-c(8:9)]
auto=na.omit(auto)
attach(auto)
## The following objects are masked from Auto:
## 
##     acceleration, cylinders, displacement, horsepower, mpg,
##     weight, year
## The following object is masked from package:ggplot2:
## 
##     mpg

(A)

pairs(auto)

(B)

cor(auto)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

(C):

mod=lm(mpg ~ .-name, data=Auto)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

C(1): Yes, the p-value of the F-Statistic is very small and very significant.

C(2): year & cylinders

C(3): It suggests that for each additional 1 year the mpg is expected to increase by 0.75.

(D):

# Response Vector
Y<-as.matrix(mpg)
#head(Y)
# Design Matrix
n<-dim(Y)[1]
X<-matrix(c(rep(1, n), displacement, cylinders, horsepower, weight, acceleration, year), ncol=7, byrow=FALSE)
#head(X)
## Least squares estimates
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
##               [,1]
## [1,] -1.453525e+01
## [2,]  7.678430e-03
## [3,] -3.298591e-01
## [4,] -3.913556e-04
## [5,] -6.794618e-03
## [6,]  8.527325e-02
## [7,]  7.533672e-01
sd(betaHat)
## [1] 5.53554

(E)

plot(mod)

In the Residuals vs Fitted plot, the relationshp appears not so linear. There also appears to not be very constant error variance as the fitted values increase. The Scale-Location graph depict what might be possible outliers as the Fitted Values increase, the Residuals vs Leverage graph suggests that there may not be as many outliers as visually depicted in the Scale-Location graph.