Homework Set 4

Problem 1: Complete the table

#Summary table:

pf(89.57,1,48, lower.tail = FALSE)
## [1] 1.489077e-12

Problem 2

#SLR

auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")

auto=na.omit(auto)
attach(auto)

#2a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor

mod  <- lm(mpg~horsepower)
summary(mod)
## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

#2a i) Is there a relationship between the predictor and the response?

#Yes, The output indicates that there is a negative relationship between horsepower and mpg. The estimate of the relationship based on the data is -.158. The probabilty of getting data as or more extreme than that from our sample given that there is no relationship between horsepower and mpg is very unlikely (as indicated by the small p value).

#2a ii) How strong is the relationship between the predictor and the response?

#This, relationship is very strong because the r squared is large (.6059) which means that horsepower can explain about 60% of the variance in mpg.

#2a iii) Is the relationship between the predictor and the response positive or negative?

#The relationship is negative, as indicated by a negative slope estimate. The slope estimate says that for each 1 unit increase in horsepower, there is a .1578 unit decrease in mpg.

#2a iv) What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

newauto <- data.frame(horsepower = c(98))
predict(mod,newauto, interval = "predict")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
predict(mod,newauto, interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

#The predicted mpg with a horse power of 98 is 24.47.

#2b) Plot the response and the predictor. Use abline() function to display the least squares regression line.

plot(mpg~horsepower)
abline(mod$coefficients, col= "hot pink", lwd = 2, lty = 2,)

#2c) Use the plot function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

plot(mod$residuals~horsepower)
abline(h=0,col="blue",lwd=2,lty=2)

qqnorm(mod$residuals)
qqline(mod$residuals)

#In residual plots, we are looking for residuals to be centered around 0 and with some residuals above and some residuals below. If there are patterns in the residual plot this is problematic because this indicates that there is not constant variance in the data. This residual plot appears to be problematic because the data has a pattern of starting above zero, then trending down, and then curving back up above 0 again. This means our data probably does not have constant variance.

##Problem 3

auto<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.csv",
               header=TRUE,
               na.strings = "?")
auto<-na.omit(auto)

#take out origin and names column

auto<-auto[,-c(8:9)]

#3a) Produce a scatterplot matrix which includes all of the variables in the data set

pairs(auto)

#3b) Compute the matrix of correlations between the variables using the function cor()

cor(auto)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

#3c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables as predictors. Use the summary function to print the results.

mlr_mod <- lm(mpg~cylinders+displacement+horsepower+weight+acceleration+year, data = auto)
summary(mlr_mod)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

#3c1) Is there a relationship between the predictors and the response? #All the variables factors together do create a relationship between the predictors and the response such that the predictors explain about 80% of the variance in mpg.

#3c2) Which predictors appear to have a statistically significant relationship to the response? #Individually, only weight and year are significant predictors of mpg by themselves.

#3c3) What does the coefficient for the year variable suggest? #The coefficient for year is .7534. This suggests that for every 1 unit increase in year, there will be about a .75 unit increase in mpg. This is the case when all other variables are held constant.

#3d) Write code using matrix algebra to produce the summary output.

#response vector

Y <- as.matrix(auto$mpg)
head(Y)
##      [,1]
## [1,]   18
## [2,]   15
## [3,]   18
## [4,]   16
## [5,]   17
## [6,]   15
dim(Y)
## [1] 392   1

#add column of 1s

n <- dim(Y)[1]

#design matrix

X <- matrix(c(rep(1,n),
              cylinders,
              displacement,
              horsepower,
              weight,
              acceleration,
              year),
            ncol = 7,
            byrow = FALSE)

head(X)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    8  307  130 3504 12.0   70
## [2,]    1    8  350  165 3693 11.5   70
## [3,]    1    8  318  150 3436 11.0   70
## [4,]    1    8  304  150 3433 12.0   70
## [5,]    1    8  302  140 3449 10.5   70
## [6,]    1    8  429  198 4341 10.0   70
dim(X)
## [1] 392   7
betaHat <- solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
##               [,1]
## [1,] -1.453525e+01
## [2,] -3.298591e-01
## [3,]  7.678430e-03
## [4,] -3.913556e-04
## [5,] -6.794618e-03
## [6,]  8.527325e-02
## [7,]  7.533672e-01

#E) Use the plot function to produce diagnostic plots of linear regression fit. Comment on any problems you see with the fit.

plot(mlr_mod$residuals~mpg)
abline(h=0,col="red",lwd=2,lty=2)

qqnorm(mlr_mod$residuals)
qqline(mlr_mod$residuals)

#The fit resembles that of the simple linear regression. as mpg increases, the residuals are positive, then trend negative, and then trend positive again. This indicates that variance is not constant. There are especially large outliers when mpg is high.