Problem Set 4

#A:
auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
auto=na.omit(auto)
attach(auto)

mod1<-lm(mpg~horsepower)

summary(mod1)

## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

#i: is there a relationship between the predictor and the response?
#yes there is a relationship between mpg and horsepower

#ii: how strong is this relationship?
#given that the multiple r squared value is 0.6059 that means that around 60 percent of the variance in a cars miles per gallon (fuel efficiency) can be explained by its total horsepower. This means that there is a strong relationship between the two. 

#iii: is the relationship positive or negative?
#given that there is a negative sign next to the etimate for horsepower this means that it would be a negative relationship between mpg and horsepwoer. 

#iv: what is the predicted mpg associated with a horsepower of 98? What are the associated 95% CI and PI?

datNew<-data.frame(horsepower=c(98))

predict(mod1, datNew, interval= "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(mod1, datNew, interval= "predict")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

#the predicted miles per gallon for a car with 98 horsepower would be approximately 24.47.

#b) plot the response and the predictor. Use the abline()function to display the least squares regression line.

plot(mpg~horsepower)
abline(mod1)

#c) use the plot function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

plot(mpg, mod1$residuals)
abline(h=0)

#we can see that there is a pattern emerging in the residual plot. This means that the errors are not centered around zero. Moreover, becasue the errors are not centered around zero this means that they are not randomly distibuted. 

qqnorm(mod1$residuals)
qqline(mod1$residuals)

hist(mod1$residuals)

#we can see that most of the values in the residual plot point to a negative error at low mpg and a positve error for high mpg. There is also a right skew to the data, as we can see that the data falls mostly to the right of the center. 

#problem 3
#produce a scatter plot matrix for all variables in the data set

auto1 <- auto[, -c(8:9)]
pairs(auto)

#b complete the matrix of correlations sign the cor function

cor(auto1)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
##              acceleration       year
## mpg             0.4233285  0.5805410
## cylinders      -0.5046834 -0.3456474
## displacement   -0.5438005 -0.3698552
## horsepower     -0.6891955 -0.4163615
## weight         -0.4168392 -0.3091199
## acceleration    1.0000000  0.2903161
## year            0.2903161  1.0000000

mlr_mod <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year)
summary(mlr_mod)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

#1 and 2

#given the estimates and the p values we can determine that there are indeed relationships between the predictors and reponse, some of of which are signifigant.
#both weight and year are those predictors whicha are signifigant, becasue the p value is less than 0.05 meaning we reject the null of no relationship and determine that there is most likely a relationship between them.

#3
#the coefficient of 0.75 for the year predictor suggests that for every year newer the car is that same car will gain an additional 0.75 miles per gallon. This seems rather common-sensical as newer cars tend towards better technology and therefore better mpg. 

#D: write the code using matrix algebra to produce the summary output.

Y <- as.matrix(mpg)
n<-dim(Y)[1]

X <- matrix(c(rep(1, n),
              cylinders,
              displacement,
              horsepower,
              weight,
              acceleration,
              year), 
            ncol = 7,
            byrow = FALSE)
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat

##               [,1]
## [1,] -1.453525e+01
## [2,] -3.298591e-01
## [3,]  7.678430e-03
## [4,] -3.913556e-04
## [5,] -6.794618e-03
## [6,]  8.527325e-02
## [7,]  7.533672e-01

#use the plot function in base R or use ggplot to produce diagnostic plots of the regression fit.

plot(mpg, mlr_mod$residuals)
abline(h=0)

#again we see that there is patterns in the regression plot, which means the error is again not random. This means that there is in fact a relationship that can explain this type of error.

hist(mlr_mod$residuals)

#again we see a right skew in the data as most of it falls to the right (positive side) of the center. Again there tends to be more postive error for higher mpg and negative error for lower mpg. 
#moreover, as the mpg gets very high we can see that there are more outiers here than in the SLR plot.

Problem Set 4

Gian Olsen

10/8/2019