Problem set 5

auto<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.csv",
               header=TRUE,
               na.strings = "?")
# OMIT NAs
auto=na.omit(auto)
# TAKE OUT COLUMNS FOR ORIGIN AND NAME
auto<-auto[,-c(8:9)]

#1A:
names(auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"

modInt1<-lm(mpg~horsepower*weight, data=auto)
summary(modInt1)

## 
## Call:
## lm(formula = mpg ~ horsepower * weight, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

#We can see that with both weight and horsepower seperately, they seem to explain variance in mpg. These interactions are statistically signifigant as each has the smallest p value possiible in R.
#when we look at them in conjunction - that is weight ad horsepower together - we can see that this also signifigantly explains the variance in mpg. If we look at this p-value is 9.93 e-15, which is very small indeed. 
#these findings seem obvious as both the weight of a vehicle and also its horsepower seperately would impact the fuel efficiency of that vehicle. Moreover, these two factors are intrinsically connected becasue the horsepower needed to move a vehicle is greatly dependent upon the weight. And the weight of the vehicle is greatly dependent on the engine weight, and the heavier the engine the more likely it will have higher horsepower. 


modInt2<-lm(mpg~cylinders*acceleration, data=auto)
summary(modInt2)

## 
## Call:
## lm(formula = mpg ~ cylinders * acceleration, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2257  -3.1788  -0.7045   2.4031  17.4642 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            31.37192    5.27599   5.946 6.13e-09 ***
## cylinders              -1.84692    0.85564  -2.159   0.0315 *  
## acceleration            0.73498    0.33724   2.179   0.0299 *  
## cylinders:acceleration -0.11179    0.05806  -1.926   0.0549 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.895 on 388 degrees of freedom
## Multiple R-squared:  0.6097, Adjusted R-squared:  0.6067 
## F-statistic:   202 on 3 and 388 DF,  p-value: < 2.2e-16

# looking at cylinders and acceleration seperately we see p-values of 0.032 and 0.03 respectively, showing that these two factors can effectively predict and or explain the variance in a vehicles mpg.
#howwever, when taken in conjunction we see that there is a p-value of 0.055 which is greater than the 0.05 alpha level, meaning we would conclude that there is both cylinders and acceleration taken together do not effectively predict variance in a ehicles mpg. 
# these results make sense in context for the following reason. Cars which have less cylinders tend to have better mpg, but the amount of cylinders does not always imact a cars accelration. Acceleration, while slightly dependent on the energy output from the cylinders, is most affected by the cars weight. But, a car can overcome its weight with more power to equalize its accelrattion. So, much of these scenarios may well have individualized impacts on mpg, but as a whole is not an effective predictor. 

#1B:

mod_new<-lm(mpg~horsepower +I(horsepower^2), data=auto)
summary(mod_new)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

mod_new2 <- lm(mpg ~ horsepower*log10(horsepower), data=auto)
summary(mod_new2)

## 
## Call:
## lm(formula = mpg ~ horsepower * log10(horsepower), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7866  -2.5421  -0.1044   2.2218  15.9611 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  -12.6458    57.7392  -0.219  0.82675   
## horsepower                    -3.8463     1.3280  -2.896  0.00399 **
## log10(horsepower)             74.3527    49.8270   1.492  0.13645   
## horsepower:log10(horsepower)   1.3555     0.4539   2.986  0.00300 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.37 on 388 degrees of freedom
## Multiple R-squared:  0.6889, Adjusted R-squared:  0.6865 
## F-statistic: 286.4 on 3 and 388 DF,  p-value: < 2.2e-16

plot(mod_new$fitted.values,mod_new$residuals, pch=16)
abline(h=0, col="blue")

plot(mod_new$fitted.values,mod_new2$residuals, pch=16)
abline(h=0, col="blue")

# the quadratic tranformation of the model above and the log tranformation of the model above have nearly identical residual plots, leading me to beleive that there is very little difference in how these transformations affect the interactions between these variables. The fan shape in both suggest heteroskedasticity, meaning non-random variance, suggesting that these variables are related.
#Moreover, each of these transformations resulted in higher r-squared values. Yet, there is still the same fan-shape for all of the models. 

#2:

#A:
library(ISLR)
data(Carseats)
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

carMod1<-lm(Sales~Price+Urban+US, data=Carseats)
summary(carMod1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

#the intercept value for sales is 13.04. The coefficient in price suggests that for every one unit increased change in price there is a 0.05 drop in sales. 
#A response of yes to the Urban categorical variable suggests a 0.02 drop in sales. This means that any store located in an urban area has 0.02 drop in sales. However, becasue of the dummy coding, non-urban is considered no-change from the initial sales intercept. However, given the p-value 0f .936, we would not reject the null, and thus would say that the Urban variable does not signifigantly explain the variance in vehicle sales.
# A response of Yes for a store in the Unites States suggests an increase in 1.2 sales. While the no response o this same category represents no-change in overall sales.

# Equation Form:

#y = 13.043469 + -.054459(PRICE) + -.021916(URBAN) + 1.200573(US)
#it is important to notice that the response to these categorical variables will vary the imputs into this equations. A response of yes to either US or Urban means an imput of 1, while a response of no means an imput of 0.

# We can reject the null for all but Urban. Urban has too high a p-value to be considered signifigant meaning the difference seen is most likeley random. However, for Prive and US/non-US we see that there is a signifigant relationship meaning there is a slope value different from zero.

#we do this to eliminate Urban because it is not signifigant

mod_small <-lm(Sales ~ Price + US, data =Carseats)
summary(mod_small)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

# the multiple r-squared for each of the models is exactly the same value of .2393. This means they both explain 23.93 percent of the variance in sales. However, the adjusted r-squared for the second model is slightly higher than the first becasue there is one fewer predictor, becasue urban is not a useful predictor, and eliminating it gives a better model.

confint(mod_small)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Problem set 5

Gian Olsen

10/13/2019