Homework 7

Problem 1

Filling in the anova chart: tvalue_int = -2.6 tvalue_speed = 9.464

RSE: df = 48 R^2=0.6511 F stat: 89.57 on 1 and 48 df pvalue = 1.490127e-12

MS_speed = 21186 MS_Red = 236.54

F value = 89.566

Pr(>F) = 1.490127e-12

#Finding P value from F value
pf(89.566, 1, 48, lower.tail = FALSE)
## [1] 1.490127e-12

Problem 2

library(ISLR)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   1.0.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data("Carseats")

head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

A) Describe the variables Sales, Price, Urban, and US. Are the variables numeric or categorical? If they are categorical describe the levels.

Sales -numeric; Price - numeric; Urban - categorical; US - categorical. Both the Urban and US have a binary response of yes and no.

B) Fit a multiple regression model to predict Sales using Price, Urban, and US.

car_mlr = lm(Sales~Price+Urban+US, data=Carseats)

C) Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are categorical!

summary(car_mlr)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The slopes provide the relationship between the Sales variable. For each unit of Price, Sales will drop 0.054. If location is Urban, the Sales will drop 0.022 and if the location is US, then the Sales will increase by 1.2

D) Write out the model in equation form, being careful to handle the qualitative variables properly.

All the slopes are -0.054 and the intercept will change according to the uban/US coefficients. So, Sales as a function Price would be modeled:

\[y = 13.043469 - 0.054459x\]

With the Urban variable, the equation would be:

\[y = (13.043469 + (-0.021916)) - 0.054459x\]

With the US variable, the equation would be:

\[y = (13.043469 + 1.200573) - 0.054459x\]

With the both the Urban and US variables, the equation would be:

\[y = (13.043469 + (-0.021916) +1.200573) - 0.054459x\]

E) For which of the predictors can you reject the null hypothesis? (beta_j! = 0)

The P-value for Urban is the only one greater than 0.05 so we fail to reject the null hypothesis. The other variables, US and Price, we can reject the null hypothesis.

F) On the basis of your response in the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome (i.e. keep only the predictors that are significant).

small_mod = lm(Sales ~ Price + US, data = Carseats)
anova(small_mod)
## Analysis of Variance Table
## 
## Response: Sales
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Price       1  630.03  630.03 103.319 < 2.2e-16 ***
## US          1  131.37  131.37  21.543 4.707e-06 ***
## Residuals 397 2420.87    6.10                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(car_mlr)
## Analysis of Variance Table
## 
## Response: Sales
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Price       1  630.03  630.03 103.0603 < 2.2e-16 ***
## Urban       1    0.10    0.10   0.0158    0.9001    
## US          1  131.31  131.31  21.4802  4.86e-06 ***
## Residuals 396 2420.83    6.11                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

G) How well do the models in part a and f fit the data? (Hint: Use the MSE)

The MSE for the smaller model is 6.10 and the MSE for the original model is 6.11. Technically, the smaller model has a lower MSE but the numbers are very close and small.

H) Using the model from part (a), obtain 95% confidence intervals for the coefficients(s). Discuss what the confidence interals for the coefficients tell us.

confint(car_mlr, level = .95)
##                   2.5 %      97.5 %
## (Intercept) 11.76359670 14.32334118
## Price       -0.06476419 -0.04415351
## UrbanYes    -0.55597316  0.51214085
## USYes        0.69130419  1.70984121

We are 95% confident that the true value of the coefficient of Price is in the interval (-0.065, -0.044)