Niko Hellman

Part I:

First, the t-values in the lm output table:

# Intercept
(-17.5791/6.7584)
## [1] -2.601074
# speed
testStat<-(3.9324/0.4155)
testStat
## [1] 9.46426

The t values are -2.60 and 9.46, respectively.

Now, the p value Pr(>|t|) and degrees of freedom:

# there are 50 observations
n<-50

# degrees of freedom
n - 2
## [1] 48
# p value
pt(testStat, df=n-2, lower.tail=FALSE)*2
## [1] 1.488495e-12

Thus, the degrees of freedom is 48 and the p value of the speed variable is 1.49e-12.

For multiple R-squared, we will consider the SS(Reg) and SS(Res) from the ANOVA table:

# multiple R-squared
21186/(21186+11354)
## [1] 0.6510756

The value for Multiple R-Squared is 0.651.

Finally, the F-statistic:

# MS(Reg)
MS_Reg<- 21186/1

# MS(Res)
MS_Res<- 11354/48
MS_Res
## [1] 236.5417
# F stat
F<- MS_Reg/MS_Res
F
## [1] 89.56562

Thus, the F statistic is 89.56 on 1 and 48 DF, with our p-value (above) of 1.49e-12.

Considering the ANOVA table: we have already found our MS(Reg) and MS(Res) above: 21186 and 236.54 respectively. We also found our F statistic, 89.56 (the same value for both rows). Finally, we need our Pr(>F):

# Pr(>F)
pf(F, df1=1, df2=48, lower.tail = FALSE)
## [1] 1.490228e-12

The p value of our F statistic is 1.49e-12.

Part II

Carseat data

library(ISLR)
data(Carseats)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

(A)

Observe that ‘Sales’ and ‘Price’ are numeric variables while ‘Urban’ and ‘US’ are factor variables with two levels. Thus, they are categorical variables with levels “Yes” and “No”. ‘Sales’ is the unit sales at each location in thousands, ‘Price’ is the price charged for car seats at each site, ‘Urban’ reflects whether the store is urban or rural, and ‘US’ reflects whether the store is in the US or not.

(B)

mod<-lm(Sales~Price+Urban+US, data=Carseats)
summary(mod)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(C)

Intercept coefficient: Considering that realistically, a car seat will not be sold for $0, we will not make inferences based on this intercept. Rather, we use it to define our unique model line in space.

‘Price’ coefficient: for each one unit increase in price, we expect sales (in thousands) to decrease by 0.054459 on average. In other words, with a $1 increase in price of a car seat, we observe a $54.46 decrease in sales on average with other variables held constant.

‘UrbanYes’ coefficient: (note: baseline is no/rural) On average, stores in a urban locations are expected to have 21.9 less sales than those in rural locations with other variables held constant.

‘USYes’ coefficient: (note: baseline is no/not in US) On average, stores located in the US are expected to have 1,200 more sales than those outside of the US with other variables held constant.

(D)

Model in equation form: \(\beta_{0} = 13.04, \beta_{1} = -0.054, \beta_{2}= -0.022, \beta_{3} = 1.20\). \[y_{i}= \beta_{0} + \beta_{1}x_{i1}+ \beta_{2} x_{i2} + \beta_{3} x_{i3}\] \[y= 13.04 - 0.054x_{1} - 0.022 x_{2} + 1.20 x_{3} \] Since \(x_{2}\) and \(x_{3}\) are categorical, we can derive the following equations:

In US & Urban (\(x_{3} = 1\), \(x_{2} = 1\)): \(y= 13.04 - 0.054x_{1} - 0.022 (1) + 1.20 (1) = 14.2 - 0.054x_{1}\)

Not in US & Urban (\(x_{3} = 0\), \(x_{2} = 1\)): \(y= 13.04 - 0.054x_{1} - 0.022 (1) + 1.20 (0) = 13.02 - 0.054x_{1}\)

In US & Rural (\(x_{3} = 1\), \(x_{2} = 0\)): \(y= 13.04 - 0.054x_{1} - 0.022 (0) + 1.20 (1) = 14.24 -.054x_{1}\)

Not in US & Rural (\(x_{3} = 0\), \(x_{2} = 0\)): \(y= 13.04 - 0.054x_{1} - 0.022 (0) + 1.20 (0) = 13.04 - 0.054x_{1}\)

(E)

You can reject the null hypothesis for Price (with a p-value of 2e-16) and for US (with a p-value of 4.86e-06), but you cannot reject the null hypothesis for Urban (with a p-value of 0.936).

(F)

mod2<-lm(Sales~Price+US, data=Carseats)
summary(mod2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(G)

anova(mod)
## Analysis of Variance Table
## 
## Response: Sales
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Price       1  630.03  630.03 103.0603 < 2.2e-16 ***
## Urban       1    0.10    0.10   0.0158    0.9001    
## US          1  131.31  131.31  21.4802  4.86e-06 ***
## Residuals 396 2420.83    6.11                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod2)
## Analysis of Variance Table
## 
## Response: Sales
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Price       1  630.03  630.03 103.319 < 2.2e-16 ***
## US          1  131.37  131.37  21.543 4.707e-06 ***
## Residuals 397 2420.87    6.10                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observe the MSE is 6.11 for our first model and 6.10 for our smaller model. While these values are very similar to one another, this tells us that the smaller model has a slightly smaller MSE and thus is a better fit.

(H)

confint(mod)
##                   2.5 %      97.5 %
## (Intercept) 11.76359670 14.32334118
## Price       -0.06476419 -0.04415351
## UrbanYes    -0.55597316  0.51214085
## USYes        0.69130419  1.70984121

The confidence interval tells us that we are 95% confident that the true value will fall within the given interval for each predictor coefficient. However, observe that a value of 0 falls in the interval for Urban, reinforcing the notion that it is not a statistically significant predictor.