Introduction to Statistical Learning Solutions (ISLR)

Ch 3 Exercises

Conceptual Exercises

Applied Exercises

1

The null hypothesis is $H_{0}: \beta = 0$ given the other variables in the model, or simply that for any coefficient $\beta$ there isn’t a relationship between $\beta$ and the target variable.

Furthermore, each p-value in Table 3.4 corresponds to the probability of the respective coefficient being zero. The conclusion drawn is that given the effect of both TV and radio has on sales there is no evidence that newspaper is related to sales.

2

Both KNN classifier and KNN regression use information about the neighborhood of data points to draw conclusions, however, each method has unique goals. In the case of KNN classifier we are interested in finding to which class a new data point belongs to, thus we assign it to the class of most frequent point class in the neighborhood of this point. One the other hand, in KNN regression our target is not a class label but rather a real number, so we assign a new point value to the mean of the points in the neighborhood.

3

3a

$Female: 85+20\times GPA+0.07\times IQ+0.01\times GPA:IQ-10\times GPA$

$Male: 50+20\times GPA+0.07\times IQ+0.01\times GPA:IQ$

When GPA is above 3.5 Females will have a lower salary. The correct alternative is .

3b

$50+20\times4+0.07\times110+35+0.01\times4*110-10\times 4 = 137.1$

3c

This may not be the case. While the coefficient is small it may have a high t-statistic and thus small p-value implying that an interaction effect exists.

4

4a

I would expect the RSS to be lower when fitting cubic linear regression on the training set, since fitting polynomial coefficients would provide a tighter fit to the data.

4b

In the test set the RSS would tend to be higher for the cubic linear model. Since the true relationship is linear the reduction of bias is not offset by a reduction in variance. Thus I expect the RSS to be larger for the cubic linear model.

4c

Since the cubic model is more flexible I would expect it to perform well on the training set.

4d

There is not enough information in this case; Depending on how non-linear is the true relationship either model may perform better. In the case that the true relationship is non-linear, but closer to linear than cubic then the linear model will perform better. The opposite is true when the true relationship is closer to cubic than linear.

5

$a'_i = x_i / (\sum x_i^2)$

6

A linear model with one predictor is defined by $\hat{y} = \hat{\beta_0} + \hat{\beta_1} x$ and if $(\bar{y},\bar{x})$ is a point on the line then $\bar{y} = \hat{\beta_0} + \hat{\beta_1} \bar{x}$ then $ {y} - {x}= $ which is true by the definition of $\hat{\beta_0}$ on equation 3.4.

7

(skipped)

8

library(ISLR)

summary(Auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

8a

lm.fit <- lm(mpg~horsepower,data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i.

A relationship between the predictor and response exists as can be seen using the F-statistic of the model or the p-value associated with horsepower (thus the null hypothesis can be refuted in this case).

ii.

The string can be measured by how well does the predictor explain the variance in the data. Thus using the R-squared statistic we can see that about 60% of the variance in mpg is explained by horsepower.

floor(summary(lm.fit)$r.sq*100)

## [1] 60

Alternatively, RSE can be used as a measure of strength of the relationship. RSE measures the standard deviation from the population regression line (this can be interpreted as the lack of fit of the model). This indicates then a percentage error of about 20%.

floor(summary(lm.fit)$sigma/mean(Auto[["mpg"]])*100)

## [1] 20

iii.

The relationship is negative as can be seen by the coefficient of horsepower.

iv.

predict(lm.fit, data.frame(horsepower=c(98)), interval="confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm.fit, data.frame(horsepower=c(98)), interval="prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

b

plot(Auto[["horsepower"]],Auto[["mpg"]],xlab="horsepower",ylab="mpg")+abline(coef(lm.fit),col="red")

## numeric(0)

c

plot(lm.fit)

From the plots above it is easy to notice that there is evident of non-linearity. It is also possible to spot possible outlines and points with high leverage.

plot(predict(lm.fit),rstudent(lm.fit),col=ifelse(rstudent(lm.fit)>=3,"red","black"))+
text(predict(lm.fit),rstudent(lm.fit),labels=ifelse(rstudent(lm.fit)>=2.5,names(which(rstudent(lm.fit)>=3)),""),pos=4)

## numeric(0)

plot(hatvalues(lm.fit),col=ifelse(hatvalues(lm.fit)>0.028,"red","black"))+
text(hatvalues(lm.fit),labels=ifelse(hatvalues(lm.fit)>0.028,names(which.max(hatvalues(lm.fit)>0.028)),""), cex= 0.7,pos=4)

## numeric(0)

9

9a

pairs(~.,data = Auto)

9b

cor(Auto[,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

9c

lm.fit <- lm(mpg~.-name,data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i

There is a relationship between the predictors and the response, which can be seen by the F-statistic much larger than 1.

ii

By order of p-values; year, weight, origin, displacement. The other values do not have a significant relationship at the 95% confidence level.

iii

The coefficient of year suggests that mpg is improved upon by car makers every year.

9d

par(mfrow = c(2,2))
plot(lm.fit)

The diagnostic plots identify a few large outlines, entries 326, 227 and 323. Also, points with high leverage are identified. These points are 327, and 394. 327 is then a point with a high leverage and also an outlier.

9e

lm.fit <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

When fitting with all predictors plus all possible interaction terms very few interactions appear statistically significant at 5% level. The significant ones are acceleration:origin and displacement:year.

9f

From previous experiments it is clear to see that there is a non-linear relationship between mpg and horsepower.

lm.fit1 <- lm(mpg~horsepower+I(horsepower^2),data = Auto)
summary(lm.fit1)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

bnd <- seq(0,300,0.1)
counts <- predict(lm.fit1, data.frame(horsepower=c(bnd),horsepower2=c(bnd^2)))

plot(Auto[["horsepower"]],Auto[["mpg"]],xlab="horsepower",ylab="mpg")+ lines(bnd,counts,col="red")

## numeric(0)

When regressing mpg onto weight it is noticeable that the residuals present a nonlinear relationship, and some indication of heteroscedasticity (seen by a funnel like shape of the residuals). These indicate that transformations such as log X or sqr(X) may provide a better fit by shrinking of large values.

You can observe below that the funnel like shape is significant reduced when comparing the linear residual plot and the log transform plot.

lm.fit2 <- lm(mpg~I(weight^(1/2)),data = Auto)
lm.fit3 <- lm(mpg~log(weight),data = Auto)
lm.fit4 <- lm(mpg~weight,data = Auto)

bnd <- seq(0,6000,0.1)
counts1 <- predict(lm.fit2,data.frame(weight=bnd))
counts2 <- predict(lm.fit3,data.frame(weight=bnd))
counts3 <- predict(lm.fit4,data.frame(weight=bnd))


par(mfrow=c(2,2))
plot(lm.fit4)

plot(lm.fit3)

par(mfrow=c(1,1))

plot(Auto[["weight"]],Auto[["mpg"]],xlab="weight",ylab="mpg")+lines(bnd,counts1,col="red")+
lines(bnd,counts1,col="blue",lty=2)+lines(bnd,counts3,col="green",lty=3)

## numeric(0)

10

10a

library(ISLR)
?Carseats
summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age       
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00  
##    Education    Urban       US     
##  Min.   :10.0   No :118   No :142  
##  1st Qu.:12.0   Yes:282   Yes:258  
##  Median :14.0                      
##  Mean   :13.9                      
##  3rd Qu.:16.0                      
##  Max.   :18.0

lm.fit <- lm(Sales~Price+Urban+US,data=Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b

There are 3 coefficients in this model; Price, UrbanYes, and USYes. For every one unit increase of Price, the average Sales of a carseat decreases by about 0.05. Since UrbanYes is a qualitative variable it can be interpreted as average difference increase of Sales of a store located in a rural zone or urban zone, holding all other predictors constant. However, since be p-value there is not enough information to tell this relationship is present in the data given the other predictors. Lastly, the USYes coefficient can be interpreted as the average increase in Sales provided that the store is located in the United States. Thus there is an average increase of about 1200 units of sales in the US compared to carseat sales elsewhere.

10c

This model produces several parallel lines for the different qualitative values.

\[\hat{y}=-0.05\times x+ \begin{cases} 13.04 & UrbanYes=0,USYes=0 \\ 13.02 & UrbanYes=1,USYes=0 \\ 14.24 & UrbanYes=0,USYes=1 \\ 14.22 & UrbanYes=1,USYes=1 \end{cases} \]

10d

The null hypothesis can be rejected for USYes and Price.

10e

lm.fit <- lm(Sales~Price+US,,data=Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f

Both models fit the data equally well, with the second model having a slightly lower RSE error.

10g

confint(lm.fit)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

10h

As noted from the the plots below there is no evidence of outliers but there are points with high leverage.

plot(predict(lm.fit),rstudent(lm.fit),col=ifelse(rstudent(lm.fit)>3,"red","black"))

plot(hatvalues(lm.fit),col=ifelse(hatvalues(lm.fit)>(2+1)/dim(Carseats)[1]*3,"red","black"))

11

set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)

11a

lm.fit1 <- lm(y~x+0)

summary(lm.fit1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The null hypothesis can be refuted for this model due to both high F-statistic and low p-values.

11b

lm.fit2 <- lm(x~y)
summary(lm.fit2)

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

The null hypothesis cannot be refuted in this step because of the p-value of the intercept being high.

11b

$\hat{\beta}=(\sum_{i=1}^{n}x_{i}y_{i})/(\sum_{i'=1}^{n}x_{i'}^{2})$

$SE(\hat{\beta}) = \sqrt{ \frac{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}{(n-1)\sum_{i'=1}^{n}x_{i'}^{2}}}$

$t-statistic = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{(\sum_{i=1}^{n}x_{i}y_{i})/(\sum_{i'=1}^{n}x_{i'}^{2})}{\sqrt{ \frac{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}{(n-1)\sum_{i'=1}^{n}x_{i'}^{2}}}}$

$= \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}\frac{1}{\sum_{i'}^{n}x_{i'}^{2}} = \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sum_{i=1}^{n}y_i^{2}-2y_{i}x_{i}\hat{\beta}+x_{i}^{2}\hat{\beta}^{2}}\frac{1}{\sum_{i'}^{n}x_{i'}^{2}} = \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sqrt{(\sum_{i'=1}^{n}x_{i'}^{2})(\sum_{i'=1}^{n}y_{i'}^{2})-(\sum_{i'=1}^{n}x_{i'}y_{i'})^{2} }}$

Since,

$(-2\sum_{i'}^{n}y_{i'}x_{i'}\hat{\beta}+\sum_{i'}^{n}x_{i'}^{2}\hat{\beta}^{2})(\sum_{i'}^{n}x_{i'}^{2}) = -2\sum_{i'}^{n}y_{i}x_{i}(\sum_{i=1}^{n}x_{i}y_{i}) +(\sum_{i=1}^{n}x_{i}y_{i})^{2} = -(\sum_{i=1}^{n}x_{i}y_{i})^{2}$

11c

Since no intercept is present, both equations simplify to the same one.

11f

lm.fit1 <- lm(x~y+0)
summary(lm.fit1)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

lm.fit2 <- lm(y~x+0)
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

12

12a

The coefficient will be the same when $\sum_{i}^{n}x_i^2 = \sum_{i}^{n}y_i^2$

12b

set.seed(47)
x=rnorm(100)
y=x+rnorm(100)

head(data.frame(y,x))

##            y          x
## 1  1.8521818  1.9946963
## 2  0.2456694  0.7111425
## 3 -0.2392240  0.1854053
## 4  0.4655555 -0.2817650
## 5  0.7682045  0.1087755
## 6 -0.8869080 -1.0857375

lm.fit <- lm(y~x+0)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1436 -0.4683  0.1323  0.9264  2.4012 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.9159     0.1033   8.863 3.28e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 99 degrees of freedom
## Multiple R-squared:  0.4424, Adjusted R-squared:  0.4368 
## F-statistic: 78.56 on 1 and 99 DF,  p-value: 3.281e-14

lm.fit <- lm(x~y+0)
summary(lm.fit)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8412 -0.5199 -0.1615  0.4169  1.7175 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.4831     0.0545   8.863 3.28e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7347 on 99 degrees of freedom
## Multiple R-squared:  0.4424, Adjusted R-squared:  0.4368 
## F-statistic: 78.56 on 1 and 99 DF,  p-value: 3.281e-14

set.seed(38)
x=rnorm(100)
y=abs(x)

head(data.frame(y,x))

##           y          x
## 1 0.2535911 -0.2535911
## 2 1.0556027 -1.0556027
## 3 0.6864966  0.6864966
## 4 0.0251569  0.0251569
## 5 1.6718410 -1.6718410
## 6 1.4835520 -1.4835520

lm.fit <- lm(y~x+0)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## 0.00772 0.13753 0.53052 1.11493 2.28704 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.02509    0.10047    0.25    0.803
## 
## Residual standard error: 0.9345 on 99 degrees of freedom
## Multiple R-squared:  0.0006293,  Adjusted R-squared:  -0.009465 
## F-statistic: 0.06234 on 1 and 99 DF,  p-value: 0.8033

lm.fit <- lm(x~y+0)
summary(lm.fit)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.11590 -0.44338 -0.01365  0.56617  2.28704 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.02509    0.10047    0.25    0.803
## 
## Residual standard error: 0.9345 on 99 degrees of freedom
## Multiple R-squared:  0.0006293,  Adjusted R-squared:  -0.009465 
## F-statistic: 0.06234 on 1 and 99 DF,  p-value: 0.8033

13

13a

set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)

13b

eps <- rnorm(100,mean = 0,sd = 0.25)

13c

$\hat{\beta_0} = -1$ and $\hat{\beta_1} = 0.5$

y <- -1+0.5*x+eps

length(y)

## [1] 100

13d

The scatterplot indicates a linear relationship between the two values.

plot(x,y)+abline(coef = c(-1,0.5), col="red")

## numeric(0)

13e

$\hat{\beta_0},\hat{\beta_1}$ are very close to $\beta_0,\beta_1$

lm.fit <- lm(y~x)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

13f

coef(lm.fit)

## (Intercept)           x 
##  -1.0094232   0.4997349

plot(x,y)+abline(coef = c(-1,0.5), col="red")+abline(coef=coef(lm.fit),col="blue")

## numeric(0)

legend("bottomright",legend=c("Population Regression               ","Least Square                  "),col=c("red","blue"),lty=c(1,1))

13g

There is no evidence that the quadratic term improve the fit due to a lower $R^{2}$, F-statistic, and the p-value of the quadradic coefficient (> 0.05).

lm.fit2 <- lm(y~poly(x,2))
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

13h

# a-f
set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)
eps <- rnorm(100,mean = 0,sd = 0.10)
y <- -1+0.5*x+eps
nrow(y) # length of y

## NULL

plot(x,y) # scatter plot

lm.fit3 <- lm(y~x)
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18768 -0.06138 -0.01395  0.05394  0.23462 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.003769   0.009699  -103.5   <2e-16 ***
## x            0.499894   0.010773    46.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared:  0.9565, Adjusted R-squared:  0.956 
## F-statistic:  2153 on 1 and 98 DF,  p-value: < 2.2e-16

coef(lm.fit3) # coef of model

## (Intercept)           x 
##   -1.003769    0.499894

lm.fit4 <- lm(y~poly(x,2))
summary(lm.fit4)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19650 -0.06254 -0.01288  0.05803  0.22700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94934    0.00958 -99.092   <2e-16 ***
## poly(x, 2)1  4.46754    0.09580  46.632   <2e-16 ***
## poly(x, 2)2 -0.13441    0.09580  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0958 on 97 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9565 
## F-statistic:  1088 on 2 and 97 DF,  p-value: < 2.2e-16

coef(lm.fit4)

## (Intercept) poly(x, 2)1 poly(x, 2)2 
##  -0.9493371   4.4675377  -0.1344062

ys <- predict(lm.fit2,data.frame(x=seq(0,3,0.1)))
plot(x,y)+abline(coef = coef(lm.fit3),col="red")+abline(coef = coef(lm.fit4),col="blue")

## numeric(0)

legend("bottomright",legend=c("Linear","Quadradic       "),lty=c(1,1),col=c("red","blue"))

13i

set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)
eps <- rnorm(100,mean = 0,sd = 0.80)
y <- -1+0.5*x+eps

lm.fit5 <- lm(y~x)
summary(lm.fit5)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5015 -0.4910 -0.1116  0.4315  1.8769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03015    0.07759 -13.277  < 2e-16 ***
## x            0.49915    0.08618   5.792 8.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7702 on 98 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.2474 
## F-statistic: 33.55 on 1 and 98 DF,  p-value: 8.421e-08

coef(lm.fit5)

## (Intercept)           x 
##  -1.0301541   0.4991517

lm.fit6 <- lm(y~poly(x,2))
summary(lm.fit6)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5720 -0.5003 -0.1031  0.4642  1.8160 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97580    0.07664 -12.732  < 2e-16 ***
## poly(x, 2)1  4.46090    0.76643   5.820 7.58e-08 ***
## poly(x, 2)2 -1.07525    0.76643  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7664 on 97 degrees of freedom
## Multiple R-squared:  0.2698, Adjusted R-squared:  0.2548 
## F-statistic: 17.92 on 2 and 97 DF,  p-value: 2.378e-07

coef(lm.fit6)

## (Intercept) poly(x, 2)1 poly(x, 2)2 
##  -0.9758028   4.4609041  -1.0752495

ys <- predict(lm.fit2,data.frame(x=seq(0,3,0.1)))

plot(x,y)+abline(coef = coef(lm.fit5),col="red")+abline(coef = coef(lm.fit6),col="blue")

## numeric(0)

legend("bottomright",legend=c("Linear","Quadradic         "),lty=c(1,1),col=c("red","blue"))

13j

As the noise increases, so does the confidence intervals.

confint(lm.fit3)

##                  2.5 %     97.5 %
## (Intercept) -1.0230161 -0.9845224
## x            0.4785159  0.5212720

confint(lm.fit1)

##       2.5 %    97.5 %
## y 0.3496717 0.4325574

confint(lm.fit5)

##                  2.5 %     97.5 %
## (Intercept) -1.1841286 -0.8761796
## x            0.3281271  0.6701763

14

14a

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1+rnorm(100)/10
y  =2+2*x1+0.3*x2+rnorm(100)

The regression coefficients are $\beta_0 =2,\beta_1 =2,\beta_2 = 0.3$ #### 14b

cor(x1,x2)

## [1] 0.8351212

plot(x1,x2)

14c

lm.fit <- lm(y~x1+x2)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The model coefficients are as follows; $\hat{\beta_0} = 2.13,\hat{\beta_1}=1.439,\hat{\beta_2}=1$. The null hypothesis can be rejected for $\hat{\beta_0}$ at the 5% confidence level. This is not the case for $\hat{\beta_2}$.

14d

lm.fit <- lm(y~x1)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The null hypothesis can be rejected in this case.

14e

lm.fit <- lm(y~x2)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The null hypothesis can be rejected in this case.

14f

The results do not contradict each other since x1 and x2 are highly correlated. One way to interpret this is that x2 adds no new information when fitting a model that already contains x1, and vice-versa.

14g

x1=c(x1,0.1)
x2=c(x2,0.8)
y=c(y,6)

lm.fit <- lm(y~x1+x2)
lm.fit1 <- lm(y~x1)
lm.fit2 <- lm(y~x2)

plot(lm.fit)

par(mfrow = c(2,2))
plot(lm.fit1)

par(mfrow = c(2,2))
plot(lm.fit2)

The additional observation is a point of high leverage in all 3 models, having significant influence on the model and altering the p-values. This makes it so that the same conclusions are not reached (here both null hypothesis are rejected). Only when regressing y onto x1 that evidence is present of the additional point being an outlier.

15

library(MASS)
attach(Boston)

chas <- as.factor(chas)

Regressing crim onto zn.

lm.zn <- lm(crim~zn)
summary(lm.zn)

## 
## Call:
## lm(formula = crim ~ zn)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## zn          -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06

Regressing crim onto indus.

lm.indus <- lm(crim~indus)
summary(lm.indus)

## 
## Call:
## lm(formula = crim ~ indus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## indus        0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto chas.

lm.chas <- lm(crim~chas)
summary(lm.chas)

## 
## Call:
## lm(formula = crim ~ chas)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## chas1        -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094

Regressing crim onto nox.

lm.nox <- lm(crim~nox)
summary(lm.nox)

## 
## Call:
## lm(formula = crim ~ nox)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## nox           31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto rm.

lm.rm <- lm(crim~rm)
summary(lm.rm)

## 
## Call:
## lm(formula = crim ~ rm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## rm            -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07

Regressing crim onto age.

lm.age <- lm(crim~age)
summary(lm.age)

## 
## Call:
## lm(formula = crim ~ age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## age          0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16

Regressing crim onto dis.

lm.dis <- lm(crim~dis)
summary(lm.dis)

## 
## Call:
## lm(formula = crim ~ dis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## dis          -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto rad.

lm.rad <- lm(crim~rad)
summary(lm.rad)

## 
## Call:
## lm(formula = crim ~ rad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## rad          0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto tax.

lm.tax <- lm(crim~tax)
summary(lm.tax)

## 
## Call:
## lm(formula = crim ~ tax)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## tax          0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto ptratio.

lm.ptratio <- lm(crim~ptratio)
summary(lm.ptratio)

## 
## Call:
## lm(formula = crim ~ ptratio)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## ptratio       1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11

Regressing crim onto black.

lm.black <- lm(crim~black)
summary(lm.black)

## 
## Call:
## lm(formula = crim ~ black)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.553529   1.425903  11.609   <2e-16 ***
## black       -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto lstat.

lm.lstat <- lm(crim~lstat)
summary(lm.lstat)

## 
## Call:
## lm(formula = crim ~ lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## lstat        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16

Regressing crim onto medv.

lm.medv <- lm(crim~medv)
summary(lm.medv)

## 
## Call:
## lm(formula = crim ~ medv)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## medv        -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

It can be noted from all the linear models above that there is no evidence that chas and age are associated with crim, and there is evidence of association between the remaining predictors and crim.

15b

lm.all <- lm(crim~.,data=Boston)
summary(lm.all)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

At the 5% confidence level, the null hypothesis can be rejected for the predictors; zn, dis, rad, black, and medv.

15c

The results are different when fitting a model using all the predictors; In this case, there is no evidence of additional predictors (other than chas, and age) being associated with crim. The only predictors were the null hypothesis can be refuted are zn, dis, rad, black and medv.

xs <- c(coef(lm.zn)[2],coef(lm.indus)[2],coef(lm.chas)[2],coef(lm.nox)[2],coef(lm.rm)[2],coef(lm.age)[2],coef(lm.dis)[2],coef(lm.rad)[2],coef(lm.tax)[2],coef(lm.ptratio)[2],coef(lm.black)[2],coef(lm.lstat)[2],coef(lm.medv)[2])
ys <- coef(lm.all)[-1]

data.frame(xs,ys)

##                  xs            ys
## zn      -0.07393498   0.044855215
## indus    0.50977633  -0.063854824
## chas1   -1.89277655  -0.749133611
## nox     31.24853120 -10.313534912
## rm      -2.68405122   0.430130506
## age      0.10778623   0.001451643
## dis     -1.55090168  -0.987175726
## rad      0.61791093   0.588208591
## tax      0.02974225  -0.003780016
## ptratio  1.15198279  -0.271080558
## black   -0.03627964  -0.007537505
## lstat    0.54880478   0.126211376
## medv    -0.36315992  -0.198886821

plot(xs,ys)

15d

lm.zn3 <- lm(crim~poly(zn,3))
summary(lm.zn3) # No evidence for cubic term.

## 
## Call:
## lm(formula = crim ~ poly(zn, 3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.821 -4.614 -1.294  0.473 84.130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.6135     0.3722   9.709  < 2e-16 ***
## poly(zn, 3)1 -38.7498     8.3722  -4.628  4.7e-06 ***
## poly(zn, 3)2  23.9398     8.3722   2.859  0.00442 ** 
## poly(zn, 3)3 -10.0719     8.3722  -1.203  0.22954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared:  0.05824,    Adjusted R-squared:  0.05261 
## F-statistic: 10.35 on 3 and 502 DF,  p-value: 1.281e-06

lm.indus3 <- lm(crim~poly(indus,3))
summary(lm.indus3) # Evidence of non linearity.

## 
## Call:
## lm(formula = crim ~ poly(indus, 3))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.278 -2.514  0.054  0.764 79.713 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.614      0.330  10.950  < 2e-16 ***
## poly(indus, 3)1   78.591      7.423  10.587  < 2e-16 ***
## poly(indus, 3)2  -24.395      7.423  -3.286  0.00109 ** 
## poly(indus, 3)3  -54.130      7.423  -7.292  1.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared:  0.2597, Adjusted R-squared:  0.2552 
## F-statistic: 58.69 on 3 and 502 DF,  p-value: < 2.2e-16

# chas is qualitative predictor, polynomial does not make sense in this case.

Introduction to Statistical Learning Solutions (ISLR)

Ch 3 Exercises

Table of contents

Conceptual Exercises

Applied Exercises

1

2

3

3a

3b

3c

4

4a

4b

4c

4d

5

6

7

8

8a

i.

ii.

iii.

iv.

b

c

9

9a

9b

9c

i

ii

iii

9d

9e

9f

10

10a

10b

10c

10d

10e

10f

10g

10h

11

11a

11b

11b

11c

11f

12

12a

12b

13

13a

13b

13c

13d

13e

13f

13g

13h

13i

13j

14

14a

14c

14d

14e

14f

14g

15

15b

15c

15d