This notebook was prepared by Everton Lima.
The null hypothesis is \(H_{0}: \beta = 0\) given the other variables in the model, or simply that for any coefficient \(\beta\) there isn’t a relationship between \(\beta\) and the target variable.
Furthermore, each p-value in Table 3.4 corresponds to the probability of the respective coefficient being zero. The conclusion drawn is that given the effect of both TV and radio has on sales there is no evidence that newspaper is related to sales.
Both KNN classifier and KNN regression use information about the neighborhood of data points to draw conclusions, however, each method has unique goals. In the case of KNN classifier we are interested in finding to which class a new data point belongs to, thus we assign it to the class of most frequent point class in the neighborhood of this point. One the other hand, in KNN regression our target is not a class label but rather a real number, so we assign a new point value to the mean of the points in the neighborhood.
\(Female: 85+20\times GPA+0.07\times IQ+0.01\times GPA:IQ-10\times GPA\)
\(Male: 50+20\times GPA+0.07\times IQ+0.01\times GPA:IQ\)
When GPA is above 3.5 Females will have a lower salary. The correct alternative is .
\(50+20\times4+0.07\times110+35+0.01\times4*110-10\times 4 = 137.1\)
This may not be the case. While the coefficient is small it may have a high t-statistic and thus small p-value implying that an interaction effect exists.
I would expect the RSS to be lower when fitting cubic linear regression on the training set, since fitting polynomial coefficients would provide a tighter fit to the data.
In the test set the RSS would tend to be higher for the cubic linear model. Since the true relationship is linear the reduction of bias is not offset by a reduction in variance. Thus I expect the RSS to be larger for the cubic linear model.
Since the cubic model is more flexible I would expect it to perform well on the training set.
There is not enough information in this case; Depending on how non-linear is the true relationship either model may perform better. In the case that the true relationship is non-linear, but closer to linear than cubic then the linear model will perform better. The opposite is true when the true relationship is closer to cubic than linear.
A linear model with one predictor is defined by \(\hat{y} = \hat{\beta_0} + \hat{\beta_1} x\) and if \((\bar{y},\bar{x})\) is a point on the line then \(\bar{y} = \hat{\beta_0} + \hat{\beta_1} \bar{x}\) then $ {y} - {x}= $ which is true by the definition of \(\hat{\beta_0}\) on equation 3.4.
library(ISLR)
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
lm.fit <- lm(mpg~horsepower,data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
A relationship between the predictor and response exists as can be seen using the F-statistic of the model or the p-value associated with horsepower (thus the null hypothesis can be refuted in this case).
The string can be measured by how well does the predictor explain the variance in the data. Thus using the R-squared statistic we can see that about 60% of the variance in mpg is explained by horsepower.
floor(summary(lm.fit)$r.sq*100)
## [1] 60
Alternatively, RSE can be used as a measure of strength of the relationship. RSE measures the standard deviation from the population regression line (this can be interpreted as the lack of fit of the model). This indicates then a percentage error of about 20%.
floor(summary(lm.fit)$sigma/mean(Auto[["mpg"]])*100)
## [1] 20
The relationship is negative as can be seen by the coefficient of horsepower.
predict(lm.fit, data.frame(horsepower=c(98)), interval="confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm.fit, data.frame(horsepower=c(98)), interval="prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
plot(Auto[["horsepower"]],Auto[["mpg"]],xlab="horsepower",ylab="mpg")+abline(coef(lm.fit),col="red")
## numeric(0)
plot(lm.fit)
From the plots above it is easy to notice that there is evident of non-linearity. It is also possible to spot possible outlines and points with high leverage.
plot(predict(lm.fit),rstudent(lm.fit),col=ifelse(rstudent(lm.fit)>=3,"red","black"))+
text(predict(lm.fit),rstudent(lm.fit),labels=ifelse(rstudent(lm.fit)>=2.5,names(which(rstudent(lm.fit)>=3)),""),pos=4)
## numeric(0)
plot(hatvalues(lm.fit),col=ifelse(hatvalues(lm.fit)>0.028,"red","black"))+
text(hatvalues(lm.fit),labels=ifelse(hatvalues(lm.fit)>0.028,names(which.max(hatvalues(lm.fit)>0.028)),""), cex= 0.7,pos=4)
## numeric(0)
pairs(~.,data = Auto)
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit <- lm(mpg~.-name,data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response, which can be seen by the F-statistic much larger than 1.
By order of p-values; year, weight, origin, displacement. The other values do not have a significant relationship at the 95% confidence level.
The coefficient of year suggests that mpg is improved upon by car makers every year.
par(mfrow = c(2,2))
plot(lm.fit)
The diagnostic plots identify a few large outlines, entries 326, 227 and 323. Also, points with high leverage are identified. These points are 327, and 394. 327 is then a point with a high leverage and also an outlier.
lm.fit <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
When fitting with all predictors plus all possible interaction terms very few interactions appear statistically significant at 5% level. The significant ones are acceleration:origin and displacement:year.
From previous experiments it is clear to see that there is a non-linear relationship between mpg and horsepower.
lm.fit1 <- lm(mpg~horsepower+I(horsepower^2),data = Auto)
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
bnd <- seq(0,300,0.1)
counts <- predict(lm.fit1, data.frame(horsepower=c(bnd),horsepower2=c(bnd^2)))
plot(Auto[["horsepower"]],Auto[["mpg"]],xlab="horsepower",ylab="mpg")+ lines(bnd,counts,col="red")
## numeric(0)
When regressing mpg onto weight it is noticeable that the residuals present a nonlinear relationship, and some indication of heteroscedasticity (seen by a funnel like shape of the residuals). These indicate that transformations such as log X or sqr(X) may provide a better fit by shrinking of large values.
You can observe below that the funnel like shape is significant reduced when comparing the linear residual plot and the log transform plot.
lm.fit2 <- lm(mpg~I(weight^(1/2)),data = Auto)
lm.fit3 <- lm(mpg~log(weight),data = Auto)
lm.fit4 <- lm(mpg~weight,data = Auto)
bnd <- seq(0,6000,0.1)
counts1 <- predict(lm.fit2,data.frame(weight=bnd))
counts2 <- predict(lm.fit3,data.frame(weight=bnd))
counts3 <- predict(lm.fit4,data.frame(weight=bnd))
par(mfrow=c(2,2))
plot(lm.fit4)
plot(lm.fit3)
par(mfrow=c(1,1))
plot(Auto[["weight"]],Auto[["mpg"]],xlab="weight",ylab="mpg")+lines(bnd,counts1,col="red")+
lines(bnd,counts1,col="blue",lty=2)+lines(bnd,counts3,col="green",lty=3)
## numeric(0)
library(ISLR)
?Carseats
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75
## Median :272.0 Median :117.0 Medium:219 Median :54.50
## Mean :264.8 Mean :115.8 Mean :53.32
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :509.0 Max. :191.0 Max. :80.00
## Education Urban US
## Min. :10.0 No :118 No :142
## 1st Qu.:12.0 Yes:282 Yes:258
## Median :14.0
## Mean :13.9
## 3rd Qu.:16.0
## Max. :18.0
lm.fit <- lm(Sales~Price+Urban+US,data=Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
There are 3 coefficients in this model; Price, UrbanYes, and USYes. For every one unit increase of Price, the average Sales of a carseat decreases by about 0.05. Since UrbanYes is a qualitative variable it can be interpreted as average difference increase of Sales of a store located in a rural zone or urban zone, holding all other predictors constant. However, since be p-value there is not enough information to tell this relationship is present in the data given the other predictors. Lastly, the USYes coefficient can be interpreted as the average increase in Sales provided that the store is located in the United States. Thus there is an average increase of about 1200 units of sales in the US compared to carseat sales elsewhere.
This model produces several parallel lines for the different qualitative values.
\[\hat{y}=-0.05\times x+ \begin{cases} 13.04 & UrbanYes=0,USYes=0 \\ 13.02 & UrbanYes=1,USYes=0 \\ 14.24 & UrbanYes=0,USYes=1 \\ 14.22 & UrbanYes=1,USYes=1 \end{cases} \]
The null hypothesis can be rejected for USYes and Price.
lm.fit <- lm(Sales~Price+US,,data=Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models fit the data equally well, with the second model having a slightly lower RSE error.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
As noted from the the plots below there is no evidence of outliers but there are points with high leverage.
plot(predict(lm.fit),rstudent(lm.fit),col=ifelse(rstudent(lm.fit)>3,"red","black"))
plot(hatvalues(lm.fit),col=ifelse(hatvalues(lm.fit)>(2+1)/dim(Carseats)[1]*3,"red","black"))
set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)
lm.fit1 <- lm(y~x+0)
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The null hypothesis can be refuted for this model due to both high F-statistic and low p-values.
lm.fit2 <- lm(x~y)
summary(lm.fit2)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The null hypothesis cannot be refuted in this step because of the p-value of the intercept being high.
\(\hat{\beta}=(\sum_{i=1}^{n}x_{i}y_{i})/(\sum_{i'=1}^{n}x_{i'}^{2})\)
\(SE(\hat{\beta}) = \sqrt{ \frac{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}{(n-1)\sum_{i'=1}^{n}x_{i'}^{2}}}\)
\(t-statistic = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{(\sum_{i=1}^{n}x_{i}y_{i})/(\sum_{i'=1}^{n}x_{i'}^{2})}{\sqrt{ \frac{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}{(n-1)\sum_{i'=1}^{n}x_{i'}^{2}}}}\)
\(= \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sum_{i=1}^{n}(y_{i}-x_{i}\hat{\beta})^{2}}\frac{1}{\sum_{i'}^{n}x_{i'}^{2}} = \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sum_{i=1}^{n}y_i^{2}-2y_{i}x_{i}\hat{\beta}+x_{i}^{2}\hat{\beta}^{2}}\frac{1}{\sum_{i'}^{n}x_{i'}^{2}} = \frac{\sum_{i=1}^{n}x_{i}y_{i}\sqrt{n-1}}{\sqrt{(\sum_{i'=1}^{n}x_{i'}^{2})(\sum_{i'=1}^{n}y_{i'}^{2})-(\sum_{i'=1}^{n}x_{i'}y_{i'})^{2} }}\)
Since,
\((-2\sum_{i'}^{n}y_{i'}x_{i'}\hat{\beta}+\sum_{i'}^{n}x_{i'}^{2}\hat{\beta}^{2})(\sum_{i'}^{n}x_{i'}^{2}) = -2\sum_{i'}^{n}y_{i}x_{i}(\sum_{i=1}^{n}x_{i}y_{i}) +(\sum_{i=1}^{n}x_{i}y_{i})^{2} = -(\sum_{i=1}^{n}x_{i}y_{i})^{2}\)
Since no intercept is present, both equations simplify to the same one.
lm.fit1 <- lm(x~y+0)
summary(lm.fit1)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit2 <- lm(y~x+0)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient will be the same when \(\sum_{i}^{n}x_i^2 = \sum_{i}^{n}y_i^2\)
set.seed(47)
x=rnorm(100)
y=x+rnorm(100)
head(data.frame(y,x))
## y x
## 1 1.8521818 1.9946963
## 2 0.2456694 0.7111425
## 3 -0.2392240 0.1854053
## 4 0.4655555 -0.2817650
## 5 0.7682045 0.1087755
## 6 -0.8869080 -1.0857375
lm.fit <- lm(y~x+0)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1436 -0.4683 0.1323 0.9264 2.4012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.9159 0.1033 8.863 3.28e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.012 on 99 degrees of freedom
## Multiple R-squared: 0.4424, Adjusted R-squared: 0.4368
## F-statistic: 78.56 on 1 and 99 DF, p-value: 3.281e-14
lm.fit <- lm(x~y+0)
summary(lm.fit)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8412 -0.5199 -0.1615 0.4169 1.7175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.4831 0.0545 8.863 3.28e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7347 on 99 degrees of freedom
## Multiple R-squared: 0.4424, Adjusted R-squared: 0.4368
## F-statistic: 78.56 on 1 and 99 DF, p-value: 3.281e-14
set.seed(38)
x=rnorm(100)
y=abs(x)
head(data.frame(y,x))
## y x
## 1 0.2535911 -0.2535911
## 2 1.0556027 -1.0556027
## 3 0.6864966 0.6864966
## 4 0.0251569 0.0251569
## 5 1.6718410 -1.6718410
## 6 1.4835520 -1.4835520
lm.fit <- lm(y~x+0)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.00772 0.13753 0.53052 1.11493 2.28704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.02509 0.10047 0.25 0.803
##
## Residual standard error: 0.9345 on 99 degrees of freedom
## Multiple R-squared: 0.0006293, Adjusted R-squared: -0.009465
## F-statistic: 0.06234 on 1 and 99 DF, p-value: 0.8033
lm.fit <- lm(x~y+0)
summary(lm.fit)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11590 -0.44338 -0.01365 0.56617 2.28704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.02509 0.10047 0.25 0.803
##
## Residual standard error: 0.9345 on 99 degrees of freedom
## Multiple R-squared: 0.0006293, Adjusted R-squared: -0.009465
## F-statistic: 0.06234 on 1 and 99 DF, p-value: 0.8033
set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)
eps <- rnorm(100,mean = 0,sd = 0.25)
\(\hat{\beta_0} = -1\) and \(\hat{\beta_1} = 0.5\)
y <- -1+0.5*x+eps
length(y)
## [1] 100
The scatterplot indicates a linear relationship between the two values.
plot(x,y)+abline(coef = c(-1,0.5), col="red")
## numeric(0)
\(\hat{\beta_0},\hat{\beta_1}\) are very close to \(\beta_0,\beta_1\)
lm.fit <- lm(y~x)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
coef(lm.fit)
## (Intercept) x
## -1.0094232 0.4997349
plot(x,y)+abline(coef = c(-1,0.5), col="red")+abline(coef=coef(lm.fit),col="blue")
## numeric(0)
legend("bottomright",legend=c("Population Regression ","Least Square "),col=c("red","blue"),lty=c(1,1))
There is no evidence that the quadratic term improve the fit due to a lower \(R^{2}\), F-statistic, and the p-value of the quadradic coefficient (> 0.05).
lm.fit2 <- lm(y~poly(x,2))
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
# a-f
set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)
eps <- rnorm(100,mean = 0,sd = 0.10)
y <- -1+0.5*x+eps
nrow(y) # length of y
## NULL
plot(x,y) # scatter plot
lm.fit3 <- lm(y~x)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18768 -0.06138 -0.01395 0.05394 0.23462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.003769 0.009699 -103.5 <2e-16 ***
## x 0.499894 0.010773 46.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09628 on 98 degrees of freedom
## Multiple R-squared: 0.9565, Adjusted R-squared: 0.956
## F-statistic: 2153 on 1 and 98 DF, p-value: < 2.2e-16
coef(lm.fit3) # coef of model
## (Intercept) x
## -1.003769 0.499894
lm.fit4 <- lm(y~poly(x,2))
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ poly(x, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19650 -0.06254 -0.01288 0.05803 0.22700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94934 0.00958 -99.092 <2e-16 ***
## poly(x, 2)1 4.46754 0.09580 46.632 <2e-16 ***
## poly(x, 2)2 -0.13441 0.09580 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0958 on 97 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9565
## F-statistic: 1088 on 2 and 97 DF, p-value: < 2.2e-16
coef(lm.fit4)
## (Intercept) poly(x, 2)1 poly(x, 2)2
## -0.9493371 4.4675377 -0.1344062
ys <- predict(lm.fit2,data.frame(x=seq(0,3,0.1)))
plot(x,y)+abline(coef = coef(lm.fit3),col="red")+abline(coef = coef(lm.fit4),col="blue")
## numeric(0)
legend("bottomright",legend=c("Linear","Quadradic "),lty=c(1,1),col=c("red","blue"))
set.seed(1)
x <- rnorm(100,mean = 0,sd = 1)
eps <- rnorm(100,mean = 0,sd = 0.80)
y <- -1+0.5*x+eps
lm.fit5 <- lm(y~x)
summary(lm.fit5)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5015 -0.4910 -0.1116 0.4315 1.8769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03015 0.07759 -13.277 < 2e-16 ***
## x 0.49915 0.08618 5.792 8.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7702 on 98 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2474
## F-statistic: 33.55 on 1 and 98 DF, p-value: 8.421e-08
coef(lm.fit5)
## (Intercept) x
## -1.0301541 0.4991517
lm.fit6 <- lm(y~poly(x,2))
summary(lm.fit6)
##
## Call:
## lm(formula = y ~ poly(x, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5720 -0.5003 -0.1031 0.4642 1.8160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97580 0.07664 -12.732 < 2e-16 ***
## poly(x, 2)1 4.46090 0.76643 5.820 7.58e-08 ***
## poly(x, 2)2 -1.07525 0.76643 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7664 on 97 degrees of freedom
## Multiple R-squared: 0.2698, Adjusted R-squared: 0.2548
## F-statistic: 17.92 on 2 and 97 DF, p-value: 2.378e-07
coef(lm.fit6)
## (Intercept) poly(x, 2)1 poly(x, 2)2
## -0.9758028 4.4609041 -1.0752495
ys <- predict(lm.fit2,data.frame(x=seq(0,3,0.1)))
plot(x,y)+abline(coef = coef(lm.fit5),col="red")+abline(coef = coef(lm.fit6),col="blue")
## numeric(0)
legend("bottomright",legend=c("Linear","Quadradic "),lty=c(1,1),col=c("red","blue"))
As the noise increases, so does the confidence intervals.
confint(lm.fit3)
## 2.5 % 97.5 %
## (Intercept) -1.0230161 -0.9845224
## x 0.4785159 0.5212720
confint(lm.fit1)
## 2.5 % 97.5 %
## y 0.3496717 0.4325574
confint(lm.fit5)
## 2.5 % 97.5 %
## (Intercept) -1.1841286 -0.8761796
## x 0.3281271 0.6701763
set.seed(1)
x1 = runif(100)
x2 = 0.5*x1+rnorm(100)/10
y =2+2*x1+0.3*x2+rnorm(100)
The regression coefficients are \(\beta_0 =2,\beta_1 =2,\beta_2 = 0.3\) #### 14b
cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)
lm.fit <- lm(y~x1+x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The model coefficients are as follows; \(\hat{\beta_0} = 2.13,\hat{\beta_1}=1.439,\hat{\beta_2}=1\). The null hypothesis can be rejected for \(\hat{\beta_0}\) at the 5% confidence level. This is not the case for \(\hat{\beta_2}\).
lm.fit <- lm(y~x1)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The null hypothesis can be rejected in this case.
lm.fit <- lm(y~x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The null hypothesis can be rejected in this case.
The results do not contradict each other since x1 and x2 are highly correlated. One way to interpret this is that x2 adds no new information when fitting a model that already contains x1, and vice-versa.
x1=c(x1,0.1)
x2=c(x2,0.8)
y=c(y,6)
lm.fit <- lm(y~x1+x2)
lm.fit1 <- lm(y~x1)
lm.fit2 <- lm(y~x2)
plot(lm.fit)
par(mfrow = c(2,2))
plot(lm.fit1)
par(mfrow = c(2,2))
plot(lm.fit2)
The additional observation is a point of high leverage in all 3 models, having significant influence on the model and altering the p-values. This makes it so that the same conclusions are not reached (here both null hypothesis are rejected). Only when regressing y onto x1 that evidence is present of the additional point being an outlier.
library(MASS)
attach(Boston)
chas <- as.factor(chas)
Regressing crim onto zn.
lm.zn <- lm(crim~zn)
summary(lm.zn)
##
## Call:
## lm(formula = crim ~ zn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
Regressing crim onto indus.
lm.indus <- lm(crim~indus)
summary(lm.indus)
##
## Call:
## lm(formula = crim ~ indus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto chas.
lm.chas <- lm(crim~chas)
summary(lm.chas)
##
## Call:
## lm(formula = crim ~ chas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas1 -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
Regressing crim onto nox.
lm.nox <- lm(crim~nox)
summary(lm.nox)
##
## Call:
## lm(formula = crim ~ nox)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto rm.
lm.rm <- lm(crim~rm)
summary(lm.rm)
##
## Call:
## lm(formula = crim ~ rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
Regressing crim onto age.
lm.age <- lm(crim~age)
summary(lm.age)
##
## Call:
## lm(formula = crim ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
Regressing crim onto dis.
lm.dis <- lm(crim~dis)
summary(lm.dis)
##
## Call:
## lm(formula = crim ~ dis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto rad.
lm.rad <- lm(crim~rad)
summary(lm.rad)
##
## Call:
## lm(formula = crim ~ rad)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto tax.
lm.tax <- lm(crim~tax)
summary(lm.tax)
##
## Call:
## lm(formula = crim ~ tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto ptratio.
lm.ptratio <- lm(crim~ptratio)
summary(lm.ptratio)
##
## Call:
## lm(formula = crim ~ ptratio)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
Regressing crim onto black.
lm.black <- lm(crim~black)
summary(lm.black)
##
## Call:
## lm(formula = crim ~ black)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto lstat.
lm.lstat <- lm(crim~lstat)
summary(lm.lstat)
##
## Call:
## lm(formula = crim ~ lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
Regressing crim onto medv.
lm.medv <- lm(crim~medv)
summary(lm.medv)
##
## Call:
## lm(formula = crim ~ medv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
It can be noted from all the linear models above that there is no evidence that chas and age are associated with crim, and there is evidence of association between the remaining predictors and crim.
lm.all <- lm(crim~.,data=Boston)
summary(lm.all)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
At the 5% confidence level, the null hypothesis can be rejected for the predictors; zn, dis, rad, black, and medv.
The results are different when fitting a model using all the predictors; In this case, there is no evidence of additional predictors (other than chas, and age) being associated with crim. The only predictors were the null hypothesis can be refuted are zn, dis, rad, black and medv.
xs <- c(coef(lm.zn)[2],coef(lm.indus)[2],coef(lm.chas)[2],coef(lm.nox)[2],coef(lm.rm)[2],coef(lm.age)[2],coef(lm.dis)[2],coef(lm.rad)[2],coef(lm.tax)[2],coef(lm.ptratio)[2],coef(lm.black)[2],coef(lm.lstat)[2],coef(lm.medv)[2])
ys <- coef(lm.all)[-1]
data.frame(xs,ys)
## xs ys
## zn -0.07393498 0.044855215
## indus 0.50977633 -0.063854824
## chas1 -1.89277655 -0.749133611
## nox 31.24853120 -10.313534912
## rm -2.68405122 0.430130506
## age 0.10778623 0.001451643
## dis -1.55090168 -0.987175726
## rad 0.61791093 0.588208591
## tax 0.02974225 -0.003780016
## ptratio 1.15198279 -0.271080558
## black -0.03627964 -0.007537505
## lstat 0.54880478 0.126211376
## medv -0.36315992 -0.198886821
plot(xs,ys)
lm.zn3 <- lm(crim~poly(zn,3))
summary(lm.zn3) # No evidence for cubic term.
##
## Call:
## lm(formula = crim ~ poly(zn, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3722 9.709 < 2e-16 ***
## poly(zn, 3)1 -38.7498 8.3722 -4.628 4.7e-06 ***
## poly(zn, 3)2 23.9398 8.3722 2.859 0.00442 **
## poly(zn, 3)3 -10.0719 8.3722 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
lm.indus3 <- lm(crim~poly(indus,3))
summary(lm.indus3) # Evidence of non linearity.
##
## Call:
## lm(formula = crim ~ poly(indus, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.330 10.950 < 2e-16 ***
## poly(indus, 3)1 78.591 7.423 10.587 < 2e-16 ***
## poly(indus, 3)2 -24.395 7.423 -3.286 0.00109 **
## poly(indus, 3)3 -54.130 7.423 -7.292 1.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
# chas is qualitative predictor, polynomial does not make sense in this case.