3.7 Exercises

Conceptual

1.

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

The null hypotheses is rejected totally for TV and radio predictors, which have low p-values. On contrary, newspaper has a high \(p\)-value, meaning that the relation between this predictor and the response is rather low, but the null hypotheses is not rejected. Therefore, the \(p\)-values are saying that TV and radio have high significance in the prediction of sales, rather than newspaper that has a high p-value, which is not recommended to be used in a future predictive model.

2.

Carefully explain the differences between the KNN classifier and KNN regression methods.

The KNN classifier methodology will select a fixed number of neighbors \(K\), and from there will drawn a decision boundary based on the training data values. The boundary will classify a test data \(x_{0}\) based in how close this observation test data is in one side or the other of the decision boundary.

On the other hand, the KNN regression will look at the prediction point \(x_{0}\), given a value of \(K\), and classify this test data by identifying the \(K\) training observations closest to \(x_{0}\) and call this \(N_{0}\) or neighborhood, then, will estimate \(f(x_{0})\) using the an average of the observations in \(N_{0}\).

3.

Suppose we have a data set with five predictors:

  • \(X_{1}\) = GPA
  • \(X_{2}\) = IQ
  • \(X_{3}\) = Gender (1 for Female and 0 for Male)
  • \(X_{4}\) = Interaction between GPA and IQ
  • \(X_{5}\) = Interaction between GPA and Gender.

The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get

  • \(\hat{β}_{0} = 50\),
  • \(\hat{β}_{1} = 20\)
  • \(\hat{β}_{2} = 0.07\)
  • \(\hat{β}_{3} = 35\)
  • \(\hat{β}_{4} = 0.01\)
  • \(\hat{β}_{5} = −10\)

\[salary = β_{0} + β_{1} \times GPA + β_{2} \times IQ + β_{3} \times Genter + β_{4} \times (GPA \times IQ) + β_{5} \times (GPA \times Genter)\]

\[salary = 50 + 20 \times GPA + 0.07 \times IQ + 35 \times Genter + 0.01 \times (GPA \times IQ) -10 \times (GPA \times Genter)\]

a)

Which answer is correct, and why?

  • i. For a fixed value of IQ and GPA, males earn more on average than females. –> X

  • ii. For a fixed value of IQ and GPA, females earn more on average than males. –> X

  • iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. –> V

  • iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough. –> X

  • Females scenario, \(Gender = 1\)

\[salary_{Female} = 50 + 20 \times GPA + 0.07 \times IQ + 35 + 0.01 \times (GPA \times IQ) -10 \times GPA\]

\[salary_{Female} = 85 + 10 \times GPA + 0.07 \times IQ + 0.01 \times (GPA \times IQ)\]

  • Males scenario, \(Gender = 0\)

\[salary_{Male} = 50 + 20 \times GPA + 0.07 \times IQ + 0.01 \times (GPA \times IQ)\]

For the same value of \(GPA\) and \(IQ\), one can observe that male salary will be higher than the female salary. The minimum value of \(GPA\) for the male salary be higher than the female salary we could calculate:

\[salary_{Male} > salary_{Female}\]

\[50 + 20 \times GPA + 0.07 \times IQ + 0.01 \times (GPA \times IQ) > 85 + 10 \times GPA + 0.07 \times IQ + 0.01 \times (GPA \times IQ)\]

\[GPA > 3.5\]

For the male salary be higher than the female salary, the \(GPA > 3.5\) condition must be satisfied. Alternative iii

(b)

Predict the salary of a female with IQ of 110 and a GPA of 4.0.

\[salary_{Female} = 85 + 10 \times GPA + 0.07 \times IQ + 0.01 \times (GPA \times IQ)\]

\[salary_{Female} = 85 + 10 \times 4.0 + 0.07 \times 110 + 0.01 \times (4.0 \times 110)\]

\[salary_{Female} = 137.1\]

(c)

True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

False: the coefficient value has says nothing about the importance of the interaction term. This effect evidence is in fact justified by the \(p\)-value. If the \(p\)-value is low, there you would have an evidence of an poor interaction between \(GPA\) and \(IQ\).

4.

I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. \(Y = β_{0} + β_{1}X + β_{2}X^2 + β_{2}X^3 + ϵ\).

a)

Suppose that the true relationship between X and Y is linear, i.e. \(Y = β_{0} + β_{1}X + ϵ\). Consider the training residual sum of squares \((RSS)\) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

WRONG ANSWER

The cubic regression will fit closely to the training data, reducing the RSS to a minimum value because will try to overfit the data, once the cubic regression is more flexible than the linear regression.

RIGHT ANSWER

As the true \(f\) is linear, we would expect that the linear regression perform better than to a cubic regression, with a lower \(RSS\)

(b)

Answer (a) using test rather than training \(RSS\).

WRONG ANSWER

In this scenario, has the true \(f\) is linear, the test RSS will increase because the cubic regression is overfitted to a linear regression, rather to a cubic regression. Then, the variance will be high, as the model is too flexible.

RIGHT ANSWER

The same idea from the previous item. as the true relationship is linear, a linear regression is expected to perform better in terms of variance and bias than compared to a cubic regression

(c)

Suppose that the true relationship between \(X\) and \(Y\) is not linear, but we don’t know how far it is from linear. Consider the training \(RSS\) for the linear regression, and also the training \(RSS\) for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

The cubic regression would perform better than a linear regression in this case, even if the a cubic function is not enough, because the model will be more flexible, decreasing the variance.

(d) Answer (c) using test rather than training \(RSS\).

Neither one or another will be good for a \(f\) that we do not know how far from linear is. The tendency is that the cubic regression would perform better.

5.

Consider the fitted values that result from performing linear regression without an intercept. In this setting, the \(i\)th fitted value takes the form

\[ \hat{y_{i}} = x_{i}\hat{β}, \]

\[\hat{β} = \left(\sum_{i=1}^nx_{i}y_{i}\right) \Biggl/ \left(\sum_{i'=1}^nx_{i'}^{2}\right)\]

Show that we can write \[\hat{y_{i}} = \sum_{i'=1}^na_{i'}y_{i'}\]

What is \(a_{i'}\)?

Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

Substituting \(\hat{β}\) in the \(\hat{y_{i}}\) expression we get:

\[ \hat{y_{i}} = x_{i}\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=1}^nx_{i'}^{2}} \]

The \(x_{i}\) outside the fraction is not part of the sum but can be put to vary along with it. Rename it to \(c_{i}\) and plugging in the sum we get the expression:

\[ \hat{y_{i}} = \frac{\sum_{i=1}^nc_{i}x_{i}y_{i}}{\sum_{i'=1}^nx_{i'}^{2}} \]

Calling the expression \(\frac{\sum_{i=1}^nc_{i}x_{i}}{\sum_{i'=1}^nx_{i'}^{2}}\) equals to \(a_{i}\) and switching \(i\) to \(i^{'}\) to be able to vary with \(y_{i}\), then \(y_{i'}\), the result is:

\[ \hat{y_{i}} = \sum_{i'=1}^na_{i'}y_{i'} \]

6.

Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (\(\bar{x}\),\(\bar{y}\)).

A simple linear regression is written as:

\[ \hat{y_{i}} = \hat{β}_{0} + \hat{β}_{1}x_{i} \]

and the coefficient \(\hat{β}_{0}\) is written as:

\[ \hat{β}_{0} = \bar{y} - \hat{β}_{1}\bar{x} \]

Now, plugging in the values (\(\bar{x}\),\(\bar{y}\)) in the simple linear regression:

\[ \bar{y} = \hat{β}_{0} + \hat{β}_{1}\bar{x} \]

And now \(\hat{β}_{0}\)

\[ \bar{y} = \bar{y} - \hat{β}_{1}\bar{x} + \hat{β}_{1}\bar{x} \]

\[ \bar{y} = \bar{y} \]

Which is proved.

7.

It is claimed in the text that in the case of simple linear regression of \(Y\) onto \(X\), the \(R^2\) statistic (3.17) is equal to the square of the correlation between \(X\) and \(Y\) (3.18). Prove that this is the case. For simplicity, you may assume that \(\bar{x}\) = \(\bar{y}\) = 0.

First, the \(Cor(X,Y)^2\) with \(\bar{x}=\bar{y}=0\) condition can be written as:

\[Cor(X,Y) = \frac{\sum_{i=1}^n(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^n(x_{i}-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_{i}-\bar{y})^2}}\]

\[Cor(X,Y) = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sqrt{\sum_{i=1}^nx_{i}^2}\sqrt{\sum_{i=1}^ny_{i}^2}}\]

\[Cor(X,Y)^2 = \frac{\sum_{i=1}^nx_{i}^2y_{i}^2}{\sum_{i=1}^nx_{i}^2\sum_{i=1}^ny_{i}^2}\]

A key point is to notice that the denominator is not the same thing as the numerator. put a scheme later

Then, the \(R^2\) must to be equal to that expression we just worked with. Now, the \(R^2\) Statistic can be written as:

\[R^2 = \frac{TSS-RSS}{TSS}\]

\(TSS\), the total sum of squares, with the same condition as before can be written as:

\[ TSS = \sum_{i=n}^{n}(y_{i}-\bar{y})^2 \]

\[ TSS = \sum_{i=n}^{n}y_{i}^2 \]

Now, the \(RSS\) is where the math is. The residual sum of squares following the same simplification can be written as:

\[ RSS = \sum_{i=n}^{n}(y_{i}-\hat{y_{i}})^2 \]

The term \(\hat{y_{i}}\) can be expressed as:

\[ \hat{y_{i}} = \hat{β}_{0} + \hat{β}_{1}x_{i} \]

By the simplification, the term \(\hat{β}_{0}\) is expressed as:

\[\hat{β}_{0} = \bar{y} + \hat{β}_{1}\bar{x}\]

\[ \hat{β}_{0} = 0 \]

The \(\hat{β}_{1}\) can be expressed as:

\[\hat{β}_{1} = \frac{\sum_{i=1}^n(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=n}^{n}(x_{i}-\bar{x})^2}\]

And with the simplification:

\[\hat{β}_{1} = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i=n}^{n}x_{i}^2}\]

Plugging in \(\hat{β}_{0} = 0\) and \(\hat{β}_{1}\) in \(\hat{y_{i}}\) results in:

\[ \hat{y_{i}} = \frac{\sum_{i=1}^nx_{i}^2y_{i}}{\sum_{i=n}^{n}x_{i}^2} \]

Now plugging \(\hat{y_{i}}\) in \(RSS\) and working with the expression:

\[ RSS = \sum_{i=n}^{n}(y_{i}-\hat{β}_{1}x_{i})^2 \]

\[ RSS = \sum_{i=n}^{n}\Big(y_{i}^2-2y_{i}\hat{β}_{1}x_{i}+(\hat{β}_{1}x_{i})^2\Big) \]

\[ RSS = \sum_{i=n}^{n}y_{i}^2-\sum_{i=n}^{n}2y_{i}\hat{β}_{1}x_{i}+\sum_{i=n}^{n}\hat{β}_{1}^2x_{i}^2 \]

As \(\hat{β}_{1}^2\) is a constant, we can remove it from the sum:

\[ RSS = \sum_{i=n}^{n}y_{i}^2-2\hat{β}_{1}\sum_{i=n}^{n}x_{i}y_{i}+\hat{β}_{1}^2\sum_{i=n}^{n}x_{i}^2 \]

Plugging \(\hat{β}_{1}\) we have:

\[ RSS = \sum_{i=n}^{n}y_{i}^2-2\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i=n}^{n}x_{i}^2}\sum_{i=n}^{n}x_{i}y_{i}+\Bigg(\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i=n}^{n}x_{i}^2}\Bigg)^2\sum_{i=n}^{n}x_{i}^2 \]

Simplifying the expression:

\[ RSS = \sum_{i=n}^{n}y_{i}^2 - 2\frac{\sum_{i=1}^nx_{i}^2y_{i}^2}{\sum_{i=n}^{n}x_{i}^2} + \frac{\sum_{i=1}^nx_{i}^2y_{i}^2}{\sum_{i=n}^{n}x_{i}^2} \]

Once more:

\[ RSS = \sum_{i=n}^{n}y_{i}^2 - \frac{\sum_{i=1}^nx_{i}^2y_{i}^2}{\sum_{i=n}^{n}x_{i}^2} \]

Plugging in the \(RSS\) and the \(TSS\) into the \(R^2\) expression:

\[R^2 = \frac{TSS-RSS}{TSS}\]

\[R^2 = \frac{\sum_{i=n}^{n}y_{i}^2 - \sum_{i=n}^{n}y_{i}^2 + \Large\frac{\sum_{i=n}^{n}x_{i}^2y_{i}^2}{\sum_{i=n}^{n}x_{i}^2}}{\sum_{i=n}^{n}y_{i}^2}\]

Subtracting the numerator terms and making the sums fractions division:

\[ R^2 = \frac{\sum_{i=n}^{n}x_{i}^2y_{i}^2}{\sum_{i=n}^{n}x_{i}^2\sum_{i=n}^{n}y_{i}^2} \]

Which is the same expression found in \(Cor(X,Y)^2\)

This is connected which the class quote from the chapter 3 slides that says:

“It can be shown that in this simple linear regression setting that \(R^2\) = \(r^2\), where \(r = Cor(X, Y )\) is the correlation between \(X\) and \(Y\). In other words, the squared correlation and the \(R^2\) statistic are identical”

1

Applied

8.

This question involves the use of simple linear regression on the Auto data set.

(a)

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

library(ISLR)
data("Auto")
attach(Auto)
head(Auto)
fit.8a <- lm(mpg~horsepower)
summary(fit.8a)
## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictor and the response?

Looking at the \(p\)-value (< 2.2e-16), we could infer that horsepower and mpg have a relationship. The book says that (pg 67):

“a small p-value indicates that it is UNLIKELY to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response. We reject the null hypothesis—that is, we declare a relationship to exist between \(X\) and \(Y\) —if the p-value is small enough. Typical p-value cutoffs for rejecting the null hypothesis are 5 or 1 %.”

Has the coefficient \(\beta_{1}\) is negative, when horsepower increase, mpg decrease.

ii. How strong is the relationship between the predictor and the response?

The \(RSE\) (Residual standard error) and the \(R^2\)-statistic can be used to answer.

The \(RSE\) is 4.906, which means that mpg is deviating its value from the average, in an average amount of 4.906.

Judging by the mean value of the response mpg, 23.44592, this represents a error percentage of \(4.906\div23.44592 = 0.2092475 \times 100 = 20.92\%\), quite high.

The \(R^2\)-statistic informs a proportion of this variance that the \(RSE\) originally describes, as it informs in \(Y\) units.

The value found is 0.6049, which means that 60.49% of variability in mpg that can be explained using horsepower

iii. Is the relationship between the predictor and the response positive or negative?

It is negative. \(\beta_{1}=-0.157845\)

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

“How accurately can we predict mpg?”

“The accuracy associated with this estimate depends on whether we wish to predict an individual response (prediction interval), \(Y = f(X) + ϵ\), or the average response, \(f(X)\) (confidence interval).”

confidence intervals: is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. The range is defined in terms of lower and upper limits computed from the sample of data.”

prediction intervals: are always wider than confidence intervals, because they incorporate both the error in the estimate for \(f(X)\) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).”

predict(fit.8a,data.frame(horsepower=c(98)),interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(fit.8a,data.frame(horsepower=c(98)),interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

(b)

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(horsepower,mpg, main = "mpg vs horsepower", col="red" )
abline(fit.8a, col="blue")

(c)

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

DUBIOUS ANSWER

It might have outliers on the horsepower predictor in the index values of 331, 321 and 328.

MORE CONFIDENT ANSWER

There are outliers that can be seen on the residuals vs fitted plot and in the stn. residuals vs leverage plot.

par(mfrow =c(2,2))
plot(fit.8a)

The residuals plot also shows that there is evidence of non-linearity.

The observations 94 and 116 on the studentized residuals versus leverage statistic plot have a very high leverage statistic as well as a high studentized residual, they are outliers as well as high leverages observations.

par(mfrow =c(1,2))
plot(predict(fit.8a), residuals(fit.8a))
plot(predict(fit.8a), rstudent(fit.8a))

The leverage statistic access by the hatvalues functions says that:

\(h_{i}\) increases with the distance of \(x_{i}\) from \(\bar{x}\). The leverage statistic \(h_{i}\) is always between \(1/n\) and 1 the average leverage for all the observations is always equal to \((p + 1)/n\) (line plotted) if a given observation has a leverage statistic that greatly exceeds \((p+1)/n\), then we may suspect that the corresponding point has high leverage”

plot(hatvalues (fit.8a))
which.max (hatvalues (fit.8a))
## 116 
## 116
abline(0.0225,0)

9.

This question involves the use of multiple linear regression on the Auto data set.

(a)

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
cor(Auto[-c(9)])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
#cor(Auto[1:8])

(c)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

fit.9c <- lm(mpg~.,data = Auto[-c(9)])
summary(fit.9c)
## 
## Call:
## lm(formula = mpg ~ ., data = Auto[-c(9)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Analyzing the \(p\)-values we can infer that:

There is a strong relationship on the year and weight predictors.

There is a medium relationship on the displacement predictor.

There is a weak relationship on the cylinders and horsepower and acceleration.

ii. Which predictors appear to have a statistically significant_ relationship to the response?

By the \(p\)-values, year and weight are the most significant predictors.

iii. What does the coefficient for the year variable suggest?

That increasing the year, the mpg increases as well.

  • INTERESTING OBSERVATION

the average effect of an increase of 1 year is an increase of 0.7507727 in mpg (all other predictors remaining constant)

(d)

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

This question is the same from the previous exercise.

(e)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

fit.9.e1 <- lm(mpg~year*weight)
summary(fit.9.e1)
## 
## Call:
## lm(formula = mpg ~ year * weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0397 -1.9956 -0.0983  1.6525 12.9896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.105e+02  1.295e+01  -8.531 3.30e-16 ***
## year         2.040e+00  1.718e-01  11.876  < 2e-16 ***
## weight       2.755e-02  4.413e-03   6.242 1.14e-09 ***
## year:weight -4.579e-04  5.907e-05  -7.752 8.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.193 on 388 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8326 
## F-statistic: 649.3 on 3 and 388 DF,  p-value: < 2.2e-16

All the predictors are statistically significant after the introduction of the interaction terms.

Another scenario could be:

fit.9.e2 <- lm(mpg~cylinders * displacement+displacement * weight, data = Auto[, 1:8])
summary(fit.9.e2)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement * 
##     weight, data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2934  -2.5184  -0.3476   1.8399  17.7723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
## cylinders               7.606e-01  7.669e-01   0.992    0.322    
## displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
## weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
## displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared:  0.7272, Adjusted R-squared:  0.7237 
## F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

In this case, there are predictors that even after the interaction terms did not present a statistical significance, as one can seen in the p-values for the cylinders:displacement term. Note the displacement has quite low p-value and after the interaction with cylinders, got a high p-value.

Following the same inference, the interaction term displacement:weight, although weight has a low p-value, it increased after displacement interaction.

(f)

Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.

fit.9f.1 <- lm(mpg~poly(horsepower,2))
fit.9f.2 <- lm(mpg~poly(displacement,2))
fit.9f.3 <- lm(mpg~poly(displacement,1))

poly(horsepower^2)

par(mfrow =c(2,2))
summary(fit.9f.1)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
plot(fit.9f.1)

poly(displacement^2)

par(mfrow =c(2,2))
summary(fit.9f.2)
## 
## Call:
## lm(formula = mpg ~ poly(displacement, 2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2165  -2.2404  -0.2508   2.1094  20.5158 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              23.4459     0.2205 106.343  < 2e-16 ***
## poly(displacement, 2)1 -124.2585     4.3652 -28.466  < 2e-16 ***
## poly(displacement, 2)2   31.0895     4.3652   7.122 5.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6872 
## F-statistic: 430.5 on 2 and 389 DF,  p-value: < 2.2e-16
plot(fit.9f.2)

log(horsepower)

par(mfrow = c(2,2))
fit.9f.4 <- lm(mpg~log(horsepower))
summary(fit.9f.4)
## 
## Call:
## lm(formula = mpg ~ log(horsepower))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2299  -2.7818  -0.2322   2.6661  15.4695 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     108.6997     3.0496   35.64   <2e-16 ***
## log(horsepower) -18.5822     0.6629  -28.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.501 on 390 degrees of freedom
## Multiple R-squared:  0.6683, Adjusted R-squared:  0.6675 
## F-statistic: 785.9 on 1 and 390 DF,  p-value: < 2.2e-16
plot(fit.9f.4)

log(displacement)

par(mfrow =c(2,2))
fit.9f.5 <- lm(mpg~log(displacement))
summary(fit.9f.5)
## 
## Call:
## lm(formula = mpg ~ log(displacement))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1204  -2.5843  -0.4217   2.1979  19.9005 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        85.6906     2.1422   40.00   <2e-16 ***
## log(displacement) -12.1385     0.4155  -29.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.377 on 390 degrees of freedom
## Multiple R-squared:  0.6863, Adjusted R-squared:  0.6855 
## F-statistic: 853.4 on 1 and 390 DF,  p-value: < 2.2e-16
plot(fit.9f.5)

sqrt(horsepower)

par(mfrow =c(2,2))
fit.9f.6 <- lm(mpg~sqrt(horsepower))
summary(fit.9f.6)
## 
## Call:
## lm(formula = mpg ~ sqrt(horsepower))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9768  -3.2239  -0.2252   2.6881  16.1411 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        58.705      1.349   43.52   <2e-16 ***
## sqrt(horsepower)   -3.503      0.132  -26.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.665 on 390 degrees of freedom
## Multiple R-squared:  0.6437, Adjusted R-squared:  0.6428 
## F-statistic: 704.6 on 1 and 390 DF,  p-value: < 2.2e-16
plot(fit.9f.6)

sqrt(displacement)

par(mfrow =c(2,2))
fit.9f.7 <- lm(mpg~sqrt(displacement))
summary(fit.9f.7)
## 
## Call:
## lm(formula = mpg ~ sqrt(displacement))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4034  -2.7367  -0.4956   2.3207  19.3499 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        47.11839    0.86246   54.63   <2e-16 ***
## sqrt(displacement) -1.75878    0.06186  -28.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.458 on 390 degrees of freedom
## Multiple R-squared:  0.6746, Adjusted R-squared:  0.6738 
## F-statistic: 808.5 on 1 and 390 DF,  p-value: < 2.2e-16
plot(fit.9f.7)

FITTED POINTS

sqrt(displacement) vs lm(displacement) vs log(displacement) vs displacement^2

par(mfrow =c(2,2))

plot(displacement,mpg, xlab="displacement")
points(displacement,fitted(fit.9f.3),col="red",pch=20)
legend("topright", legend = c("linear"), col = c("red"), pch = 19)

plot(mpg~displacement, xlab="displacement")
points(displacement,fitted(fit.9f.2),col="red",pch=20)
legend("topright", legend = c("degree 2"), col = c("red"), pch = 19)

plot(mpg~displacement, xlab="displacement")
points(displacement,fitted(fit.9f.7),col="red",pch=20)
legend("topright", legend = c("sqrt"), col = c("red"), pch = 19)

plot(mpg~displacement, xlab="displacement")
points(displacement,fitted(fit.9f.5),col="red",pch=20)
legend("topright", legend = c("log"), col = c("red"), pch = 19)

anova(fit.9f.3, fit.9f.2, fit.9f.7, fit.9f.5)

FITTED POINTS

sqrt(horsepower) vs lm(horsepower) vs log(horsepower) vs horsepower^2

par(mfrow =c(2,2))

plot(horsepower,mpg, xlab="horsepower")
points(horsepower,fitted(fit.8a),col="red",pch=20)
legend("topright", legend = c("linear"), col = c("red"), pch = 19)

plot(mpg~horsepower, xlab="horsepower^2")
points(horsepower,fitted(fit.9f.1),col="red",pch=20)
legend("topright", legend = c("degree 2"), col = c("red"), pch = 19)

plot(mpg~horsepower, xlab="sqrt(horsepower)")
points(horsepower,fitted(fit.9f.6),col="red",pch=20)
legend("topright", legend = c("sqrt"), col = c("red"), pch = 19)

plot(mpg~horsepower, xlab="log(horsepower)")
points(horsepower,fitted(fit.9f.4),col="red",pch=20)
legend("topright", legend = c("log"), col = c("red"), pch = 19)

anova(fit.8a, fit.9f.1, fit.9f.6, fit.9f.4)

FITTED LINE (JUST FOR PRACTICING)

sqrt(displacement) vs lm(displacement) vs log(displacement) vs displacement^2

par(mfrow =c(2,2))
displacement.limits <- range(displacement)
displacement.grid <- seq(from=displacement.limits[1], to=displacement.limits[2])

preds.fit.9f.7 <- predict(fit.9f.7, newdata = list(displacement = displacement.grid), se = TRUE)
preds.fit.9f.5 <- predict(fit.9f.5, newdata = list(displacement = displacement.grid), se = TRUE)
preds.fit.9f.3 <- predict(fit.9f.3, newdata = list(displacement = displacement.grid), se = TRUE)
preds.fit.9f.2 <- predict(fit.9f.2, newdata = list(displacement = displacement.grid), se = TRUE)
  
plot(displacement,mpg,col="darkgrey")
lines(displacement.grid,preds.fit.9f.3$fit,col="blue", lwd = 2)
legend("topright", legend = c("linear"), col = c("blue"), pch = 19)

plot(displacement,mpg,col="darkgrey")
lines(displacement.grid,preds.fit.9f.2$fit,col="blue", lwd = 2)
legend("topright", legend = c("degree 2"), col = c("blue"), pch = 19)

plot(displacement,mpg,col="darkgrey")
lines(displacement.grid, preds.fit.9f.7$fit,col="blue", lwd = 2)
legend("topright", legend = c("sqrt"), col = c("blue"), pch = 19)

plot(displacement,mpg,col="darkgrey")
lines(displacement.grid,preds.fit.9f.5$fit,col="blue", lwd = 2)
legend("topright", legend = c("log"), col = c("blue"), pch = 19)

FITTED LINE (JUST FOR PRACTICING)

sqrt(horsepower) vs lm(horsepower) vs log(horsepower) vs horsepower^2

par(mfrow =c(2,2))
horsepower.limits <- range(horsepower)
horsepower.grid <- seq(from = horsepower.limits[1], to = horsepower.limits[2])

preds.fit.8a <- predict(fit.8a, newdata = list(horsepower = horsepower.grid), se = TRUE)
preds.fit.9f.1 <- predict(fit.9f.1, newdata = list(horsepower = horsepower.grid), se = TRUE)
preds.fit.9f.6 <- predict(fit.9f.6, newdata = list(horsepower = horsepower.grid), se = TRUE)
preds.fit.9f.4 <- predict(fit.9f.4, newdata = list(horsepower = horsepower.grid), se = TRUE)
  
plot(horsepower,mpg,col="darkgrey")
lines(horsepower.grid, preds.fit.8a$fit,col="blue", lwd = 2)
legend("topright", legend = c("linear"), col = c("blue"), pch = 19)

plot(horsepower,mpg,col="darkgrey")
lines(horsepower.grid,preds.fit.9f.4$fit,col="blue", lwd = 2)
legend("topright", legend = c("degree 2"), col = c("blue"), pch = 19)

plot(horsepower,mpg,col="darkgrey")
lines(horsepower.grid, preds.fit.9f.6$fit,col="blue", lwd = 2)
legend("topright", legend = c("sqrt"), col = c("blue"), pch = 19)

plot(horsepower,mpg,col="darkgrey")
lines(horsepower.grid,preds.fit.9f.4$fit,col="blue", lwd = 2)
legend("topright", legend = c("log"), col = c("blue"), pch = 19)

As we can see for the plots above, the model with a polynomial of degree 2 is more suitable for both variables horsepower and displacement. This could be infer by the anova analysis p-value. The second model that presented a better fit was using the sqrt of each variable. The F-statics is also high for the former model, giving more evidence against the null-hypothesis.

Note: why the log and the linear models do not have the F-statistics and neither the p-values?

PLOTS: VARIABLE VS TRANSFORMATION

par(mfrow = c(2, 2))

plot(horsepower,mpg, xlab="horsepower")
points(horsepower,fitted(fit.8a),col="red",pch=20)

plot((horsepower)^2, mpg)
points((horsepower)^2,fitted(fit.9f.1),col="red",pch=20)

plot(sqrt(horsepower), mpg)
points(sqrt(horsepower),fitted(fit.9f.6),col="red",pch=20)

plot(log(horsepower), mpg)
points(log(horsepower),fitted(fit.9f.4),col="red",pch=20)

par(mfrow =c(2,2))

plot(displacement,mpg, xlab="displacement")
points(displacement,fitted(fit.9f.3),col="red",pch=20)

plot(mpg ~ I(displacement ^2), xlab="displacement^2")          
points((displacement)^2,fitted(fit.9f.2),col="red",pch=20)     

plot(mpg~sqrt(displacement), xlab="sqrt(displacement)")
points(sqrt(displacement),fitted(fit.9f.7),col="red",pch=20)

plot(mpg~log(displacement), xlab="log(displacement)")
points(log(displacement),fitted(fit.9f.5),col="red",pch=20)

10.

This question should be answered using the Carseats data set.

(a)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR)
data("Carseats")
head(Carseats)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
attach(Carseats)
fit.10.a1 <- lm(Sales~Price+Urban+US)
contrasts(US)
##     Yes
## No    0
## Yes   1
summary(fit.10.a1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
plot(Sales)

(b)

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

\(\beta_{1} = -0.054459\) –> Price Coefficient/Quantitative

The negative sign infer that when the value of the price increases, Sales would decrease. The p-value is low meaning that the null-hypothesis is discarded and the predictor has significant relation with the response.

Commercial interpretation: when price increase 1 dollar, sales decreases = 0.0544 * 1000 = 54.4 US$

\(\beta_{2} = -0.021916\) –> UrbanYes Coefficient//Qualitative

The negative sign has the same effect as the previous variable. But we could discuss the interpretation a little longer. The predictor is set to 2 dummy variables, Yes and No, but the coefficient is relative to the Yes response. This means that if we have more Yes observations on this predictor, sales will tend to decrease. The p-value however is quite high, and the null-hypothesis is not completely discarded.

Commercial interpretation: if the location is Urban, sales decreases = 0.0219 * 1000 = 21.9 US$

\(\beta_{3} = 1.200573\) –> USYes Coefficient//Qualitative

Similar to the previous qualitative predictor, but with the opposite sign (will tend to increase Sales) and a low p-value.

Commercial interpretation: if the location is in US, sales increases = 1.200 * 1000 = 1200 US$

(c)

Write out the model in equation form, being careful to handle the qualitative variables properly.

We have on the model:

  • One intercept

  • 1 Quantitative response: Sales

  • 1 Quantitative predictor: Price

  • 2 Qualitative predictors: Urban and US

The two qualitative predictors (factors) must to be treat carefully. They are going to be treat as indicators, or dummy variables, with two levels, Yes and No. Because of this, the model will have different setups depending on the value of the dummy variables.

\[ Sales = \beta_{0} + \beta_{1} \times Price + \beta_{2} \times Urban + \beta_{3} \times US = \begin{cases} \beta_{0} + \beta_{1} \times Price + \beta_{2} + \beta_{3} & \text{if both Yes} \\ \beta_{0} + \beta_{1} \times Price & \text{if both No} \\ \beta_{0} + \beta_{1} \times Price + \beta_{2} & \text{if Urban Yes and US No }\\ \beta_{0} + \beta_{1} \times Price + \beta_{3} & \text{if Urban No and US Yes }\\ \end{cases} \]

\[ Sales = 13.043 - 0.0545 \times Price − 0.022 \times Urban + 1.201 \times US = \begin{cases} 13.043 - 0.0545 \times Price − 0.022 + 1.201 & \text{if both Yes} \\ 13.043 - 0.0545 \times Price & \text{if both No} \\ 13.043 - 0.0545 \times Price − 0.022 & \text{if Urban Yes and US No }\\ 13.043 - 0.0545 \times Price + 1.201 & \text{if Urban No and US Yes }\\ \end{cases} \]

\[ Sales = 13.043 - 0.0545 \times Price − 0.022 \times Urban + 1.201 \times US = \begin{cases} 13.043 - 0.0545 \times Price + 1.179 & \text{if both Yes} \\ 13.043 - 0.0545 \times Price & \text{if both No} \\ 13.043 - 0.0545 \times Price − 0.022 & \text{if Urban Yes and US No }\\ 13.043 - 0.0545 \times Price + 1.201 & \text{if Urban No and US Yes }\\ \end{cases} \]

(d)

For which of the predictors can you reject the null hypothesis \(H_{0} : β_{j} = 0\) ?

Price and US. Observe the p-values.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit.10.e1 <- lm(Sales~Price+US)
summary(fit.10.e1)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f)

How well do the models in (a) and (e) fit the data?

Not very good. The \(R^2\) is quite low in both cases. The anova analysis also got a high p-value for the last model that excluded the dummy variable with the lowest statistical significance. We could say they are equivalent.

anova(fit.10.a1,fit.10.e1)

(g)

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(fit.10.e1)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h)

Is there evidence of outliers or high leverage observations in the model from (e)?

Observing the plot below, one can notice outliers (377 and 69 observations for example) in the residuals vs fitted plot.

The high leverage observations can be seen in the residuals vs leverage plot. There is one high leverage near the 0.04 value.

par(mfrow=c(2,2))
plot(fit.10.e1)

11.

In this problem we will investigate the t-statistic for the null hypothesis \(H_{0} : β = 0\) in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed(1)

x=rnorm (100)

y=2*x+rnorm(100)

set.seed (1)
x=rnorm (100)
y=2*x+rnorm (100)

(a)

Perform a simple linear regression of \(y\) onto \(x\), without an intercept. Report the coefficient estimate \(\hat{β}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_{0} : β = 0\). Comment on these results. (You can perform regression without an intercept using the command lm(y∼x+0).)

\(\hat{β}\) = 1.9939

\(SE(\hat{β})\) = 0.1065

t-statistic = 18.73

p-value = <2e-16

The model has a positive angular coefficient and the predictor, according to its p-value, has a good statistical significance

fit.11.a1 <- lm(y~x+0)
summary(fit.11.a1)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
plot(y~x)
lines(x,fit.11.a1$fitted.values)

(b)

Now perform a simple linear regression of \(x\) onto \(y\) without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis \(H_{0} : β = 0\). Comment on these results.

\(\hat{β}\) = 0.39111

\(SE(\hat{β})\) = 0.02089

t-statistic = 18.73

p-value = <2e-16

Again we have a model with positive angular coefficient and relevant p-value. The \(SE(\hat{β})\) was lower than the previous one but because the x, the response, is lower in absolute value. Comparing the ranges we have:

range(x) = [-2.214700, 2.401618]

range(y) = [-5.080096, 5.228336]

The book says that:

“if \(SE(\hat{β})\) is large, then \(\hat{β}\) must be large in absolute value in order for us to reject the null hypothesis.”

Which is in agreement with the previous inference.

Lastly, the t-statistic for both cases were the same.

fit.11.b1 <- lm(x~y+0)
summary(fit.11.b1)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
plot(x~y)
lines(y,fit.11.b1$fitted.values)

(c)

What is the relationship between the results obtained in (a) and (b)?

Recording the concept of t-statistic from the book:

“measures the number of standard deviations that \(\hat{β_{1}}\) is away from 0.”

It is good to remind the formula as well

\[ t = \frac{\hat{β_{1}} - 0}{SE(\hat{β_{1}})} \]

It says that if the standard error is to high, the prediction of the coefficient as to be high as well. As the t-statistic value does not change from one model to another, we can infer that they are equivalent. We simple had switched (flip) the axis in a way that:

\[ y = 2 \times x + \epsilon \]

\[ x = \frac{y - \epsilon}{2} \]

(d)

For the regression of Y onto X without an intercept, the t-statistic for \(H_{0} : β = 0\) takes the form \(\hat{β}/SE(\hat{β})\), where \(\hat{β}\) is given by (3.38), and where

\[ SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2}{(n-1)\sum_{i'=1}^nx_{i'}^2}} \]

(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as:

\[ t = \frac{(\sqrt{n-1})\sqrt{\sum_{i=1}^nx_{i}y_{i}}}{\sqrt{(\sum_{i=1}^nx_{i}^2)(\sum_{i'=1}^ny_{i'}^2) - (\sum_{i'=1}^nx_{i'}y_{i'})^2 }} \]

First, the present case without the intercept has the definition of \(\hat{\beta}\) as:

\[\hat{β} = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2}\]

From the t-statistic expression, we have:

\[ t = \frac{\hat{β}}{SE(\hat{β})} \]

Plugging in the expression for \(SE(\hat{\beta})\):

\[ t = \frac{\hat{β}}{\Large\sqrt{\frac{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2}{(n-1)\sum_{i'=1}^nx_{i'}^2}}} \]

Squaring the terms and reordering the fractions we have:

\[ t^2 = \frac{\hat{β}^2}{\Large\frac{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2}{(n-1)\sum_{i'=1}^nx_{i'}^2}} \]

\[ t^2 = \frac{\hat{β}^2(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2} \]

Plugging in \(\hat{β}\):

\[ t^2 = \frac{\Large\Big(\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2}\Big)^2(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2} \]

\[ t^2 = \frac{\Large\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{\sum_{i'=n}^{n}(x_{i'}^2)^2}(n-1)\sum_{i'=1}^nx_{i'}^2}{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2} \]

\[ t^2 = \frac{(n-1)\Large\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{\sum_{i'=n}^{n}x_{i'}^2}}{\sum_{i=1}^n(y_{i}-x_{i}\hat{\beta})^2} \]

Now expanding the denominator:

\[ t^2 = \frac{(n-1)\Large\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{\sum_{i'=n}^{n}x_{i'}^2}} {\sum_{i=1}^n(y_{i}^2 - 2x_{i}y_{i}\hat{\beta} + x_{i}^2\hat{\beta}^2)} \]

Working with the denominators:

\[ t^2 = \frac{(n-1)\Large\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{\sum_{i'=n}^{n}x_{i'}^2}} {(\sum_{i=1}^ny_{i}^2 - \sum_{i=1}^n2x_{i}y_{i}\hat{\beta} + \sum_{i=1}^nx_{i}^2\hat{\beta}^2)} \]

\[ t^2 = \frac{(n-1)\Large\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{\sum_{i'=n}^{n}x_{i'}^2}} {(\sum_{i=1}^ny_{i}^2 - 2\hat{\beta}\sum_{i=1}^nx_{i}y_{i} + \hat{\beta}^2\sum_{i=1}^nx_{i}^2)} \]

\[ t^2 = (n-1)\frac{ \sum_{i=1}^n(x_{i}y_{i})^2 } { (\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^ny_{i}^2 - 2\hat{\beta}\sum_{i=1}^nx_{i}y_{i} + \hat{\beta}^2\sum_{i=1}^nx_{i}^2) } \]

\[ t^2 = (n-1)\frac{\sum_{i=1}^n(x_{i}y_{i})^2} {(\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^ny_{i}^2) - 2\hat{\beta}(\sum_{i'=n}^{n}x_{i'}^2)(\sum_{i=1}^nx_{i}y_{i}) + \hat{\beta}^2(\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^nx_{i}^2)} \]

Plugging in \(\hat{β}\) once more:

\[ t^2 = (n-1)\frac{\sum_{i=1}^n(x_{i}y_{i})^2} {(\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^ny_{i}^2) - 2\Large\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2}(\sum_{i'=n}^{n}x_{i'}^2)(\sum_{i=1}^nx_{i}y_{i}) + \Big(\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2}\Big)^2(\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^nx_{i}^2)} \]

Working with the sums multiplications:

\[ t^2 = (n-1)\frac{\sum_{i=1}^n(x_{i}y_{i})^2}{(\sum_{i'=n}^{n}x_{i'}^2) (\sum_{i=1}^ny_{i}^2) - 2(\sum_{i=1}^nx_{i}y_{i})(\sum_{i=1}^nx_{i}y_{i}) + \Large\frac{(\sum_{i=1}^nx_{i}y_{i})^2}{\sum_{i'=n}^{n}x_{i'}^2}(\sum_{i=1}^nx_{i}^2)} \]

If the index of the sum at the denominator were changed, we have (it is really a doubt):

\[ t^2 = (n-1)\frac{\sum_{i=1}^n(x_{i}y_{i})^2} {(\sum_{i=n}^{n}x_{i}^2) (\sum_{i'=1}^ny_{i'}^2) - 2(\sum_{i'=1}^nx_{i'}y_{i'})^2 + \Large\frac{(\sum_{i'=1}^nx_{i'}y_{i'})^2}{\sum_{i'=n}^{n}x_{i'}^2}(\sum_{i'=1}^nx_{i'}^2)} \]

\[ t^2 = (n-1)\frac{\sum_{i=1}^n(x_{i}y_{i})^2} {(\sum_{i=n}^{n}x_{i}^2) (\sum_{i'=1}^ny_{i'}^2) - 2(\sum_{i'=1}^nx_{i'}y_{i'})^2 + (\sum_{i'=1}^nx_{i'}y_{i'})^2} \]

\[ t^2 = \frac {(n-1)\sum_{i=1}^n(x_{i}y_{i})^2} {(\sum_{i=n}^{n}x_{i}^2) (\sum_{i'=1}^ny_{i'}^2) - (\sum_{i'=1}^nx_{i'}y_{i'})^2} \]

And square root all both sides:

\[ t = \frac {\sqrt{(n-1)}\sum_{i=1}^nx_{i}y_{i}} { \sqrt{(\sum_{i=n}^{n}x_{i}^2) (\sum_{i'=1}^ny_{i'}^2) - (\sum_{i'=1}^nx_{i'}y_{i'})^2}} \]

The numerical verification can be done as follows:

The t-statistic found was the same from the summary() function

x_n <- as.numeric(length(x))
x_sum <- sum(x)
x2_sum <- sum(x^2)

y_n <- as.numeric(length(y))
y_sum <- sum(y)
y2_sum <- sum(y^2)

t_xy_num <- sqrt(100-1)*sum(x*y)
t_xy_denom <- sqrt(x2_sum*y2_sum-(sum(x*y))^2)
t_xy <- t_xy_num/t_xy_denom
t_xy
## [1] 18.72593
summary(fit.11.a1)$coefficients[1,3]
## [1] 18.72593
summary(fit.11.b1)$coefficients[1,3]
## [1] 18.72593

(e)

Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y.

The equation from (d) says that even if we flip the axis, the mathematical operations will sill the same, as each term as both response and predictor at the same degree.

(f)

In R, show that when regression is performed with an intercept, the t-statistic for \(H_{0} : β_{1} = 0\). is the same for the regression of y onto x as it is for the regression of x onto y.

fit.11.f1 = lm(x~y)
fit.11.f2 = lm(y~x)
summary(fit.11.f1)$coefficients[2,3]
## [1] 18.5556
summary(fit.11.f2)$coefficients[2,3]
## [1] 18.5556

12.

This problem involves simple linear regression without an intercept.

(a)

Recall that the coefficient estimate \(\hat{β}\) for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

  • VAGUE ANSWER

When the observations from both cases predictions were the same.

  • COMPLETE ANSWER

Looking at the equation of the X onto Y case:

\[\hat{β}_{XY} = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2}\]

Now for the Y onto X case:

\[\hat{β}_{YX} = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}y_{i'}^2}\]

We could infer that:

\[ \hat{β}_{XY} = \hat{β}_{YX} \] \[\frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}x_{i'}^2} = \frac{\sum_{i=1}^nx_{i}y_{i}}{\sum_{i'=n}^{n}y_{i'}^2}\]

\[\frac{1}{\sum_{i'=n}^{n}x_{i'}^2} = \frac{1}{\sum_{i'=n}^{n}y_{i'}^2}\]

\[\sum_{i'=n}^{n}x_{i'}^2 = \sum_{i'=n}^{n}y_{i'}^2\]

In this case, when the denominator term from both cases is the same, the coefficients prediction will be equal, or, when the observations sum of the squares were the same.

This could be extended to a further analysis as the total sum of squares is defined as:

\[ TSS = \sum{(y_i - \bar{y})^2} \]

Which in a case without the intercept:

\[ TSS = \sum{y_i^2} \]

Pointed out to be the case of the exercise. The total sum of squares, according to the book, can be defined as:

“TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed”

Therefore, when the amount of variability of the observations is the same, we could expect a similar coefficient estimate for both regressions.

(b)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

set.seed(1)
x_dif=rnorm(100)
y_dif=2*x_dif+rnorm(100)

fit.12.b1 <- lm(y_dif~x_dif+0) # X onto Y
fit.12.b2 <- lm(x_dif~y_dif+0) # Y onto X

summary(fit.12.b1)$coefficient[1,1]
## [1] 1.993876
summary(fit.12.b2)$coefficient[1,1]
## [1] 0.3911145

(c)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

library(ggplot2)
set.seed(2)
x_same <- 1:100
y_same <- x[order(rnorm(100))]
data <- data.frame(x, y)

fit.12.c1 <- lm(y_same ~ x_same + 0)
fit.12.c2 <- lm(x_same ~ y_same + 0)


ggplot(data, aes(x_same, y_same)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(fit.12.c1), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(fit.12.c2), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (fit.12.c1 & fit.12.c2)")

  labs(colours = "")
## $colours
## [1] ""
## 
## attr(,"class")
## [1] "labels"

13.

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

(a)

Using the rnorm() function, create a vector, x, containing 100 observations drawn from a \(N(0, 1)\) distribution. This represents a feature, \(X\).

The function rnorm help says that:

“If mean or sd are not specified they assume the default values of 0 and 1”

rnorm generates random deviates”

set.seed(1)
x_13 <- rnorm(100)
plot(x_13)

(b)

Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a \(N(0, 0.25)\) distribution i.e. a normal distribution with mean zero and variance 0.25.

It is known that:

\[\begin{align*} Var(\epsilon) = \sigma^2 \\ 0.25 = \sigma^2 \\ \sigma = 0.5 \end{align*}\]

eps <- rnorm(100, sd = 0.5)
plot(eps)

(c)

Using x and eps, generate a vector y according to the model

\[ Y = -1+0.5X+\epsilon \]

What is the length of the vector y? What are the values of \(β_0\) and \(β_1\) in this linear model?

Length = 100

\(β_0\) = -1

\(β_1\) = 0.5

y_13 <- -1 + 0.5*x_13 + eps
length(y)
## [1] 100

(d)

Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

The relationship holds a positive characteristic angular coefficient

plot.13.d1 <- ggplot(data=NULL, aes(x = x_13, y = y_13))

plot.13.d1 +
  geom_point() +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "13 (d)")

(e)

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{β}_0\) and \(\hat{β}_1\) compare to \(β_0\) and \(β_1\)?

The coefficients predictions are very close in magnitude to the real \(f(x)\).

fit.13.e1 <- lm(y_13~x_13)
fit.13.e1$coefficients
## (Intercept)        x_13 
##  -1.0188463   0.4994698
summary(fit.13.e1)
## 
## Call:
## lm(formula = y_13 ~ x_13)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x_13         0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

(f)

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

JUST TO PRACTICE, THERE ARE TWO PLOTS, ONE USING ggplot2() AND ONE USING plot()

plot(y_13~x_13)
abline(fit.13.e1, col = "red", lty = 1, lwd = 2)
abline(a = -1, b = 0.5, col = "blue", lty = 1, lwd = 2)
legend("bottomright", legend = c("Least Squares", "Population"), col = c("red", "blue"), lty = 1, lwd = 2)

plot.13.d1 +
  geom_point() +
  geom_abline(aes(intercept=fit.13.e1$coefficients[1],slope=fit.13.e1$coefficients[2], col = "Least Squares"), size = 1) +
  geom_abline(aes(intercept=-1,slope=0.5,col = "Population"), size = 1) + 
  scale_colour_manual(name = "Regression Lines:", values = c("red", "blue")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "13 (f), sd = 0.5")

(g)

Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

As one can see in the p-values below, the value for the quadratic coefficient is quite high if compared to the others, not being sufficient to be an actual improvement to the model prediction.

fit.13.g1 <- lm(y_13 ~ x_13 + I(x_13^2))
summary(fit.13.g1)
## 
## Call:
## lm(formula = y_13 ~ x_13 + I(x_13^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x_13         0.50858    0.05399   9.420  2.4e-15 ***
## I(x_13^2)   -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14
plot.13.g1 <- ggplot(data = NULL, aes(x = x_13, y = y_13))
plot.13.g1 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2)") +
  theme_bw() +
  labs(title = "polynomial regression degree 2, eps = 0.5")

(h)

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term ϵ in (b). Describe your results.

The model increased accuracy slightly due to the decrease of the standard deviation, that means the values of the predictions are more close to the mean. The value for the variance can be seen below:

\[\begin{align*} Var(\epsilon) = \sigma^2 \\ 0.1 = \sigma^2 \\ \sigma = 0.3162278 \end{align*}\]

set.seed(1)
x_13_lv <- rnorm(100)
eps_lv <- rnorm(100, sd = sqrt(0.1))
y_13_lv <- -1 + 0.5*x_13_lv + eps_lv
fit.13.h1 <- lm(y_13_lv~x_13_lv)
coefficients(fit.13.h1)
## (Intercept)     x_13_lv 
##  -1.0119195   0.4996647
summary(fit.13.h1)
## 
## Call:
## lm(formula = y_13_lv ~ x_13_lv)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59351 -0.19409 -0.04411  0.17057  0.74193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01192    0.03067  -32.99   <2e-16 ***
## x_13_lv      0.49966    0.03407   14.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3044 on 98 degrees of freedom
## Multiple R-squared:  0.687,  Adjusted R-squared:  0.6838 
## F-statistic: 215.1 on 1 and 98 DF,  p-value: < 2.2e-16
plot.13.h1 <- ggplot(data = NULL, aes(x = x_13_lv, y = y_13_lv))

plot.13.h1 + 
  geom_point() +
  geom_abline(aes(intercept=fit.13.h1$coefficients[1],slope=fit.13.h1$coefficients[2], col = "Least Squares"), size = 1) +
  geom_abline(aes(intercept=-1,slope=0.5,col = "Population"), size = 1) + 
  scale_colour_manual(name = "Regression Lines:", values = c("red", "blue")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "13 (h), sd_lv = sqrt(0.1) = 0.31")

(i)

Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ϵ in (b). Describe your results.

Similarly to the previous item, but now we increased the variance in order to increase the standard deviation, resulting in a model in which the values of the predictions will be further from the mean. The value for the standard deviation was calculated as expressed below:

\[\begin{align*} Var(\epsilon) = \sigma^2 \\ 1.0 = \sigma^2 \\ \sigma = 1.0 \end{align*}\]

set.seed(1)
x_13_hv <- rnorm(100)
eps_hv <- rnorm(100, sd = 1.0)
y_13_hv <- -1 + 0.5*x_13_hv + eps_hv
fit.13.i1 <- lm(y_13_hv~x_13_hv)
coefficients(fit.13.i1)
## (Intercept)     x_13_hv 
##  -1.0376926   0.4989396
summary(fit.13.i1)
## 
## Call:
## lm(formula = y_13_hv ~ x_13_hv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03769    0.09699 -10.699  < 2e-16 ***
## x_13_hv      0.49894    0.10773   4.632 1.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.1796, Adjusted R-squared:  0.1712 
## F-statistic: 21.45 on 1 and 98 DF,  p-value: 1.117e-05
plot.13.i1 <- ggplot(data = NULL, aes(x = x_13_hv, y = y_13_hv))

plot.13.i1 + 
  geom_point() +
  geom_abline(aes(intercept=fit.13.i1$coefficients[1],slope=fit.13.i1$coefficients[2], col = "Least Squares"), size = 1) +
  geom_abline(aes(intercept=-1,slope=0.5,col = "Population"), size = 1) + 
  scale_colour_manual(name = "Regression Lines:", values = c("red", "blue")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "13 (i), sd_hv = 1.0")

(j)

What are the confidence intervals for \(β_0\) and \(β_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

According to the book, the confidence intervals are defined as:

\[ \hat{\beta_i} \pm t_{n-2}(0.975) \cdot SE(\hat{\beta_i}) \]

As a simplification, the factor \(t_{n-2}(0.975)\) could be approximated as:

\[ t_{n-2}(0.975) = 2 \]

Resulting in the equation:

\[ \hat{\beta_i} \pm 2 \cdot SE(\hat{\beta_i}) \]

And the \(SE(\hat{\beta_1})\) and \(SE(\hat{\beta_0})\) can be defined in terms of the standard deviation as:

\[ SE(\hat{\beta_0})^2 = \sigma^2 \Big[ \frac{1}{n} + \frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2} \Big] \]

\[ SE(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum^n_{i=1}(x_i-\bar{x})^2} \]

As one can see from the equations above, the confidence intervals with wide when the standard error are high, which is consequence of a high standard deviation or high variance.

For \(\hat{\beta_1}\) the confidence intervals were:

Noise data set: \(Var(\epsilon) = \sigma^2 = 1^2\)

\[ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) = [0.2851588, 07127204]\]

confint(fit.13.i1)
##                  2.5 %     97.5 %
## (Intercept) -1.2301607 -0.8452245
## x_13_hv      0.2851588  0.7127204

Less noise data set: \(Var(\epsilon) = \sigma^2 = 0.1^2\)

\[ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) = [0.4337114, 0.5796757]\]

confint(fit.13.h1)
##                  2.5 %     97.5 %
## (Intercept) -1.0727832 -0.9510557
## x_13_lv      0.4320613  0.5672681

Original noise data set: \(Var(\epsilon) = \sigma^2 = 0.5^2\)

\[ \hat{\beta_1} \pm 2 \cdot SE(\hat{\beta_1}) = [0.3925794, 0.6063602]\]

confint(fit.13.e1)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x_13         0.3925794  0.6063602

14.

This problem focuses on the collinearity problem.

(a)

Perform the following commands in R:

> set .seed (1)

> x1=runif (100)

> x2 =0.5* x1+rnorm (100) /10

> y=2+2* x1 +0.3* x2+rnorm (100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

\[ y = 2 + 2x_1 + 0.3x_2 + \epsilon \]

The coefficients are:

\(\beta_0=2\)

\(\beta_1=2\)

\(\beta_2=0.3\)

set.seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)

(b)

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

The correlation between x1 and x2 is 0.8351212

Furthermore, the plot shows that the predictors have some collinearity

Therefore, by the correlation value and the plot, one can infer that the predictors present are collinear

cor(x1,x2)
## [1] 0.8351212
plot(x1~x2)

(c)

Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\hat{β}_0\), \(\hat{β}_1\), and \(\hat{β}_2\)? How do these relate to the true \(β_0\), \(β_1\), and \(β_2\)? Can you reject the null hypothesis \(H_{0} : β_{1} = 0\)? How about the null hypothesis \(H_{0} : β_{2} = 0\)?

The coefficients obtained are:

\(\hat{β}_0 = 2.130500 / β_0 = 2\)

\(\hat{β}_1 = 1.439555/ β_1 = 2\)

\(\hat{β}_2 = 1.009674/ β_2 = 0.3\)

The predictions were pretty bad aside from the intercept

The p-values are high for the predicted variables coefficients.

The null hypothesis \(H_{0} : β_{1} = 0\) for \(β_{1}\) can be rejected.

But for \(β_{2}\), \(H_{0} : β_{2} = 0\) cannot be rejected as the p-value is too high (higher than 0.05)

par(mfrow=c(2,2))
fit.14.c1 <- lm(y~x1+x2)
summary(fit.14.c1)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
plot(fit.14.c1)

(d)

Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_{0} : β_{1} = 0\)?

The model was improved considerably using just x1.

The p-value decreased to less than the required to reject the null hypothesis

par(mfrow=c(2,2))
fit.14.d1 <- lm(y~x1)
summary(fit.14.d1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06
plot(fit.14.d1)

(e)

Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_{0} : β_{1} = 0\)?

The model was improved considerably using just x2.

The p-value decreased to less than the required to reject the null hypothesis

par(mfrow=c(2,2))
fit.14.e1 <- lm(y~x2)
summary(fit.14.e1)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05
plot(fit.14.e1)

(f)

Do the results obtained in (c)–(e) contradict each other? Explain your answer.

The results from the previous items explain when any of the predictors is removed , the p-values increases.

More specifically, x2 alone is able to reject the null-hypothesis.

This behavior can be explained by the predictors correlation since together they are compromising the regression because of the redundant information about the response they provide.

Separately, each one have good relation with the response, information inferred by the low p-value resulted by the linear regression removing one by one.

This phenomena is caused by the decrease of the t-statistic, which is expressed by dividing \(\hat{β}_j\) by its standard error.

The t-statistic decreases due to the reduction in the accuracy of the estimation of the regression coefficients

The estimation reduction power is due by a broad range of values for the coefficient estimates that result in equal values for RSS

The consequence is in a great deal of uncertainty in the coefficient estimates

(g)

Now suppose we obtain one additional observation, which was unfortunately mismeasured.

> x1=c(x1 , 0.1)

> x2=c(x2 , 0.8)

> y=c(y,6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

x1_101 <- c(x1 , 0.1)
x2_101 <- c(x2 , 0.8)
y_101 <- c(y,6)
cor(x1_101,x2_101)
## [1] 0.7392279
  • CORRELATION VALUE

\(Cor(x_{1,101},x_{2,101}) = 0.7392279\)

\(Cor(x_1,x_2) = 0.8351212\)

As one can see, the correlation between the predictors after one additional observation was reduced. This was due to the coordinate \((0.1, 0.8)\) which is far from the others observations distribution and therefore contributing to reduce the relationship between the predictors.

par(mfrow=c(1,2))
plot(x1,x2)
plot(x1_101,x2_101)
points(x1_101[101],x2_101[101], pch = 20, col = "red", cex = 2)

  • Both predictors model

The included observation is considered both an outlier and a high-leverage point, which can be seen in the plot below

The p-values were inverted when compared to the previous both predictors model with 100 observations, being x1 not being able to discard the null-hypothesis and x2 being able to reject it.

par(mfrow=c(2,2))
fit.14.g1 <- lm(y_101~x1_101+x2_101)
summary(fit.14.g1)
## 
## Call:
## lm(formula = y_101 ~ x1_101 + x2_101)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1_101        0.5394     0.5922   0.911  0.36458    
## x2_101        2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
plot(fit.14.g1)

  • x1_101 model

The inclusion of the observation is considered an outlier and a high-leverage point.

The p-values decreased significantly when compared to the previous model both predictors model with 101 observations, being able to reject the null-hypothesis.

par(mfrow=c(1,2))
plot(x1,y)
plot(x1_101,y_101)
points(x1_101[101],y_101[101], pch = 20, col = "red", cex = 2)

par(mfrow=c(2,2))
fit.14.g2 <- lm(y_101~x1_101)
summary(fit.14.g2)
## 
## Call:
## lm(formula = y_101 ~ x1_101)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1_101        1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
plot(fit.14.g2)

  • x2_101 model

The inclusion of the observation is considered a high-leverage point but not an outlier

The same conclusion from the previous x1_101 model about the p-values can be inferred.

par(mfrow=c(1,2))
plot(x2,y)
plot(x2_101,y_101)
points(x2_101[101],y_101[101], pch = 20, col = "red", cex = 2)

par(mfrow=c(2,2))
fit.14.g3 <- lm(y_101~x2_101)
summary(fit.14.g3)
## 
## Call:
## lm(formula = y_101 ~ x2_101)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2_101        3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
plot(fit.14.g3)

15.

This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

library(MASS)
data("Boston")
head(Boston)
attach(Boston)

(a)

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models there is a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

The table below shows the results from the linear fits between each predictor and the variable crim in terms of \(R2\), \(p\)-value and \(\beta_1\);

The results are ordered by descending order relative to &R2&, that gives a first hint of which variables have most statistically significant to crim;

One can see that rad and tax are the two ones most related to crim;

The variables zn and chas are the ones least related to crim;

The \(p\)-values informs that all variables are able to reject the null hypothesis except the variable chas.

The most significant predictors (rad and tax) were pointed out by the \(R2\) and can be seen in the plots below. It seems that this relationship is more related to the quantity of observations grouped in a certain category than for a more distributed behavior, as for the others two most significant ones (lstat and nox)

fit.15.a.r2 <- c()
fit.15.a.p <- c()
fit.15.a.coeff <- c()
fit.15.a.beta0 <- c()

for(i in names(Boston))
  {
    if(i != "crim")
      {
        fit.15.a.r2[i] <- summary(lm(crim ~ Boston[,i]))$r.squared
        fit.15.a.p[i] <- summary(lm(crim ~ Boston[,i]))$coefficients[2,4]
        fit.15.a.coeff[i] <- summary(lm(crim ~ Boston[,i]))$coefficients[2,1]
        fit.15.a.beta0[i] <- summary(lm(crim ~ Boston[,i]))$coefficients[1,1]
      }
  }

fit.15.a.summary <- data.frame(fit.15.a.r2, fit.15.a.p, fit.15.a.coeff, fit.15.a.beta0)
fit.15.a.summary <- fit.15.a.summary[order(fit.15.a.summary$fit.15.a.r2, decreasing = T), ]
fit.15.a.summary
par(mfrow = c(2,2))
boxplot(crim~rad)
stripchart(crim~rad,data=Boston, vertical=TRUE, method="jitter", pch=21, add=T)
boxplot(crim~tax)
stripchart(crim~tax,data=Boston, vertical=TRUE, method="jitter", pch=21, add=T)
plot(lstat,crim)
abline(fit.15.a.beta0[3],fit.15.a.coeff[3])
plot(nox,crim)
abline(fit.15.a.beta0[4],fit.15.a.coeff[4])

Boston$rad <- as.factor(Boston$rad)
p1.9e <- ggplot(Boston, aes(x = rad, y = crim)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.2)) +
scale_x_discrete() +
theme_bw() +
theme(legend.position = "bottom") +
labs(title = "13 (d)")
p1.9e

par(mfrow = c(2,2))
boxplot(crim~chas)
stripchart(crim~chas,data=Boston, vertical=TRUE, method="jitter", pch=21, add=T)
plot(zn,crim)
abline(fit.15.a.beta0[12],fit.15.a.coeff[12])
plot(rm,crim)
abline(fit.15.a.beta0[11], fit.15.a.coeff[11])
plot(ptratio,crim)
abline(fit.15.a.beta0[10],fit.15.a.coeff[10])

(b)

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_{0} : β_{j} = 0\)?

According to the \(p\)-values:

Not Rejected null null-hypothesis: nox, lstat, ptratio, indus, tax, rm, chas and age;

Rejected null null-hypothesis: rad, dis, medv, zn and black

In the previous analysis, all the predictors, except chas, were able to reject the null hypothesis when fitted one by one;

Also, the p-values were higher in the multiple linear regression than in the linear regression;

What is really happen is that the predictors with poor relationship with the response are causing a surrogate in the predictors that have a higher statistical significance, suppressing their relevance;

In order to try to overcome this issue, one could analyze the F-statistic instead of the t-statistic;

The F-statistic in the multiple linear regression setting does not suffer from the suppression that certain less relevant predictors might apply to the more significant ones when related to the response;

Referring to the book about the F-statistic in a multiple linear regression:

“Hence, if we use the individual t-statistics and associated p-values in order to decide whether or not there is any association between the variables and the response, there is a very high chance that we will incorrectly conclude that there is a relationship. However, the F-statistic does not suffer from this problem because it adjusts for the number of predictors. Hence, if \(H_{0}\) is true, there is only a 5 % chance that the F-statistic will result in a p-value below 0.05, regardless of the number of predictors or the number of observations.”

“The approach of using an F-statistic to test for any association between the predictors and the response works when p is relatively small, and certainly small compared to n. However, sometimes we have a very large number of variables. If p \(>\) n then there are more coefficients \(β_j\) to estimate than observations from which to estimate them.”

“The square of each t-statistic is the corresponding F-statistic”

The results of F-statistic, when compared to the p-values, obtained the same order of predictors relevance;

This conclusion could be due to the low quantity of predictors that are being evaluated.

library(magrittr)
fit.15.b <- lm(crim ~., data = Boston)

fit.15.b.p <- summary(fit.15.b)$coefficients[,4] 
fit.15.b.f <- summary(fit.15.b)$coefficients[,3]
fit.15.b.f <- fit.15.b.f^2
fit.15.b.std <- summary(fit.15.b)$coefficients[,2]
fit.15.b.coeff <- summary(fit.15.b)$coefficients[,1] 


fit.15.b.summary <- data.frame(fit.15.b.p, fit.15.b.f, fit.15.b.std, fit.15.b.coeff)
#removing the intercept
fit.15.b.summary <- fit.15.b.summary[-c(1),]
fit.15.b.summary <- fit.15.b.summary[order(fit.15.b.summary$fit.15.b.p, decreasing = F),]

fit.15.b.summary

(c)

How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the \(x\)-axis, and the multiple regression coefficients from (b) on the \(y\)-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the \(x\)-axis, and its coefficient estimate in the multiple linear regression model is shown on the \(y\)-axis.

The coefficients between the regressions are plotted below;

The nox coefficient is very different in magnitude both from the others and among the others;

Removing

fit.15.a.summary <- fit.15.a.summary[order(row.names(fit.15.a.summary)),]
fit.15.b.summary <- fit.15.b.summary[order(row.names(fit.15.b.summary)),]
coeff.names <- row.names(fit.15.b.summary)

ggplot(data = NULL, mapping = aes(fit.15.a.summary$fit.15.a.coeff, fit.15.b.summary$fit.15.b.coeff, col = coeff.names)) +
  geom_jitter(size = 4) +
    labs(title = "15.(c) - Coefficients comparison", 
       subtitle = "simple linear vs multiple linear", 
       x = "Simple Coefficient", 
       y = "Multiple Coefficient") +
  theme_bw() +
  theme(legend.position = "bottom") 

Removing the nox and plotting another graph, the comparison is better observed;

One can see that some coefficients have approximate values on both regressions around (0,0);

#removing only `nox`
fit.15.a.summary.nox <- fit.15.a.summary[row.names(fit.15.a.summary) != "nox",]
fit.15.b.summary.nox <- fit.15.b.summary[row.names(fit.15.b.summary) != "nox",]
coeff.names <- row.names(fit.15.b.summary.nox)

ggplot(data = NULL, mapping = aes(fit.15.a.summary.nox$fit.15.a.coeff, fit.15.b.summary.nox$fit.15.b.coeff, col = coeff.names)) +
  geom_jitter(size = 4) +
    labs(title = "15.(c) - Coefficients comparison", 
       subtitle = "simple linear vs multiple linear - no nox", 
       x = "Simple Coefficient", 
       y = "Multiple Coefficient") +
  theme_bw() +
  theme(legend.position = "bottom") 

(d)

Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor \(X\), fit a model of the form:

\[ Y = β_0 + β_1X + β_2X^2 + β_3X^3 + \epsilon \]

The table below lists the results for the degree 3 polynomial regression

The p-values and the \(R2\) results gives a clue to which plot would posses the best non-linear association with the response

fit.15.d.r2 <- c()
fit.15.d.p <- c()
fit.15.d.coeff <- c()
fit.15.d.beta0 <- c()

for(i in names(Boston))
  {
    if(i != "crim")
      {
        fit.15.d.r2[i] <- summary(lm(crim ~ Boston[,i] + I(Boston[,i]^2) + I(Boston[,i]^3) ))$r.squared
        fit.15.d.p[i] <- summary(lm(crim ~ Boston[,i] + I(Boston[,i]^2) + I(Boston[,i]^3) ))$coefficients[2,4]
        fit.15.d.coeff[i] <- summary(lm(crim ~ Boston[,i] + I(Boston[,i]^2) + I(Boston[,i]^3) ))$coefficients[2,1]
        fit.15.d.beta0[i] <- summary(lm(crim ~ Boston[,i] + I(Boston[,i]^2) + I(Boston[,i]^3) ))$coefficients[1,1]
      }
  }

fit.15.d.summary <- data.frame(fit.15.d.r2, fit.15.d.p, fit.15.d.coeff, fit.15.d.beta0)
fit.15.d.summary <- fit.15.d.summary[order(fit.15.d.summary$fit.15.d.r2, decreasing = T), ]
fit.15.d.summary

As one can see, the plots with low p-values and high \(R2\) have the better non-linear association

par(mfrow=c(1,3))
#quantitative plots
fit.15.d1 <- lm(crim ~ medv + I(medv^2) + + I(medv^3) )
fit.15.d4 <- lm(crim ~ nox + I(nox^2) + + I(nox^3) )
fit.15.d5 <- lm(crim ~ dis + I(dis^2) + + I(dis^3) )

plot.15.d1 <- ggplot(data = Boston, aes(x = medv, y = crim))
plot.15.d4 <- ggplot(data = Boston, aes(x = nox, y = crim))
plot.15.d5 <- ggplot(data = Boston, aes(x = dis, y = crim))

plot.15.d1 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
  theme_bw() +
  labs(title = "crim versus polinomial medv degree 3")

plot.15.d4 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
  theme_bw() +
  labs(title = "crim versus polinomial nox degree 3")

plot.15.d5 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
  theme_bw() +
  labs(title = "crim versus polinomial dis degree 3")

On the other hand, the plots which predictors posses high \(R2\) but low p-value are not well modeled by a polynomial function

They might be better analyzed via qualitative models

#quanlitative plots
fit.15.d2 <- lm(crim ~ rad + I(rad^2) + + I(rad^3) )
fit.15.d3 <- lm(crim ~ tax + I(tax^2) + + I(tax^3) )

plot.15.d2 <- ggplot(data = Boston, aes(x = rad, y = crim))
plot.15.d3 <- ggplot(data = Boston, aes(x = tax, y = crim))

plot.15.d2 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
  theme_bw() +
  labs(title = "crim versus polinomial rad degree 3")

plot.15.d3 +
  geom_point() +
  geom_smooth(method = "lm", formula = "y ~ x + I(x^2) + I(x^3)") +
  theme_bw() +
  labs(title = "crim versus polinomial tax degree 3")


  1. The concept of correlation between the predictors and the response does not extend automatically to this setting of Multiple Linear Regression, since correlation quantifies the association between a single pair of variables rather than between a larger number of variables. The \(R^2\) fills this role.↩︎