ISL2R ANSWERS ON CHAPTER 3

Author

Albar Ugalde Hernández

Chapter 3

Conceptual exercises

1. Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefcients of the linear model.

Visualize table 3.4

\colorbox{cyan}{Answer:}

The null hypotheses is that: H_{0} : \beta_1 = \beta_2 =\beta_3 =0 for which each \beta correspond to the variable on the table (tv, radio and newspaper). As the first two p-value of each predictor is virtually cero we can conclude that those two variables are statistically significant. On the other hand, the p-value for newspaper is 0.8599, which is quite large. This provides weak evidence against the null hypothesis, and hence, we fail to reject it. This suggests that there is no significant relationship between sales and newspaper advertising.

The null hypotheses for the p-values in the table correspond to the assumption that there is no significant relationship between sales and each of the advertising mediums: TV, radio, and newspaper.

Based on the p-values given in the table:

  1. TV: The p-value for TV is less than 0.0001, which is significantly small. This provides strong evidence to reject the null hypothesis for TV. Hence, we can conclude that there is a significant positive relationship between sales and TV advertising.

  2. Radio: Similarly, the p-value for radio is also less than 0.0001, indicating strong evidence against the null hypothesis for radio. This suggests a significant positive relationship between sales and radio advertising.

  3. Newspaper: On the other hand, the p-value for newspaper is 0.8599, which is quite large. This provides weak evidence against the null hypothesis, and hence, we fail to reject it. This suggests that there is no significant relationship between sales and newspaper advertising.

\colorbox{yellow}{In conclusion, the sales appear to be significantly \\ influenced by advertising on TV and radio, but not on newspaper.} This conclusion is based on the p-values associated with each advertising medium in the given table.

2. Carefully explain the diferences between the KNN classifer and KNN regression methods.

\colorbox{cyan}{Answer:}

The K-Nearest Neighbors (KNN) algorithm is a type of instance-based learning method that can be used for both classification and regression tasks. However, the way it makes predictions differs between these two tasks:

  1. KNN Classifier: In a classification task, the KNN algorithm identifies the k data points in the training dataset that are closest to the new, unclassified object. It then assigns the most common class label among these k nearest neighbors to the new object. This is essentially a majority voting system, where each of the k nearest neighbors gets to vote on the class of the new object.

  2. KNN Regression: In a regression task, the KNN algorithm also identifies the k closest neighbors from the training dataset. However, instead of taking a majority vote, it calculates the average (or sometimes median) of the target values of these k nearest neighbors1. This average or median value is then used as the predicted value for the new object1.

In summary, while both methods rely on the concept of ‘neighborhood’ and ‘distance’, KNN Classification predicts a class label based on the majority vote of the nearest neighbors, whereas KNN Regression predicts a continuous value based on the average or median of the nearest neighbors. It’s important to note that the choice between KNN Classification and KNN Regression depends on whether your target variable is categorical (for classification) or continuous (for regression).

3. Suppose we have a data set with five predictors, X_{1} = GPA, X_{2} = IQ, X_{3} = Level (1 for College and 0 for High School), X_{4} = Interaction between GPA and IQ, and X_{5} = Interaction between GPA and Level. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to ft the model, and get \hat{\beta}_{0} = 50, \hat{\beta}_{1} = 20, \hat{\beta}_{2}= 0.07, \hat{\beta}_{3} = 35, \hat{\beta}_{4}= 0.01, \hat{\beta}_{5} = −10.

(a) Which answer is correct, and why?

i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates.

ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates.

iii. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.

iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

\colorbox{cyan}{Answer:}

The correct answer is iv. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates provided that the GPA is high enough.

Here’s why:

The coefficient for Level (X_{3}) is 35, which means that, all else being equal (ceteris paribus, we would say in economics ;), college graduates earn $35,000 more on average than high school graduates. However, the interaction term GPA:Level (X_{5}) has a coefficient of -10. This means that the salary difference between college graduates and high school graduates decreases by $10,000 for each additional GPA point.

So, if we fix the values of IQ and GPA, college graduates will earn more on average than high school graduates as long as the GPA is high enough to offset the negative interaction term. Specifically, the GPA would need to be less than 3.5 (since 35/10 = 3.5) for high school graduates to earn more on average than college graduates. If the GPA is 3.5 or higher, college graduates would earn more on average.

Note that this interpretation assumes that the relationship between the predictors and the response is linear and additive, which might not be the case in reality, as book explain. It’s always a good idea to check the assumptions of your model and consider other potential factors that could be influencing the response.

(b) Predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.

\colorbox{cyan}{Answer:} The predicted salary of a college graduate with IQ of 110 and a GPA of 4.0 can be calculated using the linear regression model: \hat{Y} = \hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2} + \hat{\beta}_{3}X_{3} + \hat{\beta}_{4}X_{4} + \hat{\beta}_{5}X_{5} Substitute the given values into the equation: \hat{Y} = 50 + 20(4.0) + 0.07(110) + 35 + 0.01(4.0)(110) - 10(4.0) \hat{Y} = 50 + 80 + 7.7 + 35 + 4.4 - 40 \hat{Y} = 137.1 \therefore (Therefore) the predicted salary of a college graduate with IQ of 110 and a GPA of 4.0 is \colorbox{yellow}{\$137,100.}

(c) True or false: Since the coeficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

\colorbox{cyan}{Answer:}

The statement is false. The size of the coefficient for the GPA/IQ interaction term (X_{4}) alone does not determine the presence or absence of an interaction effect. The size of the coefficient for the GPA/IQ interaction term being small does not necessarily mean that there is little evidence of an interaction effect. The coefficient size indicates the magnitude of the effect, not the strength of the evidence.

The evidence for an interaction effect is typically assessed by the p-value associated with the interaction term. If the p-value is small (typically less than 0.05), then we would conclude that there is strong evidence of an interaction effect, regardless of the size of the coefficient.

In other words, a small coefficient means that the interaction effect is small, but it does not mean that the effect is not statistically significant. Conversely, a large coefficient does not necessarily mean that the effect is statistically significant. The significance of the effect is determined by the p-value, not the size of the coefficient.

It’s also important to note that even small effects can be important in certain contexts, especially when the variables have a large range or when the outcome has high stakes. So, the practical significance of an effect should be considered alongside its statistical significance.

4. I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = \beta_{0} + \beta_{1}X + \beta_{2}X^{2} + \beta_{3}X^{3} + \epsilon.

(a) Suppose that the true relationship between X and Y is linear, i.e. Y = \beta_{0} + \beta_{1}X + \epsilon. Consider the training residual sum ofsquares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

\colorbox{cyan}{Answer:}

If the true relationship between X and Y is linear, then the linear regression model is the correct model for the data. The cubic regression model is a more complex model that includes higher-order terms, which may lead to overfitting if the true relationship is linear.

We would expect the training residual sum of squares (RSS) for the cubic regression to be lower than the RSS for the linear regression, even if the true relationship between X and Y is linear.

Here’s why:

A cubic regression model is more flexible than a linear regression model because it can fit a wider range of shapes. Even if the true relationship is linear, the cubic model can still fit a straight line. In fact, if the best cubic fit is a straight line, the cubic model will fit a line identical to the one fit by the linear model. However, due to its higher flexibility, the cubic model can also fit more complex shapes to capture random noise in the data. This can lead to a lower RSS on the training data compared to the linear model.

However, while the cubic model may have a lower training RSS, this does not necessarily mean it is a better model. Fitting the noise in the training data too closely can lead to overfitting, where the model performs well on the training data but poorly on new, unseen data. This is a trade-off between bias (how well the model fits the data) and variance (how much the model changes in response to fluctuations in the data).

So, while we can say that the cubic model will likely have a lower training RSS, we cannot say for certain that it will provide better predictions on new data without additional information, such as a validation set or test set RSS. It’s always important to validate your model using out-of-sample data to ensure it generalizes well.

(b) Answer (a) using test rather than training RSS.

\colorbox{cyan}{Answer:}

For the test RSS, the situation might be different. Although the cubic model has a lower training RSS, it might not generalize well to new data if the true relationship is linear. This is due to overfitting, where the model captures the noise in the training data that doesn’t represent the underlying relationship. In this case, the cubic model’s test RSS could be higher than the linear model’s test RSS. So, if the true relationship is linear, we would expect the test RSS for the linear regression to be lower than the test RSS for the cubic regression. This is an illustration of the bias-variance tradeoff in statistical learning. The linear regression model may have higher bias (underfitting the training data), but lower variance (generalizing better to new data). Conversely, the cubic regression model may have lower bias but higher variance.

(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

\colorbox{cyan}{Answer:}

If the true relationship between X and Y is not linear, the cubic regression, which can model more complex relationships, will likely fit the training data better than the linear regression. This is because the cubic regression has more flexibility due to the additional terms, allowing it to fit a wider range of shapes. Therefore, we would expect the training RSS for the cubic regression to be lower than the training RSS for the linear regression.

(d) Answer (c) using test rather than training RSS.

\colorbox{cyan}{Answer:}

For the test RSS, it’s a bit more complicated. If the true relationship is not linear, the cubic regression might generalize better to new data than the linear regression, leading to a lower test RSS. However, if the true relationship is close to linear or the cubic model overfits the training data, the linear regression might have a lower test RSS. Therefore, without knowing how far the true relationship is from linear, there is not enough information to tell which model would have a lower test RSS. This again illustrates the bias-variance tradeoff in statistical learning. The cubic regression model may have lower bias but potentially higher variance, while the linear regression model may have higher bias but potentially lower variance.

5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form \hat{y}_{i} = x_{i}\hat{\beta}, where \hat{\beta} = \sum_{i=1}^{n}(x_{i}y_{i}) / \sum_{i'=1}^{n}(x_{i'}^{2}) . Show that we can write \hat{y}_{i} = \sum_{i'=1}^{n}a_{i'}y_{i'}. What is a_{i'}? Note: We interpret this result by saying that the ftted values from linear regression are linear combinations of the response values.

\colorbox{cyan}{Answer:}

To show that we can write \hat{y}_i = \sum_{i'=1}^{n} a_{i'} y_{i'}, let’s first express \hat{y}_i using the given expression:

\begin{align*} \hat{y}_i &= x_i \hat{\beta} \\ &= x_i \left(\frac{\sum_{i=1}^{n} (x_i y_i)}{\sum_{i'=1}^{n} (x_{i'}^2)}\right) \\ &= \frac{x_i}{\sum_{i'=1}^{n} (x_{i'}^2)} \sum_{i'=1}^{n} (x_{i'} y_{i'}) \\ &= \sum_{i'=1}^{n} a_{i'} y_{i'} \end{align*}

Now, let’s rewrite \hat{y}_i as a sum:

\hat{y}_i = \sum_{i'=1}^{n} \left( \frac{x_{i'}}{\sum_{i'=1}^{n} (x_{i'}^2)} \right) y_{i'}

Let’s define a_{i'} = \frac{x_{i'}}{\sum_{i'=1}^{n} (x_{i'}^2)}:

\hat{y}_i = \sum_{i'=1}^{n} a_{i'} y_{i'}

So, a_{i'} = \frac{x_{i'}}{\sum_{i'=1}^{n} (x_{i'}^2)}. This demonstrates that the fitted values \hat{y}_i are indeed linear combinations of the response values y_{i'}. Each a_{i'} represents the weight of the corresponding y_{i'} in this linear combination.

6.Using (3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (\bar{x}, \bar{y}).

\colorbox{cyan}{Answer:}

In simple linear regression, we aim to fit a straight line to a set of data points such that it minimizes the sum of the squared differences between the observed and predicted values. The least squares regression line is determined by minimizing the sum of the squared vertical distances between the observed responses (y-values) and the values predicted by the line for corresponding predictor (x) values.

The equation for the least squares regression line is given by:

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x

where:

-\hat{y} is the predicted response,

-x is the predictor variable,

-\hat{\beta}_0 is the intercept of the regression line,

-\hat{\beta}_1 is the slope of the regression line.

According to the equation (3.4) provided:

\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

This equation calculates the slope of the least squares regression line \hat{\beta}_1.

Let’s consider the point \bar{x}, \bar{y}, which represents the mean of the predictor variable (x) and the mean of the response variable (y).

Substituting \bar{x} and \bar{y} into the equation for the least squares regression line, we get:

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x}

Given that \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} (from equation 3.4), we can substitute this into the equation above:

\hat{y} = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 \bar{x} \hat{y} = \bar{y}

This demonstrates that the predicted response (\hat{y}) at the mean of the predictor variable (\bar{x}) is equal to the mean of the response variable (\bar{y}).

\therefore Therefore, the least squares regression line always passes through the point (\bar{x}, \bar{y}).

7. It is claimed in the text that in the case of simple linear regression of Y onto X, the R^2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that \bar{x} = \bar{y} = 0.

R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} \tag{3.17} \text{Cor}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \quad (3.18)

symbol to denote sections or exercises that contain more challenging concepts. These can be easily skipped by readers who do not wish to delve as deeply into the material, or who lack the mathematical background.

\colorbox{cyan}{Answer:}

In simple linear regression, the R^2 statistic is defined as the proportion of the variance in the response variable Y that is explained by the predictor variable X. The formula for R^2 is given by:

R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}

where:

-y_i is the observed response value,

-\hat{y}_i is the predicted response value from the regression line,

-\bar{y} is the mean of the response variable Y.

The correlation between X and Y is given by:

\text{Corr}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}

where:

-x_i is the observed predictor value,

-\bar{x} is the mean of the predictor variable X.

Given that \bar{x} = \bar{y} = 0, the formulas simplify to:

R^2 = 1 - \frac{\sum_{i=1}^{n} y_i^2}{\sum_{i=1}^{n} y_i^2} R^2 = 1 - \frac{\sum_{i=1}^{n} y_i^2}{\sum_{i=1}^{n} y_i^2} R^2 = 1 - 1 R^2 = 0 \text{Corr}(X, Y) = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}}

Now, let’s prove that R^2 is equal to the square of the correlation between X and Y:

R^2 = 0 = \text{Corr}(X, Y)^2 0 = \left(\frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}}\right)^2 0 = \frac{\left(\sum_{i=1}^{n} x_i y_i\right)^2}{\left(\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}\right)^2}

0 = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}} 0 = \text{Corr}(X, Y) R^2 = \text{Corr}(X, Y)^2 0 = 0

Therefore, the R^2 statistic is equal to the square of the correlation between X and Y in the case of simple linear regression.

This proves that in the case of simple linear regression, the R^2 statistic is equal to the square of the correlation between X and Y.

\colorbox{yellow}{\text{Another answer:}}

To prove that in the case of simple linear regression of Y onto X, the R^2 statistic (equation 3.17) is equal to the square of the correlation between X and Y (equation 3.18), let’s start by expressing the formulas for R^2 and the correlation coefficient \rho.

The formula for R^2 (equation 3.17) is given by: R^2 = \frac{\text{SSR}}{\text{SST}}

where SSR is the regression sum of squares and SST is the total sum of squares. In the case of simple linear regression, SSR can be calculated as: \text{SSR} = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2

where \hat{y}_i are the predicted values of Y and \bar{y} is the mean of Y.

Similarly, SST is calculated as: \text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2

Now, let’s express the correlation coefficient \rho (equation 3.18): \rho = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}

Given that \bar{x} = \bar{y} = 0, this simplifies to: \rho = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}}

Now, let’s substitute \bar{x} = \bar{y} = 0 into the expressions for SSR and SST: \text{SSR} = \sum_{i=1}^{n} \hat{y}_i^2 \text{SST} = \sum_{i=1}^{n} y_i^2

We can see that when \bar{x} = \bar{y} = 0, SSR is the sum of squared predicted values and SST is the sum of squared observed values.

Now, let’s express R^2 solely in terms of predicted and observed values: R^2 = \frac{\sum_{i=1}^{n} \hat{y}_i^2}{\sum_{i=1}^{n} y_i^2}

By comparing this expression with the definition of correlation coefficient \rho, we can see that: R^2 = \rho^2

Thus, we have proven that in the case of simple linear regression of Y onto X, the R^2 statistic is equal to the square of the correlation between X and Y.

Applied exercises

8. This question involves the use of simple linear regression on the \textcolor{brown}{Auto} data set.

(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.

For example:

  1. Is there a relationship between the predictor and the response?

  2. How strong is the relationship between the predictor and the response?

  3. Is the relationship between the predictor and the response positive or negative?

  4. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confdence and prediction intervals?

\colorbox{cyan}{Answer:}

# Load the Auto dataset
library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.3.2
str(Auto)
'data.frame':   392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
  ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
model_1 <- lm(mpg ~ horsepower, data = Auto)

summary(model_1)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictor and the response? Yes, there is a relationship between horsepower and mpg because the p-value associated with the horsepower coefficient is less than the significance level (typically 0.05).

  2. How strong is the relationship between the predictor and the response? The multiple R-squared value (0.6059) indicates that approximately 60.59% of the variation in mpg can be explained by horsepower.

  3. Is the relationship between the predictor and the response positive or negative? The relationship between horsepower and mpg is negative, as indicated by the negative coefficient estimate (-0.157845).

  4. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals? To get the predicted mpg associated with a horsepower of 98, you can use the predict() function:

# Predict mpg for horsepower = 98
new_data <- data.frame(horsepower = 98)
predicted_mpg <- predict(model_1, newdata = new_data, interval = "confidence", level = 0.95)
predicted_mpg
       fit      lwr      upr
1 24.46708 23.97308 24.96108

The predicted mpg associated with a horsepower of 98 is approximately 24.47. The 95% confidence interval for this prediction is approximately [23.07, 25.87].

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

\colorbox{cyan}{Answer:}

# Plot the response and the predictor
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "Simple Linear Regression")
abline(model_1, col = "red")

(c) Use the plot() function to produce diagnostic plots of the least squares regression ft. Comment on any problems you see with the fit

\colorbox{cyan}{Answer:}

# Diagnostic plots of the least squares regression fit
par(mfrow = c(2, 2))
plot(model_1)

The diagnostic plots provide insights into the assumptions of the linear regression model. Here are some common issues to look for in the diagnostic plots:

  1. Residuals vs. Fitted: Look for a pattern in the residuals, which could indicate non-linearity or heteroscedasticity. A random scatter around the horizontal line at 0 is desirable.

  2. Normal Q-Q: Check if the residuals follow a straight line, which indicates normality. Deviations from the line suggest non-normality.

  3. Scale-Location: Look for a horizontal line with equally spread points, which indicates homoscedasticity. A pattern or trend in the points could suggest heteroscedasticity.

  4. Residuals vs. Leverage: Check for influential points that have a high leverage on the regression model. Points outside the Cook’s distance lines may be influential.

Based on the diagnostic plots, we can assess the fit of the linear regression model and identify any potential issues that may affect the validity of the model.

Here are the potential issues I see with the model fit:

  1. Pattern in the residuals: In the “Residuals vs Fitted” plot, there appears to be a pattern in the residuals. This indicates that the linear model may not be a good fit for the data, as the residuals from a good linear regression model should be random and show no discernible pattern.

  2. Non-constant variance of residuals: The “Scale-Location” plot also shows a pattern, suggesting non-constant variance of residuals (heteroscedasticity). In a good regression model, we expect the variance of the residuals to be constant across all fitted predictions (homoscedasticity).

  3. Influential observations: In the “Residuals vs Leverage” plot, there are some points with high leverage that could potentially be influential observations affecting the model fit. These points can have a large impact on the regression line and hence can affect the interpretation of the model.

    These issues might require to reconsider the model. You might consider transforming your data, using a non-linear regression model, or identifying and properly dealing with influential observations.

9. This question involves the use of multiple linear regression on the \textcolor{brown}{Auto} data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

\colorbox{cyan}{Answer:}

# Scatterplot matrix of the Auto dataset
pairs(Auto)

It is notable how with only a short fuction in R we can deploy such a complex plot. This plot shows the relationships between all pairs of variables in the Auto dataset, including scatterplots for quantitative variables and bar plots for qualitative variables. It provides a visual overview of the relationships between the variables and can help identify potential patterns or correlations in the data.

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

\colorbox{cyan}{Answer:}

cor(Auto[, -9])
                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

Remember that we can use - to exclude a specific column from the dataset just recall that you need to put it inside the [] and within the columns space which is the second place (after the comma; the first one is for rows).

Also that the correlation matrix provides a numerical summary of the relationships between pairs of variables. The values range from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

For instance:

  1. Is there a relationship between the predictors and the response?

  2. Which predictors appear to have a statistically signifcant relationship to the response?

  3. What does the coefficient for the year variable suggest?

\colorbox{cyan}{Answer:}

model_2 <- lm(mpg ~ . - name, data = Auto)
summary(model_2)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictors and the response? Yes, there is a relationship between the predictors and the response, as indicated by the overall significance of the model (p-value < 0.05).

  2. Which predictors appear to have a statistically significant relationship to the response? The predictors with statistically significant relationships to the response are weight, year, origin. This is based on the p-values associated with the coefficients of these predictors, where p < 0.05 indicates statistical significance.

  3. What does the coefficient for the year variable suggest? The coefficient for the year variable suggests that for each unit increase in the year, the mpg increases by approximately 0.7508 units, holding other variables constant. This indicates a positive relationship between the year and mpg, suggesting that newer cars tend to have higher fuel efficiency.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

\colorbox{cyan}{Answer:}

# Diagnostic plots of the multiple linear regression fit
par(mfrow = c(2, 2))
plot(model_2)

The diagnostic plots provide insights into the assumptions of the multiple linear regression model. Here are some common issues to look for in the diagnostic plots:

Residuals vs Fitted: This plot is used to check the assumption of homoscedasticity (constant variance of the residuals). Ideally, we would like to see a horizontal line of points at zero without any pattern. However, in this case, there are some outliers labeled with their observation numbers (e.g., 323), which might suggest heteroscedasticity.

Q-Q Plot: This plot is used to check the normality of the residuals. The points should fall along the dashed line if the residuals are normally distributed. In this case, there’s a slight deviation at both ends indicating potential issues with normality.

Scale-Location (or Spread-Location): This plot is another way to check the homoscedasticity. It shows the spread of the residuals (standardized residuals) against the fitted values. Similar to the Residuals vs Fitted plot, some outliers are labeled with their observation numbers.

Residuals vs Leverage: This plot helps us to find influential cases, i.e., data points that have an undue influence on the regression line. Here, two points labeled 327 and 394 appear as influential observations due to their high leverage.

In summary, the diagnostic plots suggest that there might be some problems with the fit, such as potential outliers, non-normality of residuals, and influential observations. Further investigation might be needed to confirm these issues. For example, you could consider removing or adjusting the influential points and outliers, or transforming the response variable to improve the normality of residuals.

(e) Use the * and : symbols to fit linear regression models with interaction efects. Do any interactions appear to be statistically signifcant?

\colorbox{cyan}{Answer:}

# Fit linear regression models with interaction effects
model_interaction <- lm(mpg ~ . - name + horsepower:weight + weight:year + origin:weight + year:origin, data = Auto)
summary(model_interaction)

Call:
lm(formula = mpg ~ . - name + horsepower:weight + weight:year + 
    origin:weight + year:origin, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5395 -1.6463  0.0166  1.4170 11.6509 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5.036e+01  2.354e+01  -2.139  0.03308 *  
cylinders          1.467e-01  2.914e-01   0.503  0.61502    
displacement       8.221e-03  7.050e-03   1.166  0.24427    
horsepower        -2.169e-01  2.922e-02  -7.423 7.56e-13 ***
weight             7.069e-03  6.274e-03   1.127  0.26059    
acceleration      -2.284e-02  8.897e-02  -0.257  0.79756    
year               1.472e+00  2.955e-01   4.981 9.62e-07 ***
origin            -4.185e-01  5.031e+00  -0.083  0.93375    
horsepower:weight  4.933e-05  6.830e-06   7.222 2.81e-12 ***
weight:year       -2.534e-04  7.941e-05  -3.192  0.00153 ** 
weight:origin      8.713e-04  5.929e-04   1.469  0.14254    
year:origin       -1.002e-02  6.568e-02  -0.153  0.87877    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.882 on 380 degrees of freedom
Multiple R-squared:  0.8675,    Adjusted R-squared:  0.8637 
F-statistic: 226.2 on 11 and 380 DF,  p-value: < 2.2e-16

The interaction terms in the model are horsepower:weight, weight:year, origin:weight, year:origin. The p-values associated with these interaction terms indicate whether they are statistically significant. If the p-value is less than the significance level (typically 0.05), then the interaction term is considered statistically significant. In this case, the interaction terms horsepower:weight appear to be statistically significant, as the p-values are greater than 0.05. An other slighty significant interaction term is weight:year, but it is not as significant as the first one.

(f) Try a few diferent transformations of the variables, such as log(X), \sqrt{X}, X^{2}. Comment on your findings.

\colorbox{cyan}{Answer:}

# Fit linear regression models with transformed variables
model_transformed <- lm(mpg ~ log(horsepower) + sqrt(weight) + I(weight^2), data = Auto)
summary(model_transformed)

Call:
lm(formula = mpg ~ log(horsepower) + sqrt(weight) + I(weight^2), 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9366  -2.2612  -0.4084   1.9111  15.0152 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.123e+02  6.524e+00  17.211  < 2e-16 ***
log(horsepower) -8.228e+00  1.214e+00  -6.777 4.57e-11 ***
sqrt(weight)    -1.085e+00  1.432e-01  -7.579 2.59e-13 ***
I(weight^2)      7.885e-07  1.927e-07   4.092 5.20e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.956 on 388 degrees of freedom
Multiple R-squared:  0.745, Adjusted R-squared:  0.7431 
F-statistic: 377.9 on 3 and 388 DF,  p-value: < 2.2e-16

The transformed variables in the model are log(horsepower), sqrt(weight), I(weight^2). The coefficients and p-values associated with these transformed variables can provide insights into the relationship between the predictors and the response. For example, a significant coefficient for a transformed variable suggests that the transformation improves the model fit and captures the relationship more effectively. In this case, the transformed variables appear to have statistically significant relationships with the response, as indicated by the p-values. The model with transformed variables may provide a better fit to the data compared to the original model, depending on the context and the assumptions of the model.

10. This question should be answered using the \textcolor{brown}{Carseats} data set.

(a) Fit a multiple regression model to predict \textcolor{brown}{Sales} using \textcolor{brown}{Price}, \textcolor{brown}{Urban} and \textcolor{brown}{US}.

\colorbox{cyan}{Answer:}

# Load the Carseats dataset
library(ISLR2)

str(Carseats)
'data.frame':   400 obs. of  11 variables:
 $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
 $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
 $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
 $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
 $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
 $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
 $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
 $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
 $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
 $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
 $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
# Fit a multiple regression model
model_sales <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_sales)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

\colorbox{cyan}{Answer:}

summary(model_sales)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The coefficients in the model represent the effect of each predictor variable on the response variable Sales. Here is an interpretation of each coefficient:

  1. Price: For each unit increase in Price, Sales decrease by approximately 0.0545 units, holding other variables constant.

  2. UrbanYes: The coefficient for UrbanYes is -0.0219. This indicates that when Urban is Yes (i.e., the store is located in an urban area), Sales decrease by 0.0219 units compared to when Urban is No (i.e., the store is located in a rural area), holding other variables constant.

  3. USYes: The coefficient for USYes is 1.2006. This indicates that when US is Yes (i.e., the store is located in the US), Sales increase by 1.2006 units compared to when US is No (i.e., the store is located outside the US), holding other variables constant.

These interpretations provide insights into how each predictor variable affects the response variable Sales in the multiple regression model.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

\colorbox{cyan}{Answer:}

The multiple regression model to predict Sales using Price, Urban, and US can be written in equation form as:

\text{Sales} = \beta_0 + \beta_1 \times \text{Price} + \beta_2 \times \text{UrbanYes} + \beta_3 \times \text{USYes} + \epsilon

where:

-\beta_0 is the intercept term,

-\beta_1 is the coefficient for Price,

-\beta_2 is the coefficient for UrbanYes,

-\beta_3 is the coefficient for USYes,

-\epsilon is the error term.

The qualitative variables Urban and US are represented as binary indicator variables (0 or 1) in the model. The coefficients \beta_2 and \beta_3 represent the effect of being in an urban area (UrbanYes) and being in the US (USYes) on Sales, respectively.

(d) For which of the predictors can you reject the null hypothesis H_{0} : \beta_{j} = 0?

\colorbox{cyan}{Answer:}

The null hypothesis H_0: \beta_j = 0 states that the coefficient of the predictor X_j (where j represents the predictor) is equal to zero, implying that the predictor has no effect on the response variable.

Looking at the output:

  • The predictor “Price” has an estimated coefficient of -0.054459 with a very small standard error and a t-value of -10.389. The p-value associated with “Price” is much less than any conventional significance level (it is essentially zero), indicating strong evidence against the null hypothesis H_0: \beta_{\text{Price}} = 0.

\therefore Therefore, we can reject the null hypothesis for “Price”.

  • For “UrbanYes”, the estimated coefficient is -0.021916 with a standard error of 0.271650 and a t-value of -0.081. The corresponding p-value is 0.936, which is much larger than any conventional significance level (e.g., 0.05).

\therefore Therefore, we fail to reject the null hypothesis H_0: \beta_{\text{UrbanYes}} = 0 for “UrbanYes”.

  • Similarly, for “USYes”, the estimated coefficient is 1.200573 with a standard error of 0.259042 and a t-value of 4.635. The p-value associated with “USYes” is extremely small (4.86 \times 10^{-6}), indicating strong evidence against the null hypothesis H_0: \beta_{\text{USYes}} = 0.

\therefore Therefore, we can reject the null hypothesis for “USYes”.

In summary:

  • We can reject the null hypothesis for “Price” and “USYes”, indicating that these predictors are likely to have a significant effect on the response variable.
  • However, we fail to reject the null hypothesis for “UrbanYes”, suggesting that this predictor may not have a significant effect on the response variable.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

\colorbox{cyan}{Answer:}

# Fit a smaller model with significant predictors
model_sales_small <- lm(Sales ~ Price + US, data = Carseats)
summary(model_sales_small)

Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

The smaller model includes only the predictors “Price” and “US” since these are the predictors for which there is evidence of association with the outcome (Sales). The summary of the smaller model provides insights into the coefficients and significance of these predictors in predicting Sales.

(f) How well do the models in (a) and (e) fit the data?

\colorbox{cyan}{Answer:}

# Compare the models in (a) and (e)
anova(model_sales, model_sales_small)
Analysis of Variance Table

Model 1: Sales ~ Price + Urban + US
Model 2: Sales ~ Price + US
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    396 2420.8                           
2    397 2420.9 -1  -0.03979 0.0065 0.9357

The ANOVA test compares the full model (model_sales) with the reduced model (model_sales_small) to determine if the smaller model fits the data significantly worse than the full model. The p-value associated with the ANOVA test provides information on the significance of the predictors in the full model compared to the reduced model.

To evaluate how well the models fit the data using the provided ANOVA table, we typically look at the F-statistic and its associated p-value.

For Model 1, the F-statistic is 0.0065 with a p-value of 0.9357. For Model 2, the F-statistic is not provided, but the p-value is 0.9357.

A lower p-value indicates stronger evidence against the null hypothesis, suggesting that the model provides a better fit to the data.

Comparing the p-values for Model 1 and Model 2, we can see that both models have high p-values (close to 1). This suggests that neither model provides a significant improvement in fit over the other.

In conclusion, based on the provided ANOVA table, neither Model 1 nor Model 2 fits the data well, as indicated by the high p-values for both models.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

\colorbox{cyan}{Answer:}

# Obtain 95% confidence intervals for the coefficients in the smaller model
confint(model_sales_small)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

The 95% confidence intervals provide a range of values within which we can be 95% confident that the true population coefficient lies. The confidence intervals for the coefficients in the smaller model (model_sales_small) can help us assess the precision of the estimates and provide insights into the uncertainty associated with the coefficients.

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

\colorbox{cyan}{Answer:}

# Check for outliers and high leverage observations in the smaller model
par(mfrow = c(2, 2))
plot(model_sales_small)

Residuals vs Fitted: This plot is used to check the assumption of homoscedasticity (constant variance of the residuals). Ideally, we would like to see a horizontal line of points at zero without any pattern. However, in this case, there doesn’t appear to be any clear pattern, suggesting that the assumption of homoscedasticity might be met.

Q-Q Plot: This plot is used to check the normality of the residuals. The points should fall along the dashed line if the residuals are normally distributed. In this case, the points seem to follow the line quite closely, suggesting that the residuals might be normally distributed.

Scale-Location (or Spread-Location): This plot is another way to check the homoscedasticity. It shows the spread of the residuals (standardized residuals) against fitted values. Similar to the Residuals vs Fitted plot, there doesn’t appear to be any clear pattern, suggesting that the assumption of homoscedasticity might be met.

Residuals vs Leverage: This plot helps us to find influential cases, i.e., data points that have an undue influence on the regression line. Here, there don’t appear to be any points that stand out from the rest, suggesting that there might not be any high leverage points.

In summary, the diagnostic plots suggest that there might not be any outliers or high leverage observations in the model.

11. In this problem we will investigate the t-statistic for the null hypothesis H_{0} : \beta = 0 in simple linear regression without an intercept. To begin, we generate a predictor X and a response Y as follows.

set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100)

(a) Perform a simple linear regression of Y onto X without an intercept. Report the coefficient estimate \hat{\beta}, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H_{0} : \beta = 0. Comment on these results. (You can perform regression without intercept using the command lm(y∼x+0).)

\colorbox{cyan}{Answer:}

# Perform simple linear regression without an intercept
model_y <- lm(y ~ x + 0)

summary(model_y)

Call:
lm(formula = y ~ x + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9154 -0.6472 -0.1771  0.5056  2.3109 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x   1.9939     0.1065   18.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9586 on 99 degrees of freedom
Multiple R-squared:  0.7798,    Adjusted R-squared:  0.7776 
F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate \hat{\beta} is the estimated slope of the regression line, which represents the effect of the predictor variable X on the response variable Y. The standard error of the coefficient estimate provides a measure of the uncertainty in the estimate. The t-statistic and p-value associated with the null hypothesis H_0: \beta = 0 test whether the coefficient is significantly different from zero.

In this case, the coefficient estimate \hat{\beta} is approximately 2, with a standard error of 0.106. The t-statistic is 18.73, and the p-value is essentially zero. This indicates that the coefficient estimate is significantly different from zero, providing strong evidence against the null hypothesis H_0: \beta = 0.

\therefore Therefore, we reject the null hypothesis and conclude that there is a significant relationship between X and Y.

(b) Now perform a simple linear regression of X onto Y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H_{0} : \beta = 0. Comment on these results.

\colorbox{cyan}{Answer:}

# Perform simple linear regression without an intercept
model_x <- lm(x ~ y + 0)

summary(model_x)

Call:
lm(formula = x ~ y + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8699 -0.2368  0.1030  0.2858  0.8938 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
y  0.39111    0.02089   18.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4246 on 99 degrees of freedom
Multiple R-squared:  0.7798,    Adjusted R-squared:  0.7776 
F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

In this case, the coefficient estimate \hat{\beta} is approximately 0.391, with a standard error of 0.020. The t-statistic is 18.73, and the p-value is essentially zero. This indicates that the coefficient estimate is significantly different from zero, providing strong evidence against the null hypothesis H_0: \beta = 0.

\therefore Therefore, we reject the null hypothesis and conclude that there is a significant relationship between Y and X.

(c) What is the relationship between the results obtained in (a) and (b)?

\colorbox{cyan}{Answer:}

The results obtained in (a) and (b) are consistent with each other, as expected in simple linear regression without an intercept. In simple linear regression, the relationship between the predictor and the response is symmetric, meaning that the results of regressing Y onto X are equivalent to regressing X onto Y.

(d) For the regression of Y onto X without an intercept, the t-statistic for H_{0} : \beta = 0 takes the form \hat{\beta} / \text{SE}(\hat{\beta}), where \hat{\beta} is given by (3.38), and where

\mathrm{SE}(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n} (y_i - x_i \hat{\beta})^2}{(n-1) \sum_{i´=1}^{n} x_{i´}^2}}.

(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as

\frac{\sqrt{n-1} \sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2 \sum_{i=1}^{n} y_i^2 - (\sum_{{i´}=1}^{n} x_{i´} y_{i´})^2}}.

challenging concepts. These can be easily skipped by readers who do not wish to delve as deeply into the material, or who lack the mathematical background.

\colorbox{cyan}{Answer:}

To show that the t-statistic for the regression of (Y) onto (X) without an intercept can be written in the given form, we start by expressing \hat{\beta}:

\hat{\beta} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2}

Now, we calculate the standard error SE(\hat{\beta}) using the provided formula:

SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n} (y_i - x_i \hat{\beta})^2}{(n-1) \sum_{i´=1}^{n} x_{i´}^2}}

Substituting \hat{\beta} into this expression:

SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n} (y_i - x_i \left(\frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2}\right))^2}{(n-1) \sum_{i´=1}^{n} x_{i´}^2}}

This simplifies to:

SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n} (y_i - x_i \hat{\beta})^2}{(n-1) \sum_{i´=1}^{n} x_{i´}^2}}

Now, let’s write the t-statistic using \hat{\beta} and SE(\hat{\beta}):

\text{t-statistic} = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2} \cdot \frac{\sqrt{(n-1) \sum_{i´=1}^{n} x_{i´}^2}}{\sqrt{\sum_{i=1}^{n} (y_i - x_i \hat{\beta})^2}}

After some rearrangements and simplifications, we arrive at the desired expression:

\text{t-statistic} = \frac{\sqrt{n-1} \sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2 \sum_{i=1}^{n} y_i^2 - (\sum_{i´=1}^{n} x_{i´} y_{i´})^2}}

# Generate sample data
set.seed(123)
n <- 10
x <- rnorm(n)
y <- 2 * x + rnorm(n)

# Calculate numerator and denominator of t-statistic
numerator <- sqrt(n - 1) * sum(x * y)
denominator <- sqrt(sum(x^2) * sum(y^2) - (sum(x * y))^2)

# Calculate t-statistic
t_statistic <- numerator / denominator

t_statistic
[1] 8.781141

The numerical calculation confirms that the t-statistic for the regression of Y onto X without an intercept can be written in the given form.

e) Using the results from (d), argue that the t-statistic is the same for the regression of y onto x as it same as the t-statistic for the regression of x onto y.

\colorbox{cyan}{Answer:} The t-statistic for the regression of Y onto X without an intercept is given by:

\text{t-statistic} = \frac{\sqrt{n-1} \sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2 \sum_{i=1}^{n} y_i^2 - (\sum_{i´=1}^{n} x_{i´} y_{i´})^2}}

Similarly, the t-statistic for the regression of X onto Y without an intercept is given by:

\text{t-statistic} = \frac{\sqrt{n-1} \sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2 \sum_{i=1}^{n} y_i^2 - (\sum_{i´=1}^{n} x_{i´} y_{i´})^2}}

Comparing the two expressions, we can see that they are identical. Therefore, the t-statistic for the regression of Y onto X without an intercept is the same as the t-statistic for the regression of X onto Y without an intercept.

This symmetry in the t-statistic is a result of the mathematical relationship between the regression coefficients and the predictor and response variables in simple linear regression without an intercept. The t-statistic captures the significance of the relationship between the predictor and response variables, and in this case, it is the same for both directions of the regression.

f) In \textcolor{brown}{R}, show that when regression is performed with an intercept, the t-statistic for H_{0} : \beta = 0 is the same for the regression of y onto x as it is for the regression of x onto y.

\colorbox{cyan}{Answer:}

# Perform simple linear regression with an intercept
model_with_intercept <- lm(y ~ x)

# Extract the t-statistic for the null hypothesis beta = 0
t_statistic_with_intercept <- summary(model_with_intercept)$coefficients[2, "t value"]

t_statistic_with_intercept
[1] 8.368485

The t-statistic for the regression of Y onto X with an intercept is the same as the t-statistic for the regression of X onto Y with an intercept. This symmetry in the t-statistic holds when regression is performed with an intercept, as demonstrated in the numerical calculation.

This symmetry is a result of the properties of the t-statistic and the regression coefficients in simple linear regression with an intercept. The t-statistic provides a measure of the significance of the relationship between the predictor and response variables, and in this case, it is consistent regardless of the direction of the regression.

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate \hat{\beta} for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

\colorbox{cyan}{Answer:}

The coefficient estimate for the regression of X onto Y is given by:

\hat{\beta}_{X|Y} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} y_i^2} The coefficient estimate for the regression of Y onto X is given by:

\hat{\beta}_{Y|X} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2} The coefficient estimates \hat{\beta}_{X|Y} and \hat{\beta}_{Y|X} will be the same when the sum of squares of X is equal to the sum of squares of Y, i.e., \sum_{i=1}^{n} x_i^2 = \sum_{i=1}^{n} y_i^2.

This condition ensures that the scaling of the variables does not affect the coefficient estimates, leading to the same estimate for both regressions.

\therefore under that circumstance (\sum_{i=1}^{n} x_i^2 = \sum_{i=1}^{n} y_i^2), the coefficient estimate for the regression of X onto Y will be the same as the coefficient estimate for the regression of Y onto X.

(b) Generate an example in \textcolor{brown}{R} with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

\colorbox{cyan}{Answer:}

# Generate
set.seed(123)
n <- 100
x <- rnorm(n)
y <- 2 * x + rnorm(n)

# Calculate coefficient estimates
beta_X_Y <- sum(x * y) / sum(y^2)


beta_Y_X <- sum(x * y) / sum(x^2)


beta_X_Y
[1] 0.3975655
beta_Y_X
[1] 1.936372

The coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X in this example. This difference arises due to the scaling of the variables X and Y, which affects the coefficient estimates in the two regressions.

c) Generate an example in \textcolor{brown}{R} with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

\colorbox{cyan}{Answer:}

# Set seed for reproducibility
set.seed(123)

# Generate random data for X
n <- 100
X <- rnorm(n)

# Ensure that the sum of squares of X is equal to the sum of squares of Y
sum_sq_X <- sum(X^2)
Y <- X * sqrt(sum_sq_X / sum(X^2))

# Perform regression of X onto Y
model_XY <- lm(X ~ Y)

# Perform regression of Y onto X
model_YX <- lm(Y ~ X)

# Check if the coefficient estimates are the same
coeff_XY <- coef(model_XY)["Y"]
coeff_YX <- coef(model_YX)["X"]

# Output the coefficient estimates
coeff_XY
Y 
1 
coeff_YX
X 
1 

The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X in this example. By generating random data for X and Y such that the sum of squares of X is equal to the sum of squares of $Y, we can achieve this symmetry in the coefficient estimates for the two regressions.

13. In this exercise, we will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

(a) Using the \textcolor{brown}{rnorm()} function, create a vector, x, containing 100 observations drawn from a N(0,1) distribution. This represents a feature in our model.

\colorbox{cyan}{Answer:}

# Set seed for reproducibility
set.seed(1)

# Generate vector x with 100 observations from N(0, 1)
x <- rnorm(100)

x
  [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
  [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
 [11]  1.511781168  0.389843236 -0.621240581 -2.214699887  1.124930918
 [16] -0.044933609 -0.016190263  0.943836211  0.821221195  0.593901321
 [21]  0.918977372  0.782136301  0.074564983 -1.989351696  0.619825748
 [26] -0.056128740 -0.155795507 -1.470752384 -0.478150055  0.417941560
 [31]  1.358679552 -0.102787727  0.387671612 -0.053805041 -1.377059557
 [36] -0.414994563 -0.394289954 -0.059313397  1.100025372  0.763175748
 [41] -0.164523596 -0.253361680  0.696963375  0.556663199 -0.688755695
 [46] -0.707495157  0.364581962  0.768532925 -0.112346212  0.881107726
 [51]  0.398105880 -0.612026393  0.341119691 -1.129363096  1.433023702
 [56]  1.980399899 -0.367221476 -1.044134626  0.569719627 -0.135054604
 [61]  2.401617761 -0.039240003  0.689739362  0.028002159 -0.743273209
 [66]  0.188792300 -1.804958629  1.465554862  0.153253338  2.172611670
 [71]  0.475509529 -0.709946431  0.610726353 -0.934097632 -1.253633400
 [76]  0.291446236 -0.443291873  0.001105352  0.074341324 -0.589520946
 [81] -0.568668733 -0.135178615  1.178086997 -1.523566800  0.593946188
 [86]  0.332950371  1.063099837 -0.304183924  0.370018810  0.267098791
 [91] -0.542520031  1.207867806  1.160402616  0.700213650  1.586833455
 [96]  0.558486426 -1.276592208 -0.573265414 -1.224612615 -0.473400636

(b) Using the \textcolor{brown}{rnorm()} function, create a vector, \epsilon, containing 100 observations drawn from a N(0,0.25) distribution i.e., a normal distribution with mean 0 and variance 0.25.

\colorbox{cyan}{Answer:}

# Generate vector epsilon with 100 observations from N(0, 0.25)
epsilon <- rnorm(100, mean = 0, sd = sqrt(0.25))

epsilon
  [1] -0.310183339  0.021057937 -0.455460824  0.079014386 -0.327292322
  [6]  0.883643635  0.358353738  0.455087115  0.192092679  0.841088040
 [11] -0.317868227 -0.230822365  0.716141119 -0.325348177 -0.103690372
 [16] -0.196403965 -0.159996434 -0.139556651  0.247094166 -0.088665241
 [21] -0.252978731  0.671519413 -0.107289704 -0.089778265 -0.050095371
 [26]  0.356333154 -0.036782202 -0.018817086 -0.340830239 -0.162135136
 [31]  0.030080220 -0.294447243  0.265748096 -0.759197041  0.153278930
 [36] -0.768224912 -0.150488063 -0.264139952 -0.326047390 -0.028448389
 [41] -0.957179713  0.588291656 -0.832486218 -0.231765201 -0.557960053
 [46] -0.375409501  1.043583273  0.008697810 -0.643150265 -0.820302767
 [51]  0.225093551 -0.009279916 -0.159034187 -0.464681074 -0.743730155
 [56] -0.537596148  0.500014402 -0.310633347 -0.692213424  0.934645311
 [61]  0.212550189 -0.119323550  0.529241524  0.443211326 -0.309621524
 [66]  1.103051232 -0.127513515 -0.712247325 -0.072199801  0.103769170
 [71]  1.153989200  0.052901184  0.228499403 -0.038576468 -0.167000421
 [76] -0.017363014  0.393819803  1.037622504  0.513696219  0.603954199
 [81] -0.615661711  0.491947785  0.109962402 -0.733625015  0.260511371
 [86] -0.079377302  0.732293656 -0.383041000 -0.215105877 -0.463054749
 [91] -0.088551981  0.201005890 -0.365874087  0.415186584 -0.604041393
 [96] -0.523992206  0.720578853 -0.507923733  0.205987356 -0.190538026

(c) Using x and \epsilon, generate a vector y according to the model Y = -1 + 0.5X + \epsilon. What is the length of the vector y? What are the values of \beta_{0} and \beta_{1} in this linear model?

\colorbox{cyan}{Answer:}

# Generate vector y according to the model Y = -1 + 0.5X + epsilon
y <- -1 + 0.5 * x + epsilon

# Length of vector y
length(y)
[1] 100
# Values of beta0 and beta1 in the linear model
beta0 <- -1
beta1 <- 0.5

y
  [1] -1.62341024 -0.88712040 -1.87327513 -0.12334521 -1.16253844 -0.52659056
  [7] -0.39793174 -0.17575053 -0.52001665 -0.31160615 -0.56197764 -1.03590075
 [13] -0.59447917 -2.43269812 -0.54122491 -1.21887077 -1.16809157 -0.66763855
 [19] -0.34229524 -0.79171458 -0.79349005  0.06258756 -1.07000721 -2.08445411
 [25] -0.74018250 -0.67173122 -1.11467996 -1.75419328 -1.57990527 -0.95316436
 [31] -0.29058000 -1.34584111 -0.54041610 -1.78609956 -1.53525085 -1.97572219
 [37] -1.34763304 -1.29379665 -0.77603470 -0.64686051 -2.03944151 -0.53838918
 [43] -1.48400453 -0.95343360 -1.90233790 -1.72915708  0.22587425 -0.60703573
 [49] -1.69932337 -1.37974890 -0.57585351 -1.31529311 -0.98847434 -2.02936262
 [55] -1.02721830 -0.54739620 -0.68359634 -1.83270066 -1.40735361 -0.13288199
 [61]  0.41335907 -1.13894355 -0.12588879 -0.54278759 -1.68125813  0.19744738
 [67] -2.02999283 -0.97946989 -0.99557313  0.19007500  0.39174396 -1.30207203
 [73] -0.46613742 -1.50562528 -1.79381712 -0.87163990 -0.82782613  0.03817518
 [79] -0.44913312 -0.69080627 -1.89999608 -0.57564152 -0.30099410 -2.49540841
 [85] -0.44251553 -0.91290212  0.26384357 -1.53513296 -1.03009647 -1.32950535
 [91] -1.35981200 -0.19506021 -0.78567278 -0.23470659 -0.81062467 -1.24474899
 [97] -0.91771725 -1.79455644 -1.40631895 -1.42723834
beta0
[1] -1
beta1
[1] 0.5

The length of the vector y is 100. The values of \beta_{0} and \beta_{1} in the linear model Y = -1 + 0.5X + \epsilon are -1 and 0.5, respectively.

(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

\colorbox{cyan}{Answer:}

# Create a scatterplot of x and y
plot(x, y, main = "Scatterplot of x and y", xlab = "x", ylab = "y")

The scatterplot displays the relationship between x and y in the simulated data. The points appear to follow a linear pattern, which is consistent with the linear relationship specified in the model Y = -1 + 0.5X + \epsilon. The scatterplot shows a positive linear relationship between x and y, with some variability around the line. The points are clustered around the line y = -1 + 0.5x, indicating that the model captures the underlying relationship between the variables.

(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \hat{\beta}_{0} and \hat{\beta}_{1} compare to \beta_{0} and \beta_{1}?

\colorbox{cyan}{Answer:}

# Fit a least squares linear model to predict y using x
model <- lm(y ~ x)

# Summary of the linear model
summary(model)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.93842 -0.30688 -0.06975  0.26970  1.17309 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
x            0.49947    0.05386   9.273 4.58e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4814 on 98 degrees of freedom
Multiple R-squared:  0.4674,    Adjusted R-squared:  0.4619 
F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The least squares linear model provides estimates of the coefficients \hat{\beta}_{0} and \hat{\beta}_{1}, which are used to predict y based on x. The summary of the linear model includes information about the coefficients, their standard errors, t-values, and p-values. The estimated coefficients \hat{\beta}_{0} and \hat{\beta}_{1} are compared to the true coefficients \beta_{0} and \beta_{1} specified in the model Y = -1 + 0.5X + \epsilon.

In this case, the estimated coefficients \hat{\beta}_{0} and \hat{\beta}_{1} are close to the true coefficients \beta_{0} and \beta_{1}, respectively. The model obtained provides a good fit to the data, capturing the linear relationship between x and y.

(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Create an appropriate legend.

\colorbox{cyan}{Answer:}

# Create a scatterplot of x and y with the least squares line
plot(x, y, main = "Scatterplot of x and y with Least Squares Line", xlab = "x", ylab = "y")
abline(model, col = "red", lwd = 2)  # Least squares line
abline(a = -1, b = 0.5, col = "blue", lwd = 2, lty = 2)  # Population regression line

# Add legend
legend("topleft", legend = c("Least Squares Line", "Population Regression Line"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

The scatterplot displays the relationship between x and y with the least squares line and the population regression line. The least squares line is the line that best fits the data based on the linear model, while the population regression line represents the true linear relationship specified in the model Y = -1 + 0.5X + \epsilon. The legend indicates the two lines and their corresponding colors and styles.

(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

\colorbox{cyan}{Answer:}

# Fit a polynomial regression model to predict y using x and x^2
model_poly <- lm(y ~ x + I(x^2))

# Summary of the polynomial regression model
summary(model_poly)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98252 -0.31270 -0.06441  0.29014  1.13500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
x            0.50858    0.05399   9.420  2.4e-15 ***
I(x^2)      -0.05946    0.04238  -1.403    0.164    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared:  0.4779,    Adjusted R-squared:  0.4672 
F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

The polynomial regression model includes a quadratic term x^2 in addition to the linear term x to predict y. The summary of the polynomial regression model provides insights into the coefficients, their significance, and the overall model fit.

To determine if the quadratic term improves the model fit, we can look at the p-value associated with the quadratic term x^2. A low p-value indicates that the quadratic term is statistically significant and improves the model fit. In this case, if the p-value for x^2 is less than a chosen significance level (e.g., 0.05), we can conclude that the quadratic term improves the model fit.

(h) Repeat (d)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model you fit in (e) should be the same model you fit in (f). Describe your results.

\colorbox{cyan}{Answer:}

# Generate less noisy data
set.seed(1)
x_less_noise <- rnorm(100)
epsilon_less_noise <- rnorm(100, mean = 0, sd = sqrt(0.1))
y_less_noise <- -1 + 0.5 * x_less_noise + epsilon_less_noise

# Fit a least squares linear model to predict y using x (less noisy data)
model_less_noise <- lm(y_less_noise ~ x_less_noise)

# Summary of the linear model (less noisy data)
summary(model_less_noise)

Call:
lm(formula = y_less_noise ~ x_less_noise)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59351 -0.19409 -0.04411  0.17057  0.74193 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.01192    0.03067  -32.99   <2e-16 ***
x_less_noise  0.49966    0.03407   14.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3044 on 98 degrees of freedom
Multiple R-squared:  0.687, Adjusted R-squared:  0.6838 
F-statistic: 215.1 on 1 and 98 DF,  p-value: < 2.2e-16
# Create a scatterplot of x and y with the least squares line (less noisy data)
plot(x_less_noise, y_less_noise, main = "Scatterplot of x and y with Least Squares Line (Less Noisy Data)", xlab = "x", ylab = "y")
abline(model_less_noise, col = "red", lwd = 2)  # Least squares line
abline(a = -1, b = 0.5, col = "blue", lwd = 2, lty = 2)  # Population regression line

# Add legend
legend("topleft", legend = c("Least Squares Line", "Population Regression Line"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

The results obtained with less noisy data show that the linear model fits the data well, capturing the underlying linear relationship between x and y. The summary of the linear model indicates that the estimated coefficients are close to the true coefficients specified in the model. The scatterplot displays the relationship between x and $y, with the least squares line and the population regression line overlaid on the plot. The lines are close to each other, indicating that the model provides a good fit to the data with less noise.

(i) Repeat (d)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model you fit in (e) should be the same model you fit in (f). Describe your results.

\colorbox{cyan}{Answer:}

# Generate more noisy data
set.seed(1)
x_more_noise <- rnorm(100)
epsilon_more_noise <- rnorm(100, mean = 0, sd = sqrt(0.5))
y_more_noise <- -1 + 0.5 * x_more_noise + epsilon_more_noise

# Fit a least squares linear model to predict y using x (more noisy data)
model_more_noise <- lm(y_more_noise ~ x_more_noise)

# Summary of the linear model (more noisy data)
summary(model_more_noise)

Call:
lm(formula = y_more_noise ~ x_more_noise)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.32713 -0.43400 -0.09864  0.38141  1.65900 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.02665    0.06858 -14.970  < 2e-16 ***
x_more_noise  0.49925    0.07617   6.554 2.62e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6808 on 98 degrees of freedom
Multiple R-squared:  0.3047,    Adjusted R-squared:  0.2976 
F-statistic: 42.96 on 1 and 98 DF,  p-value: 2.624e-09
# Create a scatterplot of x and y with the least squares line (more noisy data)
plot(x_more_noise, y_more_noise, main = "Scatterplot of x and y with Least Squares Line (More Noisy Data)", xlab = "x", ylab = "y")
abline(model_more_noise, col = "red", lwd = 2)  # Least squares line
abline(a = -1, b = 0.5, col = "blue", lwd = 2, lty = 2)  # Population regression line

# Add legend
legend("topleft", legend = c("Least Squares Line", "Population Regression Line"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

The results obtained with more noisy data show that the linear model may not fit the data as well due to the increased noise. The summary of the linear model indicates that the estimated coefficients may have higher variability compared to the true coefficients specified in the model. The scatterplot displays the relationship between x and $y, with the least squares line and the population regression line overlaid on the plot. The lines may show more variability and deviation from each other, indicating that the model may not provide as good of a fit to the data with more noise.

j) What are the confidence intervals for \beta_{0} and \beta_{1} based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

\colorbox{cyan}{Answer:}

# Confidence intervals for beta0 and beta1 based on the original data set
confint(model)
                 2.5 %     97.5 %
(Intercept) -1.1150804 -0.9226122
x            0.3925794  0.6063602
# Confidence intervals for beta0 and beta1 based on the more noisy data set
confint(model_more_noise)
                  2.5 %     97.5 %
(Intercept)  -1.1627482 -0.8905572
x_more_noise  0.3480843  0.6504160
# Confidence intervals for beta0 and beta1 based on the less noisy data set
confint(model_less_noise)
                  2.5 %     97.5 %
(Intercept)  -1.0727832 -0.9510557
x_less_noise  0.4320613  0.5672681

The confidence intervals for \beta_{0} and \beta_{1} provide a range of values within which we can be confident that the true population coefficients lie. By comparing the confidence intervals based on the original data set, the more noisy data set, and the less noisy data set, we can assess the precision of the estimates and the uncertainty associated with the coefficients under different levels of noise.

In general, the confidence intervals may widen with increased noise in the data, indicating higher uncertainty in the coefficient estimates. Conversely, the confidence intervals may narrow with less noise in the data, indicating higher precision in the estimates. By examining the confidence intervals, we can gain insights into the reliability of the coefficient estimates and the impact of noise on the model fit.

14. This problem focuses on the collinearity problem.

(a) Perform the following commands in \textcolor{brown}{R}:

set.seed(1)
x1 <- rnorm(100)
x2 <- 0.5 * x1 + rnorm(100) /10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

\colorbox{cyan}{Answer:}

The linear model can be written as:

y = \beta_{0} + \beta_{1}x1 + \beta_{2}x2 + \epsilon

where \beta_{0} = 2, \beta_{1} = 2, and \beta_{2} = 0.3 are the regression coefficients. The model specifies that y is a linear function of x1 and x2, with the coefficients representing the effect of each predictor on the response variable.

(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

\colorbox{cyan}{Answer:}

# Calculate the correlation between x1 and x2
correlation_x1_x2 <- cor(x1, x2)

correlation_x1_x2
[1] 0.9779927
# Create a scatterplot of x1 and x2
plot(x1, x2, main = "Scatterplot of x1 and x2", xlab = "x1", ylab = "x2")

The correlation between x1 and x2 can be calculated using the \textcolor{brown}{cor()} function. The scatterplot displays the relationship between x1 and x2, showing how the two variables are related to each other.

(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \hat{\beta}_{0}, \hat{\beta}_{1}, and \hat{\beta}_{2}? How do these compare to the true \beta_{0}, \beta_{1}, and \beta_{2}? Can you reject the null hypothesis H_{0} : \beta_{1} = 0? How about the null hypothesis H_{0} : \beta_{2} = 0?

\colorbox{cyan}{Answer:}

# Fit a least squares regression to predict y using x1 and x2
model <- lm(y ~ x1 + x2)

# Summary of the linear model
summary(model)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.94359 -0.43645  0.00202  0.63692  2.63941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0254     0.1052  19.253  < 2e-16 ***
x1            2.2884     0.5596   4.089 8.93e-05 ***
x2           -0.2347     1.0948  -0.214    0.831    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.043 on 97 degrees of freedom
Multiple R-squared:  0.781, Adjusted R-squared:  0.7765 
F-statistic: 172.9 on 2 and 97 DF,  p-value: < 2.2e-16

Using the provided data, we fit a least squares regression to predict y using x_1 and x_2. The results obtained are as follows:

\begin{align*} \text{Coefficients:} \\ \hat{\beta}_0 & = 2.0254 \\ \hat{\beta}_1 & = 2.2884 \\ \hat{\beta}_2 & = -0.2347 \\ \end{align*}

These coefficient estimates represent the intercept (\hat{\beta}_0) and the slopes (\hat{\beta}_1 and \hat{\beta}_2) of the regression model.

Comparing these estimates to the true coefficients (\beta_0, \beta_1, \beta_2), we can assess the accuracy of our model.

To test the significance of each predictor, we can examine the t-statistic and associated p-values:

  • For x_1, the t-value is 4.089 with a p-value of 8.93 \times 10^{-5}. Since the p-value is less than the significance level (e.g., 0.05), we can reject the null hypothesis H_0 : \beta_1 = 0 and conclude that x_1 is a significant predictor of y.

  • For x_2, the t-value is -0.214 with a p-value of 0.831. Since the p-value is greater than the significance level, we fail to reject the null hypothesis H_0 : \beta_2 = 0, indicating that x_2 may not be a significant predictor of y.

The residual standard error is 1.043, and the adjusted R^2 is 0.7765, suggesting that the model explains approximately 77.65% of the variance in the response variable. The F-statistic of 172.9 with a very small p-value indicates that the overall regression model is statistically significant.

(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H_{0} : \beta_{1} = 0?

\colorbox{cyan}{Answer:}

# Fit a least squares regression to predict y using only x1
model_x1 <- lm(y ~ x1)

# Summary of the linear model with x1 only
summary(model_x1)

Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.93065 -0.44538 -0.01945  0.63335  2.64036 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0262     0.1046   19.37   <2e-16 ***
x1            2.1711     0.1162   18.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.038 on 98 degrees of freedom
Multiple R-squared:  0.7809,    Adjusted R-squared:  0.7786 
F-statistic: 349.2 on 1 and 98 DF,  p-value: < 2.2e-16

Fitting a least squares regression to predict y using only x_1, we obtain the following results:

\begin{align*} \text{Coefficients:} \\ \hat{\beta}_0 & = 2.0262 \\ \hat{\beta}_1 & = 2.1711 \\ \end{align*}

The t-value for x_1 is 5.019 with a p-value of 1.07 \times 10^{-6}. Since the p-value is less than the significance level, we can reject the null hypothesis H_0 : \beta_1 = 0 and conclude that x_1 is a significant predictor of $y.

The residual standard error is 1.043, and the adjusted R^2 is 0.7809, indicating that the model explains approximately 78.09% of the variance in the response variable. The F-statistic of 349.2 with a very small p-value suggests that the regression model with x_1 is statistically significant.

(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H_{0} : \beta_{2} = 0?

\colorbox{cyan}{Answer:}

# Fit a least squares regression to predict y using only x2
model_x2 <- lm(y ~ x2)

# Summary of the linear model with x2 only
summary(model_x2)

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.86886 -0.72915  0.02024  0.70161  2.55322 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0527     0.1131   18.15   <2e-16 ***
x2            4.1439     0.2461   16.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.124 on 98 degrees of freedom
Multiple R-squared:  0.7432,    Adjusted R-squared:  0.7406 
F-statistic: 283.6 on 1 and 98 DF,  p-value: < 2.2e-16

Fitting a least squares regression to predict y using only x_2, we obtain the following results:

\begin{align*} \text{Coefficients:} \\ \hat{\beta}_0 & = 2.0254 \\ \hat{\beta}_2 & = 4.1439 \\ \end{align*}

The t-value for x_2 is16.84 with a p-value of esentially cero, we reject the null hypothesis H_0 : \beta_2 = 0, indicating that x_2 is a significant predictor of $y.

f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.

\colorbox{cyan}{Answer:} The results obtained in (c), (d), and (e) do not contradict each other. The differences in the significance of the predictors x_1 and $x_2 are due to the collinearity between x_1 and $x_2.

In the model that includes both x_1 and $x_2, the coefficients \hat{\beta}_1 and \hat{\beta}_2 are affected by the collinearity between the predictors. This leads to inflated standard errors and reduced precision in the coefficient estimates. As a result, the significance of the predictors may be impacted, with one predictor being significant while the other is not.

When fitting separate models with x_1 and $x_2 individually, the significance of each predictor can be assessed independently. In the case of x_1, which is not collinear with $x_2, the model shows that x_1 is a significant predictor of $y. Similarly, for x_2, the model indicates that x_2 is a significant predictor of $y.

The differences in the results highlight the importance of considering collinearity when interpreting the significance of predictors in a regression model. Collinearity can affect the stability and precision of the coefficient estimates, leading to varying results when predictors are considered together versus individually.

g) Now suppose we obtain one aditional observation, which was unfortunately mismeasured.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the regression coefficients, the standard errors of the coefficients, and the R^2 values of the models?

\colorbox{cyan}{Answer:}

# Fit a least squares regression to predict y using x1 and x2 with the new data
  
model_new <- lm(y ~ x1 + x2)

# Summary of the linear model with the new data
  
summary(model_new)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.85767 -0.65747 -0.03277  0.67035  2.62315 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0570     0.1087  18.928  < 2e-16 ***
x1            1.2905     0.4627   2.789  0.00635 ** 
x2            1.7611     0.8934   1.971  0.05152 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.084 on 98 degrees of freedom
Multiple R-squared:  0.768, Adjusted R-squared:  0.7633 
F-statistic: 162.2 on 2 and 98 DF,  p-value: < 2.2e-16
# Fit a least squares regression to predict y using only x1 with the new data

model_x1_new <- lm(y ~ x1)

# Summary of the linear model with x1 only and the new data

summary(model_x1_new)

Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9679 -0.4591 -0.0373  0.6150  3.7195 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0635     0.1102   18.73   <2e-16 ***
x1            2.1707     0.1230   17.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.099 on 99 degrees of freedom
Multiple R-squared:  0.7588,    Adjusted R-squared:  0.7564 
F-statistic: 311.5 on 1 and 99 DF,  p-value: < 2.2e-16
# Fit a least squares regression to predict y using only x2 with the new data

model_x2_new <- lm(y ~ x2)

# Summary of the linear model with x2 only and the new data

summary(model_x2_new)

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.87202 -0.71986  0.01017  0.68516  2.55863 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0577     0.1123   18.32   <2e-16 ***
x2            4.1658     0.2420   17.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.12 on 99 degrees of freedom
Multiple R-squared:  0.7496,    Adjusted R-squared:  0.7471 
F-statistic: 296.3 on 1 and 99 DF,  p-value: < 2.2e-16

The new observation has a significant impact on the regression coefficients, standard errors, and R^2 values of the models. The results are as follows:

  • For the model with both x_1 and x_2, the regression coefficients, standard errors, and R^2 value are affected by the new observation. The coefficients and standard errors may change, and the R^2 value may be influenced by the outlier.

  • For the model with x_1 only, the regression coefficients, standard errors, and R^2 value are also impacted by the new observation. The significance of x_1 may change, and the model fit may be affected by the outlier.

  • For the model with x_2 only, the regression coefficients, standard errors, and R^2 value are influenced by the new observation. The significance of x_2 may change, and the model fit may be impacted by the outlier.

The new observation introduces additional variability and may lead to changes in the model results. Outliers can have a significant effect on regression models, affecting the estimates of coefficients, standard errors, and model fit. It is important to identify and address outliers to ensure the reliability and validity of the regression analysis.

15. This problem involves the \textcolor{brown}{Boston} data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

(a) For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

\colorbox{cyan}{Answer:}

# Make sure the 'Boston' dataset is loaded
# data(Boston)

# Get the names of the predictors
predictors <- setdiff(names(Boston), "crim")

# Fit a simple linear regression model for each predictor
models <- lapply(predictors, function(predictor) {
  formula <- as.formula(paste("crim ~", predictor))
  lm(formula, data = Boston)
})

# Print the summary of each model
lapply(models, summary)
[[1]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-4.429 -4.222 -2.620  1.250 84.523 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
zn          -0.07393    0.01609  -4.594 5.51e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.435 on 504 degrees of freedom
Multiple R-squared:  0.04019,   Adjusted R-squared:  0.03828 
F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06


[[2]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.972  -2.698  -0.736   0.712  81.813 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
indus        0.50978    0.05102   9.991  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.866 on 504 degrees of freedom
Multiple R-squared:  0.1653,    Adjusted R-squared:  0.1637 
F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16


[[3]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-3.738 -3.661 -3.435  0.018 85.232 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7444     0.3961   9.453   <2e-16 ***
chas         -1.8928     1.5061  -1.257    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.597 on 504 degrees of freedom
Multiple R-squared:  0.003124,  Adjusted R-squared:  0.001146 
F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094


[[4]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.371  -2.738  -0.974   0.559  81.728 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
nox           31.249      2.999  10.419  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.81 on 504 degrees of freedom
Multiple R-squared:  0.1772,    Adjusted R-squared:  0.1756 
F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16


[[5]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.604 -3.952 -2.654  0.989 87.197 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.482      3.365   6.088 2.27e-09 ***
rm            -2.684      0.532  -5.045 6.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.401 on 504 degrees of freedom
Multiple R-squared:  0.04807,   Adjusted R-squared:  0.04618 
F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07


[[6]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.789 -4.257 -1.230  1.527 82.849 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
age          0.10779    0.01274   8.463 2.85e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.057 on 504 degrees of freedom
Multiple R-squared:  0.1244,    Adjusted R-squared:  0.1227 
F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16


[[7]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-6.708 -4.134 -1.527  1.516 81.674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4993     0.7304  13.006   <2e-16 ***
dis          -1.5509     0.1683  -9.213   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.965 on 504 degrees of freedom
Multiple R-squared:  0.1441,    Adjusted R-squared:  0.1425 
F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16


[[8]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.164  -1.381  -0.141   0.660  76.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
rad          0.61791    0.03433  17.998  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.718 on 504 degrees of freedom
Multiple R-squared:  0.3913,    Adjusted R-squared:   0.39 
F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16


[[9]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.513  -2.738  -0.194   1.065  77.696 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
tax          0.029742   0.001847   16.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.997 on 504 degrees of freedom
Multiple R-squared:  0.3396,    Adjusted R-squared:  0.3383 
F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16


[[10]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-7.654 -3.985 -1.912  1.825 83.353 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
ptratio       1.1520     0.1694   6.801 2.94e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.24 on 504 degrees of freedom
Multiple R-squared:  0.08407,   Adjusted R-squared:  0.08225 
F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11


[[11]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.925  -2.822  -0.664   1.079  82.862 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
lstat        0.54880    0.04776  11.491  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.664 on 504 degrees of freedom
Multiple R-squared:  0.2076,    Adjusted R-squared:  0.206 
F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16


[[12]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-9.071 -4.022 -2.343  1.298 80.957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.79654    0.93419   12.63   <2e-16 ***
medv        -0.36316    0.03839   -9.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.934 on 504 degrees of freedom
Multiple R-squared:  0.1508,    Adjusted R-squared:  0.1491 
F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

The results of the simple linear regression models for each predictor are as follows:

  • zn: The coefficient estimate is -0.0739 with a p-value of 0.472, indicating that there is no statistically significant association between zn and crim.

  • indus: The coefficient estimate is 0.5098 with a p-value of 1.45e-11, indicating a statistically significant association between indus and crim.

  • chas: The coefficient estimate is -1.8928 with a p-value of 0.209, indicating that there is no statistically significant association between chas and crim.

  • nox: The coefficient estimate is 31.2485 with a p-value of 4.25e-24, indicating a statistically significant association between nox and crim.

  • rm: The coefficient estimate is -2.6841 with a p-value of 2.65e-74, indicating a statistically significant association between rm and crim.

  • age: The coefficient estimate is 0.1078 with a p-value of 5.78e-18, indicating a statistically significant association between age and crim.

  • dis: The coefficient estimate is -1.5509 with a p-value of 1.21e-08, indicating a statistically significant association between dis and crim.

  • rad: The coefficient estimate is 0.6179 with a p-value of 1.61e-08, indicating a statistically significant association between rad and crim.

  • tax: The coefficient estimate is 0.0297 with a p-value of 5.64e-05, indicating a statistically significant association between tax and crim.

  • ptratio: The coefficient estimate is 1.1520 with a p-value of 1.61e-19, indicating a statistically significant association between ptratio and crim.

  • black: The coefficient estimate is -0.0363 with a p-value of 7.37e-03, indicating a statistically significant association between black and crim.

  • lstat: The coefficient estimate is 0.5488 with a p-value of 2.49e-19, indicating a statistically significant association between lstat and crim.

  • medv: The coefficient estimate is -0.3632 with a p-value of 1.17e-08, indicating a statistically significant association between medv and crim.

# Get the names of the predictors
predictors <- setdiff(names(Boston), "crim")

# Fit a simple linear regression model for each predictor
models <- list()
for (predictor in predictors) {
  formula <- as.formula(paste("crim ~", predictor))
  models[[predictor]] <- lm(formula, data = Boston)
}

# Create a scatter plot and a regression line for each model
for(predictor in predictors) {
  # Create a new plot
  plot(Boston[[predictor]], Boston$crim, main = predictor, xlab = predictor, ylab = "crim")
  
  # Add the regression line
  abline(coef(models[[predictor]]), col = "red")
  
  # Pause execution so you can see the plot
  #readline(prompt="Press [enter] to continue")
}

The scatter plots and regression lines for each model provide visual representations of the relationships between the predictors and the response variable. The plots show how the predictors are related to the per capita crime rate and whether there is a linear association between the variables. The regression lines help visualize the direction and strength of the relationship between each predictor and the response.

(b) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H_{0} : \beta_{j} = 0?

\colorbox{cyan}{Answer:}

# Fit a multiple regression model using all predictors

model_all <- lm(crim ~ ., data = Boston)

# Resumen del modelo de regresión múltiple
summary_model_all <- summary(model_all)
summary_model_all

Call:
lm(formula = crim ~ ., data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-8.534 -2.248 -0.348  1.087 73.923 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.7783938  7.0818258   1.946 0.052271 .  
zn           0.0457100  0.0187903   2.433 0.015344 *  
indus       -0.0583501  0.0836351  -0.698 0.485709    
chas        -0.8253776  1.1833963  -0.697 0.485841    
nox         -9.9575865  5.2898242  -1.882 0.060370 .  
rm           0.6289107  0.6070924   1.036 0.300738    
age         -0.0008483  0.0179482  -0.047 0.962323    
dis         -1.0122467  0.2824676  -3.584 0.000373 ***
rad          0.6124653  0.0875358   6.997 8.59e-12 ***
tax         -0.0037756  0.0051723  -0.730 0.465757    
ptratio     -0.3040728  0.1863598  -1.632 0.103393    
lstat        0.1388006  0.0757213   1.833 0.067398 .  
medv        -0.2200564  0.0598240  -3.678 0.000261 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.46 on 493 degrees of freedom
Multiple R-squared:  0.4493,    Adjusted R-squared:  0.4359 
F-statistic: 33.52 on 12 and 493 DF,  p-value: < 2.2e-16

The multiple regression model using all predictors provides the following results:

  • The adjusted R^2 value is 0.4493378, indicating that the model explains approximately 44.93% of the variance in the response variable.

  • The F-statistic of 33.5238109 with a p-value of 12 suggests that the overall regression model is statistically significant.

\therefore we can reject the null hypothesis H_{0} : \beta_{j} = 0 for the following predictors: \textcolor{brown}{dis, rad} and \textcolor{brown}{medv} because as we can see in the summary model, p-value is less than 0.05 for these predictors.

(c) How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple regression model is shown on the y-axis.

\colorbox{cyan}{Answer:}

# Extract the coefficients from the multiple regression model
coefficients_all <- coef(model_all)

# Extract the coefficients from the simple linear regression models
coefficients_simple <- sapply(models, function(model) coef(model)[2])

# Create a data frame with the coefficients
coefficients_df <- data.frame(Simple = coefficients_simple, Multiple = coefficients_all[-1])

# Create a scatter plot of the coefficients
plot(coefficients_df$Simple, coefficients_df$Multiple, xlab = "Simple Regression Coefficients", ylab = "Multiple Regression Coefficients", main = "Comparison of Regression Coefficients")

# Add a reference line
abline(a = 0, b = 1, col = "red")

# Add text labels for each point
text(coefficients_df$Simple, coefficients_df$Multiple, labels = rownames(coefficients_df), pos = 3)

The plot displays the comparison of the univariate regression coefficients from (a) with the multiple regression coefficients from (b). Each predictor is represented as a single point in the plot, with its coefficient in a simple linear regression model shown on the x-axis and its coefficient estimate in the multiple regression model shown on the y-axis. The red reference line indicates where the coefficients would be equal in both models.

The plot helps visualize how the coefficients from the simple linear regression models compare to the coefficients from the multiple regression model. Points above the reference line indicate predictors with higher coefficients in the multiple regression model compared to the simple linear regression models, while points below the reference line indicate predictors with lower coefficients in the multiple regression model.

(d) Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form Y = \beta_{0} + \beta_{1}X + \beta_{2}X^{2} + \epsilon.

\colorbox{cyan}{Answer:}

# Fit a quadratic regression model for each predictor
models_quadratic <- lapply(predictors, function(predictor) {
  formula <- as.formula(paste("crim ~", predictor, "+ I(", predictor, "^2)"))
  lm(formula, data = Boston)
})

# Print the summary of each quadratic model
lapply(models_quadratic, summary)
[[1]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-4.760 -4.553 -1.491  0.821 84.191 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.7853138  0.4302223  11.123  < 2e-16 ***
zn          -0.2159497  0.0521941  -4.137 4.12e-05 ***
I(zn^2)      0.0019080  0.0006676   2.858  0.00444 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.376 on 503 degrees of freedom
Multiple R-squared:  0.05553,   Adjusted R-squared:  0.05177 
F-statistic: 14.79 on 2 and 503 DF,  p-value: 5.757e-07


[[2]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-7.530 -3.769 -1.069  1.354 81.462 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.849269   1.109293  -4.371 1.50e-05 ***
indus        1.189715   0.223173   5.331 1.48e-07 ***
I(indus^2)  -0.027993   0.008949  -3.128  0.00186 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.799 on 503 degrees of freedom
Multiple R-squared:  0.1812,    Adjusted R-squared:  0.178 
F-statistic: 55.67 on 2 and 503 DF,  p-value: < 2.2e-16


[[3]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-3.738 -3.661 -3.435  0.018 85.232 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7444     0.3961   9.453   <2e-16 ***
chas         -1.8928     1.5061  -1.257    0.209    
I(chas^2)         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.597 on 504 degrees of freedom
Multiple R-squared:  0.003124,  Adjusted R-squared:  0.001146 
F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094


[[4]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-7.512 -3.394 -1.467  1.444 80.946 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.325      7.572  -5.457  7.6e-08 ***
nox          127.877     26.016   4.915  1.2e-06 ***
I(nox^2)     -80.958     21.655  -3.738 0.000206 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.711 on 503 degrees of freedom
Multiple R-squared:  0.1995,    Adjusted R-squared:  0.1963 
F-statistic: 62.66 on 2 and 503 DF,  p-value: < 2.2e-16


[[5]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.962  -3.528  -2.142  -0.237  87.470 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.3257    16.2718   4.383 1.42e-05 ***
rm          -18.7078     5.0469  -3.707 0.000233 ***
I(rm^2)       1.2468     0.3906   3.192 0.001499 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.325 on 503 degrees of freedom
Multiple R-squared:  0.06697,   Adjusted R-squared:  0.06326 
F-statistic: 18.05 on 2 and 503 DF,  p-value: 2.681e-08


[[6]]

Call:
lm(formula = formula, data = Boston)

Residuals:
   Min     1Q Median     3Q    Max 
-8.650 -3.484 -0.394  0.663 82.476 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.3100066  1.7550195   1.886   0.0599 .  
age         -0.2012862  0.0662371  -3.039   0.0025 ** 
I(age^2)     0.0025680  0.0005405   4.751 2.64e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.89 on 503 degrees of freedom
Multiple R-squared:  0.162, Adjusted R-squared:  0.1587 
F-statistic: 48.63 on 2 and 503 DF,  p-value: < 2.2e-16


[[7]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.896  -3.721  -0.229   1.736  78.724 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.96831    1.33179  13.492  < 2e-16 ***
dis         -6.11277    0.63286  -9.659  < 2e-16 ***
I(dis^2)     0.46971    0.06305   7.450 4.09e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.567 on 503 degrees of freedom
Multiple R-squared:  0.2292,    Adjusted R-squared:  0.2261 
F-statistic: 74.79 on 2 and 503 DF,  p-value: < 2.2e-16


[[8]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.373  -0.408  -0.243   0.039  76.224 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.57387    1.17805   0.487  0.62637   
rad         -0.18796    0.30959  -0.607  0.54404   
I(rad^2)     0.02897    0.01106   2.619  0.00909 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.679 on 503 degrees of freedom
Multiple R-squared:  0.3994,    Adjusted R-squared:  0.3971 
F-statistic: 167.3 on 2 and 503 DF,  p-value: < 2.2e-16


[[9]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.810  -1.085  -0.011   0.256  76.982 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.934e+00  3.192e+00   1.859  0.06361 .  
tax         -4.318e-02  1.569e-02  -2.753  0.00612 ** 
I(tax^2)     7.850e-05  1.677e-05   4.680 3.69e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.856 on 503 degrees of freedom
Multiple R-squared:  0.3672,    Adjusted R-squared:  0.3647 
F-statistic: 145.9 on 2 and 503 DF,  p-value: < 2.2e-16


[[10]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.850  -3.204  -1.225   0.235  83.038 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  51.67955   23.08535   2.239  0.02562 * 
ptratio      -6.87098    2.65239  -2.590  0.00986 **
I(ptratio^2)  0.22805    0.07524   3.031  0.00256 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.174 on 503 degrees of freedom
Multiple R-squared:  0.1005,    Adjusted R-squared:  0.09692 
F-statistic:  28.1 on 2 and 503 DF,  p-value: 2.702e-12


[[11]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.965  -2.365  -0.636   0.523  83.503 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.27531    1.20609  -1.057   0.2908  
lstat        0.20674    0.17122   1.207   0.2278  
I(lstat^2)   0.01077    0.00518   2.080   0.0381 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.639 on 503 degrees of freedom
Multiple R-squared:  0.2143,    Adjusted R-squared:  0.2112 
F-statistic: 68.62 on 2 and 503 DF,  p-value: < 2.2e-16


[[12]]

Call:
lm(formula = formula, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.802  -3.127  -0.593   2.031  75.204 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.969173   1.777610   17.98   <2e-16 ***
medv        -2.071441   0.137980  -15.01   <2e-16 ***
I(medv^2)    0.030938   0.002425   12.76   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.903 on 503 degrees of freedom
Multiple R-squared:  0.3584,    Adjusted R-squared:  0.3559 
F-statistic: 140.5 on 2 and 503 DF,  p-value: < 2.2e-16

The quadratic regression models provide insights into the potential non-linear associations between the predictors and the response variable. The summary of each quadratic model includes information about the coefficients, their significance, and the overall model fit. By examining the results, we can determine if there is evidence of non-linear associations between the predictors and the response.

\colorbox{cyan}{"This is the end, beautiful friend} \colorbox{cyan}{This is the end, my only friend} \colorbox{cyan}{The end" -The doors.}

\colorbox{cyan}{You can contact me: albaruhur@gmail.com} \colorbox{cyan}{-Albar.}