Question 8

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
model <- lm(mpg ~ horsepower, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

B)

# Fit the linear model
model <- lm(mpg ~ horsepower, data=Auto)

# Create the scatterplot
plot(Auto$horsepower, Auto$mpg, xlab="Horsepower", ylab="MPG", main="MPG vs Horsepower")

# Add the least squares regression line
abline(model, col="red")

## c)

# Fit the linear model
model <- lm(mpg ~ horsepower, data=Auto)

# Produce diagnostic plots
par(mfrow=c(2,2))  # Sets the plotting area into a 2x2 layout
plot(model)  # Produces the diagnostic plots

10 Question.

(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.

# Load the Carseats data
data(Carseats)

# Fit the multiple regression model
model <- lm(Sales ~ Price + Urban + US, data=Carseats)

# Display the summary of the model
summary(model)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b)

For the model Sales ~ Price + Urban + US:

Price: This is a quantitative variable. The coefficient for Price represents the amount of change in Sales we would expect to see for each one-unit increase in Price (e.g., for each additional dollar in the price of a car seat), assuming all other variables in the model are held constant. If the coefficient is negative, it suggests that higher prices are associated with lower sales, which is intuitive for most goods.

Urban: This is a qualitative variable with two levels, typically coded as 1 for ‘Yes’ (the car seat is being sold in an urban location) and 0 for ‘No’ (the car seat is being sold in a non-urban location). The coefficient for Urban tells us the difference in Sales between urban and non-urban locations, all else being equal. If the coefficient is negative, it indicates that, on average, sales are lower in urban locations compared to non-urban locations, when controlling for the other factors in the model.

US: This is also a qualitative variable, indicating whether the car seat is sold in the US (1) or not (0). The coefficient for US shows the difference in Sales for car seats sold in the US compared to those sold outside the US, holding all else equal. A positive coefficient would suggest that car seats sell better in the US than in other countries in this dataset.

The actual numeric values of these coefficients would determine the magnitude and direction (positive or negative) of these effects. It’s important to look at the p-values associated with each coefficient to determine if the effects are statistically significant.

c)

Multiple Regression Model Interpretation

When writing out the model in equation form with qualitative (categorical) variables, we handle them as follows: Assuming ‘No’ is the reference level for Urban and ‘Outside US’ is the reference level for US, the regression equation for the model Sales ~ Price + Urban + US is: Sales = β0 + β1 * Price + β2 * UrbanYes + β3 * USYes Interpretations: - β0 (Intercept) represents the average sales for the baseline case (non-urban location outside the US, with the price being zero). - β1 (Coefficient for Price) represents the average decrease in sales for each one-unit increase in the price of the car seat. - β2 (Coefficient for Urban) reflects the difference in average sales between urban and non-urban locations. A negative β2 suggests that being in an urban location is associated with lower sales, on average, compared to non-urban locations. - β3 (Coefficient for US) indicates the difference in average sales between car seats sold in the US and those sold outside the US. A positive β3 suggests that sales are higher in the US, on average, than outside the US.

d)

Price: The p-value for Price is less than 2e-16, which is far below the common significance level of 0.05. This means you can reject the null hypothesis for Price, indicating a significant relationship between Price and Sales. The negative coefficient (-0.054459) suggests that as Price increases, Sales tend to decrease.

UrbanYes: The p-value for UrbanYes is 0.936, which is much greater than 0.05. Therefore, you cannot reject the null hypothesis for UrbanYes, indicating that being in an urban location does not have a statistically significant effect on Sales, when controlling for other factors in the model.

USYes: The p-value for USYes is 4.86e-06, which is well below the 0.05 threshold. This means you can reject the null hypothesis for USYes, suggesting that the market (US vs. non-US) has a significant effect on Sales. The positive coefficient (1.200573) indicates that Sales are higher for car seats sold in the US compared to those sold outside the US, all else being equal.

e)

Based on the previous analysis, there is evidence of an association between Sales and the predictors Price and USYes, but not with UrbanYes. Therefore, you should fit a smaller model that only includes Price and USYes as predictors.

# Fit the smaller model
smaller_model <- lm(Sales ~ Price + US, data=Carseats)

# Display the summary of the smaller model
summary(smaller_model)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

The results says: Both predictors, Price and USYes, have p-values less than 2e-16 and 4.71e-06, respectively, indicating that both are statistically significant predictors of Sales ## f)

Interpretation of fit e:

The multiple regression model predicts sales (Sales) using two predictors: price (Price) and whether the store is located in the US (US).

Coefficients:

For every one-unit increase in price, sales decrease by approximately 0.054 units. Being located in the US is associated with an increase in sales by approximately 1.20 units compared to not being in the US. Model Fit:

The model explains about 24% of the variability in sales. The predictors (Price and US) are statistically significant in predicting sales. The model’s overall significance is confirmed by a highly significant F-statistic. Overall, the model reasonably explains the relationship between predictors and sales, but there may still be other factors influencing sales not accounted for in the model.

Interpretation of fit a:

Residuals: The residuals represent the differences between the observed sales values and the sales values predicted by the model. They range from approximately -6.92 to 7.06, with no obvious patterns or trends. Coefficients:

For every one-unit increase in price, sales decrease by approximately 0.054 units. There is no statistically significant association between sales and whether the store is located in an urban area (UrbanYes). Being located in the US is associated with an increase in sales by approximately 1.20 units compared to not being in the US. Model Fit:

The model explains about 24% of the variability in sales. The predictors (Price and US) are statistically significant in predicting sales. The overall model is significant, as indicated by the highly significant F-statistic. Overall, the model reasonably explains the relationship between predictors and sales. However, the variable indicating whether the store is located in an urban area (UrbanYes) is not statistically significant in predicting sales.

g)

# Obtain 95% confidence intervals for coefficients
conf_intervals <- confint(smaller_model)

# Print the confidence intervals
conf_intervals
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h)

# Obtain 95% confidence intervals for coefficients
conf_intervals <- confint(smaller_model)

# Print the confidence intervals
conf_intervals
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

14.(a)

 set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

Based on the code provided, the linear model has the following form:

y = β0 + β1x1 + β2x2 + ε

Where:

y is the response variable x1 and x2 are the predictor variables β0 is the intercept term β1 and β2 are the regression coefficients ε is the error term Specifically, the regression coefficients are:

β0 = 2 β1 = 2 β2 = 0.3 So the full linear model is:

y = 2 + 2x1 + 0.3x2 + ε

14(b.)

x2 <- 0.5 * x1 + rnorm(100) / 10 We can see that x2 is generated based on x1, specifically: x2 is set equal to 0.5 times x1 Random normal noise (rnorm(100)) divided by 10 is added So x2 is directly linearly related to x1, plus some random noise. The correlation between x1 and x2 will be very high, around 0.9, since x2 is set to be a linear function of x1.

plot(x1, x2, 
     main="Relationship Between x1 and x2",
     xlab="x1", ylab="x2")
abline(lm(x2~x1), col="red")

##14(c.)

model <- lm(y ~ x1 + x2)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Based on the model summary:

(estimated intercept) = 2.1305 (estimated coefficient for x1) = 1.4396 (estimated coefficient for x2) = 1.0097 Comparing to the true underlying coefficients:

True β0 = 2 True β1 = 2 True β2 = 0.3 We see that the estimated coefficients x1 and x2 are quite far from the true values due to collinearity between x1 and x2.

Regarding hypothesis tests:

H0: β1 = 0 We reject this null hypothesis since the p-value for x1 is 0.0487, which is < 0.05 So x1 coefficient is significantly different from 0 H0: β2 = 0 We fail to reject this hypothesis since p-value for x2 is 0.3754, much greater than 0.05 So x2 coefficient is not significantly different from 0 In summary, the collinearity has inflated the variability of coefficient estimates, so they deviate from the true values and only x1 is detected as a significant predictor when both should be. Addressing collinearity should improve coefficient quality and hypothesis testing.

##14(d)

# Fit least squares regression model using only x1
lm_model <- lm(y ~ x1)

# Print the summary of the regression results
summary(lm_model)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Since the p-value is much smaller than the typical significance level of 0.05, we have strong evidence to reject the null hypothesis. This indicates that the predictor x1 is statistically significant in predicting the response variable y, and there is a significant linear relationship between x1 and y.

##14(e)

# Fit least squares regression model using only x2
lm_model <- lm(y ~ x2)

# Print the summary of the regression results
summary(lm_model)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Since the p-value is much smaller than the typical significance level of 0.05, we have strong evidence to reject the null hypothesis. This indicates that the predictor x2 is statistically significant in predicting the response variable y, and there is a significant linear relationship between x2 and y.

##14(f) There is no inherent contradiction, but the results do reveal some key insights about the underlying relationships:

In the model with both x1 and x2, x2 was NOT a statistically significant predictor, with a high p-value (0.3754). However, when modeling y with ONLY x2 as the predictor, x2 is highly statistically significant, with a very low p-value (1.25e-06). This suggests that in the combined model, there is likely collinearity between x1 and x2 which is leading to inflated standard errors and inability to detect the significance of x2 when controlling for x1. ##14(g)

x1 <- c(x1 , 0.1)
x2 <- c(x2 , 0.8)
y <- c(y ,6)
summary(lm(y ~ x1 + x2))
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
summary(lm(y ~ x1))
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
summary(lm(y ~ x2))
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
par(mfrow = c(2, 2))
plot(lm(y ~ x1 + x2), cex = 0.2)

par(mfrow = c(2, 2))
plot(lm(y ~ x1), cex = 0.2)

par(mfrow = c(2, 2))
plot(lm(y ~ x2), cex = 0.2)

In the first model (with both predictors), the new point has very high leverage (since it is an outlier in terms of the joint x1 and x2 distribution), however it is not an outlier. In the model that includes x1, it is an outlier but does not have high leverage. In the model that includes x2, it has high leverage but is not an outlier. It is useful to consider the scatterplot of x1 and x2.

plot(x1, x2)
points(0.1, 0.8, col = "red", pch = 19)