stats-lab-2.knit

chapter 3 Question 8

8a

library(ISLR) 
library(ggplot2)

# Load the Auto dataset
data(Auto)

# Fit the simple linear regression model
lm_model <- lm(mpg ~ horsepower, data = Auto)

# Display the summary of the regression
summary(lm_model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

(i)

Yes, there is a relationship between horsepower and mpg as determined by testing the null hypothesis of all regression coefficients equal to zero. Since the F-statistic is far larger than 1 and the p-value of the F-statistic is close to zero we can reject the null hypothesis and state there is a statistically significant relationship between horsepower and mpg.

(ii)

The R-squared value means that about 60.59% of the variation in mpg is explained by horsepower. This suggests a moderate to strong linear relationship.

(iii)

The relationship between mpg and horsepower is negative. The more horsepower an automobile has the linear regression indicates the less mpg fuel efficiency the automobile will have.

###(iv)

new_data <- data.frame(horsepower = 98)
predict(lm_model, newdata = new_data, interval = "confidence")  # Confidence interval

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_model, newdata = new_data, interval = "prediction")  # Prediction interval

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

8b

# Fit the simple linear regression model
lm_model <- lm(mpg ~ horsepower, data = Auto)

# Scatterplot of mpg vs horsepower with regression line
plot(Auto$horsepower, Auto$mpg, 
     main = "MPG vs Horsepower", 
     xlab = "Horsepower", 
     ylab = "MPG", 
     pch = 16, col = "blue")  # Plot data points

# Add the regression line
abline(lm_model, col = "red", lwd = 2)  # Add least squares regression line

8c

lm_model <- lm(mpg ~ horsepower, data = Auto)

# Generate diagnostic plots
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(lm_model)  # Produces residual plots

Problems Observed in This Fit: - The Residuals vs. Fitted plot likely shows a curved pattern, indicating that a quadratic term may better fit the data. - The Normal Q-Q plot may show non-normal residuals at the extremes, suggesting potential outliers or skewed distribution. - The Scale-Location plot might indicate heteroscedasticity, variance is not constant. - The Residuals vs. Leverage plot could reveal high leverage points, which may need further investigation.

Question 10

10a

library(ISLR) 
data(Carseats)

# Fit the multiple regression model
lm_model <- lm(Sales ~ Price + Urban + US, data = Carseats)

# Print the summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b

Intercept When Price = 0, the store is not located in an Urban area, and the store is not in the US, the predicted Sales is 13.043 units.
Price The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales, as Price increases, Sales decreases.
UrbanYes The linear regression suggests that there isn’t a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.
USYes The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales, if the store is in the US, the sales will increase by approximately 1201 units.

10c

 Sales =13.043−0.05446×Price−0.02192×UrbanYes+1.20057×USYes

10d

Price and USYes, based on the p-values, F-statistic.

10e

lm_model_smaller <- lm(Sales ~ Price + US, data = Carseats)

# Print the summary of the smaller model
summary(lm_model_smaller)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f

The Urban variable did not show a significant relationship with Sales, so it was removed in the smaller model.
The smaller model is more efficient, explaining a similar or slightly higher proportion of the variance in Sales while excluding an insignificant variable.

10g

lm_model_smaller <- lm(Sales ~ Price + US, data = Carseats)

# Obtain 95% confidence intervals for the coefficients
confint(lm_model_smaller, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

10h

# Fit the smaller model using Price and US
lm_model_smaller <- lm(Sales ~ Price + US, data = Carseats)

# Plot studentized residuals vs predicted values
plot(predict(lm_model_smaller), rstudent(lm_model_smaller), 
     xlab = "Predicted Values", ylab = "Studentized Residuals", 
     main = "Studentized Residuals vs Predicted Values")
abline(h = 0, col = "red")  # Add a horizontal line at y = 0

# Fit the smaller model using Price and US
lm_model_smaller <- lm(Sales ~ Price + US, data = Carseats)

# Set up a 2x2 grid for diagnostic plots
par(mfrow=c(2,2))

# Generate the diagnostic plots
plot(lm_model_smaller)

- exceed (p+1)/n(0.0076) on the leverage-statistic plot that suggest that the corresponding points have high leverage.

Question 14

14a

set.seed(1)
x1 = runif(100)
x2 = 0.5 * x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)
model <- lm(y ~ x1 + x2)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Y=2+2X1+0.3X2+ϵ
β0=2,β1=2,β3=0.3
Regression Coefficients: Intercept: 2.1305 Coefficient for 𝑥1: 1.4396 Coefficient for 𝑥2: 1.0097

14b

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2)

14c

lm.fit = lm(y~x1+x2)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

β0=2.0533,β1=1.6336,β3=0.5588
The regression coefficients are close to the true coefficients, although with high standard error. We can reject the null hypothesis for β1 because its p-value is below 5%. We cannot reject the null hypothesis for β2 because its p-value is much above the 5% typical cutoff, over 60%.

14d

lm.fit = lm(y~x1)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Yes, we can reject the null hypothesis for the regression coefficient given the p-value for its t-statistic is near zero.

14e

lm.fit = lm(y~x2)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Yes, we can reject the null hypothesis for the regression coefficient given the p-value for its t-statistic is near zero.

14f

No, because x1 and x2 have collinearity, it is hard to distinguish their effects when regressed upon together. When they are regressed upon separately, the linear relationship between y and each predictor is indicated more clearly.

14g

x1 = c(x1, 0.1)
x2 = c(x2, 0.8)
y = c(y, 6)
lm.fit1 = lm(y~x1+x2)
summary(lm.fit1)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

lm.fit2 = lm(y~x1)
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

lm.fit3 = lm(y~x2)
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

par(mfrow=c(2,2))
plot(lm.fit1)

par(mfrow=c(2,2))
plot(lm.fit2)

par(mfrow=c(2,2))
plot(lm.fit3)

plot(predict(lm.fit1), rstudent(lm.fit1))
plot(predict(lm.fit2), rstudent(lm.fit2))
plot(predict(lm.fit3), rstudent(lm.fit3))