Lab2

lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

a. 1. Yes, there is a relationship between horsepower (the predictor) and mpg (the response) as indicated by the regression analysis.

The relationship is quite strong since the R-squared value is around 0.6, which means approximately 60% of the variability in mpg can be explained by horsepower.
The relationship is negative, as shown by the negative coefficient for horsepower in the regression equation. This indicates that as horsepower increases, the mpg tends to decrease.

predicted_output <- predict(lm_fit, data.frame(horsepower = (c(98))), interval = "confidence")
print(predicted_output)

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

The predicted mpg for a horsepower of 98 is 24.46708. The 95% confidence interval for this prediction is between 23.97308 and 24.96108.

plot(Auto$horsepower, Auto$mpg)
abline(lm_fit)

par(mfrow = c(2, 2))
plot(lm_fit)

There don’t appear to be any problems with the fit. The points are randomly scattered around the line with no major outliers.

10. a.

model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b. The sales decrease as price increases. The sales decrease if it is in an urban area, however, this is not statistically significant Sales increase if in the US.

The coefficient for Price is -0.054459, meaning that for each unit increase in price, sales are expected to decrease by this coefficient’s value.
The coefficient for Urban is -0.021916, suggests that being in an urban location is associated with a decrease in sales, but this is not statistically significant (p = 0.936), which indicates that urban location may not have a strong impact on sales.
The coefficient for US is 1.200573, indicating that sales are expected to increase by this amount when the store is in the US, holding other variables constant. This result is statistically significant (p-value = 4.86e-06).

c. Sales = -0.054459(Price) - 0.021916(Urban[1 or 0]) + 1.200573(US[1 or 0]) + 13.043469

d. We can reject the null hypothesis for Price and US as their p-values are below the common alpha level of 0.05, indicating that these predictors have a statistically significant association with Sales.

model <- lm(Sales ~ Price + US, data = Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f We can see that R-squared is quite low for both models it looks like these models do not fit the data very well, which indicates that the explanatory power of the models is very limited.

confint(model)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h We see an observation with high leverage, yet within the smaller model no data points are considered outliers. This indicates the presence of certain points that have influence on the model’s predictions.

par(mfrow = c(2,2))
plot(model)

plot(predict(model), rstudent(model))

14 a.

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

y = 2+2x1+0.3x2+ß Coeffi cients: ß0 = ß+2 ß1 = 2 ß2 = .3

plot(x1, x2)

cor(x1, x2)

## [1] 0.8351212

x2 goes up as x1 goes up (positive correlation)

model2 <- lm(y ~ x1 + x2)
summary(model2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

This is quite different than the real values, except for ß0 Actual Values: ß0: 2.0000, ß1: 2.0000, ß2: 0.3000 Predicted Values: ß0: 2.1305, ß1: 1.4396, ß2: 1.0097 As the p value for ß1 is low enough, we can reject the null hypothesis for it, however, we cannot do the same for ß2.

model3 <- lm(y ~ x1)
summary(model3)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The coefficients are much closer this time to the true values and we can reject the null hypothesis for ß1.

model4 <- lm(y ~ x2)
summary(model4)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The results do not contradict each other despite seeming to do so. This is because x1 and x2 are highly correlated, which causes the importance of x2 to be masked by collinearity when both x1 and x2 are included in the model.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

model5 <- lm(y ~ x1 + x2)
model6 <- lm(y ~ x1)
model7 <- lm(y ~ x2)
summary(model5)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

summary(model6)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

summary(model7)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

For the first model that includes both variables (\(x1\) and \(x2\)), compared to the original model, the new model’s coefficients changed significantly. The value of \(\hat{\beta}_2\) became significant instead of \(\hat{\beta}_1\). In the models that include only one of the predictors, the significance of the variables remained as in the original models, but the coefficients changed due to the new observation.

par(mfrow=c(2, 2))
plot(model5)

par(mfrow=c(2,2))
plot(model7)

The mismeasured observation is a high-leverage point in both the first and third models, which means it has a significant impact on the estimation of the regression coefficients.

plot(predict(model5), rstudent(model5))

plot(predict(model6), rstudent(model6))

plot(predict(model7), rstudent(model7))

In the second model, which uses \(x1\) as the only predictor, the new observation is considered an outlier (above the cutoff value of 3)

Lab2

2024-02-14