lab 4

library(ISLR)  
data(Auto)
model <- lm(mpg ~ horsepower, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

new_data <- data.frame(horsepower = 98)
prediction <- predict(model, newdata = new_data, interval = "confidence")  
prediction_pred <- predict(model, newdata = new_data, interval = "prediction")  
print("Predicted mpg with 95% Confidence Interval:")

## [1] "Predicted mpg with 95% Confidence Interval:"

print(prediction)

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

print("Predicted mpg with 95% Prediction Interval:")

## [1] "Predicted mpg with 95% Prediction Interval:"

print(prediction_pred)

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

The p-value for horsepower (< 2e-16) is extremely small. This suggests a strong statistical relationship between horsepower and mpg. Since the p-value is far below 0.05, we reject the null hypothesis (H0) no relationship)

ii)The Multiple R-squared value is 0.6059, meaning about 60.59% of the variation in mpg is explained by horsepower. This indicates a moderate to strong relationship—horsepower explains a significant portion of mpg variation but not all.

iii)The coefficient for horsepower is -0.157845, meaning that as horsepower increases, mpg decreases. The relationship is negative: cars with higher horsepower tend to have lower fuel efficiency

4)Confidence Interval (CI): We are 95% confident that the average mpg for cars with 98 horsepower falls between 23.97 and 24.96. Prediction Interval (PI): A single car with 98 horsepower is likely to have an mpg between 14.81 and 34.12

4 b)

# Scatter plot of mpg vs horsepower
plot(Auto$horsepower, Auto$mpg, main = "MPG vs Horsepower",
     xlab = "Horsepower", ylab = "MPG", pch = 19, col = "blue")

# Add the regression line
abline(model, col = "red", lwd = 2)

4 c)

# Generate diagnostic plots
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(model)

The red regression line (from abline()) shows the negative trend—as horsepower increases, mpg decreases.

#10 a)

data(Carseats)

# multiple regression model
model_a <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_a)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

From summary(model_a), we get: Intercept: The expected Sales when Price = 0, Urban = No, and US = No. Price coefficient: Indicates how much Sales change per unit increase in Price. Urban coefficient: Shows how Sales differ between urban and non-urban locations. US coefficient: Shows how Sales differ between stores in the US and those outside.

Since Urban and US are categorical, R automatically encodes them as dummy variables (UrbanYes and USYes). Sales = β0 + β1× Price +β2 × UrbanYes+ β3 × US YesSales = β0 + β1 × Price + β2 × UrbanYes + β3 × USYes where: UrbanYes = 1 if Urban = “Yes”, else 0 USYes = 1 if US = “Yes”, else 0 If Urban = No and US = No, the model simplifies to: Sales^{=β0+β1×PriceSales}=β0+β1×Price

Price is likely significant (low p-value). Urban might not be significant (high p-value). US might be significant (low p-value).

model_e <- lm(Sales ~ Price + US, data = Carseats)
summary(model_e)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Both predictors (Price and US) are statistically significant (p-values < 0.001), meaning they strongly impact sales. Price has a negative effect (higher prices → lower sales). US has a positive effect (US stores sell more Price significantly affects Sales (negative impact). US stores sell significantly more than non-US stores. Model explains ~24% of variance in Sales (decent, but can improve). Residuals are fairly small (~2.47), but unexplained variation remains.

anova(model_e, model_a)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + US
## Model 2: Sales ~ Price + Urban + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    397 2420.9                           
## 2    396 2420.8  1   0.03979 0.0065 0.9357

The p-value is 0.9357, which is much greater than 0.05. This means adding Urban does not significantly improve the model. Since Urban is not significant, we prefer the simpler model (Sales ~ Price + US). Removing unnecessary predictors improves interpretability and avoids overfitting.

confint(model_e)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Price has a strong negative effect on sales. Being a US store significantly increases sales. Both predictors are useful, and we should keep them in the model.

par(mfrow = c(2, 2))  # 2x2 grid of plots
plot(model_e)

Model is reasonable: No strong non-linearity, normal residuals, and no severe heteroscedasticity. Residuals Vs Fitted: No major issues, but mild heteroscedasticity could be checked further. Q-Q Residuals: Residuals are approximately normal, minor deviations at the tails. SCale-Location: No strong signs of heteroscedasticity, though slight variance increase at higher values. Residuals vs Leverages:No highly influential points, but 368 might need further checking.

#14 a

set.seed(1)
x1 <- runif(100)  
x2 <- 0.5 * x1 + rnorm(100) / 10  
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

y=2+2x1+0.3x 2+ε β0=2 (Intercept) β1=2 (Effect of x1) β2=0 (Effect of x2) ε is random noise. #b

# Compute correlation
cor_x1_x2 <- cor(x1, x2)
print(cor_x1_x2)

## [1] 0.8351212

# Scatterplot of x1 vs x2
plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2", pch = 19, col = "blue")

# Fit multiple linear regression model
model_full <- lm(y ~ x1 + x2)
summary(model_full)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

# Fit simple linear regression with only x1
model_x1 <- lm(y ~ x1)
summary(model_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

# Fit simple linear regression with only x2
model_x2 <- lm(y ~ x2)
summary(model_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

No, the results of (c) and (e) do not contradict with each other.The difference can be the cause of collinearity.the response variable is dependent upon two variables and those two variables are collinear which result in masking the effect of variables on response variable one over the other.collinearity can also increase the sandard error as we can see that the standard error of x1+x2 is greater than the x1 or x2.

model_full_new <- lm(y ~ x1 + x2)
summary(model_full_new)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

model_x1_new <- lm(y ~ x1)
summary(model_x1_new)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

model_x2_new <- lm(y ~ x2)
summary(model_x2_new)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

par(mfrow = c(2, 2))  # Diagnostic plots
plot(model_full_new)   # For Full Model

par(mfrow = c(2, 2))
plot(model_x1_new)     # For Reduced Model

par(mfrow = c(2, 2))
plot(model_x2_new)     # For Alternative Model

lab 4

2025-02-19