Lab 4

library("ISLR")

##a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.

lm.fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

#i) Is there a relationship between the predictor and the response?

There is a significant relationship between the predictor (horsepower) and the response (mpg). This is indicated by the p-value (< 2.2e-16), which is much less than the significance level of 0.05.
The relationship between horsepower and mpg is moderately strong, as indicated by the adjusted R-squared value of 0.6049. This means that approximately 60.49% of the variability in mpg can be explained by horsepower.
The relationship between horsepower and mpg is negative. This is evident from the negative coefficient estimate (-0.157845), which indicates that as horsepower increases, mpg decreases.

#ii) How strong is the relationship between the predictor and the response? The R^{2} value indicates that about 61% of the variation in the response variable ( mpg) is due to the predictor variable (horsepower).

#iii) Is the relationship between the predictor and the response positive or negative? The regression coefficient for ‘horsepower’ is negative. Hence, the relationship is negative.

#iv) What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals? The confidence 95% interval

predict(lm.fit, data.frame(horsepower = c(85)), interval ="confidence")

##        fit    lwr      upr
## 1 26.51906 25.973 27.06512

And, the 95% prediction interval

predict(lm.fit, data.frame(horsepower = c(85)), interval ="prediction")

##        fit      lwr      upr
## 1 26.51906 16.85857 36.17954

As expected the prediction interval is wider than the confidence interval.

##B)Plot the response and the predictor. Use the abline() function ##to display the least squares regression line.

# Scatter plot of mpg against horsepower
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "Scatter Plot of MPG vs. Horsepower")

# Add least squares regression line
abline(lm(mpg ~ horsepower, data = Auto), col = "red")

##C) Use the plot() function to produce diagnostic plots of the least ##squares regression fit. Comment on any problems you see with ##the fit.

# Fit the linear regression model
model <- lm(mpg ~ horsepower, data = Auto)

# Produce diagnostic plots
par(mfrow = c(2, 2))  # Set up a 2x2 grid for the plots
plot(model)

The residuals and the fitted values display a pattern (U-shaped) in the first plot. This suggests that there is a non-linear relationship between the response variables and the predictor. The normal distribution of the residuals is demonstrated by the second plot. The third plot demonstrates that the mistakes’ variance is constant. Lastly, the data’s lack of leverage points is shown by the fourth plot.

##10)This question should be answered using the Carseats data set. #(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.

library("ISLR")
?Carseats

## starting httpd help server ... done

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

lm.fit = lm(Sales ~ Price+Urban+US, data= Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

#(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Here are the interpretations of each coefficient in the multiple regression model:

Intercept (13.043469): When all other predictors are zero (i.e., when the Price is zero and the store is not Urban and not in the US), the expected value of Sales is 13.043469 units.

Price (-0.054459): For each one-unit increase in Price, the expected value of Sales decreases by 0.054459 units, holding all other predictors constant.

UrbanYes (-0.021916): This coefficient represents the difference in expected Sales between an Urban store and a non-Urban store, holding all other predictors constant. However, since the p-value is not significant (p = 0.936), we fail to reject the null hypothesis that there is no difference in Sales between Urban and non-Urban stores.

USYes (1.200573): This coefficient represents the difference in expected Sales between a store in the US and a store not in the US, holding all other predictors constant. Since the p-value is significant (p < 0.001), we reject the null hypothesis and conclude that there is a statistically significant difference in Sales between stores in the US and stores not in the US. Specifically, the expected Sales in the US is 1.200573 units higher than in stores not in the US, on average.

#(c) Write out the model in equation form, being careful to handle the qualitative variables properly. The multiple regression model in equation form is:

Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε

where: Sales is the predicted sales. Price is the price of the product. UrbanYes is a binary variable indicating whether the store is urban ( 1 for “Yes” and 0 for “No”). USYes is a binary variable indicating whether the store is located in the US ( 1 for “Yes” and 0 for “No”). ϵ is the error term.

#(d)For which of the predictors can you reject the null hypothesis H0:βj=0 ? We can reject the null hypothesis for the “Price” and “US” variables.

#(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# Fit the smaller model
smaller_model <- lm(Sales ~ Price + US, data = Carseats)

# Display the summary of the smaller model
summary(smaller_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

#(h) Is there evidence of outliers or high leverage observations in the model from (e) ?

# Load necessary libraries
library(car)

## Loading required package: carData

# Fit the model
model <- lm(Sales ~ Price + US, data = Carseats)

# Get standardized residuals and leverage values
residuals <- rstandard(model)
leverage <- hatvalues(model)

# Plot standardized residuals vs fitted values
plot(fitted(model), residuals, 
     xlab = "Fitted Values", ylab = "Standardized Residuals",
     main = "Standardized Residuals vs Fitted Values")
abline(h = c(-2, 0, 2), col = "red", lty = 2)

# Plot leverage values
plot(leverage, residuals, 
     xlab = "Leverage", ylab = "Standardized Residuals",
     main = "Standardized Residuals vs Leverage")
abline(h = c(-2, 0, 2), col = "red", lty = 2)

All studentized residuals appear to be bounded by (-3 to 3), so not potential outliers are suggested from the linear regression.

##14) #(a) Perform the following commands in R. The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)

#(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2)

#(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ??^0, ??^1, and ??^2? How do these relate to the true ??0, ??1, and ??2? Can you reject the null hypothesis H0: ??1 = 0? How about the null hypothesis H0: ??2 = 0?

model = lm(y ~ x1 + x2)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

(rounded to three decimal places) β0^=2.131β1=1.44β2^=1.01

Note that β0−β0^{=−0.131β1−β1}=0.56β2−β2^=−0.71

These results indicate nontrivial bias between all of the parameters. Examing the p-values of our parameters, we cannot reject the hypothses for beta 1 or beta 2 with 95% confidence.

#(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0: ??1 = 0?

modelx1 = lm(y ~ x1)
summary(modelx1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

We can confidently reject the null hypothesis H0: ??1 = 0 under a least squares regression to predict y using only x1.

#(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0: ??1 = 0?

modelx2 = lm(y ~ x2)
summary(modelx2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

We can confidently reject the null hypothesis H0: ??1 = 0 under a least squares regression to predict y using only x2.

#(f) Do the results obtained in (c)-(e) contradict each other? Explain your answer.

It is perfectly possibly to rewrite y as two different linear equations y1=2+4320X1+103100ϵ and y2=2+4310X2+35ϵ This suggests to us that either X1 or X2 might attequately model y seperately. We could even inspect the moment generating functions of all three functions to guarentee that their distributions match if time permitted.

However, we are still left with the question of why the model that utilized X1 and X2 did not fit where the other two did. Recall from part (b) that we found significant correlation between X1 and X2, as we would expect from dependent variables. This suggests collinearity between the variables. Collinearity has the effect of magnifying our error estimates when constructing our coefficients, leading to high p-values. Thus, it is not a contradiction for a model that includes two collinear predictors to have significant p-values while models of individual predictors do not.

#(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.

set.seed(1)
x1 = runif(100)
x2 = 0.5*x1 + rnorm(100)/10
y = 2 + 2*x1 + 0.3*x2 + rnorm(100)

x1 = c(x1, .1)
x2 = c(x2, .8)
y = c(y, 6)

model = lm(y ~ x1 + x2)
modelx1 = lm(y ~ x1)
modelx2 = lm(y ~ x2)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

plot(model)

summary(modelx1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

plot(modelx1)

summary(modelx2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

plot(modelx2)

yBoth = coefficients(model)[1] + coefficients(model)[2]*x1 + coefficients(model)[3]*x2
yx1 = coefficients(modelx1)[1] + coefficients(modelx1)[2]*x1
yx2 = coefficients(modelx2)[1] + coefficients(modelx2)[2]*x2

upperyboth = mean(yBoth) + 2*sd(yBoth)
loweryboth = mean(yBoth) - 2*sd(yBoth)
print(upperyboth)

## [1] 4.29006

print(loweryboth)

## [1] 2.037906

upperyx1 = mean(yx1) + 2*sd(yx1)
loweryx1 = mean(yx1) - 2*sd(yx1)
print(upperyx1)

## [1] 4.115588

print(loweryx1)

## [1] 2.212378

upperyx2 = mean(yx2) + 2*sd(yx2)
loweryx2 = mean(yx2) - 2*sd(yx2)
print(upperyx2)

## [1] 4.272908

print(loweryx2)

## [1] 2.055057

The ovservation 101 is an outlier for all three models, as each model has a maximum value of less than 4.5, and the observation has a value of 6. The addition of our new point turns x1 into our weakest predictor. While both models with individual predictors perform well with low p-values for beta 1, the x1 model’s residual is 6 points smaller than the other two models. It is a high leverage point of all three models. Observation 101 only signficantly disrupts model x1, and even makes the x1 + x2 model workable.

Lab 4

Vaishali Kondoju

2024-02-12