This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
data(Auto)
model <- lm(mpg ~ horsepower, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Ans: Yes, there is a significant relationship between horsepower and mpg. The p-value for the horsepower coefficient is highly significant (p-value: < 2e-16), suggesting that the relationship is not due to random chance.
How strong is the relationship between the predictor and the response? Ans: The Multiple R-squared value is 0.6059, which means that approximately 60.59% of the variability in mpg can be explained by the linear relationship with horsepower.
Is the relationship between the predictor and the response positive or negative? Ans: The coefficient for horsepower is -0.157845. Since it’s negative, there is a negative relationship between horsepower and mpg. As horsepower increases, mpg tends to decrease.
What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals? Ans: We can use the predict() function. Assuming subset,which is a new data frame with a column named horsepower containing the value 98:
subset <- data.frame(horsepower = 98)
predicted_value <- predict(model, newdata = subset, interval = "confidence", level = 0.95)
print(predicted_value)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
So, the predicted mpg for a car with a horsepower of 98 is approximately 24.47, and It’s 95% confident that the true value falls between 23.97 and 24.96. This interval represents the uncertainty associated with predicting mpg for a car with a horsepower of 98 based on the linear regression model.
# Assuming 'Auto' is your data frame
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "Scatterplot of Horsepower vs MPG")
abline(model, col = "red")
#8c) Use the plot() function to produce diagnostic plots of the least squares regression ft. Comment on any problems you see with the ft.
plot(model)
data("Carseats")
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Ans) Interpretation of each coefficient: Ans:Intercept (13.043469): The estimated intercept is 13.043469. It represents the estimated average ‘Sales’ when all predictor variables (Price, Urban, US) are zero. In this context, it might not have a specific practical interpretation.
Price (-0.054459): The estimated coefficient for ‘Price’ is -0.054459. It suggests that, on average, for a one-unit increase in ‘Price’, the ‘Sales’ are expected to decrease by 0.054459 units, holding other variables constant.
UrbanYes (-0.021916): The estimated coefficient for ‘UrbanYes’ is -0.021916. Since ‘Urban’ is a dummy variable (1 for ‘Yes’, 0 for ‘No’), this coefficient represents the average difference in ‘Sales’ between urban and non-urban areas. However, with a p-value of 0.936, it is not statistically significant, and we cannot conclude that ‘Urban’ has a significant effect on ‘Sales’ in this model.
USYes (1.200573): The estimated coefficient for ‘USYes’ is 1.200573. It represents the average difference in ‘Sales’ between stores in the US (‘Yes’) and not in the US (‘No’). A positive value suggests that, on average, stores in the US have higher sales compared to those not in the US.
Sales= 13.043469−0.054459×Price−0.021916×UrbanYes+1.200573×USYes+Error Here, ‘UrbanYes’ and ‘USYes’ are dummy variables representing the presence of ‘Urban’ and ‘US’, respectively.
Ans: The p-values associated with each coefficient are given in the “Pr(>|t|)” column. If this p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis for that predictor. In this case: ‘Price’ and ‘USYes’ have very low p-values (< 0.05), so you can reject the null hypothesis for these predictors. ‘UrbanYes’ has a high p-value (0.936), so we cannot reject the null hypothesis for ‘UrbanYes’. This suggests that ‘Urban’ might not have a significant impact on ‘Sales’ in this model.
Since there is evidence of association with the outcome for ‘Price’ and ‘USYes’, we can fit a smaller model using only these predictors. Ans:
# Fitting a smaller model
smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
model: Residual standard error: 4.906 Multiple R-squared: 0.6059 Adjusted R-squared: 0.6049 F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
smaller_model: Residual standard error: 2.469 Multiple R-squared: 0.2393 Adjusted R-squared: 0.2354 F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Comparison: Residual Standard Error (RSE): Model (b) has a lower RSE (2.469) compared to model (a) (4.906). A lower RSE indicates a better fit of the model to the data, as it represents the average amount that the response variable deviates from the predicted values.
R-squared: Model (a) has a higher R-squared value (0.6059) compared to model (b) (0.2393). A higher R-squared indicates a better fit, as it represents the proportion of variability in the response variable explained by the model.
Adjusted R-squared: Model (a) also has a higher adjusted R-squared (0.6049) compared to model (b) (0.2354), further suggesting a better fit.
F-statistic: Model (a) has a higher F-statistic (599.7) compared to model (b) (62.43). A higher F-statistic suggests that the overall model is a better fit.
Conclusion: Based on these metrics, model (a) appears to fit the data better than model (b). It has a lower residual standard error, higher R-squared values, and a higher F-statistic.
conf_intervals <- confint(smaller_model, level = 0.95)
# Print the confidence intervals
print(conf_intervals)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(smaller_model)
There is one outlier in the smaller model
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
#14a)
The form of the linear model can be written as: y=β0+β1⋅x1+β2⋅x2+ε
where: β0 is the intercept, β1 is the coefficient for x1, β2 is the
coefficient for x2, and ε is the error term. In this case, the
coefficients are: β0=2 β1=2 β2=0.3 So, the linear model can be written
as: y=2+2⋅x1+0.3⋅x2+ε
correlation_x1_x2 <- cor(x1, x2)
print(paste("Correlation between x1 and x2:", correlation_x1_x2))
## [1] "Correlation between x1 and x2: 0.835121242463113"
# Create scatterplot
plot(x1, x2, main = "Scatterplot of x1 and x2", xlab = "x1", ylab = "x2")
# Fit a least squares regression model
model <- lm(y ~ x1 + x2)
# Print the summary of the model
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The estimated intercept (βˆ0) is 2.1305. The estimated coefficient for x1 (βˆ1) is 1.4396. This suggests that, on average, for a one-unit increase in x1, y is expected to increase by approximately 1.4396 units, holding x2 constant. The estimated coefficient for x2 (βˆ2) is 1.0097. This suggests that, on average, for a one-unit increase in x2, y is expected to increase by approximately 1.0097 units, holding x1 constant. Hypothesis Tests:
The p-value for x1 is 0.0487, which is less than 0.05. Therefore, you can reject the null hypothesis (H0: β1 = 0) and conclude that there is evidence that x1 is associated with y. The p-value for x2 is 0.3754, which is greater than 0.05. Therefore, you cannot reject the null hypothesis (H0: β2 = 0) and conclude that there is not enough evidence that x2 is associated with y.
# Fitting a least squares regression model using only x1
model_x1 <- lm(y ~ x1)
summary(model_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
#Interpretation:
The estimated intercept (βˆ0) is 2.1124. The estimated coefficient for x1 (βˆ1) is 1.9759. This suggests that, on average, for a one-unit increase in x1, y is expected to increase by approximately 1.9759 units.
#Hypothesis Test:
The p-value for x1 is 2.66e-06, which is less than 0.05. Therefore, you can reject the null hypothesis (H0: β1 = 0) and conclude that there is evidence that x1 is associated with y.
# Fiting a least squares regression model using only x2
model_x2 <- lm(y ~ x2)
# Printing the summary of the model
summary(model_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
#Interrpretation:
The estimated intercept (βˆ0) is 2.3899. The estimated coefficient for x2 (βˆ1) is 2.8996. This suggests that, on average, for a one-unit increase in x2, y is expected to increase by approximately 2.8996 units.
#Hypothesis Test: The p-value for x2 is 1.37e-05, which is less than 0.05. Therefore, you can reject the null hypothesis (H0: β1 = 0) and conclude that there is evidence that x2 is associated with y.
No they do not contradict each other. Both x1 and
x2 individually are capable of explaining much of the
variation observed in y, however since they are correlated,
it is very difficult to tease apart their separate contributions.
# Add the new observation to the data
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
# Refit the linear models
model_c <- lm(y ~ x1 + x2)
model_d <- lm(y ~ x1)
model_e <- lm(y ~ x2)
summary(model_c)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(model_d)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(model_e)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
Model with both predictors \(x_1\) and \(x_2\):
The new observation does not significantly affect the coefficients for \(x_1\) and \(x_2\), but it does increase the residual standard error slightly.
The p-values for both \(x_1\) and \(x_2\) are not significant, indicating that neither predictor has a significant linear relationship with the response.
The observation does not appear to be an outlier or a high-leverage point based on the magnitude of its residuals, leverage, Cook’s distance, and influence statistics.
Model with only predictor \(x_1\):
The new observation increases the residual standard error slightly, but it also increases the significance of \(x_1\), as indicated by its reduced p-value.
The observation does not seem to be an outlier or a high-leverage point based on the magnitude of its residuals, leverage, Cook’s distance, and influence statistics.
Model with only predictor \(x_2\):
The new observation increases the residual standard error slightly, but it also increases the significance of \(x_2\), as indicated by its reduced p-value.
The observation does not appear to be an outlier or a high-leverage point based on the magnitude of its residuals, leverage, Cook’s distance, and influence statistics.
In the first model (with both predictors), the new point has very high leverage (since it is an outlier in terms of the joint x1 and x2 distribution), however it is not an outlier. In the model that includes x1, it is an outlier but does not have high leverage. In the model that includes x2, it has high leverage but is not an outlier. It is useful to consider the scatterplot of x1 and x2.