library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lindia)
## Warning: package 'lindia' was built under R version 4.2.3
library(ggplot2)
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.2.3
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:ISLR2':
##
## Boston
##
## The following object is masked from 'package:dplyr':
##
## select
Intialising the ‘Auto’ dataset to ‘auto’ incase we edit the dataset.
auto<- Auto
Question 8
(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.
model<-lm(mpg~horsepower, data=auto)
model
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
##
## Coefficients:
## (Intercept) horsepower
## 39.9359 -0.1578
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Yes, there is a relationship between the predictor (horsepower) and the response (mpg). This is evident from the significant p-value associated with the coefficient of the horsepower variable.
The relationship between horsepower and mpg is moderately strong. This is indicated by the magnitude of the coefficient for horsepower (-0.1578) and the R-squared value (0.6059), which suggests that approximately 60.59% of the variability in mpg can be explained by the variability in horsepower.
The relationship between horsepower and mpg is negative. This is evident from the negative coefficient (-0.1578), indicating that as horsepower increases, mpg tends to decrease.
To predict the mpg associated with a horsepower of 98, we can use the formula provided by the regression model:
mpg = Intercept + (Coefficient of horsepower × horsepower) mpg=39.9359+(−0.1578×98) mpg=24.4715 So, the predicted mpg associated with a horsepower of 98 is approximately 24.47.
95% Confidence Interval
#Calculating lower confidence interval bound
lower_bound<- mean(auto$horsepower)-(1.96*(sd(auto$horsepower)/sqrt(nrow(auto))))
#Calculating upper confidence interval bound
upper_bound<- mean(auto$horsepower)+(1.96*(sd(auto$horsepower)/sqrt(nrow(auto))))
cat("The 95% confidence intervals are: (" , lower_bound , ", " , upper_bound , ")")
## The 95% confidence intervals are: ( 100.659 , 108.2798 )
95% Prediction interval
#Calculating lower prediction bound
lower_pred<- 24.4715 - (1.96*4.906*sqrt(1+(1/nrow(auto))))
#Calculating upper prediction bound
upper_pred<- 24.4715 + (1.96*4.906*sqrt(1+(1/nrow(auto))))
cat("The 95% confidence intervals are: (" , lower_pred , ", " , upper_pred , ")")
## The 95% confidence intervals are: ( 14.84348 , 34.09952 )
(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line
response_variable <- auto$mpg
predictor_variable <- auto$horsepower
# Create a scatter plot
plot(predictor_variable, response_variable,
xlab = "Predictor Variable", ylab = "Response Variable",
main = "Scatter Plot with Least squares Regression Line")
# Fit a linear regression model
lm_model <- lm(response_variable ~ predictor_variable, data = auto)
# Add the regression line
abline(lm_model, col = "red")
Our graph shows that our response variable “mpg is inversely
proportional to our predictor variable”horsepower” by seeing our least
squares regression line. This is the case because higher horsepower
typically means the engine is more powerful, and more powerful engines
tend to consume more fuel.
(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit
# Plot diagnostic plots
par(mfrow = c(2, 2)) # Set up a 2x2 grid for diagnostic plots
plot(lm_model)
Interpretation: 1. Residuals vs. Fitted: We see random scatter points with no clear pattern around the horizontal line. This suggests that the assumption of linearity is reasonable.
Normal Q-Q Plot: We notice the points closely follow a diagonal line. This suggests that the residuals are approximately normally distributed, which is a desirable property for many statistical tests. However, there is a slight deviation that results in a curve. This could be caused by outliers or influential points in our data which can be assessed better in the Residuals vs Leverage graph.
Scale-Location: A horizontal line in a scale-location plot indicates relatively constant spread or variance of the residuals across different levels of the predictor variable, which is a desirable property. Since we don’t have a horizontal line, it indicates heteroscedasticity, meaning that the variance of the residuals is not constant, and it may vary systematically with the predictor variable.
Residuals vs. Leverage: We see there are no points outside the dashed line (Cook’s distance). This means that there are no influential points. However, since there is a point very close to reaching the outside of the dashed line, we could see this as a small hinderance towards the normality of residuals which was seen in ‘Normal Q-Q Points’.
Question 10 Intialising the ‘Careseats’ dataset to ’car incase we edit the dataset.
car<- Carseats
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
model <- lm(Sales ~ Price + Urban + US, data = car)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Intercept: It represents the estimated average Sales when the Price, Urban, and US variables are at their reference levels (Price = 0, Urban = “No”, US = “No”).
Price: As the Price increases, Sales are expected to decrease, on average, by approximately 0.0544 units, indicating a negative relationship.
UrbanYes: The p-value for this coefficient is greater than the significance level of 0.05, indicating that Urban may not be a significant predictor of Sales in this model.
USYes: Since the coefficient is positive and significant (p < 0.05), it suggests that products sold in the US have, on average, approximately 1.2006 units higher Sales compared to those sold outside the US, after accounting for Price and Urban variables.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
The model equation can be written as:
Sales = β0 + β1×Price + β2×UrbanYes + β3×USYes + ε
β0 is the intercept term (13.043469)
β1 is the coefficient for Price (-0.054459)
β2 is the coefficient for UrbanYes (-0.021916)
β3 is the coefficient for USYes (1.200573)
UrbanYes is a binary indicator variable taking the value of 1 if the observation corresponds to an urban location and 0 otherwise.
USYes is a binary indicator variable taking the value of 1 if the observation corresponds to a product sold in the US and 0 otherwise.
ε represents the error term.
(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
To determine for which predictors we can reject the null hypothesis we need to look at the p-values associated with each coefficient estimate in the model. Typically, if the p-value is less than a chosen significance level (commonly 0.05), we reject the null hypothesis and consider the corresponding predictor to be statistically significant.
We can reject the null hypothesis for Price and USYes because their p-value is much less than 0.05 (p < 0.05). However, the p-value for the coefficient estimate of UrbanYes is greater than 0.05. Therefore, we fail to reject the null hypothesis for UrbanYes, indicating that it is not a statistically significant predictor of Sales in this model.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
Based on the previous question, we found evidence of association with the outcome (Sales) for the predictors Price and USYes. Therefore, we will fit a smaller model that includes only these predictors.
model2<- lm(Sales ~ Price + US, data = car)
summary(model2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = car)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Comparing models in (a) and (e), we notice that:
(a): Adjusted R-squared = 0.2335; (e): Adjusted R-squared = 0.2354. Both models have similar adjusted R-squared values, indicating that they explain roughly the same amount of variance in the dependent variable (Sales).
(a): Residual standard error = 2.472;(e) Residual standard error = 2.469. Both models have similar values, indicating similar levels of variability not accounted for by the model.
(a): F-statistic = 41.52, p-value < 2.2e-16; (e): F-statistic = 62.43, p-value < 2.2e-16. (e) has a higher F-statistic, indicating it may be a slightly better fit.
Overall, both models have similar goodness-of-fit statistics, with Model (e) slightly outperforming Model (a) based on the F-statistic.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
For “Price”:
#Calculating lower confidence interval bound
lower_bound<- mean(car$Price)-(1.96*(sd(car$Price)/sqrt(nrow(car))))
#Calculating upper confidence interval bound
upper_bound<- mean(car$Price)+(1.96*(sd(car$Price)/sqrt(nrow(car))))
cat("The 95% confidence intervals are: (" , lower_bound , ", " , upper_bound , ")")
## The 95% confidence intervals are: ( 113.4747 , 118.1153 )
For “US:”
car$US <- as.numeric(car$US)
#Calculating lower confidence interval bound
lower_bound<- mean(car$US)-(1.96*(sd(car$US)/sqrt(nrow(car))))
#Calculating upper confidence interval bound
upper_bound<- mean(car$US)+(1.96*(sd(car$US)/sqrt(nrow(car))))
cat("The 95% confidence intervals are: (" , lower_bound , ", " , upper_bound , ")")
## The 95% confidence intervals are: ( 1.598047 , 1.691953 )
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
# to check for outliers
plot(model2, which = 5)
We see there are no points outside the dashed line (Cook’s distance). This means that there are no outliers or high leverage observations in the model from (e).
Question 14
(a) Perform the following commands in R
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The form of the linear model is: y=β0 + β1x1 + β2x2 + ϵ
Where:
y is the dependent variable.
x1 and x2 are the independent variables.
β0, β1 ,and β2 are the regression coefficients for the intercept, x1, and x2 respectively.
ϵ is the error term.
model <- lm(y ~ x1 + x2)
# View coefficients
cat("Regression coefficients are:", coefficients(model))
## Regression coefficients are: 2.1305 1.439555 1.009674
*(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
Calculate correlation coefficient.
correlation <- cor(x1, x2)
print(correlation)
## [1] 0.8351212
A correlation coefficient of 0.8351212 indicates a strong positive linear relationship between x1 and x2. This means that as x1 increases, x2 tends to increase as well, and vice versa.
plot(x1, x2, main = "Scatterplot of x1 and x2", xlab = "x1", ylab = "x2")
The scatter plot clearly aligns with the correlation coefficient hence proving the strong positive linear relationship between x1 and x2.
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?
To fit the least squares regression model
model <- lm(y ~ x1 + x2)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Here βˆ0, βˆ1, and βˆ2 are estimated coefficients. From the summary of the model we see that the estimated intercept (βˆ0) is approximately 2.1305. The estimated coefficient for x1 (βˆ1) is approximately 1.4396. The estimated coefficient for x2 (βˆ2) is approximately 1.0097. The estimated coefficients should be close to the true coefficients. However, due to the randomness in the data generation process (rnorm()), the estimated coefficients may not exactly match the true coefficients.
The p-value for x1 is 0.0487, which is less than 0.05. Thus, you can reject the null hypothesis H0: β1 = 0 at the 5% significance level, indicating that there is significant evidence to suggest that x1 has a linear relationship with y. The p-value for x2 is 0.3754, which is greater than 0.05. Thus, you cannot reject the null hypothesis H0: β2 = 0 at the 5% significance level, indicating that there is not significant evidence to suggest that x2 has a linear relationship with y.
The multiple R-squared value is 0.2088, indicating that approximately 20.88% of the variance in y is explained by the predictors x1 and x2.
*(d) Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
Fit the least squares regression model with only x1
model_x1 <- lm(y ~ x1)
summary(model_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The estimated intercept (βˆ0) is approximately 2.1124 and the estimated coefficient for x1 (βˆ1) is approximately 1.9759. The p-value associated with x1 is 2.661e-06, which is much smaller than 0.05. Thus, you can reject the null hypothesis H0: β1 = 0 at the 5% significance level.
(e) Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
Fit the least squares regression model with only x2
model_x2 <- lm(y ~ x2)
summary(model_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The estimated intercept (βˆ0) is approximately 2.3899 and the estimated coefficient for x2 (βˆ1) is approximately 2.8996. The p-value associated with x2 is 1.366e-05, which is much smaller than 0.05. Thus, you can reject the null hypothesis H0: β1 = 0 at the 5% significance level.
(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer
Including both x1 and x2 in the model (c) showed that x1 has a significant linear relationship with y, while x2 does not. When considering x1 alone (d), it still shows a significant relationship with y. Similarly, when considering x2 alone (e), it also shows a significant relationship with y. Therefore, while the significance of x2 changed when considering it alone versus in combination with x1, the overall conclusion remains consistent: both x1 and x2 individually have a significant linear relationship with y.
(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.Re-fit the linear models from (c) to (e) using this new data. Whateffect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
# Re-fit the linear models from (c) to (e) using the new data
model_c <- lm(y ~ x1 + x2)
model_d <- lm(y ~ x1)
model_e <- lm(y ~ x2)
# Summary of the models
summary(model_c)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(model_d)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(model_e)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
Model (c) (Using both x1 and x2):
The new observation has a small impact on the coefficient of x1, changing it from approximately 1.4396 to 0.5394.
The coefficient for x1 (βˆ1) is no longer statistically significant, with a p-value of 0.36458.
The coefficient for x2 (βˆ2) remains statistically significant, with a p-value of 0.00614.
Regarding the effect of the new observation:
The new observation acts as a high-leverage point, influencing the coefficient estimates significantly. It also contributes to the decrease in the significance of the x1 coefficient, potentially indicating that it could be considered an influential outlier.
Model (d) (Using only x1):
The new observation has a relatively small impact on the coefficient of x1, changing it from approximately 1.7657 to 1.4396.
The coefficient for x1 (βˆ1) remains statistically significant, with a p-value of 4.29e-05.
Regarding the effect of the new observation:
The new observation contributes to the decrease in the coefficient estimate for x1 but does not affect its significance.
Model (e) (Using only x2):
The new observation has a substantial impact on the coefficient of x2, changing it from approximately 2.8996 to 3.1190.
The coefficient for x2 (βˆ1) remains statistically significant, with a p-value of 1.25e-06.
Regarding the effect of the new observation:
Similar to Model (d), the new observation contributes to the change in the coefficient estimate for x2 but does not affect its significance.