dataset <- read.csv("/Users/nasase/Auto_Final.csv")
head(dataset)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
#Checking structure of dataset
str(dataset)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
#Convert Horsepower to Numeric
dataset$horsepower <- as.numeric(as.character(dataset$horsepower))
## Warning: NAs introduced by coercion
#Remove NAs
dataset <- subset(dataset, !is.na(horsepower))
#Create Model
model <- lm(mpg ~ horsepower, data = dataset)
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
i.) Yes, the p-value for slope is very small, indicating a statistically significant relationship between horsepower and mpg.
ii.) Based on the 0.6059 r-squared value, this suggests taht about 60.6% of the variation in mpg is explained by horsepower alone. This is a strong indicator.
iii.) The slope for horsepower is -.1578, which is negative. Hence, as horsepower increases, mpg tends to decrease.
iv.)
-Predicted MPG: 24.45
-Confidence Interval: Tells you the plausible range for the average mpg of all cars with 98HP.
-Prediction Interval: Tells you the range for an individual car’s mpg with 98 horsepower. Prediction Interval is typically wider due to the fact that it must account for both the uncertainty in the regression line and individual variability around that line.
plot(dataset$horsepower, dataset$mpg,
xlab = "Horsepower", ylab = "MPG",
main = "MPG vs. Horsepower")
abline(lm(mpg ~ horsepower, data = dataset), col = "red", lwd = 2)
plot(lm(mpg ~ horsepower, data = dataset))
Residuals Plot: Suggests some non-linearity and suggests that residuals may grow as fitted values increase.
QQ Plot: Seems Normal for most part, the tails at both ends of the graph deviate slightly.
Scale-Location: Suggests the same as the QQ plot, that the spread of residuals is increasing with larger fitted values.
Residuals vs. Leverage: Nothing to note besides the present of some outliers.
10.)
library(ISLR2)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
# Fit the model
model1 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b.)
-Intercept: Used as a baseline or reference, represents average predicted sales with Price = 0, Urban = No, and US = No.
-Price: The coefficients suggests that if Price goes up by a dollar, that sales decreases by .055 units.
-Urban = Y: This represents the difference between Urban yes and Urban no, it appears to not be significant suggesting this factor has little effect on sales.
-US: This represents the difference between US yes and US no, it appears to be significant suggesting this factor has an effect on sales.This suggest that when all else is equal, stores in the US sell about 1.2 more units than stores outside the US.
-Price: Highly Significant with a very small p-value
-UrbanYes: Not Significant since p-value is close to one.
-USYes: Significant (p<.01)
We can reject the null for Price and USYes but not for UrbanYes
#Model with Removed UrbanYes
model2 <- lm(Sales ~ Price + US, data = Carseats)
summary(model2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
R-squared values are too close together suggesting that the removal of the insignificant variable did not do anything to improve the model. Models fit similarily but the second model does it with less variables.
g.)
confint(model2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h.)
plot(model2)
-Appear to be outliers present as shown in the Residuals chart.
set.seed (1)
x1 <- runif (100)
x2 <- 0.5 * x1 + rnorm (100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm (100)
cor(x1,x2)
## [1] 0.8351212
plot(x1, x2,
xlab = "x1",
ylab = "x2",
main = "Scatterplot of x1 vs. x2")
fit_models <- lm(y ~ x1 + x2)
summary(fit_models)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
-B1 and B2 are both zero. x1’s slope estimate is 1.44 with a low p-value meaning it is hihgly significant and we reject the null. x2’s p-value is not low enough thus it is not significant and we accept the null.
fit_x1 <- lm(y ~ x1)
summary(fit_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
-We reject the null based on the p-value. Suggesting x1 does influence y.
fit_x2 <- lm(y ~ x2)
summary(fit_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
-We reject the null based on p-value suggesting x2 does influence y.
-No they don’t contradict, x1 and x2 seem strongly correlated thus one of the factors may lose significance if both are present. That is why it is okay for x2 to not be significant when both factors are included in the model and significant when it is the only factors the model considers.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
# Model with both x1 and x2
model_both_new <- lm(y ~ x1 + x2)
summary(model_both_new)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
# (d) Model with only x1
model_x1_new <- lm(y ~ x1)
summary(model_x1_new)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
# (e) Model with only x2
model_x2_new <- lm(y ~ x2)
summary(model_x2_new)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
-On the model with both it seems that both the intercepts and slopes changed, both increased. The Rsquared value rose slightly. Since the new point has been added it seems to pull the fit in a different direction.
-On the model with only x1 the slope jumped substantially suggesting the point strongly influences the fitted line. Both the residual error and rsquared values are affected by the new point.
-On the model with only x2, both the slope and the resquared values rise.
All three models changed their parameter estimates and fit statistics because the new point is somewhat far from the original data.
-x1 and x2 model:
Outlier: Yes
High Leverage: Yes
-The new point is quite unusual compared to the og data giving it high leverage and has a high residual since the Y has increased in comparison to the original model. The reasoning is the same for the models below, i denote it as moderately since the new point is not as far from the original range compared to this model.
-x1 model:
Outlier: Yes
High Leverage: Moderately Yes
-x2 model:
Outlier:Moderately Yes
High Leverage: Moderately Yes