Lab4

dataset <- read.csv("/Users/nasase/Auto_Final.csv")
head(dataset)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

8.)

#Checking structure of dataset
str(dataset)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

#Convert Horsepower to Numeric
dataset$horsepower <- as.numeric(as.character(dataset$horsepower))

## Warning: NAs introduced by coercion

#Remove NAs
dataset <- subset(dataset, !is.na(horsepower))

#Create Model
model <- lm(mpg ~ horsepower, data = dataset)

summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

a.)

i.) Yes, the p-value for slope is very small, indicating a statistically significant relationship between horsepower and mpg.

ii.) Based on the 0.6059 r-squared value, this suggests taht about 60.6% of the variation in mpg is explained by horsepower alone. This is a strong indicator.

iii.) The slope for horsepower is -.1578, which is negative. Hence, as horsepower increases, mpg tends to decrease.

iv.)

-Predicted MPG: 24.45

-Confidence Interval: Tells you the plausible range for the average mpg of all cars with 98HP.

-Prediction Interval: Tells you the range for an individual car’s mpg with 98 horsepower. Prediction Interval is typically wider due to the fact that it must account for both the uncertainty in the regression line and individual variability around that line.

b.)

plot(dataset$horsepower, dataset$mpg,
     xlab = "Horsepower", ylab = "MPG",
     main = "MPG vs. Horsepower")
abline(lm(mpg ~ horsepower, data = dataset), col = "red", lwd = 2)

c.)

plot(lm(mpg ~ horsepower, data = dataset))

Residuals Plot: Suggests some non-linearity and suggests that residuals may grow as fitted values increase.

QQ Plot: Seems Normal for most part, the tails at both ends of the graph deviate slightly.

Scale-Location: Suggests the same as the QQ plot, that the spread of residuals is increasing with larger fitted values.

Residuals vs. Leverage: Nothing to note besides the present of some outliers.

10.)

library(ISLR2)
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

a.)

# Fit the model
model1 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b.)

-Intercept: Used as a baseline or reference, represents average predicted sales with Price = 0, Urban = No, and US = No.

-Price: The coefficients suggests that if Price goes up by a dollar, that sales decreases by .055 units.

-Urban = Y: This represents the difference between Urban yes and Urban no, it appears to not be significant suggesting this factor has little effect on sales.

-US: This represents the difference between US yes and US no, it appears to be significant suggesting this factor has an effect on sales.This suggest that when all else is equal, stores in the US sell about 1.2 more units than stores outside the US.

c.)

d.)

-Price: Highly Significant with a very small p-value

-UrbanYes: Not Significant since p-value is close to one.

-USYes: Significant (p<.01)

We can reject the null for Price and USYes but not for UrbanYes

e.)

#Model with Removed UrbanYes
model2 <- lm(Sales ~ Price + US, data = Carseats)
summary(model2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f.)

R-squared values are too close together suggesting that the removal of the insignificant variable did not do anything to improve the model. Models fit similarily but the second model does it with less variables.

g.)

confint(model2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h.)

plot(model2)

-Appear to be outliers present as shown in the Residuals chart.

14.)

a.)

set.seed (1)
x1 <- runif (100)
x2 <- 0.5 * x1 + rnorm (100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm (100)

b.)

cor(x1,x2)

## [1] 0.8351212

plot(x1, x2,
     xlab = "x1",
     ylab = "x2",
     main = "Scatterplot of x1 vs. x2")

c.)

fit_models <- lm(y ~ x1 + x2)
summary(fit_models)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

-B1 and B2 are both zero. x1’s slope estimate is 1.44 with a low p-value meaning it is hihgly significant and we reject the null. x2’s p-value is not low enough thus it is not significant and we accept the null.

d.)

fit_x1 <- lm(y ~ x1)
summary(fit_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

-We reject the null based on the p-value. Suggesting x1 does influence y.

e.)

fit_x2 <- lm(y ~ x2)
summary(fit_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

-We reject the null based on p-value suggesting x2 does influence y.

f.)

-No they don’t contradict, x1 and x2 seem strongly correlated thus one of the factors may lose significance if both are present. That is why it is okay for x2 to not be significant when both factors are included in the model and significant when it is the only factors the model considers.

g.)

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y  <- c(y, 6)

# Model with both x1 and x2
model_both_new <- lm(y ~ x1 + x2)
summary(model_both_new)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

# (d) Model with only x1
model_x1_new <- lm(y ~ x1)
summary(model_x1_new)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

# (e) Model with only x2
model_x2_new <- lm(y ~ x2)
summary(model_x2_new)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

-On the model with both it seems that both the intercepts and slopes changed, both increased. The Rsquared value rose slightly. Since the new point has been added it seems to pull the fit in a different direction.

-On the model with only x1 the slope jumped substantially suggesting the point strongly influences the fitted line. Both the residual error and rsquared values are affected by the new point.

-On the model with only x2, both the slope and the resquared values rise.

All three models changed their parameter estimates and fit statistics because the new point is somewhat far from the original data.

Outlier or High Leverage

-x1 and x2 model:

Outlier: Yes

High Leverage: Yes

-The new point is quite unusual compared to the og data giving it high leverage and has a high residual since the Y has increased in comparison to the original model. The reasoning is the same for the models below, i denote it as moderately since the new point is not as far from the original range compared to this model.

-x1 model:

Outlier: Yes

High Leverage: Moderately Yes

-x2 model:

Outlier:Moderately Yes

High Leverage: Moderately Yes

Lab4

2025-02-26

8.)

a.)

b.)

c.)

a.)

c.)

d.)

e.)

f.)

14.)

a.)

b.)

c.)

d.)

e.)

f.)

g.)

Outlier or High Leverage