Data_Analytics

Question 08

library(ISLR2)
data("Auto")
auto <- Auto
auto <- na.omit(auto)
model <- lm(mpg~horsepower,data=auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

(i)The P-value is for the horsepower’s coefficient is close to zero. This indicates that there is a strong relationship between response and predictor ( mpg & horsepower) variables.

mean(auto$mpg, na.rm=T)

## [1] 23.44592

 4.906/mean(Auto$mpg, na.rm=T) * 100.0

## [1] 20.92475

The R- squared is close to 60% which implies that the 60% variance in mpg can be described by the horsepower.The percentage error is 20.9% which also indicates that there is a strong relationship between mpg and horsepower. (iii)The coefficient of the horsepower is negative this indicates that the relationship between the mpg and the horsepower is negative. this means that if the automobile has more horse power then the mpg fuel efficiency will be less for that automobile.

horsepower_new <- data.frame(horsepower = 98)
predicted_mpg <- predict(model, newdata = horsepower_new)

print(predicted_mpg)

##        1 
## 24.46708

intervals <- predict(model, newdata = horsepower_new, interval = "confidence", level = 0.95)
prediction_intervals <- predict(model, newdata = horsepower_new, interval = "prediction", level = 0.95)

print(intervals)

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

print(prediction_intervals)

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

plot(auto$horsepower, auto$mpg, 
     xlab = "Horsepower", ylab = "MPG",
     main = "Scatter plot of MPG vs Horsepower")

abline(model, col = "red")

(iv) (b)

plot(model)

## Question 10 (a)

data("Carseats")
multiple_model <- lm(Sales~Price+Urban+US, data=Carseats)
summary(multiple_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The coefficient of price variable is negative. this indicates that the sales will fall by 54 seats for every unit increase in price. The coefficient of urban indicates that approximately there are 21 units less sales in urban area than in rural area by keeping all other predictory remaining fixed. The coefficient of US indicates that US stores are selling 1200 units more than the non-US stores.
The model in the equation form

Sales = 13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US where Urban=1 if the store is in urban location Urban=0 if the store is in rural location US=1 if the store is in US US=0 if the store is not in US (d) we can reject the null hypothesis for the ‘Price’ and ‘US’ variables. (e)

model_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(model_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Both the models fit the data similarly in terms of explaining the variance in Sales. Including of Urban predictor in model doesn’t have the bigger significance it seems in the model.

confint(model_2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

plot(model_2)

The plot of standardized residuals vseverage indicates thart the are some outliers (highr than 2 and lower than -2).there are some leverage points as some points exceeded (P+1)/n. ## Question 14

set.seed(1)
 x1 <- runif (100)
 x2 <- 0.5 * x1 + rnorm(100) / 10
 y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

y=2+2x1+0.3x2+epsilon The regression coefficients are respectively 2,2,0.3.

 cor(x1,x2)

## [1] 0.8351212

 plot(x1,x2)

(c)

model_3 <- lm(y~x1+x2)
summary(model_3)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The coefficients βˆ0, βˆ1, and βˆ2 are respectively 2.1305, 1.4396 & 1.0097.The β0 and βˆ0 are almost close.we may reject the null hypothesis for β1 as the p-value is less than 0.05. we may not reject the null hypothesis for β2 as the p-value is higher than 0.05. (d)

model_4 <- lm(y~x1)
summary(model_4)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The coefficient of x1 is different from the previous case where the ‘y’ is dependent upon both x1 and x2. In this case the coefficient of x1 is very significant and the p-value is also very low. so we may reject the null hypothesis β1=0. (e)

model_5 <- lm(y~x2)
summary(model_5)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The coefficient x2 is also very significant and p-value is very low. so we may also reject this null hypothesis β1=0. (f) No, the results of (c) and (e) do not contradict with each other.The difference can be the cause of collinearity.the response variable is dependent upon two variables and those two variables are collinear which result in masking the effect of variables on response variable one over the other.collinearity can also increase the sandard error as we can see that the standard error of x1+x2 is greater than the x1 or x2. (g)

 x1 <- c(x1, 0.1) 
 x2 <- c(x2, 0.8) 
 y <- c(y, 6)
model_6 <- lm(y~x1+x2)
model_7 <- lm(y~x1)
model_8 <- lm(y~x2)
summary(model_6)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

summary(model_7)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

summary(model_8)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

plot(model_6)

plot(model_7)

plot(model_8)

In the model with two predictors the point 101 has high leverage point. In the model with x1 as only predictor the point 101 is an outlier.In the model with x2 as only predictor the point 101 is high leverage point.

Data_Analytics_Lab 02

2024-02-12

Question 08