Nassos Galiopoulos February 18, 2022
We would use a regression method for continuous response variable to predict closet observations based on the mean value. We would use classification for categorical non-parametric approach where we would look at odds ratios and conditional probabilities.
# clean the environment
rm(list = ls())
# load the dataset
library(ISLR2)
summary(Auto)## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
head(Auto)## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
plot(Auto)cor(). You will need to exclude the name variable,cor() which is qualitative.# compute correlations and exclude qualitative
cor(Auto[ ,-9])## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?
the p-value is < 2.2e-16 which is very small the F-Statistic is 252.4 which is large so we feel confident that there is a relationship between the predictors and the response variable mpg.
Displacement, weight, year and origin have low p-values less than the assumed significance level of a=0.05
The coefficient for the year is 0.750773 This means that for every unit increase of year there is a positive relationship that mpg efficiency increases by 0.750773 assuming all other predictors are held constant.
# perform multiple linear regression on mpg
MilesPerGallon = lm(mpg~.-name,Auto)
summary(MilesPerGallon)##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?The Q-Q plot shows some linearity of the data up to a point 2 However the residuals vs fitted plot does not show good linearity Also when looking at the residuals vs fitted graph we show that the fitted values especially to the right could influence our model We don’t see any observations with large cooks distance
# produce diagnostics plots
par(mfrow=c(2,2))
plot(MilesPerGallon)cook.d = cooks.distance(MilesPerGallon)
par(mfrow=c(1,1))
plot(cook.d,col="red",pch=19,cex=1)* and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?Displacement: weight appear to be statistically significant
# fit linear regression models with interaction effects
summary(MilesPerGallon)##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Model1 = lm(mpg~.-name + displacement*weight+year:cylinders, Auto)
summary(Model1)##
## Call:
## lm(formula = mpg ~ . - name + displacement * weight + year:cylinders,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5164 -1.5950 -0.1015 1.3021 12.5609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.632e+01 1.166e+01 -5.686 2.59e-08 ***
## cylinders 1.209e+01 2.161e+00 5.594 4.24e-08 ***
## displacement -6.739e-02 1.063e-02 -6.340 6.50e-10 ***
## horsepower -3.994e-02 1.199e-02 -3.332 0.000948 ***
## weight -1.011e-02 6.936e-04 -14.574 < 2e-16 ***
## acceleration 1.006e-01 8.498e-02 1.184 0.237154
## year 1.561e+00 1.455e-01 10.724 < 2e-16 ***
## origin 5.936e-01 2.525e-01 2.351 0.019249 *
## displacement:weight 2.078e-05 2.200e-06 9.448 < 2e-16 ***
## cylinders:year -1.543e-01 2.762e-02 -5.588 4.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.853 on 382 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.8664
## F-statistic: 282.6 on 9 and 382 DF, p-value: < 2.2e-16
log transformation of acceleration shoed reduced p-value but the value of the F-statistic is now smaller
# log transformations
Model2 = lm(mpg~.-name + displacement*log(weight)+year:cylinders+log(acceleration), Auto)
summary(Model2)##
## Call:
## lm(formula = mpg ~ . - name + displacement * log(weight) + year:cylinders +
## log(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0920 -1.5290 0.0143 1.2957 12.9721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.859829 76.160285 1.062 0.289043
## cylinders 10.279768 2.259566 4.549 7.25e-06 ***
## displacement -0.504066 0.196698 -2.563 0.010772 *
## horsepower -0.047762 0.012410 -3.849 0.000139 ***
## weight -0.004112 0.004851 -0.848 0.397142
## acceleration 1.421988 0.448480 3.171 0.001644 **
## year 1.453984 0.149493 9.726 < 2e-16 ***
## origin 0.530919 0.253621 2.093 0.036980 *
## log(weight) -14.176185 10.284091 -1.378 0.168873
## log(acceleration) -22.593994 7.455537 -3.030 0.002609 **
## displacement:log(weight) 0.061791 0.024131 2.561 0.010833 *
## cylinders:year -0.132177 0.028685 -4.608 5.56e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.831 on 380 degrees of freedom
## Multiple R-squared: 0.8721, Adjusted R-squared: 0.8684
## F-statistic: 235.6 on 11 and 380 DF, p-value: < 2.2e-16
# exercise 10
# clean the environment
rm(list = ls())
library(ISLR2)
# load the dataset
lm.carseats = lm(Sales~Price+Urban+US,data=Carseats)for every unit increase in price sales will decrease by 54 also fo a US shop there a positive change in sales of 1.2 units
summary(lm.carseats)##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Sales = 13.043469 -0.054459*Price-0.021916*Urban+1.200573*US
We should remove Urban as a predictor in the model Also we can reject the null hypothesis for Price and US predictor variables
# fit a new model
lm.carseats2 = lm(Sales~Price+US,data = Carseats)the F-statistic is now noticeably higher 62.43 from 41.52 and there is a small increase in Rsquared 0.2354 from 0.2335
summary(lm.carseats2)##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
# get confidence interval for new model
confint(lm.carseats2, level = 0.95)## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
# take a look at the plots
par(mfrow=c(2,2))
plot(lm.carseats2)# check for outliers with studentized residuals
rstudent(lm.carseats2)[which(rstudent(lm.carseats2)>3)]## named numeric(0)
none above 3
The coefficient estimates will be the same if they have the same denominator
# clean the environment
rm(list = ls())# generate an example with the different denominator
set.seed (1)
x1=rnorm(100)
y1=x1*x1
# fit a regression to get coefficients
lm.fit = lm(x1~y1+0)
lm.fit2 = lm(y1~x1+0)
# compare coefficients
summary(lm.fit)$coefficients[1,1]## [1] 0.1078456
summary(lm.fit2)$coefficients[1,1]## [1] 0.2598674
# generate an example with the same denominator
set.seed (1)
x1=rnorm(100)
y1=sample(x1)
# fir a regression to get coefficients
lm.fit = lm(x1~y1+0)
lm.fit2 = lm(y1~x1+0)
# compare coefficients
summary(lm.fit)$coefficients[1,1]## [1] -0.07767695
summary(lm.fit2)$coefficients[1,1]## [1] -0.07767695