Question 1) Let’s deal with nonlinearity first. Create a new dataset
that log-transforms several variables from our original dataset (called
cars in this case):
cars <- read.table("C:/R-language/BACS/auto-data.txt", header=FALSE, na.strings = "?")
names(cars) <- c("mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "model_year", "origin", "car_name")
cars_log <- with(cars, data.frame(log(mpg), log(cylinders), log(displacement),
log(horsepower), log(weight), log(acceleration),
model_year, origin))
1-a) Run a new regression on the cars_log dataset, with mpg.log.
dependent on all other variables
mpglog_lm <- lm(log.mpg. ~ log.cylinders. + log.displacement. + log.horsepower. + log.weight. + log.acceleration. + model_year + factor(origin), cars_log)
summary(mpglog_lm)
##
## Call:
## lm(formula = log.mpg. ~ log.cylinders. + log.displacement. +
## log.horsepower. + log.weight. + log.acceleration. + model_year +
## factor(origin), data = cars_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39727 -0.06880 0.00450 0.06356 0.38542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.301938 0.361777 20.184 < 2e-16 ***
## log.cylinders. -0.081915 0.061116 -1.340 0.18094
## log.displacement. 0.020387 0.058369 0.349 0.72707
## log.horsepower. -0.284751 0.057945 -4.914 1.32e-06 ***
## log.weight. -0.592955 0.085165 -6.962 1.46e-11 ***
## log.acceleration. -0.169673 0.059649 -2.845 0.00469 **
## model_year 0.030239 0.001771 17.078 < 2e-16 ***
## factor(origin)2 0.050717 0.020920 2.424 0.01580 *
## factor(origin)3 0.047215 0.020622 2.290 0.02259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.113 on 383 degrees of freedom
## (因為不存在,6 個觀察量被刪除了)
## Multiple R-squared: 0.8919, Adjusted R-squared: 0.8897
## F-statistic: 395 on 8 and 383 DF, p-value: < 2.2e-16
1-a-i) Which log-transformed factors have a significant effect on
log.mpg. at 10% significance?
At 10% significance, There are log.horsepower., log.weight.,
log.acceleration., having a significant effect on log.mpg.
1-a-ii) Do some new factors now have effects on mpg, and why might
this be?
Horsepower, acceleration now have effect on mpg, because
taking the log of variables can give them more symmetric
distributions.
1-a-iii) Which factors still have insignificant or opposite (from
correlation) effects on mpg? Why might this be?
Cylinders, displacement still have insignificant effects on
mpg. It might means that those variables indeed don’t have the effect on
mpg, otherwise, it might means that there’s the problem of
multicollinearity on those variables.
1-b) Let’s take a closer look at weight, because it seems to be a
major explanation of mpg.
1-b-i) Create a regression (call it regr_wt) of mpg over weight from
the original cars dataset.
regr_wt <- lm(mpg ~ weight,cars)
summary(regr_wt)
##
## Call:
## lm(formula = mpg ~ weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.012 -2.801 -0.351 2.114 16.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.3173644 0.7952452 58.24 <2e-16 ***
## weight -0.0076766 0.0002575 -29.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.345 on 396 degrees of freedom
## Multiple R-squared: 0.6918, Adjusted R-squared: 0.691
## F-statistic: 888.9 on 1 and 396 DF, p-value: < 2.2e-16
1-b-ii) Create a regression (call it regr_wt_log) of log.mpg. on
log.weight. from cars_log
regr_wt_log <- lm(log.mpg. ~ log.weight., cars_log)
summary(regr_wt_log)
##
## Call:
## lm(formula = log.mpg. ~ log.weight., data = cars_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52408 -0.10441 -0.00805 0.10165 0.59384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5219 0.2349 49.06 <2e-16 ***
## log.weight. -1.0583 0.0295 -35.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.165 on 396 degrees of freedom
## Multiple R-squared: 0.7647, Adjusted R-squared: 0.7641
## F-statistic: 1287 on 1 and 396 DF, p-value: < 2.2e-16
1-b-iii) Visualize the residuals of both regression models (raw and
log-transformed):
1-b-iii-1) density plots of residuals
par(mfrow= c(1,2))
plot(density(regr_wt$residuals))
plot(density(regr_wt_log$residuals))

1-b-iii-2) scatterplot of log.weight. vs. residuals
par(mfrow= c(1,1))
plot(cars_log$log.weight., regr_wt_log$residuals)

1-b-iv) Which regression produces better distributed residuals for
the assumptions of regression?
log-transformed regression produces better distributed
residuals for the assumptions of regression.
1-b-v) How would you interpret the slope of log.weight. vs log.mpg.
in simple words?
I would interpret it as the description that “1% change in
weight leads to 1.06% decrease in mpg”.
1-b-vi) From its standard error, what is the 95% confidence interval
of the slope of log.weight. vs log.mpg.?
confint(regr_wt_log, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.060154 11.983659
## log.weight. -1.116264 -1.000272
Question2) Let’s tackle multicollinearity next. Consider the
regression model:
regr_log <- lm(log.mpg. ~ log.cylinders. + log.displacement. + log.horsepower. +
log.weight. + log.acceleration. + model_year +
factor(origin), data=cars_log)
2-a) Using regression and R2, compute the VIF of log.weight. using
the approach shown in class
weight_regr <- lm(log.weight. ~ log.cylinders. + log.displacement. + log.horsepower. +
log.acceleration. + model_year +
factor(origin), data=cars_log)
r2_weight <- summary(weight_regr)$r.squared
vif_weight <- 1 / (1 - r2_weight)
sqrt(vif_weight)
## [1] 4.192269
2-b) Let’s try a procedure called Stepwise VIF Selection to remove
highly collinear predictors.
require(car)
## 載入需要的套件:car
## Warning: 套件 'car' 是用 R 版本 4.2.3 來建造的
## 載入需要的套件:carData
## Warning: 套件 'carData' 是用 R 版本 4.2.2 來建造的
2-b-i) Use vif(regr_log) to compute VIF of the all the independent
variables
vif(regr_log)
## GVIF Df GVIF^(1/(2*Df))
## log.cylinders. 10.456738 1 3.233688
## log.displacement. 29.625732 1 5.442952
## log.horsepower. 12.132057 1 3.483110
## log.weight. 17.575117 1 4.192269
## log.acceleration. 3.570357 1 1.889539
## model_year 1.303738 1 1.141814
## factor(origin) 2.656795 2 1.276702
2-b-ii) Eliminate from your model the single independent variable
with the largest VIF score that is also greater than 5
regr_log_adj <- lm(log.mpg. ~ log.cylinders. + log.horsepower. +
log.weight.+ log.acceleration. + model_year +
factor(origin), data=cars_log)
vif(regr_log_adj)
## GVIF Df GVIF^(1/(2*Df))
## log.cylinders. 5.433107 1 2.330903
## log.horsepower. 12.114475 1 3.480585
## log.weight. 11.239741 1 3.352572
## log.acceleration. 3.327967 1 1.824272
## model_year 1.291741 1 1.136548
## factor(origin) 1.897608 2 1.173685
2-b-iii) Repeat steps (i) and (ii) until no more independent
variables have VIF scores above 5
regr_log_adj <- lm(log.mpg. ~ log.cylinders. +
log.weight.+ log.acceleration. + model_year +
factor(origin), data=cars_log)
vif(regr_log_adj)
## GVIF Df GVIF^(1/(2*Df))
## log.cylinders. 5.321090 1 2.306749
## log.weight. 4.788498 1 2.188264
## log.acceleration. 1.400111 1 1.183263
## model_year 1.201815 1 1.096273
## factor(origin) 1.792784 2 1.157130
regr_log_adjw <- lm(log.mpg. ~ log.weight.+ log.acceleration. + model_year +
factor(origin), data=cars_log)
vif(regr_log_adjw)
## GVIF Df GVIF^(1/(2*Df))
## log.weight. 1.926377 1 1.387940
## log.acceleration. 1.303005 1 1.141493
## model_year 1.167241 1 1.080389
## factor(origin) 1.692320 2 1.140567
2-b-iv) Report the final regression model and its summary
statistics
summary(regr_log_adjw)
##
## Call:
## lm(formula = log.mpg. ~ log.weight. + log.acceleration. + model_year +
## factor(origin), data = cars_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38275 -0.07032 0.00491 0.06470 0.39913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.431155 0.312248 23.799 < 2e-16 ***
## log.weight. -0.876608 0.028697 -30.547 < 2e-16 ***
## log.acceleration. 0.051508 0.036652 1.405 0.16072
## model_year 0.032734 0.001696 19.306 < 2e-16 ***
## factor(origin)2 0.057991 0.017885 3.242 0.00129 **
## factor(origin)3 0.032333 0.018279 1.769 0.07770 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1156 on 392 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8841
## F-statistic: 606.8 on 5 and 392 DF, p-value: < 2.2e-16
2-c) Using stepwise VIF selection, have we lost any variables that
were previously significant? If so, how much did we hurt our explanation
by dropping those variables? (hint: look at model fit)
Yes, it’s possible that we lost some variables that were
previously significant after applying the stepwise VIF selection
method.
We can use the R-squared value as a measure of model fit.
Comparing to the original log-transformed data, we can notice that all
the R-squared value of the new model is significantly lower than the
previous model about 0.01, which means that we may have hurt our
explanation by dropping the variables.
2-d) From only the formula for VIF, try deducing/deriving the
following:
2-d-i) If an independent variable has no correlation with other
independent variables, what would its VIF score be?
If an independent variable has no correlation with other
independent variables, its VIF score would be 1, indicating no
multicollinearity issue.
2-d-ii) Given a regression with only two independent variables (X1
and X2), how correlated would X1 and X2 have to be, to get VIF scores of
5 or higher? To get VIF scores of 10 or higher?
If we want VIF(X1) and VIF(X2) to be at least 5, we can set
these equations equal to 5 and solve for r: 5 = 1 / (1 -
r^2)
R_squ_5 <- 1-(1/5)
sqrt(R_squ_5)
## [1] 0.8944272
Therefore, X1 and X2 would need to be very highly
correlated, with a correlation coefficient of at least 0.953, to get VIF
scores of 10 or higher.
Question3) Might the relationship of weight on mpg be different for
cars from different origins?
par(mfrow= c(1,1))
origin_colors = c("blue", "darkgreen", "red")
with(cars_log, plot(log.weight., log.mpg., pch=origin, col=origin_colors[origin]))
legend(7.35,2.75, lty=1, c("origin1", "origin2", "origin3"), col=origin_colors)

3-a) Let’s add three separate regression lines on the scatterplot,
one for each of the origins.
with(cars_log, plot(log.weight., log.mpg., pch=origin, col=origin_colors[origin]))
legend(7.35,2.75, lty=1, c("origin1", "origin2", "origin3"), col=origin_colors)
#origin1
cars_us <- subset(cars_log, origin==1)
wt_regr_us <- lm(log.mpg. ~ log.weight., data=cars_us)
abline(wt_regr_us, col=origin_colors[1], lwd=2)
#origin2
cars_us <- subset(cars_log, origin==2)
wt_regr_us <- lm(log.mpg. ~ log.weight., data=cars_us)
abline(wt_regr_us, col=origin_colors[2], lwd=2)
#origin3
cars_us <- subset(cars_log, origin==3)
wt_regr_us <- lm(log.mpg. ~ log.weight., data=cars_us)
abline(wt_regr_us, col=origin_colors[3], lwd=2)

3-b) [not graded] Do cars from different origins appear to have
different weight vs. mpg relationships?
It seems that there’s not significant different between each
origins. There’s a common finding that when the weight increase, the mpg
will decrease.