keep your #comment short in a code chunk.
Write paragraphs above or below the code chunks.
Do not round intermediate calculations. Round your final calculation to 2 significant digits.
Answer the following questions using the appropriate dataset and codebook. For each question, provide (1) your codes, (2) R outputs AND (3) the answer in complete sentences.
Load the necessary libraries for this lab assignment.
# install.packages("stargazer")
options(scipen=999, digits = 3)
library(ggplot2)
library(ggfortify) # for autoplot
library(car)
## Loading required package: carData
Following the lab example, create a log-transformed variable for the
dependent variable ahe
and assign a new name. Remember to
replace 0 with 1 using the ifelse
function before doing the
log transformation. (1 point)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 13.0 18.3 21.2 26.4 105.8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 13.0 18.3 21.2 26.4 105.8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.56 2.91 2.91 3.27 4.66
For the variable age, create three new variables as follows: (2 points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 27.0 30.0 29.6 32.0 34.0
cps$log_age <- log(cps$age)
# centered age
# for mean, we can use the exact value or the closest discrete value = 30
mean(cps$age)
## [1] 29.6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.63 -2.63 0.37 0.00 2.37 4.37
# alternatively
cps$ageC <- cps$age - 30
# squared term of centered age
cps$ageC2 <- cps$ageC * cps$ageC
# or this code:
cps$ageC2 <- cps$ageC^2
summary(cps$ageC2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 4.00 8.41 16.00 25.00
Based on the lab demonstration, reproduce lm3 to lm6 below. Find the summary regression results. (4 points)
##
## Call:
## lm(formula = ahe ~ female + ageC + ageC2 + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.62 -6.66 -1.92 4.28 83.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1827 0.2524 72.03 <0.0000000000000002 ***
## female -4.1389 0.2659 -15.56 <0.0000000000000002 ***
## ageC 0.5086 0.0479 10.63 <0.0000000000000002 ***
## ageC2 -0.0252 0.0177 -1.43 0.15
## bachelor 9.8488 0.2624 37.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7093 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 416 on 4 and 7093 DF, p-value: <0.0000000000000002
##
## Call:
## lm(formula = ahe ~ female + log_age + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.82 -6.66 -1.89 4.30 83.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.240 4.484 -7.86 0.0000000000000045 ***
## female -4.141 0.266 -15.57 < 0.0000000000000002 ***
## log_age 15.669 1.323 11.85 < 0.0000000000000002 ***
## bachelor 9.848 0.262 37.53 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared: 0.19, Adjusted R-squared: 0.189
## F-statistic: 554 on 3 and 7094 DF, p-value: <0.0000000000000002
##
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7147 -0.2862 0.0129 0.3044 2.0602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.767359 0.011066 250.07 <0.0000000000000002 ***
## female -0.176939 0.011657 -15.18 <0.0000000000000002 ***
## ageC 0.022591 0.002098 10.77 <0.0000000000000002 ***
## ageC2 -0.001875 0.000776 -2.42 0.016 *
## bachelor 0.462164 0.011504 40.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared: 0.209, Adjusted R-squared: 0.208
## F-statistic: 467 on 4 and 7093 DF, p-value: <0.0000000000000002
##
## Call:
## lm(formula = log_ahe1 ~ female + log_age + bachelor, data = cps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7061 -0.2870 0.0105 0.3029 2.0631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3118 0.1966 1.59 0.11
## female -0.1771 0.0117 -15.19 <0.0000000000000002 ***
## log_age 0.7185 0.0580 12.39 <0.0000000000000002 ***
## bachelor 0.4621 0.0115 40.16 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 7094 degrees of freedom
## Multiple R-squared: 0.208, Adjusted R-squared: 0.208
## F-statistic: 621 on 3 and 7094 DF, p-value: <0.0000000000000002
Check the adjusted R-squared values for lm3 to lm6.
## [1] 0.189
## [1] 0.189
## [1] 0.208
## [1] 0.208
Check the AIC and BIC values for lm3 to lm6. (1 point)
## df AIC
## lm3 6 54083
## lm4 5 54082
## lm5 6 9688
## lm6 5 9690
## df BIC
## lm3 6 54125
## lm4 5 54116
## lm5 6 9729
## lm6 5 9724
Which model among lm3 to lm6 provides the best fit based on your evaluation of the adjusted R-squared, AIC, and BIC values? Justify your choice. (2 points)
Response:
Models 5 and 6 have the same adjusted R-squared values of 0.208. It implies that both models can explain 20.8 percent of variance in the average houring earning, the DV.
Model 6 (Log-Log regression) which has the lowest BIC value seems to have a better fit among all the models.
Although one can also argue that a quadratic term of age in Model 5 seems to be theoretically more meaningful for investigating the nonlinear relationship between age and earning. So, you can argue for either Model 5 or 6 as long as you are able to justify which one has a better fit. (This thought process mimicked what can happen in reality, when obvious answer doesn’t exist! )
Note: BIC is a more conservative measure than AIC and penalizes model complexity more heavily. Hence, BIC is a better indicator of fit for multiple regression models when AIC and BIC values are inconclusive (the scenarios of lm5 and lm6). Another consideration is theoretical, a particular model might make more sense based on the prior literature and hypothesized relationships. You might opt for a model (lm5: with a quadratic term ageC2 vs. lm6: logged age) that makes a better theoretical sense when two models have a similar fit.
Q9. Is there any multicollinearity issue for the final model you choose? Find the VIF for lm3 to lm6. (2 points)
Response: No, there is no multicollinearity problem given that the variance inflation factor (vif) values for the final model (either Model 5 or Model 6) are low and less than 2.
(There is no multicollinearity problem for age centered (ageC) and its quadratic term (ageC2) given that they are not independent from each other and their relationship is not linear. Multicollinearity has to be checked when the independent effect of two variables happen to be highly or perfectly correlated.)
## female ageC ageC2 bachelor
## 1.02 1.13 1.13 1.02
## female log_age bachelor
## 1.02 1.00 1.02
## female ageC ageC2 bachelor
## 1.02 1.13 1.13 1.02
## female log_age bachelor
## 1.02 1.00 1.02
Based on the residual plots for the final model, comment on the residual patterns in terms of randomness, normality and homoscedasticity. Discuss if it might violate any of the OLS assumptions. (3 points)
Response: The residual plots of the Model 5 or Model 6 show that the residuals are randomly scattered at 0 (see Residual vs Fitted Plot). The residuals mostly stay on the straight Quartile-Quartile line, except for some in the lower tail (see Normal Q-Q Plot). The residuals seem to largely follow a homoscedastic pattern as there is no systematic pattern of change in residual variance (i.e., constant variance) as the fitted value increases (see Residual vs Fitted Plot or Scale-Location Plot). Based on the residual diagnostics, the model does not seem to violate any of the OLS assumptions.
Interpret the significance, directions and size of the coefficients for the intercept and all the independent variables for the final (best fit) model you choose. Tips: Log-transformation and polynomial terms will affect the way you interpret certain coefficients. Review Week 8 lecture. Check if the DV and/or the IVs are log-transformed.
Response:
The use of logarithmic transformations can affect the interpretation of the estimated coefficients and require some additional care in reporting and discussing the results.
If you choose Model 5: Log-Linear Regression–logged ahe with ageC and a squared term of ageC
The regression results of Model 5 showed that sex (female), age (ageC), the quadratic term of age (ageC2) and having a bachelor’s degree (bachelor) were significant predictors of the average hourly earning (log_ahe1) at the 0.05 level.
When only Y is log-transformed: ln(Y) ~ X1 : A one unit increase in IV is associated with an increase/decrease in DV by (100*\(\beta\)) percent, while all other variables are held constant.
beta of female
= -0.1769: The average hourly earning
of female workers (female = 1) was predicted to be 17.69 percent less
than male workers (female = 0), holding other factors constant.
beta of bachelor
= 0.4621: The average hourly
earning of workers who had a bachelor’s degree (bachelor = 1) was
predicted to be 46.22 percent more than workers who did not earn a
bachelor’s degree (bachelor = 0), holding other factors
constant.
beta of ageC
= 0.0226; beta of ageC2
=
-0.00188: Holding other factors constant, the regression results suggest
that a one-unit increase in age is associated with a 2.26 percent
increase in the average hourly earning. However, the positive
relationship between age and earnings weakens as age increases beyond a
certain threshold, indicated by the negative coefficient for
ageC2.
If you choose Model 6: Log-log Regression–logged ahe with logged age and other IVs
The regression results of Model 6 showed that sex (female), age (log_age) and having a bachelor’s degree (bachelor) were significant predictors of the average hourly earning (log_ahe1) at the 0.05 level.
When only Y is log-transformed: \(ln(Y)\) ~ \(X_1\): A one unit increase in IV is associated with an increase/decrease in DV by (100*\(\beta\)) percent, while all other variables are held constant.
beta of female
= -0.1771: The average hourly earning
of female workers (female = 1) was predicted to be 17.7 percent less
than male workers (female = 0), holding other factors constant. (Note:
This is the interpretation for log-linear regression since we only have
log(y))
beta of bachelor
= 0.4621: The average hourly
earning of workers who had a bachelor’s degree (bachelor = 1) was
predicted to be 46.22 percent more than workers who did not earn a
bachelor’s degree (bachelor = 0), holding other factors constant. (Note:
This is the interpretation for log-linear regression since we only have
log(y))
When both X & Y are log-transformed: \(ln(Y)\) ~ \(ln(X_1)\): A one percentage increase in X is associated with an increase/decrease in Y by (coefficient) percent, holding all other variables constant.
log_age
= 0.7185: A one percentage increase in
age was associated with a 0.72 percent increase in the average hourly
earning, holding other factors constant. (Note: Given log(x) and log(y),
the interpretation for a log-log regression coefficient
uses percent to describe both the changes in x and y, instead of a unit
change)