A Reminder

  • keep your #comment short in a code chunk.

  • Write paragraphs above or below the code chunks.

  • Do not round intermediate calculations. Round your final calculation to 2 significant digits.

Instructions

Answer the following questions using the appropriate dataset and codebook. For each question, provide (1) your codes, (2) R outputs AND (3) the answer in complete sentences.

Lab 6 Assignment

Q1

Load the necessary libraries for this lab assignment.

# install.packages("stargazer")
options(scipen=999, digits = 3)
library(ggplot2)
library(ggfortify) # for autoplot
library(car)
## Loading required package: carData

Q2

Import the dataset named “CPS2015.csv”, name your dataset.

cps <- read.csv("CPS2015.csv")

Q3

Following the lab example, create a log-transformed variable for the dependent variable ahe and assign a new name. Remember to replace 0 with 1 using the ifelse function before doing the log transformation. (1 point)

  • correct code using ifelse to recode ahe (0.5 points)
  • correct code for logging ahe (0.5 points)
summary(cps$ahe)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    13.0    18.3    21.2    26.4   105.8
cps$ahe1 <- ifelse(cps$ahe <= 0, 1, cps$ahe)
summary(cps$ahe1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    13.0    18.3    21.2    26.4   105.8
cps$log_ahe1 <- log(cps$ahe1)
summary(cps$log_ahe1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.56    2.91    2.91    3.27    4.66

Q4

For the variable age, create three new variables as follows: (2 points)

  • Correct codes for generating new variables
  • Assign different names to the transformed variables of age
  1. Create a log-transformed variable of age (0.5 points)
  2. Create a centered age variable (+ finding the mean) (1 point)
  3. Create a quadratic term for the centered age variable (0.5 points)
# logged age
summary(cps$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    27.0    30.0    29.6    32.0    34.0
cps$log_age <- log(cps$age)

# centered age 
  # for mean, we can use the exact value or the closest discrete value = 30
mean(cps$age)
## [1] 29.6
cps$ageC <- cps$age - mean(cps$age)
summary(cps$ageC)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.63   -2.63    0.37    0.00    2.37    4.37
# alternatively
cps$ageC <- cps$age - 30

# squared term of centered age
cps$ageC2 <- cps$ageC * cps$ageC 

# or this code:
cps$ageC2 <- cps$ageC^2

summary(cps$ageC2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    4.00    8.41   16.00   25.00

Q5

Based on the lab demonstration, reproduce lm3 to lm6 below. Find the summary regression results. (4 points)

  • correct codes for producing lm3-6 models and summary() (1 point for each model)
# lm3
lm3 <- lm(ahe ~ female + ageC + ageC2 + bachelor, data = cps)
summary(lm3)
## 
## Call:
## lm(formula = ahe ~ female + ageC + ageC2 + bachelor, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.62  -6.66  -1.92   4.28  83.86 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  18.1827     0.2524   72.03 <0.0000000000000002 ***
## female       -4.1389     0.2659  -15.56 <0.0000000000000002 ***
## ageC          0.5086     0.0479   10.63 <0.0000000000000002 ***
## ageC2        -0.0252     0.0177   -1.43                0.15    
## bachelor      9.8488     0.2624   37.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7093 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  416 on 4 and 7093 DF,  p-value: <0.0000000000000002
# lm4
lm4 <- lm(ahe ~ female + log_age + bachelor, data = cps)
summary(lm4)
## 
## Call:
## lm(formula = ahe ~ female + log_age + bachelor, data = cps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.82  -6.66  -1.89   4.30  83.89 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -35.240      4.484   -7.86   0.0000000000000045 ***
## female        -4.141      0.266  -15.57 < 0.0000000000000002 ***
## log_age       15.669      1.323   11.85 < 0.0000000000000002 ***
## bachelor       9.848      0.262   37.53 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  554 on 3 and 7094 DF,  p-value: <0.0000000000000002
# lm5
lm5 <- lm(log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
summary(lm5)
## 
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7147 -0.2862  0.0129  0.3044  2.0602 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.767359   0.011066  250.07 <0.0000000000000002 ***
## female      -0.176939   0.011657  -15.18 <0.0000000000000002 ***
## ageC         0.022591   0.002098   10.77 <0.0000000000000002 ***
## ageC2       -0.001875   0.000776   -2.42               0.016 *  
## bachelor     0.462164   0.011504   40.18 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  467 on 4 and 7093 DF,  p-value: <0.0000000000000002
# lm6
lm6 <- lm(log_ahe1 ~ female + log_age + bachelor, data = cps)
summary(lm6)
## 
## Call:
## lm(formula = log_ahe1 ~ female + log_age + bachelor, data = cps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7061 -0.2870  0.0105  0.3029  2.0631 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   0.3118     0.1966    1.59                0.11    
## female       -0.1771     0.0117  -15.19 <0.0000000000000002 ***
## log_age       0.7185     0.0580   12.39 <0.0000000000000002 ***
## bachelor      0.4621     0.0115   40.16 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7094 degrees of freedom
## Multiple R-squared:  0.208,  Adjusted R-squared:  0.208 
## F-statistic:  621 on 3 and 7094 DF,  p-value: <0.0000000000000002

Q6

Check the adjusted R-squared values for lm3 to lm6.

summary(lm3)$adj.r.squared 
## [1] 0.189
summary(lm4)$adj.r.squared 
## [1] 0.189
summary(lm5)$adj.r.squared 
## [1] 0.208
summary(lm6)$adj.r.squared 
## [1] 0.208

Q7

Check the AIC and BIC values for lm3 to lm6. (1 point)

  • correct codes for AIC and BIC (0.5 points each)
AIC(lm3,lm4,lm5,lm6)
##     df   AIC
## lm3  6 54083
## lm4  5 54082
## lm5  6  9688
## lm6  5  9690
BIC(lm3,lm4,lm5,lm6)
##     df   BIC
## lm3  6 54125
## lm4  5 54116
## lm5  6  9729
## lm6  5  9724

Q8

Which model among lm3 to lm6 provides the best fit based on your evaluation of the adjusted R-squared, AIC, and BIC values? Justify your choice. (2 points)

  • The best fit model should be lm5 or lm6 (0.5 points)
  • Justification of your choice (one model) based on (1) adjusted r2, and (2) AIC and BIC (0.5 points each)

Response:

  • Models 5 and 6 have the same adjusted R-squared values of 0.208. It implies that both models can explain 20.8 percent of variance in the average houring earning, the DV.

  • Model 6 (Log-Log regression) which has the lowest BIC value seems to have a better fit among all the models.

  • Although one can also argue that a quadratic term of age in Model 5 seems to be theoretically more meaningful for investigating the nonlinear relationship between age and earning. So, you can argue for either Model 5 or 6 as long as you are able to justify which one has a better fit. (This thought process mimicked what can happen in reality, when obvious answer doesn’t exist! )

Note: BIC is a more conservative measure than AIC and penalizes model complexity more heavily. Hence, BIC is a better indicator of fit for multiple regression models when AIC and BIC values are inconclusive (the scenarios of lm5 and lm6). Another consideration is theoretical, a particular model might make more sense based on the prior literature and hypothesized relationships. You might opt for a model (lm5: with a quadratic term ageC2 vs. lm6: logged age) that makes a better theoretical sense when two models have a similar fit.

Q9. Is there any multicollinearity issue for the final model you choose? Find the VIF for lm3 to lm6. (2 points)

  • correct codes for vif (1 point)
  • no multicollinearity problem for final model (0.5 points)
  • description of the vif value (0.5 points)

Response: No, there is no multicollinearity problem given that the variance inflation factor (vif) values for the final model (either Model 5 or Model 6) are low and less than 2.

(There is no multicollinearity problem for age centered (ageC) and its quadratic term (ageC2) given that they are not independent from each other and their relationship is not linear. Multicollinearity has to be checked when the independent effect of two variables happen to be highly or perfectly correlated.)

vif(lm3)
##   female     ageC    ageC2 bachelor 
##     1.02     1.13     1.13     1.02
vif(lm4)
##   female  log_age bachelor 
##     1.02     1.00     1.02
vif(lm5)
##   female     ageC    ageC2 bachelor 
##     1.02     1.13     1.13     1.02
vif(lm6)
##   female  log_age bachelor 
##     1.02     1.00     1.02

Q10

Based on the residual plots for the final model, comment on the residual patterns in terms of randomness, normality and homoscedasticity. Discuss if it might violate any of the OLS assumptions. (3 points)

  • description of residual plots: (1) randomness, (2) normality, (3) homoscedasticity (0.5 points each)
  • correct codes on showing the residual plot for the final model (1 point)
  • conclusion of no assumption violation (0.5 points)
  • reminder: indicate the corresponding plots for residual diagnostics

Response: The residual plots of the Model 5 or Model 6 show that the residuals are randomly scattered at 0 (see Residual vs Fitted Plot). The residuals mostly stay on the straight Quartile-Quartile line, except for some in the lower tail (see Normal Q-Q Plot). The residuals seem to largely follow a homoscedastic pattern as there is no systematic pattern of change in residual variance (i.e., constant variance) as the fitted value increases (see Residual vs Fitted Plot or Scale-Location Plot). Based on the residual diagnostics, the model does not seem to violate any of the OLS assumptions.

# either lm5 or lm6
autoplot(lm5)

autoplot(lm6)

Q11

Interpret the significance, directions and size of the coefficients for the intercept and all the independent variables for the final (best fit) model you choose. Tips: Log-transformation and polynomial terms will affect the way you interpret certain coefficients. Review Week 8 lecture. Check if the DV and/or the IVs are log-transformed.

  • indicate significant variables at the 0.05 level (1 points)
  • interpret the coefficients for significant variables (2 points in total)

Response:

The use of logarithmic transformations can affect the interpretation of the estimated coefficients and require some additional care in reporting and discussing the results.

If you choose Model 5: Log-Linear Regression–logged ahe with ageC and a squared term of ageC

The regression results of Model 5 showed that sex (female), age (ageC), the quadratic term of age (ageC2) and having a bachelor’s degree (bachelor) were significant predictors of the average hourly earning (log_ahe1) at the 0.05 level.

  1. Intercept = 2.7674: The intercept represents the predicted value of average hourly earning when all independent variables are equal to zero, on the logarithmic scale. Since the intercept is estimated on the logarithmic scale, we need to exponentiate it to obtain the predicted value on the original scale. In this case, exp(2.7674) would give an estimated average hourly earning of $15.9 for the intercept of workers at 30 years old since we centered the age variable at its mean (ageC at 0 at the intercept means age is 30 years old, the mean)

When only Y is log-transformed: ln(Y) ~ X1 : A one unit increase in IV is associated with an increase/decrease in DV by (100*\(\beta\)) percent, while all other variables are held constant.

  1. beta of female = -0.1769: The average hourly earning of female workers (female = 1) was predicted to be 17.69 percent less than male workers (female = 0), holding other factors constant.

  2. beta of bachelor = 0.4621: The average hourly earning of workers who had a bachelor’s degree (bachelor = 1) was predicted to be 46.22 percent more than workers who did not earn a bachelor’s degree (bachelor = 0), holding other factors constant.

  3. beta of ageC = 0.0226; beta of ageC2 = -0.00188: Holding other factors constant, the regression results suggest that a one-unit increase in age is associated with a 2.26 percent increase in the average hourly earning. However, the positive relationship between age and earnings weakens as age increases beyond a certain threshold, indicated by the negative coefficient for ageC2.


If you choose Model 6: Log-log Regression–logged ahe with logged age and other IVs

The regression results of Model 6 showed that sex (female), age (log_age) and having a bachelor’s degree (bachelor) were significant predictors of the average hourly earning (log_ahe1) at the 0.05 level.

  1. Intercept = 0.3118: The intercept represents the predicted value of average hourly earning when all independent variables are equal to zero, on the logarithmic scale. Since the intercept is estimated on the logarithmic scale, we need to exponentiate it to obtain the predicted value on the original scale. In this case, exp(0.3118) would give an estimated average hourly earning of $1.37 for the intercept.

When only Y is log-transformed: \(ln(Y)\) ~ \(X_1\): A one unit increase in IV is associated with an increase/decrease in DV by (100*\(\beta\)) percent, while all other variables are held constant.

  1. beta of female = -0.1771: The average hourly earning of female workers (female = 1) was predicted to be 17.7 percent less than male workers (female = 0), holding other factors constant. (Note: This is the interpretation for log-linear regression since we only have log(y))

  2. beta of bachelor = 0.4621: The average hourly earning of workers who had a bachelor’s degree (bachelor = 1) was predicted to be 46.22 percent more than workers who did not earn a bachelor’s degree (bachelor = 0), holding other factors constant. (Note: This is the interpretation for log-linear regression since we only have log(y))

When both X & Y are log-transformed: \(ln(Y)\) ~ \(ln(X_1)\): A one percentage increase in X is associated with an increase/decrease in Y by (coefficient) percent, holding all other variables constant.

  1. beta of log_age = 0.7185: A one percentage increase in age was associated with a 0.72 percent increase in the average hourly earning, holding other factors constant. (Note: Given log(x) and log(y), the interpretation for a log-log regression coefficient uses percent to describe both the changes in x and y, instead of a unit change)