QUESTION (1)

setwd("D:/econ7310/Research project 1")
library(readr)
EPH=read_csv("cps4_small.csv")
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach (EPH)
summary (cbind(wage,educ))
##       wage            educ     
##  Min.   : 1.97   Min.   : 0.0  
##  1st Qu.:11.25   1st Qu.:12.0  
##  Median :17.30   Median :13.0  
##  Mean   :20.62   Mean   :13.8  
##  3rd Qu.:25.63   3rd Qu.:16.0  
##  Max.   :76.39   Max.   :21.0
sd (wage)
## [1] 12.83472
sd(educ)
## [1] 2.711079
hist(educ, main = "Years of Education", xlab = "Year")

hist(wage, main = "Earning per Hour", xlab = "Wage")

lnwage=log(wage)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'EPH':
## 
##     midwest
fig1=ggplot(EPH, aes(educ,lnwage))+ 
  geom_point(alpha=.75, size=1.5, color="cyan4") + 
  labs(title="Figure 1: Ln(wage) and Education",
       x="Education", y="Ln(Wage)")  + 
  theme(axis.title=element_text(family="serif"), 
        plot.title=element_text(hjust=0.5, family="serif", face="bold", size=10))  
print(fig1)

Comments: The distribution of points suggesst a positive relationship between education and ln(Wage).

QUESTION (2)

library(estimatr)
reg1=lm_robust(lnwage~educ, data=EPH, se_type="stata")
summary (reg1)
## 
## Call:
## lm_robust(formula = lnwage ~ educ, data = EPH, se_type = "stata")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)  1.60944   0.094739   16.99 4.782e-57  1.42353   1.7954 998
## educ         0.09041   0.006794   13.31 2.610e-37  0.07708   0.1037 998
## 
## Multiple R-squared:  0.1782 ,    Adjusted R-squared:  0.1774 
## F-statistic: 177.1 on 1 and 998 DF,  p-value: < 2.2e-16
  1. \[\widehat{{lnwage}}=\underset{(0.095)}{1.60944}+\underset{(0.0068)}{0.09041}*{educ}, R^2=0.1782 \]
fig2=ggplot(EPH, aes(educ,lnwage))+ 
  geom_point(alpha=.75, size=1.5, color="cyan4") + 
  labs(title="Figure 2: Ln(wage) and Education with Regression Line",
       x="Education", y="Ln(Wage)")  + 
  theme(axis.title=element_text(family="serif"), 
        plot.title=element_text(hjust=0.5, family="serif", face="bold", size=10))+
  geom_smooth(method="lm", se=FALSE, size=0.75, color="violetred3")
print(fig2)
## `geom_smooth()` using formula 'y ~ x'

  1. Interpretation: Average ln(wage) will increase by 0.09041% at one unit increase in year of education.

The t-statistic for the slope coefficient is t = 0.0904/0.0068 = 13.31. The t-statistic is less in absolute value than the 1% critical values (2.58). Therefore, the null hypothesis is rejected at the 1% levels.

  1. first is when weekly working hour variable is correlated with ln(wage) per hour; second is when weekly working hour is a determinant factor of education.

Intuitively, worker who have more working hours in a week tend to have higher earning. It could be caused as these people are full-time worker. Full-time worker tend to have more hours in working and more years spent on education compared to those who are work part-time or casual.

As a result, the effect of weekly working hour on ln(wage) will be partially absorbed into the effect of (educ) on ln(wage). Therefore, OLS estimate for b1 will overestimate the effect of (educ) on ln(wage).

reg2=lm_robust(lnwage~educ+hrswk, data=EPH, se_type="stata")
summary (reg2)
## 
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk, data = EPH, se_type = "stata")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept) 1.299871   0.108424  11.989 4.931e-31 1.087106  1.51264 997
## educ        0.086787   0.006826  12.715 1.964e-34 0.073393  0.10018 997
## hrswk       0.008999   0.001697   5.302 1.409e-07 0.005669  0.01233 997
## 
## Multiple R-squared:  0.2036 ,    Adjusted R-squared:  0.202 
## F-statistic: 106.5 on 2 and 997 DF,  p-value: < 2.2e-16

\[\widehat{{lnwage}}=\underset{(0.095)}{1.60944}+\underset{(0.0068)}{0.09041}*{educ}+\underset{(0.00169)}{0.0089}*{hrswk}, R^2=0.2036 \] (1)average ln(wage) will increase by 0.086787% at one unit increase in (educ), holding the effect of (hrswk) constant. (2)average ln(wage) will increase by 0.008999% at one unit increase in (hrswk), holding the effect of(educ) constant. So, reg2 result shows better model compared to reg1 as the included of (hrswk) will increase the R square.

QUESTION (3)

reg3=lm_robust(lnwage~educ+hrswk+exper+married+female+metro+midwest+south+west+black+asian, data=EPH, se_type="stata")
summary (reg3)
## 
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk + exper + married + 
##     female + metro + midwest + south + west + black + asian, 
##     data = EPH, se_type = "stata")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept)  1.178749   0.124795  9.4455 2.470e-20  0.933856  1.423642 988
## educ         0.092325   0.006786 13.6044 9.232e-39  0.079008  0.105643 988
## hrswk        0.005954   0.001635  3.6413 2.853e-04  0.002745  0.009162 988
## exper        0.005458   0.001350  4.0425 5.700e-05  0.002809  0.008108 988
## married      0.076673   0.032427  2.3645 1.825e-02  0.013039  0.140306 988
## female      -0.208679   0.032086 -6.5038 1.242e-10 -0.271643 -0.145715 988
## metro        0.175411   0.037296  4.7032 2.926e-06  0.102222  0.248601 988
## midwest     -0.097281   0.043003 -2.2622 2.390e-02 -0.181668 -0.012894 988
## south       -0.057424   0.045254 -1.2689 2.048e-01 -0.146229  0.031380 988
## west         0.009633   0.045410  0.2121 8.320e-01 -0.079477  0.098744 988
## black       -0.104145   0.051316 -2.0295 4.268e-02 -0.204847 -0.003443 988
## asian       -0.063372   0.088579 -0.7154 4.745e-01 -0.237197  0.110454 988
## 
## Multiple R-squared:  0.2802 ,    Adjusted R-squared:  0.2722 
## F-statistic: 32.69 on 11 and 988 DF,  p-value: < 2.2e-16
  1. The 95% confidential interval for the slope coefficient is 0.092 ± 1.96 x 0.0068, or [0.106, 0.079]. This interval does not include β1 = 0, so the estimated slope is significantly different from 0 at the 5% level.

To test the hypothesis of that one year of additional education would increase hourly wage by 12%, we can use the sample mean and standard deviation that we obtained above, we can test the null hypothesis H0 : µ = 0.12 (the alternative is H1 : µ ̸= 0.12). Since \[|\frac{\overline{X}_n-\mu_0}{\overline{a}/\sqrt{n}}|=|\frac{0.0923-0.12}{0.0068}|=-4.0735>-2.576\] As the t test (-4.0735) is higher than t table (-2.576), we reject the null hypothesis. Means that one year of additional education could increase hourly wage by 12% at 1% significant level.

  1. female: average earning per hour will reduce by 0.21% if the worker are female, holding all other variables constant.

Moreover, being a female will reduce the average hourly wage by 0.21%, while having one year of additional education will increase the average hourly earning by 0.09%. In addition, being female will have a significantly negative impact at 5% level on earning per hours. To measure, we can use The 95% confidential interval for the slope coefficient is -0.2086 ± 1.96 × 0.0321, or [-0.2729, -0.146]. This interval does not include β1 = 0, so the estimated slope is significantly different from 0 at the 5% level. Moreover, we can use the the t-statistic is -0.2086/0.0321 ≈ -6.50, which is greater in absolute value than the 5% critical value of -1.96. or alternatively, the p-value for the t-statistic is ≈ 0.000, which is smaller than 0.05.

  1. H0:B6=b7=b8=b9=0 H1:any of the (geographic) is not equal to 0
library(car)
## Loading required package: carData
linearHypothesis(reg3, c("metro=0", "midwest=0", "south=0", "west=0"), test=c("F"))
## Linear hypothesis test
## 
## Hypothesis:
## metro = 0
## midwest = 0
## south = 0
## west = 0
## 
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest + 
##     south + west + black + asian
## 
##   Res.Df Df     F    Pr(>F)    
## 1    992                       
## 2    988  4 9.123 3.077e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To reach the conclusion, the F statistics (9.123) is higher than f table (2.37), and the corresponding p-value is=0.000<0.05. Therefore, the null hypothesis that the coefficient of geographic are jointly equal to 0 is rejected at 5% significant level.

  1. H0:B10=b11=0 H1:any of the variables is not equal to 0
linearHypothesis(reg3, c("asian=0", "black=0"), test=c("F"))
## Linear hypothesis test
## 
## Hypothesis:
## asian = 0
## black = 0
## 
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest + 
##     south + west + black + asian
## 
##   Res.Df Df      F Pr(>F)
## 1    990                 
## 2    988  2 2.2237 0.1087

The F statistics (2.2237) is lower than f table (2.996) , and the corresponding p-value is=0.1087>0.05. Therefore, we can’t reject the null hypothesis. Means that there is no significant difference on earning between African American and Asian American at 5% level.

  1. \[\widehat{{lnwage}}={\beta_0}+{\beta_1 educ}++{\beta_2 exper}+{\beta_3 hrswk}+{\beta_4 msrried}+{\beta_5 female}+{\beta_6 metro}+{\beta_7 midwest}+{\beta_8 south}+{\beta_9 west}+{\beta_10 black}+{\beta_11asian}+{\beta_12 educ*female} \]
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
EF=read_csv("cps4_small.csv") %>%
  mutate(femEduc=female*educ)
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach (EF)
## The following object is masked from package:ggplot2:
## 
##     midwest
## 
## The following objects are masked from EPH:
## 
##     asian, black, educ, exper, female, hrswk, married, metro, midwest,
##     south, wage, west
reg4=lm_robust(lnwage~educ+hrswk+exper+married+female+metro+midwest+south+west+black+asian+femEduc, data=EPH, se_type="stata")
summary (reg4)
## 
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk + exper + married + 
##     female + metro + midwest + south + west + black + asian + 
##     femEduc, data = EPH, se_type = "stata")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper  DF
## (Intercept)  1.406445   0.143378  9.8094 9.731e-22  1.125085  1.687806 987
## educ         0.074706   0.008615  8.6721 1.721e-17  0.057801  0.091611 987
## hrswk        0.006124   0.001633  3.7508 1.866e-04  0.002920  0.009328 987
## exper        0.005708   0.001344  4.2478 2.363e-05  0.003071  0.008345 987
## married      0.076101   0.032192  2.3640 1.827e-02  0.012928  0.139273 987
## female      -0.764908   0.175368 -4.3617 1.426e-05 -1.109046 -0.420771 987
## metro        0.171931   0.036731  4.6808 3.256e-06  0.099851  0.244010 987
## midwest     -0.095797   0.043044 -2.2256 2.627e-02 -0.180265 -0.011329 987
## south       -0.063432   0.045118 -1.4059 1.601e-01 -0.151971  0.025107 987
## west         0.006049   0.045482  0.1330 8.942e-01 -0.083203  0.095302 987
## black       -0.102046   0.050649 -2.0148 4.420e-02 -0.201438 -0.002653 987
## asian       -0.054879   0.093211 -0.5888 5.562e-01 -0.237792  0.128035 987
## femEduc      0.040260   0.012711  3.1673 1.586e-03  0.015316  0.065204 987
## 
## Multiple R-squared:  0.2887 ,    Adjusted R-squared:   0.28 
## F-statistic: 32.57 on 12 and 987 DF,  p-value: < 2.2e-16

Then, we can use the linear hypothesis to see the significant level:

linearHypothesis(reg4, c("female=0", "femEduc=0"), test=c("F"))
## Linear hypothesis test
## 
## Hypothesis:
## female = 0
## femEduc = 0
## 
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest + 
##     south + west + black + asian + femEduc
## 
##   Res.Df Df      F    Pr(>F)    
## 1    989                        
## 2    987  2 28.418 1.002e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F statistics is 28.418, and the corresponding p-value is=000. Therefore, we reject the null hypothesis. Means that there is significant difference on earning every additional year of education in each gender.

predict(reg3, newdata=data.frame(educ=12, hrswk=40, exper=5, married=0, female=1, metro=1, midwest=0, south=0, west=0, asian=0, black=1))
##        1 
## 2.414683

QUESTION 4 (a)

EH = read_csv("cps4_small.csv") %>%
  mutate(lt_hs = as.numeric(educ < 12), hs = as.numeric(educ == 12), 
         col = as.numeric(educ >= 16), some_col = 1 - lt_hs - hs - col)
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(EH)
## The following objects are masked from EF:
## 
##     asian, black, educ, exper, female, hrswk, married, metro, midwest,
##     south, wage, west
## 
## The following object is masked from package:ggplot2:
## 
##     midwest
## 
## The following objects are masked from EPH:
## 
##     asian, black, educ, exper, female, hrswk, married, metro, midwest,
##     south, wage, west
library(psych)
## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(EH)
##          vars    n  mean    sd median trimmed   mad  min   max range  skew
## wage        1 1000 20.62 12.83   17.3   18.67 10.27 1.97 76.39 74.42  1.58
## educ        2 1000 13.80  2.71   13.0   13.68  1.48 0.00 21.00 21.00 -0.07
## exper       3 1000 26.51 12.85   27.0   26.27 14.83 2.00 65.00 63.00  0.15
## hrswk       4 1000 39.95 10.34   40.0   40.40  0.00 0.00 90.00 90.00 -0.53
## married     5 1000  0.58  0.49    1.0    0.60  0.00 0.00  1.00  1.00 -0.33
## female      6 1000  0.51  0.50    1.0    0.52  0.00 0.00  1.00  1.00 -0.06
## metro       7 1000  0.78  0.41    1.0    0.85  0.00 0.00  1.00  1.00 -1.35
## midwest     8 1000  0.24  0.43    0.0    0.17  0.00 0.00  1.00  1.00  1.22
## south       9 1000  0.30  0.46    0.0    0.24  0.00 0.00  1.00  1.00  0.89
## west       10 1000  0.24  0.43    0.0    0.17  0.00 0.00  1.00  1.00  1.22
## black      11 1000  0.11  0.32    0.0    0.01  0.00 0.00  1.00  1.00  2.46
## asian      12 1000  0.04  0.20    0.0    0.00  0.00 0.00  1.00  1.00  4.50
## lt_hs      13 1000  0.06  0.24    0.0    0.00  0.00 0.00  1.00  1.00  3.66
## hs         14 1000  0.33  0.47    0.0    0.28  0.00 0.00  1.00  1.00  0.73
## col        15 1000  0.33  0.47    0.0    0.29  0.00 0.00  1.00  1.00  0.72
## some_col   16 1000  0.28  0.45    0.0    0.22  0.00 0.00  1.00  1.00  0.98
##          kurtosis   se
## wage         2.91 0.41
## educ         2.06 0.09
## exper       -0.63 0.41
## hrswk        4.39 0.33
## married     -1.89 0.02
## female      -2.00 0.02
## metro       -0.18 0.01
## midwest     -0.52 0.01
## south       -1.20 0.01
## west        -0.52 0.01
## black        4.04 0.01
## asian       18.26 0.01
## lt_hs       11.43 0.01
## hs          -1.47 0.01
## col         -1.49 0.01
## some_col    -1.04 0.01
tapply(wage, hs, describe)
## $`0`
##    vars   n  mean    sd median trimmed   mad  min   max range skew kurtosis
## X1    1 672 22.87 13.84  19.56   20.97 12.29 1.97 76.39 74.42 1.31     1.76
##      se
## X1 0.53
## 
## $`1`
##    vars   n  mean   sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 328 15.99 8.84   14.2   14.72 6.35 2.5 72.13 69.63  2.3     9.01 0.49
tapply(wage, lt_hs, describe)
## $`0`
##    vars   n  mean    sd median trimmed   mad  min   max range skew kurtosis
## X1    1 939 21.12 12.94  17.88   19.19 10.56 1.97 76.39 74.42 1.55     2.79
##      se
## X1 0.42
## 
## $`1`
##    vars  n  mean   sd median trimmed mad  min   max range skew kurtosis se
## X1    1 61 12.89 7.81   9.75   11.47 3.9 4.33 45.65 41.32 2.04     4.85  1
tapply(wage, col, describe)
## $`0`
##    vars   n  mean    sd median trimmed  mad min   max range skew kurtosis   se
## X1    1 669 17.04 10.02  14.55   15.52 7.46 2.5 76.39 73.89 2.09     6.79 0.39
## 
## $`1`
##    vars   n  mean    sd median trimmed   mad  min   max range skew kurtosis
## X1    1 331 27.84 14.73  24.53   26.27 13.57 1.97 72.13 70.16 0.95     0.63
##      se
## X1 0.81
tapply(wage, some_col, describe)
## $`0`
##    vars   n  mean    sd median trimmed   mad  min   max range skew kurtosis  se
## X1    1 720 21.18 13.37   17.5   19.17 10.71 1.97 72.13 70.16 1.48     2.32 0.5
## 
## $`1`
##    vars   n  mean    sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 280 19.17 11.23   16.7   17.52 9.37 2.88 76.39 73.51 1.84        5 0.67
reg5=lm_robust(wage~hs+lt_hs+col+some_col, data=EH, se_type="stata")
## Warning in sqrt(diag(vcov_fit$Vcov_hat)): NaNs produced

In this model, the sum of all category dummy variable for each row is equal to the intercept value of that row. In other words, there is perfect multi-collinearity (one value can be predicted from the other values). Intuitively, there is a duplicate category: if we dropped the hs category it is inherently defined in the lt_hs category (zero hs value indicate lt_hs, and vice-versa).

reg6=lm_robust(wage~lt_hs+col+some_col, data=EH, se_type="stata")
summary(reg6)
## 
## Call:
## lm_robust(formula = wage ~ lt_hs + col + some_col, data = EH, 
##     se_type = "stata")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)   15.993     0.4885  32.737 3.691e-160   15.035  16.9520 996
## lt_hs         -3.108     1.1075  -2.806  5.107e-03   -5.281  -0.9348 996
## col           11.849     0.9460  12.525  1.572e-33    9.992  13.7050 996
## some_col       3.179     0.8304   3.828  1.372e-04    1.549   4.8082 996
## 
## Multiple R-squared:  0.1733 ,    Adjusted R-squared:  0.1708 
## F-statistic: 64.07 on 3 and 996 DF,  p-value: < 2.2e-16

#In this model, the estimated intercept is 15.993, means that the average hourly earnings of worker who are their highest diploma was not included in the model 15.993. The value is quite similar with the average hourly earning of (hs=1) that calculated in simple means.

If we compare sample means and regression result, it shows quite the similar result. for variable lt_hs, average income will be lower by 3.108 compared to value of otherwise. Therefore, in the sample means result, lt_hs will have average income by 12.890, around 3.103 lower compared to the intercept value.

for variable col, average income will be higher by 11.849 compared to value of otherwise. Therefore, in the sample means result, col will have average income by 27.840, around 11.847 higher compared to the intercept value.

for variable some_col, average income will be higher by 3.179 compared to value of otherwise. Therefore, in the sample means result, come_col will have average income by 19.170 , around 3.177 higher compared to the intercept value.