QUESTION (1)
setwd("D:/econ7310/Research project 1")
library(readr)
EPH=read_csv("cps4_small.csv")
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach (EPH)
summary (cbind(wage,educ))
## wage educ
## Min. : 1.97 Min. : 0.0
## 1st Qu.:11.25 1st Qu.:12.0
## Median :17.30 Median :13.0
## Mean :20.62 Mean :13.8
## 3rd Qu.:25.63 3rd Qu.:16.0
## Max. :76.39 Max. :21.0
sd (wage)
## [1] 12.83472
sd(educ)
## [1] 2.711079
hist(educ, main = "Years of Education", xlab = "Year")
hist(wage, main = "Earning per Hour", xlab = "Wage")
lnwage=log(wage)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'EPH':
##
## midwest
fig1=ggplot(EPH, aes(educ,lnwage))+
geom_point(alpha=.75, size=1.5, color="cyan4") +
labs(title="Figure 1: Ln(wage) and Education",
x="Education", y="Ln(Wage)") +
theme(axis.title=element_text(family="serif"),
plot.title=element_text(hjust=0.5, family="serif", face="bold", size=10))
print(fig1)
Comments: The distribution of points suggesst a positive relationship between education and ln(Wage).
QUESTION (2)
library(estimatr)
reg1=lm_robust(lnwage~educ, data=EPH, se_type="stata")
summary (reg1)
##
## Call:
## lm_robust(formula = lnwage ~ educ, data = EPH, se_type = "stata")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 1.60944 0.094739 16.99 4.782e-57 1.42353 1.7954 998
## educ 0.09041 0.006794 13.31 2.610e-37 0.07708 0.1037 998
##
## Multiple R-squared: 0.1782 , Adjusted R-squared: 0.1774
## F-statistic: 177.1 on 1 and 998 DF, p-value: < 2.2e-16
fig2=ggplot(EPH, aes(educ,lnwage))+
geom_point(alpha=.75, size=1.5, color="cyan4") +
labs(title="Figure 2: Ln(wage) and Education with Regression Line",
x="Education", y="Ln(Wage)") +
theme(axis.title=element_text(family="serif"),
plot.title=element_text(hjust=0.5, family="serif", face="bold", size=10))+
geom_smooth(method="lm", se=FALSE, size=0.75, color="violetred3")
print(fig2)
## `geom_smooth()` using formula 'y ~ x'
The t-statistic for the slope coefficient is t = 0.0904/0.0068 = 13.31. The t-statistic is less in absolute value than the 1% critical values (2.58). Therefore, the null hypothesis is rejected at the 1% levels.
Intuitively, worker who have more working hours in a week tend to have higher earning. It could be caused as these people are full-time worker. Full-time worker tend to have more hours in working and more years spent on education compared to those who are work part-time or casual.
As a result, the effect of weekly working hour on ln(wage) will be partially absorbed into the effect of (educ) on ln(wage). Therefore, OLS estimate for b1 will overestimate the effect of (educ) on ln(wage).
reg2=lm_robust(lnwage~educ+hrswk, data=EPH, se_type="stata")
summary (reg2)
##
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk, data = EPH, se_type = "stata")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 1.299871 0.108424 11.989 4.931e-31 1.087106 1.51264 997
## educ 0.086787 0.006826 12.715 1.964e-34 0.073393 0.10018 997
## hrswk 0.008999 0.001697 5.302 1.409e-07 0.005669 0.01233 997
##
## Multiple R-squared: 0.2036 , Adjusted R-squared: 0.202
## F-statistic: 106.5 on 2 and 997 DF, p-value: < 2.2e-16
\[\widehat{{lnwage}}=\underset{(0.095)}{1.60944}+\underset{(0.0068)}{0.09041}*{educ}+\underset{(0.00169)}{0.0089}*{hrswk}, R^2=0.2036 \] (1)average ln(wage) will increase by 0.086787% at one unit increase in (educ), holding the effect of (hrswk) constant. (2)average ln(wage) will increase by 0.008999% at one unit increase in (hrswk), holding the effect of(educ) constant. So, reg2 result shows better model compared to reg1 as the included of (hrswk) will increase the R square.
QUESTION (3)
reg3=lm_robust(lnwage~educ+hrswk+exper+married+female+metro+midwest+south+west+black+asian, data=EPH, se_type="stata")
summary (reg3)
##
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk + exper + married +
## female + metro + midwest + south + west + black + asian,
## data = EPH, se_type = "stata")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 1.178749 0.124795 9.4455 2.470e-20 0.933856 1.423642 988
## educ 0.092325 0.006786 13.6044 9.232e-39 0.079008 0.105643 988
## hrswk 0.005954 0.001635 3.6413 2.853e-04 0.002745 0.009162 988
## exper 0.005458 0.001350 4.0425 5.700e-05 0.002809 0.008108 988
## married 0.076673 0.032427 2.3645 1.825e-02 0.013039 0.140306 988
## female -0.208679 0.032086 -6.5038 1.242e-10 -0.271643 -0.145715 988
## metro 0.175411 0.037296 4.7032 2.926e-06 0.102222 0.248601 988
## midwest -0.097281 0.043003 -2.2622 2.390e-02 -0.181668 -0.012894 988
## south -0.057424 0.045254 -1.2689 2.048e-01 -0.146229 0.031380 988
## west 0.009633 0.045410 0.2121 8.320e-01 -0.079477 0.098744 988
## black -0.104145 0.051316 -2.0295 4.268e-02 -0.204847 -0.003443 988
## asian -0.063372 0.088579 -0.7154 4.745e-01 -0.237197 0.110454 988
##
## Multiple R-squared: 0.2802 , Adjusted R-squared: 0.2722
## F-statistic: 32.69 on 11 and 988 DF, p-value: < 2.2e-16
To test the hypothesis of that one year of additional education would increase hourly wage by 12%, we can use the sample mean and standard deviation that we obtained above, we can test the null hypothesis H0 : µ = 0.12 (the alternative is H1 : µ ̸= 0.12). Since \[|\frac{\overline{X}_n-\mu_0}{\overline{a}/\sqrt{n}}|=|\frac{0.0923-0.12}{0.0068}|=-4.0735>-2.576\] As the t test (-4.0735) is higher than t table (-2.576), we reject the null hypothesis. Means that one year of additional education could increase hourly wage by 12% at 1% significant level.
Moreover, being a female will reduce the average hourly wage by 0.21%, while having one year of additional education will increase the average hourly earning by 0.09%. In addition, being female will have a significantly negative impact at 5% level on earning per hours. To measure, we can use The 95% confidential interval for the slope coefficient is -0.2086 ± 1.96 × 0.0321, or [-0.2729, -0.146]. This interval does not include β1 = 0, so the estimated slope is significantly different from 0 at the 5% level. Moreover, we can use the the t-statistic is -0.2086/0.0321 ≈ -6.50, which is greater in absolute value than the 5% critical value of -1.96. or alternatively, the p-value for the t-statistic is ≈ 0.000, which is smaller than 0.05.
library(car)
## Loading required package: carData
linearHypothesis(reg3, c("metro=0", "midwest=0", "south=0", "west=0"), test=c("F"))
## Linear hypothesis test
##
## Hypothesis:
## metro = 0
## midwest = 0
## south = 0
## west = 0
##
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest +
## south + west + black + asian
##
## Res.Df Df F Pr(>F)
## 1 992
## 2 988 4 9.123 3.077e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To reach the conclusion, the F statistics (9.123) is higher than f table (2.37), and the corresponding p-value is=0.000<0.05. Therefore, the null hypothesis that the coefficient of geographic are jointly equal to 0 is rejected at 5% significant level.
linearHypothesis(reg3, c("asian=0", "black=0"), test=c("F"))
## Linear hypothesis test
##
## Hypothesis:
## asian = 0
## black = 0
##
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest +
## south + west + black + asian
##
## Res.Df Df F Pr(>F)
## 1 990
## 2 988 2 2.2237 0.1087
The F statistics (2.2237) is lower than f table (2.996) , and the corresponding p-value is=0.1087>0.05. Therefore, we can’t reject the null hypothesis. Means that there is no significant difference on earning between African American and Asian American at 5% level.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
EF=read_csv("cps4_small.csv") %>%
mutate(femEduc=female*educ)
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach (EF)
## The following object is masked from package:ggplot2:
##
## midwest
##
## The following objects are masked from EPH:
##
## asian, black, educ, exper, female, hrswk, married, metro, midwest,
## south, wage, west
reg4=lm_robust(lnwage~educ+hrswk+exper+married+female+metro+midwest+south+west+black+asian+femEduc, data=EPH, se_type="stata")
summary (reg4)
##
## Call:
## lm_robust(formula = lnwage ~ educ + hrswk + exper + married +
## female + metro + midwest + south + west + black + asian +
## femEduc, data = EPH, se_type = "stata")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 1.406445 0.143378 9.8094 9.731e-22 1.125085 1.687806 987
## educ 0.074706 0.008615 8.6721 1.721e-17 0.057801 0.091611 987
## hrswk 0.006124 0.001633 3.7508 1.866e-04 0.002920 0.009328 987
## exper 0.005708 0.001344 4.2478 2.363e-05 0.003071 0.008345 987
## married 0.076101 0.032192 2.3640 1.827e-02 0.012928 0.139273 987
## female -0.764908 0.175368 -4.3617 1.426e-05 -1.109046 -0.420771 987
## metro 0.171931 0.036731 4.6808 3.256e-06 0.099851 0.244010 987
## midwest -0.095797 0.043044 -2.2256 2.627e-02 -0.180265 -0.011329 987
## south -0.063432 0.045118 -1.4059 1.601e-01 -0.151971 0.025107 987
## west 0.006049 0.045482 0.1330 8.942e-01 -0.083203 0.095302 987
## black -0.102046 0.050649 -2.0148 4.420e-02 -0.201438 -0.002653 987
## asian -0.054879 0.093211 -0.5888 5.562e-01 -0.237792 0.128035 987
## femEduc 0.040260 0.012711 3.1673 1.586e-03 0.015316 0.065204 987
##
## Multiple R-squared: 0.2887 , Adjusted R-squared: 0.28
## F-statistic: 32.57 on 12 and 987 DF, p-value: < 2.2e-16
Then, we can use the linear hypothesis to see the significant level:
linearHypothesis(reg4, c("female=0", "femEduc=0"), test=c("F"))
## Linear hypothesis test
##
## Hypothesis:
## female = 0
## femEduc = 0
##
## Model 1: restricted model
## Model 2: lnwage ~ educ + hrswk + exper + married + female + metro + midwest +
## south + west + black + asian + femEduc
##
## Res.Df Df F Pr(>F)
## 1 989
## 2 987 2 28.418 1.002e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F statistics is 28.418, and the corresponding p-value is=000. Therefore, we reject the null hypothesis. Means that there is significant difference on earning every additional year of education in each gender.
predict(reg3, newdata=data.frame(educ=12, hrswk=40, exper=5, married=0, female=1, metro=1, midwest=0, south=0, west=0, asian=0, black=1))
## 1
## 2.414683
QUESTION 4 (a)
EH = read_csv("cps4_small.csv") %>%
mutate(lt_hs = as.numeric(educ < 12), hs = as.numeric(educ == 12),
col = as.numeric(educ >= 16), some_col = 1 - lt_hs - hs - col)
## Rows: 1000 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): wage, educ, exper, hrswk, married, female, metro, midwest, south, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(EH)
## The following objects are masked from EF:
##
## asian, black, educ, exper, female, hrswk, married, metro, midwest,
## south, wage, west
##
## The following object is masked from package:ggplot2:
##
## midwest
##
## The following objects are masked from EPH:
##
## asian, black, educ, exper, female, hrswk, married, metro, midwest,
## south, wage, west
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(EH)
## vars n mean sd median trimmed mad min max range skew
## wage 1 1000 20.62 12.83 17.3 18.67 10.27 1.97 76.39 74.42 1.58
## educ 2 1000 13.80 2.71 13.0 13.68 1.48 0.00 21.00 21.00 -0.07
## exper 3 1000 26.51 12.85 27.0 26.27 14.83 2.00 65.00 63.00 0.15
## hrswk 4 1000 39.95 10.34 40.0 40.40 0.00 0.00 90.00 90.00 -0.53
## married 5 1000 0.58 0.49 1.0 0.60 0.00 0.00 1.00 1.00 -0.33
## female 6 1000 0.51 0.50 1.0 0.52 0.00 0.00 1.00 1.00 -0.06
## metro 7 1000 0.78 0.41 1.0 0.85 0.00 0.00 1.00 1.00 -1.35
## midwest 8 1000 0.24 0.43 0.0 0.17 0.00 0.00 1.00 1.00 1.22
## south 9 1000 0.30 0.46 0.0 0.24 0.00 0.00 1.00 1.00 0.89
## west 10 1000 0.24 0.43 0.0 0.17 0.00 0.00 1.00 1.00 1.22
## black 11 1000 0.11 0.32 0.0 0.01 0.00 0.00 1.00 1.00 2.46
## asian 12 1000 0.04 0.20 0.0 0.00 0.00 0.00 1.00 1.00 4.50
## lt_hs 13 1000 0.06 0.24 0.0 0.00 0.00 0.00 1.00 1.00 3.66
## hs 14 1000 0.33 0.47 0.0 0.28 0.00 0.00 1.00 1.00 0.73
## col 15 1000 0.33 0.47 0.0 0.29 0.00 0.00 1.00 1.00 0.72
## some_col 16 1000 0.28 0.45 0.0 0.22 0.00 0.00 1.00 1.00 0.98
## kurtosis se
## wage 2.91 0.41
## educ 2.06 0.09
## exper -0.63 0.41
## hrswk 4.39 0.33
## married -1.89 0.02
## female -2.00 0.02
## metro -0.18 0.01
## midwest -0.52 0.01
## south -1.20 0.01
## west -0.52 0.01
## black 4.04 0.01
## asian 18.26 0.01
## lt_hs 11.43 0.01
## hs -1.47 0.01
## col -1.49 0.01
## some_col -1.04 0.01
tapply(wage, hs, describe)
## $`0`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 672 22.87 13.84 19.56 20.97 12.29 1.97 76.39 74.42 1.31 1.76
## se
## X1 0.53
##
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 328 15.99 8.84 14.2 14.72 6.35 2.5 72.13 69.63 2.3 9.01 0.49
tapply(wage, lt_hs, describe)
## $`0`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 939 21.12 12.94 17.88 19.19 10.56 1.97 76.39 74.42 1.55 2.79
## se
## X1 0.42
##
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 61 12.89 7.81 9.75 11.47 3.9 4.33 45.65 41.32 2.04 4.85 1
tapply(wage, col, describe)
## $`0`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 669 17.04 10.02 14.55 15.52 7.46 2.5 76.39 73.89 2.09 6.79 0.39
##
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 331 27.84 14.73 24.53 26.27 13.57 1.97 72.13 70.16 0.95 0.63
## se
## X1 0.81
tapply(wage, some_col, describe)
## $`0`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 720 21.18 13.37 17.5 19.17 10.71 1.97 72.13 70.16 1.48 2.32 0.5
##
## $`1`
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 280 19.17 11.23 16.7 17.52 9.37 2.88 76.39 73.51 1.84 5 0.67
reg5=lm_robust(wage~hs+lt_hs+col+some_col, data=EH, se_type="stata")
## Warning in sqrt(diag(vcov_fit$Vcov_hat)): NaNs produced
In this model, the sum of all category dummy variable for each row is equal to the intercept value of that row. In other words, there is perfect multi-collinearity (one value can be predicted from the other values). Intuitively, there is a duplicate category: if we dropped the hs category it is inherently defined in the lt_hs category (zero hs value indicate lt_hs, and vice-versa).
reg6=lm_robust(wage~lt_hs+col+some_col, data=EH, se_type="stata")
summary(reg6)
##
## Call:
## lm_robust(formula = wage ~ lt_hs + col + some_col, data = EH,
## se_type = "stata")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 15.993 0.4885 32.737 3.691e-160 15.035 16.9520 996
## lt_hs -3.108 1.1075 -2.806 5.107e-03 -5.281 -0.9348 996
## col 11.849 0.9460 12.525 1.572e-33 9.992 13.7050 996
## some_col 3.179 0.8304 3.828 1.372e-04 1.549 4.8082 996
##
## Multiple R-squared: 0.1733 , Adjusted R-squared: 0.1708
## F-statistic: 64.07 on 3 and 996 DF, p-value: < 2.2e-16
#In this model, the estimated intercept is 15.993, means that the average hourly earnings of worker who are their highest diploma was not included in the model 15.993. The value is quite similar with the average hourly earning of (hs=1) that calculated in simple means.
If we compare sample means and regression result, it shows quite the similar result. for variable lt_hs, average income will be lower by 3.108 compared to value of otherwise. Therefore, in the sample means result, lt_hs will have average income by 12.890, around 3.103 lower compared to the intercept value.
for variable col, average income will be higher by 11.849 compared to value of otherwise. Therefore, in the sample means result, col will have average income by 27.840, around 11.847 higher compared to the intercept value.
for variable some_col, average income will be higher by 3.179 compared to value of otherwise. Therefore, in the sample means result, come_col will have average income by 19.170 , around 3.177 higher compared to the intercept value.