hw5

JOYCE,MEI, CAROL,TAMI 11/1/2020

load("~/Desktop/.RData")
attach(acs2017_ny)
## The following object is masked _by_ .GlobalEnv:
## 
##     OWNCOST
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)&(Asian=1)
dat_use <- subset(acs2017_ny,use_varb) # 
detach()

Jam nonlinear regression into linear regression

## with one dummy
model1<-lm(INCWAGE~AGE+I(AGE^2)+female,data=dat_use)                                   
summary(model1)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + female, data = dat_use)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90258 -39251 -16422  12852 603267 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -86399.070   8408.739  -10.28   <2e-16 ***
## AGE           7641.924    431.330   17.72   <2e-16 ***
## I(AGE^2)       -82.642      5.327  -15.51   <2e-16 ***
## female      -18264.550    760.754  -24.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82030 on 46967 degrees of freedom
## Multiple R-squared:  0.02899,    Adjusted R-squared:  0.02892 
## F-statistic: 467.3 on 3 and 46967 DF,  p-value: < 2.2e-16
attach(dat_NYC)
## The following objects are masked _by_ .GlobalEnv:
## 
##     Asian, OWNCOST
NNobs <- length(INCWAGE)
set.seed(12345) 
graph_obs <- (runif(NNobs) < 0.1) 
dat_graph <-subset(dat_use,graph_obs) 
a<-plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 16, col = rgb(0.5, 0.7, 0.8, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted1 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian = 1, Amindian = 0, race_oth = 0, Hispanic =0, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted1$yhat <- predict(model1, newdata = to_be_predicted1)
lines(yhat ~ AGE, data = to_be_predicted1)

AGE, AGE^2,female and educ_college are all significant. 1. It’s necessary to add up AGE^2 to adjust the model.

2.INCWAGE= f(AGE)= -82.642AGE^2+7641.924 AGE-86399.070 -18264.550 female The function reaches its peach at f’(AGE)=0, -300.78AGE+13611.11=0 (AGE=46).

Incwage estimation with the variables of age and one dummy D1(female=1)

When D1=0 (male)

E(INCWAGE)= -82.642AGE^2+7641.924 AGE-86399.070

E(INCWAGE)max=f(46)=90248

When D1=1(female) E(INCWAGE)=-150.39AGE^2+ 13611.11AGE-179132.90 -22415.91female=67832.78

The wages of male is$22415.91 higher than female. Adding gender in the model change the intercept of the function.

## Estimating income With more dummies.改
model1a <-lm(INCWAGE~AGE+I(AGE^2)+female+Asian+educ_hs+educ_somecoll+educ_college+educ_advdeg,data=dat_use)  
summary(model1a)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + female + Asian + educ_hs + 
##     educ_somecoll + educ_college + educ_advdeg, data = dat_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138710  -32970  -10697   13008  624344 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.145e+05  8.101e+03 -14.135  < 2e-16 ***
## AGE            6.587e+03  4.051e+02  16.258  < 2e-16 ***
## I(AGE^2)      -6.509e+01  5.003e+00 -13.012  < 2e-16 ***
## female        -2.537e+04  7.182e+02 -35.325  < 2e-16 ***
## Asian         -1.077e+03  1.194e+03  -0.902    0.367    
## educ_hs        1.337e+04  1.787e+03   7.481 7.52e-14 ***
## educ_somecoll  2.576e+04  1.822e+03  14.141  < 2e-16 ***
## educ_college   6.149e+04  1.783e+03  34.493  < 2e-16 ***
## educ_advdeg    8.658e+04  1.823e+03  47.483  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76760 on 46962 degrees of freedom
## Multiple R-squared:  0.1499, Adjusted R-squared:  0.1498 
## F-statistic:  1035 on 8 and 46962 DF,  p-value: < 2.2e-16

Q:What is the predicted wage, from your model, for a few relevant cases? Do those seem reasonable? 1.E(INCWAGE)=-1.145e+05+ 6.587e+03AGE+ -6.509e+01AGE^2-2.537e+04female+ 1.337e+04 educ_hs + 2.576e+04 educ_somecoll+ 6.149e+04 educ_college+ 8.658e+04educ_advdeg.

2.Except for Asian,all other variables are significant. In another word, income is not relative to ethnicity, so just drop Asian in the model. By adding more variables, R squares increases by 14.9%, which is an ok result.

3.By using this model, we can estimate the marginal impacts of each variable holding other factors unchanged. For example, female earn $2.537e+04 less than male per year, people have advanced education has income $8.658e+04 higher than those who are in educ_ohs.

PS:My confusion is that, if I drop educ_nohs in the regression, then the coefficients of the rest of the education index are making comparision to educ_nohs?

## Adding higher polynomial terms of age
model2<-lm(INCWAGE~AGE+I(AGE^2)+I(AGE^3)+I(AGE^4)+female,data=dat_use)
summary(model2)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + I(AGE^3) + I(AGE^4) + 
##     female, data = dat_use)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89585 -39074 -16528  12946 603245 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.076e+05  2.022e+05   1.027   0.3046    
## AGE         -2.577e+04  2.138e+04  -1.205   0.2282    
## I(AGE^2)     1.303e+03  8.303e+02   1.569   0.1167    
## I(AGE^3)    -2.484e+01  1.404e+01  -1.768   0.0770 .  
## I(AGE^4)     1.626e-01  8.741e-02   1.860   0.0628 .  
## female      -1.827e+04  7.607e+02 -24.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82030 on 46965 degrees of freedom
## Multiple R-squared:  0.02913,    Adjusted R-squared:  0.02903 
## F-statistic: 281.8 on 5 and 46965 DF,  p-value: < 2.2e-16
NNobs <- length(INCWAGE)
set.seed(12345) 
graph_obs <- (runif(NNobs) < 0.1) 
dat_graph <-subset(dat_use,graph_obs) 
a<-plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 16, col = rgb(0.5, 0.7, 0.8, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted1 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian = 1, Amindian = 0, race_oth = 0, Hispanic =0, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted1$yhat <- predict(model2, newdata = to_be_predicted1)
lines(yhat ~ AGE, data = to_be_predicted1)

1.At the significant level of 0.05, Age and its higher power terms and female are statistically significant.

  1. E(INCWAGE)=f(age)=1.985e+06 -2.344e+05AGE +1.023e+04 AGE^2-1.880e+02 AGE^3+1.244e+00 AGE^4+ -2.236e+04female

3.By adding higher power of age, the model is improved as R squared increases.

Know the Data

model2a<-lm(INCWAGE~log(AGE)+I(log(AGE^2))+I(log(AGE^3))+I(log(AGE^4))+female,data=dat_use)
summary(model2a)
## 
## Call:
## lm(formula = INCWAGE ~ log(AGE) + I(log(AGE^2)) + I(log(AGE^3)) + 
##     I(log(AGE^4)) + female, data = dat_use)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94336 -39311 -16982  12904 594134 
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -68030.3     5995.9  -11.35   <2e-16 ***
## log(AGE)       40517.3     1624.5   24.94   <2e-16 ***
## I(log(AGE^2))       NA         NA      NA       NA    
## I(log(AGE^3))       NA         NA      NA       NA    
## I(log(AGE^4))       NA         NA      NA       NA    
## female        -18524.0      761.9  -24.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82180 on 46968 degrees of freedom
## Multiple R-squared:  0.02548,    Adjusted R-squared:  0.02544 
## F-statistic:   614 on 2 and 46968 DF,  p-value: < 2.2e-16

##why take log for higher power of age dosen’t work??

attach(acs2017_ny)
## The following objects are masked _by_ .GlobalEnv:
## 
##     Asian, OWNCOST

## The following objects are masked from dat_NYC:
## 
##     AfAm, AGE, Amindian, ANCESTR1, ANCESTR1D, ANCESTR2, ANCESTR2D,
##     Asian, below_150poverty, below_200poverty, below_povertyline, BPL,
##     BPLD, BUILTYR2, CITIZEN, CLASSWKR, CLASSWKRD, Commute_bus,
##     Commute_car, Commute_other, Commute_rail, Commute_subway, COSTELEC,
##     COSTFUEL, COSTGAS, COSTWATR, DEGFIELD, DEGFIELD2, DEGFIELD2D,
##     DEGFIELDD, DEPARTS, EDUC, educ_advdeg, educ_college, educ_hs,
##     educ_nohs, educ_somecoll, EDUCD, EMPSTAT, EMPSTATD, FAMSIZE,
##     female, foodstamps, FOODSTMP, FTOTINC, FUELHEAT, GQ,
##     has_AnyHealthIns, has_PvtHealthIns, HCOVANY, HCOVPRIV, HHINCOME,
##     Hisp_Cuban, Hisp_DomR, Hisp_Mex, Hisp_PR, HISPAN, HISPAND,
##     Hispanic, in_Bronx, in_Brooklyn, in_Manhattan, in_Nassau, in_NYC,
##     in_Queens, in_StatenI, in_Westchester, INCTOT, INCWAGE, IND,
##     LABFORCE, LINGISOL, MARST, MIGCOUNTY1, MIGPLAC1, MIGPUMA1,
##     MIGRATE1, MIGRATE1D, MORTGAGE, NCHILD, NCHLT5, OCC, OWNCOST,
##     OWNERSHP, OWNERSHPD, POVERTY, PUMA, PWPUMA00, RACE, race_oth,
##     RACED, RELATE, RELATED, RENT, ROOMS, SCHOOL, SEX, SSMC, TRANTIME,
##     TRANWORK, UHRSWORK, UNITSSTR, unmarried, veteran, VETSTAT,
##     VETSTATD, white, WKSWORK2, YRSUSA1
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35) & (Hispanic == 1) & (female == 1) & ((educ_college == 1) | (educ_advdeg == 1))
dat_use1 <- subset(acs2017_ny,use_varb) # 
detach()
## Explore what happens if we denote different properities under the same dummy variable as 1 and use them in the same model.
model2b <-lm(INCWAGE~AGE+I(AGE^2)+female+Asian+educ_hs+educ_somecoll+educ_college+educ_advdeg,data=dat_use1) 
summary(model2b)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + female + Asian + educ_hs + 
##     educ_somecoll + educ_college + educ_advdeg, data = dat_use1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92859 -25796  -7006  15021 553413 
## 
## Coefficients: (4 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -77174.12   33798.14  -2.283   0.0226 *  
## AGE             7794.92    1761.97   4.424 1.06e-05 ***
## I(AGE^2)         -89.33      22.25  -4.015 6.34e-05 ***
## female               NA         NA      NA       NA    
## Asian          12692.60   14470.50   0.877   0.3806    
## educ_hs              NA         NA      NA       NA    
## educ_somecoll        NA         NA      NA       NA    
## educ_college  -21888.58    3211.16  -6.816 1.52e-11 ***
## educ_advdeg          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51860 on 1131 degrees of freedom
## Multiple R-squared:  0.0779, Adjusted R-squared:  0.07464 
## F-statistic: 23.89 on 4 and 1131 DF,  p-value: < 2.2e-16

##In the original subset, educ_college and educ_advdeg are all denoted by 1. If a dummy variable has m properties, then we apply m-1 dummies in the model. Because different properties under the same dummy variables are exclusive. People will be classified by either of the property but not both. For example, “(educ_college == 1) | (educ_advdeg == 1)” if a person falls into educ_college or educ_advdge, then we fail to classify him.

Here in this example suppose we have 3 propertities in education index: educ_hs,educ_college, educ_advdge.

person educ_college educ_advdge

1: 0 0
2: 0 1
3: 1 0 4: 1 1

In this case, how should we classify person no.4? So D1(educ_college) and D2(educ_advdge) can’t take 1 at the same time.

##polynomial terms of dummy variables
model2c <-model1a <-lm(INCWAGE~AGE+I(AGE^2)+female+I(female^2)+Asian+I(Asian^2)+educ_hs+I(educ_hs^2)+educ_somecoll+educ_college+educ_advdeg,data=dat_use) 
summary(model2c)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + female + I(female^2) + 
##     Asian + I(Asian^2) + educ_hs + I(educ_hs^2) + educ_somecoll + 
##     educ_college + educ_advdeg, data = dat_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138710  -32970  -10697   13008  624344 
## 
## Coefficients: (3 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.145e+05  8.101e+03 -14.135  < 2e-16 ***
## AGE            6.587e+03  4.051e+02  16.258  < 2e-16 ***
## I(AGE^2)      -6.509e+01  5.003e+00 -13.012  < 2e-16 ***
## female        -2.537e+04  7.182e+02 -35.325  < 2e-16 ***
## I(female^2)           NA         NA      NA       NA    
## Asian         -1.077e+03  1.194e+03  -0.902    0.367    
## I(Asian^2)            NA         NA      NA       NA    
## educ_hs        1.337e+04  1.787e+03   7.481 7.52e-14 ***
## I(educ_hs^2)          NA         NA      NA       NA    
## educ_somecoll  2.576e+04  1.822e+03  14.141  < 2e-16 ***
## educ_college   6.149e+04  1.783e+03  34.493  < 2e-16 ***
## educ_advdeg    8.658e+04  1.823e+03  47.483  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76760 on 46962 degrees of freedom
## Multiple R-squared:  0.1499, Adjusted R-squared:  0.1498 
## F-statistic:  1035 on 8 and 46962 DF,  p-value: < 2.2e-16

##Dummy variables are denoted as 0 or 1. If we take log to 0 or 1, it is meaningless. The coefficients of those dummy variables just measure the percentage change when dummies switch from 0 to 1. Same as raising power of dunmmies. If we raise 0 and 1 to higher power, it doesn’t help to explain the model.

log wage as the dependent variable: measuring the percentage change of the variables.

use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)&(Asian=1)
dat_use <- subset(acs2017_ny,use_varb) # 
detach()
model2d<-lm(log1p(INCWAGE)~AGE+I(AGE^2)+female+Asian+educ_hs+educ_somecoll+educ_college+educ_advdeg,data=dat_use) 
summary(model2d)
## 
## Call:
## lm(formula = log1p(INCWAGE) ~ AGE + I(AGE^2) + female + Asian + 
##     educ_hs + educ_somecoll + educ_college + educ_advdeg, data = dat_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.009  -4.076   1.810   3.267  16.053 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.215e-01  1.043e-01  -5.001 5.71e-07 ***
## AGE            3.212e-01  4.375e-03  73.419  < 2e-16 ***
## I(AGE^2)      -4.029e-03  4.299e-05 -93.713  < 2e-16 ***
## female        -6.694e-01  3.325e-02 -20.133  < 2e-16 ***
## Asian         -2.619e-01  5.902e-02  -4.437 9.12e-06 ***
## educ_hs        1.840e+00  5.605e-02  32.819  < 2e-16 ***
## educ_somecoll  2.584e+00  6.053e-02  42.694  < 2e-16 ***
## educ_college   3.686e+00  6.187e-02  59.573  < 2e-16 ***
## educ_advdeg    4.129e+00  6.619e-02  62.385  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.369 on 69578 degrees of freedom
##   (14237 observations deleted due to missingness)
## Multiple R-squared:  0.2825, Adjusted R-squared:  0.2824 
## F-statistic:  3424 on 8 and 69578 DF,  p-value: < 2.2e-16

Comparision between model1a and model2d

1.If we take log of INCWAGE, we measure those variable in a percentage terms, eg. in model1a, people has some college education earn $2.576e+04 more than those who receive education less than high school(educ_ohs), while in model2d, it express as income of those who are in educ_somecoll is 128.95% higher than those in educ_ohs.

Multiple dummy variables with other variables: adjustment of the estimated coefficients and intercepts of the model

model3a:interaction of gender and age

model3a <-lm(INCWAGE~AGE+I(AGE^2)+female+I(female*AGE)+I(female*(AGE^2))+Asian+educ_hs+educ_somecoll+educ_college+educ_advdeg,data=dat_use)
summary(model3a)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2) + female + I(female * AGE) + 
##     I(female * (AGE^2)) + Asian + educ_hs + educ_somecoll + educ_college + 
##     educ_advdeg, data = dat_use)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101281  -24550   -6536   11190  675329 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -7.396e+04  2.026e+03 -36.496  < 2e-16 ***
## AGE                  4.863e+03  8.902e+01  54.623  < 2e-16 ***
## I(AGE^2)            -5.117e+01  8.927e-01 -57.326  < 2e-16 ***
## female               4.209e+04  2.748e+03  15.318  < 2e-16 ***
## I(female * AGE)     -2.830e+03  1.190e+02 -23.772  < 2e-16 ***
## I(female * (AGE^2))  2.916e+01  1.175e+00  24.810  < 2e-16 ***
## Asian               -1.193e+03  8.172e+02  -1.460    0.144    
## educ_hs              5.267e+03  7.763e+02   6.785 1.17e-11 ***
## educ_somecoll        1.178e+04  8.382e+02  14.049  < 2e-16 ***
## educ_college         3.652e+04  8.570e+02  42.618  < 2e-16 ***
## educ_advdeg          5.974e+04  9.171e+02  65.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60490 on 69576 degrees of freedom
##   (14237 observations deleted due to missingness)
## Multiple R-squared:  0.1776, Adjusted R-squared:  0.1775 
## F-statistic:  1502 on 10 and 69576 DF,  p-value: < 2.2e-16

Asian is insignificant. Income doesn’t has significant relation with ethnicity.

E(INCWAGE:gender*age)= -1.487e+05 +8.085e+03AGE-8.044e+01AGE^2-3.177e+03 female * AGE +3.245e+01 female * AGE^2+ 1.343e+04educ_hs+ 2.587e+04educ_somecoll + 8.641e+04educ_advdeg

2.Interpret model E(INCWAGE:GENDERAGE) interaction term: -3.177e+03 female * AGE +3.245e+01 female * AGE^2= femaleage(-3.177e+03+3.245e+01*age). This means when age increase 1 year, the wage changed will be -3.177e+03+3.245e+01*age.

  1. Application Let’s narrow it down to estimate the income of female, age=35, with advance education:

E(INCWAGE:gender*age)= -1.487e+05 +8.085e+03AGE-8.044e+01AGE^2-3.177e+03 female * AGE+3.245e+01 female * AGE^2+ 8.641e+04educ_advdeg

the excepted value=E(INCWAGE:gender*age)=-1.487e+05 +8.085e+03(35)-8.044e+01(35)^2+3.245e+01 (35)^2+8.641e+04-3.177e+03 * (35)=

D1=gender

when D1=male=0 E(INCWAGE:gender*age)= -1.487e+05 +8.085e+03AGE-8.044e+01AGE^2+ 8.641e+04educ_advdeg

f’(AGE)=8.085e+03-8.044e+01*(2)AGE f’(age)=0 intercept1= -1.487e+05+04educ_advdeg

When D1=1 E(INCWAGE:gender*age)= -1.487e+05 +8.085e+03AGE-8.044e+01AGE^2-3.177e+03 female * AGE+3.245e+01 female * AGE^2+ 8.641e+04educ_advdeg

f’(AGE)=8.085e+03-8.044e+01*(2)AGE-3.177e+03+3.245e+01*(2)AGE+ 8.641e+04educ_advdeg INTERCEPT2= -1.487e+05+ 8.641e+04educ_advdeg

Conclusion: interaction of the shape of income~age of these two group are totally different(different intercepts and different slope).

attach(acs2017_ny)
## The following objects are masked _by_ .GlobalEnv:
## 
##     Asian, OWNCOST
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)&(Asian=1)
dat_use <- subset(acs2017_ny,use_varb)
model3a<-lm(INCWAGE ~ AGE + I(AGE^2) + female + I(female*AGE) + I(female*(AGE^2))+educ_somecoll+educ_college+educ_advdeg,data=dat_use)
NNobs <- length(INCWAGE)
set.seed(66666) 
graph_obs <- (runif(NNobs) < 0.1) 
dat_graph <-subset(dat_use,graph_obs) 
plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 16, col = rgb(0.5, 0.7, 0.8, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted6 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian =1, Amindian = 0, race_oth = 0, Hispanic =0, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted6$yhat <- predict(model3a, newdata = to_be_predicted6)
lines(yhat ~ AGE, data = to_be_predicted6,col="red",lwd=2)
to_be_predicted7 <- data.frame(AGE = 25:55, female = 0, AfAm = 0, Asian =1, Amindian = 0, race_oth = 0, Hispanic =0, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted7$yhat <- predict(model3a, newdata = to_be_predicted7)
lines(yhat ~ AGE, data = to_be_predicted7,col="blue",lty=3,lwd=4)
legend("topleft", c("male", "female"), lty = c(2,1), bty = "n")

## gender/marriage statue interaction
model3b <-lm(log1p(INCWAGE)~AGE+I(AGE^2)+female:MARST+Asian+educ_hs+educ_somecoll+educ_college+educ_advdeg,data=dat_use)  
summary(model3b)
## 
## Call:
## lm(formula = log1p(INCWAGE) ~ AGE + I(AGE^2) + female:MARST + 
##     Asian + educ_hs + educ_somecoll + educ_college + educ_advdeg, 
##     data = dat_use)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1731  -0.0265   0.3735   0.7737   3.8979 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.2270851  0.2313890  35.555  < 2e-16 ***
## AGE            0.0615367  0.0115379   5.333 9.68e-08 ***
## I(AGE^2)      -0.0006763  0.0001423  -4.754 2.00e-06 ***
## Asian         -0.0784687  0.0338079  -2.321   0.0203 *  
## educ_hs        0.5416555  0.0505947  10.706  < 2e-16 ***
## educ_somecoll  0.8280733  0.0515311  16.069  < 2e-16 ***
## educ_college   1.2834252  0.0504253  25.452  < 2e-16 ***
## educ_advdeg    1.5463282  0.0514825  30.036  < 2e-16 ***
## female:MARST  -0.0241825  0.0047160  -5.128 2.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.173 on 46962 degrees of freedom
## Multiple R-squared:  0.03919,    Adjusted R-squared:  0.03903 
## F-statistic: 239.4 on 8 and 46962 DF,  p-value: < 2.2e-16

The interaction term tells us that the income ration of female who are married spouse absent,separated ,divorced, widowed,never married earn 2.4% less than those not.