hw4 finial

Carol,Joyce, Yameili, Tamires 10/14/2020

load("~/Desktop/.RData")
attach(acs2017_ny)
## The following object is masked _by_ .GlobalEnv:
## 
##     OWNCOST
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)
dat_use <- subset(acs2017_ny,use_varb) # 
detach()
attach(dat_use)
## The following object is masked _by_ .GlobalEnv:
## 
##     OWNCOST

Here, the age is limited ranging from 25~55, during which it has real effect on wage.

Run on the original regression from lab4 and observe its fitness.
model_temp1 <- lm(INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
summary(model_temp1)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Asian + Amindian + 
##     race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + 
##     educ_advdeg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -148088  -33205  -10708   13053  625543 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7096.25    2446.71  -2.900 0.003730 ** 
## AGE             1316.69      39.66  33.199  < 2e-16 ***
## female        -24939.46     720.43 -34.617  < 2e-16 ***
## AfAm          -11934.26    1130.37 -10.558  < 2e-16 ***
## Asian            566.53    1369.83   0.414 0.679188    
## Amindian       -8858.57    6077.71  -1.458 0.144971    
## race_oth       -7526.49    1272.49  -5.915 3.35e-09 ***
## Hispanic       -4224.82    1183.47  -3.570 0.000358 ***
## educ_hs        10592.37    1814.71   5.837 5.35e-09 ***
## educ_somecoll  22461.39    1857.67  12.091  < 2e-16 ***
## educ_college   57155.71    1830.96  31.216  < 2e-16 ***
## educ_advdeg    82766.43    1878.64  44.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76760 on 46959 degrees of freedom
## Multiple R-squared:   0.15,  Adjusted R-squared:  0.1498 
## F-statistic: 753.6 on 11 and 46959 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_temp1,col="green",pch=16,cex=1,lwd=1,lty=2)

library(stargazer)
## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(model_temp1, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               INCWAGE          
## -----------------------------------------------
## AGE                        1,316.691***        
##                              (39.661)          
##                                                
## female                    -24,939.460***       
##                              (720.433)         
##                                                
## AfAm                      -11,934.250***       
##                             (1,130.372)        
##                                                
## Asian                         566.528          
##                             (1,369.834)        
##                                                
## Amindian                    -8,858.569         
##                             (6,077.710)        
##                                                
## race_oth                   -7,526.487***       
##                             (1,272.485)        
##                                                
## Hispanic                   -4,224.816***       
##                             (1,183.469)        
##                                                
## educ_hs                    10,592.370***       
##                             (1,814.709)        
##                                                
## educ_somecoll              22,461.390***       
##                             (1,857.674)        
##                                                
## educ_college               57,155.710***       
##                             (1,830.963)        
##                                                
## educ_advdeg                82,766.430***       
##                             (1,878.638)        
##                                                
## Constant                   -7,096.252***       
##                             (2,446.712)        
##                                                
## -----------------------------------------------
## Observations                  46,971           
## R2                             0.150           
## Adjusted R2                    0.150           
## Residual Std. Error   76,755.980 (df = 46959)  
## F Statistic         753.551*** (df = 11; 46959)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Simply interpret the results. The model has a very low Multiple R^2 value, which means this only has 15% chance to explain the income. From the summaries above, “Asian” and “amindian” are both insignificant. Next run on a step regression test to delete some weak relative variable.

step(model_temp1)
## Start:  AIC=1056708
## INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth + 
##     Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg
## 
##                 Df  Sum of Sq        RSS     AIC
## - Asian          1 1.0077e+09 2.7666e+14 1056706
## <none>                        2.7666e+14 1056708
## - Amindian       1 1.2516e+10 2.7667e+14 1056708
## - Hispanic       1 7.5080e+10 2.7673e+14 1056719
## - educ_hs        1 2.0072e+11 2.7686e+14 1056740
## - race_oth       1 2.0611e+11 2.7686e+14 1056741
## - AfAm           1 6.5671e+11 2.7731e+14 1056817
## - educ_somecoll  1 8.6131e+11 2.7752e+14 1056852
## - educ_college   1 5.7410e+12 2.8240e+14 1057671
## - AGE            1 6.4932e+12 2.8315e+14 1057796
## - female         1 7.0601e+12 2.8372e+14 1057890
## - educ_advdeg    1 1.1435e+13 2.8809e+14 1058608
## 
## Step:  AIC=1056706
## INCWAGE ~ AGE + female + AfAm + Amindian + race_oth + Hispanic + 
##     educ_hs + educ_somecoll + educ_college + educ_advdeg
## 
##                 Df  Sum of Sq        RSS     AIC
## <none>                        2.7666e+14 1056706
## - Amindian       1 1.2451e+10 2.7667e+14 1056706
## - Hispanic       1 8.9401e+10 2.7675e+14 1056719
## - educ_hs        1 1.9974e+11 2.7686e+14 1056738
## - race_oth       1 2.4610e+11 2.7691e+14 1056746
## - AfAm           1 6.6303e+11 2.7732e+14 1056817
## - educ_somecoll  1 8.6128e+11 2.7752e+14 1056850
## - educ_college   1 5.7430e+12 2.8240e+14 1057669
## - AGE            1 6.4922e+12 2.8315e+14 1057794
## - female         1 7.0610e+12 2.8372e+14 1057888
## - educ_advdeg    1 1.1441e+13 2.8810e+14 1058607

## 
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Amindian + race_oth + 
##     Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
## 
## Coefficients:
##   (Intercept)            AGE         female           AfAm       Amindian  
##         -7004           1316         -24941         -11965          -8835  
##      race_oth       Hispanic        educ_hs  educ_somecoll   educ_college  
##         -7282          -4378          10547          22409          57128  
##   educ_advdeg  
##         82740

The start AIC=1056708, In the first step regression, when Asian is deleted, the AIC is the smallest, which is 1056706. ##Note: in the first step, when “Admindian” is deleted, the AIC is 1056708. NONE means the step regression dose nothing to the model and AIC remain the same as 1056708. When “Hispanic” is delected, AIC is 1056719 ect.

Based on the previous codding in which “Asian” is deleted, Step() run on second time. when “Amidian” is deleted, the AIC decreases to the smallest 1056706.

So we get a new model denoted by: new<- lm(formula=INCWAGE ~ AGE + female + AfAm + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)

But the R^2 value doesn’t improve much.

##suppressMessages(require(AER))
NNobs <- length(INCWAGE)
set.seed(00000)
graph_obs <- (runif(NNobs) < 0.1) 
dat_graph <-subset(dat_use,graph_obs)
plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 19, col = rgb(0.3, 0.4, 0.5, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted2 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian = 0, Amindian = 1, race_oth = 1, Hispanic = 1, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted2$yhat <- predict(model_temp1, newdata = to_be_predicted2)
lines(yhat ~ AGE, data = to_be_predicted2, col="red",cex=2,lwd=2,lty=2)

## From the step regression, we delete Asian and AfAm and use the “new” regression. Since in the new regression: new<- lm(formula=INCWAGE ~ AGE + female + AfAm + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)

In the second part, we try to break down the multi-variable regression into single-variable and see if R^2 will be improved.

2.1 wage~age

nage<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_age<-lm(INCWAGE~AGE)
plot(Wage_age,col="blue",pch=20,cex=1,lwd=1,lty=1)

##2.2 Wage~Female

str(female)
##  num [1:46971] 0 1 0 1 1 0 0 0 1 0 ...
wage_female_2.1<-lm(INCWAGE ~ female, data = dat_use)
par(mfrow=c(2,2))
plot(wage_female_2.1, col="grey")

summary(wage_female_2.1)
## 
## Call:
## lm(formula = INCWAGE ~ female, data = dat_use)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80964 -39212 -18212  12788 575788 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80963.9      515.9  156.95   <2e-16 ***
## female      -18752.3      766.8  -24.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82720 on 46969 degrees of freedom
## Multiple R-squared:  0.01257,    Adjusted R-squared:  0.01255 
## F-statistic:   598 on 1 and 46969 DF,  p-value: < 2.2e-16
library(stargazer)
stargazer(wage_female_2.1, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               INCWAGE          
## -----------------------------------------------
## female                    -18,752.280***       
##                              (766.823)         
##                                                
## Constant                   80,963.880***       
##                              (515.871)         
##                                                
## -----------------------------------------------
## Observations                  46,971           
## R2                             0.013           
## Adjusted R2                    0.013           
## Residual Std. Error   82,721.380 (df = 46969)  
## F Statistic         598.023*** (df = 1; 46969) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

## try to improve Wage_female_2.1
``
Wage_female_2.2<-glm(INCWAGE ~ female, data = dat_use)
plot(Wage_female_2.2)

summary(Wage_female_2.2)
## 
## Call:
## glm(formula = INCWAGE ~ female, data = dat_use)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -80964  -39212  -18212   12788  575788  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80963.9      515.9  156.95   <2e-16 ***
## female      -18752.3      766.8  -24.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6842826694)
## 
##     Null deviance: 3.2549e+14  on 46970  degrees of freedom
## Residual deviance: 3.2140e+14  on 46969  degrees of freedom
## AIC: 1197029
## 
## Number of Fisher Scoring iterations: 2
library(stargazer)
stargazer(Wage_female_2.2, type = "text")
## 
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                             INCWAGE          
## ---------------------------------------------
## female                  -18,752.280***       
##                            (766.823)         
##                                              
## Constant                 80,963.880***       
##                            (515.871)         
##                                              
## ---------------------------------------------
## Observations                46,971           
## Log Likelihood           -598,512.600        
## Akaike Inf. Crit.        1,197,029.000       
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

we get the function of y=1/(1+exp{18752.3X+ 80963.9}). now predict the model. ## for this part,we are not sure whether it is correct.

new<-data.frame(female=1,educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
lm.pred<-predict(Wage_female_2.2,new,interval="prediction",level=0.95)
lm.pred
##       1 
## 62211.6

##part3 model_temp2

model_temp2<-lm(INCWAGE ~ AGE + female + educ_somecoll + educ_college + educ_advdeg)
summary(model_temp2)
## 
## Call:
## lm(formula = INCWAGE ~ AGE + female + educ_somecoll + educ_college + 
##     educ_advdeg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146856  -33301  -10864   13016  618251 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3048.08    1809.96  -1.684   0.0922 .  
## AGE             1343.87      39.64  33.905   <2e-16 ***
## female        -25581.84     719.29 -35.565   <2e-16 ***
## educ_somecoll  14406.76    1010.57  14.256   <2e-16 ***
## educ_college   49929.34     946.27  52.764   <2e-16 ***
## educ_advdeg    75991.16    1020.90  74.435   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76940 on 46965 degrees of freedom
## Multiple R-squared:  0.1459, Adjusted R-squared:  0.1458 
## F-statistic:  1605 on 5 and 46965 DF,  p-value: < 2.2e-16
new<-data.frame(female=0,AGE = 35,educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
lm.pred<-predict(model_temp2,new,interval="prediction",level=0.95)
lm.pred
##        fit    lwr      upr
## 1 93916.72 -56887 244720.5

we can predict a man who has a college educational level, at age 35, his income interval=(-56696.2,185318.1) at the 95% confidence level.

In conclusion. Age, gender and education background are both factors affecting income, among which the average income of males is greater than that of females. Moreover, in high-income groups, the number of males is greater than that of females. For education background, the higher education background is, the higher income is, and the income of people with a graduate degree is much higher than that of people with a high school education.