Carol,Joyce, Yameili, Tamires 10/14/2020
## The following object is masked _by_ .GlobalEnv:
##
## OWNCOST
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)
dat_use <- subset(acs2017_ny,use_varb) #
detach()
attach(dat_use)## The following object is masked _by_ .GlobalEnv:
##
## OWNCOST
Here, the age is limited ranging from 25~55, during which it has real effect on wage.
Run on the original regression from lab4 and observe its fitness.
model_temp1 <- lm(INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
summary(model_temp1)##
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Asian + Amindian +
## race_oth + Hispanic + educ_hs + educ_somecoll + educ_college +
## educ_advdeg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148088 -33205 -10708 13053 625543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7096.25 2446.71 -2.900 0.003730 **
## AGE 1316.69 39.66 33.199 < 2e-16 ***
## female -24939.46 720.43 -34.617 < 2e-16 ***
## AfAm -11934.26 1130.37 -10.558 < 2e-16 ***
## Asian 566.53 1369.83 0.414 0.679188
## Amindian -8858.57 6077.71 -1.458 0.144971
## race_oth -7526.49 1272.49 -5.915 3.35e-09 ***
## Hispanic -4224.82 1183.47 -3.570 0.000358 ***
## educ_hs 10592.37 1814.71 5.837 5.35e-09 ***
## educ_somecoll 22461.39 1857.67 12.091 < 2e-16 ***
## educ_college 57155.71 1830.96 31.216 < 2e-16 ***
## educ_advdeg 82766.43 1878.64 44.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76760 on 46959 degrees of freedom
## Multiple R-squared: 0.15, Adjusted R-squared: 0.1498
## F-statistic: 753.6 on 11 and 46959 DF, p-value: < 2.2e-16
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
##
## ===============================================
## Dependent variable:
## ---------------------------
## INCWAGE
## -----------------------------------------------
## AGE 1,316.691***
## (39.661)
##
## female -24,939.460***
## (720.433)
##
## AfAm -11,934.250***
## (1,130.372)
##
## Asian 566.528
## (1,369.834)
##
## Amindian -8,858.569
## (6,077.710)
##
## race_oth -7,526.487***
## (1,272.485)
##
## Hispanic -4,224.816***
## (1,183.469)
##
## educ_hs 10,592.370***
## (1,814.709)
##
## educ_somecoll 22,461.390***
## (1,857.674)
##
## educ_college 57,155.710***
## (1,830.963)
##
## educ_advdeg 82,766.430***
## (1,878.638)
##
## Constant -7,096.252***
## (2,446.712)
##
## -----------------------------------------------
## Observations 46,971
## R2 0.150
## Adjusted R2 0.150
## Residual Std. Error 76,755.980 (df = 46959)
## F Statistic 753.551*** (df = 11; 46959)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Simply interpret the results. The model has a very low Multiple R^2 value, which means this only has 15% chance to explain the income. From the summaries above, “Asian” and “amindian” are both insignificant. Next run on a step regression test to delete some weak relative variable.
## Start: AIC=1056708
## INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth +
## Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg
##
## Df Sum of Sq RSS AIC
## - Asian 1 1.0077e+09 2.7666e+14 1056706
## <none> 2.7666e+14 1056708
## - Amindian 1 1.2516e+10 2.7667e+14 1056708
## - Hispanic 1 7.5080e+10 2.7673e+14 1056719
## - educ_hs 1 2.0072e+11 2.7686e+14 1056740
## - race_oth 1 2.0611e+11 2.7686e+14 1056741
## - AfAm 1 6.5671e+11 2.7731e+14 1056817
## - educ_somecoll 1 8.6131e+11 2.7752e+14 1056852
## - educ_college 1 5.7410e+12 2.8240e+14 1057671
## - AGE 1 6.4932e+12 2.8315e+14 1057796
## - female 1 7.0601e+12 2.8372e+14 1057890
## - educ_advdeg 1 1.1435e+13 2.8809e+14 1058608
##
## Step: AIC=1056706
## INCWAGE ~ AGE + female + AfAm + Amindian + race_oth + Hispanic +
## educ_hs + educ_somecoll + educ_college + educ_advdeg
##
## Df Sum of Sq RSS AIC
## <none> 2.7666e+14 1056706
## - Amindian 1 1.2451e+10 2.7667e+14 1056706
## - Hispanic 1 8.9401e+10 2.7675e+14 1056719
## - educ_hs 1 1.9974e+11 2.7686e+14 1056738
## - race_oth 1 2.4610e+11 2.7691e+14 1056746
## - AfAm 1 6.6303e+11 2.7732e+14 1056817
## - educ_somecoll 1 8.6128e+11 2.7752e+14 1056850
## - educ_college 1 5.7430e+12 2.8240e+14 1057669
## - AGE 1 6.4922e+12 2.8315e+14 1057794
## - female 1 7.0610e+12 2.8372e+14 1057888
## - educ_advdeg 1 1.1441e+13 2.8810e+14 1058607
##
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Amindian + race_oth +
## Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
##
## Coefficients:
## (Intercept) AGE female AfAm Amindian
## -7004 1316 -24941 -11965 -8835
## race_oth Hispanic educ_hs educ_somecoll educ_college
## -7282 -4378 10547 22409 57128
## educ_advdeg
## 82740
The start AIC=1056708, In the first step regression, when Asian is deleted, the AIC is the smallest, which is 1056706. ##Note: in the first step, when “Admindian” is deleted, the AIC is 1056708. NONE means the step regression dose nothing to the model and AIC remain the same as 1056708. When “Hispanic” is delected, AIC is 1056719 ect.
Based on the previous codding in which “Asian” is deleted, Step() run on second time. when “Amidian” is deleted, the AIC decreases to the smallest 1056706.
So we get a new model denoted by: new<- lm(formula=INCWAGE ~ AGE + female + AfAm + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
But the R^2 value doesn’t improve much.
##suppressMessages(require(AER))
NNobs <- length(INCWAGE)
set.seed(00000)
graph_obs <- (runif(NNobs) < 0.1)
dat_graph <-subset(dat_use,graph_obs)
plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 19, col = rgb(0.3, 0.4, 0.5, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted2 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian = 0, Amindian = 1, race_oth = 1, Hispanic = 1, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted2$yhat <- predict(model_temp1, newdata = to_be_predicted2)
lines(yhat ~ AGE, data = to_be_predicted2, col="red",cex=2,lwd=2,lty=2) ## From the step regression, we delete Asian and AfAm and use the “new” regression. Since in the new regression: new<- lm(formula=INCWAGE ~ AGE + female + AfAm + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
In the second part, we try to break down the multi-variable regression into single-variable and see if R^2 will be improved.
nage<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_age<-lm(INCWAGE~AGE)
plot(Wage_age,col="blue",pch=20,cex=1,lwd=1,lty=1)##2.2 Wage~Female
## num [1:46971] 0 1 0 1 1 0 0 0 1 0 ...
wage_female_2.1<-lm(INCWAGE ~ female, data = dat_use)
par(mfrow=c(2,2))
plot(wage_female_2.1, col="grey")##
## Call:
## lm(formula = INCWAGE ~ female, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80964 -39212 -18212 12788 575788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80963.9 515.9 156.95 <2e-16 ***
## female -18752.3 766.8 -24.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82720 on 46969 degrees of freedom
## Multiple R-squared: 0.01257, Adjusted R-squared: 0.01255
## F-statistic: 598 on 1 and 46969 DF, p-value: < 2.2e-16
##
## ===============================================
## Dependent variable:
## ---------------------------
## INCWAGE
## -----------------------------------------------
## female -18,752.280***
## (766.823)
##
## Constant 80,963.880***
## (515.871)
##
## -----------------------------------------------
## Observations 46,971
## R2 0.013
## Adjusted R2 0.013
## Residual Std. Error 82,721.380 (df = 46969)
## F Statistic 598.023*** (df = 1; 46969)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## try to improve Wage_female_2.1
``
##
## Call:
## glm(formula = INCWAGE ~ female, data = dat_use)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -80964 -39212 -18212 12788 575788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80963.9 515.9 156.95 <2e-16 ***
## female -18752.3 766.8 -24.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6842826694)
##
## Null deviance: 3.2549e+14 on 46970 degrees of freedom
## Residual deviance: 3.2140e+14 on 46969 degrees of freedom
## AIC: 1197029
##
## Number of Fisher Scoring iterations: 2
##
## =============================================
## Dependent variable:
## ---------------------------
## INCWAGE
## ---------------------------------------------
## female -18,752.280***
## (766.823)
##
## Constant 80,963.880***
## (515.871)
##
## ---------------------------------------------
## Observations 46,971
## Log Likelihood -598,512.600
## Akaike Inf. Crit. 1,197,029.000
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
we get the function of y=1/(1+exp{18752.3X+ 80963.9}). now predict the model. ## for this part,we are not sure whether it is correct.
new<-data.frame(female=1,educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
lm.pred<-predict(Wage_female_2.2,new,interval="prediction",level=0.95)
lm.pred## 1
## 62211.6
##part3 model_temp2
model_temp2<-lm(INCWAGE ~ AGE + female + educ_somecoll + educ_college + educ_advdeg)
summary(model_temp2)##
## Call:
## lm(formula = INCWAGE ~ AGE + female + educ_somecoll + educ_college +
## educ_advdeg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146856 -33301 -10864 13016 618251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3048.08 1809.96 -1.684 0.0922 .
## AGE 1343.87 39.64 33.905 <2e-16 ***
## female -25581.84 719.29 -35.565 <2e-16 ***
## educ_somecoll 14406.76 1010.57 14.256 <2e-16 ***
## educ_college 49929.34 946.27 52.764 <2e-16 ***
## educ_advdeg 75991.16 1020.90 74.435 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76940 on 46965 degrees of freedom
## Multiple R-squared: 0.1459, Adjusted R-squared: 0.1458
## F-statistic: 1605 on 5 and 46965 DF, p-value: < 2.2e-16
new<-data.frame(female=0,AGE = 35,educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
lm.pred<-predict(model_temp2,new,interval="prediction",level=0.95)
lm.pred## fit lwr upr
## 1 93916.72 -56887 244720.5
we can predict a man who has a college educational level, at age 35, his income interval=(-56696.2,185318.1) at the 95% confidence level.
In conclusion. Age, gender and education background are both factors affecting income, among which the average income of males is greater than that of females. Moreover, in high-income groups, the number of males is greater than that of females. For education background, the higher education background is, the higher income is, and the income of people with a graduate degree is much higher than that of people with a high school education.