Income lab

##Lab 8

###Loading and and viewing the data

library(plyr)
inc <- read.csv("~/Desktop/ANT291WD/income.csv")

I have created an object inc that represents the dataset

If my question is whether innate factors influence income more or less than life experience, the first thing I will do is look at the variable “income”

hist(inc$income)

log-transforming the data make them more normal. First, I will create an object loginc that is the log-transformed version of the “income” column in the dataset

loginc <- log(inc$income)
hist(loginc)

###Innate models

Next, I will look at a few innate properties in the dataset, specifically race and sex. Given the ubiquity of talk in the national dialogue concerning pay gaps between races and genders (although it is important to note here that the data we are deling with specify sex and not gender), it is reasonable to hypothesize that there may be a link between these factors and income. First, we will change the variables to factors and rename the levels. The names for the levels were derived from the metadata

inc$female <- as.factor(inc$female)
inc$race <- as.factor(inc$race)
inc$female<- revalue(inc$female, c("0"="male", "1"="female"))
inc$race <- revalue(inc$race, c("1"="white","2"="black", "3"="other"))

Here is what the plots look like

plot(loginc~inc$female, ylab="log transformed income", xlab="sex")

plot(loginc~inc$race,ylab="log transformed income", xlab="race")

It does look like there is a relationship between lower income in females and lower income in individuals who do not identify themselves as “white”

If we look at the interaction between the two, we see this:

plot(loginc~interaction(inc$race, inc$female), ylab="log transformed income", xlab="interaction between race and sex")

im1 <- glm(loginc~inc$female)
summary(im1)

## 
## Call:
## glm(formula = loginc ~ inc$female)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3921  -0.4838   0.2126   0.7208   2.3537  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.52335    0.04316 243.826  < 2e-16 ***
## inc$femalefemale -0.47630    0.06079  -7.835 9.34e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.274104)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1754.4  on 1377  degrees of freedom
## AIC: 4251.5
## 
## Number of Fisher Scoring iterations: 2

im2 <- glm(loginc~inc$race)
summary(im2)

## 
## Call:
## glm(formula = loginc ~ inc$race)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2831  -0.4818   0.1836   0.7379   2.5315  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.36989    0.03564 290.980  < 2e-16 ***
## inc$raceblack -0.36435    0.08646  -4.214 2.67e-05 ***
## inc$raceother -0.31041    0.10406  -2.983   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.309429)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1801.8  on 1376  degrees of freedom
## AIC: 4290.2
## 
## Number of Fisher Scoring iterations: 2

im <- glm(loginc~inc$female*inc$race)
summary(im)

## 
## Call:
## glm(formula = loginc ~ inc$female * inc$race)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4901  -0.4801   0.1985   0.7123   2.5586  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.62130    0.04834 219.738  < 2e-16 ***
## inc$femalefemale               -0.52260    0.06969  -7.499 1.15e-13 ***
## inc$raceblack                  -0.64287    0.13866  -4.636 3.89e-06 ***
## inc$raceother                  -0.25901    0.13785  -1.879  0.06046 .  
## inc$femalefemale:inc$raceblack  0.56434    0.17570   3.212  0.00135 ** 
## inc$femalefemale:inc$raceother -0.14652    0.20417  -0.718  0.47308    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.249965)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1716.2  on 1373  degrees of freedom
## AIC: 4229.1
## 
## Number of Fisher Scoring iterations: 2

If we model these phenomenon we see that when sex is a better predictor of income than race, but the interaction of the two factors is better than either one of them alone (AIC = 4229 vs 4252 and 4290 respectively). When looking at the interaction between race and sex, the only categories with a p-value > 0.5 is where the “other” category in race.

###Experience based madels

Before loading the dataset, I created a new category that collapsed education down into “no diploma,” meaning less than high school education, “high school,” for high school graduate, “some college,” for less than a four year degree, “college,” for college graduate, and “graduate,” for more than a bachelors level education. I turned this into a factor

inc$Educ <- as.factor(inc$Educ)
inc$Educ <- factor(inc$Educ, levels=c("Graduate","College", "Some College", "HS","ND"))
inc$veteran <- as.factor(inc$veteran)
inc$veteran<- revalue(inc$veteran, c("0"="non-veteran", "1"="veteran"))

You can see that this has an effect on income.

plot(loginc~inc$Educ, xlab="education level", ylab="log transformed income")

Another factor that could affect income is whether an individual served in the military or not. Military service teaches discipline, expands individuals’ experiences and extends social networks.

plot(loginc~inc$veteran, ylab="log transformed income", xlab="military service")

em1 <- glm(loginc~inc$Educ)
summary(em1)

## 
## Call:
## glm(formula = loginc ~ inc$Educ)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6865  -0.4359   0.2077   0.6874   2.4543  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.77329    0.07138 150.926  < 2e-16 ***
## inc$EducCollege      -0.11021    0.09624  -1.145    0.252    
## inc$EducSome College -0.67774    0.09062  -7.479 1.33e-13 ***
## inc$EducHS           -0.78628    0.09252  -8.498  < 2e-16 ***
## inc$EducND           -0.94400    0.12657  -7.458 1.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.212676)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1666.2  on 1374  degrees of freedom
## AIC: 4186.3
## 
## Number of Fisher Scoring iterations: 2

em2 <- glm(loginc~inc$veteran)
summary(em2)

## 
## Call:
## glm(formula = loginc ~ inc$veteran)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4123  -0.4809   0.1928   0.7128   2.5032  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        10.25509    0.03277 312.932  < 2e-16 ***
## inc$veteranveteran  0.26643    0.10072   2.645  0.00825 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.324169)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1823.4  on 1377  degrees of freedom
## AIC: 4304.6
## 
## Number of Fisher Scoring iterations: 2

em <- glm(loginc~inc$Educ+inc$veteran)
summary(em)

## 
## Call:
## glm(formula = loginc ~ inc$Educ + inc$veteran)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6581  -0.4330   0.1902   0.6851   2.4787  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.74484    0.07155 150.176  < 2e-16 ***
## inc$EducCollege      -0.10619    0.09585  -1.108 0.268123    
## inc$EducSome College -0.69368    0.09037  -7.676 3.09e-14 ***
## inc$EducHS           -0.80232    0.09225  -8.697  < 2e-16 ***
## inc$EducND           -0.93995    0.12606  -7.456 1.57e-13 ***
## inc$veteranveteran    0.33857    0.09637   3.513 0.000457 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.202748)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1651.4  on 1373  degrees of freedom
## AIC: 4176
## 
## Number of Fisher Scoring iterations: 2

With our experience based models, we see that there is a significant difference in income, with income levels decreasing as education level goes down. Military service also raises income levels, but it is not as good an indicator given its higher AIC. Whereas we looked at the interaction between race and sex, I simply added these two factors together in the model because there is no reason to believe the factors interact with one another. Actually, givemnn that military service often occurs during the time when other people attend college, individuals with military service may be less likely to have college education.

###Conclusions

According to the AIC values, the experince based models are better predictors of income than the innate models. In general, adding two factors together resulted in models with lower AICs than each of their parts, but this may be because I cherry-picked factors that were intuitive when considering their effect on income. Ultimately, a combination of innate and environmental factors may be the best explanation for income. take for example when we look at the interaction between race and crack use and its effect on income, the combination of the innate and environmental factors return a better AIC than either of the component models.

glm(loginc~inc$race)

## 
## Call:  glm(formula = loginc ~ inc$race)
## 
## Coefficients:
##   (Intercept)  inc$raceblack  inc$raceother  
##       10.3699        -0.3644        -0.3104  
## 
## Degrees of Freedom: 1378 Total (i.e. Null);  1376 Residual
## Null Deviance:       1833 
## Residual Deviance: 1802  AIC: 4290

glm(loginc~as.factor(inc$crack))

## 
## Call:  glm(formula = loginc ~ as.factor(inc$crack))
## 
## Coefficients:
##           (Intercept)  as.factor(inc$crack)1  
##               10.3021                -0.4717  
## 
## Degrees of Freedom: 1378 Total (i.e. Null);  1377 Residual
## Null Deviance:       1833 
## Residual Deviance: 1821  AIC: 4303

glm(loginc~inc$race*as.factor(inc$crack))

## 
## Call:  glm(formula = loginc ~ inc$race * as.factor(inc$crack))
## 
## Coefficients:
##                         (Intercept)                        inc$raceblack  
##                             10.3871                              -0.3358  
##                       inc$raceother                as.factor(inc$crack)1  
##                             -0.3357                              -0.4217  
## inc$raceblack:as.factor(inc$crack)1  inc$raceother:as.factor(inc$crack)1  
##                             -0.5426                               0.7901  
## 
## Degrees of Freedom: 1378 Total (i.e. Null);  1373 Residual
## Null Deviance:       1833 
## Residual Deviance: 1785  AIC: 4284

summary(glm(loginc~interaction(inc$race,as.factor(inc$crack))))

## 
## Call:
## glm(formula = loginc ~ interaction(inc$race, as.factor(inc$crack)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3003  -0.4809   0.1780   0.7237   2.4949  
## 
## Coefficients:
##                                                    Estimate Std. Error
## (Intercept)                                        10.38707    0.03626
## interaction(inc$race, as.factor(inc$crack))black.0 -0.33583    0.08823
## interaction(inc$race, as.factor(inc$crack))other.0 -0.33566    0.10497
## interaction(inc$race, as.factor(inc$crack))white.1 -0.42175    0.17965
## interaction(inc$race, as.factor(inc$crack))black.1 -1.30014    0.36242
## interaction(inc$race, as.factor(inc$crack))other.1  0.03271    0.65936
##                                                    t value Pr(>|t|)    
## (Intercept)                                        286.460  < 2e-16 ***
## interaction(inc$race, as.factor(inc$crack))black.0  -3.806 0.000147 ***
## interaction(inc$race, as.factor(inc$crack))other.0  -3.198 0.001417 ** 
## interaction(inc$race, as.factor(inc$crack))white.1  -2.348 0.019038 *  
## interaction(inc$race, as.factor(inc$crack))black.1  -3.587 0.000346 ***
## interaction(inc$race, as.factor(inc$crack))other.1   0.050 0.960438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.300329)
## 
##     Null deviance: 1832.6  on 1378  degrees of freedom
## Residual deviance: 1785.4  on 1373  degrees of freedom
## AIC: 4283.6
## 
## Number of Fisher Scoring iterations: 2

plot(loginc~interaction(inc$race,as.factor(inc$crack)), ylab="log transformed income", xlab="interaction between use of crack cocaine and race")

This brings up the danger of assigning a direction of causation. We know the relationship between these factors exist, but without further investigation it is unclear which factor causes the other, or whether the relationship is a causal one at all.

Income lab

Kirk J Schmitz

11/18/2019