##Lab 8
###Loading and and viewing the data
library(plyr)
inc <- read.csv("~/Desktop/ANT291WD/income.csv")
I have created an object inc that represents the dataset
If my question is whether innate factors influence income more or less than life experience, the first thing I will do is look at the variable “income”
hist(inc$income)
log-transforming the data make them more normal. First, I will create an object
loginc that is the log-transformed version of the “income” column in the dataset
loginc <- log(inc$income)
hist(loginc)
###Innate models
Next, I will look at a few innate properties in the dataset, specifically race and sex. Given the ubiquity of talk in the national dialogue concerning pay gaps between races and genders (although it is important to note here that the data we are deling with specify sex and not gender), it is reasonable to hypothesize that there may be a link between these factors and income. First, we will change the variables to factors and rename the levels. The names for the levels were derived from the metadata
inc$female <- as.factor(inc$female)
inc$race <- as.factor(inc$race)
inc$female<- revalue(inc$female, c("0"="male", "1"="female"))
inc$race <- revalue(inc$race, c("1"="white","2"="black", "3"="other"))
Here is what the plots look like
plot(loginc~inc$female, ylab="log transformed income", xlab="sex")
plot(loginc~inc$race,ylab="log transformed income", xlab="race")
It does look like there is a relationship between lower income in females and lower income in individuals who do not identify themselves as “white”
If we look at the interaction between the two, we see this:
plot(loginc~interaction(inc$race, inc$female), ylab="log transformed income", xlab="interaction between race and sex")
im1 <- glm(loginc~inc$female)
summary(im1)
##
## Call:
## glm(formula = loginc ~ inc$female)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3921 -0.4838 0.2126 0.7208 2.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.52335 0.04316 243.826 < 2e-16 ***
## inc$femalefemale -0.47630 0.06079 -7.835 9.34e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.274104)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1754.4 on 1377 degrees of freedom
## AIC: 4251.5
##
## Number of Fisher Scoring iterations: 2
im2 <- glm(loginc~inc$race)
summary(im2)
##
## Call:
## glm(formula = loginc ~ inc$race)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.2831 -0.4818 0.1836 0.7379 2.5315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.36989 0.03564 290.980 < 2e-16 ***
## inc$raceblack -0.36435 0.08646 -4.214 2.67e-05 ***
## inc$raceother -0.31041 0.10406 -2.983 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.309429)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1801.8 on 1376 degrees of freedom
## AIC: 4290.2
##
## Number of Fisher Scoring iterations: 2
im <- glm(loginc~inc$female*inc$race)
summary(im)
##
## Call:
## glm(formula = loginc ~ inc$female * inc$race)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4901 -0.4801 0.1985 0.7123 2.5586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.62130 0.04834 219.738 < 2e-16 ***
## inc$femalefemale -0.52260 0.06969 -7.499 1.15e-13 ***
## inc$raceblack -0.64287 0.13866 -4.636 3.89e-06 ***
## inc$raceother -0.25901 0.13785 -1.879 0.06046 .
## inc$femalefemale:inc$raceblack 0.56434 0.17570 3.212 0.00135 **
## inc$femalefemale:inc$raceother -0.14652 0.20417 -0.718 0.47308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.249965)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1716.2 on 1373 degrees of freedom
## AIC: 4229.1
##
## Number of Fisher Scoring iterations: 2
If we model these phenomenon we see that when sex is a better predictor of income than race, but the interaction of the two factors is better than either one of them alone (AIC = 4229 vs 4252 and 4290 respectively). When looking at the interaction between race and sex, the only categories with a p-value > 0.5 is where the “other” category in race.
###Experience based madels
Before loading the dataset, I created a new category that collapsed education down into “no diploma,” meaning less than high school education, “high school,” for high school graduate, “some college,” for less than a four year degree, “college,” for college graduate, and “graduate,” for more than a bachelors level education. I turned this into a factor
inc$Educ <- as.factor(inc$Educ)
inc$Educ <- factor(inc$Educ, levels=c("Graduate","College", "Some College", "HS","ND"))
inc$veteran <- as.factor(inc$veteran)
inc$veteran<- revalue(inc$veteran, c("0"="non-veteran", "1"="veteran"))
You can see that this has an effect on income.
plot(loginc~inc$Educ, xlab="education level", ylab="log transformed income")
Another factor that could affect income is whether an individual served in the military or not. Military service teaches discipline, expands individuals’ experiences and extends social networks.
plot(loginc~inc$veteran, ylab="log transformed income", xlab="military service")
em1 <- glm(loginc~inc$Educ)
summary(em1)
##
## Call:
## glm(formula = loginc ~ inc$Educ)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6865 -0.4359 0.2077 0.6874 2.4543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.77329 0.07138 150.926 < 2e-16 ***
## inc$EducCollege -0.11021 0.09624 -1.145 0.252
## inc$EducSome College -0.67774 0.09062 -7.479 1.33e-13 ***
## inc$EducHS -0.78628 0.09252 -8.498 < 2e-16 ***
## inc$EducND -0.94400 0.12657 -7.458 1.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.212676)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1666.2 on 1374 degrees of freedom
## AIC: 4186.3
##
## Number of Fisher Scoring iterations: 2
em2 <- glm(loginc~inc$veteran)
summary(em2)
##
## Call:
## glm(formula = loginc ~ inc$veteran)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4123 -0.4809 0.1928 0.7128 2.5032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.25509 0.03277 312.932 < 2e-16 ***
## inc$veteranveteran 0.26643 0.10072 2.645 0.00825 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.324169)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1823.4 on 1377 degrees of freedom
## AIC: 4304.6
##
## Number of Fisher Scoring iterations: 2
em <- glm(loginc~inc$Educ+inc$veteran)
summary(em)
##
## Call:
## glm(formula = loginc ~ inc$Educ + inc$veteran)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6581 -0.4330 0.1902 0.6851 2.4787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.74484 0.07155 150.176 < 2e-16 ***
## inc$EducCollege -0.10619 0.09585 -1.108 0.268123
## inc$EducSome College -0.69368 0.09037 -7.676 3.09e-14 ***
## inc$EducHS -0.80232 0.09225 -8.697 < 2e-16 ***
## inc$EducND -0.93995 0.12606 -7.456 1.57e-13 ***
## inc$veteranveteran 0.33857 0.09637 3.513 0.000457 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.202748)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1651.4 on 1373 degrees of freedom
## AIC: 4176
##
## Number of Fisher Scoring iterations: 2
With our experience based models, we see that there is a significant difference in income, with income levels decreasing as education level goes down. Military service also raises income levels, but it is not as good an indicator given its higher AIC. Whereas we looked at the interaction between race and sex, I simply added these two factors together in the model because there is no reason to believe the factors interact with one another. Actually, givemnn that military service often occurs during the time when other people attend college, individuals with military service may be less likely to have college education.
###Conclusions
According to the AIC values, the experince based models are better predictors of income than the innate models. In general, adding two factors together resulted in models with lower AICs than each of their parts, but this may be because I cherry-picked factors that were intuitive when considering their effect on income. Ultimately, a combination of innate and environmental factors may be the best explanation for income. take for example when we look at the interaction between race and crack use and its effect on income, the combination of the innate and environmental factors return a better AIC than either of the component models.
glm(loginc~inc$race)
##
## Call: glm(formula = loginc ~ inc$race)
##
## Coefficients:
## (Intercept) inc$raceblack inc$raceother
## 10.3699 -0.3644 -0.3104
##
## Degrees of Freedom: 1378 Total (i.e. Null); 1376 Residual
## Null Deviance: 1833
## Residual Deviance: 1802 AIC: 4290
glm(loginc~as.factor(inc$crack))
##
## Call: glm(formula = loginc ~ as.factor(inc$crack))
##
## Coefficients:
## (Intercept) as.factor(inc$crack)1
## 10.3021 -0.4717
##
## Degrees of Freedom: 1378 Total (i.e. Null); 1377 Residual
## Null Deviance: 1833
## Residual Deviance: 1821 AIC: 4303
glm(loginc~inc$race*as.factor(inc$crack))
##
## Call: glm(formula = loginc ~ inc$race * as.factor(inc$crack))
##
## Coefficients:
## (Intercept) inc$raceblack
## 10.3871 -0.3358
## inc$raceother as.factor(inc$crack)1
## -0.3357 -0.4217
## inc$raceblack:as.factor(inc$crack)1 inc$raceother:as.factor(inc$crack)1
## -0.5426 0.7901
##
## Degrees of Freedom: 1378 Total (i.e. Null); 1373 Residual
## Null Deviance: 1833
## Residual Deviance: 1785 AIC: 4284
summary(glm(loginc~interaction(inc$race,as.factor(inc$crack))))
##
## Call:
## glm(formula = loginc ~ interaction(inc$race, as.factor(inc$crack)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3003 -0.4809 0.1780 0.7237 2.4949
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 10.38707 0.03626
## interaction(inc$race, as.factor(inc$crack))black.0 -0.33583 0.08823
## interaction(inc$race, as.factor(inc$crack))other.0 -0.33566 0.10497
## interaction(inc$race, as.factor(inc$crack))white.1 -0.42175 0.17965
## interaction(inc$race, as.factor(inc$crack))black.1 -1.30014 0.36242
## interaction(inc$race, as.factor(inc$crack))other.1 0.03271 0.65936
## t value Pr(>|t|)
## (Intercept) 286.460 < 2e-16 ***
## interaction(inc$race, as.factor(inc$crack))black.0 -3.806 0.000147 ***
## interaction(inc$race, as.factor(inc$crack))other.0 -3.198 0.001417 **
## interaction(inc$race, as.factor(inc$crack))white.1 -2.348 0.019038 *
## interaction(inc$race, as.factor(inc$crack))black.1 -3.587 0.000346 ***
## interaction(inc$race, as.factor(inc$crack))other.1 0.050 0.960438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.300329)
##
## Null deviance: 1832.6 on 1378 degrees of freedom
## Residual deviance: 1785.4 on 1373 degrees of freedom
## AIC: 4283.6
##
## Number of Fisher Scoring iterations: 2
plot(loginc~interaction(inc$race,as.factor(inc$crack)), ylab="log transformed income", xlab="interaction between use of crack cocaine and race")
This brings up the danger of assigning a direction of causation. We know the relationship between these factors exist, but without further investigation it is unclear which factor causes the other, or whether the relationship is a causal one at all.