Many factors can influence the earning capability of a person, including but not limited to age, gender, ethnicity, education year, occupation etc.. However, it’s unclear the differences of the contributions of these factors to the income levels, in other words, the ranking of the importance of these factors to the income levels. In this project, I asked the question: what are the most important factor(s) which influence the annual income? I leveraged the “Adult Income” dataset from UC Irvine Machine Learning Depository (https://archive.ics.uci.edu/dataset/2/adult) to assess the relations between these potential factors and annual income level (>50k or <=50k), and prioritize their contributions accordingly.
library(ggplot2)
library(plotly)
library(dplyr)
#setwd('/home/shann/Documents/Data_Science/DAT301/project1')
dat_df = read.csv(file = 'adult.data',header = FALSE)
The original data contain a lot of dash signs “-” in both the column names and data entries, which is identical to the mathematical minus sign and is confusing, so all these dash signs “-” in the data were converted into underline “_”.
#rename colnames to avoid conflicts
colnames(dat_df) = c('age','workClass','fnlwgt','edu','eduNum','maritalStatus','occupation','relationship','race','sex','capital_gain','capital_loss','hoursPerWeek','nativeCountry','class')
In addition, I also recoded the binary dependent variable, the annual income level, from “>50K” to “more50K” and “<=50K” to “less50K” to avoid conflicting with mathematical signs.
#recode class
dat_df$class = with(dat_df,ifelse(class==' <=50K','less50K','more50K'))
The original data also contains a lot of “?” to annotate the missing values. In this analysis, firstly, I included all the independent variables into the initial model to assess their contributions to annual income. If an independent variable does not contribute, or is not statistically significant, it will be removed in the final model. So, to perform the multivariate logistic regression with all the independent variables in the initial model, I removed any sample in the original data frame contain “?”.
# Remove rows contain "?"
dat_df2 <- dat_df %>% filter(!grepl('\\?', age) & !grepl('\\?', workClass) & !grepl('\\?', fnlwgt) & !grepl('\\?', edu) & !grepl('\\?', eduNum) & !grepl('\\?', maritalStatus) & !grepl('\\?', occupation) & !grepl('\\?', relationship) & !grepl('\\?', race) & !grepl('\\?', sex) & !grepl('\\?', capital_gain) & !grepl('\\?', capital_loss) & !grepl('\\?', hoursPerWeek) & !grepl('\\?', nativeCountry) & !grepl('\\?', class))
For the data entries, dash sign “-” were replaced with underline “_“. Specifically, the data entries of non-numerical columns, such as sex and nativeCountry, start with leading spaces” “. To ease further programming, these leading spaces were removed.
#replace - with _ and remove leading space
dat_df3 <- data.frame(lapply(dat_df2, function(x) {gsub("\\-", "_", x)}))
dat_df3 <- data.frame(lapply(dat_df3, function(x) {gsub(" ", "", x)}))
Then, I used the function as.integer() to make sure those numerical integer variables are integers.
dat_df3$age = as.integer(dat_df3$age)
dat_df3$fnlwgt = as.integer(dat_df3$fnlwgt)
dat_df3$eduNum = as.integer(dat_df3$eduNum)
dat_df3$capital_gain = as.integer(dat_df3$capital_gain)
dat_df3$capital_loss = as.integer(dat_df3$capital_loss)
dat_df3$hoursPerWeek = as.integer(dat_df3$hoursPerWeek)
A successful multivariate logistic regression depends on a cleaned dataset after removing outlier samples.
For numerical variables, such as age and education year, histogram was used to show the data distribution to identify outliers.
#age
ggplot(dat_df3, aes(x=age)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+theme_bw()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#final weight
ggplot(dat_df3, aes(x=fnlwgt)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('final weight')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#education number
#The two peak mode in education number histogram comes from the higher proportion of population with high school diploma and bachelor's degree.
ggplot(dat_df3, aes(x=eduNum)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('education number')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#hours per week
ggplot(dat_df3, aes(x=hoursPerWeek)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('hours per week')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The variable capital_gain follows a skewed distribution, in which a majority of the sampled population has 0 capital gain.
ggplot(dat_df3, aes(x=capital_gain)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('capital gain')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The variable capital_loss follows a skewed distribution, in which a majority of the sampled population has 0 capital loss.
ggplot(dat_df3, aes(x=capital_loss)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('capital loss')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#percentage of individuals with capital gain or loss:
sum(dat_df3$capital_loss != 0 | dat_df3$capital_gain != 0)/dim(dat_df3)[1]
## [1] 0.1314568
However, when I separated these samples with capital_gain more than 0 and redid a histogram, I observed some outliers with extremely large capital gain ($99,999). This number is a hard coded cap by the original dataset submitter. These ultra-rich individuals were removed before logistic regression.
#capital gain zero removed
#148 individuals with 99999 capital gain will be removed
capGain_df = dat_df3[dat_df3$capital_gain != 0,]
ggplot(capGain_df, aes(x=capital_gain)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('capital gain with zero removed')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similarly, I separated the samples with capital_loss more than 0 and redid a histogram. I didn’t observe any outliers with extremely large capital loss.
#capital loss zero removed
capLoss_df = dat_df3[dat_df3$capital_loss != 0,]
ggplot(capLoss_df, aes(x=capital_loss)) + geom_histogram(aes(y=..density..),colour="black", fill="white")+geom_density(alpha=.2, fill="#FF6666")+xlab('capital loss with zero removed')+theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It is necessary to set a reference for each given categorical variable to compare with in logistic regression. In this project, the category with the highest proportion of samples was set as the reference for the variable. To do so, I used pie charts to identify the category with the highest proportion.
#pie chart for categorial variables
#part1
workClass_df = as.data.frame(table(dat_df3$workClass))
edu_df = as.data.frame(table(dat_df3$edu))
maritalStatus_df = as.data.frame(table(dat_df3$maritalStatus))
occupation_df = as.data.frame(table(dat_df3$occupation))
ggplot(workClass_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(workClass_df)[1]))+xlab('work class')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(edu_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(edu_df)[1]))+xlab('education')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(maritalStatus_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(maritalStatus_df)[1]))+xlab('marital status')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(occupation_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(occupation_df)[1]))+xlab('occupation')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
Specifically for the categorical variable “native country”, all other countries except the US make up only a very small percentage. Therefore, to reduce computation burden and biases introduced by the discrepant distribution, I merged the samples originally from all other countries as the level “abroad”.
#part2
relationship_df = as.data.frame(table(dat_df3$relationship))
race_df = as.data.frame(table(dat_df3$race))
sex_df = as.data.frame(table(dat_df3$sex))
nativeCountry_df = as.data.frame(table(dat_df3$nativeCountry))
#par(mfrow=c(2,2))
ggplot(relationship_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(relationship_df)[1]))+xlab('relationship')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(race_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(race_df)[1]))+xlab('race')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(sex_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(sex_df)[1]))+xlab('sex')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
ggplot(nativeCountry_df,aes(x=Var1,y=Freq))+geom_bar(stat='identity',fill=rainbow(dim(nativeCountry_df)[1]))+xlab('native country')+ylab('count')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
Remove extra rich and set categorical variable reference level.
#remove those extra rich individuals
dat_df4 = dat_df3[dat_df3$capital_gain != 99999,]
#set the category with the largest number of occurrences as the reference group for each categorial variable
dat_df4$workClass = as.factor(dat_df4$workClass)
dat_df4$workClass = relevel(dat_df4$workClass,ref='Private')
dat_df4$edu = as.factor(dat_df4$edu)
dat_df4$edu = relevel(dat_df4$edu,ref='HS_grad')
dat_df4$maritalStatus = as.factor(dat_df4$maritalStatus)
dat_df4$maritalStatus = relevel(dat_df4$maritalStatus,ref='Married_civ_spouse')
dat_df4$occupation = as.factor(dat_df4$occupation)
dat_df4$occupation = relevel(dat_df4$occupation,ref='Prof_specialty')
dat_df4$relationship = as.factor(dat_df4$relationship)
dat_df4$relationship = relevel(dat_df4$relationship,ref='Husband')
dat_df4$race = as.factor(dat_df4$race)
dat_df4$race = relevel(dat_df4$race,ref='White')
dat_df4$sex = as.factor(dat_df4$sex)
dat_df4$sex = relevel(dat_df4$sex,ref='Male')
dat_df4$nativeCountry = as.factor(dat_df4$nativeCountry)
dat_df4$nativeCountry = relevel(dat_df4$nativeCountry,ref='United_States')
dat_df4$nativeCountry2 = with(dat_df4,ifelse(nativeCountry=='United_States','United_States','abroad'))
dat_df4$nativeCountry2 = as.factor(dat_df4$nativeCountry2)
dat_df4$nativeCountry2 = relevel(dat_df4$nativeCountry2,ref='United_States')
dat_df4$nativeCountry = dat_df4$nativeCountry2
Of note, the categorical variable “education” and the numerical variable “education number” describe the same quality. In this regression analysis, only “education number” was included in the model because a quantitative numerical variable can ease the model fitting in computation than a categorical variable. Here is the summary of the multivariate logistic regression:
#perform logistic regression with all the variables
dat_df4$class = as.factor(dat_df4$class)
dat_df4$class = relevel(dat_df4$class,ref='less50K')
full_model = glm(class~age+workClass+fnlwgt+eduNum+maritalStatus+occupation+relationship+race+sex+capital_gain+capital_loss+hoursPerWeek+nativeCountry,family='binomial',data=dat_df4)
summary(full_model)
##
## Call:
## glm(formula = class ~ age + workClass + fnlwgt + eduNum + maritalStatus +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hoursPerWeek + nativeCountry, family = "binomial", data = dat_df4)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.555e+00 1.805e-01 -30.772 < 2e-16 ***
## age 2.660e-02 1.684e-03 15.797 < 2e-16 ***
## workClassFederal_gov 4.984e-01 9.345e-02 5.333 9.63e-08 ***
## workClassLocal_gov -2.021e-01 7.247e-02 -2.789 0.005284 **
## workClassSelf_emp_inc 1.751e-01 8.571e-02 2.043 0.041085 *
## workClassSelf_emp_not_inc -4.833e-01 6.381e-02 -7.573 3.65e-14 ***
## workClassState_gov -3.141e-01 9.070e-02 -3.463 0.000534 ***
## workClassWithout_pay -1.179e+01 1.194e+02 -0.099 0.921304
## fnlwgt 7.078e-07 1.743e-07 4.060 4.91e-05 ***
## eduNum 2.825e-01 9.578e-03 29.494 < 2e-16 ***
## maritalStatusDivorced -2.064e+00 2.749e-01 -7.507 6.07e-14 ***
## maritalStatusMarried_AF_spouse 6.614e-01 5.109e-01 1.295 0.195468
## maritalStatusMarried_spouse_absent -2.072e+00 3.524e-01 -5.879 4.12e-09 ***
## maritalStatusNever_married -2.530e+00 2.687e-01 -9.418 < 2e-16 ***
## maritalStatusSeparated -2.146e+00 3.070e-01 -6.988 2.79e-12 ***
## maritalStatusWidowed -1.896e+00 3.074e-01 -6.167 6.97e-10 ***
## occupationAdm_clerical -5.709e-01 8.007e-02 -7.130 1.01e-12 ***
## occupationArmed_Forces -1.696e+00 1.527e+00 -1.111 0.266627
## occupationCraft_repair -5.103e-01 7.178e-02 -7.109 1.17e-12 ***
## occupationExec_managerial 2.357e-01 6.338e-02 3.720 0.000199 ***
## occupationFarming_fishing -1.566e+00 1.340e-01 -11.687 < 2e-16 ***
## occupationHandlers_cleaners -1.260e+00 1.403e-01 -8.983 < 2e-16 ***
## occupationMachine_op_inspct -8.356e-01 9.740e-02 -8.580 < 2e-16 ***
## occupationOther_service -1.389e+00 1.141e-01 -12.179 < 2e-16 ***
## occupationPriv_house_serv -4.649e+00 1.682e+00 -2.764 0.005706 **
## occupationProtective_serv 2.271e-02 1.173e-01 0.194 0.846510
## occupationSales -2.844e-01 7.138e-02 -3.985 6.75e-05 ***
## occupationTech_support 7.423e-02 1.035e-01 0.718 0.473039
## occupationTransport_moving -6.551e-01 9.355e-02 -7.002 2.52e-12 ***
## relationshipNot_in_family 4.097e-01 2.717e-01 1.508 0.131608
## relationshipOther_relative -4.426e-01 2.484e-01 -1.782 0.074760 .
## relationshipOwn_child -7.670e-01 2.712e-01 -2.828 0.004687 **
## relationshipUnmarried 2.834e-01 2.872e-01 0.987 0.323677
## relationshipWife 1.319e+00 1.050e-01 12.556 < 2e-16 ***
## raceAmer_Indian_Eskimo -5.845e-01 2.287e-01 -2.556 0.010599 *
## raceAsian_Pac_Islander 6.923e-02 1.186e-01 0.584 0.559520
## raceBlack -1.550e-01 7.715e-02 -2.010 0.044476 *
## raceOther -6.319e-01 2.950e-01 -2.142 0.032212 *
## sexFemale -8.514e-01 8.028e-02 -10.605 < 2e-16 ***
## capital_gain 3.167e-04 1.056e-05 29.984 < 2e-16 ***
## capital_loss 6.405e-04 3.835e-05 16.701 < 2e-16 ***
## hoursPerWeek 2.981e-02 1.690e-03 17.638 < 2e-16 ***
## nativeCountryabroad -2.372e-01 7.924e-02 -2.994 0.002755 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33437 on 30013 degrees of freedom
## Residual deviance: 19607 on 29971 degrees of freedom
## AIC: 19693
##
## Number of Fisher Scoring iterations: 12
To identify categorical variables which statistically significantly contribute to the annual income, likelihood ratio test was performed to compare the models with and without the variable to determine their global effects.
#perform likelihood ratio test to determine the global effect of categorical variables
reduced_model = glm(class~1,family=binomial,data=dat_df4)
workClass_model = glm(class~workClass,family=binomial,data=dat_df4)
anova(reduced_model, workClass_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ 1
## Model 2: class ~ workClass
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30013 33437
## 2 30007 32767 6 670.17 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
maritalStatus_model = glm(class~maritalStatus,family=binomial,data=dat_df4)
anova(reduced_model, maritalStatus_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ 1
## Model 2: class ~ maritalStatus
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30013 33437
## 2 30007 26936 6 6500.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
occupation_model = glm(class~occupation,family=binomial,data=dat_df4)
anova(reduced_model, occupation_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ 1
## Model 2: class ~ occupation
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30013 33437
## 2 30000 29651 13 3785.5 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
relationship_model = glm(class~relationship,family=binomial,data=dat_df4)
anova(reduced_model, relationship_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ 1
## Model 2: class ~ relationship
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30013 33437
## 2 30008 26582 5 6855.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
race_model = glm(class~race,family=binomial,data=dat_df4)
anova(reduced_model, race_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: class ~ 1
## Model 2: class ~ race
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 30013 33437
## 2 30009 33095 4 341.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From likelihood ratio test, all the categorical variables in the initial logistic model can significantly contribute to the annual income.
The next step is to rank the variables based on their significance levels with regards to the correlations with the dependent variable annual income. The x axis is the negative logarithmic transformed P-values with a base of 10. The color of the bar represents the direction of the correlation with annual income. Red is positive correlation and blue is negative correlation.
#generate the final barplot to rank the variables based on their significance levels
full_model_coef_df = as.data.frame(coef(summary(full_model)))
full_model_coef_df = full_model_coef_df[!rownames(full_model_coef_df) %in% '(Intercept)',]
colnames(full_model_coef_df)[4] = 'P'
full_model_coef_df$log10p = -log10(full_model_coef_df$P)
full_model_coef_df$color = ifelse(full_model_coef_df$Estimate>0,'red','blue')
full_model_coef_df = full_model_coef_df[order(full_model_coef_df$log10p,decreasing = FALSE),]
full_model_coef_df$variable = rownames(full_model_coef_df)
full_model_coef_df$variable = factor(full_model_coef_df$variable,levels=full_model_coef_df$variable)
ggplot(full_model_coef_df,aes(x=variable,y=log10p))+geom_bar(stat='identity',fill=full_model_coef_df$color)+xlab('variable')+ylab('-log(P)')+theme(axis.text.y = element_text(angle = 45))+coord_flip()+theme_bw()
The top 5 contributing factors are positively correlated with annual income, including capital investment (gain or loss), years of education, work hours per week and age.