Using Logistic Regression Models to Predict Income Levels

Jacqueline Nosrati

Overview

In this week’s assignment I will use logistic regression models to predict whether an individual makes more than $50,000 or less annually using extracted data from the 1994 US Census Data set. To make predictions about income levels I will look at gender, race, number of years of education,and age.

Load packages

library(Zelig)
library(readr)
library(pander)
library(radiant.data)
library(texreg)
library(lmtest)
library(visreg)

In this step I loaded the required packages to analyze and visualize the data.

Load data set

Income<- read_csv("C:/Users/Papa/Desktop/Soc 712 -R/adult.csv", col_names = TRUE)
head(Income)
## # A tibble: 6 x 15
##     age workclass fnlwgt    education education.num marital.status        occupation  relationship  race    sex capital.gain capital.loss hours.per.week native.country income
##   <int>     <chr>  <int>        <chr>         <int>          <chr>             <chr>         <chr> <chr>  <chr>        <int>        <int>          <int>          <chr>  <chr>
## 1    90         ?  77053      HS-grad             9        Widowed                 ? Not-in-family White Female            0         4356             40  United-States  <=50K
## 2    82   Private 132870      HS-grad             9        Widowed   Exec-managerial Not-in-family White Female            0         4356             18  United-States  <=50K
## 3    66         ? 186061 Some-college            10        Widowed                 ?     Unmarried Black Female            0         4356             40  United-States  <=50K
## 4    54   Private 140359      7th-8th             4       Divorced Machine-op-inspct     Unmarried White Female            0         3900             40  United-States  <=50K
## 5    41   Private 264663 Some-college            10      Separated    Prof-specialty     Own-child White Female            0         3900             40  United-States  <=50K
## 6    34   Private 216864      HS-grad             9       Divorced     Other-service     Unmarried White Female            0         3770             45  United-States  <=50K

In this step, I loaded my data set and saved it as “Income”. Currently, we only see the first 6 observations.

Change my outcome variables

Income2 <- Income%>%
mutate(incomelevel= ifelse(income ==">50K",1,0))
head(Income2)
## # A tibble: 6 x 16
##     age workclass fnlwgt    education education.num marital.status        occupation  relationship  race    sex capital.gain capital.loss hours.per.week native.country income incomelevel
##   <int>     <chr>  <int>        <chr>         <int>          <chr>             <chr>         <chr> <chr>  <chr>        <int>        <int>          <int>          <chr>  <chr>       <dbl>
## 1    90         ?  77053      HS-grad             9        Widowed                 ? Not-in-family White Female            0         4356             40  United-States  <=50K           0
## 2    82   Private 132870      HS-grad             9        Widowed   Exec-managerial Not-in-family White Female            0         4356             18  United-States  <=50K           0
## 3    66         ? 186061 Some-college            10        Widowed                 ?     Unmarried Black Female            0         4356             40  United-States  <=50K           0
## 4    54   Private 140359      7th-8th             4       Divorced Machine-op-inspct     Unmarried White Female            0         3900             40  United-States  <=50K           0
## 5    41   Private 264663 Some-college            10      Separated    Prof-specialty     Own-child White Female            0         3900             40  United-States  <=50K           0
## 6    34   Private 216864      HS-grad             9       Divorced     Other-service     Unmarried White Female            0         3770             45  United-States  <=50K           0

First, I created a new data set called Income2. Using the mutate command and ifelse condition, I created and changed my dependent variable (income) to have an outcome of 1 if the value is “>50K” (greater than 50,000) and else (“<=50k” or less than 50,000) to be 0. This is done to change the dependent variable into binary form. I labeled the new variable incomelevel.

A Simple Model

m0 <- glm(incomelevel ~ sex, family = binomial, data = Income2)
summary(m0)
## 
## Call:
## glm(formula = incomelevel ~ sex, family = binomial, data = Income2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8543  -0.8543  -0.4815  -0.4815   2.1034  
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -2.09626    0.03086  -67.93 <0.0000000000000002 ***
## sexMale      1.27614    0.03418   37.33 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 34270  on 32559  degrees of freedom
## AIC: 34274
## 
## Number of Fisher Scoring iterations: 4

For the first model, I looked at the probability that an individual earns more than 50,000 dollars based on sex. When looking at the result we see that men are more likely to make more than 50,000 dollars (1.27614) than women.We see that the results are statistically significant at 99% confidence level. I chose this variable because there is generally a gender pay gap.

A (Slightly) More Complicated Model

m1 <- glm(incomelevel ~ sex + race + age, family = binomial, data = Income2)
summary(m1)
## 
## Call:
## glm(formula = incomelevel ~ sex + race + age, family = binomial, 
##     data = Income2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7404  -0.7694  -0.5610  -0.2673   2.6975  
## 
## Coefficients:
##                         Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)            -4.424771   0.189257 -23.380 < 0.0000000000000002 ***
## sexMale                 1.206911   0.035105  34.380 < 0.0000000000000002 ***
## raceAsian-Pac-Islander  0.999109   0.196718   5.079           0.00000038 ***
## raceBlack               0.181186   0.190725   0.950                0.342    
## raceOther              -0.114303   0.281879  -0.406                0.685    
## raceWhite               0.874203   0.182807   4.782           0.00000173 ***
## age                     0.038721   0.001005  38.530 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 32473  on 32554  degrees of freedom
## AIC: 32487
## 
## Number of Fisher Scoring iterations: 5

To this model I added the variables race and age. We see that the probability of earning more than 50,000 dollars annually has slightly decreased if an individual is male (from 1.27614 to 1.206911) and as an individual’s age increases the probability of earning more than 50,000 dollars annually only slightly increases (0.038721). In addition, when we look at the probability of earning more than 50,000 dollars annually with respect to race we see that Asian/Pacific Islander and White individuals tend are more likely to earn more than 50,000 dollars annually (0.999109 and 0.874203, respectively) than individuals of other races (0.181186 for Black and -0.114303 for Other). All results are statistically significant at the 99% confidence level with the exception of Black and Other,which are not statistically significant. I added the variables race and age to the model because those variables can have impact on an individual’s predicted annual income. In addition, it is assumed that as age increases income will also increase (until a certain point where age increases and income decreases due to retirement and increased expenses related to aging).

Adding Years of Education Variable To the Model)

m2 <- glm(incomelevel ~ sex + race + age + education.num, family = binomial, data = Income2)
summary(m2)
## 
## Call:
## glm(formula = incomelevel ~ sex + race + age + education.num, 
##     family = binomial, data = Income2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4242  -0.6737  -0.4392  -0.1352   3.2953  
## 
## Coefficients:
##                         Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)            -8.220812   0.211670 -38.838 < 0.0000000000000002 ***
## sexMale                 1.297711   0.037014  35.060 < 0.0000000000000002 ***
## raceAsian-Pac-Islander  0.348052   0.206716   1.684              0.09224 .  
## raceBlack               0.113161   0.199047   0.569              0.56968    
## raceOther              -0.210656   0.298465  -0.706              0.48031    
## raceWhite               0.573206   0.190812   3.004              0.00266 ** 
## age                     0.042145   0.001133  37.199 < 0.0000000000000002 ***
## education.num           0.366387   0.006582  55.666 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 28631  on 32553  degrees of freedom
## AIC: 28647
## 
## Number of Fisher Scoring iterations: 5

In this model, I added the numbers of years of education variable. This time we see that that the probability of male individuals earning more than 50,000 dollars annually has slightly increased to 1.297711. In addition the probability that older inviduals earning more than 50,000 dollars annually has slightly increased to 0.042145. Looking at the newly added variable number of years of education, the probability of earning more than 50,000 dollars annually is slightly higher for relatively older individuals (0.366387). The result is statistically significant at a 99% confidence level. I added numbers of years of education to the model because we assume that as an individual gets more educated, their income will also tend to increase.

Looking at the interaction between race and sex

m3 <- glm(incomelevel ~ sex*race + age + education.num, family = binomial, data = Income2)
summary(m3)
## 
## Call:
## glm(formula = incomelevel ~ sex * race + age + education.num, 
##     family = binomial, data = Income2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4255  -0.6737  -0.4392  -0.1347   3.2191  
## 
## Coefficients:
##                                 Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)                    -7.666066   0.332716 -23.041 <0.0000000000000002 ***
## sexMale                         0.512309   0.396030   1.294              0.1958    
## raceAsian-Pac-Islander          0.041125   0.363302   0.113              0.9099    
## raceBlack                      -0.567747   0.339313  -1.673              0.0943 .  
## raceOther                      -0.359035   0.543871  -0.660              0.5092    
## raceWhite                       0.010174   0.322259   0.032              0.9748    
## age                             0.042181   0.001133  37.221 <0.0000000000000002 ***
## education.num                   0.366546   0.006585  55.664 <0.0000000000000002 ***
## sexMale:raceAsian-Pac-Islander  0.470971   0.440751   1.069              0.2853    
## sexMale:raceBlack               0.959057   0.417399   2.298              0.0216 *  
## sexMale:raceOther               0.251898   0.648687   0.388              0.6978    
## sexMale:raceWhite               0.791635   0.397953   1.989              0.0467 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35948  on 32560  degrees of freedom
## Residual deviance: 28622  on 32549  degrees of freedom
## AIC: 28646
## 
## Number of Fisher Scoring iterations: 5

For this model I looked at the interaction between sex and race with respect to the probability of earning more than 50,000 dollars annually. The results further seemed to further reinforce the relationships between the indepedent variables and income. We see that male individuals are more likely to earn more than 50,000 dollars annually (0.512309). We also see that White individuals are more likely to earn more than 50,000 dollars annually (-0.567747 and -0.359035). Looking at the interactions between race and gender we see that male individuals accross all races are more likely to earn more than 50,000 dollars annually than their female counterparts (0.470971, 0.959057,0.251898, and 0.791635). This is particularly true for Black and White males. The effect of numbers of years of education on the probability of earning more than 50,000 dollars annually remains unchanged (0.366546). I chose the interaction between race and gender to get a closer look at income discrepancies.

Comparing Models

library(texreg)
screenreg(list(m0, m1, m2, m3))
## 
## ==========================================================================================
##                                 Model 1        Model 2        Model 3        Model 4      
## ------------------------------------------------------------------------------------------
## (Intercept)                         -2.10 ***      -4.42 ***      -8.22 ***      -7.67 ***
##                                     (0.03)         (0.19)         (0.21)         (0.33)   
## sexMale                              1.28 ***       1.21 ***       1.30 ***       0.51    
##                                     (0.03)         (0.04)         (0.04)         (0.40)   
## raceAsian-Pac-Islander                              1.00 ***       0.35           0.04    
##                                                    (0.20)         (0.21)         (0.36)   
## raceBlack                                           0.18           0.11          -0.57    
##                                                    (0.19)         (0.20)         (0.34)   
## raceOther                                          -0.11          -0.21          -0.36    
##                                                    (0.28)         (0.30)         (0.54)   
## raceWhite                                           0.87 ***       0.57 **        0.01    
##                                                    (0.18)         (0.19)         (0.32)   
## age                                                 0.04 ***       0.04 ***       0.04 ***
##                                                    (0.00)         (0.00)         (0.00)   
## education.num                                                      0.37 ***       0.37 ***
##                                                                   (0.01)         (0.01)   
## sexMale:raceAsian-Pac-Islander                                                    0.47    
##                                                                                  (0.44)   
## sexMale:raceBlack                                                                 0.96 *  
##                                                                                  (0.42)   
## sexMale:raceOther                                                                 0.25    
##                                                                                  (0.65)   
## sexMale:raceWhite                                                                 0.79 *  
##                                                                                  (0.40)   
## ------------------------------------------------------------------------------------------
## AIC                              34274.20       32487.16       28647.36       28646.35    
## BIC                              34290.98       32545.90       28714.49       28747.04    
## Log Likelihood                  -17135.10      -16236.58      -14315.68      -14311.17    
## Deviance                         34270.20       32473.16       28631.36       28622.35    
## Num. obs.                        32561          32561          32561          32561       
## ==========================================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Using the texreg package we put the results of the 4 moldes into a table to compare the results. Instead of looking for which model has the highest R-squared, it is the best in this case to look at deviance, AIC and BIC and find the model with the lowest 2 values in order to determine which model is the best fit.These 3 measures indicate the lowest possible errors Looking at our results we see that Model 3 and Model 4 are the best ones with an AIC and BIC of 28647.36 and 28714.49 for Model 3 and 28646.35 and 28747.04 for Model 4.The 2 Models seeemed to have very close results, despite the addition of an interaction between race and gender. We were able to reduce the AIC and BIC significantly from the first Model to the last 2 Models.

Likelihood ratio test

lmtest::lrtest(m0, m1, m2, m3)
## Likelihood ratio test
## 
## Model 1: incomelevel ~ sex
## Model 2: incomelevel ~ sex + race + age
## Model 3: incomelevel ~ sex + race + age + education.num
## Model 4: incomelevel ~ sex * race + age + education.num
##   #Df LogLik Df     Chisq           Pr(>Chisq)    
## 1   2 -17135                                      
## 2   7 -16237  5 1797.0359 < 0.0000000000000002 ***
## 3   8 -14316  1 3841.7963 < 0.0000000000000002 ***
## 4  12 -14311  4    9.0148              0.06073 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we do a Likelihood ratio test we see that we were able to decrease error significantly by adding the additional indepedent variables and that the last model fits the best.

Graphs

The first graph shows the likelihood of earning more than 50,000 annually with respect to age.

library(visreg)
visreg(m2, "age", scale = "response")

The second graph shows the likelihood of earning more than 50,000 annually with respect to number of years of education.

library(visreg)
visreg(m2, "education.num", scale = "response")