Introduction

Are work class and income independent of each other? In order to answer this question, I will use a dataset provided by UC Irvine’s Machine Learning Repository, which contains data from the 1994 Census database. The dataset itself has ~32,000 rows with 15 columns (with ~30,000 rows and 2 columns after cleaning). This analysis will focus on only two of those columns though, workclass and income. The workclass variable categorizes the type of employer or employment situation an individual has, such as private sector, government, or self-employment. The income variable is a binary two-level categorical variable stating whether an individual makes >50K a year or <=50K a year. Using these two variables, I will perform a chi-square test of independence in order to determine if workclass and income are associated with each other. I chose this research question because I believe that understanding whether certain workclasses are associated with higher or lower income levels can provide valuable insight into labor market inequality and economic mobility. The dataset can be found here.

Data Analysis

To prepare the Adult dataset for analysis, I first inspected the raw data to understand the variable types and identify how missing values were represented. All “?” entries in character columns were converted to proper NA values, and I checked all the missing values. Because this analysis focused on the relationship between workclass and income, I only kept these two variables, and irrelevant workclass categories (“Without-pay” and “Never-worked”) were removed. I dropped any rows containing missing values to ensure a complete dataset. Finally, both variables were converted to factors so that they were properly formatted for the chi-square test of independence.

library(tidyverse)
library(ggplot2)
adult <- read.csv("adult.csv")
#brief glance at data
head(adult)
##   age        workclass fnlwgt education education_num     marital_status
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hours_per_week native_country income
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K
#inspect data structure
str(adult)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : chr  "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : chr  "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation    : chr  "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship  : chr  "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race          : chr  "White" "White" "White" "Black" ...
##  $ sex           : chr  "Male" "Male" "Male" "Male" ...
##  $ capital_gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
#replace "?" with NA in all character columns
adult_clean <- adult %>%
  mutate(across(where(is.character), ~na_if(.x, "?")))

#check number of missing values by column
colSums(is.na(adult_clean))
##            age      workclass         fnlwgt      education  education_num 
##              0           1836              0              0              0 
## marital_status     occupation   relationship           race            sex 
##              0           1843              0              0              0 
##   capital_gain   capital_loss hours_per_week native_country         income 
##              0              0              0            583              0
#remove unnecessary variables
adult_clean <- adult_clean %>%
  select(workclass, income) %>%
  filter(workclass != "Without-pay",
         workclass != "Never-worked")

#drop rows with any missing values
adult_clean <- adult_clean %>%
  filter(rowSums(is.na(.)) == 0)
#convert applicable variables to factors
adult_clean <- adult_clean %>%
  mutate(
    workclass = factor(workclass),
    income = factor(income)
  )

#inspect cleaned data
str(adult_clean)
## 'data.frame':    30704 obs. of  2 variables:
##  $ workclass: Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 3 5 3 3 ...
##  $ income   : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
#confirm no remaining missing values
colSums(is.na(adult_clean))
## workclass    income 
##         0         0


Exploratory Analysis of Workclass and Income

#checking expected cell count assumptions
observed_dataset <- table(adult_clean$workclass, adult_clean$income)
observed_dataset
##                   
##                    <=50K  >50K
##   Federal-gov        589   371
##   Local-gov         1476   617
##   Private          17733  4963
##   Self-emp-inc       494   622
##   Self-emp-not-inc  1817   724
##   State-gov          945   353
chisq.test(observed_dataset)$expected
##                   
##                         <=50K      >50K
##   Federal-gov        720.8129  239.1871
##   Local-gov         1571.5223  521.4777
##   Private          17041.2189 5654.7811
##   Self-emp-inc       837.9450  278.0550
##   Self-emp-not-inc  1907.9017  633.0983
##   State-gov          974.5991  323.4009

All cells are greater than 5, meaning we can proceed with the chi-square test.

#graph of proportions of income by workclass
ggplot(adult_clean, aes(x = workclass, fill = income)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Income Levels by Workclass",
    x = "Workclass",
    y = "Proportion",
    fill = "Income Level"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This stacked bar plot shows the proportion of individuals earning ≤50K and >50K within each workclass category. Overall, most workclasses have a substantially higher proportion of individuals earning less than or equal to 50K, but the size of this gap varies across groups. These visible differences suggest that income may be related to workclass, however that conclusion cannot be drawn from this graph alone. A chi-square test of independence is appropriate to assess whether or not workclass and income are associated.

Chi-Square Test of Independence

\(H_0\) : Workclass is not associated with income
\(H_a\) : Workclass is associated with income

chi <- chisq.test(observed_dataset)
chi
## 
##  Pearson's Chi-squared test
## 
## data:  observed_dataset
## X-squared = 820.38, df = 5, p-value < 2.2e-16


With a p-value of < 2.2e-16, which is less than the typical significance level of 0.05, there is sufficient evidence to reject the null hypothesis. The chi-square statistic was \(X^2\) = 820.38, which is very large relative to the five degrees of freedom. This indicates that the observed frequencies differ greatly from the expected frequencies, providing further evidence of an association between workclass and income.

Therefore, we conclude that there is a significant association between workclass and income.

Conclusion

This analysis examined whether there is a significant relationship between workclass and income level in the Adult dataset. A chi-square test revealed that income level is not independent of workclass. Certain workclasses were associated with higher proportions of individuals earning more than $50K. These results highlight how employment type may serve as a significant indicator of income disparity, which is relevant for labor economists and policymakers interested in inequality. Future research could expand on this analysis by adding in additional variables (such as education or hours worked) or exploring interaction effects that may reveal important patterns. Using multiple different Census results could help explore how these trends have evolved over the past few decades, which could also reveal important information. Collecting more balanced or targeted data as well could help refine the understanding of how different work sectors influence earning potential.

References

Becker, B., & Kohavi, R. (n.d.). Adult. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/2/adult
Convert values to Na - na_if. - na_if • dplyr. (n.d.). https://dplyr.tidyverse.org/reference/na_if.html