Are work class and income independent of each other?
In order to answer this question, I will use a dataset provided by UC
Irvine’s Machine Learning Repository, which contains data from the 1994
Census database. The dataset itself has ~32,000 rows with 15 columns
(with ~30,000 rows and 2 columns after cleaning). This analysis will
focus on only two of those columns though, workclass and income. The
workclass variable categorizes the type of employer or employment
situation an individual has, such as private sector, government, or
self-employment. The income variable is a binary two-level categorical
variable stating whether an individual makes >50K a year or <=50K
a year. Using these two variables, I will perform a chi-square test of
independence in order to determine if workclass and income are
associated with each other. I chose this research question because I
believe that understanding whether certain workclasses are associated
with higher or lower income levels can provide valuable insight into
labor market inequality and economic mobility. The dataset can be found
here.
To prepare the Adult dataset for analysis, I first inspected the raw
data to understand the variable types and identify how missing values
were represented. All “?” entries in character columns were converted to
proper NA values, and I checked all the missing values. Because this
analysis focused on the relationship between workclass and income, I
only kept these two variables, and irrelevant workclass categories
(“Without-pay” and “Never-worked”) were removed. I dropped any rows
containing missing values to ensure a complete dataset. Finally, both
variables were converted to factors so that they were properly formatted
for the chi-square test of independence.
library(tidyverse)
library(ggplot2)
adult <- read.csv("adult.csv")
#brief glance at data
head(adult)
## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours_per_week native_country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
#inspect data structure
str(adult)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : chr "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr "White" "White" "White" "Black" ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
#replace "?" with NA in all character columns
adult_clean <- adult %>%
mutate(across(where(is.character), ~na_if(.x, "?")))
#check number of missing values by column
colSums(is.na(adult_clean))
## age workclass fnlwgt education education_num
## 0 1836 0 0 0
## marital_status occupation relationship race sex
## 0 1843 0 0 0
## capital_gain capital_loss hours_per_week native_country income
## 0 0 0 583 0
#remove unnecessary variables
adult_clean <- adult_clean %>%
select(workclass, income) %>%
filter(workclass != "Without-pay",
workclass != "Never-worked")
#drop rows with any missing values
adult_clean <- adult_clean %>%
filter(rowSums(is.na(.)) == 0)
#convert applicable variables to factors
adult_clean <- adult_clean %>%
mutate(
workclass = factor(workclass),
income = factor(income)
)
#inspect cleaned data
str(adult_clean)
## 'data.frame': 30704 obs. of 2 variables:
## $ workclass: Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 3 5 3 3 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
#confirm no remaining missing values
colSums(is.na(adult_clean))
## workclass income
## 0 0
#checking expected cell count assumptions
observed_dataset <- table(adult_clean$workclass, adult_clean$income)
observed_dataset
##
## <=50K >50K
## Federal-gov 589 371
## Local-gov 1476 617
## Private 17733 4963
## Self-emp-inc 494 622
## Self-emp-not-inc 1817 724
## State-gov 945 353
chisq.test(observed_dataset)$expected
##
## <=50K >50K
## Federal-gov 720.8129 239.1871
## Local-gov 1571.5223 521.4777
## Private 17041.2189 5654.7811
## Self-emp-inc 837.9450 278.0550
## Self-emp-not-inc 1907.9017 633.0983
## State-gov 974.5991 323.4009
All cells are greater than 5, meaning we can proceed with the
chi-square test.
#graph of proportions of income by workclass
ggplot(adult_clean, aes(x = workclass, fill = income)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Income Levels by Workclass",
x = "Workclass",
y = "Proportion",
fill = "Income Level"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This stacked bar plot shows the proportion of individuals earning
≤50K and >50K within each workclass category. Overall, most
workclasses have a substantially higher proportion of individuals
earning less than or equal to 50K, but the size of this gap varies
across groups. These visible differences suggest that income may be
related to workclass, however that conclusion cannot be drawn from this
graph alone. A chi-square test of independence is appropriate to assess
whether or not workclass and income are associated.
\(H_0\) : Workclass is not
associated with income
\(H_a\) :
Workclass is associated with income
chi <- chisq.test(observed_dataset)
chi
##
## Pearson's Chi-squared test
##
## data: observed_dataset
## X-squared = 820.38, df = 5, p-value < 2.2e-16
With a p-value of < 2.2e-16, which is less than the typical
significance level of 0.05, there is sufficient evidence to reject the
null hypothesis. The chi-square statistic was \(X^2\) = 820.38, which is very large
relative to the five degrees of freedom. This indicates that the
observed frequencies differ greatly from the expected frequencies,
providing further evidence of an association between workclass and
income.
Therefore, we conclude that there is a significant association
between workclass and income.
This analysis examined whether there is a significant relationship
between workclass and income level in the Adult dataset. A chi-square
test revealed that income level is not independent of workclass. Certain
workclasses were associated with higher proportions of individuals
earning more than $50K. These results highlight how employment type may
serve as a significant indicator of income disparity, which is relevant
for labor economists and policymakers interested in inequality. Future
research could expand on this analysis by adding in additional variables
(such as education or hours worked) or exploring interaction effects
that may reveal important patterns. Using multiple different Census
results could help explore how these trends have evolved over the past
few decades, which could also reveal important information. Collecting
more balanced or targeted data as well could help refine the
understanding of how different work sectors influence earning potential.
Becker, B., & Kohavi, R. (n.d.). Adult. UCI Machine
Learning Repository. https://archive.ics.uci.edu/dataset/2/adult
Convert values to Na - na_if. - na_if • dplyr. (n.d.). https://dplyr.tidyverse.org/reference/na_if.html