Adult - UCI Machine Learning Repository

According to https://archive.ics.uci.edu/dataset/2/adult, this data set, extracted by Barry Becker from the 1994 Census database, predicts whether annual income of an individual exceeds $50K/yr based on census data. Also known as “Census Income” data set.

Install and load necessary packages

options(repos = c(CRAN = "https://cran.rstudio.com"))

req_packages <- c("tidyverse","readr","knitr")

for (pkg in req_packages) {
  if (!require(pkg, character.only = TRUE)) {
    message(paste("Installing package:", pkg))
    install.packages(pkg, dependencies = TRUE)
  } else {
    message(paste(pkg, " already installed."))
  }
  library(pkg, character.only = TRUE)
}

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## tidyverse  already installed.
## 
## readr  already installed.
## 
## Loading required package: knitr
## 
## knitr  already installed.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Load and Clean the Data

# Define all column names
all_columns <- c("age", "workclass", "fnlwgt", "education", "education_num","marital_status", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hours_per_week","native_country", "income")

# Load the dataset from Adult - UCI ML Repository

adult_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
col_names = all_columns,
na = c("?","NA"))

## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital_status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Select specific columns
select_data <- adult_data %>% select(age, education, income, sex, native_country)

Pivot data to get counts of male/female by education, income, or native country

selected_data1 <- select_data %>%
  count(sex, income, education, native_country)

selected_data<- selected_data1 %>%
  pivot_wider(names_from = sex, values_from = n)

Plot females earning >50K by education

library(forcats)
selected_data %>%
  filter(income == ">50K",!is.na(Female), !is.na(education)) %>%
  mutate(education = fct_reorder(education, Female, .na_rm = TRUE)) %>%
  ggplot(aes(x = Female, y = education)) +
  geom_bar(stat = "identity") +
  labs(title = "Females Earning >50K by Education Level", x = "Count", y = "Education Level") + theme_minimal()

Plot males earning >50K by education

selected_data %>%
  filter(income == ">50K",!is.na(Male), !is.na(education)) %>%
  mutate(education = fct_reorder(education, Male, .na_rm = TRUE)) %>%
  ggplot(aes(x = Male, y = education)) +
  geom_bar(stat = "identity") +
  labs(title = "Males Earning >50K by Education Level", x = "Count", y = "Education Level") + theme_minimal()

Number of people earning >50K by Native Country

selected_data %>%
  filter(income == ">50K", !is.na(native_country)) %>%
  mutate(total = Female + Male) %>%
  group_by(native_country) %>%
  summarise(total = sum(total, na.rm = TRUE)) %>%
  mutate(native_country = fct_reorder(native_country, total)) %>%
  ggplot(aes(x = total, y = native_country)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of People Earning >50K by Native Country", x = "Count", y = "Native Country") +
  theme_minimal()

Males earning >50K by Native Country

selected_data %>%
  filter(income == ">50K",!is.na(Male), !is.na(native_country)) %>%
  group_by(native_country)%>%
  mutate(native_country = fct_reorder(native_country, Male, .na_rm = TRUE)) %>%
  ggplot(aes(x = Male, y = native_country)) +
  geom_bar(stat = "identity") +
  labs(title = "Males Earning >50K by Native Country", x = "Count", y = "Native Country") + theme_minimal()

Females earning >50K by Native Country

selected_data %>%
  filter(income == ">50K",!is.na(Female), !is.na(native_country)) %>%
  group_by(native_country)%>%
  mutate(native_country = fct_reorder(native_country, Female, .na_rm = TRUE)) %>%
  ggplot(aes(x = Female, y = native_country)) +
  geom_bar(stat = "identity") +
  labs(title = "Females Earning >50K by Native Country", x = "Count", y = "Native Country") + theme_minimal()

Based on this dataset, it’s evident that more males earn over 50K compared to females. Additionally, individuals whose native country is the United States make up the largest group earning above 50K, surpassing all other countries represented.