According to https://archive.ics.uci.edu/dataset/2/adult, this data set, extracted by Barry Becker from the 1994 Census database, predicts whether annual income of an individual exceeds $50K/yr based on census data. Also known as “Census Income” data set.
options(repos = c(CRAN = "https://cran.rstudio.com"))
req_packages <- c("tidyverse","readr","knitr")
for (pkg in req_packages) {
if (!require(pkg, character.only = TRUE)) {
message(paste("Installing package:", pkg))
install.packages(pkg, dependencies = TRUE)
} else {
message(paste(pkg, " already installed."))
}
library(pkg, character.only = TRUE)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## tidyverse already installed.
##
## readr already installed.
##
## Loading required package: knitr
##
## knitr already installed.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Define all column names
all_columns <- c("age", "workclass", "fnlwgt", "education", "education_num","marital_status", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hours_per_week","native_country", "income")
# Load the dataset from Adult - UCI ML Repository
adult_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
col_names = all_columns,
na = c("?","NA"))
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital_status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Select specific columns
select_data <- adult_data %>% select(age, education, income, sex, native_country)
selected_data1 <- select_data %>%
count(sex, income, education, native_country)
selected_data<- selected_data1 %>%
pivot_wider(names_from = sex, values_from = n)
library(forcats)
selected_data %>%
filter(income == ">50K",!is.na(Female), !is.na(education)) %>%
mutate(education = fct_reorder(education, Female, .na_rm = TRUE)) %>%
ggplot(aes(x = Female, y = education)) +
geom_bar(stat = "identity") +
labs(title = "Females Earning >50K by Education Level", x = "Count", y = "Education Level") + theme_minimal()
selected_data %>%
filter(income == ">50K",!is.na(Male), !is.na(education)) %>%
mutate(education = fct_reorder(education, Male, .na_rm = TRUE)) %>%
ggplot(aes(x = Male, y = education)) +
geom_bar(stat = "identity") +
labs(title = "Males Earning >50K by Education Level", x = "Count", y = "Education Level") + theme_minimal()
selected_data %>%
filter(income == ">50K", !is.na(native_country)) %>%
mutate(total = Female + Male) %>%
group_by(native_country) %>%
summarise(total = sum(total, na.rm = TRUE)) %>%
mutate(native_country = fct_reorder(native_country, total)) %>%
ggplot(aes(x = total, y = native_country)) +
geom_bar(stat = "identity") +
labs(title = "Number of People Earning >50K by Native Country", x = "Count", y = "Native Country") +
theme_minimal()
selected_data %>%
filter(income == ">50K",!is.na(Male), !is.na(native_country)) %>%
group_by(native_country)%>%
mutate(native_country = fct_reorder(native_country, Male, .na_rm = TRUE)) %>%
ggplot(aes(x = Male, y = native_country)) +
geom_bar(stat = "identity") +
labs(title = "Males Earning >50K by Native Country", x = "Count", y = "Native Country") + theme_minimal()
selected_data %>%
filter(income == ">50K",!is.na(Female), !is.na(native_country)) %>%
group_by(native_country)%>%
mutate(native_country = fct_reorder(native_country, Female, .na_rm = TRUE)) %>%
ggplot(aes(x = Female, y = native_country)) +
geom_bar(stat = "identity") +
labs(title = "Females Earning >50K by Native Country", x = "Count", y = "Native Country") + theme_minimal()
Based on this dataset, it’s evident that more males earn over 50K compared to females. Additionally, individuals whose native country is the United States make up the largest group earning above 50K, surpassing all other countries represented.