This Tidyverse EXTEND assignment was performed on the basis of the Tidyverse CREATE assignment by Noori Selina. I am adding more Graphical Representation of census-income data analysis of annotated codes to her existing examples.
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This vignette demonstrates how to perform data manipulation and analysis using the dplyr package. We will work with the Census Income dataset, which contains information on individuals’ demographics and income levels.The data set can be found here: https://www.kaggle.com/datasets/tawfikelmetwally/census-income-dataset/data
To load the data, I will be importing the data from a GitHub link.
census_data <- read_csv("https://raw.githubusercontent.com/NooriSelina/Data-607/main/censusincome.csv")
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Workclass, Education, Marital Status, Occupation, Relationship, Rac...
## dbl (6): Age, Final Weight, EducationNum, Capital Gain, capital loss, Hours ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(census_data)
## # A tibble: 6 × 15
## Age Workclass `Final Weight` Education EducationNum `Marital Status`
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spou…
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spou…
## 5 28 Private 338409 Bachelors 13 Married-civ-spou…
## 6 37 Private 284582 Masters 14 Married-civ-spou…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## # Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## # `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>
The dplyr package provides a set of functions for data manipulation. We will use the following functions to explore the dataset.
filter() function to filter the dataset based on specific
criteria. In this case, we are interested in individuals with incomes
greater than $50,000, so we filter the dataset to include only such
individuals.high_income_data <- census_data %>%
filter(Income == ">50K")
head(high_income_data)
## # A tibble: 6 × 15
## Age Workclass `Final Weight` Education EducationNum `Marital Status`
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spo…
## 2 31 Private 45781 Masters 14 Never-married
## 3 42 Private 159449 Bachelors 13 Married-civ-spo…
## 4 37 Private 280464 Some-coll… 10 Married-civ-spo…
## 5 30 State-gov 141297 Bachelors 13 Married-civ-spo…
## 6 40 Private 121772 Assoc-voc 11 Married-civ-spo…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## # Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## # `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>
income_summary <- high_income_data %>%
group_by(Income) %>%
summarize(
mean_age = mean(Age, na.rm = TRUE),
median_hours = median(`Hours per Week`, na.rm = TRUE)
)
print(income_summary)
## # A tibble: 1 × 3
## Income mean_age median_hours
## <chr> <dbl> <dbl>
## 1 >50K 44.2 40
high_income_data <- high_income_data %>%
arrange(desc(Age))
print(head(high_income_data, 10))
## # A tibble: 10 × 15
## Age Workclass `Final Weight` Education EducationNum `Marital Status`
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 90 Local-gov 227796 Masters 14 Married-civ-spou…
## 2 90 Private 51744 Masters 14 Never-married
## 3 90 Private 87372 Prof-school 15 Married-civ-spou…
## 4 90 Private 46786 Bachelors 13 Married-civ-spou…
## 5 90 Private 175491 HS-grad 9 Married-civ-spou…
## 6 90 Private 88991 Bachelors 13 Married-civ-spou…
## 7 90 Private 206667 Masters 14 Married-civ-spou…
## 8 90 ? 313986 HS-grad 9 Married-civ-spou…
## 9 84 Self-emp-inc 172907 Some-college 10 Married-civ-spou…
## 10 83 Self-emp-inc 240150 10th 6 Married-civ-spou…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## # Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## # `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>
In this vignette, the dplyr package from the
tidyverse collection was used to efficiently manage and
analyze the Census Income dataset. Specifically, the data was filtered
using the filter() function to focus on high-income
individuals, grouped by income levels using the group_by()
function, and then sorted by mean age with the arrange()
function.
This approach uncovered valuable insights about the demographics of
high-income individuals. The dplyr package, along with
other tools from the tidyverse, made data manipulation
tasks straightforward and effective.
The first variable Age is a continuous variable. As an initial step, two histograms are plotted.
# histogram of age by income group
ggplot(census_data) + aes(x=as.numeric(Age), group=Income, fill=Income) +
geom_histogram(binwidth=1, color='black')
# histogram of age by gender group
ggplot(census_data) + aes(x=as.numeric(Age), group=Gender, fill=Gender) +
geom_histogram(binwidth=1, color='black')
It is noticed that majority of the female has income less than 50K but
make can make money more than $50,000 a year. For those do make over
$50,000 annually, they are mainly in midcareer. Interestingly, females
are It is noticed that majority of the observations make less than
$50,000 a year. For those do make over $50,000 annually, they are mainly
in midcareer. Interestingly, females are overrepresented. This could be
possibly caused by census bias. The average age of most working group
lies around 42 for both gender who earn maximum annual income and there
are more male than female in the working industries.
The variable Workclass stands for the industry in which the responding unit is employed.
library(ggplot2)
library(dplyr)
# Delete rows with "?" and "Armed-Forces" in Workclass Column
census_data <- census_data %>%
filter(!(Workclass %in% c("?", "Never-worked")))
# combine into Government job
census_data$Workclass<- gsub('^Federal-gov', 'Government', census_data$Workclass)
census_data$Workclass <- gsub('^Local-gov', 'Government', census_data$Workclass)
census_data$Workclass <- gsub('^State-gov', 'Government', census_data$Workclass)
# combine into Sele-Employed job
census_data$Workclass <- gsub('Self-emp-inc','Self-Employed', census_data$Workclass)
census_data$Workclass <- gsub('Self-emp-not-inc','Self-Employed', census_data$Workclass)
census_data$Workclass <- as.factor(census_data$Workclass)
summary(census_data$Workclass)
## Government Private Self-Employed Without-pay
## 4351 22696 3657 14
Notice that there are two small groups – Never-worked and Without-pay. I will combine them with Unknowns into a group called Other/Unknown. Those who work in the government are further break down into federal, state, and local levels. To facilitate the analysis, I group them into one group called Government. While those who are self-employed fall into two groups, incorporated and not incorporated, and are grouped into Self-Employed.
To calculate counts and proportions, and then creates a bar plot showing the relationship between industry and income across different Workclass categories.
# barplot of job type by income group
# get the counts by industry and income group
count <- table(census_data[census_data$Workclass == 'Government',]$Income)["<=50K"]
count <- c(count, table(census_data[census_data$Workclass == 'Government',]$Income)[">50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Private',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Private',]$Income)[">50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Self-Employed',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Self-Employed',]$Income)[">50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Without-pay',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Without-pay',]$Income)[">50K"])
count <- as.numeric(count)
# create a dataframe
industry <- rep(levels(census_data$Workclass), each = 2)
Income <- rep(c('<=50K', '>50K'), 4)
df <- data.frame(industry, Income, count)
df
## industry Income count
## 1 Government <=50K 3010
## 2 Government >50K 1341
## 3 Private <=50K 17733
## 4 Private >50K 4963
## 5 Self-Employed <=50K 2311
## 6 Self-Employed >50K 1346
## 7 Without-pay <=50K 14
## 8 Without-pay >50K NA
# Get the counts by Workclass and Income group
counts_workclass_income <- sapply(unique(census_data$Workclass), function(w) {
sapply(c('<=50K', '>50K'), function(inc) {
sum(census_data$Income[census_data$Workclass == w] == inc)
})
})
# Create a dataframe
df <- data.frame(
Workclass = rep(unique(census_data$Workclass), each = 2),
Income = rep(c('<=50K', '>50K'), times = length(unique(census_data$Workclass))),
Count = as.vector(counts_workclass_income)
)
# Plotting the bar plot
ggplot(df, aes(x = Workclass, y = Count, fill = Income)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Workclass", y = "Count", title = "Counts of Income Group by Workclass") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
Those who are self employed have the highest tendency of making greater
than $50,000 a year.
Since EducationNum is a continuous representation of education, a stacked bar plot is used to visualize the relationship between EducationNum and Income, in-group proportions are calculated as well.
# create a dataframe
df1 <- data.frame(table(census_data$Income, census_data$EducationNum))
names(df1) <- c('Income', 'EducationNum', 'count')
df1
## Income EducationNum count
## 1 <=50K 1 46
## 2 >50K 1 0
## 3 <=50K 2 150
## 4 >50K 2 6
## 5 <=50K 3 289
## 6 >50K 3 14
## 7 <=50K 4 535
## 8 >50K 4 38
## 9 <=50K 5 437
## 10 >50K 5 26
## 11 <=50K 6 771
## 12 >50K 6 60
## 13 <=50K 7 996
## 14 >50K 7 60
## 15 <=50K 8 362
## 16 >50K 8 31
## 17 <=50K 9 8339
## 18 >50K 9 1629
## 19 <=50K 10 5423
## 20 >50K 10 1352
## 21 <=50K 11 973
## 22 >50K 11 348
## 23 <=50K 12 761
## 24 >50K 12 259
## 25 <=50K 13 3006
## 26 >50K 13 2176
## 27 <=50K 14 734
## 28 >50K 14 941
## 29 <=50K 15 143
## 30 >50K 15 415
## 31 <=50K 16 103
## 32 >50K 16 295
# Plotting the bar plot
ggplot(df1, aes(x = as.factor(EducationNum), y = count, fill = Income)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Education Number", y = "Count", title = "Counts of Income by Education Number") +
theme_minimal()
The bars illustrates that the level of education rises, so does the
likelihood of earning more than $50,000 annually. Individuals with less
than or equal to 8 years of education have less than 10% chance of
achieving this income level. Conversely, for those with doctorate
degrees, almost 75% surpass the $50,000 yearly income threshold.
# Delete rows with "?" and "Armed-Forces" in Occupation Column
census_data <- census_data %>%
filter(!(Occupation %in% c("?", "Armed-Forces")))
census_data$Occupation <- gsub('Adm-clerical', 'White-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Craft-repair', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Exec-managerial', 'White-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Farming-fishing', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Handlers-cleaners', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Machine-op-inspct', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Other-service', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Priv-house-serv', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Prof-specialty', 'Professional', census_data$Occupation)
census_data$Occupation <- gsub('Protective-serv', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Tech-support', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Transport-moving', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- as.factor(census_data$Occupation)
summary(census_data$Occupation)
## Blue-Collar Professional Sales Service White-Collar
## 10062 4140 3650 5021 7836
# create a dataframe
df2 <- data.frame(table(census_data$Income, census_data$Occupation))
names(df2) <- c('Income', 'Occupation', 'count')
df2
## Income Occupation count
## 1 <=50K Blue-Collar 8362
## 2 >50K Blue-Collar 1700
## 3 <=50K Professional 2281
## 4 >50K Professional 1859
## 5 <=50K Sales 2667
## 6 >50K Sales 983
## 7 <=50K Service 4389
## 8 >50K Service 632
## 9 <=50K White-Collar 5361
## 10 >50K White-Collar 2475
# using dplyr functions
df2 <- df2 %>%
group_by(Occupation) %>%
mutate(percent = count/sum(count) * 100,
pos = cumsum(count) - 0.5 * count)
# Plotting the bar plot for df2
ggplot(df2, aes(x = Occupation, y = count, fill = Income)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Occupation", y = "Count", fill = "Income", title = "Count of Income Levels by Occupation") +
geom_text(aes(label = paste0(round(percent), "%")), position = position_dodge(width = 0.9), vjust = -0.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
There’s a significant income disparity among distinct occupations. Approximately half of those in Professional roles earn more than $50,000 yearly, contrasting with just 14% within the Service occupation achieving this income level.