First we load the library tidyverse that will help us with manipulation of our data:
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsSecond we load out Census Income data set in memory
df_census <- read.csv("./censusincome.csv")
df_census
Just like in the previous Data Dive, we filter out observations/records that have missing/unknown working class as this is a critical field:
df_census_cleaned <- df_census |>
filter(workclass != " ?")Here we sample 5 random samples with replacement where each sample is approximately 50% of our census income data set. We’ll store these samples in different data frames.
For reproducibility, we set the seed of randomness to a specific numerical value, although this is really not mandatory.
set.seed(357)Also for the purposes of sampling at least 50% of our census income data set (here we are obviously using the cleaned data set), we’ll evaluate number of rows and then divide that by 2.
total_rows <- nrow(df_census_cleaned)
sample_size <- ceiling(total_rows / 2)
sample_size
## [1] 15363#Random Sample 1
#sample row indices from the data set
sampled_indices_1 <- sample(1:total_rows, sample_size, replace = TRUE)
#use sampled row indices to form our subset data frame
df_1 <- df_census_cleaned[sampled_indices_1, ]
#Random Sample 2
#sample row indices from the data set
sampled_indices_2 <- sample(1:total_rows, sample_size, replace = TRUE)
#use sampled row indices to form our subset data frame
df_2 <- df_census_cleaned[sampled_indices_2, ]
#Random Sample 1
#sample row indices from the data set
sampled_indices_3 <- sample(1:total_rows, sample_size, replace = TRUE)
#use sampled row indices to form our subset data frame
df_3 <- df_census_cleaned[sampled_indices_3, ]
#Random Sample 1
#sample row indices from the data set
sampled_indices_4 <- sample(1:total_rows, sample_size, replace = TRUE)
#use sampled row indices to form our subset data frame
df_4 <- df_census_cleaned[sampled_indices_4, ]
#Random Sample 1
#sample row indices from the data set
sampled_indices_5 <- sample(1:total_rows, sample_size, replace = TRUE)
#use sampled row indices to form our subset data frame
df_5 <- df_census_cleaned[sampled_indices_5, ]
# Grouping by the given categorical variable
summarize_sample <- function(sample, category, sample_id) {
sample |>
group_by({{ category }}) |>
summarize(
avg_age = mean(age, na.rm = TRUE),
avg_hours = mean(hours_per_week),
count = n()
) |>
# we had an id for the sample
mutate(sample = {{sample_id}})
}
#combining the summaries
combined_summaries <- bind_rows(
summarize_sample(df_1, income, "Sample 1"),
summarize_sample(df_2, income, "Sample 2"),
summarize_sample(df_3, income, "Sample 3"),
summarize_sample(df_4, income, "Sample 4"),
summarize_sample(df_5, income, "Sample 5")
)
combined_summaries
Making a visual plot of this using a bar graph for better visualization:
ggplot(combined_summaries, aes(x=income, y=avg_age, fill = sample)) +
geom_bar(stat ="identity", position="dodge") +
labs(title = " average age by sample and income ", x="Income", y="Average Age") +
theme_minimal()
We can have a similar approach plot for income versus hours worked across all the samples.
#Plotting the summaries by average of hours worked
ggplot(combined_summaries, aes(x=income, y=avg_hours, fill = sample)) +
geom_bar(stat ="identity", position="dodge") +
labs(title = " Average Hours by Sample and Income ", x="Income", y="Average Hours") +
theme_minimal()
#We combine the summaries for the categorization by work class
combined_summaries <- bind_rows(
summarize_sample(df_1, workclass, "Sample 1"),
summarize_sample(df_2, workclass, "Sample 2"),
summarize_sample(df_3, workclass, "Sample 3"),
summarize_sample(df_4, workclass, "Sample 4"),
summarize_sample(df_5, workclass, "Sample 5")
)
#ploting the work class versus age
ggplot(combined_summaries, aes(x=workclass, y=avg_age, fill = sample)) +
geom_bar(stat ="identity", position="dodge") +
labs(title = " Average Age by Sample and work class ", x="work class", y="Average Age") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, hjust=1.0))
#ploting the work class versus average work hours
ggplot(combined_summaries, aes(x=workclass, y=avg_hours, fill = sample)) +
geom_bar(stat ="identity", position="dodge") +
labs(title = " average work hours by sample and work class ", x="Work class", y="Average Work Hours") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, hjust=1.0))
For the five samples, it is consistent that the average age for persons earning less or equal to 50k is 36 years, while those that earn above that their average age 43 years, meaning that the older group in our data set earn more than younger people.
It is also consistent for all samples that people who work longer hours per week earn more than those who work less hours per week. Those who earn \(> 50k\) work on average 46 hours per week, while those who earn \(\le 50k\) work on average 39 hours per week.
For both work class versus age and work class versus work hours, the respective averages are consistent across all the samples, except for the never worked and without pay categories. This was quite expected because these two work classes have very few records hence the fluctuations of the aggregation metrics.