Week 4 Data Dive

Loading the Data & Cleaning

First we load the library tidyverse that will help us with manipulation of our data:

library (tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Second we load out Census Income data set in memory
```
df_census <- read.csv("./censusincome.csv")
df_census
```
Just like in the previous Data Dive, we filter out observations/records that have missing/unknown working class as this is a critical field:
```
df_census_cleaned <- df_census |>
  filter(workclass != " ?")
```

Sampling the data set

Here we sample 5 random samples with replacement where each sample is approximately 50% of our census income data set. We’ll store these samples in different data frames.
For reproducibility, we set the seed of randomness to a specific numerical value, although this is really not mandatory.
```
set.seed(357)
```
Also for the purposes of sampling at least 50% of our census income data set (here we are obviously using the cleaned data set), we’ll evaluate number of rows and then divide that by 2.
```
total_rows <- nrow(df_census_cleaned)
sample_size <- ceiling(total_rows / 2)
sample_size
```
```
## [1] 15363
```

Sample 1

#Random Sample 1

#sample row indices from the data set
sampled_indices_1 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_1 <- df_census_cleaned[sampled_indices_1, ]

Sample 2

#Random Sample 2

#sample row indices from the data set
sampled_indices_2 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_2 <- df_census_cleaned[sampled_indices_2, ]

Sample 3

#Random Sample 1

#sample row indices from the data set
sampled_indices_3 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_3 <- df_census_cleaned[sampled_indices_3, ]

Sample 4

#Random Sample 1

#sample row indices from the data set
sampled_indices_4 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_4 <- df_census_cleaned[sampled_indices_4, ]

Sample 5

#Random Sample 1

#sample row indices from the data set
sampled_indices_5 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_5 <- df_census_cleaned[sampled_indices_5, ]

Scrutinizing the Samples

To scrutinize the 5 samples, we can create a simple function to apply the summaries we want for all the samples. We could categorize our samples into various groups using the group_by method. Some of the groups we would create include income, marital status, gender, and country.

# Grouping by the given categorical variable
summarize_sample <- function(sample, category, sample_id) {
  sample |>
    group_by({{ category }}) |>
    summarize(
      avg_age = mean(age, na.rm = TRUE),
      avg_hours = mean(hours_per_week),
      count = n()
    ) |>
    
    # we had an id for the sample
    mutate(sample = {{sample_id}})
}

Categorizing by Income

summarizing each sample by income, we can combine the records for better visualization of the records - this will also come in handy when we’re making a visualization plot.

#combining the summaries
combined_summaries <- bind_rows(
  summarize_sample(df_1, income, "Sample 1"),
  summarize_sample(df_2, income, "Sample 2"),
  summarize_sample(df_3, income, "Sample 3"),
  summarize_sample(df_4, income, "Sample 4"),
  summarize_sample(df_5, income, "Sample 5")
)

combined_summaries

Making a visual plot of this using a bar graph for better visualization:

ggplot(combined_summaries, aes(x=income, y=avg_age, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " average age by sample and income ", x="Income", y="Average Age") +
  theme_minimal()

We can have a similar approach plot for income versus hours worked across all the samples.

#Plotting the summaries by average of hours worked
ggplot(combined_summaries, aes(x=income, y=avg_hours, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " Average Hours by Sample and Income ", x="Income", y="Average Hours") +
  theme_minimal()

Categorizing by Work Class

Using the approach similar to the one we have above of grouping the observations by work class, evaluating and combining the summaries we can have it as below - we’re still going to use the helper function we created earlier i.e. summarize_sample method.

#We combine the summaries for the categorization by work class
combined_summaries <- bind_rows(
  summarize_sample(df_1, workclass, "Sample 1"),
  summarize_sample(df_2, workclass, "Sample 2"),
  summarize_sample(df_3, workclass, "Sample 3"),
  summarize_sample(df_4, workclass, "Sample 4"),
  summarize_sample(df_5, workclass, "Sample 5")
)

#ploting the work class versus age 
ggplot(combined_summaries, aes(x=workclass, y=avg_age, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " Average Age by Sample and work class ", x="work class", y="Average Age") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1.0))

#ploting the work class versus average work hours
ggplot(combined_summaries, aes(x=workclass, y=avg_hours, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " average work hours by sample and work class ", x="Work class", y="Average Work Hours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1.0))

Insights

For the five samples, it is consistent that the average age for persons earning less or equal to 50k is 36 years, while those that earn above that their average age 43 years, meaning that the older group in our data set earn more than younger people.
It is also consistent for all samples that people who work longer hours per week earn more than those who work less hours per week. Those who earn \(> 50k\) work on average 46 hours per week, while those who earn \(\le 50k\) work on average 39 hours per week.
For both work class versus age and work class versus work hours, the respective averages are consistent across all the samples, except for the never worked and without pay categories. This was quite expected because these two work classes have very few records hence the fluctuations of the aggregation metrics.