Loading the Data & Cleaning

Sampling the data set

Sample 1

#Random Sample 1

#sample row indices from the data set
sampled_indices_1 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_1 <- df_census_cleaned[sampled_indices_1, ]

Sample 2

#Random Sample 2

#sample row indices from the data set
sampled_indices_2 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_2 <- df_census_cleaned[sampled_indices_2, ]

Sample 3

#Random Sample 1

#sample row indices from the data set
sampled_indices_3 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_3 <- df_census_cleaned[sampled_indices_3, ]

Sample 4

#Random Sample 1

#sample row indices from the data set
sampled_indices_4 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_4 <- df_census_cleaned[sampled_indices_4, ]

Sample 5

#Random Sample 1

#sample row indices from the data set
sampled_indices_5 <- sample(1:total_rows, sample_size, replace = TRUE)

#use sampled row indices to form our subset data frame
df_5 <- df_census_cleaned[sampled_indices_5, ]

Scrutinizing the Samples

# Grouping by the given categorical variable
summarize_sample <- function(sample, category, sample_id) {
  sample |>
    group_by({{ category }}) |>
    summarize(
      avg_age = mean(age, na.rm = TRUE),
      avg_hours = mean(hours_per_week),
      count = n()
    ) |>
    
    # we had an id for the sample
    mutate(sample = {{sample_id}})
}

Categorizing by Income

  • summarizing each sample by income, we can combine the records for better visualization of the records - this will also come in handy when we’re making a visualization plot.
#combining the summaries
combined_summaries <- bind_rows(
  summarize_sample(df_1, income, "Sample 1"),
  summarize_sample(df_2, income, "Sample 2"),
  summarize_sample(df_3, income, "Sample 3"),
  summarize_sample(df_4, income, "Sample 4"),
  summarize_sample(df_5, income, "Sample 5")
)

combined_summaries

Making a visual plot of this using a bar graph for better visualization:

ggplot(combined_summaries, aes(x=income, y=avg_age, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " average age by sample and income ", x="Income", y="Average Age") +
  theme_minimal()

  • We can have a similar approach plot for income versus hours worked across all the samples.

    #Plotting the summaries by average of hours worked
    ggplot(combined_summaries, aes(x=income, y=avg_hours, fill = sample)) +
      geom_bar(stat ="identity", position="dodge") +
      labs(title = " Average Hours by Sample and Income ", x="Income", y="Average Hours") +
      theme_minimal() 

Categorizing by Work Class

  • Using the approach similar to the one we have above of grouping the observations by work class, evaluating and combining the summaries we can have it as below - we’re still going to use the helper function we created earlier i.e. summarize_sample method.
#We combine the summaries for the categorization by work class
combined_summaries <- bind_rows(
  summarize_sample(df_1, workclass, "Sample 1"),
  summarize_sample(df_2, workclass, "Sample 2"),
  summarize_sample(df_3, workclass, "Sample 3"),
  summarize_sample(df_4, workclass, "Sample 4"),
  summarize_sample(df_5, workclass, "Sample 5")
)

#ploting the work class versus age 
ggplot(combined_summaries, aes(x=workclass, y=avg_age, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " Average Age by Sample and work class ", x="work class", y="Average Age") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1.0)) 

#ploting the work class versus average work hours
ggplot(combined_summaries, aes(x=workclass, y=avg_hours, fill = sample)) +
  geom_bar(stat ="identity", position="dodge") +
  labs(title = " average work hours by sample and work class ", x="Work class", y="Average Work Hours") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1.0)) 

Insights

  1. For the five samples, it is consistent that the average age for persons earning less or equal to 50k is 36 years, while those that earn above that their average age 43 years, meaning that the older group in our data set earn more than younger people.

  2. It is also consistent for all samples that people who work longer hours per week earn more than those who work less hours per week. Those who earn \(> 50k\) work on average 46 hours per week, while those who earn \(\le 50k\) work on average 39 hours per week.

  3. For both work class versus age and work class versus work hours, the respective averages are consistent across all the samples, except for the never worked and without pay categories. This was quite expected because these two work classes have very few records hence the fluctuations of the aggregation metrics.