Code
library(tidyverse)
library(ggplot2)
library(dplyr)This project aims to visually and statistically explore the 2021 household census dataset using Quarto and ggplot2 in R to uncover relationships between key demographic and socioeconomic variables with data source provided by the University of Wolverhampton.
library(tidyverse)
library(ggplot2)
library(dplyr)Householdcensus_df <- read.csv("C:/Users/User/Downloads/data.csv", na.strings = "NA")
view(Householdcensus_df)
summary(Householdcensus_df) ID Person_ID Age Mar_Stat
Min. : 37 Min. : 1.000 Min. : 0.00 Length:27410
1st Qu.: 324129 1st Qu.: 1.000 1st Qu.:16.00 Class :character
Median : 617477 Median : 2.000 Median :35.00 Mode :character
Mean : 624437 Mean : 2.242 Mean :35.67
3rd Qu.: 938244 3rd Qu.: 3.000 3rd Qu.:51.00
Max. :1236779 Max. :16.000 Max. :93.00
INC Female H8 Eth
Min. : 0 Min. :0.0000 Min. :0.0000 Length:27410
1st Qu.: 6000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median : 18000 Median :1.0000 Median :0.0000 Mode :character
Mean : 27767 Mean :0.5138 Mean :0.1294
3rd Qu.: 35900 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :720000 Max. :1.0000 Max. :1.0000
NA's :6173
Highest.Ed
Length:27410
Class :character
Mode :character
The dataset includes 27,410 individuals. Ages range from 0 to 93 years, with a median of 35 years, though the minimum age of 0 appears unrealistic. Income varies widely from £0 to £720,000, with a median of £18,000 and a mean of £27,767, indicating some high-income outliers. About 51% of the sample are female. Marital status, ethnicity, education, and the variable H8 are categorical. Notably, 6,173 income values are missing.
Householdcensus_df <- Householdcensus_df |> rename(Household_ID = `ID`, Individual_ID = `Person_ID`, Marital_Status = `Mar_Stat`,
Annual_Income = `INC`, Is_Female = `Female`, Shared_Accommodation = `H8`, Ethnicity = `Eth`, Highest_Education = `Highest.Ed`)
colnames(Householdcensus_df)[1] "Household_ID" "Individual_ID" "Age"
[4] "Marital_Status" "Annual_Income" "Is_Female"
[7] "Shared_Accommodation" "Ethnicity" "Highest_Education"
sapply(Householdcensus_df, class) Household_ID Individual_ID Age
"integer" "integer" "integer"
Marital_Status Annual_Income Is_Female
"character" "integer" "integer"
Shared_Accommodation Ethnicity Highest_Education
"integer" "character" "character"
Several variables (Marital_Status, Is_Female, Shared_Accommodation, Ethnicity, Highest_Education) were incorrectly stored as integers instead of factors, and Annual_Income as a character instead of numeric. Correcting these ensures accurate analysis and visualization.
Householdcensus_df <- Householdcensus_df %>%
mutate(across(c(Marital_Status, Is_Female, Shared_Accommodation, Ethnicity, Highest_Education), as.factor),Annual_Income = as.numeric(Annual_Income))sapply(Householdcensus_df, class) Household_ID Individual_ID Age
"integer" "integer" "integer"
Marital_Status Annual_Income Is_Female
"factor" "numeric" "factor"
Shared_Accommodation Ethnicity Highest_Education
"factor" "factor" "factor"
colSums(is.na(Householdcensus_df)) Household_ID Individual_ID Age
0 0 0
Marital_Status Annual_Income Is_Female
6144 6173 0
Shared_Accommodation Ethnicity Highest_Education
0 0 1123
Marital_Status, Annual_Income, and Highest_Education have missing values (22.4%, 22.5%, and 4.1%, respectively). To preserve data integrity and avoid bias in visualizations, missing Marital_Status and Highest_Education will be replaced with the mode(most frequent value) since they are factors.
replace_na_with_mode <- function(df, cols) {
for (c in cols) {
df[[c]] <- factor(replace(as.character(df[[c]]),
is.na(df[[c]]),
names(which.max(table(df[[c]])))))
}
df
}
Householdcensus_df <- replace_na_with_mode(Householdcensus_df, c("Marital_Status"))
Householdcensus_df <- replace_na_with_mode(Householdcensus_df, c("Highest_Education"))Annual_Incomehas 6,173 missing values, which could distort visualizations. Since it’s numeric, we’ll use mean imputation to preserve its distribution.
Householdcensus_df$Annual_Income[is.na(Householdcensus_df$Annual_Income)] <-
mean(Householdcensus_df$Annual_Income, na.rm = TRUE)colSums(is.na(Householdcensus_df)) Household_ID Individual_ID Age
0 0 0
Marital_Status Annual_Income Is_Female
0 0 0
Shared_Accommodation Ethnicity Highest_Education
0 0 0
The dataset is now clean, complete, and properly structured with correct data types, ready for visualization and analysis.
Householdcensus_df$Age[Householdcensus_df$Age == 0] <-
mean(Householdcensus_df$Age[Householdcensus_df$Age != 0], na.rm = TRUE)since having 0 age is very unrealistic i decided to replace the 0 with the mean as shown above.
Householdcensus_df <- Householdcensus_df %>% group_by(Household_ID) %>%mutate(Household_Size = n()) %>% ungroup() %>% mutate( Income_Category = case_when(
Annual_Income < 20000 ~ "Low", Annual_Income >= 20000 & Annual_Income <= 50000 ~ "Medium", Annual_Income > 50000 ~ "High", TRUE ~ "Unknown" ))
Householdcensus_df$Income_Category <- factor(Householdcensus_df$Income_Category, levels = c("Low", "Medium", "High", "Unknown"))
view(Householdcensus_df)# First, compute summary data for the plot
summary_df <- Householdcensus_df %>%
group_by(Household_Size, Ethnicity) %>%
summarise(Avg_Income = mean(Annual_Income, na.rm = TRUE), .groups = "drop")
ggplot(summary_df, aes(x = Household_Size, y = Avg_Income, color = Ethnicity)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(title = "Average Annual Income by Household Size and Ethnicity",
x = "Household Size",
y = "Average Annual Income (£)",
color = "Ethnicity") +
theme_light()Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
From the above visual, we notice that income peaks at 2-person households (£50–58k) for all groups, then declines sharply fastest for Black and Hispanic families. White households maintain a £10–15k advantage at every size.Recommendation: Introduce progressive family tax credits that scale with household size for ethnic minorities to offset the per-capita income penalty in larger homes.
Householdcensus_df %>%
count(Household_Size, Highest_Education) %>%
ggplot(aes(x = Household_Size, y = Highest_Education, fill = n)) +
geom_tile(color = "white") +
scale_fill_viridis_c(option = "magma", labels = scales::comma) +
labs(title = "Heatmap: Larger Households = Lower Education",
x = "Household Size", y = "Highest Education", fill = "Count") +
theme_minimal()From the heatmap, No individual in a household with> 8 people has a Master’s degree; higher education clusters only in 1–3 person homes. A bright yellow band at size 3–4 for “Less than Secondary” signals early dropout risk. Recommendation: Deploy mobile adult learning units and childcare-integrated evening classes in large low-income estates to break the education-size trap.
ggplot(Householdcensus_df, aes(x = Ethnicity, fill = Highest_Education)) +
geom_bar(position = "fill") +
labs(title = "Distribution of Education Levels Across Ethnic Groups",
x = "Ethnicity",
y = "Proportion") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))White respondents lead with 20% Master ’s+, while Black and Hispanic groups have >50% with secondary or less. Asian and “Other” groups show balanced mid-tier attainment. Recommendation: Launch secondary-to-HE bridge programs with guaranteed progression for Black and Hispanic 18-year olds, backed by income-contingent loans
ggplot(Householdcensus_df,
aes(x = Highest_Education, y = Annual_Income, fill = Is_Female)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ Ethnicity, scales = "free_y") +
labs(title = "Income Gaps at Intersection of Education, Gender & Ethnicity",
x = "Highest Education", y = "Annual Income (£)", fill = "Gender (0=M, 1=F)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))From the above box plot, we noticed that Black women with ‘Master’s’ earn less than White men with Bachelor’s; Hispanic females show the widest gender gap at the postgraduate level. Outliers above £500k are nearly all White males with advanced degrees. Recommendation: Mandate pay transparency for roles requiring Master ’s+ and fund ethnic-gender postgraduate scholarships in STEM and finance.
# Assume Marital_Status levels are factors (e.g., 1=Single, 2=Married, etc.); relabel if needed
ggplot(Householdcensus_df, aes(x = Income_Category, fill = Ethnicity)) +
geom_bar(position = "dodge") +
facet_wrap(~ Marital_Status, labeller = labeller(Marital_Status = c("1" = "Single", "2" = "Married", "3" = "Divorced", "4" = "Widowed"))) + # Adjust labels based on your data
labs(title = "Marital Status Distribution by Income Category and Ethnicity",
x = "Income Category", y = "Count", fill = "Ethnicity") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "right")From above, Married White individuals dominate High income (6,000) ; Black, Hispanic, and single/formerly married groups cluster in Low/Medium. Recommendation: Offer individual tax credits and financial training for non-married minorities to reduce marriage-income dependency.
This project transforms raw 2021 census data into evidence-based policy levers through rigorous cleaning, derived variables (Household_Size, Income_Category), and strategic multivariate visualisation in Quarto. The visuals collectively prove that education, household structure, ethnicity, and gender interact powerfully to shape income: two-person, White, highly educated households dominate the top, while large, minority, low-education families face compounding disadvantage. The recommended interventions targeted scholarships, family-size tax relief, pay transparency, and community learning hubs are precise, feasible, and directly traceable to the patterns uncovered.