Unveiling Socio-Economic Dynamics: A Comprehensive Exploration of Household Census Data from England, 2021

Author

Clara Raphael

Introduction

The 2021 household census conducted in England provides a rich repository of demographic and socio-economic information (Shipsey et al. 2020). Working with a modified snapshot of the census data, my analysis delves into this data set, aiming to uncover compelling insights and discern patterns that illuminate the relationships between demographic variables, income levels, and living conditions.

About Data

The data set under examination is a modified snapshot derived from a comprehensive household census conducted in England in 2021. Comprising a diverse array of variables, this data set encapsulates crucial demographic and socio-economic information collected from households across the region.

Key Variables include:

ID and Person_ID: Identification numbers assigned to households and individuals, facilitating the organization and differentiation of data entries.
Age: Provides insights into the age distribution of individuals within households, aiding in demographic profiling.
Mar_Stat: Indicates the marital status of individuals, enabling the exploration of household compositions.
INC: Represents the annual income in pounds, serving as a pivotal indicator of households’ economic status.
Female: A binary variable signifying the gender of individuals within households.
H8: Binary variable denoting whether all rooms in the accommodation are exclusively used by the household where 0 means shared apartment and 1 means non-shared apartment.
Eth: Captures information on the ethnicity of individuals, contributing to understanding cultural diversity.
Highest Ed: Indicates the highest level of education attained by individuals, offering insights into educational backgrounds.

Explore Data

The primary objective of this report is to provide a comprehensive understanding of the socio-economic dynamics prevalent among households in England, emphasizing the influence of income disparities on living conditions and the differential impact of demographic factors on the socio-economic status of households.

To achieve that, let’s start by importing necessary libraries and load the data.

Importing libraries

Code

library(tidyverse) #for data transformation and wrangling

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(ggplot2) # for creating data visualization
library(ggthemes) #customizing data visualizations
library(dplyr) #data manipulation
library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

Code

library(caTools)

Loading Data

Code

data <- read.csv("C:/Users/Jucheey/Downloads/data-1.csv")

Data Inspection

After loading data, it is ideal to inspect the data to give an idea of how much data we are working with and its structure.

Data set shape

The data contains 27,410 observations and 9 variables

Code

n_rows <- dim.data.frame(data)[1] # number of rows 
n_cols <- dim.data.frame(data)[2] # number of columns

cat("Number of rows is: ", n_rows, "\n")

Number of rows is:  27410

Code

cat("Number of columns: ", n_cols, "\n")

Number of columns:  9

Sample observation

From the data set, the total number of respondents is 27,410 from 10,565 households.

Code

# Print the number of sampled households
cat("Number of sampled households is:", n_distinct(data$ID), "\n")

Number of sampled households is: 10565

Code

cat("Number of sampled persons is:", length(data$ID), "\n")

Number of sampled persons is: 27410

Missing values

The output “Columns with missing values:” highlights specific columns within the data set where missing values are present. This information is critical for data analysis as it indicates columns—‘Mar_Stat’, ‘INC’, and ‘Highest.Ed’—with 6144, 6173, and 1123 missing values, respectively. These missing values might require attention or handling before proceeding with any statistical analysis or modeling. Understanding and addressing missing data is crucial to ensure the accuracy and reliability of any insights or conclusions drawn from the data set (Kang 2013).

Code

# Check for missing values in each column and display the count per column
if (any(colSums(is.na(data)) > 0)) {
  print("Columns with missing values:")
  print(colSums(is.na(data))[colSums(is.na(data)) > 0])
} else {
  print("No missing values found in any column.")
}

[1] "Columns with missing values:"
  Mar_Stat        INC Highest.Ed 
      6144       6173       1123

Descriptive Statistics

The descriptive statistics provide valuable insights into the Age and Income (INC) columns from the dataset. Let’s break down the information provided:

Age:

Minimum Age: The minimum recorded age in the dataset is 0, which could indicate entries for infants or very young children.
1st Quartile (Q1): 25% of the individuals are aged 16 or below.
Median Age: The median age, or the middle value when all ages are ordered, is 35. This means half of the individuals are below 35 and half are above.
Mean Age: The mean age is slightly higher than the median, at 35.67. The distribution might have a slight positive skew (as mean > median), suggesting a few higher age outliers.
3rd Quartile (Q3): 75% of the individuals are aged 51 or below.
Maximum Age: The maximum recorded age is 93.

Income (INC):

Minimum Income: The lowest recorded income is 0, which might indicate missing or invalid entries.
1st Quartile (Q1): 25% of the households have an income of 6000 or lower.
Median Income: The median income is 18000, indicating that half of the households have an income below this value.
Mean Income: The mean income is higher than the median, at 27767. This suggests a possible right-skewed distribution with some higher-income outliers.
3rd Quartile (Q3): 75% of the households have an income of 35900 or lower.
Maximum Income: The maximum recorded income is 720000.
Missing Values: There are 6173 missing values in the Income column (NA’s).

Insights:

Age Distribution: The age distribution seems to be relatively spread out, with a mean and median close together, indicating a somewhat symmetric distribution.
Income Distribution: The income distribution, however, appears to be right-skewed, with a higher mean than the median, suggesting a few high-income outliers affecting the mean.
Potential Issues: There are quite a few missing values in the Income column that need to be addressed for a more comprehensive analysis.
Consider Outliers: The presence of very low or very high values in both Age and Income columns might indicate potential outliers that could significantly affect the analysis and need further investigation.

These insights provide a preliminary understanding of the age and income distributions within the dataset. Further analysis, outlier treatment, and data cleansing might be necessary for a more accurate and robust exploration of these variables’ relationships with other factors in the dataset.

Code

summary(data[c('Age', 'INC')],na.rm = TRUE)

      Age             INC        
 Min.   : 0.00   Min.   :     0  
 1st Qu.:16.00   1st Qu.:  6000  
 Median :35.00   Median : 18000  
 Mean   :35.67   Mean   : 27767  
 3rd Qu.:51.00   3rd Qu.: 35900  
 Max.   :93.00   Max.   :720000  
                 NA's   :6173

Exploratory Data Analysis

The data set has been loaded and inspected. Now, it is time to explore relationships between variables through data visualization. We will work with the ggplot2 library to achieve this.

To start with, lets explore the relationship between Age and Income to see if there’s any observable trend:

Code

# Remove all rows with missing values from the entire data set
cleaned_data <- na.omit(data)

Code

# Creating a scatter plot with a legend based on the 'H8' column
ggplot(cleaned_data, aes(x = Age, y = INC, color = factor(H8))) +
  geom_point(na.rm = TRUE) +
  labs(title = "Age vs. Income with Living Conditions",
       x = "Age",
       y = "Income",
       color = "Living Conditions") +
  scale_color_manual(values = c("0" = "blue", "1" = "red"), 
                     labels = c("Non-Exclusive LC", 
                                "Exclusive LC"))

The relationship between age and income is a linear relationship which means that the older a person gets, the higher the income.This is also true when compared with those living in an exclusive apartment (H8). However, there are categories of persons that still earn low (below 100k) despite the variation in age

To further explore the relationship between age and income, we will categorize them into groups.

Code

# Create intervals for income levels and assign labels
cleaned_data["income_group"] = cut(cleaned_data$INC, c(0, 20000, 40000, 60000,80000, Inf), c("0-20k", "20-40k", "40-60k", "60-80k", "80k+"), include.lowest=TRUE)
# Create intervals for age and assign labels
cleaned_data$Age_Group <- cut(cleaned_data$Age, c(0, 18, 30, 40,50,60, Inf), c("0-18", "19-30", "31-40", "41-50", "51-60", "60+"), include.lowest=TRUE)

Code

# Calculate the mean income for each age group
mean_income_age <- cleaned_data %>%
  group_by(Age_Group) %>%
  summarise(mean_income = mean(INC, na.rm = TRUE))

# Plotting a bar chart of average income by age group
ggplot(mean_income_age, aes(x = Age_Group, y = mean_income, fill = Age_Group)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Income by Age Group",
       x = "Age Group",
       y = "Average Income") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the visual, we can see that the older a person gets,the higher their income. However, there is a drop in income for persons above 60. This can be attributed to the fact that most persons retire at that age and therefore source of income is reduced.

Now, let’s explore the influence of income on living conditions across age groups.

Code

# Boxplot with legend based on H8 for Income by Living Conditions
ggplot(cleaned_data, aes(x = factor(Age_Group), y = INC, fill = factor(H8))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions among Age group",
       x = "Age Group",
       y = "Income",
       fill = "Living Conditions")

While it is ideal for a household to be able to afford certain level of comfort as a result of higher income, the visual above shows that persons living in a shared apartment (H8 = 0) have slightly higher income compared to their counterparts living in non-shared apartment.The highest earner is between the age of 31-40 years and still lives in a shared apartment and likewise, this trend is same across other categories.

Having explored the relationship between age and income, it is time to explore the differential impact of other demographic factors in the data set on the socio-economic status of households

Code

ggplot(cleaned_data, aes(x = factor(Female), y = INC, fill = factor(H8))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions among Gender",
       x = "Sex",
       y = "Income",
       fill = "Living Conditions")

Females have generally higher income than males. When compared to their living conditions , both male and females with higher income stay in shared apartments compared to their counterparts.

Code

ggplot(cleaned_data, aes(x = factor(Mar_Stat), y = INC, fill = factor(H8))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions by Marital Status",
       x = "Marital status",
       y = "Income",
       fill = "Living Conditions") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Taking a lot at the trend, married people have generally higher income. the widowed category have the least income however, those with higher income across all categories stay in shared apartments

Code

ggplot(cleaned_data, aes(x = factor(Eth), y = INC, fill = factor(H8))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions among Ethnic group",
       x = "Ethnic group",
       y = "Income",
       fill = "Living Conditions") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the ethnic group, the white generally earn higher however, none of them live in a non-exclusive apartment. The Asian ethnic are among the lowest earner but none stays in a shared apartment.

Code

ggplot(cleaned_data, aes(x = factor(Highest.Ed), y = INC, fill = factor(H8))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions among Education status",
       x = "Education Status",
       y = "Income",
       fill = "Living Conditions") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on the ethnic group, individuals with a masters degree or higher earn higher followed by individuals with a Bachelors degree. However, in a categories, those with higher income live in shared apartment and vice versa.

Factors that influences the household living conditions

Having looked at the relationship between age and income as well as the differential impact of other demographic factors on the socio-economic status of households, it’s time to determine which of the variables influence on the socioeconomic status is statistically significant. The Logistic regression algorithm was employed to explain this relationship.

The following steps were taken to prepare the data.

Encoding of categorical variables. The “income_group”, “Age_group”, “Mar_Stat”, “Eth” and “Highest Ed” variables were converted to numbers as a step to prepare the data to be fit into the algorithm.

Code

model_data <- cleaned_data
model_data$income_group <- as.integer(factor(model_data$income_group))
model_data$Age_Group <- as.integer(factor(model_data$Age_Group))
model_data$Mar_Stat <- as.integer(factor(model_data$Mar_Stat))
model_data$Eth <- as.integer(factor(model_data$Eth))
model_data$Highest.Ed <- as.integer(factor(model_data$Highest.Ed))
model_data <- model_data[, c( "Mar_Stat", "Female", "H8", "Eth","Highest.Ed", "income_group","Age_Group")]

Splliting the data into train and test set. 70% of the data was used as the training set and the remaining 30% as testing set

Code

set.seed(1)
#Used 70% of data set as training set and remaining 30% as testing set
sample <- sample.split(model_data, SplitRatio = 0.7)
train  <- subset(model_data, sample == TRUE)
test   <- subset(model_data, sample == FALSE)

Building the model

Code

## fit a logistic regression model with the training dataset
log.model <- glm(H8 ~Mar_Stat+Female+Highest.Ed+income_group+Age_Group, data = train, family = binomial(link = "logit"))

Summarizing the model’s output.

Code

summary(log.model)


Call:
glm(formula = H8 ~ Mar_Stat + Female + Highest.Ed + income_group + 
    Age_Group, family = binomial(link = "logit"), data = train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.51592    0.15195   3.395 0.000686 ***
Mar_Stat     -0.08763    0.03607  -2.430 0.015107 *  
Female       -0.36236    0.05992  -6.048 1.47e-09 ***
Highest.Ed   -0.16158    0.02063  -7.833 4.77e-15 ***
income_group -0.50364    0.03851 -13.077  < 2e-16 ***
Age_Group    -0.24540    0.01923 -12.761  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8626.0  on 12135  degrees of freedom
Residual deviance: 8063.1  on 12130  degrees of freedom
AIC: 8075.1

Number of Fisher Scoring iterations: 5

Findings:

Marital Status (Mar_Stat): Being married or having a different marital status compared to being unmarried reduces the likelihood of living in their own apartment without sharing by approximately 0.09 on the log-odds scale.
Gender (Female): Being female, as opposed to being male, is associated with a decrease in the likelihood of staying in their own apartment without sharing.
Education Level (Highest.Ed): Individuals with a Master’s degree or higher education are less likely to live in their own apartment without sharing, decreasing this likelihood by approximately 0.16 on the log-odds scale.
Income (income_group): Higher income is linked with a decreased likelihood of living in a non-shared apartment. For each unit increase in income, the chance of living in their own apartment decreases by around 0.5 on the log-odds scale.
Age (Age_Group): With each unit increase in age, there’s a reduced likelihood of living in their own apartment without sharing by approximately 0.25 on the log-odds scale.

Conclusion:

All these factors—marital status, gender, education, income, and age—show statistically significant associations with the living arrangements (‘H8’ category), indicating their relevance in determining whether individuals opt for their own apartment or shared living arrangements. The ‘***’ in the ‘Signif. codes’ column denotes high statistical significance (p < 0.001) for all variables, implying strong evidence that these factors influence living arrangements.

Clustering Analysis

In a bid to extract deeper insights from the data set, I made a decision to employ clustering analysis. This strategic approach was chosen to uncover inherent patterns and groupings within the data set, aiming to segment respondents based on shared characteristics or behaviors. By utilizing clustering techniques, the objective was to unveil distinct clusters or subgroups among respondents, facilitating a comprehensive understanding of variations and similarities in their socio-economic status. This segmentation strategy sought to provide a nuanced perspective, allowing for a more targeted and insightful exploration of the data set’s intricacies, ultimately enhancing the understanding of the diverse socio-economic landscapes among households.

Code

cluster_data <- cleaned_data
cluster_data <- cluster_data[, c( "Age","INC","Mar_Stat", "Female", "H8", "Eth","Highest.Ed", "income_group","Age_Group")]

cluster_data$income_group <- as.integer(factor(cluster_data$income_group))
cluster_data$Age_Group <- as.integer(factor(cluster_data$Age_Group))
cluster_data$Mar_Stat <- as.integer(factor(cluster_data$Mar_Stat))
cluster_data$Eth <- as.integer(factor(cluster_data$Eth))
cluster_data$Highest.Ed <- as.integer(factor(cluster_data$Highest.Ed))

set.seed(1)
# K-means clustering with k clusters (adjust 'k' as needed)
k <- 4  # Number of clusters
kmeans_model <- kmeans(cluster_data, centers = k, nstart = 25)

# View the cluster assignments
cluster_assignments <- kmeans_model$cluster
data_with_clusters <- cbind(cluster_data, cluster = cluster_assignments)

Following the clustering analysis, the data set was effectively grouped into four distinct clusters based on respondents’ characteristics. Visualizing this segmentation through a scatter plot portraying the relationship between age and income levels offered illuminating insights. Despite variations in age, the clusters delineated three primary income categories. Firstly, a group comprising high-income earners, notably securing incomes of 200k pounds and above. Secondly, a middle-income cluster, encompassing individuals earning between 80k and 200k pounds. Finally, a lower-income category was unveiled, further subdivided into two sub classes: individuals earning nil to less than 20k pounds, and those earning between 20k and 80k pounds. This clear classification shed light on distinct income brackets, providing an accessible depiction of the socio-economic landscape within the data set.

Code

ggplot(data_with_clusters, aes(x = Age, y = INC, color = factor(cluster))) +
  geom_point() +
  labs(title = "K-means Clustering of Respondents by Age group and income",
       x = "Age",
       y = "Income",
       color = "Cluster") +
  theme_minimal()

Income plays a significant role in determining the socio economic status of an individual. Earning higher income means an individual could afford a certain kind of luxury. Hence, further examination was done to visualize income distribution concerning the ‘H8’ variable across different clusters. For individuals categorized as high-income earners, a compelling trend emerged: a larger proportion of those living in shared apartments fell within the income range of 200k to 320k, in contrast to their counterparts residing in non-shared apartments. Notably, for non-shared apartment dwellers in this bracket, the minimum income surpassed 320k, reaching up to 420k. Similar patterns surfaced across other income brackets, where respondents in shared apartments consistently showcased higher income compared to their counterparts in non-shared apartments. This exploration illuminated how income dynamics, especially within high-income brackets, intertwined with living arrangements, emphasizing a prevalent pattern of higher incomes among shared apartment residents, regardless of income category.

Code

ggplot(data_with_clusters, aes(x = factor(H8), y = INC, fill = factor(cluster))) +
  geom_boxplot() +
  labs(title = "Income by Living Conditions",
       x = "Living Condition",
       y = "Income",
       fill = "Clusters")

Insights

The analysis encapsulates a multifaceted relationship between demographic variables and living arrangements. Notably, variables such as marital status, gender, education, income, and age exhibit statistically significant associations with living arrangements (‘H8’ category). These factors intertwine to influence individuals’ preferences for shared or non-shared living spaces. There’s a consistent trend indicating that higher income is linked to a decreased likelihood of living in non-shared apartments across various income brackets. Moreover, the clustering analysis unveils distinct income categories, revealing prevalent income brackets—high earners above 200k pounds, middle earners between 80k and 200k pounds, and lower earners below 80k pounds—while highlighting a propensity for higher incomes among shared apartment residents across income brackets. This suggests that despite higher incomes, a considerable proportion of affluent individuals opt for shared living arrangements, showcasing intriguing nuances in socio-economic preferences beyond the conventional assumption of income dictating exclusive living conditions.

Conclusion

The comprehensive analysis of various demographic factors and their correlation with living arrangements presents a nuanced understanding of socio-economic dynamics. It unveils a complex interplay between income levels, demographic variables, and living preferences. Factors such as marital status, gender, education, and age, in conjunction with income, significantly influence individuals’ choices regarding shared or non-shared living spaces. Surprisingly, while higher income generally correlates with a decreased tendency to live in non-shared apartments, a substantial segment of affluent individuals opts for shared living arrangements. This challenges the conventional belief that higher income inevitably leads to exclusive living conditions. The clustering analysis further elucidates distinct income categories, emphasizing prevalent income brackets and underscoring a persistent inclination towards shared apartment residency across income brackets.

References

Kang, Hyun. 2013. “The Prevention and Handling of the Missing Data.” Korean Journal of Anesthesiology 64 (5): 402. https://doi.org/10.4097/kjae.2013.64.5.402.

Shipsey, Rachel, Charlie Tomlin, Zoe White, Josie Plachta, Shelley Gammon, and Patricia Dygas. 2020. “2021 Census England and Wales: Developing Record Speed Linkage Methods to Produce Outputs in a Year.” International Journal of Population Data Science 5 (5). https://doi.org/10.23889/ijpds.v5i5.1436.