This report, created using Quarto in RStudio(Bauer and Landesvatter 2023), provides a visual analysis of a snapshot extracted from data of 2021 household census in England. Using ggplot2, we explore key demographic trends, focusing on factors like age, income, marital status and ethnicity. The main objective is to process the data which can be used to obtain interesting patterns and linear relationships between the variables through clear visualizations (Hoffmann 2021). The report offers insights that could inform future policies and improve understanding of the correlation between the variables.
2 Data Pre processing:
After installation and loading the necessary packages, data pre-processing in RStudio, data analysis begins with understanding the variables. As Kandel mentions, which is followed by cleaning and organizing the raw data to make it ready for analysis and visualization(Kandel et al. 2012).
2.1 Data Exploration:
To start the data exploration, we load the necessary library-tidyverse and read the data using the read_csv() function from the specified file path.
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
df <-"C:/Users/Dell/OneDrive/Desktop/University/Course/8. Data Science/3. Assignments/data.csv"
We use functions like head() to view the first few rows, dim() to check the size of the data, str() to see the variables and their types, summary() for a quick summary and is.na() to check for any missing values, ensuring the data is clean and ready for analysis.
Code
dr <-read_csv(df)
Rows: 27410 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Mar_Stat, Eth, Highest Ed
dbl (6): ID, Person_ID, Age, INC, Female, H8
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
dim(dr)
[1] 27410 9
It is found that, there are 27410 including 6173 null values and 9 variables as listed above.
2.2 Tidying the Data:
Cleaning the data includes dealing with missing values, changing categorical data into factors and renaming columns for better clarity. ID and Person_ID are variables with minimal feature importance which are removed from the data. The data was filtered to remove irrelevant or unusual entries.
Code
c_dr <- dr %>%select(-ID, -Person_ID)
Now, we are removing all null values in the data-set. The aim is to ensure the data is accurate, consistent and in the right format for effective regression analysis and visualization using tools like dplyr and ggplot2.
Code
c_dr <-na.omit(c_dr)
2.3 Refining the Data:
Out of three categorical variables; we are changing Mar_Stat and Highest Ed to nominal numeric variable except Eth as illustrated below. This feature transformation is later applicable in regression analysis to understand the trends between the variables(Zeileis and Hothorn 2002).
After tidying the data by cleaning missing values, as illustrated in Table 2.3., transforming categorical variables into numeric forms and removing irrelevant columns, we are now prepared to analyze the relationships between the variables. This refined dataset is ready for further regression analysis and visualization to uncover insights from the data.
Table 2.3. Transforming categorical to nominal variables based on their grade.
3 Relation between the variables:
Grouping the selected variables plays an important role to identify the strength of relation in the analysis of the demographic data(Yusuf, Martins, and Swanson 2014). Therefore, when, age is grouped into two parts, age up to 50 (Age <= 50) and above 50 (Age > 50). Following algorithm is applied to check its correlation coefficient with average income(INC).
Code
library(dplyr)age_income_avg <- c_dr %>%group_by(Age) %>%summarize(avg_INC =mean(INC, na.rm =TRUE)) %>%mutate(Age_Group =ifelse(Age <=50, "Age <= 50", "Age > 50")) %>%group_by(Age_Group) %>%summarize(correlation =cor(Age, avg_INC, use ="complete.obs"))print(paste("Correlation between Age <= 50 and avg_INC: ", round(age_income_avg$correlation[age_income_avg$Age_Group =="Age <= 50"], 2)))
[1] "Correlation between Age <= 50 and avg_INC: 0.96"
Code
print(paste("Correlation between Age > 50 and avg_INC: ", round(age_income_avg$correlation[age_income_avg$Age_Group =="Age > 50"], 2)))
[1] "Correlation between Age > 50 and avg_INC: -0.7"
Interestingly, it was found that there is a very strong positive relationship between average income and age up to 50 for British citizens, with a correlation of 0.96. However, for those above 50, the relationship is a strong negative with - 0.7.
4 Data Visualization:
Now, for diagrams, these two opposite linear relations can be visualized through scatter plot along with their best fitting regression line by using library-ggplot2 as illustrated in Fig 4.1.
Code
library(ggplot2)c_dr %>%group_by(Age) %>%summarize(avg_INC =mean(INC, na.rm =TRUE)) %>%mutate(Age_Group =ifelse(Age <=50, "Up to Age 50", "Above Age 50")) %>%ggplot(aes(x = Age, y = avg_INC, color = Age_Group)) +geom_point(size =2) +geom_smooth(method ="lm", formula = y ~ x, se =FALSE) +labs(x ="Age", y ="Average Income") +scale_color_manual(values =c("Up to Age 50"="blue", "Above Age 50"="green")) +theme_minimal(base_size =14) +theme(axis.title =element_text(face ="bold"),axis.text =element_text(face ="bold"),legend.title =element_text(face ="bold"),legend.text =element_text(face ="bold") )
Fig 4.1. Scatter Plot of Average Income by groped Age with Separate Regression Lines.
Similarly, a scatter plot of average income against age was plotted based on their ethnicity, similar pattern was observed as we see below as expected.
Code
ggplot(c_dr %>%filter(Eth !="Other") %>%group_by(Age, Eth) %>%summarize(avg_INC =mean(INC, na.rm =TRUE), .groups ='drop'),aes(x = Age, y = avg_INC, color = Eth)) +geom_point(size =3) +labs(x ="Age", y ="Average Income", color ="Ethnicity") +theme_minimal() +theme(axis.title =element_text(face ="bold"),axis.text =element_text(face ="bold"),axis.text.x =element_text(angle =45, hjust =1) )
Fig 4.2. Scatter Plot showing Average Income by Age and Ethnicity.
Again, in Fig 4.3. , cross checking the relation between age and average income based on their gender(Female), similar pattern was observed . This suggests that, regardless of gender or ethnicity, income of British residents significantly increases up to 50 years of their life. However, income significantly decreases after 50 years when they reach old age.
Code
ggplot(c_dr %>%group_by(Age, Female) %>%summarize(avg_INC =mean(INC, na.rm =TRUE), .groups ='drop') %>%mutate(Age_Group =ifelse(Age <=50, "Up to Age 50", "Above Age 50"),Female_Label =ifelse(Female ==0, "Female", "Male") ), aes(x = Age, y = avg_INC, color =interaction(Age_Group, Female_Label))) +geom_point(size =2) +geom_smooth(method ="lm", formula = y ~ x, se =FALSE) +scale_color_manual(values =c("Up to Age 50.Female"="blue", "Above Age 50.Female"="green","Up to Age 50.Male"="red", "Above Age 50.Male"="orange")) +labs(x ="Age", y ="Average Income", color ="Group") +theme_minimal(base_size =14) +theme(axis.title =element_text(face ="bold"), axis.text =element_text(face ="bold"),legend.title =element_text(face ="bold"), legend.text =element_text(face ="bold"))
Fig 4.3. Scatter Plot of Average Income by Age, Gender and Age Group.
On the other hand, when relation between transformed marital status based on their grade is explored with average income. There was a positive linear regression visualized for all ethnic groups ass illustrated in Fig 4.4.
Code
library(ggplot2)c_dr %>%group_by(Eth, Mar_Stat) %>%summarise(avg_INC =mean(INC, na.rm =TRUE), .groups ='drop') %>%ggplot(aes(x = Mar_Stat, y = avg_INC, color = Eth)) +geom_point(size =3) +geom_smooth(method ="lm", formula = y ~ x, se =FALSE) +labs(x ="Marital Status", y ="Average Income (INC)", color ="Ethnicity") +scale_x_continuous(breaks =0:4, labels =c("Never married", "Widowed", "Divorced", "Separated", "Married")) +theme_minimal() +theme(axis.title =element_text(face ="bold"), axis.text.x =element_text(angle =45, hjust =1))
Fig 4.4. Average Income by Martial Status and Ethnicity.
This provides an interesting insight: in the UK, individuals in stable relationships or marriage tend to have higher incomes. This suggests that fostering strong, supportive relationships could be beneficial for financial success.
Lastly, regarding income disparities, there is a significant income gap, with the White population earning substantially more than other racial groups, followed by Black, Asian, and Hispanic individuals. Across all groups, women earn more on average than men.
Code
ggplot(c_dr %>%group_by(Eth, Female) %>%summarize(total_INC =sum(INC, na.rm =TRUE), .groups ='drop'), aes(x = Eth, y = total_INC, fill =as.factor(Female))) +geom_bar(stat ="identity", position ="stack") +labs(x ="Ethnicity (Eth)", y ="Total Income (INC)",fill ="Female (0 = Female, 1 = Male)" ) +scale_fill_manual(values =c("0"="blue", "1"="red"), labels =c("0"="Female", "1"="Male")) +theme_minimal() +theme(axis.title =element_text(face ="bold"))
Fig 4.5. Average income gap by ethnicity and gender.
5 Limitation and Recommendation:
This analysis shows clear patterns between age, marital status, education and income among British citizens, but it has limitations. The data lacks details on regional, industry and socio-economic factors that could impact income differences(Howe et al. 2012). Furthermore, the simplified categories for ethnicity and marital status may overlook complex social influences on income. Future research would benefit from including more socio-economic factors and regional details. The policies supporting education of elderly people and relationship stability could help improve financial well-being across demographics.
6 Conclusion:
Up to the age of 50, income shows a strong positive link with age, but after 50, income tends to fall. Which suggests, elderly people in UK at risk of low income. Marriage and stable relationships appear to support financial success, with married individuals generally earning more. There is also a clear income gap, with White individuals earning more than other ethnic groups, although women tend to earn more than men across all groups. These findings point to areas where future government policies could focus, such as supporting elderly education, and creating social programmes to promote financial stability and equality across age, gender and ethnicity.
Howe, L. D., B. Galobardes, A. Matijasevich, D. Gordon, D. Johnston, O. Onwujekwe, R. Patel, E. A. Webb, D. A. Lawlor, and J. R. Hargreaves. 2012. “Measuring Socio-Economic Position for Epidemiological Studies in Low- and Middle-Income Countries: A Methods of Measurement in Epidemiology Paper.”International Journal of Epidemiology 41 (3): 871–86. https://doi.org/10.1093/ije/dys037.
Kandel, Sean, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. “Enterprise Data Analysis and Visualization: An Interview Study.”IEEE Transactions on Visualization and Computer Graphics 18 (12): 2917–26. https://doi.org/10.1109/tvcg.2012.219.