This assignment aims to visualise the divorce trends in Singapore across the years. The data is obtained from Singstat (Department of Statistics Singapore). The scope of this research is regarding divorces under Woman’s Charter. Therefore, the visulisation would not reflect on Muslim divorces.
There are a few challenges faced in visualising the divorce dataset. These challenges could potentially hinder the approach to visualising the divorce trends in Singapore. The following table displays the variables in the dataset of divorces under the Woman’s Charter from 1980 to 2019.
| Variables | Description | Year |
|---|---|---|
| Divorce Rates | Includes Total Divorces and Crude Divorce Rate and etc | 1980 - 2019 |
| Gender | Male, Female | - |
| Race | Chinese, Indian, Others | - |
| Age | < 25, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, > 60 | - |
| Median Age | Provides median age for both genders - Male and Female | 1980 - 2019 |
| Duration of Marriage | Total of 5 categories - Under 5 Years, 5-9 Years, 10-14 Years, 15-19 Years, 20-24 Years, 25-29 Years, 30 Years & Over | 1980 - 2019 |
| Previous Martial Status | Single, Divorced, Widowed | 1980 - 2019 |
| Age at Marriage | < 20, 20-24, 25-29, 35-39, > 40 | 1980-2019 |
| Number of Children | Number of Children under 18 years old (0 > x > 4) and Number of Children over 21 years old (0 > x > 4) | 1980 - 2019 |
| Main Reason for Divorce | Adultery, Desertion, Estranged, Unreasonable Behavior | 2004 - 2019 |
Given the large and elaborative dataset, there is a need to reduce the number of variables needed to visualise the divorce trends in Singapore. The chosen variables need to be substantial enough to provide informative insights about divorce trends in Singapore. Hence, the following table is the reasoning on the choice of the variable(s) and a combination of variables to deliver an informative story on divorce in Singapore.
| Variable(s) | Explanation |
|---|---|
| Total Number of Divorces | Displays the overall divorces in Singapore over the years and used it for evaluation for other visualisations |
| Crude Divorce Rate | Displays the overall divorces in Singapore over the years and used it for evaluation for other visualisations |
| Duration of Marriage | Provides a perspective on the average marriage duration for divorces in Singapore |
| Gender, Age group, Race | Shows the proportion of divorces in Singapore based on these 3 variables and discover potential patterns within them |
| Duration of Marriage and Main Reason for Divorce | Discover the common issue of divorces in Singapore – will there be a pattern in reason and marriage duration. |
Since the variables include Main Reason for Divorce¸ we could only focus on data after 2004 due to the absence of data for Main Reason for Divorce.
If a comparison is needed, a gap of 15 years (2004 to 2019) will be too huge to make any analysis. Hence, a gap of 5 years (2014 to 2019) would be ideal for comparison.
For data challenges, the dataset itself is a challenge to convert into readable data. The dataset consists of many worksheets in an excel file. Each worksheet represents one or a combination of two or three variables on divorces. Hence, there is a need to perform manual data transformation to modify the data to be more readable for R to process.
With data transformation, the dataset now consists of 11 rows and 112 columns.
However, the transformed data still requires data wrangling for data visualisation that will be explained in the following sections. Each data visualisation uses a different data wrangling, so there would not be a standardised data wrangling for the dataset.
Proposed Design Sketch
tidyverse contains a set of essential packages for data manipulation and exploration.ggalt that provides ‘geom_dumbbell’ for ‘ggplot2’ graphs.heatmaply to create correlation heatmaps and normal heatmaps.plotly to create interactive web graphics from ‘ggplot2’ graphs.packages = c('tidyverse','ggalt','heatmaply','plotly')
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
if (!require(devtools)) {
install.packages('devtools')
}
divorce_df <- read_csv('divorce-data.csv')
To create the bar and line chart, the variables used are Year, Total Divorces and Crude Divorce Rate. So, data wrangling have to be performed on the dataset to filter the data needed to visualise the bar and line chart.
x <- divorce_df[6:11,1:3]
maxcdr = max(x$cdr)
ggplot(x) +
geom_bar(aes(x=as.character(year), y=divorces),stat="identity", fill="#2a5675",colour="#2a5675",width=0.5)
ggplot(x) +
geom_point(aes(x=as.character(year), y=cdr,group=1),stat="identity", colour="#ff8c00",size=2)+
geom_line(aes(x=as.character(year), y=cdr,group=1),stat="identity", colour="#ff8c00",size=0.8)
For both bar and line to be in the same chart, there is a need to include a dual axis object into the chart. This is due to difference in scaling for Total Divorces and Crude Divorce Rate. Hence, on top of the bar and line codes, there is a scale_y_continous(sec.axis) to add a secondary axis into the chart.
Unfortunately, dual axis does not work on plotly therefore data labels are included in the visualisation to provide usability and readability for the reader.
percent <- function(x, digits = 1, format = "f", ...) {
paste0(formatC(x, format = format, digits = digits, ...), "%")
}
ggplot(x) +
geom_bar(aes(x=as.character(year), y=divorces),stat="identity", fill="#2a5675",colour="#2a5675",width=0.5)+
geom_text(aes(x=as.character(year), y=divorces,label=divorces),stat="identity", vjust = 1.5, colour = "white",size=3.5) +
geom_point(aes(x=as.character(year), y=cdr*3000,group=1),stat="identity", colour="#ff8c00",size=2)+
geom_text(aes(x=as.character(year), y=cdr*3000,group=1,label=percent(cdr)),stat="identity", vjust = 2, colour = "#ff8c00",size=3) +
geom_line(aes(x=as.character(year), y=cdr*3000,group=1),stat="identity", colour="#ff8c00",size=0.8) +
scale_y_continuous(sec.axis = sec_axis(~./3000,name="Crude Divorce Rate (per thousand)\n",breaks=seq(0, 3, 0.5))) +
theme_gray(base_family='Tahoma') + theme(
axis.title.x = element_text(size=9),
axis.title.y = element_text(size=9),
plot.title=element_text(size = 16,face="bold"),
plot.title.position = "plot",
plot.subtitle=element_text(face="italic", size=10, margin=margin(b=20)),
plot.caption=element_text(size=8, margin=margin(t=15), color="#7a7d7e")) +
labs(x="\nYear", y="Number of Divorces\n", title="Divorces and Crude Divorce Rate, 2014 to 2019",
subtitle="Crude Divorce Rate remained constant throughout the year except 2018 that has a slight decrease.\nOverall the number of divorces shows an increasing trend with a huge jump from 2018 and 2019.",
caption="Source: Singstat Singapore (Department of Statistics)")
To create the dumbbell chart, the variables used are Duration of Marriage and Year. Due to the nature of the dataset, data wrangling have to be performed on the dataset.
pivot_longer is used to make the dataset longer by increasing the number of rows and decreasing the number of columns. In this case, we are pivoting the duration based on year variable.
pivot_wider is used to make the dataset wider by increasing the number of columns and decreasing the number of rows. In this case, we are combining rows based on number of years to get the number of divorces for each year by marriage duration as shown below.
For the second part, we are creating calculated column based on dura_2014 and dura_2019 that calculates the percentage of total divorces by Marriage duration. This allows us to display percentage values on our visualisation.
library(ggalt)
duration <- divorce_df[c(6,11),c(1,6:12)]
duration <- duration %>%
pivot_longer(!year,names_to = 'Duration',values_to = 'num_years') %>%
pivot_wider(names_from = 'year',values_from = 'num_years') %>%
rename(dura_2014 = '2014', dura_2019 = '2019')
duration <- duration %>%
mutate(pct_2014 = round(100* dura_2014/sum(duration$dura_2014),1)) %>%
mutate(pct_2019 = round(100*dura_2019/sum(duration$dura_2019),1))
duration
duration$Duration <- factor(duration$Duration, levels=as.character(duration$Duration))
Dumbbell visualisation uses geom_segment to provide a segment for geom_dumbbell to display the points on the graph. For dumbbel visulisation, the ggplot requires x and xend - which is the two variables that are being compared to each other. Hence, x = 2014 and x.end=2019.
library(ggplot2)
ggplot(duration, aes(x=pct_2014, xend=pct_2019, y=Duration)) +
geom_segment(aes(x=pct_2014, xend=pct_2019, y=Duration, yend=Duration),
color="#dbdbdb", size=1.5,)+
geom_dumbbell(color="light blue", size_x=2.5, size_xend = 2.5, colour_x="#edae52",
colour_xend = "#9fb059")
Unfortunately plotly does not work on dumbbell charts too. Therefore, a further enhancement is needed. By making use of geom_text, we are able to display data labels on the visualisation to provide a better readability for the reader.
library(ggplot2)
percent <- function(x, digits = 1, format = "f", ...) {
paste0(formatC(x, format = format, digits = digits, ...), "%")
}
ggplot(duration, aes(x=pct_2014, xend=pct_2019, y=Duration)) +
geom_segment(aes(x=pct_2014, xend=pct_2019, y=Duration, yend=Duration),
color="#dbdbdb", size=1.5,)+
geom_dumbbell(color="light blue", size_x=2.5, size_xend = 2.5, colour_x="#edae52",
colour_xend = "#9fb059", show.legend = TRUE)+
geom_text(color="#a67d0c", size=2.5, vjust=1.8,family="sans",
aes(x=pct_2014, label=percent(pct_2014)))+
geom_text(aes(x=pct_2019, label=percent(pct_2019)),
color="#235e33", size=2.5, vjust=-1.5,family="sans") +
geom_text(color="#235e33", size=3, hjust=-0.5,data=filter(duration, Duration=="Under 5 Years"), aes(x=pct_2019, label="2019"),fontface="bold")+
geom_text(color="#d9a002", size=3, hjust=1.5,data=filter(duration, Duration=="Under 5 Years"), aes(x=pct_2014, label="2014"),fontface="bold")+
scale_y_discrete(limits=rev) +
theme_gray(base_family='Tahoma') + theme(
axis.text.x=element_text(size=8),
plot.title=element_text(size = 16, face="bold"),
plot.title.position = "plot",
plot.subtitle=element_text(face="italic", size=10, margin=margin(b=12)),
plot.caption=element_text(size=8, margin=margin(t=12), color="#7a7d7e")) +
labs(x="Percentage of Divorces", y="Marriage Duration", title="Divorces by Marriage Duration in 2014 and 2019",
subtitle="Most marriages end between 5-9 years in 2014 and 2019.\nHowever, there is a slight increase in divorces for marriages of more than 30 years",
caption="Source: Singstat Singapore (Department of Statistics)")
To create the dumbbell chart, the variables used are Race, Gender, Age group, Year. Due to the nature of the dataset, data wrangling have to be performed on the dataset.
In this case, we are pivoting the columns that are related to the variable using on a pattern that was created during manual data transformation.
pivot_longer have parameters that helps to split the columns into rows based on a delimiter. names_sep is the delimiter and names_to is the name of the columns to create from the data stored in the column names of data. An example from the dataset is that the column name is Chinese_Female_<25 and the this will create 3 additional columns to the dataset - Race, Gender and Age Group.
pyra <- divorce_df[11,c(1,13:84)]
pyra <- pyra %>%
pivot_longer(!year, names_to = c("race","gender","agegrp"),names_sep = "_", values_to = "value")
Using the original values to visualise the Pyramid Chart using geom_bar.
pyra <- divorce_df[11,c(1,13:84)]
pyra <- pyra %>%
pivot_longer(!year, names_to = c("race","gender","agegrp"),names_sep = "_", values_to = "value")
# pyra$agegrp <- factor(pyra$agegrp, levels=as.character(pyra$agegrp))
# pyra
pyra$value <- ifelse(pyra$gender == "Male", -1*pyra$value, pyra$value)
ggplot(pyra, aes(x = agegrp, y= value, fill = gender)) +
geom_bar(data = subset(pyra, gender == "Female "), stat = "identity") +
geom_bar(data = subset(pyra, gender == "Male"), stat = "identity")+
#scale_y_continuous(breaks = seq(-2000, 2000, 250)) +
coord_flip()+
facet_wrap(~race) +
theme_minimal()
However, the chart does not provide any insightful information using the total divorces values. Indian and Others chart is significantly smaller to the natural population size in Singapore with a higher Chinese proportion.
Therefore, this version creates a percentage value that calculates the percentage of divorces based on the race rather than the overall that could diminish the information for other races.
With the usage of group_by and mutate, we are able to visualise a pyramid chart using percentage values. plotly is further used to create interactive charts for reader to better understand the proportion of the age group among divorces.
library(ggplot2)
pyra <- divorce_df[11,c(1,13:84)] %>%
pivot_longer(!year, names_to = c("race","gender","agegrp"),names_sep = "_", values_to = "value")
pyra$agegrp <- factor(pyra$agegrp, levels = c("<25", "25-29","30-34","35-39","40-44","45-49","50-54","55-59","> 60"))
pyramid <- group_by(pyra, race) %>%
mutate(pct = 100 * value / sum(value)) %>%
ungroup() %>%
mutate(text = paste0("Age group: ", agegrp, "\n", "Divorces: ",percent(pct))) %>%
ggplot(aes(text=text)) +
geom_col(aes(x = agegrp, y = ifelse(gender == "Male", -pct, pct), fill = gender)) +
coord_flip() +
facet_wrap(~ race) +
scale_y_continuous(labels = function(x) scales::percent(abs(x / 100)))+
theme_minimal(base_family='Tahoma') + theme(
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
legend.title = element_text(size=10),
axis.text.x=element_text(size=10),
plot.title=element_text(size = 12,face="bold"),
plot.title.position = "plot")+
labs(title="Divorces Pyramid by Age group and Race", x="Age group", y="\n Percentage of Divorces by Race",fill="Gender")
py <- ggplotly(pyramid, tooltip = "text")
py[['x']][['layout']][['annotations']][[1]][['y']] <- -0.07
py[['x']][['layout']][['annotations']][[2]][['x']] <- -0.07
py
To create the dumbbell chart, the variables used are Marriage Duration, Main Reason for Divorce and Year. Due to the nature of the dataset, data wrangling have to be performed on the dataset.
Similar to Pyramid chart, we are pivoting the columns that are related to the variable using on a pattern that was created during manual data transformation.
duraxreason = divorce_df[11,c(1,85:112)]
duraxreason <- duraxreason %>%
pivot_longer(!year, names_to = c("duration","reason"),names_sep = "_", values_to = "count") %>%
select("duration","reason","count")
ggplot and geom_tile is used to display the heatmap of two categories.
library(ggplot2)
library(hrbrthemes)
library(heatmaply)
p <- ggplot(duraxreason, aes(reason, duration, fill= count)) +
geom_tile() +
scale_fill_viridis(option = 'cividis',discrete=FALSE)
ggplotly(p)
However, the marriage duration is not categorised according to the dataset therefore additional data wrangling is needed to categorise the y-axis. Furthermore, the tooltip of the visualisation needs to be edited to provide better insights as it is currently displaying the variable name instead.
Besides the design, using divorce count does not provide much insights because it merely shows the divorce counts for that marriage duration and reason. A better visualisation should change the count into the percentage of divorce by marriage duration. This will provide insights into showing the most common reason of divorce among the marriage duration category, rather than overall marriage duration category.
Using factor, the y-axis is ordered correctly with levels that allows to declare the order of the category. mutate is then used to edit the tooltip and rename it accordingly to the correct labels.
group_by is used to group the data using marriage duration and calculate the percentage of divorces by reason.
library(ggplot2)
library(hrbrthemes)
library(heatmaply)
duraxreason$duration <- factor(duraxreason$duration, levels = c("Under 5 Years", "5-9 Years","10-14 Years","15-19 Years","20-24 Years","25-29 Years","30 Years & Over"))
duraxreason <- group_by(duraxreason, duration) %>%
mutate(pct = 100* (count / sum(count))) %>%
ungroup()
duraxreason <- duraxreason %>%
mutate(text = paste0("Marriage Duration: ", duration, "\n", "Reason: ", reason, "\n",percent(pct)))
p <- ggplot(duraxreason, aes(reason, duration, fill= pct,text=text)) +
geom_tile() +
scale_fill_viridis(option = 'cividis',discrete=FALSE)+ scale_x_discrete(limits=rev)+
theme_gray(base_family='Tahoma') + theme(
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
legend.title = element_text(size=8),
axis.text.x=element_text(size=10),
plot.title=element_text(size = 11, face="bold"),
plot.title.position = "plot") +
labs(x="Main Reason for Divorce\n", y="Marriage Duration\n", fill="% of Divorces \nby Marriage Duration\n", title="Relationship between Main Reason for Divorce and Marriage Duration")
ggplotly(p, tooltip="text")
Based on the 4 visualisation generated, the main insights gathered are as follows:
The number of divorces shows an increasing trend over the years, with an increase in 2018 and 2019. The crude Divorce Rate observed a decrease of 1% in 2018 but increased in the following year. This shows that the divorce rates in Singapore are constant with an average of 1.9% per 1000 people from 2014 to 2019.
Most divorcees end their marriage between 5 to 9 years in both 2014 and 2019. There was a decrease of 0.5% in 2019 but a marriage duration of 5 to 9 years still remains as the highest number of divorces with a value of 29%. There is no significant increase or decrease in other categories except marriage duration of 15-19 years and 30 years and over. A decrease in divorce in 15-19 years and an increase in divorce for the marriage of 30 years and over.
The median age of divorce for both genders is 35 – 39 regardless of race. There are more males divorcing after the age of 35, while a higher percentage of female divorce around 20-35. “Other” race has a higher percentage of female divorcing as compared to other races that have an almost identical proportion. This could be due to the proportion of “Other” races in Singapore and that most of them marrying someone of a different race from them. There seems to be an increase in divorce in Chinese Males aged above 60 and this could be the reason why there was an increase in marriages above 30 years in 2019.
Unreasonable behaviour is the main reason for divorce in marriages below 20 years. Marriages above 20 years observed an increase in estranged as the main reason for divorce with a value of 52.5% for 20-24 years, 58.4% for 25-29 years and 50.3% for 30 years and above. By analysing the visualisation, couples should know more about their partner before deciding to get married, if not there is a 79.7% chance that the marriage will end in less than 5 years.