This dataset is retrieved from the Ministry of Health and shows the top 10 Causes of death in Singapore over the period of 14 years, according to the percentage of total death contributed by each cause. This publication aims to see how the top principal causes of death has changed overtime (i.e. any disease becoming more common and increasing in prevalence). The dataset can be retrieved here at Principal Causes of Death, where data for 2006-2011 is recorded in the first sheet and 2012-2019 in the second sheet.
The data is not structured properly for analysis i.e. has empty columns and some principal causes of death are split into two rows (for example, Cerebrovascular Disease and (including stroke) are recorded in separate rows).
We will skip the first 7 rows of the sheet as they contain headings which are not necessary in our analysis. Further, we will not assign the topmost rows to be column names.
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
setwd("C:/Users/X Lin/Desktop/R")
library(readxl)
df06 <- read_xlsx("PCoD.xlsx", skip = 7, col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
head(df06, n = 15)
## # A tibble: 15 x 9
## ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <dbl> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Cancer NA 28.5 27.7 29.3 29.3 28.5 30
## 2 NA [ ICD9 : 140-208 ] NA NA NA NA NA NA NA
## 3 NA <NA> NA NA NA NA NA NA NA
## 4 2 Ischaemic Heart Disease NA 18.5 19.8 20.1 19.2 18.7 16.4
## 5 NA [ ICD9 : 410-414 ] NA NA NA NA NA NA NA
## 6 NA <NA> NA NA NA NA NA NA NA
## 7 3 Pneumonia NA 13.7 13.9 13.9 15.3 15.7 16
## 8 NA [ ICD9 : 480-486 ] NA NA NA NA NA NA NA
## 9 NA <NA> NA NA NA NA NA NA NA
## 10 4 Cerebrovascular Disease NA 8.9 8.7 8.3 8 8.4 9
## 11 NA (including stroke) NA NA NA NA NA NA NA
## 12 NA [ ICD9 : 430-438 ] NA NA NA NA NA NA NA
## 13 NA <NA> NA NA NA NA NA NA NA
## 14 5 Accidents, Poisoning & Viole~ NA 6.3 6 5.8 5.7 5.5 5.5
## 15 NA [ ICD9 : E800-E999 ] NA NA NA NA NA NA NA
Remove columns 1 & 3, assign respective headings and rename Causes of death accordingly.
df06 <- df06[,-c(1,3)] #remove empty columns
colnames(df06) <- c("Cause", "2006", "2007","2008","2009","2010","2011") #assign column names
df06$Cause
## [1] "Cancer" "[ ICD9 : 140-208 ]"
## [3] NA "Ischaemic Heart Disease"
## [5] "[ ICD9 : 410-414 ]" NA
## [7] "Pneumonia" "[ ICD9 : 480-486 ]"
## [9] NA "Cerebrovascular Disease"
## [11] "(including stroke)" "[ ICD9 : 430-438 ]"
## [13] NA "Accidents, Poisoning & Violence"
## [15] "[ ICD9 : E800-E999 ]" NA
## [17] "Other Heart Diseases" "[ ICD9 : 393-398, 402, 415-429 ]"
## [19] NA "Urinary Tract Infection"
## [21] "[ ICD9 : 599.0 ]" NA
## [23] "Chronic Obstructive Lung" "Disease"
## [25] "[ ICD9 : 490-493, 496 ]" NA
## [27] "Nephritis, Nephrotic" "Syndrome & Nephrosis"
## [29] "[ ICD9 : 580-589 ]" NA
## [31] "Diabetes Mellitus" "[ ICD9 : 250 ]"
#Cleaning up column 'Cause'
df06[10,1] <- paste(df06[10,1], df06[11,1])
df06[23,1] <- paste(df06[23,1], df06[24,1])
df06[28,1] <- paste(df06[27,1], df06[28,1])
df06$Cause
## [1] "Cancer"
## [2] "[ ICD9 : 140-208 ]"
## [3] NA
## [4] "Ischaemic Heart Disease"
## [5] "[ ICD9 : 410-414 ]"
## [6] NA
## [7] "Pneumonia"
## [8] "[ ICD9 : 480-486 ]"
## [9] NA
## [10] "Cerebrovascular Disease (including stroke)"
## [11] "(including stroke)"
## [12] "[ ICD9 : 430-438 ]"
## [13] NA
## [14] "Accidents, Poisoning & Violence"
## [15] "[ ICD9 : E800-E999 ]"
## [16] NA
## [17] "Other Heart Diseases"
## [18] "[ ICD9 : 393-398, 402, 415-429 ]"
## [19] NA
## [20] "Urinary Tract Infection"
## [21] "[ ICD9 : 599.0 ]"
## [22] NA
## [23] "Chronic Obstructive Lung Disease"
## [24] "Disease"
## [25] "[ ICD9 : 490-493, 496 ]"
## [26] NA
## [27] "Nephritis, Nephrotic"
## [28] "Nephritis, Nephrotic Syndrome & Nephrosis"
## [29] "[ ICD9 : 580-589 ]"
## [30] NA
## [31] "Diabetes Mellitus"
## [32] "[ ICD9 : 250 ]"
#extracting only rows of interest
df06 <- df06[c(1,4,7,10,14,17,20,23,28,31),]
df06
## # A tibble: 10 x 7
## Cause `2006` `2007` `2008` `2009` `2010` `2011`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cancer 28.5 27.7 29.3 29.3 28.5 30
## 2 Ischaemic Heart Disease 18.5 19.8 20.1 19.2 18.7 16.4
## 3 Pneumonia 13.7 13.9 13.9 15.3 15.7 16
## 4 Cerebrovascular Disease (including~ 8.9 8.7 8.3 8 8.4 9
## 5 Accidents, Poisoning & Violence 6.3 6 5.8 5.7 5.5 5.5
## 6 Other Heart Diseases 4.3 4.3 4 4.4 4.8 5
## 7 Urinary Tract Infection 2 2.2 2.1 2.5 2.5 2.5
## 8 Chronic Obstructive Lung Disease 3.3 2.6 2.5 2.4 2.5 2.2
## 9 Nephritis, Nephrotic Syndrome & Ne~ 1.7 2 2.1 2.3 2.2 2
## 10 Diabetes Mellitus 3.3 3.6 2.7 1.7 1 1.7
Here, we do the same cleaning as we did for the first sheet above before binding the two tidied dataframes.
df12 <- read_xlsx("PCoD.xlsx", sheet = 2, skip = 7, col_names = FALSE)
df12_tidy <- df12[,-c(1,3)]
colnames(df12_tidy) <- c("Cause","2012", "2013","2014","2015","2016","2017","2018","2019")
df12_tidy[10,1] <- paste(df12_tidy[10,1], df12_tidy[11,1])
df12_tidy[18,1] <- paste(df12_tidy[17,1], df12_tidy[18,1])
df12_tidy[21,1] <- paste(df12_tidy[21,1], df12_tidy[22,1])
#Convert all causes to lower case
library(stringr)
df06$Cause<- str_to_lower(df06$Cause)
df12_tidy$Cause <- str_to_lower(df12_tidy$Cause)
df12_tidy <- df12_tidy[c(1,4,7,10,14,18,21,25,28,31),]
df12_tidy
## # A tibble: 10 x 9
## Cause `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cancer 30.1 30.5 29.4 29.7 29.6 29.1 28.8 28.4
## 2 pneumonia 16.8 18.5 19 19.4 19.3 20.1 20.6 20.7
## 3 ischaemic heart dise~ 16.1 15.5 16 16.7 17 18.5 18.1 18.8
## 4 cerebrovascular dise~ 9.3 8.9 8.4 6.8 6.6 6.3 6 5.8
## 5 external causes of m~ 5.6 4.9 4.7 4.5 4.4 4 4.3 4
## 6 nephritis, nephrotic~ 2.4 2.4 2 2.3 1.9 2.4 2.97 3.1
## 7 hypertensive disease~ 2.8 3.1 3.6 3.9 4 3.4 2.97 NA
## 8 urinary tract infect~ 2.4 2.6 2.6 2.2 2.3 1.9 2.04 2.3
## 9 other heart diseases 1.9 2 1.9 2.2 1.9 1.9 2.09 2
## 10 chronic obstructive ~ 2.1 1.6 1.8 1.8 1.6 1.5 1.34 1.4
#missing value in Hypertensive diseases, 2019
df12_tidy[7,9] <- df12[22,11]
df12_tidy
## # A tibble: 10 x 9
## Cause `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cancer 30.1 30.5 29.4 29.7 29.6 29.1 28.8 28.4
## 2 pneumonia 16.8 18.5 19 19.4 19.3 20.1 20.6 20.7
## 3 ischaemic heart dise~ 16.1 15.5 16 16.7 17 18.5 18.1 18.8
## 4 cerebrovascular dise~ 9.3 8.9 8.4 6.8 6.6 6.3 6 5.8
## 5 external causes of m~ 5.6 4.9 4.7 4.5 4.4 4 4.3 4
## 6 nephritis, nephrotic~ 2.4 2.4 2 2.3 1.9 2.4 2.97 3.1
## 7 hypertensive disease~ 2.8 3.1 3.6 3.9 4 3.4 2.97 2.6
## 8 urinary tract infect~ 2.4 2.6 2.6 2.2 2.3 1.9 2.04 2.3
## 9 other heart diseases 1.9 2 1.9 2.2 1.9 1.9 2.09 2
## 10 chronic obstructive ~ 2.1 1.6 1.8 1.8 1.6 1.5 1.34 1.4
Cancer is the leading cause of death in Singapore across all years, contributing to approximately one third of total deaths. The incidence of ischaemic heart diseas has fallen whilst pneumonia has contributed to greater deaths in 2011 as compared to 2006.Percentage of total deaths caused by Cerebrovascular diseases, including stroke, was rather similar in both 2006 and 2011. The percentage of deaths caused by the remaining causes remain fairly negligible, each contributing to less than 7% of total deaths in Singapore in both 2006 and 2011.
library(dplyr)
library(tidyr)
library(ggplot2)
df06_cleaned <- df06 %>% gather(key = 'Year', value = 'Percentage_death', -Cause)
df06_cleaned
## # A tibble: 60 x 3
## Cause Year Percentage_death
## <chr> <chr> <dbl>
## 1 cancer 2006 28.5
## 2 ischaemic heart disease 2006 18.5
## 3 pneumonia 2006 13.7
## 4 cerebrovascular disease (including stroke) 2006 8.9
## 5 accidents, poisoning & violence 2006 6.3
## 6 other heart diseases 2006 4.3
## 7 urinary tract infection 2006 2
## 8 chronic obstructive lung disease 2006 3.3
## 9 nephritis, nephrotic syndrome & nephrosis 2006 1.7
## 10 diabetes mellitus 2006 3.3
## # ... with 50 more rows
df06_slope <- df06_cleaned %>% filter(Year == 2006 | Year == 2011)
df06_slope
## # A tibble: 20 x 3
## Cause Year Percentage_death
## <chr> <chr> <dbl>
## 1 cancer 2006 28.5
## 2 ischaemic heart disease 2006 18.5
## 3 pneumonia 2006 13.7
## 4 cerebrovascular disease (including stroke) 2006 8.9
## 5 accidents, poisoning & violence 2006 6.3
## 6 other heart diseases 2006 4.3
## 7 urinary tract infection 2006 2
## 8 chronic obstructive lung disease 2006 3.3
## 9 nephritis, nephrotic syndrome & nephrosis 2006 1.7
## 10 diabetes mellitus 2006 3.3
## 11 cancer 2011 30
## 12 ischaemic heart disease 2011 16.4
## 13 pneumonia 2011 16
## 14 cerebrovascular disease (including stroke) 2011 9
## 15 accidents, poisoning & violence 2011 5.5
## 16 other heart diseases 2011 5
## 17 urinary tract infection 2011 2.5
## 18 chronic obstructive lung disease 2011 2.2
## 19 nephritis, nephrotic syndrome & nephrosis 2011 2
## 20 diabetes mellitus 2011 1.7
plot1 <- ggplot(data = df06_slope, aes(x = Year, y = Percentage_death, group = Cause)) +
geom_line(aes(color = Cause, alpha = 1), size = 2) +
geom_point(aes(color = Cause, alpha = 1), size = 4) +
geom_text(data =df06_slope %>% filter(Year == 2011), aes(label = Cause), hjust = -.05,
fontface = "bold",
size = 4) +
theme(legend.position = "none") +
scale_y_continuous(breaks = seq(0, max(df06_slope$Percentage_death), by = 5)) +
labs(
title = "Percentage of total deaths in Singapore, 2006-2011",
subtitle = "(Principal causes of deaths)",
caption = "Data source: Ministry of Health"
) + ylab("Percentage of total deaths")
plot1
Cancer remains the leading cause of death in Singapore even beyond 2011, although the percentage of deaths caused by Cancer was lower in 2019 compared to 2012.Similar to the 2006-2011 trend, the percentage of deaths caused by pneumonia has risen. Ischaemic heart diseases contribute more to total deaths in 2019 than in 2012, a reverse of what we see between 2006 and 2011. Notably, cerebrovascular diseases contribute less to total deaths in 2019 than in 2012, an improvement from 2006-2011.The remaining causes of death remains negligible, constituting less than 5% of deaths in 2019.
df12_cleaned <- df12_tidy %>% gather(key = 'Year', value = 'Percentage_death', -Cause)
df12_cleaned
## # A tibble: 80 x 3
## Cause Year Percentage_death
## <chr> <chr> <dbl>
## 1 cancer 2012 30.1
## 2 pneumonia 2012 16.8
## 3 ischaemic heart diseases 2012 16.1
## 4 cerebrovascular diseases (including stroke) 2012 9.3
## 5 external causes of morbidity and mortality 2012 5.6
## 6 nephritis, nephrotic syndrome & nephrosis 2012 2.4
## 7 hypertensive diseases (including hypertensive heart d~ 2012 2.8
## 8 urinary tract infection 2012 2.4
## 9 other heart diseases 2012 1.9
## 10 chronic obstructive lung diseases 2012 2.1
## # ... with 70 more rows
df12_slope <- df12_cleaned %>% filter(Year == 2012 | Year == 2019)
df12_slope
## # A tibble: 20 x 3
## Cause Year Percentage_death
## <chr> <chr> <dbl>
## 1 cancer 2012 30.1
## 2 pneumonia 2012 16.8
## 3 ischaemic heart diseases 2012 16.1
## 4 cerebrovascular diseases (including stroke) 2012 9.3
## 5 external causes of morbidity and mortality 2012 5.6
## 6 nephritis, nephrotic syndrome & nephrosis 2012 2.4
## 7 hypertensive diseases (including hypertensive heart d~ 2012 2.8
## 8 urinary tract infection 2012 2.4
## 9 other heart diseases 2012 1.9
## 10 chronic obstructive lung diseases 2012 2.1
## 11 cancer 2019 28.4
## 12 pneumonia 2019 20.7
## 13 ischaemic heart diseases 2019 18.8
## 14 cerebrovascular diseases (including stroke) 2019 5.8
## 15 external causes of morbidity and mortality 2019 4
## 16 nephritis, nephrotic syndrome & nephrosis 2019 3.1
## 17 hypertensive diseases (including hypertensive heart d~ 2019 2.6
## 18 urinary tract infection 2019 2.3
## 19 other heart diseases 2019 2
## 20 chronic obstructive lung diseases 2019 1.4
plot2 <- ggplot(data = df12_slope, aes(x = Year, y = Percentage_death, group = Cause)) +
geom_line(aes(color = Cause, alpha = 1), size = 2) +
geom_point(aes(color = Cause, alpha = 1), size = 4) +
geom_text(data =df12_slope %>% filter(Year == 2019), aes(label = Cause), hjust = -.05,
fontface = "bold",
size = 4) +
theme(legend.position = "none") +
scale_y_continuous(breaks = seq(0, max(df12_slope$Percentage_death), by = 5)) +
labs(
title = "Percentage of total deaths in Singapore, 2012-2019",
subtitle = "(Principal causes of deaths)",
caption = "Data source: Ministry of Health"
) + ylab("Percentage of total deaths")
plot2
Compared to 2006-2011, we see that the following causes of deaths are no longer the top 10 principal causes of death between 2012 and 2019:
1. accidents, poisoning & violence
2. diabetes mellitus
Recent causes that have been added to the top 10 principal causes of deaths not seen between 2006-2011 are:
1. External causes of morbidity and mortality (accidents, poisoning and violence could have been recorded under this category)
2. Hypertensive diseases (including hypertensive heart disease)
Notably, Cancer, pneumonia, ischaemic heart disease, cerebrovascular disease are the leading causes of death in Singapore.
library(gridExtra)
grid.arrange(plot1, plot2, ncol = 2)
To see the top principal causes of death across 2006-2019, an inner join is performed to deduce which diseases are in the top 10 principal causes of death in both datsets (2006-2011 and 2012-2019).
#check spelling of Cause in both datasets - we want to make sure they are the same.
df06$Cause
## [1] "cancer"
## [2] "ischaemic heart disease"
## [3] "pneumonia"
## [4] "cerebrovascular disease (including stroke)"
## [5] "accidents, poisoning & violence"
## [6] "other heart diseases"
## [7] "urinary tract infection"
## [8] "chronic obstructive lung disease"
## [9] "nephritis, nephrotic syndrome & nephrosis"
## [10] "diabetes mellitus"
df12_tidy$Cause #in df12, disease is spelled with an 's' i.e. diseases, unlike df06 (spelled as disease). We will correct this to match df06.
## [1] "cancer"
## [2] "pneumonia"
## [3] "ischaemic heart diseases"
## [4] "cerebrovascular diseases (including stroke)"
## [5] "external causes of morbidity and mortality"
## [6] "nephritis, nephrotic syndrome & nephrosis"
## [7] "hypertensive diseases (including hypertensive heart disease)"
## [8] "urinary tract infection"
## [9] "other heart diseases"
## [10] "chronic obstructive lung diseases"
class(df12_tidy$Cause)
## [1] "character"
df12_tidy$Cause[ grep("cerebrovascular",df12_tidy$Cause, ignore.case = TRUE)] <- "cerebrovascular disease (including stroke)"
df12_tidy$Cause[ grep("ischaemic",df12_tidy$Cause, ignore.case = TRUE)] <- "ischaemic heart disease"
df12_tidy$Cause[ grep("lung",df12_tidy$Cause, ignore.case = TRUE)] <- "chronic obstructive lung disease"
#Perform inner join on 2006-2011 and 2012-2019 dataset
df_full <- inner_join(df06,df12_tidy, by = "Cause")
df_full
## # A tibble: 8 x 15
## Cause `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 canc~ 28.5 27.7 29.3 29.3 28.5 30 30.1 30.5 29.4 29.7
## 2 isch~ 18.5 19.8 20.1 19.2 18.7 16.4 16.1 15.5 16 16.7
## 3 pneu~ 13.7 13.9 13.9 15.3 15.7 16 16.8 18.5 19 19.4
## 4 cere~ 8.9 8.7 8.3 8 8.4 9 9.3 8.9 8.4 6.8
## 5 othe~ 4.3 4.3 4 4.4 4.8 5 1.9 2 1.9 2.2
## 6 urin~ 2 2.2 2.1 2.5 2.5 2.5 2.4 2.6 2.6 2.2
## 7 chro~ 3.3 2.6 2.5 2.4 2.5 2.2 2.1 1.6 1.8 1.8
## 8 neph~ 1.7 2 2.1 2.3 2.2 2 2.4 2.4 2 2.3
## # ... with 4 more variables: `2016` <dbl>, `2017` <dbl>, `2018` <dbl>,
## # `2019` <dbl>
df_full_tidy <- df_full %>% gather(key = 'Year', value = 'Percentage_deaths', -Cause)
df_full_tidy
## # A tibble: 112 x 3
## Cause Year Percentage_deaths
## <chr> <chr> <dbl>
## 1 cancer 2006 28.5
## 2 ischaemic heart disease 2006 18.5
## 3 pneumonia 2006 13.7
## 4 cerebrovascular disease (including stroke) 2006 8.9
## 5 other heart diseases 2006 4.3
## 6 urinary tract infection 2006 2
## 7 chronic obstructive lung disease 2006 3.3
## 8 nephritis, nephrotic syndrome & nephrosis 2006 1.7
## 9 cancer 2007 27.7
## 10 ischaemic heart disease 2007 19.8
## # ... with 102 more rows
A time series graph here shows the percentage of total deaths by each cause of death.Cancer, as we have already seen, is the leading cause of death in Singapore. However, there seems to have been a slight downward trend in the percentage of deaths caused by cancer in recent years. Pneumonia and ischaemic heart diseases, on the other hand, have been contributing to rising numbers of deaths.
library(RColorBrewer)
df_full_tidy %>% filter(Year ==2019) %>% arrange(desc(Percentage_deaths)) #we will focus on the top 5 principal causes of death in 2019 here
## # A tibble: 8 x 3
## Cause Year Percentage_deaths
## <chr> <chr> <dbl>
## 1 cancer 2019 28.4
## 2 pneumonia 2019 20.7
## 3 ischaemic heart disease 2019 18.8
## 4 cerebrovascular disease (including stroke) 2019 5.8
## 5 nephritis, nephrotic syndrome & nephrosis 2019 3.1
## 6 urinary tract infection 2019 2.3
## 7 other heart diseases 2019 2
## 8 chronic obstructive lung disease 2019 1.4
df_full_final <- df_full_tidy %>% filter(Cause != "urinary tract infection" & Cause != "other heart diseases" & Cause != "chronic obstructive lung disease")
unique(df_full_final$Cause) #check that only 5 causes are here
## [1] "cancer"
## [2] "ischaemic heart disease"
## [3] "pneumonia"
## [4] "cerebrovascular disease (including stroke)"
## [5] "nephritis, nephrotic syndrome & nephrosis"
df_full_final %>% ggplot(aes(x = Year, y = Percentage_deaths, group = Cause)) + geom_line(aes(color = Cause)) + geom_point(aes(color = Cause)) +
scale_color_brewer(palette = "Dark2") + ylab("Percentage of total deaths")
A linear model (blue dashed line) with 95% confidence interval (depicted by the grey areas) is fitted into the data points to show the overall trend.
df_full_final %>% ggplot(aes(x = Year, y = Percentage_deaths, group = Cause)) + geom_line(aes(color = Cause)) + geom_point(aes(color = Cause)) +
scale_color_brewer(palette = "Dark2") + ylab("Percentage of total deaths") + geom_smooth(method = "lm", linetype = 2, level = 0.95)