Often data is released into the world in the form of an excel spreadsheet, and often these spreadsheets come formatted in such a way which may be pleasing to the eye for some, however this can make the data difficult to read and analyse programmatically.
In this vignette I am going to walk you through the process of cleaning up one of these spreadsheets, the spreadsheet I will be using in this example is the “3303.0 Causes of Death, Australia, 2017” provided by the Australian Bureau of Statistics, this file can be downloaded from here.
We’ll start by visually inspecting the document to see how the data is structured by opening it up in Excel or LibreOffice if you are running Linux, go straight to the third tab this is where the data we are interested in is located.
Screenshot of tab three in 3303_1 underlying causes of death (australia).xls file.
From visually inspecting this document there are a number of features which stand out, and which present a problem when trying to perform some analysis programmatically.
If we want to do any analysis on this data in R we are going to need to clean this data, remove the blank columns, separate out mixed data and separate the summations from the actual counts.
So lets load this into R and see what it looks like, we will use the ‘read_xls’ function from the ‘readxl’ library and go straight to the third sheet skiping the first five rows.
df <- readxl::read_xls('../data/3303_1 underlying causes of death (australia).xls', sheet = 3, skip = 5)
head(df)
## # A tibble: 6 x 40
## `Cause of death… Males...2 Females...3 Persons...4 ...5 Males...6
## <chr> <dbl> <dbl> <dbl> <lgl> <dbl>
## 1 Total deaths 73548 70398 143946 NA 72320
## 2 <NA> NA NA NA NA NA
## 3 <NA> NA NA NA NA NA
## 4 Causes of death NA NA NA NA NA
## 5 CHAPTER I Certa… 1015 954 1969 NA 974
## 6 Intestinal infe… 23 45 68 NA 29
## # … with 34 more variables: Females...7 <dbl>, Persons...8 <dbl>,
## # ...9 <lgl>, Males...10 <dbl>, Females...11 <dbl>, Persons...12 <dbl>,
## # ...13 <lgl>, Males...14 <dbl>, Females...15 <dbl>, Persons...16 <dbl>,
## # ...17 <lgl>, Males...18 <dbl>, Females...19 <dbl>, Persons...20 <dbl>,
## # ...21 <lgl>, Males...22 <dbl>, Females...23 <dbl>, Persons...24 <dbl>,
## # ...25 <lgl>, Males...26 <dbl>, Females...27 <dbl>, Persons...28 <dbl>,
## # ...29 <lgl>, Males...30 <dbl>, Females...31 <dbl>, Persons...32 <dbl>,
## # ...33 <lgl>, Males...34 <dbl>, Females...35 <dbl>, Persons...36 <dbl>,
## # ...37 <lgl>, Males...38 <dbl>, Females...39 <dbl>, Persons...40 <dbl>
From this output we can see the empty columns, and the three repeating columns for Males, Females and Persons, from visually inspecting the spreadsheet earlier we know that each of these groups of three variables represents one year.
Now lets remove those empty rows, and rename the columns to something sensible, rather then manually writing out all the column names for the rename function we are going to loop through the years which this data set covers and define a ‘dictionary’ type structure which will hold the ‘templates’ of the names we are changing to and from.
# remove the NA columns
df <- df[, colSums(is.na(df)) < nrow(df)]
# rename the cause of death column
df <- df %>%
rename(
"cause_of_death" = "Cause of death and ICD-10 code"
)
# rename the yearly Males, Females and Persons columns
# This is how we define a dictionary structure in R, which will contain the name templates we are changing to and from
col_names <- c('males_', 'females_', 'persons_')
names(col_names) <- c("Males...", "Females...", "Persons...")
# The first of the males, females & persons columns is Males...2, therefore we start the suffix at 2
column_sufix <- 2
for (year in c(2008:2017)){
counter <- 0
for (col_name in names(col_names)){
# this will be something like 'Males...2'
col_name_to_change <- paste(col_name, counter + column_sufix, sep = '')
# this will be something like 'males_2008'
change_to <- paste(col_names[[col_name]], year, sep = '')
# when passing variables into the rename function you need to place a 'bang(!!)' on the left hand side of the
# variable and use the special substitute ':=' assignment operator.
df <- df %>% rename(!!change_to := col_name_to_change)
counter <- counter + 1
}
# we increment the suffix by 4 instead of 3 to account for the empty column we removed at the begining of this code block.
column_sufix <- column_sufix + 4
}
# Check the column names.. all looks good
names(df)
## [1] "cause_of_death" "males_2008" "females_2008" "persons_2008"
## [5] "males_2009" "females_2009" "persons_2009" "males_2010"
## [9] "females_2010" "persons_2010" "males_2011" "females_2011"
## [13] "persons_2011" "males_2012" "females_2012" "persons_2012"
## [17] "males_2013" "females_2013" "persons_2013" "males_2014"
## [21] "females_2014" "persons_2014" "males_2015" "females_2015"
## [25] "persons_2015" "males_2016" "females_2016" "persons_2016"
## [29] "males_2017" "females_2017" "persons_2017"
The newly named ‘cause_of_death’ column actually contains two pieces of information, the description of the cause of death and the ICD-10 code, we will separate this code into its own column using the ‘str_extract’ from the ‘stringr’ which is part of the tidyverse, documentation on this function and more can be found here.
# Extract the cause of death code from the cause_of_death column using a regular expression
code_pattern = "(([A-N][0-9][0-9]-[A-N][0-9][0-9]))|(([A-N][0-9][0-9]))"
df$cause_of_death_code <- str_extract(df$cause_of_death, code_pattern)
df$cause_of_death <- str_replace(df$cause_of_death, code_pattern, '')
df$cause_of_death <- str_replace(df$cause_of_death, "\\(\\)", '') # remove the leftovers
knitr::kable(head(select(df, cause_of_death, cause_of_death_code)))
| cause_of_death | cause_of_death_code |
|---|---|
| Total deaths | NA |
| NA | NA |
| NA | NA |
| Causes of death | NA |
| CHAPTER I Certain infectious and parasitic diseases | A00-B99 |
| Intestinal infectious diseases | A00-A09 |
Now that we have the code separated from the description we can start removing those empty rows which we can easily do by filtering where cause_of_death is not NA and not equal to ‘Causes of death’.
# filter out empty rows
df <- df %>% filter(!is.na(cause_of_death) & cause_of_death != "Causes of death")
knitr::kable(head(select(df, cause_of_death, cause_of_death_code, males_2008, females_2008, persons_2008)))
| cause_of_death | cause_of_death_code | males_2008 | females_2008 | persons_2008 |
|---|---|---|---|---|
| Total deaths | NA | 73548 | 70398 | 143946 |
| CHAPTER I Certain infectious and parasitic diseases | A00-B99 | 1015 | 954 | 1969 |
| Intestinal infectious diseases | A00-A09 | 23 | 45 | 68 |
| Cholera | A00 | 0 | 0 | 0 |
| Typhoid and paratyphoid fevers | A01 | 0 | 0 | 0 |
| Other salmonella infections | A02 | 4 | 2 | 6 |
The data is starting to take shape now but we still have work to do, the totals, categories and subcategories still need to be separated from the main data, luckley these rows are easily identified.
Let’s start separating these summary rows from the main data.
# 1. Total deaths, overall summation
totals_df <- df %>% filter(cause_of_death == "Total deaths")
# 2. The chapter rows
chapters_df <- df %>% filter(grepl("^CHAPTER", cause_of_death))
# 3. The sub groups rows
sub_groups_df <- df %>% filter(grepl("-", cause_of_death_code) & !grepl("^CHAPTER", cause_of_death))
# which leaves us with the main data set
items_df <- df %>% filter(!grepl("-", cause_of_death_code) & !is.na(cause_of_death_code))
knitr::kable(head(select(items_df, cause_of_death, cause_of_death_code, males_2008, females_2008, persons_2008)))
| cause_of_death | cause_of_death_code | males_2008 | females_2008 | persons_2008 |
|---|---|---|---|---|
| Cholera | A00 | 0 | 0 | 0 |
| Typhoid and paratyphoid fevers | A01 | 0 | 0 | 0 |
| Other salmonella infections | A02 | 4 | 2 | 6 |
| Shigellosis | A03 | 0 | 0 | 0 |
| Other bacterial intestinal infections | A04 | 9 | 21 | 30 |
| Other bacterial foodborne intoxications, not elsewhere classified | A05 | 1 | 1 | 2 |
Now that we have the summary rows separated out from the main data we can start thinking about how to transform the data structure, if we think about this from a tidy data perspective and looking at the males, females & persons columns, each of these columns contains 2 variables year and gender, plus the observations, this needs to be separated out so that we have a year column, gender column, and number of deaths, to do this we will use the ‘melt’ function from the ‘reshape2’ library.
n <- names(items_df)
# This is going to give us all the names of the male, female & persons columns
n <- n[!grepl("cause_of_death", n) & !grepl("cause_of_death_code", n) & !grepl("code", n)]
items_df <- data.frame(melt(data = items_df, id.vars = c("cause_of_death", "cause_of_death_code"), measure.vars = n))
# extract the year into its own column
items_df$year <- str_extract(items_df$variable, "([0-9]{4})")
items_df$year <- as.numeric(items_df$year)
# remove the year from the variable
items_df$variable <- str_replace(items_df$variable, "(_[0-9]{4})", '')
# rename the columns
items_df <- items_df %>% rename(
"gender" = "variable"
, "number_of_deaths" = "value"
)
knitr::kable(head(items_df))
| cause_of_death | cause_of_death_code | gender | number_of_deaths | year |
|---|---|---|---|---|
| Cholera | A00 | males | 0 | 2008 |
| Typhoid and paratyphoid fevers | A01 | males | 0 | 2008 |
| Other salmonella infections | A02 | males | 4 | 2008 |
| Shigellosis | A03 | males | 0 | 2008 |
| Other bacterial intestinal infections | A04 | males | 9 | 2008 |
| Other bacterial foodborne intoxications, not elsewhere classified | A05 | males | 1 | 2008 |
So now the data is clean and ‘tidy’, let’s have a look at which causes of death are trending upward and which are trending downward, we will start by filtering down to the ‘persons’ rows as this is males and females combined, then we will calculate the slope of the trend line for each cause of death using linear regression via the ‘lm’ function included in base R.
persons = items_df %>% filter(gender == "persons")
death_codes_with_slope <- persons %>%
group_by(cause_of_death_code) %>%
summarise(
slope = coef(lm(number_of_deaths ~ year))[2]
)
# take the top five and bottom five to plot out
top_five_trending = head(death_codes_with_slope %>% arrange(desc(slope)), 5)
bottom_five_trending = head(death_codes_with_slope %>% arrange(slope), 5)
top_five_trending <- merge(persons, top_five_trending, by.x = c('cause_of_death_code'), by.y = c('cause_of_death_code'))
bottom_five_trending <- merge(persons, bottom_five_trending, by.x = c('cause_of_death_code'), by.y = c('cause_of_death_code'))
top_five_trending$year <- as.Date(paste(top_five_trending$year, '-01-01', sep=''), '%Y-%m-%d')
bottom_five_trending$year <- as.Date(paste(bottom_five_trending$year, '-01-01', sep=''), '%Y-%m-%d')
rising_causes_of_death = ggplot(data = top_five_trending, mapping = aes(x = year, y = number_of_deaths, colour = cause_of_death)) +
geom_line() +
theme(
legend.position = "bottom"
, legend.direction = "vertical"
, plot.title = element_text(hjust = 0.5)
) +
xlab('Year') +
ylab('Number of Deaths') +
ggtitle('Trending Up') +
guides(colour = guide_legend(title = "Causes of Death"))
falling_causes_of_death = ggplot(data = bottom_five_trending, mapping = aes(x = year, y = number_of_deaths, colour = cause_of_death)) +
geom_line() +
theme(
legend.position = "bottom"
, legend.direction = "vertical"
, plot.title = element_text(hjust = 0.5)
) +
xlab('Year') +
ylab('Number of Deaths') +
ggtitle('Trending Down') +
guides(colour = guide_legend(title = "Causes of Death"))
grid.arrange(rising_causes_of_death, falling_causes_of_death, top = 'Top 5 Trending Up & Trending Down', ncol = 2)