Cleaning Formatted Spreadsheet

Often data is released into the world in the form of an excel spreadsheet, and often these spreadsheets come formatted in such a way which may be pleasing to the eye for some, however this can make the data difficult to read and analyse programmatically.

In this vignette I am going to walk you through the process of cleaning up one of these spreadsheets, the spreadsheet I will be using in this example is the “3303.0 Causes of Death, Australia, 2017” provided by the Australian Bureau of Statistics, this file can be downloaded from here.

We’ll start by visually inspecting the document to see how the data is structured by opening it up in Excel or LibreOffice if you are running Linux, go straight to the third tab this is where the data we are interested in is located.

Screenshot of tab three in 3303_1 underlying causes of death (australia).xls file.

From visually inspecting this document there are a number of features which stand out, and which present a problem when trying to perform some analysis programmatically.

There are three regularly repeating columns for Males, Females and Persons, these are repeated for each year starting in 2008 and going through to 2017.
Between each year there is an empty column.
Within the data there are rows which represent categories and subcategories, and the numbers in the rows are summations of the cause of death counts contained within those categories.
The ICD-10 code associated with the cause of death is included in the description.
The first five rows are just header information, not particularly useful.

If we want to do any analysis on this data in R we are going to need to clean this data, remove the blank columns, separate out mixed data and separate the summations from the actual counts.

So lets load this into R and see what it looks like, we will use the ‘read_xls’ function from the ‘readxl’ library and go straight to the third sheet skiping the first five rows.

df <- readxl::read_xls('../data/3303_1 underlying causes of death (australia).xls', sheet = 3, skip = 5)
head(df)

## # A tibble: 6 x 40
##   `Cause of death… Males...2 Females...3 Persons...4 ...5  Males...6
##   <chr>                <dbl>       <dbl>       <dbl> <lgl>     <dbl>
## 1 Total deaths         73548       70398      143946 NA        72320
## 2 <NA>                    NA          NA          NA NA           NA
## 3 <NA>                    NA          NA          NA NA           NA
## 4 Causes of death         NA          NA          NA NA           NA
## 5 CHAPTER I Certa…      1015         954        1969 NA          974
## 6 Intestinal infe…        23          45          68 NA           29
## # … with 34 more variables: Females...7 <dbl>, Persons...8 <dbl>,
## #   ...9 <lgl>, Males...10 <dbl>, Females...11 <dbl>, Persons...12 <dbl>,
## #   ...13 <lgl>, Males...14 <dbl>, Females...15 <dbl>, Persons...16 <dbl>,
## #   ...17 <lgl>, Males...18 <dbl>, Females...19 <dbl>, Persons...20 <dbl>,
## #   ...21 <lgl>, Males...22 <dbl>, Females...23 <dbl>, Persons...24 <dbl>,
## #   ...25 <lgl>, Males...26 <dbl>, Females...27 <dbl>, Persons...28 <dbl>,
## #   ...29 <lgl>, Males...30 <dbl>, Females...31 <dbl>, Persons...32 <dbl>,
## #   ...33 <lgl>, Males...34 <dbl>, Females...35 <dbl>, Persons...36 <dbl>,
## #   ...37 <lgl>, Males...38 <dbl>, Females...39 <dbl>, Persons...40 <dbl>

From this output we can see the empty columns, and the three repeating columns for Males, Females and Persons, from visually inspecting the spreadsheet earlier we know that each of these groups of three variables represents one year.

Now lets remove those empty rows, and rename the columns to something sensible, rather then manually writing out all the column names for the rename function we are going to loop through the years which this data set covers and define a ‘dictionary’ type structure which will hold the ‘templates’ of the names we are changing to and from.

# remove the NA columns
df <- df[, colSums(is.na(df)) < nrow(df)]

# rename the cause of death column
df <- df %>% 
  rename(
    "cause_of_death" = "Cause of death and ICD-10 code"
  )

# rename the yearly Males, Females and Persons columns

# This is how we define a dictionary structure in R, which will contain the name templates we are changing to and from
col_names <- c('males_', 'females_', 'persons_')
names(col_names) <- c("Males...", "Females...", "Persons...")

# The first of the males, females & persons columns is Males...2, therefore we start the suffix at 2
column_sufix <- 2
for (year in c(2008:2017)){
  
  counter <- 0
  for (col_name in names(col_names)){
    
    # this will be something like 'Males...2'
    col_name_to_change <- paste(col_name, counter + column_sufix, sep = '') 
    
    # this will be something like 'males_2008'
    change_to <- paste(col_names[[col_name]], year, sep = '') 
    
    # when passing variables into the rename function you need to place a 'bang(!!)' on the left hand side of the
    # variable and use the special substitute ':=' assignment operator.
    df <- df %>% rename(!!change_to := col_name_to_change)
    
    counter <- counter + 1
  }
  
  # we increment the suffix by 4 instead of 3 to account for the empty column we removed at the begining of this code block.
  column_sufix <- column_sufix + 4
}

# Check the column names.. all looks good
names(df)

##  [1] "cause_of_death" "males_2008"     "females_2008"   "persons_2008"  
##  [5] "males_2009"     "females_2009"   "persons_2009"   "males_2010"    
##  [9] "females_2010"   "persons_2010"   "males_2011"     "females_2011"  
## [13] "persons_2011"   "males_2012"     "females_2012"   "persons_2012"  
## [17] "males_2013"     "females_2013"   "persons_2013"   "males_2014"    
## [21] "females_2014"   "persons_2014"   "males_2015"     "females_2015"  
## [25] "persons_2015"   "males_2016"     "females_2016"   "persons_2016"  
## [29] "males_2017"     "females_2017"   "persons_2017"

The newly named ‘cause_of_death’ column actually contains two pieces of information, the description of the cause of death and the ICD-10 code, we will separate this code into its own column using the ‘str_extract’ from the ‘stringr’ which is part of the tidyverse, documentation on this function and more can be found here.

# Extract the cause of death code from the cause_of_death column using a regular expression
code_pattern = "(([A-N][0-9][0-9]-[A-N][0-9][0-9]))|(([A-N][0-9][0-9]))"
df$cause_of_death_code <- str_extract(df$cause_of_death, code_pattern)
df$cause_of_death <- str_replace(df$cause_of_death, code_pattern, '')
df$cause_of_death <- str_replace(df$cause_of_death, "\\(\\)", '') # remove the leftovers
knitr::kable(head(select(df, cause_of_death, cause_of_death_code)))

cause_of_death	cause_of_death_code
Total deaths	NA
NA	NA
NA	NA
Causes of death	NA
CHAPTER I Certain infectious and parasitic diseases	A00-B99
Intestinal infectious diseases	A00-A09

Now that we have the code separated from the description we can start removing those empty rows which we can easily do by filtering where cause_of_death is not NA and not equal to ‘Causes of death’.

# filter out empty rows
df <- df %>% filter(!is.na(cause_of_death) & cause_of_death != "Causes of death")
knitr::kable(head(select(df, cause_of_death, cause_of_death_code, males_2008, females_2008, persons_2008)))

cause_of_death	cause_of_death_code	males_2008	females_2008	persons_2008
Total deaths	NA	73548	70398	143946
CHAPTER I Certain infectious and parasitic diseases	A00-B99	1015	954	1969
Intestinal infectious diseases	A00-A09	23	45	68
Cholera	A00	0	0	0
Typhoid and paratyphoid fevers	A01	0	0	0
Other salmonella infections	A02	4	2	6

The data is starting to take shape now but we still have work to do, the totals, categories and subcategories still need to be separated from the main data, luckley these rows are easily identified.

The totals row is identified where the ‘cause_of_death’ equals ‘Total deaths’, this is an overall total across all the individual causes.
There are rows where the ‘cause_of_death’ starts with ‘CHAPTER’, these rows represent a high level category.
There are rows where the ‘cause_of_death_code’ contains a dash for example ‘A00-A09’, these rows are a subcategory under the Chapter, and there numbers are a summation of causes of death from A00 through to A009.

Let’s start separating these summary rows from the main data.

# 1. Total deaths, overall summation
totals_df <- df %>% filter(cause_of_death == "Total deaths")

# 2. The chapter rows
chapters_df <- df %>% filter(grepl("^CHAPTER", cause_of_death))

# 3. The sub groups rows
sub_groups_df <- df %>% filter(grepl("-", cause_of_death_code) & !grepl("^CHAPTER", cause_of_death))

# which leaves us with the main data set
items_df <- df %>% filter(!grepl("-", cause_of_death_code) & !is.na(cause_of_death_code))
knitr::kable(head(select(items_df, cause_of_death, cause_of_death_code, males_2008, females_2008, persons_2008)))

cause_of_death	cause_of_death_code	males_2008	females_2008	persons_2008
Cholera	A00	0	0	0
Typhoid and paratyphoid fevers	A01	0	0	0
Other salmonella infections	A02	4	2	6
Shigellosis	A03	0	0	0
Other bacterial intestinal infections	A04	9	21	30
Other bacterial foodborne intoxications, not elsewhere classified	A05	1	1	2

Now that we have the summary rows separated out from the main data we can start thinking about how to transform the data structure, if we think about this from a tidy data perspective and looking at the males, females & persons columns, each of these columns contains 2 variables year and gender, plus the observations, this needs to be separated out so that we have a year column, gender column, and number of deaths, to do this we will use the ‘melt’ function from the ‘reshape2’ library.

n <- names(items_df)
# This is going to give us all the names of the male, female & persons columns
n <- n[!grepl("cause_of_death", n) & !grepl("cause_of_death_code", n) & !grepl("code", n)]

items_df <- data.frame(melt(data = items_df, id.vars = c("cause_of_death", "cause_of_death_code"), measure.vars = n))
# extract the year into its own column
items_df$year <- str_extract(items_df$variable, "([0-9]{4})")
items_df$year <- as.numeric(items_df$year)
# remove the year from the variable
items_df$variable <- str_replace(items_df$variable, "(_[0-9]{4})", '')
# rename the columns
items_df <- items_df %>% rename(
  "gender" = "variable"
  , "number_of_deaths" = "value"
)
knitr::kable(head(items_df))

cause_of_death	cause_of_death_code	gender	number_of_deaths	year
Cholera	A00	males	0	2008
Typhoid and paratyphoid fevers	A01	males	0	2008
Other salmonella infections	A02	males	4	2008
Shigellosis	A03	males	0	2008
Other bacterial intestinal infections	A04	males	9	2008
Other bacterial foodborne intoxications, not elsewhere classified	A05	males	1	2008

So now the data is clean and ‘tidy’, let’s have a look at which causes of death are trending upward and which are trending downward, we will start by filtering down to the ‘persons’ rows as this is males and females combined, then we will calculate the slope of the trend line for each cause of death using linear regression via the ‘lm’ function included in base R.

persons = items_df %>% filter(gender == "persons")

death_codes_with_slope <- persons %>%
  group_by(cause_of_death_code) %>%
  summarise(
    slope = coef(lm(number_of_deaths ~ year))[2]
  )

# take the top five and bottom five to plot out
top_five_trending = head(death_codes_with_slope %>% arrange(desc(slope)), 5)
bottom_five_trending = head(death_codes_with_slope %>% arrange(slope), 5)

top_five_trending <- merge(persons, top_five_trending, by.x = c('cause_of_death_code'), by.y = c('cause_of_death_code'))
bottom_five_trending <- merge(persons, bottom_five_trending, by.x = c('cause_of_death_code'), by.y = c('cause_of_death_code'))

top_five_trending$year <- as.Date(paste(top_five_trending$year, '-01-01', sep=''), '%Y-%m-%d')
bottom_five_trending$year <- as.Date(paste(bottom_five_trending$year, '-01-01', sep=''), '%Y-%m-%d')

rising_causes_of_death = ggplot(data = top_five_trending, mapping = aes(x = year, y = number_of_deaths, colour = cause_of_death)) +
  geom_line() + 
  theme(
    legend.position = "bottom"
    , legend.direction = "vertical"
    , plot.title = element_text(hjust = 0.5)
  ) +
  xlab('Year') +
  ylab('Number of Deaths') +
  ggtitle('Trending Up') +
  guides(colour = guide_legend(title = "Causes of Death"))


falling_causes_of_death = ggplot(data = bottom_five_trending, mapping = aes(x = year, y = number_of_deaths, colour = cause_of_death)) +
  geom_line() + 
  theme(
    legend.position = "bottom"
    , legend.direction = "vertical"
    , plot.title = element_text(hjust = 0.5)
  ) +
  xlab('Year') +
  ylab('Number of Deaths') +
  ggtitle('Trending Down') +
  guides(colour = guide_legend(title = "Causes of Death"))
  

grid.arrange(rising_causes_of_death, falling_causes_of_death, top = 'Top 5 Trending Up & Trending Down', ncol = 2)

Cleaning Formatted Spreadsheet

Michael Gordon

11 August 2019