Assignment 3 - Cancer the Silent Assassin from within

RPubs URL: https://rpubs.com/Chuver/725516

Loom URL: https://www.loom.com/share/cd5044de23c14c48aca2b30b72073cc1

Required Packages

knitr::opts_chunk$set(echo = TRUE)
# Required Packages:
library(dplyr) # Used for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr) # Used for pipe functions
library(tidyr) # Used for tidying data

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(outliers) # Used for the scores function
library(stats) # Used for the ICQ function
library(ggplot2) # Used to visualise the results

Introduction

With the pervasiveness of COVID-19 the world has almost forgotten another hidden killer amongst us - cancer. We all have heard about it but it only really becomes more “real” when someone close to us has either passed away from it or experienced its symptoms and unpleasant treatment.

Motivation for this assignment came about as a result of 3 recent cancer related deaths amongst family and friends. In addition my mother has also experienced and recovered from breast cancer. My father-in-law passed away from lung cancer whilst an elderly and middle-aged friend passed away from pancreatic and lung cancer respectively. All these experiences within the last year really solidify how close to home cancer is and how nobody is immune.

In essence cancer occurs as a result if a breakdown of one’s own system where cells grow out of control and form tumours. It can literally occur anywhere in the body and is a key contributor of death in Australia . According to the ABS in 2019, cancer of the trachea, bronchus and lung was the 4th cause of death after heart disease, Alzheimers disease and cerebrovascular diseases.

In trying to get a better understanding of what cancer is, this assignment seeks to answer the following question and in do doing answer the Assignment 3 rubric:

What are the top cancer related deaths and which cancers have the highest incident rates amongst children, adolescents, adults and seniors in Australia?

Data

The 2 datasets used in this assignment were sourced using the https://data.gov.au/ website and the search term “cancer”. They are published by the Australian Institute of Health and Welfare (AIHW) and are as follows:

ACIM Combined Counts. This provides the raw counts of all cancer incidence and mortality data by sex, year, age, and type.

ACIM Combined Rates. This represents the rates amongst the Australian population of all cancer incidence and mortality data by sex, year, age, and type.

Both these 2 datasets are in CSV format and have Creative Commons Attribution 3.0 Australia licenses. For the purpose of assignment 3, they will be joined and will follow the required preprocessing steps in order to get dataset in a position for data analytics.

Dataset 1 - Cancer counts

The first dataset comprises of raw counts by cancer type for both incidents and mortalities from 1968 to 2010.

# This represents the counts dataset
d01 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/counts01.csv")

str(d01)

## 'data.frame':    9240 obs. of  23 variables:
##  $ Year        : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex         : chr  "Male" "Male" "Male" "Male" ...
##  $ Type        : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_0_to_4  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85.     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown : int  NA NA NA NA NA NA NA NA NA NA ...

As shown above, there are 9240 observations with 23 variables.

Dataset 2 - Cancer rates

The second dataset comprises of rates by cancer type for both incidents and mortalities from 1968 to 2010.

# This represents the rates dataset

d02 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/rates01.csv")
str(d02)

## 'data.frame':    9240 obs. of  26 variables:
##  $ Year             : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex              : chr  "Male" "Male" "Male" "Male" ...
##  $ Type             : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type      : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_0_to_4       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85.          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Aust: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Segi: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_WHO : num  NA NA NA NA NA NA NA NA NA NA ...

As shown above, there are 9240 observations with 26 variables.

Merged dataset

The merged dataset is an amalgamation of datasets 1 and 2 respectively, essentially combining the cancer rates and counts by year, sex, type and cancer_type.

# Mutating left-join to join datasets 1 and 2
d12 <- left_join(d01, d02, by = c("Year" = "Year", "Sex" = "Sex", "Type" = "Type", 
                                  "Cancer_Type" = "Cancer_Type")) 

str(d12)

## 'data.frame':    9240 obs. of  45 variables:
##  $ Year             : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex              : chr  "Male" "Male" "Male" "Male" ...
##  $ Type             : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type      : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_0_to_4.x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..x        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown.x    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_0_to_4.y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..y        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown.y    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Aust: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Segi: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_WHO : num  NA NA NA NA NA NA NA NA NA NA ...

As shown above, there are 9240 observations with 45 variables. Note the “.x” and “.y” suffixes at the end of the Age* variables to separate the counts from the rates variables respectively, given that the same names were used in the original data sets.

Understand

In order to obtain a greater understanding of the merged data set, the following was performed:

Determined the types of variables and data structure using str(d12).
Checked the dimensions of the data frame using dim(d12).
Changed Sex variable from character to factor using factor(d12$Sex) and then ordered using levels().
Created a new subset of the data frame using select(-) to only use the required variables.

# i) Determined the types of variables and and data structure
str(d12)

## 'data.frame':    9240 obs. of  45 variables:
##  $ Year             : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex              : chr  "Male" "Male" "Male" "Male" ...
##  $ Type             : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type      : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_0_to_4.x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.x   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..x        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown.x    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_0_to_4.y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.y   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..y        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Unknown.y    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Aust: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_Segi: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_Std_Rate_WHO : num  NA NA NA NA NA NA NA NA NA NA ...

As shown above, the merged dataset is a dataframe which is comprised of 9240 observations and 45 variables. This dataset comprises of the following variables:

21 Integer variables:
- Year - represents the year either the cancer incident or mortality is reported.
- Age*.x - represents the actual count of the cancer incident or mortality by age bands.
3 Character variables:
- Type - represents whether the cancer reported is either an incident or a mortality
- Cancer_Type - indicates the actual name of the cancer reported, e.g. Brain or Lung (cancer).
- Sex - represents the sex of the respondent(male, female or person). Note this will be converted to a Categorical variable.
21 Numeric variables:
- Age*.y - represents the actual rate of the cancer incident or mortality by age bands.
- Age_Std_Rate* - additional classifcations of cancer incident/mortality rates.

# ii)  Checked the dimensions of the data frame

dim(d12)

## [1] 9240   45

The output confirms that there are 9240 rows and 45 variables in the merged dataset.

# iii) Changed Sex variable from character to factor
d12$Sex <- factor(d12$Sex ,levels = c("Male", "Female", "Persons"))

str(d12$Sex)

##  Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...

The above output confirms that the Sex variable has been converted to that of a factor.

As not all the variables will be used, the following will be removed from the merged dataset through creation of a subset:

Age_Unknown.x (only know age categories have been considered)
Age_Unknown.y (only know age categories have been considered)
Age_Std_Rate_Aust (not present in dataset 1, so cannot be consolidated by age band)
Age_Std_Rate_Segi (not present in dataset 1, so cannot be consolidated by age band)
Age_Std_Rate_WHO (not present in dataset 1, so cannot be consolidated by age band)

# iv) Created a new subset of the data frame to only use the required variables

d12n <- d12
d12n %<>% select(-c(Age_Unknown.x, Age_Unknown.y, Age_Std_Rate_Aust, Age_Std_Rate_Segi, Age_Std_Rate_WHO))
str(d12n)

## 'data.frame':    9240 obs. of  40 variables:
##  $ Year          : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex           : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type          : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type   : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_0_to_4.x  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.x  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_0_to_4.y  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_5_to_9.y  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_10_to_14.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_15_to_19.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_20_to_24.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_25_to_29.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_30_to_34.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_35_to_39.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_40_to_44.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_45_to_49.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_50_to_54.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_55_to_59.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_60_to_64.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_65_to_69.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_70_to_74.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_75_to_79.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_80_to_84.y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Age_85..y     : num  NA NA NA NA NA NA NA NA NA NA ...

As indicated above there are now 5 less variables, resulting in an amended count of 40 variables.

Tidy and Manipulate Data I and II

From examination of the merged dataset, the data is not 100% tidy, as theAge* variables violate the principle that all variables must have their own column. The Age* columns are not variables but values of cancer count and rate per age group.

In order to make the dataset tidy these variables need to be fixed using the pivot_longer function. But as there are many age bands, these bands will first be consolidated into the following 4 Age Categories using the transmute function:

Child (ages 0 to 9 years old)
Adolescent (ages 10 to 19 years old)
Adult (ages 20 to 64 years old)
Senior (ages 65 years old and above)

Note these will be duplicated into x and y as they represent the same categories but different values(integers for cancer counts versus numeric for cancer rates).

As I need to perform the transmute before the pivot_longer, I have decided to combine the 2 tidy and manipulate sections in the r markdown document together as Tidy and Manipulate Data I & II.

# separate age categories created for simplification
d12nt <- d12n
d12nt %<>% 
transmute(Year, Sex, Type, Cancer_Type, 
            Child_x = Age_0_to_4.x + Age_5_to_9.x, 
            Adolescent_x =  Age_10_to_14.x + Age_15_to_19.x, 
            Adult_x = Age_20_to_24.x + Age_25_to_29.x + 
                      Age_30_to_34.x + Age_35_to_39.x + Age_40_to_44.x + 
                      Age_45_to_49.x + Age_50_to_54.x + Age_55_to_59.x + 
                      Age_60_to_64.x, 
            Senior_x = Age_65_to_69.x + Age_70_to_74.x + Age_75_to_79.x + 
                      Age_80_to_84.x + Age_85..x,
            Child_y = Age_0_to_4.y + Age_5_to_9.y, 
            Adolescent_y =  Age_10_to_14.y + Age_15_to_19.y, 
            Adult_y = Age_20_to_24.y + Age_25_to_29.y + 
                      Age_30_to_34.y + Age_35_to_39.y + Age_40_to_44.y + 
                      Age_45_to_49.y + Age_50_to_54.y + Age_55_to_59.y + 
                      Age_60_to_64.y, 
            Senior_y = Age_65_to_69.y + Age_70_to_74.y + Age_75_to_79.y + 
                      Age_80_to_84.y + Age_85..y)

str(d12nt)

## 'data.frame':    9240 obs. of  12 variables:
##  $ Year        : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ Sex         : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type        : chr  "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type : chr  "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Child_x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Adolescent_x: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Adult_x     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Senior_x    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Child_y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Adolescent_y: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Adult_y     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Senior_y    : num  NA NA NA NA NA NA NA NA NA NA ...

As can be seen above there are consolidated Age Category columns Child_x, Adolescent_x, Adult_x, Senior_x, Child_y, Adolescent_y, Adult_y and Senior_y. These all need to be under a single Age Category column in order to meet tidy data principles. To do so the following will be performed:

Create counts dataset for cancer counts using a subset of merged dataset and renaming common Age Categories and then apply pivot_longer.
Create rates dataset for cancer rates using a subset of merged dataset and renaming common Age Categories and then apply pivot_longer.
Rejoin datasets to form merged dataset which meets tidy principles.

# a) Create counts dataset

counts <- d12nt #created counts table for pivoting_longer
counts %<>% 
  rename(Child = Child_x, Adolescent = Adolescent_x, Adult = Adult_x, Senior = Senior_x) %>%
  select(Year, Sex, Type, Cancer_Type, Child, Adolescent, Adult, Senior) %>% 
  pivot_longer(- c(Year, Sex, Type, Cancer_Type),
               names_to = "Age_Category", values_to = "Count")

# b) Create rates dataset

#created rates table for pivoting_longer and renamed age bands as above for consolidation

rates <- d12nt 
rates %<>% 
  rename(Child = Child_y, Adolescent = Adolescent_y, Adult = Adult_y, Senior = Senior_y) %>% 
  pivot_longer(- c(Year, Sex, Type, Cancer_Type),
               names_to = "Age_Category", values_to = "Rate")


# c) Rejoin counts and rates datasets 
library(magrittr)

d12_final <- left_join(counts, rates, by = c("Year" = "Year", "Sex" = "Sex", "Type" = "Type", 
                                             "Cancer_Type" = "Cancer_Type", 
                                             "Age_Category" = "Age_Category" )) 
# Convert Age Category from character to factor
d12_final$Age_Category <- factor(d12_final$Age_Category, levels = c("Child", "Adolescent", 
                                                                    "Adult", "Senior")) 
str(d12_final)

## tibble [36,960 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Year        : int [1:36960] 1968 1968 1968 1968 1969 1969 1969 1969 1970 1970 ...
##  $ Sex         : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type        : chr [1:36960] "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type : chr [1:36960] "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_Category: Factor w/ 4 levels "Child","Adolescent",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Count       : int [1:36960] NA NA NA NA NA NA NA NA NA NA ...
##  $ Rate        : num [1:36960] NA NA NA NA NA NA NA NA NA NA ...

As shown above, the updated merged data set is now tidy with a single Age Category for both cancer counts and rates.

Scan I

This scan is for missing values, inconsistencies and obvious errors.

The approach undertaken to identify missing values is as follows:

Used function sum(is.na()) to get an overall feel of the magnitude of missing values.
Determined which variables have missing values using names(which(colSums(is.na())>0)).
Determined which years have missing values using summarise(count = n()).
Excluded missing values from dataset using completed.cases() function.

# a) Determine how many NA's are in the merged dataframe.

sum(is.na(d12_final)) # Determine how many NA's are in the merged dataframe.

## [1] 20648

As shown above there are 20648 NA values

# b) Determine which variables have missing values and how many records are missing.

colSums(is.na(d12_final))

##         Year          Sex         Type  Cancer_Type Age_Category        Count 
##            0            0            0            0            0        10324 
##         Rate 
##        10324

names(which(colSums(is.na(d12_final))>0))

## [1] "Count" "Rate"

As indicated above the missing values relate to the Count and Rate variables and both result in a total of 10324 missing records as they are part of the same observation.

The original merged filed contained 36690 observations. By subtracting 10324 missing records, there should be 26636 remaining records with no missing data.

# c) Determine which years have missing values.


d12_final %>% 
  filter(is.na(Count) | is.na(Rate)) %>% 
  select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>% 
  group_by(Year) %>% 
  summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 44 x 2
##     Year count
##    <int> <int>
##  1  1968   472
##  2  1969   472
##  3  1970   472
##  4  1971   472
##  5  1972   472
##  6  1973   472
##  7  1974   472
##  8  1975   472
##  9  1976   472
## 10  1977   472
## # ... with 34 more rows

As seen above there are 472 NAs for the years 1968 to 1981 and 2011. These are as a result of no records being available. The remaining NAs are for the known years from 1982 to 2010 in diminishing amounts. These will all be addressed using the completed.cases() function, as this is the best approach rather than physically removing the rows in the dataset.

# d) Treatment of missing values using completed.cases function

d12_final_2 <- d12_final[complete.cases(d12_final),]

nrow(d12_final_2) # Get number of non-missing rows

## [1] 26636

sum(is.na(str(d12_final_2))) # Check that all NA have been excluded

## tibble [26,636 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Year        : int [1:26636] 1982 1982 1982 1982 1983 1983 1983 1983 1984 1984 ...
##  $ Sex         : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type        : chr [1:26636] "Incidence" "Incidence" "Incidence" "Incidence" ...
##  $ Cancer_Type : chr [1:26636] "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
##  $ Age_Category: Factor w/ 4 levels "Child","Adolescent",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Count       : int [1:26636] 67 28 24 10 59 15 18 18 67 20 ...
##  $ Rate        : num [1:26636] 11.05 4.15 5.2 7.86 9.74 ...

## [1] 0

As shown above there are now 26,636 observations and none of these contain NA. This reconciles perfectly to what was calculated above.

Scan II

This scan serves to scan the numeric data for outliers.

As both counts and rates are the individual integer and numeric variables, univariate outlier detection will be implemented.

Tukey’s method of outlier detection will be used

library(outliers) # for the scores function
library(stats)

# Generate boxplot - Count variable

d12_final_2$Count %>%
  boxplot(main = "Box Plot of Cancer Counts", ylab = "Count", col = "red")

# Determine Mean and Median values for Count variable
summary(d12_final_2$Count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    15.0   282.1   220.0 13041.0

# Generate boxplot - Rate variable

d12_final_2$Rate %>%
  boxplot(main = "Box Plot of Cancer Rates", ylab = "Count", col = "blue")

# Determine Mean and Median values for Rate variable
summary(d12_final_2$Rate)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    2.825   92.435   56.991 6439.857

As seen from the info above there are a significant number of outliers above the mean of: 282.1 for Counts and 92.435 for Rates

library(stats)
# Determine IQR for the purpose of upper and lower fences for Count and Rate variables

count_iqr <- IQR(d12_final_2$Count) # Count variable IQR
count_iqr

## [1] 220

rate_iqr <- IQR(d12_final_2$Rate) # Count variable IQR
rate_iqr

## [1] 56.99056

As seen above the IQR for the Counts is 220, whilst the IQR for Rates is 56.99056. The ext step is to find out the upper and lower fences for the Count and Rate variables.

q1 <- quantile(d12_final_2$Count, probs = 0.25) # Lower Count quantile
q3 <- quantile(d12_final_2$Count, probs = 0.75) # Upper Count quantile


count_low_fence <- q1 - (1.5 * count_iqr) # Lower fence value for Count variable
count_up_fence  <- q3 + (1.5 * count_iqr)  # Upper fence value for Count variable

count_low_fence

##  25% 
## -330

count_up_fence

## 75% 
## 550

q1 <- quantile(d12_final_2$Rate, probs = 0.25) # Lower Rate quantile
q3 <- quantile(d12_final_2$Rate, probs = 0.75) # Upper Rate quantile


rate_low_fence <- q1 - (1.5 * rate_iqr) # Lower fence value for Rate variable
rate_up_fence  <- q3 + (1.5 * rate_iqr)  # Upper fence value for Rate variable

rate_low_fence

##       25% 
## -85.48585

rate_up_fence

##      75% 
## 142.4764

As can be seen above, the fence values for both variables are as follows:

Fence	Count	Rate
Low	-330	-85.48585
Up	550	142.4764

count_up_low_outliers <- which(d12_final_2$Count < count_low_fence) # Count outliers below low fence
count_up_outliers <- which(d12_final_2$Count > count_up_fence)      # Count outliers above up fence


length(count_up_low_outliers) # Number of Count outliers below low fence

## [1] 0

length(count_up_outliers) # Number of Count outliers above the up fence

## [1] 3374

rate_up_low_outliers <- which(d12_final_2$Rate < count_low_fence) # Rate outliers below low fence
rate_up_outliers <- which(d12_final_2$Rate > count_up_fence)      # Rate outliers above up fence


length(rate_up_low_outliers) # Number of Rate outliers below low fence

## [1] 0

length(rate_up_outliers) # Number of Rate outliers above the up fence

## [1] 1057

**Summary of Outliers for both variables**
Fence	Count	Rate
Low	0	0
Up	3374	1057

The final step of Scan II is to impute these outliers with the mean of the Count and Rate values respectively.

#Impute Count outliers with mean of Count

d12_final_2$Count[count_up_outliers] <- mean(d12_final_2$Count)
length(d12_final_2$Count[count_up_outliers])

## [1] 3374

#Impute Rate outliers with mean of Rate

d12_final_2$Rate[rate_up_outliers] <- mean(d12_final_2$Rate)
length(d12_final_2$Rate[rate_up_outliers])

## [1] 1057

Notice the number of records updated for both variables aligns with the initial outlier amount.

d12_final_2$Count %>%
  boxplot(main = "Box Plot of Cancer Counts", ylab = "Count", col = "green")

d12_final_2$Rate %>%
  boxplot(main = "Box Plot of Cancer Rate", ylab = "Rate", col = "yellow")

Note amended boxplots above for Cancer Counts (green) and Cancer Rates (yellow) now have no outliers.

Transform

The variable chosen for the purpose of transformation is the Rates variable, as it better shows the result of the log10 transformation. In order to reduce right skewness of the Rates histogram, log10 transformation was applied.

rate <- d12_final_2$Rate
hist(rate, 
        main = "Histogram of Cancer Rates",
        xlab = "Cancer Rate")

hist(log10(rate),
    main = "Histogram of Log 10 Cancer Rates",
    xlab = "Cancer Rate")

Data Visualisation

In order to simplify the output of the visualisations, only the top 6 in each category have been output. A total of 8 variables as inputs for the ggplot2() function were defined:

Child Count and Child Rates
Adolescent Count and Adolescent Rates
Adult Count and Adult Rates
Senior Count and Senior Rates

# Child selection and visualisation
d12_final_2_child <- d12_final_2

d12_final_2_child %<>%
  select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>% 
  filter(Age_Category == "Child", Cancer_Type == "Acute lymphoblastic leukaemia" | 
           Cancer_Type == "Brain cancer" | Cancer_Type == "Kidney cancer" |
  Cancer_Type == "Non-Hodgkin lymphoma" | Cancer_Type == "Acute myeloid leukaemia" | 
    Cancer_Type == "Liver cancer" ) %>% 
  group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count), 
                                                          Cum_Rate = sum(Rate)) %>% 
  arrange(desc(Cum_Count, Cum_Rate))

## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)

gchi <- ggplot(d12_final_2_child, aes(Cum_Count, Cancer_Type, colour =Type)) 
gchi + geom_count() + labs(x = "Count", y = "Cancer", 
title = "Child Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

rchi <- ggplot(d12_final_2_child, aes(Cum_Rate, Cancer_Type, colour =Type)) + 
  scale_color_brewer(palette = "Paired") 
rchi + geom_count() + labs(x = "Rate", y = "Cancer", 
title = "Child Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

# Adolescent selection and visualisation
d12_final_2_ado <- d12_final_2

d12_final_2_ado %<>%
  select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>% 
  filter(Age_Category == "Adolescent", Cancer_Type == "Melanoma of the skin" | 
           Cancer_Type == "Hodgkin lymphoma" | Cancer_Type == "Acute lymphoblastic leukaemia" | 
           Cancer_Type == "Brain cancer"| Cancer_Type == "Non-Hodgkin lymphoma" | 
           Cancer_Type == "Acute myeloid leukaemia")  %>% 
  group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count), 
                                                          Cum_Rate = sum(Rate)) %>% 
  arrange(desc(Cum_Count, Cum_Rate))

## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)

gado <- ggplot(d12_final_2_ado, aes(Cum_Count, Cancer_Type, colour =Type)) 
gado + geom_count() + labs(x = "Count", y = "Cancer", 
title = "Adolescent Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

rado <- ggplot(d12_final_2_ado, aes(Cum_Rate, Cancer_Type, colour =Type)) + 
  scale_color_brewer(palette = "Paired") 
rado + geom_count() + labs(x = "Rate", y = "Cancer", 
title = "Adolescent Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

# Adult selection and visualisation
d12_final_2_adult <- d12_final_2

d12_final_2_adult %<>%
  select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>% 
  filter(Age_Category == "Adult", Cancer_Type == "Colon cancer" | 
           Cancer_Type == "Unknown primary site" | Cancer_Type == "Brain cancer" | 
           Cancer_Type == "Pancreatic cancer" | Cancer_Type == "Melanoma of the skin" | 
           Cancer_Type == "Stomach cancer") %>% 
  group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count), 
                                                          Cum_Rate = sum(Rate)) %>% 
  arrange(desc(Cum_Count, Cum_Rate))

## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)

gadu <- ggplot(d12_final_2_adult, aes(Cum_Count, Cancer_Type, colour =Type)) 
gadu + geom_count() + labs(x = "Count", y = "Cancer", 
title = "Adult Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

radu <- ggplot(d12_final_2_adult, aes(Cum_Rate, Cancer_Type, colour =Type)) + 
  scale_color_brewer(palette = "Paired") 
radu + geom_count() + labs(x = "Rate", y = "Cancer", 
title = "Adult Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

# Senior selection and visualisation
d12_final_2_sen <- d12_final_2

d12_final_2_sen %<>%
  select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>% 
  filter(Age_Category == "Senior", Cancer_Type == "Rectal cancer" | 
           Cancer_Type == "Non-Hodgkin lymphoma" | 
           Cancer_Type == "Head and neck including lip" | 
           Cancer_Type == "Head and neck excluding lip" | Cancer_Type == "Bladder cancer" | 
           Cancer_Type == "Oesophageal cancer") %>% 
  group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count), 
                                                          Cum_Rate = sum(Rate)) %>% 
  arrange(desc(Cum_Count, Cum_Rate))

## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)

gsen <- ggplot(d12_final_2_sen, aes(Cum_Count, Cancer_Type, colour =Type)) 
gsen + geom_count() + labs(x = "Count", y = "Cancer", 
title = "Senior Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

rsen <- ggplot(d12_final_2_sen, aes(Cum_Rate, Cancer_Type, colour =Type)) + 
  scale_color_brewer(palette = "Paired") 
rsen + geom_count() + labs(x = "Rate", y = "Cancer", 
title = "Senior Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")

Conclusion

**Summary of Cancer Counts and Rates:**
Age Category	Highest Mortality	Fastest Incident rate
Child	Acute lymphoblastic leukaemia	Acute lymphoblastic leukaemia
Adolescent	Acute lymphoblastic leukaemia	Melanoma of the skin
Adult	Colon cancer	Melanoma of the skin
Senior	Rectal cancer	Non-Hodgkin lymphoma

Cancer related deaths certainly do vary by age category. At the lowest end of the scale, the biggest cancer related cause of death for children and adolescents is Acute lymphoblastic leukaemia. Melanoma of the skin have high rates of incident for Adults and Adolescents given the high exposure due to work and sport but low mortality. Adults are most at risk from colon cancer. Whilst for seniors Rectal cancer is the biggest killer with Non-Hodgkin lymphoma having the highest rates of incidents.

Whilst we cannot prevent and stop the spread of the silent assassin within our bodies, we can certainly reduce the opportunities for it to appear. We can do so by strengthening our bodies by eating foods low in fat and high in fibre, exercising regularly, getting enough daily sleep, minimising our exposure to the sun, refraining from tobacco use and going for annual medical check-ups. All of these little steps will assist in reducing the cancer mortality counts and rates in our society.

References

Cancer prevention: 7 tips to reduce your risk, Mayoclinic.org, viewed 22 February 2021, https://www.mayoclinic.org/healthy-lifestyle/adult-health/in-depth/cancer-prevention/art-20044816
Causes of Death, Australia, Australian Bureau of Statistics, viewed 18 February 2021, https://www.abs.gov.au/statistics/health/causes-death/causes-death-australia/latest-release
Australian Cancer Incidence and Mortality - 2015, Australian Institute of Health and Welfare, viewed 12 February 2021, https://data.gov.au/data/dataset/05696f6f-6ff5-42a2-904f-af5e4d1f56f8
Boehmke, BC, 2016, Data Wrangling with R, Springer International Publishing, Cham, Switzerland.
Wickham, H, & Grolemund, G 2016, R for Data Science : Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, Incorporated, Sebastopol, CA.