RPubs URL: https://rpubs.com/Chuver/725516
Loom URL: https://www.loom.com/share/cd5044de23c14c48aca2b30b72073cc1
knitr::opts_chunk$set(echo = TRUE)
# Required Packages:
library(dplyr) # Used for data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr) # Used for pipe functions
library(tidyr) # Used for tidying data
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(outliers) # Used for the scores function
library(stats) # Used for the ICQ function
library(ggplot2) # Used to visualise the results
With the pervasiveness of COVID-19 the world has almost forgotten another hidden killer amongst us - cancer. We all have heard about it but it only really becomes more “real” when someone close to us has either passed away from it or experienced its symptoms and unpleasant treatment.
Motivation for this assignment came about as a result of 3 recent cancer related deaths amongst family and friends. In addition my mother has also experienced and recovered from breast cancer. My father-in-law passed away from lung cancer whilst an elderly and middle-aged friend passed away from pancreatic and lung cancer respectively. All these experiences within the last year really solidify how close to home cancer is and how nobody is immune.
In essence cancer occurs as a result if a breakdown of one’s own system where cells grow out of control and form tumours. It can literally occur anywhere in the body and is a key contributor of death in Australia . According to the ABS in 2019, cancer of the trachea, bronchus and lung was the 4th cause of death after heart disease, Alzheimers disease and cerebrovascular diseases.
In trying to get a better understanding of what cancer is, this assignment seeks to answer the following question and in do doing answer the Assignment 3 rubric:
What are the top cancer related deaths and which cancers have the highest incident rates amongst children, adolescents, adults and seniors in Australia?
The 2 datasets used in this assignment were sourced using the https://data.gov.au/ website and the search term “cancer”. They are published by the Australian Institute of Health and Welfare (AIHW) and are as follows:
ACIM Combined Counts. This provides the raw counts of all cancer incidence and mortality data by sex, year, age, and type.
ACIM Combined Rates. This represents the rates amongst the Australian population of all cancer incidence and mortality data by sex, year, age, and type.
Both these 2 datasets are in CSV format and have Creative Commons Attribution 3.0 Australia licenses. For the purpose of assignment 3, they will be joined and will follow the required preprocessing steps in order to get dataset in a position for data analytics.
The first dataset comprises of raw counts by cancer type for both incidents and mortalities from 1968 to 2010.
# This represents the counts dataset
d01 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/counts01.csv")
str(d01)
## 'data.frame': 9240 obs. of 23 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_0_to_4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9 : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85. : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown : int NA NA NA NA NA NA NA NA NA NA ...
As shown above, there are 9240 observations with 23 variables.
The second dataset comprises of rates by cancer type for both incidents and mortalities from 1968 to 2010.
# This represents the rates dataset
d02 <- read.csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/rates01.csv")
str(d02)
## 'data.frame': 9240 obs. of 26 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_0_to_4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84 : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Aust: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Segi: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_WHO : num NA NA NA NA NA NA NA NA NA NA ...
As shown above, there are 9240 observations with 26 variables.
The merged dataset is an amalgamation of datasets 1 and 2 respectively, essentially combining the cancer rates and counts by year, sex, type and cancer_type.
# Mutating left-join to join datasets 1 and 2
d12 <- left_join(d01, d02, by = c("Year" = "Year", "Sex" = "Sex", "Type" = "Type",
"Cancer_Type" = "Cancer_Type"))
str(d12)
## 'data.frame': 9240 obs. of 45 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_0_to_4.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_0_to_4.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown.y : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Aust: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Segi: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_WHO : num NA NA NA NA NA NA NA NA NA NA ...
As shown above, there are 9240 observations with 45 variables. Note the “.x” and “.y” suffixes at the end of the Age* variables to separate the counts from the rates variables respectively, given that the same names were used in the original data sets.
In order to obtain a greater understanding of the merged data set, the following was performed:
Determined the types of variables and data structure using str(d12).
Checked the dimensions of the data frame using dim(d12).
Changed Sex variable from character to factor using factor(d12$Sex) and then ordered using levels().
Created a new subset of the data frame using select(-) to only use the required variables.
# i) Determined the types of variables and and data structure
str(d12)
## 'data.frame': 9240 obs. of 45 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : chr "Male" "Male" "Male" "Male" ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_0_to_4.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_0_to_4.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Unknown.y : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Aust: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_Segi: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_Std_Rate_WHO : num NA NA NA NA NA NA NA NA NA NA ...
As shown above, the merged dataset is a dataframe which is comprised of 9240 observations and 45 variables. This dataset comprises of the following variables:
Age*.y - represents the actual rate of the cancer incident or mortality by age bands.
Age_Std_Rate* - additional classifcations of cancer incident/mortality rates.
# ii) Checked the dimensions of the data frame
dim(d12)
## [1] 9240 45
The output confirms that there are 9240 rows and 45 variables in the merged dataset.
# iii) Changed Sex variable from character to factor
d12$Sex <- factor(d12$Sex ,levels = c("Male", "Female", "Persons"))
str(d12$Sex)
## Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
The above output confirms that the Sex variable has been converted to that of a factor.
As not all the variables will be used, the following will be removed from the merged dataset through creation of a subset:
Age_Unknown.x (only know age categories have been considered)
Age_Unknown.y (only know age categories have been considered)
Age_Std_Rate_Aust (not present in dataset 1, so cannot be consolidated by age band)
Age_Std_Rate_Segi (not present in dataset 1, so cannot be consolidated by age band)
Age_Std_Rate_WHO (not present in dataset 1, so cannot be consolidated by age band)
# iv) Created a new subset of the data frame to only use the required variables
d12n <- d12
d12n %<>% select(-c(Age_Unknown.x, Age_Unknown.y, Age_Std_Rate_Aust, Age_Std_Rate_Segi, Age_Std_Rate_WHO))
str(d12n)
## 'data.frame': 9240 obs. of 40 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_0_to_4.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Age_0_to_4.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_5_to_9.y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_10_to_14.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_15_to_19.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_20_to_24.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_25_to_29.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_30_to_34.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_35_to_39.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_40_to_44.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_45_to_49.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_50_to_54.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_55_to_59.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_60_to_64.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_65_to_69.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_70_to_74.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_75_to_79.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_80_to_84.y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Age_85..y : num NA NA NA NA NA NA NA NA NA NA ...
As indicated above there are now 5 less variables, resulting in an amended count of 40 variables.
From examination of the merged dataset, the data is not 100% tidy, as theAge* variables violate the principle that all variables must have their own column. The Age* columns are not variables but values of cancer count and rate per age group.
In order to make the dataset tidy these variables need to be fixed using the pivot_longer function. But as there are many age bands, these bands will first be consolidated into the following 4 Age Categories using the transmute function:
Child (ages 0 to 9 years old)
Adolescent (ages 10 to 19 years old)
Adult (ages 20 to 64 years old)
Senior (ages 65 years old and above)
Note these will be duplicated into x and y as they represent the same categories but different values(integers for cancer counts versus numeric for cancer rates).
As I need to perform the transmute before the pivot_longer, I have decided to combine the 2 tidy and manipulate sections in the r markdown document together as Tidy and Manipulate Data I & II.
# separate age categories created for simplification
d12nt <- d12n
d12nt %<>%
transmute(Year, Sex, Type, Cancer_Type,
Child_x = Age_0_to_4.x + Age_5_to_9.x,
Adolescent_x = Age_10_to_14.x + Age_15_to_19.x,
Adult_x = Age_20_to_24.x + Age_25_to_29.x +
Age_30_to_34.x + Age_35_to_39.x + Age_40_to_44.x +
Age_45_to_49.x + Age_50_to_54.x + Age_55_to_59.x +
Age_60_to_64.x,
Senior_x = Age_65_to_69.x + Age_70_to_74.x + Age_75_to_79.x +
Age_80_to_84.x + Age_85..x,
Child_y = Age_0_to_4.y + Age_5_to_9.y,
Adolescent_y = Age_10_to_14.y + Age_15_to_19.y,
Adult_y = Age_20_to_24.y + Age_25_to_29.y +
Age_30_to_34.y + Age_35_to_39.y + Age_40_to_44.y +
Age_45_to_49.y + Age_50_to_54.y + Age_55_to_59.y +
Age_60_to_64.y,
Senior_y = Age_65_to_69.y + Age_70_to_74.y + Age_75_to_79.y +
Age_80_to_84.y + Age_85..y)
str(d12nt)
## 'data.frame': 9240 obs. of 12 variables:
## $ Year : int 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
## $ Sex : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : chr "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Child_x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Adolescent_x: int NA NA NA NA NA NA NA NA NA NA ...
## $ Adult_x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Senior_x : int NA NA NA NA NA NA NA NA NA NA ...
## $ Child_y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Adolescent_y: num NA NA NA NA NA NA NA NA NA NA ...
## $ Adult_y : num NA NA NA NA NA NA NA NA NA NA ...
## $ Senior_y : num NA NA NA NA NA NA NA NA NA NA ...
As can be seen above there are consolidated Age Category columns Child_x, Adolescent_x, Adult_x, Senior_x, Child_y, Adolescent_y, Adult_y and Senior_y. These all need to be under a single Age Category column in order to meet tidy data principles. To do so the following will be performed:
Create counts dataset for cancer counts using a subset of merged dataset and renaming common Age Categories and then apply pivot_longer.
Create rates dataset for cancer rates using a subset of merged dataset and renaming common Age Categories and then apply pivot_longer.
Rejoin datasets to form merged dataset which meets tidy principles.
# a) Create counts dataset
counts <- d12nt #created counts table for pivoting_longer
counts %<>%
rename(Child = Child_x, Adolescent = Adolescent_x, Adult = Adult_x, Senior = Senior_x) %>%
select(Year, Sex, Type, Cancer_Type, Child, Adolescent, Adult, Senior) %>%
pivot_longer(- c(Year, Sex, Type, Cancer_Type),
names_to = "Age_Category", values_to = "Count")
# b) Create rates dataset
#created rates table for pivoting_longer and renamed age bands as above for consolidation
rates <- d12nt
rates %<>%
rename(Child = Child_y, Adolescent = Adolescent_y, Adult = Adult_y, Senior = Senior_y) %>%
pivot_longer(- c(Year, Sex, Type, Cancer_Type),
names_to = "Age_Category", values_to = "Rate")
# c) Rejoin counts and rates datasets
library(magrittr)
d12_final <- left_join(counts, rates, by = c("Year" = "Year", "Sex" = "Sex", "Type" = "Type",
"Cancer_Type" = "Cancer_Type",
"Age_Category" = "Age_Category" ))
# Convert Age Category from character to factor
d12_final$Age_Category <- factor(d12_final$Age_Category, levels = c("Child", "Adolescent",
"Adult", "Senior"))
str(d12_final)
## tibble [36,960 x 7] (S3: tbl_df/tbl/data.frame)
## $ Year : int [1:36960] 1968 1968 1968 1968 1969 1969 1969 1969 1970 1970 ...
## $ Sex : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : chr [1:36960] "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr [1:36960] "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_Category: Factor w/ 4 levels "Child","Adolescent",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Count : int [1:36960] NA NA NA NA NA NA NA NA NA NA ...
## $ Rate : num [1:36960] NA NA NA NA NA NA NA NA NA NA ...
As shown above, the updated merged data set is now tidy with a single Age Category for both cancer counts and rates.
This scan is for missing values, inconsistencies and obvious errors.
The approach undertaken to identify missing values is as follows:
Used function sum(is.na()) to get an overall feel of the magnitude of missing values.
Determined which variables have missing values using names(which(colSums(is.na())>0)).
Determined which years have missing values using summarise(count = n()).
Excluded missing values from dataset using completed.cases() function.
# a) Determine how many NA's are in the merged dataframe.
sum(is.na(d12_final)) # Determine how many NA's are in the merged dataframe.
## [1] 20648
As shown above there are 20648 NA values
# b) Determine which variables have missing values and how many records are missing.
colSums(is.na(d12_final))
## Year Sex Type Cancer_Type Age_Category Count
## 0 0 0 0 0 10324
## Rate
## 10324
names(which(colSums(is.na(d12_final))>0))
## [1] "Count" "Rate"
As indicated above the missing values relate to the Count and Rate variables and both result in a total of 10324 missing records as they are part of the same observation.
The original merged filed contained 36690 observations. By subtracting 10324 missing records, there should be 26636 remaining records with no missing data.
# c) Determine which years have missing values.
d12_final %>%
filter(is.na(Count) | is.na(Rate)) %>%
select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>%
group_by(Year) %>%
summarise(count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 44 x 2
## Year count
## <int> <int>
## 1 1968 472
## 2 1969 472
## 3 1970 472
## 4 1971 472
## 5 1972 472
## 6 1973 472
## 7 1974 472
## 8 1975 472
## 9 1976 472
## 10 1977 472
## # ... with 34 more rows
As seen above there are 472 NAs for the years 1968 to 1981 and 2011. These are as a result of no records being available. The remaining NAs are for the known years from 1982 to 2010 in diminishing amounts. These will all be addressed using the completed.cases() function, as this is the best approach rather than physically removing the rows in the dataset.
# d) Treatment of missing values using completed.cases function
d12_final_2 <- d12_final[complete.cases(d12_final),]
nrow(d12_final_2) # Get number of non-missing rows
## [1] 26636
sum(is.na(str(d12_final_2))) # Check that all NA have been excluded
## tibble [26,636 x 7] (S3: tbl_df/tbl/data.frame)
## $ Year : int [1:26636] 1982 1982 1982 1982 1983 1983 1983 1983 1984 1984 ...
## $ Sex : Factor w/ 3 levels "Male","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : chr [1:26636] "Incidence" "Incidence" "Incidence" "Incidence" ...
## $ Cancer_Type : chr [1:26636] "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" "Acute lymphoblastic leukaemia" ...
## $ Age_Category: Factor w/ 4 levels "Child","Adolescent",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Count : int [1:26636] 67 28 24 10 59 15 18 18 67 20 ...
## $ Rate : num [1:26636] 11.05 4.15 5.2 7.86 9.74 ...
## [1] 0
As shown above there are now 26,636 observations and none of these contain NA. This reconciles perfectly to what was calculated above.
This scan serves to scan the numeric data for outliers.
As both counts and rates are the individual integer and numeric variables, univariate outlier detection will be implemented.
Tukey’s method of outlier detection will be used
library(outliers) # for the scores function
library(stats)
# Generate boxplot - Count variable
d12_final_2$Count %>%
boxplot(main = "Box Plot of Cancer Counts", ylab = "Count", col = "red")
# Determine Mean and Median values for Count variable
summary(d12_final_2$Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 15.0 282.1 220.0 13041.0
# Generate boxplot - Rate variable
d12_final_2$Rate %>%
boxplot(main = "Box Plot of Cancer Rates", ylab = "Count", col = "blue")
# Determine Mean and Median values for Rate variable
summary(d12_final_2$Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.825 92.435 56.991 6439.857
As seen from the info above there are a significant number of outliers above the mean of: 282.1 for Counts and 92.435 for Rates
library(stats)
# Determine IQR for the purpose of upper and lower fences for Count and Rate variables
count_iqr <- IQR(d12_final_2$Count) # Count variable IQR
count_iqr
## [1] 220
rate_iqr <- IQR(d12_final_2$Rate) # Count variable IQR
rate_iqr
## [1] 56.99056
As seen above the IQR for the Counts is 220, whilst the IQR for Rates is 56.99056. The ext step is to find out the upper and lower fences for the Count and Rate variables.
q1 <- quantile(d12_final_2$Count, probs = 0.25) # Lower Count quantile
q3 <- quantile(d12_final_2$Count, probs = 0.75) # Upper Count quantile
count_low_fence <- q1 - (1.5 * count_iqr) # Lower fence value for Count variable
count_up_fence <- q3 + (1.5 * count_iqr) # Upper fence value for Count variable
count_low_fence
## 25%
## -330
count_up_fence
## 75%
## 550
q1 <- quantile(d12_final_2$Rate, probs = 0.25) # Lower Rate quantile
q3 <- quantile(d12_final_2$Rate, probs = 0.75) # Upper Rate quantile
rate_low_fence <- q1 - (1.5 * rate_iqr) # Lower fence value for Rate variable
rate_up_fence <- q3 + (1.5 * rate_iqr) # Upper fence value for Rate variable
rate_low_fence
## 25%
## -85.48585
rate_up_fence
## 75%
## 142.4764
As can be seen above, the fence values for both variables are as follows:
| Fence | Count | Rate |
|---|---|---|
| Low | -330 | -85.48585 |
| Up | 550 | 142.4764 |
count_up_low_outliers <- which(d12_final_2$Count < count_low_fence) # Count outliers below low fence
count_up_outliers <- which(d12_final_2$Count > count_up_fence) # Count outliers above up fence
length(count_up_low_outliers) # Number of Count outliers below low fence
## [1] 0
length(count_up_outliers) # Number of Count outliers above the up fence
## [1] 3374
rate_up_low_outliers <- which(d12_final_2$Rate < count_low_fence) # Rate outliers below low fence
rate_up_outliers <- which(d12_final_2$Rate > count_up_fence) # Rate outliers above up fence
length(rate_up_low_outliers) # Number of Rate outliers below low fence
## [1] 0
length(rate_up_outliers) # Number of Rate outliers above the up fence
## [1] 1057
| Fence | Count | Rate |
|---|---|---|
| Low | 0 | 0 |
| Up | 3374 | 1057 |
The final step of Scan II is to impute these outliers with the mean of the Count and Rate values respectively.
#Impute Count outliers with mean of Count
d12_final_2$Count[count_up_outliers] <- mean(d12_final_2$Count)
length(d12_final_2$Count[count_up_outliers])
## [1] 3374
#Impute Rate outliers with mean of Rate
d12_final_2$Rate[rate_up_outliers] <- mean(d12_final_2$Rate)
length(d12_final_2$Rate[rate_up_outliers])
## [1] 1057
Notice the number of records updated for both variables aligns with the initial outlier amount.
d12_final_2$Count %>%
boxplot(main = "Box Plot of Cancer Counts", ylab = "Count", col = "green")
d12_final_2$Rate %>%
boxplot(main = "Box Plot of Cancer Rate", ylab = "Rate", col = "yellow")
Note amended boxplots above for Cancer Counts (green) and Cancer Rates (yellow) now have no outliers.
The variable chosen for the purpose of transformation is the Rates variable, as it better shows the result of the log10 transformation. In order to reduce right skewness of the Rates histogram, log10 transformation was applied.
rate <- d12_final_2$Rate
hist(rate,
main = "Histogram of Cancer Rates",
xlab = "Cancer Rate")
hist(log10(rate),
main = "Histogram of Log 10 Cancer Rates",
xlab = "Cancer Rate")
In order to simplify the output of the visualisations, only the top 6 in each category have been output. A total of 8 variables as inputs for the ggplot2() function were defined:
Child Count and Child Rates
Adolescent Count and Adolescent Rates
Adult Count and Adult Rates
Senior Count and Senior Rates
# Child selection and visualisation
d12_final_2_child <- d12_final_2
d12_final_2_child %<>%
select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>%
filter(Age_Category == "Child", Cancer_Type == "Acute lymphoblastic leukaemia" |
Cancer_Type == "Brain cancer" | Cancer_Type == "Kidney cancer" |
Cancer_Type == "Non-Hodgkin lymphoma" | Cancer_Type == "Acute myeloid leukaemia" |
Cancer_Type == "Liver cancer" ) %>%
group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count),
Cum_Rate = sum(Rate)) %>%
arrange(desc(Cum_Count, Cum_Rate))
## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)
gchi <- ggplot(d12_final_2_child, aes(Cum_Count, Cancer_Type, colour =Type))
gchi + geom_count() + labs(x = "Count", y = "Cancer",
title = "Child Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
rchi <- ggplot(d12_final_2_child, aes(Cum_Rate, Cancer_Type, colour =Type)) +
scale_color_brewer(palette = "Paired")
rchi + geom_count() + labs(x = "Rate", y = "Cancer",
title = "Child Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
# Adolescent selection and visualisation
d12_final_2_ado <- d12_final_2
d12_final_2_ado %<>%
select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>%
filter(Age_Category == "Adolescent", Cancer_Type == "Melanoma of the skin" |
Cancer_Type == "Hodgkin lymphoma" | Cancer_Type == "Acute lymphoblastic leukaemia" |
Cancer_Type == "Brain cancer"| Cancer_Type == "Non-Hodgkin lymphoma" |
Cancer_Type == "Acute myeloid leukaemia") %>%
group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count),
Cum_Rate = sum(Rate)) %>%
arrange(desc(Cum_Count, Cum_Rate))
## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)
gado <- ggplot(d12_final_2_ado, aes(Cum_Count, Cancer_Type, colour =Type))
gado + geom_count() + labs(x = "Count", y = "Cancer",
title = "Adolescent Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
rado <- ggplot(d12_final_2_ado, aes(Cum_Rate, Cancer_Type, colour =Type)) +
scale_color_brewer(palette = "Paired")
rado + geom_count() + labs(x = "Rate", y = "Cancer",
title = "Adolescent Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
# Adult selection and visualisation
d12_final_2_adult <- d12_final_2
d12_final_2_adult %<>%
select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>%
filter(Age_Category == "Adult", Cancer_Type == "Colon cancer" |
Cancer_Type == "Unknown primary site" | Cancer_Type == "Brain cancer" |
Cancer_Type == "Pancreatic cancer" | Cancer_Type == "Melanoma of the skin" |
Cancer_Type == "Stomach cancer") %>%
group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count),
Cum_Rate = sum(Rate)) %>%
arrange(desc(Cum_Count, Cum_Rate))
## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)
gadu <- ggplot(d12_final_2_adult, aes(Cum_Count, Cancer_Type, colour =Type))
gadu + geom_count() + labs(x = "Count", y = "Cancer",
title = "Adult Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
radu <- ggplot(d12_final_2_adult, aes(Cum_Rate, Cancer_Type, colour =Type)) +
scale_color_brewer(palette = "Paired")
radu + geom_count() + labs(x = "Rate", y = "Cancer",
title = "Adult Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
# Senior selection and visualisation
d12_final_2_sen <- d12_final_2
d12_final_2_sen %<>%
select(Year, Sex, Type, Cancer_Type, Age_Category, Count, Rate) %>%
filter(Age_Category == "Senior", Cancer_Type == "Rectal cancer" |
Cancer_Type == "Non-Hodgkin lymphoma" |
Cancer_Type == "Head and neck including lip" |
Cancer_Type == "Head and neck excluding lip" | Cancer_Type == "Bladder cancer" |
Cancer_Type == "Oesophageal cancer") %>%
group_by(Type, Cancer_Type, Age_Category) %>% summarise(Cum_Count = sum(Count),
Cum_Rate = sum(Rate)) %>%
arrange(desc(Cum_Count, Cum_Rate))
## `summarise()` regrouping output by 'Type', 'Cancer_Type' (override with `.groups` argument)
gsen <- ggplot(d12_final_2_sen, aes(Cum_Count, Cancer_Type, colour =Type))
gsen + geom_count() + labs(x = "Count", y = "Cancer",
title = "Senior Cancer Counts (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
rsen <- ggplot(d12_final_2_sen, aes(Cum_Rate, Cancer_Type, colour =Type)) +
scale_color_brewer(palette = "Paired")
rsen + geom_count() + labs(x = "Rate", y = "Cancer",
title = "Senior Cancer Rates (1982 to 2010)",
caption = "Source: Australian Institute of Health and Welfare")
| Age Category | Highest Mortality | Fastest Incident rate |
|---|---|---|
| Child | Acute lymphoblastic leukaemia | Acute lymphoblastic leukaemia |
| Adolescent | Acute lymphoblastic leukaemia | Melanoma of the skin |
| Adult | Colon cancer | Melanoma of the skin |
| Senior | Rectal cancer | Non-Hodgkin lymphoma |
Cancer related deaths certainly do vary by age category. At the lowest end of the scale, the biggest cancer related cause of death for children and adolescents is Acute lymphoblastic leukaemia. Melanoma of the skin have high rates of incident for Adults and Adolescents given the high exposure due to work and sport but low mortality. Adults are most at risk from colon cancer. Whilst for seniors Rectal cancer is the biggest killer with Non-Hodgkin lymphoma having the highest rates of incidents.
Whilst we cannot prevent and stop the spread of the silent assassin within our bodies, we can certainly reduce the opportunities for it to appear. We can do so by strengthening our bodies by eating foods low in fat and high in fibre, exercising regularly, getting enough daily sleep, minimising our exposure to the sun, refraining from tobacco use and going for annual medical check-ups. All of these little steps will assist in reducing the cancer mortality counts and rates in our society.
Cancer prevention: 7 tips to reduce your risk, Mayoclinic.org, viewed 22 February 2021, https://www.mayoclinic.org/healthy-lifestyle/adult-health/in-depth/cancer-prevention/art-20044816
Causes of Death, Australia, Australian Bureau of Statistics, viewed 18 February 2021, https://www.abs.gov.au/statistics/health/causes-death/causes-death-australia/latest-release
Australian Cancer Incidence and Mortality - 2015, Australian Institute of Health and Welfare, viewed 12 February 2021, https://data.gov.au/data/dataset/05696f6f-6ff5-42a2-904f-af5e4d1f56f8
Boehmke, BC, 2016, Data Wrangling with R, Springer International Publishing, Cham, Switzerland.
Wickham, H, & Grolemund, G 2016, R for Data Science : Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, Incorporated, Sebastopol, CA.