Voiceover Presentation

The companion voiceover presentation for this assignment is available at:
https://drive.google.com/file/d/1S-KmhFFDcrO1Uij1WLiIfsWK6ZKzykBT/view?usp=sharing


Required packages

The necessary packages are loaded here:

library(magrittr)  # For pipe operators.
library(tidyr)     # For data wrangling.
library(dplyr)     # For data exploration & manipulation.
library(lubridate) # For dates and times.
library(stringr)   # For manipulation of strings.
library(Hmisc)     # For imputation.
library(outliers)  # For detecting univariate outliers.
library(MVN)       # For detecting multivariate outliers. 
library(forecast)  # For Box-Cox transformation.
library(ggplot2)   # For plots & visualisation.


options(warn=-1) # Turn off warnings at completion for clean export to report



Data

Data Description

Two datasets from the NSW Government Data.NSW repository are gathered for this scope. The “NSW COVID-19 Cases” dataset includes all observations of positive COVID cases in NSW by notification date and postcode, local health district, local government area and likely source of infection from 25 January 2020 at the outset of the pandemic. The “NSW COVID-19 Tests” dataset includes all PCR COVID tests undertaken in NSW by date and postcode, local health district, and local government area from 1 January 2020.

Both datasets are updated daily on weekdays to Data.NSW for Creative Commons exploration and inclusion in analytical projects. The sources for the dataset versions retrieved 8 October 2021 for this analysis are detailed below.

Although the two datasets share a number of variables, the original datasets serve slightly different purposes so we will merge them using bind_rows() to essentially append the observations of “tests” to the “cases” dataset aligned with intersecting variables for initial data wrangling, and resolve the convergence issues during tidying to ensure a useful combined dataset.


Dataset Summary: NSW COVID-19 Cases
Variable Name Variable Type Description
notification_date Ordinal Date the positive test result was reported to NSW Health
postcode Categorical Postcode of the patient as registered with Medicare
likely_source_of_infection Categorical Likely source of each COVID infection
lhd_2010_code Categorical NSW Local Health District code (per 2010 index)
lhd_2010_name Categorical NSW Local Health District (per 2010 index)
lga_code19 Categorical Local Government Area code (per 2019 index)
lga_name19 Categorical Local Government Area (per 2019 index)


Dataset Summary: NSW COVID-19 Tests
                Variable Name                 Variable Type Description
test_date Ordinal Date the COVID sample was taken
postcode Categorical Postcode of the patient as registered with Medicare
lhd_2010_code Categorical NSW Local Health District code (per 2010 index)
lhd_2010_name Categorical NSW Local Health District (per 2010 index)
lga_code19 Categorical Local Government Area code (per 2019 index)
lga_name19 Categorical Local Government Area (per 2019 index)
test_count Discrete The number of tests performed each day per LHD/LGA


References:


Approach
  • The datasets are imported with read.csv() to allow manual data manipulation without the smart reading capabilities of read_csv().
  • Head() and str() of both datasets are output to check the import and inspect the structure for data types and variable descriptions
  • Date variables in both datasets are renamed to align. Note the data is already input in the correct format so further date manipulation is not required.
  • The intersection of the two datasets is explored using intersect(names()).
  • The datasets are merged using bind_rows() and the structure of the combined dataset is output to confirm.


# Import datasets and print head
cases <- read.csv("data/confirmed_cases_table4_location_likely_source (2).csv")
tests <- read.csv("data/pcr_testing_table1_location_agg.csv")

head(cases)
head(tests)
# Inspect structure for data types and variable descriptions
str(cases)
## 'data.frame':    67480 obs. of  7 variables:
##  $ notification_date         : chr  "2020-01-25" "2020-01-25" "2020-01-25" "2020-01-27" ...
##  $ postcode                  : chr  "2121" "2071" "2134" "2033" ...
##  $ likely_source_of_infection: chr  "Overseas" "Overseas" "Overseas" "Overseas" ...
##  $ lhd_2010_code             : chr  "X760" "X760" "X700" "X720" ...
##  $ lhd_2010_name             : chr  "Northern Sydney" "Northern Sydney" "Sydney" "South Eastern Sydney" ...
##  $ lga_code19                : chr  "16260" "14500" "11300" "16550" ...
##  $ lga_name19                : chr  "Parramatta (C)" "Ku-ring-gai (A)" "Burwood (A)" "Randwick (C)" ...
str(tests)
## 'data.frame':    298859 obs. of  7 variables:
##  $ test_date    : chr  "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-01" ...
##  $ postcode     : chr  "2038" "2039" "2040" "2041" ...
##  $ lhd_2010_code: chr  "X700" "X700" "X700" "X700" ...
##  $ lhd_2010_name: chr  "Sydney" "Sydney" "Sydney" "Sydney" ...
##  $ lga_code19   : chr  "14170" "14170" "14170" "14170" ...
##  $ lga_name19   : chr  "Inner West (A)" "Inner West (A)" "Inner West (A)" "Inner West (A)" ...
##  $ test_count   : int  1 1 2 1 1 1 1 1 1 2 ...
# Rename date variables in both datasets to align
cases %<>% rename(Date = notification_date)
tests %<>% rename(Date = test_date)

# Explore intersecting variables as options to merge datasets
intersect(names(cases), names(tests))
## [1] "Date"          "postcode"      "lhd_2010_code" "lhd_2010_name"
## [5] "lga_code19"    "lga_name19"
# Merge the datasets into a combined dataframe to be tidied in the next steps
covid <- bind_rows(cases, tests)

str(covid)
## 'data.frame':    366339 obs. of  8 variables:
##  $ Date                      : chr  "2020-01-25" "2020-01-25" "2020-01-25" "2020-01-27" ...
##  $ postcode                  : chr  "2121" "2071" "2134" "2033" ...
##  $ likely_source_of_infection: chr  "Overseas" "Overseas" "Overseas" "Overseas" ...
##  $ lhd_2010_code             : chr  "X760" "X760" "X700" "X720" ...
##  $ lhd_2010_name             : chr  "Northern Sydney" "Northern Sydney" "Sydney" "South Eastern Sydney" ...
##  $ lga_code19                : chr  "16260" "14500" "11300" "16550" ...
##  $ lga_name19                : chr  "Parramatta (C)" "Ku-ring-gai (A)" "Burwood (A)" "Randwick (C)" ...
##  $ test_count                : int  NA NA NA NA NA NA NA NA NA NA ...



Understand

The “covid” dataset is a data frame with 366339 observations of 8 variables.


Variable Types & Data Conversions

The Original data types identified via str() are all character, excepting “test_count” which is numeric (integer). The following conversions are undertaken:

  • The “Date” variable loaded as character is converted to Date using as.Date(). As the original input uses the separator “-” the format does not need to be specified.
  • The remaining variables loaded as character should be factors and are updated using as.factor() functions.
  • The secondary classification " (X) " is removed from the lga_name19 variable with str_sub().
  • The “likely_source_of_infection” factor variable is simplified via labelling in preparation for tidying of the dataset using levels() and list().
  • The dataset is subsetted to exclude unnecessary detail variables including “postcode”, “lhd_2010_code”, “lga_code19” and “lga_name19” as redundant for the purposes of this scope using square brackets.
  • The date range is subsetted to the overlap of both original datasets using subset().
  • Columns of the subsetted dataset are renamed for clarity using names().

The head() of the dataset is output for visual understanding.


# Find the dimensions and class of the dataset
dim(covid)
## [1] 366339      8
class(covid)
## [1] "data.frame"
# Explore the types of variables 
str(covid)
## 'data.frame':    366339 obs. of  8 variables:
##  $ Date                      : chr  "2020-01-25" "2020-01-25" "2020-01-25" "2020-01-27" ...
##  $ postcode                  : chr  "2121" "2071" "2134" "2033" ...
##  $ likely_source_of_infection: chr  "Overseas" "Overseas" "Overseas" "Overseas" ...
##  $ lhd_2010_code             : chr  "X760" "X760" "X700" "X720" ...
##  $ lhd_2010_name             : chr  "Northern Sydney" "Northern Sydney" "Sydney" "South Eastern Sydney" ...
##  $ lga_code19                : chr  "16260" "14500" "11300" "16550" ...
##  $ lga_name19                : chr  "Parramatta (C)" "Ku-ring-gai (A)" "Burwood (A)" "Randwick (C)" ...
##  $ test_count                : int  NA NA NA NA NA NA NA NA NA NA ...
# Convert Date to date variable type
covid$Date <- as.Date(covid$Date)

# Convert character variables to factors
cols_fact <- c("postcode", "likely_source_of_infection", "lhd_2010_code", "lhd_2010_name", "lga_code19", "lga_name19")
covid[cols_fact] <- lapply(covid[cols_fact], as.factor)

# Remove the secondary classification " (X) " from LGA Name variable
covid$lga_name19 <- str_sub(covid$lga_name19, start = 1, end = -5)

# Label the "likely_source_of_infection" variable for simplified tidying
levels(covid$likely_source_of_infection)
## [1] "Interstate"                                          
## [2] "Locally acquired - investigation ongoing"            
## [3] "Locally acquired - linked to known case or cluster"  
## [4] "Locally acquired - no links to known case or cluster"
## [5] "Overseas"
levels(covid$likely_source_of_infection) <- list(
            `Interstate Cases`  = "Interstate", 
            `Unlinked Cases` = "Locally acquired - investigation ongoing", 
            `Linked Cases` = "Locally acquired - linked to known case or cluster", 
            `Unlinked Cases` = "Locally acquired - no links to known case or cluster",
            `Overseas Cases` = "Overseas")

# Subset to remove redundant columns
covid <- covid[ ,c(1,3,5,8)]

# Subset to overlapping date ranges
covid <- subset(covid, Date > "2020-01-24" & Date < "2021-10-07")

# Rename columns for clarity
names(covid) <- c("Date", "Likely Source of Infection", "Local Health District", "Test Count")

head(covid)



Tidy & Manipulate Data I

According to tidy data principles, the following criteria must be met:

  1. Each observation must have a row
  2. Each variable has its own column
  3. Each value has its own cell


The original “NSW COVID-19 Cases” and “NSW COVID-19 Tests” datasets served slightly different purposes, which has created untidy data in the merging of those datasets as noted in the Data Description section above.

The “Test Count” variable in the “tests” dataset logs the number of tests completed for patients in each LHD per day, often in multiple data uploads per day. This essentially creates a “wide” dataset, as compared to the “long” format with individual observations for each case in the “cases” dataset.

It would be incorrect to pivot the “tests” dataset into individual observations by the “Test Count” variable. Instead, we can manipulate the data to find useful information by widening the combined dataset further.

The “Likely Source of Infection” variable is a factor with 4 labelled levels: “Unlinked”, “Linked”, “Interstate”, “Overseas”. To find useful information in the dataset we can convert each of those levels to a new variable, with each observation a unique date and LHD aligned with a condensed “Test Count” variable as the number of each case type in each LHD per day.


Approach
  • The datasets are separated to allow independent manipulation with is.na(), filter() and select() functions to extract the covid_cases observations which have NAs for “Test Count” and the covid_tests observations which have NAs for “Likely Source of Infection”.
  • The covid_cases data is pivoted wider by first creating an ID variable to use as values_from and extract the number of observations in each “Likely Source of Infection” factor level on each date and LHD as a length values_fn.
  • The covid_tests data is summarised with group_by() to create individual observations for each date and LHD.
  • With both datasets now tidied to support indiviudal observations by date and Local Health District, the two are re-merged using a full_join() with covid_tests on the left (as the larger dataset).
  • As a cross check to ensure observations aren’t duplicated or incorrect, we use arithmeitc functions to calculate the possible number of observations as the number of dates in the dataset multiplied by the number of LHDs, and confirm that the number of observations in the dataset does not exceed that maximum possible observations.

The tidied dataset is now 10232 observations of 7 variables, with a unique observation for each value (originally from either dataset) per date and Local Health District.


There are a number of issues with the dataset related to NAs that will be dealt with in subsequent sections of this report.


# Separate the datasets to allow independent manipulation
covid_cases <- covid %>% 
  filter(is.na(`Test Count`)) %>%
  select(-`Test Count`)

covid_tests <- covid %>% 
  filter(is.na(`Likely Source of Infection`)) %>%
  select(-`Likely Source of Infection`)

# Create an ID column then pivot wider using that to count instances
covid_cases <- covid_cases %>% mutate(ID = row_number())
covid_cases <- pivot_wider(covid_cases, names_from = `Likely Source of Infection`, values_from = ID, values_fn = length)

# Condense Test Counts to sum per Date and LHD
covid_tests %<>% group_by(Date, `Local Health District`) %>% 
  summarise(`Test Total` = sum(`Test Count`))
## `summarise()` has grouped output by 'Date'. You can override using the `.groups` argument.
# Rejoin the datasets
covid <- full_join(covid_tests, covid_cases, by = c("Date", "Local Health District"))

# Calculate full number of possible observations in the dataset
poss_days <- as.numeric(difftime(covid$Date[10232], covid$Date[1]))
poss_LHD <- unique(covid$`Local Health District`)

# Cross check that the possible observations is greater than the number in the dataset
poss_obs <- (poss_days*length(poss_LHD))
poss_obs >= length(covid$Date)
## [1] TRUE



Tidy & Manipulate Data II

There are two additional variables that may provide useful information and are added in this code:

To create “Total Cases” we use rowSums(), specifying na.rm = TRUE to prevent NAs from affecting the calculation.

Creating “Positivity Rate” is more complicated as dividing variable 8 by variable 3 creates a list, so unlist() is required and the entire operation occurs within a paste() formula to output the variable as a percentage.

The head() of the relevant variables is output for visual understanding.


# Create a new variable for the sum of reported cases in each observation
covid$`Total Cases` <- rowSums(covid[ , 4:7], na.rm = TRUE)

# Create a new variable for positivity rate for each observation
covid$`Positivity Rate` <- paste(round(100*(unlist(covid[,8]/covid[,3])),2), "%", sep = "")

# Print the head of relevant variables
head(covid[ ,c(1:3,8:9)])



Scan I

The original and merged datasets contain a number of missing values. First we review all the variables with colSums(is.na()) to get an overview of the dataset.


Major Issues

The lack of missing values in “Local Health District” and the number of missing values in “Test Total” indicate an obvious error in the dataset. Noting that the datasets were joined with a full_join() by “Date” and “Local Health District” it is logical that any incompatible observations will be shown at the end of the dataset.

With tail() we can see that there are a number of observations for case numbers in “Correctional settings” that is not correlated to official NSW Local Health Districts. Additionally we note there are a number of Overseas acquired cases that are not correlated to “Test Counts” or Local Health Districts.

As the focus of the analysis is to understand tests and cases by date and Local Health District, these observations are redundant and removed by first checking the levels() of the factor variable to identify the non-conforming observations, and subsetting those levels.


Minor Issues

The remaining NAs are numeric:

  • “Total Tests” may have NAs where there are cases recorded Overseas Cases, but where testing was carried out outside NSW Health remit, so these are imputed with the variable mean with {Hmisc} impute() function.
  • Where the “Total Cases” variable was created above as an arithmetic sum of Overseas, Interstate, Linked and Unliked Cases variables, any NAs in these variables must be zero.


At completion of dealing with NAs we run a check to confirm that the number of variables containing NAs is zero with an ifelse() function.


# Summarise missing values in all variables
colSums(is.na(covid))
##                  Date Local Health District            Test Total 
##                     0                     0                   412 
##        Overseas Cases          Linked Cases        Unlinked Cases 
##                  8780                  8621                  9077 
##      Interstate Cases           Total Cases       Positivity Rate 
##                 10146                     0                     0
# Output the tail of the dataset to overview observations that incomplete in passing the join function 
tail(covid)
# Check levels to identify and then remove non-conforming LHD observations
levels(covid$`Local Health District`)
##  [1] ""                      "Central Coast"         "Correctional settings"
##  [4] "Far West"              "Hunter New England"    "Illawarra Shoalhaven" 
##  [7] "Mid North Coast"       "Murrumbidgee"          "Nepean Blue Mountains"
## [10] "Network with Vic"      "Northern NSW"          "Northern Sydney"      
## [13] "South Eastern Sydney"  "South Western Sydney"  "Southern NSW"         
## [16] "Sydney"                "Western NSW"           "Western Sydney"
covid <- subset(covid,!(`Local Health District` %in% c("", "Correctional settings", "Network with Vic")))

# Impute "Total Tests" NAs with the variable mean
covid$`Test Total`[is.na(covid$`Test Total`)] <- impute(covid$`Test Total`, fun = mean())

# All remaining NAs are arithmetic and converted to zero
covid[is.na(covid)] <- 0


# Check that there are no columns remaining containing missing values
if(length(colnames(covid)[colSums(is.na(covid)) > 0]) == 0){
  "TRUE"
} else {
  "FALSE"
}   
## [1] "TRUE"



Scan II

In this section we will scan Total Tests and Total Cases for univariate outliers, and the Linked and Unlinked Cases variables in the current outbreak for multivariate outliers.

For univariate outliers we will apply different methodologies: manual and Z-score.

In both cases we first create a histogram with hist() to check distribution.


For multivariate outliers we will look at Linked and Unlinked locally acquired cases with the {mvn} package. To extract useful information we will reduce the dataset further to relate to the current outbreak (20 June 2021 - present).

A boxplot of the two variables shows complex distribution and a large number of outliers. A “quan” multivariate outlier method in mvn() calls out 791 outliers. As a relatively large number of outliers, we export a copy of the data with no outliers to a new object for consideration in potential future analyses.


Note the “Overseas Cases” and “Interstate Cases” variables are excluded as having an IQR of 0.


# Create a histogram to visualise the data distribution for Test Total
hist(covid$`Test Total`, breaks = 100, main = "Distribution of Test Total", xlab = "Test Total")

# Manually identify univariate outliers
summary(covid$`Test Total`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     207     598    1877    1542   38458
q1 <- summary(covid$`Test Total`)[[2]]
q3 <- summary(covid$`Test Total`)[[5]]

iqr <- q3-q1

lower_fence <- q1 - (1.5*iqr)
upper_fence <- q3 + (1.5*iqr)

up_outliers <- which(covid$`Test Total` > upper_fence)
low_outliers <- which(covid$`Test Total` < lower_fence)

# Output the number of outliers in either direction
length(covid$`Test Total`[up_outliers])
## [1] 991
length(covid$`Test Total`[low_outliers])
## [1] 0
# Resolve the outliers by imputing to the mean
covid$`Test Total`[up_outliers] <- mean(covid$`Test Total`)


# Create a histogram to visualise the data distribution for Total Cases
hist(covid$`Total Cases`, breaks = 100, main = "Distribution of Total Cases", xlab = "Total Cases")

z.scores <- covid$`Total Cases` %>%
  outliers::scores(type = "z")

# Output the number of outliers in either direction
length(which(abs(z.scores) >3))
## [1] 139
# Remove the observations with outliers by subsetting
covid <- covid[-(covid$`Total Cases`[which(abs(z.scores)>3)]), ]



# Reduce the dataset to current outbreak date range
covid <- subset(covid, Date > "2021-06-20" & Date < "2021-10-07")

# Create a boxplot to visualise the data distribution for Linked and Unlinked cases
boxplot(covid$`Linked Cases` ~ covid$`Unlinked Cases`,
        main = "COVID Linked Cases by Unlinked Cases", ylab = "Linked Cases", xlab = "Unlinked Cases")

# Subset the data to relevant variables and find the multivariate outliers
covid_sub <- covid %>% 
  ungroup() %>%
  select(`Linked Cases`, `Unlinked Cases`)
covid_sub <- covid_sub[complete.cases(covid_sub), ]

results_sub <- covid_sub %>%
  MVN::mvn(multivariateOutlierMethod = "quan",
           showOutliers = TRUE,
           showNewData = TRUE)

cases_outliers <- results_sub$multivariateOutliers

# Send the identified outliers to a clean object
cases_no_outliers <- results_sub$newData



Transform

Statistical inference requires normal distribution of variables and homogeneity of variances. Data transformation may be required:

In the covid dataset we will look at the distribution of the “Total Cases” variable via a number of methodologies to explore the comparative effectiveness in normalising the data distribution.

First we reduce the data to the peak of the current outbreak (month of September 2021) to reduce data redundancy. A histogram hist() to visualise the distribution of the data which is dramatically right-skewed.


For the sake of exploration and demonstrating that other transformation techniques are not appropriate, we also look at a Box-Cox transformation and a normalisation technique.


# Subset for data reduction to reduce redundancy
covid <- subset(covid, Date > "2021-09-01" & Date < "2021-09-30")


# Visualise the distribution
hist(covid$`Total Cases`, main = "Histogram of Total Cases in Sept 2021", xlab = "Total Cases Distribution", breaks = 50)

# Square root transformation
sqrt_covid_cases <- sqrt(covid$`Total Cases`)
hist(sqrt_covid_cases, main = "Histogram of Square Root Transformed Total Cases", xlab = "Transformed Total Cases Distribution", breaks = 50)

# Reciprocal transformation
recip_covid_cases <- 1/(covid$`Total Cases`)
hist(recip_covid_cases, main = "Histogram of Reciprocal Transformed Total Cases", xlab = "Transformed Total Cases Distribution", breaks = 50)

# Logarithmic transformation
log_covid_cases <- log10(covid$`Total Cases`)
hist(log_covid_cases, main = "Histogram of Log10 Transformed Total Cases", xlab = "Transformed Total Cases Distribution", breaks = 50)

# Test results with Box-Cox transformation
boxcox_covid_cases <- BoxCox(covid$`Total Cases`, lambda = "auto")

# Extract the ideal lambda to a new object
lambda_covid_cases <- attr(boxcox_covid_cases, which = "lambda")

# View on a histogram
p1 <- ggplot(boxcox_covid_cases %>% as.data.frame()) 

p1 + 
  geom_histogram(aes(x = .), 
                 bins = 50, colour = "black") + 
  labs(title = "Histogram of Transformed Total Cases, using a Box-Cox Transformation", 
       subtitle = bquote(~ lambda == "0.159795147..."), 
       x = "Transformed Total Cases")

# Test results with min_max_norm normalisation
min_max_norm <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) 
  
}

hist(min_max_norm(covid$`Total Cases`), main = "Histogram of Min-Max Normalised Total Cases", xlab = "Normalised Total Cases Distribution", breaks = 50)