Executive Summary

Three sets of creative commons open source data were selected. Two were from the Australian Bureau of Statistics (ABS) and one from NSW Health. Both sources are reputable. Each data set was tidied, cleaned and subset. The data structure and types of variables were examined. Cleaning included scanning for and removal of special and NA values. Tidy data refers to a standardised structure for data sets (Wickham, 2014). Tidy data has three basic principles. Firstly, each variable forms a column. Secondly, each observation forms a row. Thirdly, each type of observational unit forms a table. Data structure can be untidy in a number of ways including variables stored in both rows and columns and multiple variables stored in one column. The Covid data from NSW Health is a good example of multiple variables stored in one column. The data set was tidied by separating attributes that represented two variables (LGA and class and testing outcome and total tests).

After initial checks the three data sets were merged and checked again.When necessary new variables where mutated to form new variables to take into account proportion of testing per population and positive cases per population. Distribution and outliers were the examined using descriptive statistics, histograms and boxplots. Deletion, capping or imputation of outliers was not appropriate as they were of interest and could relate to hotspots of Covid cases and be useful to analyse. The distribution of most variables were skewed to the right and there was a high zero count. the local government areas with zero case counts were examined and were smaller in size and rural areas. Zeros were removed to enable transformation. A log10 function was applied across numeric variables and improved the distribution. Data types including factors were converted where necessary.

Required Packages

library(dplyr)
library(tidyverse)
library(readr)
library(ggplot2)
library(forecast)
library(MASS)
library(car)

Data Sets

The three data sets include information on population grouped by gender and local government area from the ABS (2019) and Covid (NSW Health, 2020) testing and test results from the New South Wales Health.

ABS data sets - Regional Population, Age, Sex.

The gender population data contained two data sets, one with female data and one with male data. Both sets included 559 observations for all of Australia and 23 variables. The data included estimated population figures for June 2019. As this study focused on NSW, observations from other states were removed when the data was subset. 129 observations remained. Three variables were relevant including:

Local Government Area (LGA): A categorical variable that identifies the name of the local government area.
Local Government ID (LGAID): Categorical variable of an ID number that uniquely identifies LGA.
Total Population by LGA (totalmales/totalfemales): Numeric variable that observes total population for males and females for each LGA depending on the table.

NSW Health - NSW Covid 19 Tests by Location and Result

The Covid data set included 539,050 observations of 7 variables. The data includes information on all Covid tests in NSW up to June 2020. As such, data was collected at the case level and will need to be grouped by LGA. In addition, the data set lacked information regarding the Unincorporated NSW region which was in the ABS population data. Three variables were relevant including:

lga_name19(LGA): Categorical variable including area name. LGA will be used as a key to join the data sets. This variable is untidy as it includes two variables in one cell, LGA name and the class of the LGA (ie Area(A), City (C)). This will be split into two variables (LGA and class) and converted to factor type data.
lga_code19(LGAID): A categorical variable which is an unique id number and also could be used as a key to link the population and Covid data sets.
Results(result): A categorical variable relating to testing results including two possible outcomes tested and excluded and tested and confirmed. This combines testing information with outcome of testing information. This variable is untidy and will be spread into two variables Case - Confirmed and Tested & excluded. A mutation of the two new variables will create a total number of tests variable (totaltests).

#reading in files

agemale<-read.csv("agemales.csv")
agefemale <- read.csv("agefemales.csv")
covid<-read.csv("covid19.csv")

As the variables were similar in both the data sets from the ABS, the structure and size of the data sets were checked. All three data sets had variables that included the keys of local government area name and local government ID code. These keys will allow for merging the three data sets.

#Checking structure of files

str(agemale)

## 'data.frame':    559 obs. of  23 variables:
##  $ Australian.Bureau.of.Statistics: chr  "3235.0 Regional Population by Age and Sex, Australia" "Released at 11.30am (Canberra time) 28 August 2020" "Table 1. Estimated Resident Population by Age, by Local Government Area, Males – 30 June 2019" "" ...
##  $ X                              : chr  "" "" "" "" ...
##  $ X.1                            : chr  "" "" "" "" ...
##  $ X.2                            : chr  "" "" "" "" ...
##  $ X.3                            : chr  "" "" "" "" ...
##  $ X.4                            : chr  "" "" "" "" ...
##  $ X.5                            : chr  "" "" "" "" ...
##  $ X.6                            : chr  "" "" "" "" ...
##  $ X.7                            : chr  "" "" "" "" ...
##  $ X.8                            : chr  "" "" "" "" ...
##  $ X.9                            : chr  "" "" "" "" ...
##  $ X.10                           : chr  "" "" "" "" ...
##  $ X.11                           : chr  "" "" "" "" ...
##  $ X.12                           : chr  "" "" "" "" ...
##  $ X.13                           : chr  "" "" "" "" ...
##  $ X.14                           : chr  "" "" "" "" ...
##  $ X.15                           : chr  "" "" "" "" ...
##  $ X.16                           : chr  "" "" "" "" ...
##  $ X.17                           : chr  "" "" "" "" ...
##  $ X.18                           : chr  "" "" "" "" ...
##  $ X.19                           : chr  "" "" "" "" ...
##  $ X.20                           : chr  "" "" "" "" ...
##  $ X.21                           : chr  "" "" "" "" ...

head(agemale)

##                                                                 Australian.Bureau.of.Statistics
## 1                                          3235.0 Regional Population by Age and Sex, Australia
## 2                                            Released at 11.30am (Canberra time) 28 August 2020
## 3 Table 1. Estimated Resident Population by Age, by Local Government Area, Males – 30 June 2019
## 4                                                                                              
## 5                                                                                              
## 6                                                                                              
##   X X.1 X.2               X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6           Age group (Years)                                                 
##   X.15 X.16 X.17 X.18 X.19 X.20 X.21
## 1                                   
## 2                                   
## 3                                   
## 4                                   
## 5                                   
## 6

summary(agemale)

##  Australian.Bureau.of.Statistics      X                 X.1           
##  Length:559                      Length:559         Length:559        
##  Class :character                Class :character   Class :character  
##  Mode  :character                Mode  :character   Mode  :character  
##      X.2                X.3                X.4                X.5           
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.6                X.7                X.8                X.9           
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.10               X.11               X.12               X.13          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.14               X.15               X.16               X.17          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.18               X.19               X.20               X.21          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

str(agefemale)

## 'data.frame':    559 obs. of  23 variables:
##  $ Australian.Bureau.of.Statistics: chr  "3235.0 Regional Population by Age and Sex, Australia" "Released at 11.30am (Canberra time) 28 August 2020" "Table 2. Estimated Resident Population by Age, by Local Government Area, Females – 30 June 2019" "" ...
##  $ X                              : chr  "" "" "" "" ...
##  $ X.1                            : chr  "" "" "" "" ...
##  $ X.2                            : chr  "" "" "" "" ...
##  $ X.3                            : chr  "" "" "" "" ...
##  $ X.4                            : chr  "" "" "" "" ...
##  $ X.5                            : chr  "" "" "" "" ...
##  $ X.6                            : chr  "" "" "" "" ...
##  $ X.7                            : chr  "" "" "" "" ...
##  $ X.8                            : chr  "" "" "" "" ...
##  $ X.9                            : chr  "" "" "" "" ...
##  $ X.10                           : chr  "" "" "" "" ...
##  $ X.11                           : chr  "" "" "" "" ...
##  $ X.12                           : chr  "" "" "" "" ...
##  $ X.13                           : chr  "" "" "" "" ...
##  $ X.14                           : chr  "" "" "" "" ...
##  $ X.15                           : chr  "" "" "" "" ...
##  $ X.16                           : chr  "" "" "" "" ...
##  $ X.17                           : chr  "" "" "" "" ...
##  $ X.18                           : chr  "" "" "" "" ...
##  $ X.19                           : chr  "" "" "" "" ...
##  $ X.20                           : chr  "" "" "" "" ...
##  $ X.21                           : chr  "" "" "" "" ...

head(agefemale)

##                                                                   Australian.Bureau.of.Statistics
## 1                                            3235.0 Regional Population by Age and Sex, Australia
## 2                                              Released at 11.30am (Canberra time) 28 August 2020
## 3 Table 2. Estimated Resident Population by Age, by Local Government Area, Females – 30 June 2019
## 4                                                                                                
## 5                                                                                                
## 6                                                                                                
##   X X.1 X.2               X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6           Age group (Years)                                                 
##   X.15 X.16 X.17 X.18 X.19 X.20 X.21
## 1                                   
## 2                                   
## 3                                   
## 4                                   
## 5                                   
## 6

summary(agefemale)

##  Australian.Bureau.of.Statistics      X                 X.1           
##  Length:559                      Length:559         Length:559        
##  Class :character                Class :character   Class :character  
##  Mode  :character                Mode  :character   Mode  :character  
##      X.2                X.3                X.4                X.5           
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.6                X.7                X.8                X.9           
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.10               X.11               X.12               X.13          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.14               X.15               X.16               X.17          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      X.18               X.19               X.20               X.21          
##  Length:559         Length:559         Length:559         Length:559        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

ABS Data Cleaning

The raw data from the ABS data sets included 23 variables. 20 variables related to age range, groups and state identification. The tables included superfluous meta data, empty rows and data from other states. As such, a subset of desired variables was created from each data set. This included local government identification number (LGAID), local government area name (LGA) and total population by LGA for both males and females (malepop, femalepop) in NSW. NAs and special characters were then checked.

#Subsetting, scanning and renaming variables
agemales<-agemale[10:138,]
agemales<-data.frame(agemales)
rmarkdown::paged_table(head(agemales, 5))

sum(is.na(agemales))

## [1] 0

sum(is.nan(as.matrix(agemales)))

## [1] 0

sum(is.infinite(as.matrix(agemales)))

## [1] 0

colnames(agemales)

##  [1] "Australian.Bureau.of.Statistics" "X"                              
##  [3] "X.1"                             "X.2"                            
##  [5] "X.3"                             "X.4"                            
##  [7] "X.5"                             "X.6"                            
##  [9] "X.7"                             "X.8"                            
## [11] "X.9"                             "X.10"                           
## [13] "X.11"                            "X.12"                           
## [15] "X.13"                            "X.14"                           
## [17] "X.15"                            "X.16"                           
## [19] "X.17"                            "X.18"                           
## [21] "X.19"                            "X.20"                           
## [23] "X.21"

agemales<-agemales%>%rename(LGAID = "X.1", LGA = "X.2", malepop = "X.21")
male.lga.pop<-data.frame(agemales$LGAID, agemales$LGA, agemales$malepop)
rmarkdown::paged_table(head(male.lga.pop, 5))

agefemales<-agefemale[10:138,]
agefemales<-data.frame(agefemales)
sum(is.na(agefemales))

## [1] 0

sum(is.nan(as.matrix(agefemales)))

## [1] 0

sum(is.infinite(as.matrix(agefemales)))

## [1] 0

colnames(agefemales)

##  [1] "Australian.Bureau.of.Statistics" "X"                              
##  [3] "X.1"                             "X.2"                            
##  [5] "X.3"                             "X.4"                            
##  [7] "X.5"                             "X.6"                            
##  [9] "X.7"                             "X.8"                            
## [11] "X.9"                             "X.10"                           
## [13] "X.11"                            "X.12"                           
## [15] "X.13"                            "X.14"                           
## [17] "X.15"                            "X.16"                           
## [19] "X.17"                            "X.18"                           
## [21] "X.19"                            "X.20"                           
## [23] "X.21"

agefemales<-agefemales%>%rename(LGAID = "X.1", LGA = "X.2", femalepop = "X.21")
female.lga.pop<-data.frame(agefemales$LGAID, agefemales$LGA, agefemales$femalepop)
rmarkdown::paged_table(head(female.lga.pop, 5))

A left join was then used to merge the two tables on the keys of LGA ID code and name. The variables were renamed and the new data frame checked using the head function. Population variables were converted from character type to integer.

#Join and conversion of data type.
dfpop<-left_join(male.lga.pop, female.lga.pop, by = c("agemales.LGAID" = "agefemales.LGAID", "agemales.LGA"= "agefemales.LGA"))

dfpop<-dfpop %>% rename(LGAID = "agemales.LGAID", LGA ="agemales.LGA", male = "agemales.malepop",female = "agefemales.femalepop")

rmarkdown::paged_table(head(dfpop, 5))

dfpop$male<-as.integer(as.character(dfpop$male))
dfpop$female<-as.integer(as.character(dfpop$female))

sapply(dfpop, class)

##       LGAID         LGA        male      female 
## "character" "character"   "integer"   "integer"

A new variable (totalpop) was created combining the female and male population variables.

#Mutation

totalpop<-dfpop%>% mutate(totalpop = female + male)
rmarkdown::paged_table(head(totalpop, 5))

NSW Health Data Cleaning

The Covid 19 data had 7 variables including the two keys LGA ID (lga_code19) and name of LGA (lga_name19) . The data set was an example of untidy data. The results variable included information about testing as well as results of the tests and the LGA variable included information on LGA name and LGA class. As such, multiple variables have been stored in single columns. These variables will be separated into two variables. The data was ordered by case and needed to be collapsed into cases grouped by LGA. The data frame was checked, subsetted and grouped by LGA in preparation for a join with the population data. Missing values and special values were checked. There were a number of rows with results but lacked LGA details. These may have been related to cases from nonresidents/overseas cases without an LGA. NAs were spread across rows. Rows with missing values were omitted as NAs made up less than 5% of the data set.

#Scanning

str(covid)

## 'data.frame':    539050 obs. of  7 variables:
##  $ test_date    : chr  "1/01/2020" "1/01/2020" "1/01/2020" "1/01/2020" ...
##  $ postcode     : int  2039 2040 2069 2110 2190 2071 2485 2151 2134 2207 ...
##  $ lhd_2010_code: chr  "X700" "X700" "X760" "X760" ...
##  $ lhd_2010_name: chr  "Sydney" "Sydney" "Northern Sydney" "Northern Sydney" ...
##  $ lga_code19   : int  14170 14170 14500 14100 11570 14500 17550 16260 11300 10500 ...
##  $ lga_name19   : chr  "Inner West (A)" "Inner West (A)" "Ku-ring-gai (A)" "Hunters Hill (A)" ...
##  $ result       : chr  "Tested & excluded" "Tested & excluded" "Tested & excluded" "Tested & excluded" ...

rmarkdown::paged_table(head(covid, 5))

summary(covid)

##   test_date            postcode     lhd_2010_code      lhd_2010_name     
##  Length:539050      Min.   :1021    Length:539050      Length:539050     
##  Class :character   1st Qu.:2114    Class :character   Class :character  
##  Mode  :character   Median :2232    Mode  :character   Mode  :character  
##                     Mean   :2309                                         
##                     3rd Qu.:2502                                         
##                     Max.   :4385                                         
##                     NA's   :22958                                        
##    lga_code19     lga_name19           result         
##  Min.   :10050   Length:539050      Length:539050     
##  1st Qu.:11720   Class :character   Class :character  
##  Median :14900   Mode  :character   Mode  :character  
##  Mean   :14641                                        
##  3rd Qu.:16700                                        
##  Max.   :18710                                        
##  NA's   :23825

sum(is.na(covid))

## [1] 46783

sum(is.nan(as.matrix(covid)))

## [1] 0

sum(is.infinite(as.matrix(covid)))

## [1] 0

cleancovid<-na.omit(covid)

The data set was grouped by result and LGA. The result variable was then split into two variables, tested and confirmed (Case - Confirmed) and tested and excluded (Tested & excluded). The new data frame was checked. NAs only existed in the case confirmed. It was assumed NAs represented areas without confirmed cases. As such, NA values were converted to 0. A total test by LGA variable was created using the mutate function by adding the Case - Confirmed and Tested & excluded result variables.

#Grouping by LGA and mutation.

groupcovidtest<-cleancovid %>% group_by(lga_name19, result) %>% count()
spreadcovidna<-spread(groupcovidtest, result, n)
spreadcovidna[is.na(spreadcovidna)] <- 0
rmarkdown::paged_table(head(spreadcovidna, 5))

totaltest<-spreadcovidna %>% mutate(totaltests = `Case - Confirmed` + `Tested & excluded`)
rmarkdown::paged_table(head(totaltest, 5))

The Covid dataset was then joined to the ABS population data and checked. The Covid dataframe lacked data concerning the Unincorporated NSW region and the row was removed. There were no other NA results.

#Join and rescanning new data set.

covidjoin<- left_join (totalpop, totaltest, by = c("LGA" = "lga_name19"))
rmarkdown::paged_table(head(covidjoin, 5))

sum(is.na(covidjoin))

## [1] 3

cleancovidjoin<-na.omit(covidjoin)
sum(is.na(cleancovidjoin))

## [1] 0

rmarkdown::paged_table(head(cleancovidjoin, 5))

Three new variables were created using mutate: proportion tests per population number(testbypop), proportion of cases per population (casebypop) and proportion cases by test(casebytest).

#Mutation

covidpoptest<-cleancovidjoin %>% mutate(testbypop = round(totaltests / totalpop*100, 2))
rmarkdown::paged_table(head(covidpoptest, 5))

covidbypop<-covidpoptest %>% mutate(casebypop = round(`Case - Confirmed` / totalpop*100, 2))
rmarkdown::paged_table(head(covidbypop, 5))

covidfinal<-covidbypop %>% mutate(casesbytest = round(`Case - Confirmed` / totaltests*100, 2))
rmarkdown::paged_table(head(covidfinal, 5))

Summary statistics were examine and the means and the median suggest skewed distribution. Three variables related to cases had minimum zero values and this could impact transformation methods.

#Summary statistics

summary(covidfinal)

##     LGAID               LGA                 male            female      
##  Length:128         Length:128         Min.   :   857   Min.   :   754  
##  Class :character   Class :character   1st Qu.:  4419   1st Qu.:  4330  
##  Mode  :character   Mode  :character   Median : 12206   Median : 12235  
##                                        Mean   : 31369   Mean   : 31824  
##                                        3rd Qu.: 37738   3rd Qu.: 39999  
##                                        Max.   :189987   Max.   :187930  
##     totalpop      Case - Confirmed Tested & excluded   totaltests     
##  Min.   :  1611   Min.   :  0.00   Min.   :   24     Min.   :   24.0  
##  1st Qu.:  8784   1st Qu.:  1.00   1st Qu.:  333     1st Qu.:  335.2  
##  Median : 24358   Median :  5.00   Median : 1428     Median : 1436.0  
##  Mean   : 63194   Mean   : 23.23   Mean   : 4002     Mean   : 4025.2  
##  3rd Qu.: 77737   3rd Qu.: 32.00   3rd Qu.: 5109     3rd Qu.: 5141.5  
##  Max.   :377917   Max.   :185.00   Max.   :23596     Max.   :23713.0  
##    testbypop        casebypop        casesbytest    
##  Min.   : 0.200   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 3.888   1st Qu.:0.01000   1st Qu.:0.1575  
##  Median : 4.990   Median :0.02000   Median :0.4250  
##  Mean   : 5.389   Mean   :0.02859   Mean   :0.4619  
##  3rd Qu.: 6.492   3rd Qu.:0.04000   3rd Qu.:0.6525  
##  Max.   :15.990   Max.   :0.25000   Max.   :2.1900

round(apply(covidfinal,2,sd),3)

##             LGAID               LGA              male            female 
##          2521.955                NA         41389.383         41585.820 
##          totalpop  Case - Confirmed Tested & excluded        totaltests 
##         82954.309            36.906          5530.138          5563.156 
##         testbypop         casebypop       casesbytest 
##             2.462             0.033             0.426

created new variable poplation size by quartile and converted it to factor and spread LGA into LGA name and LGA class and converted class to factor. This may assist in dealing with the differences in LGA population sizes.

#Mutation

covidfinalquart<-covidfinal %>% mutate(quartile = ntile(totalpop, 4))

covidfinalquart$quartile<-factor(covidfinalquart$quartile, ordered = TRUE, levels = c(1,2,3,4), labels = c("low", "medium", "high", "very_high"))

covidcomplete<-covidfinalquart %>% separate (LGA, c("LGA", "class"),sep ="[(*)]") 

covidcomplete$class<-factor(covidcomplete$class)

rmarkdown::paged_table(head(covidcomplete, 5))

Distributions and Outliers

The data was then checked for outliers and distributions. Histograms showed the distribution of the data skewed to the right. The study will focus on cases and tests by population size. Gender was not going to be a focus of this study.

There were high zero counts in variables relating to case testing ie confirmed cases, cases by test and cases by population. These issues created challenges for dealing with dealing outliers and the process of transforming the data to normal distributions.

There are several reasons outliers may exist including data entry error, measurement errors, extreme natural variation, preprocessing errors and sampling error. Boxplots were used to check for univariate outliers. Using this technique, outliers are defined as being outside Q1-1.5IQR and Q3+1.5IQR. Outliers existed in the upper boundariea of all numeric variables. Further investigation, suggested the outliers to be values may be connected to virus hot spots. The outlier values were deemed to be important and not due to errors of observation. Outliers observations were checked for obvious error. It was decided cannot be imputed, capped or deleted as these outliers were of interest.

In an attempt to gain some understanding about the process a number of transformation techniques were tested including square roots, cubed roots and Box-Cox transformations. None of these were effective. Due to the high zeo counts, a zero inflated Poisson distribution () was considered, as well as the removal of rows containing zeros.The rows containing zeroes for case confirmed showed low testing numbers (24-587) and appeared to rural and low density populations. Due to the complexity of the zero inflated Poisson distribution and the difference between zero case areas and the remaining sample, the rows with zeros due to no confirmed cases were removed.

#Checking for normality and outliers

covidcomplete %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#boxplot check
covidcomplete %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()

The plots above demonstrate non normal distributions and a large number of outliers in the upper boundary. Zeros will be removed and data checked again.

#Removal of zeros and rechecking distribution and outliers

coviddrop<-subset(covidcomplete,`Case - Confirmed` != 0)

rmarkdown::paged_table(head(coviddrop, 5))

coviddrop %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()

coviddrop %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Transformation

Removing zeros allows for an easier transformation of the distribution. Rows with zeros were removed from the data set via subsetting and reduced the data set to 100 observations. Outliers were rechecked. There were still outliers in the upper boundary and the distribution remained skewed.A range of transformation techniques were again tested on on the modified data including, Box-Cox, square root and log10 transformations. The log10 transformation was selected as it is known to effectively deal with right skewed data. log10 function was applied across the numeric attributes. Boxplots showed less skew although outliers remained in the casebypop, casebytest and testbypop variables. These were the main variables of interest.

#Log10 function applied across numeric variables

covidlog<-rapply(coviddrop, f = log10, classes = c("numeric", "integer"), how = "replace")
head(covidlog)

##   LGAID                LGA class     male   female totalpop Case - Confirmed
## 1 10050            Albury      C 4.420748 4.447236 4.735224        0.9030900
## 2 10130 Armidale Regional      A 4.175946 4.198217 4.488255        0.6020600
## 3 10250           Ballina      A 4.333427 4.363217 4.649607        0.9030900
## 5 10470 Bathurst Regional      A 4.341652 4.335598 4.639666        0.9542425
## 6 10500           Bayside      A 4.952894 4.947801 5.251385        1.8260748
## 7 10550       Bega Valley      A 4.228836 4.244005 4.537517        0.9542425
##   Tested & excluded totaltests testbypop casebypop casesbytest  quartile
## 1          3.232234   3.234264 0.4996871 -2.000000  -0.3279021      high
## 2          3.405517   3.406199 0.9180303 -2.000000  -0.7958800      high
## 3          3.469380   3.470557 0.8208580 -1.698970  -0.5686362      high
## 5          3.397419   3.398981 0.7596678 -1.698970  -0.4436975      high
## 6          3.960280   3.963457 0.7118072 -1.397940  -0.1366771 very_high
## 7          3.129045   3.131939 0.5943926 -1.522879  -0.1804561      high

covidlog %>%  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_boxplot()

Outliers may be linked to Covid outbreaks and deemed to be of interest. Therefore it was decided not to impute or delete them. Normal distributions are important for a number of parametric statistical tests(). However, there are a number of non-parametric tests that can be applied to data sets that are not normally distributed. The LGAs that with outliers included:

Casebytest: Federation, Murrumbidgee. (High cases per test).
testbypop: Hunters Hill, Lane Cove, Waverley (High tests per pop).
casebypop: Hunters Hill, Mosman, Waverlley, Woollahra (High cases per pop).

#Listing outliers

out <- boxplot.stats(covidlog$casesbytest)$out
out_ind <- which(covidlog$casesbytest %in% c(out))
out_ind

## [1] 30 58

out <- boxplot.stats(covidlog$testbypop)$out
out_ind <- which(covidlog$testbypop %in% c(out))
out_ind

## [1] 41 48 93

out <- boxplot.stats(covidlog$casebypop)$out
out_ind <- which(covidlog$casebypop %in% c(out))
out_ind

## [1] 41 57 93 99

#Converting data types

covidlog$LGAID<-factor(covidlog$LGAID)
covidlog$LGA<-factor(covidlog$LGA)

Conclusion

The final data set included 100 observations and 13 variables including:

LGAID: Factor variable with unique area ID values
LGA: Factor variable of local government area names.
class:Factor variable with local government area class.
male: Numeric variable with male population totals.
female: Numeric variable with female population totals.
totalpop: Numeric variable with total population totals.
Case - Confirmed: Numeric variable with confirmed case totals
Tested & Excluded: Numeric variable with excluded test result totals.
totaltests: Numeric variable with total test numbers.
testbypop: Numeric variable includes ratio of tests by population.
casesbypop: Numeric variable includes ratio of cases by population.
casesbytest: Numeric variable includes ration of cases by tests.
quartile: Ordinal factor variable with four levels outlining population numbers by quartile.

Reference List

Australian Bureau of Statistics. (2019). Regional Population, Age and Sex, retrieved from https://www.abs.gov.au/statistics/people/population/regional-population-age-and-sex/latest-release#data-download

Green J. (2020). A Tutorial on Modelling Health Behaviour as Count Data with Poisson and Negative Binomial Regression, retrieved from https://osf.io/ux9et/

New South Wales Government. (2020). NSW Covid 19 Tests by Location and Result. retrieved from (https://data.nsw.gov.au/data/dataset/nsw-covid-19-tests-by-location-and-result/resource/227f6b65-025c-482c-9f22-a25cf1b8594f)

Wickham H. (2014). Tidy Data, Journal of statistical Software, 59(10) https://www.researchgate.net/publication/215990669_Tidy_data

Data Wrangling Markdown

Louis Tintner SID: s3874433

07/10/2020