Three sets of creative commons open source data were selected. Two were from the Australian Bureau of Statistics (ABS) and one from NSW Health. Both sources are reputable. Each data set was tidied, cleaned and subset. The data structure and types of variables were examined. Cleaning included scanning for and removal of special and NA values. Tidy data refers to a standardised structure for data sets (Wickham, 2014). Tidy data has three basic principles. Firstly, each variable forms a column. Secondly, each observation forms a row. Thirdly, each type of observational unit forms a table. Data structure can be untidy in a number of ways including variables stored in both rows and columns and multiple variables stored in one column. The Covid data from NSW Health is a good example of multiple variables stored in one column. The data set was tidied by separating attributes that represented two variables (LGA and class and testing outcome and total tests).
After initial checks the three data sets were merged and checked again.When necessary new variables where mutated to form new variables to take into account proportion of testing per population and positive cases per population. Distribution and outliers were the examined using descriptive statistics, histograms and boxplots. Deletion, capping or imputation of outliers was not appropriate as they were of interest and could relate to hotspots of Covid cases and be useful to analyse. The distribution of most variables were skewed to the right and there was a high zero count. the local government areas with zero case counts were examined and were smaller in size and rural areas. Zeros were removed to enable transformation. A log10 function was applied across numeric variables and improved the distribution. Data types including factors were converted where necessary.
library(dplyr)
library(tidyverse)
library(readr)
library(ggplot2)
library(forecast)
library(MASS)
library(car)
The three data sets include information on population grouped by gender and local government area from the ABS (2019) and Covid (NSW Health, 2020) testing and test results from the New South Wales Health.
The gender population data contained two data sets, one with female data and one with male data. Both sets included 559 observations for all of Australia and 23 variables. The data included estimated population figures for June 2019. As this study focused on NSW, observations from other states were removed when the data was subset. 129 observations remained. Three variables were relevant including:
The Covid data set included 539,050 observations of 7 variables. The data includes information on all Covid tests in NSW up to June 2020. As such, data was collected at the case level and will need to be grouped by LGA. In addition, the data set lacked information regarding the Unincorporated NSW region which was in the ABS population data. Three variables were relevant including:
#reading in files
agemale<-read.csv("agemales.csv")
agefemale <- read.csv("agefemales.csv")
covid<-read.csv("covid19.csv")
As the variables were similar in both the data sets from the ABS, the structure and size of the data sets were checked. All three data sets had variables that included the keys of local government area name and local government ID code. These keys will allow for merging the three data sets.
#Checking structure of files
str(agemale)
## 'data.frame': 559 obs. of 23 variables:
## $ Australian.Bureau.of.Statistics: chr "3235.0 Regional Population by Age and Sex, Australia" "Released at 11.30am (Canberra time) 28 August 2020" "Table 1. Estimated Resident Population by Age, by Local Government Area, Males – 30 June 2019" "" ...
## $ X : chr "" "" "" "" ...
## $ X.1 : chr "" "" "" "" ...
## $ X.2 : chr "" "" "" "" ...
## $ X.3 : chr "" "" "" "" ...
## $ X.4 : chr "" "" "" "" ...
## $ X.5 : chr "" "" "" "" ...
## $ X.6 : chr "" "" "" "" ...
## $ X.7 : chr "" "" "" "" ...
## $ X.8 : chr "" "" "" "" ...
## $ X.9 : chr "" "" "" "" ...
## $ X.10 : chr "" "" "" "" ...
## $ X.11 : chr "" "" "" "" ...
## $ X.12 : chr "" "" "" "" ...
## $ X.13 : chr "" "" "" "" ...
## $ X.14 : chr "" "" "" "" ...
## $ X.15 : chr "" "" "" "" ...
## $ X.16 : chr "" "" "" "" ...
## $ X.17 : chr "" "" "" "" ...
## $ X.18 : chr "" "" "" "" ...
## $ X.19 : chr "" "" "" "" ...
## $ X.20 : chr "" "" "" "" ...
## $ X.21 : chr "" "" "" "" ...
head(agemale)
## Australian.Bureau.of.Statistics
## 1 3235.0 Regional Population by Age and Sex, Australia
## 2 Released at 11.30am (Canberra time) 28 August 2020
## 3 Table 1. Estimated Resident Population by Age, by Local Government Area, Males – 30 June 2019
## 4
## 5
## 6
## X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1
## 2
## 3
## 4
## 5
## 6 Age group (Years)
## X.15 X.16 X.17 X.18 X.19 X.20 X.21
## 1
## 2
## 3
## 4
## 5
## 6
summary(agemale)
## Australian.Bureau.of.Statistics X X.1
## Length:559 Length:559 Length:559
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## X.2 X.3 X.4 X.5
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.6 X.7 X.8 X.9
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.10 X.11 X.12 X.13
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.14 X.15 X.16 X.17
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.18 X.19 X.20 X.21
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
str(agefemale)
## 'data.frame': 559 obs. of 23 variables:
## $ Australian.Bureau.of.Statistics: chr "3235.0 Regional Population by Age and Sex, Australia" "Released at 11.30am (Canberra time) 28 August 2020" "Table 2. Estimated Resident Population by Age, by Local Government Area, Females – 30 June 2019" "" ...
## $ X : chr "" "" "" "" ...
## $ X.1 : chr "" "" "" "" ...
## $ X.2 : chr "" "" "" "" ...
## $ X.3 : chr "" "" "" "" ...
## $ X.4 : chr "" "" "" "" ...
## $ X.5 : chr "" "" "" "" ...
## $ X.6 : chr "" "" "" "" ...
## $ X.7 : chr "" "" "" "" ...
## $ X.8 : chr "" "" "" "" ...
## $ X.9 : chr "" "" "" "" ...
## $ X.10 : chr "" "" "" "" ...
## $ X.11 : chr "" "" "" "" ...
## $ X.12 : chr "" "" "" "" ...
## $ X.13 : chr "" "" "" "" ...
## $ X.14 : chr "" "" "" "" ...
## $ X.15 : chr "" "" "" "" ...
## $ X.16 : chr "" "" "" "" ...
## $ X.17 : chr "" "" "" "" ...
## $ X.18 : chr "" "" "" "" ...
## $ X.19 : chr "" "" "" "" ...
## $ X.20 : chr "" "" "" "" ...
## $ X.21 : chr "" "" "" "" ...
head(agefemale)
## Australian.Bureau.of.Statistics
## 1 3235.0 Regional Population by Age and Sex, Australia
## 2 Released at 11.30am (Canberra time) 28 August 2020
## 3 Table 2. Estimated Resident Population by Age, by Local Government Area, Females – 30 June 2019
## 4
## 5
## 6
## X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1
## 2
## 3
## 4
## 5
## 6 Age group (Years)
## X.15 X.16 X.17 X.18 X.19 X.20 X.21
## 1
## 2
## 3
## 4
## 5
## 6
summary(agefemale)
## Australian.Bureau.of.Statistics X X.1
## Length:559 Length:559 Length:559
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## X.2 X.3 X.4 X.5
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.6 X.7 X.8 X.9
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.10 X.11 X.12 X.13
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.14 X.15 X.16 X.17
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.18 X.19 X.20 X.21
## Length:559 Length:559 Length:559 Length:559
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
The raw data from the ABS data sets included 23 variables. 20 variables related to age range, groups and state identification. The tables included superfluous meta data, empty rows and data from other states. As such, a subset of desired variables was created from each data set. This included local government identification number (LGAID), local government area name (LGA) and total population by LGA for both males and females (malepop, femalepop) in NSW. NAs and special characters were then checked.
#Subsetting, scanning and renaming variables
agemales<-agemale[10:138,]
agemales<-data.frame(agemales)
rmarkdown::paged_table(head(agemales, 5))
sum(is.na(agemales))
## [1] 0
sum(is.nan(as.matrix(agemales)))
## [1] 0
sum(is.infinite(as.matrix(agemales)))
## [1] 0
colnames(agemales)
## [1] "Australian.Bureau.of.Statistics" "X"
## [3] "X.1" "X.2"
## [5] "X.3" "X.4"
## [7] "X.5" "X.6"
## [9] "X.7" "X.8"
## [11] "X.9" "X.10"
## [13] "X.11" "X.12"
## [15] "X.13" "X.14"
## [17] "X.15" "X.16"
## [19] "X.17" "X.18"
## [21] "X.19" "X.20"
## [23] "X.21"
agemales<-agemales%>%rename(LGAID = "X.1", LGA = "X.2", malepop = "X.21")
male.lga.pop<-data.frame(agemales$LGAID, agemales$LGA, agemales$malepop)
rmarkdown::paged_table(head(male.lga.pop, 5))
agefemales<-agefemale[10:138,]
agefemales<-data.frame(agefemales)
sum(is.na(agefemales))
## [1] 0
sum(is.nan(as.matrix(agefemales)))
## [1] 0
sum(is.infinite(as.matrix(agefemales)))
## [1] 0
colnames(agefemales)
## [1] "Australian.Bureau.of.Statistics" "X"
## [3] "X.1" "X.2"
## [5] "X.3" "X.4"
## [7] "X.5" "X.6"
## [9] "X.7" "X.8"
## [11] "X.9" "X.10"
## [13] "X.11" "X.12"
## [15] "X.13" "X.14"
## [17] "X.15" "X.16"
## [19] "X.17" "X.18"
## [21] "X.19" "X.20"
## [23] "X.21"
agefemales<-agefemales%>%rename(LGAID = "X.1", LGA = "X.2", femalepop = "X.21")
female.lga.pop<-data.frame(agefemales$LGAID, agefemales$LGA, agefemales$femalepop)
rmarkdown::paged_table(head(female.lga.pop, 5))
A left join was then used to merge the two tables on the keys of LGA ID code and name. The variables were renamed and the new data frame checked using the head function. Population variables were converted from character type to integer.
#Join and conversion of data type.
dfpop<-left_join(male.lga.pop, female.lga.pop, by = c("agemales.LGAID" = "agefemales.LGAID", "agemales.LGA"= "agefemales.LGA"))
dfpop<-dfpop %>% rename(LGAID = "agemales.LGAID", LGA ="agemales.LGA", male = "agemales.malepop",female = "agefemales.femalepop")
rmarkdown::paged_table(head(dfpop, 5))
dfpop$male<-as.integer(as.character(dfpop$male))
dfpop$female<-as.integer(as.character(dfpop$female))
sapply(dfpop, class)
## LGAID LGA male female
## "character" "character" "integer" "integer"
A new variable (totalpop) was created combining the female and male population variables.
#Mutation
totalpop<-dfpop%>% mutate(totalpop = female + male)
rmarkdown::paged_table(head(totalpop, 5))
The Covid 19 data had 7 variables including the two keys LGA ID (lga_code19) and name of LGA (lga_name19) . The data set was an example of untidy data. The results variable included information about testing as well as results of the tests and the LGA variable included information on LGA name and LGA class. As such, multiple variables have been stored in single columns. These variables will be separated into two variables. The data was ordered by case and needed to be collapsed into cases grouped by LGA. The data frame was checked, subsetted and grouped by LGA in preparation for a join with the population data. Missing values and special values were checked. There were a number of rows with results but lacked LGA details. These may have been related to cases from nonresidents/overseas cases without an LGA. NAs were spread across rows. Rows with missing values were omitted as NAs made up less than 5% of the data set.
#Scanning
str(covid)
## 'data.frame': 539050 obs. of 7 variables:
## $ test_date : chr "1/01/2020" "1/01/2020" "1/01/2020" "1/01/2020" ...
## $ postcode : int 2039 2040 2069 2110 2190 2071 2485 2151 2134 2207 ...
## $ lhd_2010_code: chr "X700" "X700" "X760" "X760" ...
## $ lhd_2010_name: chr "Sydney" "Sydney" "Northern Sydney" "Northern Sydney" ...
## $ lga_code19 : int 14170 14170 14500 14100 11570 14500 17550 16260 11300 10500 ...
## $ lga_name19 : chr "Inner West (A)" "Inner West (A)" "Ku-ring-gai (A)" "Hunters Hill (A)" ...
## $ result : chr "Tested & excluded" "Tested & excluded" "Tested & excluded" "Tested & excluded" ...
rmarkdown::paged_table(head(covid, 5))
summary(covid)
## test_date postcode lhd_2010_code lhd_2010_name
## Length:539050 Min. :1021 Length:539050 Length:539050
## Class :character 1st Qu.:2114 Class :character Class :character
## Mode :character Median :2232 Mode :character Mode :character
## Mean :2309
## 3rd Qu.:2502
## Max. :4385
## NA's :22958
## lga_code19 lga_name19 result
## Min. :10050 Length:539050 Length:539050
## 1st Qu.:11720 Class :character Class :character
## Median :14900 Mode :character Mode :character
## Mean :14641
## 3rd Qu.:16700
## Max. :18710
## NA's :23825
sum(is.na(covid))
## [1] 46783
sum(is.nan(as.matrix(covid)))
## [1] 0
sum(is.infinite(as.matrix(covid)))
## [1] 0
cleancovid<-na.omit(covid)
The data set was grouped by result and LGA. The result variable was then split into two variables, tested and confirmed (Case - Confirmed) and tested and excluded (Tested & excluded). The new data frame was checked. NAs only existed in the case confirmed. It was assumed NAs represented areas without confirmed cases. As such, NA values were converted to 0. A total test by LGA variable was created using the mutate function by adding the Case - Confirmed and Tested & excluded result variables.
#Grouping by LGA and mutation.
groupcovidtest<-cleancovid %>% group_by(lga_name19, result) %>% count()
spreadcovidna<-spread(groupcovidtest, result, n)
spreadcovidna[is.na(spreadcovidna)] <- 0
rmarkdown::paged_table(head(spreadcovidna, 5))
totaltest<-spreadcovidna %>% mutate(totaltests = `Case - Confirmed` + `Tested & excluded`)
rmarkdown::paged_table(head(totaltest, 5))
The Covid dataset was then joined to the ABS population data and checked. The Covid dataframe lacked data concerning the Unincorporated NSW region and the row was removed. There were no other NA results.
#Join and rescanning new data set.
covidjoin<- left_join (totalpop, totaltest, by = c("LGA" = "lga_name19"))
rmarkdown::paged_table(head(covidjoin, 5))
sum(is.na(covidjoin))
## [1] 3
cleancovidjoin<-na.omit(covidjoin)
sum(is.na(cleancovidjoin))
## [1] 0
rmarkdown::paged_table(head(cleancovidjoin, 5))
Three new variables were created using mutate: proportion tests per population number(testbypop), proportion of cases per population (casebypop) and proportion cases by test(casebytest).
#Mutation
covidpoptest<-cleancovidjoin %>% mutate(testbypop = round(totaltests / totalpop*100, 2))
rmarkdown::paged_table(head(covidpoptest, 5))
covidbypop<-covidpoptest %>% mutate(casebypop = round(`Case - Confirmed` / totalpop*100, 2))
rmarkdown::paged_table(head(covidbypop, 5))
covidfinal<-covidbypop %>% mutate(casesbytest = round(`Case - Confirmed` / totaltests*100, 2))
rmarkdown::paged_table(head(covidfinal, 5))
Summary statistics were examine and the means and the median suggest skewed distribution. Three variables related to cases had minimum zero values and this could impact transformation methods.
#Summary statistics
summary(covidfinal)
## LGAID LGA male female
## Length:128 Length:128 Min. : 857 Min. : 754
## Class :character Class :character 1st Qu.: 4419 1st Qu.: 4330
## Mode :character Mode :character Median : 12206 Median : 12235
## Mean : 31369 Mean : 31824
## 3rd Qu.: 37738 3rd Qu.: 39999
## Max. :189987 Max. :187930
## totalpop Case - Confirmed Tested & excluded totaltests
## Min. : 1611 Min. : 0.00 Min. : 24 Min. : 24.0
## 1st Qu.: 8784 1st Qu.: 1.00 1st Qu.: 333 1st Qu.: 335.2
## Median : 24358 Median : 5.00 Median : 1428 Median : 1436.0
## Mean : 63194 Mean : 23.23 Mean : 4002 Mean : 4025.2
## 3rd Qu.: 77737 3rd Qu.: 32.00 3rd Qu.: 5109 3rd Qu.: 5141.5
## Max. :377917 Max. :185.00 Max. :23596 Max. :23713.0
## testbypop casebypop casesbytest
## Min. : 0.200 Min. :0.00000 Min. :0.0000
## 1st Qu.: 3.888 1st Qu.:0.01000 1st Qu.:0.1575
## Median : 4.990 Median :0.02000 Median :0.4250
## Mean : 5.389 Mean :0.02859 Mean :0.4619
## 3rd Qu.: 6.492 3rd Qu.:0.04000 3rd Qu.:0.6525
## Max. :15.990 Max. :0.25000 Max. :2.1900
round(apply(covidfinal,2,sd),3)
## LGAID LGA male female
## 2521.955 NA 41389.383 41585.820
## totalpop Case - Confirmed Tested & excluded totaltests
## 82954.309 36.906 5530.138 5563.156
## testbypop casebypop casesbytest
## 2.462 0.033 0.426
created new variable poplation size by quartile and converted it to factor and spread LGA into LGA name and LGA class and converted class to factor. This may assist in dealing with the differences in LGA population sizes.
#Mutation
covidfinalquart<-covidfinal %>% mutate(quartile = ntile(totalpop, 4))
covidfinalquart$quartile<-factor(covidfinalquart$quartile, ordered = TRUE, levels = c(1,2,3,4), labels = c("low", "medium", "high", "very_high"))
covidcomplete<-covidfinalquart %>% separate (LGA, c("LGA", "class"),sep ="[(*)]")
covidcomplete$class<-factor(covidcomplete$class)
rmarkdown::paged_table(head(covidcomplete, 5))
The data was then checked for outliers and distributions. Histograms showed the distribution of the data skewed to the right. The study will focus on cases and tests by population size. Gender was not going to be a focus of this study.
There were high zero counts in variables relating to case testing ie confirmed cases, cases by test and cases by population. These issues created challenges for dealing with dealing outliers and the process of transforming the data to normal distributions.
There are several reasons outliers may exist including data entry error, measurement errors, extreme natural variation, preprocessing errors and sampling error. Boxplots were used to check for univariate outliers. Using this technique, outliers are defined as being outside Q1-1.5IQR and Q3+1.5IQR. Outliers existed in the upper boundariea of all numeric variables. Further investigation, suggested the outliers to be values may be connected to virus hot spots. The outlier values were deemed to be important and not due to errors of observation. Outliers observations were checked for obvious error. It was decided cannot be imputed, capped or deleted as these outliers were of interest.
In an attempt to gain some understanding about the process a number of transformation techniques were tested including square roots, cubed roots and Box-Cox transformations. None of these were effective. Due to the high zeo counts, a zero inflated Poisson distribution () was considered, as well as the removal of rows containing zeros.The rows containing zeroes for case confirmed showed low testing numbers (24-587) and appeared to rural and low density populations. Due to the complexity of the zero inflated Poisson distribution and the difference between zero case areas and the remaining sample, the rows with zeros due to no confirmed cases were removed.
#Checking for normality and outliers
covidcomplete %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#boxplot check
covidcomplete %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot()
The plots above demonstrate non normal distributions and a large number of outliers in the upper boundary. Zeros will be removed and data checked again.
#Removal of zeros and rechecking distribution and outliers
coviddrop<-subset(covidcomplete,`Case - Confirmed` != 0)
rmarkdown::paged_table(head(coviddrop, 5))
coviddrop %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot()
coviddrop %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Removing zeros allows for an easier transformation of the distribution. Rows with zeros were removed from the data set via subsetting and reduced the data set to 100 observations. Outliers were rechecked. There were still outliers in the upper boundary and the distribution remained skewed.A range of transformation techniques were again tested on on the modified data including, Box-Cox, square root and log10 transformations. The log10 transformation was selected as it is known to effectively deal with right skewed data. log10 function was applied across the numeric attributes. Boxplots showed less skew although outliers remained in the casebypop, casebytest and testbypop variables. These were the main variables of interest.
#Log10 function applied across numeric variables
covidlog<-rapply(coviddrop, f = log10, classes = c("numeric", "integer"), how = "replace")
head(covidlog)
## LGAID LGA class male female totalpop Case - Confirmed
## 1 10050 Albury C 4.420748 4.447236 4.735224 0.9030900
## 2 10130 Armidale Regional A 4.175946 4.198217 4.488255 0.6020600
## 3 10250 Ballina A 4.333427 4.363217 4.649607 0.9030900
## 5 10470 Bathurst Regional A 4.341652 4.335598 4.639666 0.9542425
## 6 10500 Bayside A 4.952894 4.947801 5.251385 1.8260748
## 7 10550 Bega Valley A 4.228836 4.244005 4.537517 0.9542425
## Tested & excluded totaltests testbypop casebypop casesbytest quartile
## 1 3.232234 3.234264 0.4996871 -2.000000 -0.3279021 high
## 2 3.405517 3.406199 0.9180303 -2.000000 -0.7958800 high
## 3 3.469380 3.470557 0.8208580 -1.698970 -0.5686362 high
## 5 3.397419 3.398981 0.7596678 -1.698970 -0.4436975 high
## 6 3.960280 3.963457 0.7118072 -1.397940 -0.1366771 very_high
## 7 3.129045 3.131939 0.5943926 -1.522879 -0.1804561 high
covidlog %>% keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot()
Outliers may be linked to Covid outbreaks and deemed to be of interest. Therefore it was decided not to impute or delete them. Normal distributions are important for a number of parametric statistical tests(). However, there are a number of non-parametric tests that can be applied to data sets that are not normally distributed. The LGAs that with outliers included:
#Listing outliers
out <- boxplot.stats(covidlog$casesbytest)$out
out_ind <- which(covidlog$casesbytest %in% c(out))
out_ind
## [1] 30 58
out <- boxplot.stats(covidlog$testbypop)$out
out_ind <- which(covidlog$testbypop %in% c(out))
out_ind
## [1] 41 48 93
out <- boxplot.stats(covidlog$casebypop)$out
out_ind <- which(covidlog$casebypop %in% c(out))
out_ind
## [1] 41 57 93 99
#Converting data types
covidlog$LGAID<-factor(covidlog$LGAID)
covidlog$LGA<-factor(covidlog$LGA)
The final data set included 100 observations and 13 variables including:
Australian Bureau of Statistics. (2019). Regional Population, Age and Sex, retrieved from https://www.abs.gov.au/statistics/people/population/regional-population-age-and-sex/latest-release#data-download
Green J. (2020). A Tutorial on Modelling Health Behaviour as Count Data with Poisson and Negative Binomial Regression, retrieved from https://osf.io/ux9et/
New South Wales Government. (2020). NSW Covid 19 Tests by Location and Result. retrieved from (https://data.nsw.gov.au/data/dataset/nsw-covid-19-tests-by-location-and-result/resource/227f6b65-025c-482c-9f22-a25cf1b8594f)
Wickham H. (2014). Tidy Data, Journal of statistical Software, 59(10) https://www.researchgate.net/publication/215990669_Tidy_data