library(readxl)
library(readr)
library(dplyr)
library(tidyr)
library(editrules)
## Warning: package 'editrules' was built under R version 4.0.3
## Warning: package 'igraph' was built under R version 4.0.3
library(rmarkdown)
library(knitr)
This report covers the data pre-processing for three tables of 2016 census data sourced from the Australian Bureau of Statistics (ABS). The variables of interest are state, gender, method of transport to work, population counts, and state area (sqkm). The imported data set was filtered to retain only data on the bicycle method of transport to work and the associated population totals, and remove the other methods of transport. The bicycle dataset was untidy due to being in a wide format and some observations were spread across multiple rows. Using the tidyr package, the State variable was gathered and the Method variable spread to conform to tidy data rules. The two smaller data sets were merged in turn to the tidied bicycle data using the dplyr package left join based on the common variable ‘State’. A structure check of the data revealed two character variables which were recoded into function variables, and four numeric variables. Two new variables, a bicycle method by working population proportion and a population density variable, were created using the dplyr mutate function. The bicycle data was scanned for missing values (using the is.na function), special values (using the is.infinite and is.nan functions with sapply) and inconsistencies or errors (using the editrules package editfile and violatedEdits functions). No missing values, special values or obvious errors were found. The distribution of the numerical variables was found to be right skewed when examined using histograms. Consequently, outliers were identified using box plots. A few outliers were found however the only truly anomalous outlier was in the population density variable, which was subsequently normalised using the log transformation method.
Main Data Source
The first data set, called ‘allmethod’, and the second ‘state_pop’ were sourced from the Australian Bureau of Statistics (ABS) utilising the Census TableBuilder, accessed via the following link (https://www.abs.gov.au/websitedbs/d3310114.nsf/home/about+tablebuilder). The Census TableBuilder accesses ABS microdata and allows guest users to create tables displaying counts from board demographic variables collected during Census and is open source under a Creative Commons Licence.
Table Creation Method in Census TableBuilder
The data for the ‘all-method’ data set was created from the 2016 Census of Population and Housing data set using the Employment, Income and Education subset.
Place of Usual Residence (UR) - The State variable is based on the Main ASGS categories; this provides the count of the population by state (i.e. NSW, VIC), the ‘other Territories’ level was not selected in this output. This was added as a column variable.
Selected Person Characteristics - Sex (SEXP: Male and Female) was selected as a row variable and provides the count for gender.
Mode of Travel to Work - The main variable of interest was the Method of Travel to Work (MTW15P: 15 travel modes option, examples include ‘Bus’, ‘Car, as driver’, ‘Walked only’) which reports the method used to travel to work on the day of the Census; the not stated and not applicable counts were not selected in this output. Method of Travel to Work was added as a row variable.
The resulting table “SEXP Sex and MTW15P Method of Travel to Work (15 travel modes) by STATE (UR)” was downloaded as an xlsx file.
Import Data Steps
The ‘allmethod’ dataset was imported into R using read_excel from the readxl package, the first seven rows of the excel spreadsheet were skipped as they contained extraneous information, the max rows option (n_max = 32) was used to select only rows that included the heading and cell counts for the variables. The head function is used throughout this section of the report to show the first six lines of the dataset, displaying the content, format and any changes made. The ‘allmethod’ first row of data and the first column were subsetted and removed from the dataset as they are missing data resulting from the prior formatting of the table in the excel output. This formatting also only included one cell to designate the “Male and”Female" row identifiers, the missing data was assigned using subsetting. The first and second column names generated by R were then renamed as “Gender” and “Method”.
allmethod <- read_excel("all methods working.xlsx", skip = 7, n_max = 32)
## New names:
## * `` -> ...2
## * `` -> ...3
head(allmethod)
allmethod <- allmethod[-1,-1]
allmethod[2:15,1] <- "Male"
allmethod[17:30,1] <- "Female"
allmethod <- allmethod %>% rename(Gender = ...2)
allmethod <- allmethod %>% rename(Method = ...3)
head(allmethod)
The ‘state_pop’ data set was also generated using the Census TableBuilder using a similar method as described above for the ‘allmethod’ data set. For this table, only the State variable and it’s associated count were selected in a column format, the full population per state was included based on Place of Usual Residence (UR), rather than only the working population by method of travel to work seen in the ‘allmethod’ data set above. The resulting xlsx output was imported into R using read_excel, the first nine rows of extraneous data were skipped and the max rows (8) with data were kept. The first and second column names generated by R were renamed as “State” and “Total_pop”.
state_pop <- read_excel("state population.xlsx", col_names = FALSE, skip = 9, n_max = 8)
## New names:
## * `` -> ...1
## * `` -> ...2
head(state_pop)
state_pop <- state_pop %>% rename(State = ...1)
state_pop <- state_pop %>% rename(Total_pop = ...2)
head(state_pop)
Second Data Source
The third data set was also sourced from the ABS website, this time from the Data Cubes available with the Australian Statistical Geography Standard (ASGS): Volume 1 - Main Structure and Greater Capital City Statistical Areas, July 2016 release (URL: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.001July%202016?OpenDocument). The relevant download is the “State and Territory (STE) ASGS Edition 2016 in .csv Format” file.
This data set ‘state_area’ includes the state code, state and state area variables. The state code variable is an ABS designation code for data processing. State is the states and territories of Australia. The state area variable (AREA_ALBERS_SQKM) gives the state’s area in square kilometres calculated using the Albers Equal Area Conic Projection method. This was imported into R using the readr package read_csv function. The first column ‘state_code_2016’ and the ninth observation ‘Other Territories’ were dropped using subsetting as they are extraneous for this investigation.
state_area <- read_csv("state_area_2016_AUST.csv")
## Parsed with column specification:
## cols(
## STATE_CODE_2016 = col_double(),
## STATE_NAME_2016 = col_character(),
## AREA_ALBERS_SQKM = col_double()
## )
head(state_area)
state_area <- state_area[,-1]
state_area <- state_area[-9,]
head(state_pop)
Tidy & Manipulate Data I
The ‘allmethod’ data set needs to be tidied before it can be merged so the first tidy and manipulate data section has been included in the report here. For this report the method of travel to work under investigation is Bicycle. Consequently, the ‘allmethod’ data set was filtered using the dplyr filter function selecting ‘Bicycle’ and ‘Total’ to focus in on the number of people who reported cycling to work and total number of working people (‘not stated’ and ‘not applicable’ counts were not selected during table creation). The filtered data set was assigned the name ‘bybike’ and will be the basis for merging all data sets.
The ‘bybike’ data set is untidy because the variable State is in a wide format, useful for readability when comparing categorical variables visually, but it can be a hindrance for data analysis. According to the tidy data rules each column can only contain one variable, the wide format violates this rule. To address this the tidyr gather function was used to transform the State variable from wide to long format, under the new column headings ‘State’ and ‘Count’. The second tidy data rule is that each row must have only one observation’s values. The ‘Method’ column contains both the bicycle method count and the sum of all methods count (working population total) in the same column, violating this rule. These should be two different variables, the tidyr spread function was used to transform them into two columns, reforming ‘Method’ and ‘Count’, to form ‘Bicycle’ and ‘Total’. The ‘Total’ column was then renamed to ‘Work_pop’ to distinguish from the total population which will be added during the forthcoming data set merges. The third tidy data rule that each cell must contain a single value is observed for this data set.
bybike <- allmethod %>% filter(Method == "Bicycle" | Method == "Total")
head(bybike)
bybike <- bybike %>% gather(`New South Wales`, `Victoria`, `Queensland`, `South Australia`, `Western Australia`, `Tasmania`, `Australian Capital Territory`, `Northern Territory`, key = "State", value = "Count")
head(bybike)
bybike <- bybike %>% spread(key = Method, value = Count)
bybike <- bybike %>% rename(Work_pop = Total)
head(bybike)
Merge Datasets
The common variable for merging the data sets is ‘State’. The ‘Total_pop’ variable from ‘state_pop’ was joined to the ‘bybike’ data set using a left join with the key variable being character vector ‘State’. The left join combines matching variables from two tables (in this case the ‘State’ variable) and adds new variables to the right (the ‘Total_pop’ variable). As both data sets have identical ‘State’ values all observations in the bybike data set were retained and the ‘Total_pop’ variable was added. The state variable for the ‘state_area’ data set is named differently (‘STATE_NAME_2016’), so the data sets were joined using a named character vector which identifies the variables to be matched.
bybike <- bybike %>% left_join(state_pop, by = "State")
head(bybike)
bybike <- bybike %>% left_join(state_area, by = c("State" = "STATE_NAME_2016"))
head(bybike)
The structure of the ‘bybike’ data set was checked revealing a data frame (tibble) with 16 observations and six variables. It has two data types, there are two character class variables (Gender and State) and four numeric class variables (Bicycle, Work_pop, Total_pop and AREA_ALBERS_SQKM). Both Gender and State are categorical variables and should be factorised prior to analysis.
Gender was encoded as a factor using the as.factor function, resulting in two unordered levels “Female” and “Male”. The State variable was encoded as a factor using the factor function, with the additional arguments specifying the eight levels, changing the labels so that the states appear in abbreviated form (e.g. ‘NSW’ instead of ‘New South Wales’) for analysis and outputs, and although it has no numerically meaningful order, the ordered argument was used to maintain the state order usually used in government publications. The structure of ‘bybike’ was rechecked and both Gender and State have been successfully re-coded as factors with the appropriate levels.
str(bybike)
## tibble [16 x 6] (S3: tbl_df/tbl/data.frame)
## $ Gender : chr [1:16] "Female" "Female" "Female" "Female" ...
## $ State : chr [1:16] "Australian Capital Territory" "New South Wales" "Northern Territory" "Queensland" ...
## $ Bicycle : num [1:16] 1645 5213 1025 5227 1801 ...
## $ Work_pop : num [1:16] 100334 1585818 46925 1017823 355457 ...
## $ Total_pop : num [1:16] 397393 7480230 228838 4703192 1676653 ...
## $ AREA_ALBERS_SQKM: num [1:16] 2358 800811 1348094 1730172 984275 ...
bybike$Gender <- bybike$Gender %>% as.factor()
levels(bybike$Gender)
## [1] "Female" "Male"
bybike$State <- bybike$State %>% factor(levels = c("New South Wales", "Victoria", "Queensland", "South Australia", "Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"),
labels = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"),
ordered = TRUE)
levels(bybike$State)
## [1] "NSW" "VIC" "QLD" "SA" "WA" "TAS" "NT" "ACT"
str(bybike)
## tibble [16 x 6] (S3: tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 2 2 ...
## $ State : Ord.factor w/ 8 levels "NSW"<"VIC"<"QLD"<..: 8 1 7 3 4 6 2 5 8 1 ...
## $ Bicycle : num [1:16] 1645 5213 1025 5227 1801 ...
## $ Work_pop : num [1:16] 100334 1585818 46925 1017823 355457 ...
## $ Total_pop : num [1:16] 397393 7480230 228838 4703192 1676653 ...
## $ AREA_ALBERS_SQKM: num [1:16] 2358 800811 1348094 1730172 984275 ...
Reshaping the data into a tidy format was done prior to merging the data sets. In this step, two new variables were added to the data set using the dplyr mutate function. The percentage of people who rode to work compared with the working population was computed by dividing ‘Bicycle’ by ‘Work_pop’ then multiplying by 100, under the column name ‘Bike_wk_perc’. An estimate of population density, the number of people per square kilometre, was computed by dividing the total population count (‘Total_pop’) by the state area (AREA_ALBERS_SQKM), under the column name ‘Pop_density’.
bybike <- bybike %>% mutate(Bike_wk_perc = Bicycle/Work_pop*100)
bybike <- bybike %>% mutate(Pop_density = Total_pop/AREA_ALBERS_SQKM)
head(bybike)
All variables in the data frame were scanned for missing values using the is.na function which gives a logical vector TRUE for missing values, the missing values were totaled by column using the colSums function. No missing values were identified in any of the variables.
All numeric variables were scanned for special values (‘infinity’ (-)Inf or ‘not a number’ NaN) using a custom function and sapply. First the is.special function was created to identify if a value was numeric, and if so, to then check if it was infinite using the is.infinite function or NaN using the is.nan function. The is.special function was then applied to the data frame using the sapply function, which applies any function to a list. Another custom function to sum the columns of the is.special function output was inserted into the sapply function to simplify the final output. No special values were identified in any of the numeric variables.
Potential obvious inconsistencies or errors were checked using the editrules package. Rules (i.e. variable constraints or restrictions) were defined for categorical and numerical variables from the data set in a text file and assigned to ‘Rules’ using the editfile function. The text file contained the following rules. All numeric variables were constrained to be non-negative (greater than or equal to zero), ‘Bike_wk_perc’ was also constrained to be less than or equal to 100. The population count variables were also checked to make sure a subset of the population was less than the whole population (for example ‘Bicycle’ was less than ‘Work_pop’ which in turn was less than ‘Total_pop’). The categorical variables ‘Gender’ and ‘State’ were constrained to only contain the string/levels coded in the factor. For example ‘Gender’ can only contain “Male” or “Female”. The violatedEdits function was applied with ‘Rules’ to ‘bybike’ and assigned to ‘Violated’. A summary of the violations was then produced with no violations of the rules detected.
colSums(is.na(bybike))
## Gender State Bicycle Work_pop
## 0 0 0 0
## Total_pop AREA_ALBERS_SQKM Bike_wk_perc Pop_density
## 0 0 0 0
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(bybike, function(x) sum(is.special(x)))
## Gender State Bicycle Work_pop
## 0 0 0 0
## Total_pop AREA_ALBERS_SQKM Bike_wk_perc Pop_density
## 0 0 0 0
Rules <- editfile("error rules.txt", type = "all")
Rules
##
## Data model:
## dat1 : Gender %in% c('Female', 'Male')
## dat2 : State %in% c('ACT', 'NSW', 'NT', 'QLD', 'SA', 'TAS', 'VIC', 'WA')
##
## Edit set:
## num1 : 0 <= Bicycle
## num2 : Bicycle < Work_pop
## num3 : Work_pop < Total_pop
## num4 : 0 <= AREA_ALBERS_SQKM
## num5 : 0 <= Bike_wk_perc
## num6 : Bike_wk_perc <= 100
## num7 : 0 <= Pop_density
Violated <- violatedEdits(Rules, bybike)
summary(Violated)
## No violations detected, 0 checks evaluated to NA
## NULL
The numeric variables were scanned for outliers using boxplots. First the distribution of each numeric variable was examined using a histogram to determine if it approximately fitted a symmetric or normal distribution. All numeric variables showed right skewed distributions so the non-parametric Tukey’s method of outlier detection used in box plots was applied to detect outliers.
The numeric variable ‘Bicycle’ had a right skewed distribution in the histogram and one upper outlier was seen in the box plot. The outlier was identified as the count for NSW; it is expected to find outliers in this data which compares entities of such differing sizes, New South Wales has the highest population overall (7,480,230 people), while the Northern Territory has the smallest (228,838 people). As the majority of the data in the ‘bybike’ data set is from census counts it will not be analysed using parametric tests, but rather non-parametric tests like chi-square goodness of fit, so numeric ‘outliers’ like this will not impact the analyses to be undertaken. Furthermore, the outlier in the ‘Bicycle’ data is not an anomaly and will be retained. The same right skewed distribution was observed in the ‘Working_pop’ and ‘Total_pop’ histograms, though neither variable returned outliers in their respective box plots. The histogram and box plot of ‘State_area’ follow the same pattern, with a right skewed distribution and no outliers were detected in the box plot.
hist(bybike$Bicycle, xlab = "Number of people who bicycled to work", main = "Histogram of Bicycle Count")
bybike$Bicycle %>% boxplot(main = "Box Plot of Bicycle Count", ylab = "Count")
hist(bybike$Work_pop, xlab = "Number of employed people", main = "Histogram of Working Population")
bybike$Work_pop %>% boxplot(main = "Box Plot of Working Population", ylab = "Count")
hist(bybike$Total_pop, main = "Histogram of Total Population", xlab = "Population Count")
bybike$Total_pop %>% boxplot(main = "Box Plot of Total Population", ylab = "Count")
hist(bybike$AREA_ALBERS_SQKM, main = "Histogram of State Area (sqkm)", xlab = "State Area (sqkm)")
bybike$AREA_ALBERS_SQKM %>% boxplot(main = "Box Plot of State Area", ylab = "State Area (sqkm)")
The ‘Bicycle_wk_perc’ variable also has a tendency to right skewness, though of all the numerical variables in this data set, it is the closest to a normal distribution. The ‘Bicycle to Work Proportion’ box plot shows one outlier, in this instance it is the ACT data points. Similarly to ‘Bicycle’ it is expected that there will be great differences observed between states. The ACT in particular presents a unique problem for the proportion (Bicycle_wk_perc) variable and the last variable, population density (Pop_density). The difficulty in comparing the ACT to the other states for these variables is due to the unique characteristics of the ACT, being essentially a capital city in a small area. This means that there are no smaller cities or towns within the state that people may commute to or from, and the area of the ACT (2,358 sqkm) has only a fraction of the area of even the state next closest in size (Tasmania 68,018 sqkm). This is why the bicycle to work by working population proportion by state is such a useful variable to add to the data set. As it gives us the proportion within the state, it is much easier to identify trends between states and compare them on the same scale. The outlier for ‘Bicycle_wk_perc’ is not an anomaly this data, but rather a point of particular interest and will be retained.
The extent to which the ACT is dissimilar to the other states is apparent in the histogram and box plot for Population Density. The ACT has a population density of almost 170 people per square kilometre, while the rest of the states range between less than one person per square kilometre (NT) and up to 26 people per square kilometre (VIC). The extreme nature of this outlier is evident in the Population Density box plot. It is clearly far beyond the upper outlier fence. This is a true anomaly in the data, reflecting the unique characteristics of the ACT as outlined in the above paragraph. This is also a reflection of the limited data available in the open source census information from the ABS, geographical counts are limited to the state level. To properly compare the ACT to other states, we would need to compare the state’s capital city population with the estimated area of the capital city. As this is not possible with the available data, different methods will need to be explored. One option would be to remove the ACT data for the analysis of population density, however as seen in the output below, filtering out ACT from the data does not correct the skewness of the distribution or the existence of outliers. In this instance the best course of action using the current data set is to apply a transformation to the population density data.
hist(bybike$Bike_wk_perc, main = "Histogram of Bicycle to Work Proportion (Percentage)", xlab = "Percentage of working population who bicycled to work")
bybike$Bike_wk_perc %>% boxplot(main = "Box Plot of Bicycle to Work Proportion (Percentage)", ylab = "Percent")
hist(bybike$Pop_density, main = "Histogram of Population Density (sqkm)", xlab = "Population Density (sqkm)")
bybike$Pop_density %>% boxplot(main = "Box Plot of Population Density (sqkm)", ylab = "Number of people per square kilometre")
NotACT <- bybike %>% filter(State != "ACT")
head(NotACT)
hist(NotACT$Pop_density, main = "Histogram of Population Density (sqkm) excluding the ACT", xlab = "Population Density (sqkm)")
NotACT$Pop_density %>% boxplot(main = "Box Plot of Population Density (sqkm) excluding the ACT", ylab = "Number of people per square kilometre")
In order to use the population density data for further analysis, outside of non-parametric methods, we will need to normalise the data. The population density variable has extremely high data points and extremely low data points and its distribution is heavily right skewed. The Log transformation is the most useful method to apply, as it will express the values in order of magnitude, compressing high values and spreading low values and is appropriate for right-skewed data. As the population data does not include zero or negative values we can apply the log transformation directly. A log transformation was applied to the ‘Pop_density’ variable using the logarithmic transformation function (log10()) and assigned to ‘log_density’. As seen in the histogram of ‘log_density’ below, the transformation was successful in reducing the skewness and improving the symmetry of the population density distribution. Being in an approximately normal distribution, the transformed population density variable can be analysed using parametric methods if needed.
log_density <- log10(bybike$Pop_density)
hist(log_density)