The aim of this pre-processing is to extract and prepare data for analysis of mean income data and rates of mortality and premature death in Local Government Areas (LGA’s) in Australia in the year of 2016. The data was extracted from two datasets before being merged into one table.
The first dataset contains data on mortality statistics in Australia, and was imported and saved as the Mortality data frame. The relevant identifying variables, and variables containing raw numbers of total deaths, premature deaths and population statistics were extracted from the dataset. The structure of the Mortality data frame was reviewed. The variables were converted to the appropriate data type where required, and the identifying variables of LGA and LGA Name required adjustments to the string composition so that they contained values consistent with the second data frame for merging. New variables Mortality_Rate and Premature_Rate were created from the original variables, in preparation for data analysis.
The Income data contains statistics on personal income by Local Government Areas in Australia. The identifying variables for LGA, and the relevant variables containing mean personal income data by year, were extracted from the dataset via sub-setting. This data was untidy, and steps to extract variables and reshape the data were completed to create a tidy data frame. The structure of the Income data was checked and data types were converted where appropriate.
The Mortality and Income data frames were merged using an inner join function, and filtered for data from 2016. A scan for missing values found two NA values which were evaluated and addressed. The variables were scanned for outliers and a transformation was applied to the Mean Income data. The small number of outliers were removed.
Two datasets will be merged during pre-processing. These datasets contain Mortality data and Income data for Local Government Areas (LGA) in Australia.
The first dataset comes from the Australian Institute of Health and Welfare (AIHW), from their reports on Mortality over Regions and Time (MORT). The imported table contains 2,730 observations over 17 variables for mortality observations by sex and year. For the purpose of this data pre-processing, the relevant variables from the “Persons” data (containing combined male and female data) have been selected;
LGA [character]: Local Government Area (LGA) code
LGA NAME [character]: Name of the Local Government Area
Year [factor]:Year of registered death
Total Deaths [numeric]: Number of deaths due to all causes
Population [numeric]: Estimated population of LGA as of 30 June
Premature Deaths [numeric]: Number of deaths where the person was under the age of 75 years old
Source: https://www.aihw.gov.au/reports/life-expectancy-death/mort-books/contents/mort-books
The second dataset is from the Australian Bureau of Statistics (ABS) data on Personal Income in Australia. The dataset contains several personal income data statistics from 2011 to 2017. Data relating to Mean personal income per LGA has been extracted via subsetting from the original dataset.
LGA [character]: Local Government Area (LGA) code
LGA NAME [character]: Name of the Local Government Area
2011-12…2016-17 [numeric]: The year the data was collected
Source: https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/latest-release
Mortality <- read_excel("Mortality_by_LGA.xlsx", sheet = "Table 1", range = "AL6:BB2736")
Mortality %<>% select("Reference", "Name", "Year", "Population","Total deaths", "Premature deaths")
colnames(Mortality) <- c("LGA", "LGA NAME", "Year","Population", "Total_deaths", "Premature_Death")
head(Mortality)tibble [2,730 × 6] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:2730] "LGA10050" "LGA10050" "LGA10050" "LGA10050" ...
$ LGA NAME : chr [1:2730] "Albury (C)" "Albury (C)" "Albury (C)" "Albury (C)" ...
$ Year : num [1:2730] 2014 2015 2016 2017 2018 ...
$ Population : num [1:2730] 51051 51548 52171 52894 53642 ...
$ Total_deaths : num [1:2730] 389 425 462 436 446 ...
$ Premature_Death: num [1:2730] 138 123 142 142 142 ...
LGA is a character variable and this data type is correct. This variable will form part of the key to merge datasets. The strings will need to be modified to match the Income dataset LGA variable. To do this, “LGA” needs to be removed.
chr [1:2730] "10050" "10050" "10050" "10050" "10050" "10130" "10130" ...
LGA NAME is a character variable and this data type is correct. This variable will also form part of the key to merge datasets. The strings will need to be modified to match the Income dataset LGA NAME variable. To do this, the additional LGA identifiers in brackets eg. (C), (RC), (Vic.) need to be removed.
Mortality$`LGA NAME` %<>%
str_remove_all("\\(" %R% ANY_CHAR %R% "\\)") %>%
str_remove_all("\\(" %R% ANY_CHAR %R% ANY_CHAR %R% "\\)") %>%
str_remove_all("\\(Vic.\\)") %>%
str_trim(side = "right")
str(Mortality$`LGA NAME`) chr [1:2730] "Albury" "Albury" "Albury" "Albury" "Albury" ...
The Year variable is a factor and needs to be converted. In order to align with the Income dataset, factor levels are set to years from 2011 to 2018.
Mortality$Year %<>%
factor(levels = c("2011","2012","2013","2014","2015","2016", "2017", "2018"), ordered = TRUE)
class(Mortality$Year)[1] "ordered" "factor"
[1] "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018"
The variables Total_Deaths, Population and Premature Death are correctly classified as numeric and do not require modification.
All data types for the Mortality variables are correct.
tibble [2,730 × 6] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:2730] "10050" "10050" "10050" "10050" ...
$ LGA NAME : chr [1:2730] "Albury" "Albury" "Albury" "Albury" ...
$ Year : Ord.factor w/ 8 levels "2011"<"2012"<..: 4 5 6 7 8 4 5 6 7 8 ...
$ Population : num [1:2730] 51051 51548 52171 52894 53642 ...
$ Total_deaths : num [1:2730] 389 425 462 436 446 ...
$ Premature_Death: num [1:2730] 138 123 142 142 142 ...
Mortality rate and premature death rate are important measures that provide a better insight into the deaths per population than the raw numbers.
These new variables can be created from the existing variables;
Mortality_Rate: Number of deaths per 1,000 people
Premature_Rate: Percentage of total deaths that were under the age of 75.
Once the new variables have been created, the original variables Population, Total_deaths and Premature_Death variables are no longer required, and can be removed with the dplyr select() function.
Mortality %<>%
mutate(Mortality_Rate = Total_deaths/Population*1000,
Premature_Rate = Premature_Death/Total_deaths *100) %>%
select(-Total_deaths, -Population, -Premature_Death)
head(Mortality)The final Mortality data frame contains Mortality rates and Premature Death rates for 2,730 observations for LGA and Year.
tibble [548 × 8] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:548] "Australia" "New South Wales" "10050" "10130" ...
$ LGA NAME: chr [1:548] NA NA "Albury" "Armidale Regional" ...
$ 2011-12 : chr [1:548] "55559" "56932" "49398" "44456" ...
$ 2012-13 : chr [1:548] "58061" "59304" "50704" "45799" ...
$ 2013-14 : chr [1:548] "59802" "61340" "52204" "46641" ...
$ 2014-15 : chr [1:548] "61036" "62798" "53775" "49109" ...
$ 2015-16 : chr [1:548] "61975" "64493" "54388" "50284" ...
$ 2016-17 : chr [1:548] "62594" "65196" "55124" "50884" ...
The Income data is untidy.
In order for Income data frame to be considered tidy, the data must meet the following requirements;
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
(Dolgun, 2020a)
The Income data does not meet these criteria. The data for the State or Territory where each LGA is geographically located is found within the LGA column, and needs to be separated to form its own column. Additionally, the column headings of the data frame are values for the year variable and each row contains multiple values for the mean income variable. The data needs to be re-shaped to form these variable columns.
The data contains observations for National and State/Territory summary statistics. These observations can be identified as they do not have a numeric LGA code in the LGA variable.
The original data is formatted so that LGAs are positioned in the data frame below the State/Territory observation for where they are geographically located. The State/Territory names can be extracted from the LGA variable, and down-filled to form a new State.Territory variable. This may useful for data analysis. Once the State.Territory variable is formed, the original State/Territory observations can be removed.
Mean_Income %<>%
mutate(State.Territory = ifelse(str_detect(Mean_Income$LGA,"[:alpha:]"), Mean_Income$LGA, NA)) %>%
fill(State.Territory) %>%
filter(!str_detect(LGA, "[:alpha:]"))
head(Mean_Income)Additionally, to make the Income data tidy, the data frame needs to be re-shaped. This is done using the gather() function from tidyr package, to transform the data frame from wide to long format. The column names will form the Year variable, and the values from the columns form the Income_Mean variable.
The Income data frame is now in a tidy form, where the data is stored in a consistent format. Each variable forms a column, each row contains a single observation, and each cell contains a single value.
tibble [3,234 × 5] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:3234] "10050" "10130" "10250" "10300" ...
$ LGA NAME : chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...
$ State.Territory: chr [1:3234] "New South Wales" "New South Wales" "New South Wales" "New South Wales" ...
$ Year : chr [1:3234] "2011-12" "2011-12" "2011-12" "2011-12" ...
$ Income_Mean : chr [1:3234] "49398" "44456" "43656" "44514" ...
LGA and LGA NAME variables are in the appropriate ‘character’ data type, and appropriate string patterns to join as a key with the Mortality data.
chr [1:3234] "10050" "10130" "10250" "10300" "10470" "10500" "10550" ...
chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...
Income Mean refers to the mean personal income. The variable is currently as ‘character’ and needs to be converted to numeric.
[1] "numeric"
Now that a Year variable has been formed, it needs to be converted to a factor. As this variable will be used as a key to join with the Mortality data, the string needs to be adjusted before being converted. The data currently contains financial years which span two calendar year eg. “2016-17”. For simplicity, this will be adjusted to the first year of the financial year. This variable will be converted to a factor, with the levels consistent with the Mortality data.
Mean_Income %<>% mutate(Year = str_sub(Year, 1,4))
Mean_Income$Year %<>% factor(levels = c("2011","2012","2013","2014","2015","2016", "2017", "2018"),ordered = TRUE)
class(Mean_Income$Year)[1] "ordered" "factor"
[1] "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018"
The State.Territory variable is to be converted to a factor. The labels for the States and Territories are changed to the commonly used abbreviated code.
Mean_Income$State.Territory %<>%
factor(levels = c("New South Wales", "Victoria", "Queensland", "South Australia","Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"),
labels = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"))
class(Mean_Income$State.Territory)[1] "factor"
[1] "NSW" "VIC" "QLD" "SA" "WA" "TAS" "NT" "ACT"
tibble [3,234 × 5] (S3: tbl_df/tbl/data.frame)
$ LGA : chr [1:3234] "10050" "10130" "10250" "10300" ...
$ LGA NAME : chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...
$ State.Territory: Factor w/ 8 levels "NSW","VIC","QLD",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Ord.factor w/ 8 levels "2011"<"2012"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ Income_Mean : num [1:3234] 49398 44456 43656 44514 51228 ...
The data is joined using an inner join. This ensures that the resulting dataset only contains observations for LGA/Year that exists in both of the Mortality and Mean_Income datasets. The data are joined by their common attributes of LGA, LGA NAME and Year.
LGA_data <- Mortality %>%
inner_join(Mean_Income, by = c("LGA", "LGA NAME", "Year")) %>%
select(State.Territory, everything())
head(LGA_data)The defined function is.specialOrNA below is used to scan the data for missing values including NA, NaN and inf values in the data.
is.specialOrNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))}
sapply(LGA2016, function(x) sum( is.specialOrNA(x) ))State.Territory LGA LGA NAME Year Mortality_Rate
0 0 0 0 0
Premature_Rate Income_Mean
1 1
Using sapply() to apply the is.specialOrNA() to the dataframe, it is evident that there are missing values in Income_Mean and Premature_Rate. The filter() function can be used with is.specialOrNA to investigate the missing values.
The LGA 44000 Maralinga Tjarutja contains an NA in premature rate variable. This LGA also has a Mortality_Rate of zero which suggests that there were no reported deaths in 2016. Consequently, it makes sense that the premature rate should also be zero. This can be imputed.
Income_Mean data is missing for LGA 56620 Ngaanyatjarraku. There are no values in the Income data for Ngaanyatjarraku in any year that can be imputed. It would not be appropriate to impute summary statistics for Mean_Income as Ngaanyatjarraku has a small population of 1,714 people, compared to the mean LGA Population (44,556) and is unlikely to be an accurate imputation. Removing this value is an appropriate approach as it makes up less than 5% of the data, and is unlikely to affect the overall data.
Outliers in data can have significant impacts on analysis of the data. If outliers are present in the data during analysis, they may increase the error variance, reduce the strength of statistical tests, and can bias the results used in estimates of model parameters. Consequently, it is vital to scan and address outliers in the pre-processing phase (Dolgun, 2020c).
This analysis will look for univariate outliers in the variables Premature_Rate, Mortality_Rate and Income_Mean. Summary statistics will provide some insight into the data. A histogram is useful to look at the distribution of the data and a boxplot can be used to detect univariate outliers in each variable.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 30.77 37.69 41.05 46.15 100.00
par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Premature_Rate, col = "#00CED1", main = "Distribution of Premature Death Rate", xlab = NULL, breaks = 15)
boxplot(LGA2016$Premature_Rate, col = "#00CED1", main = "Boxplot of Premature Death Rate") The boxplot shows a number of outliers in the premature death rate data. The histogram shows that the data is right skewed. The summary statistics demonstrate a greater mean (41.05) than median (37.69), which supports the notion of right skewness in the data.
As the data is skewed, the Tukey’s Method of Detecting outliers is the most appropriate method to apply to the data. In Tukey’s method, outliers are values that fall beyond the outlier fences (Dolgun, 2020c).
The outlier fences for extreme outliers, which will be investigated in this analysis, are defined as 2nd Quant - 3 *IQR and 3nd Quant + 3 *IQR for lower and upper fences respctively.
PREMupper_fence <- quantile(LGA2016$Premature_Rate)[[4]] + 3*IQR(LGA2016$Premature_Rate)
PREMlower_fence <- quantile(LGA2016$Premature_Rate)[[2]] - 3*IQR(LGA2016$Premature_Rate)
PREMoutliers <- which(LGA2016$Premature_Rate > PREMupper_fence | LGA2016$Premature_Rate < PREMlower_fence)
PREMoutliers[1] 206 239 260 275 333 396 518
Tukey’s Method detected 7 extreme outliers in the data.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 5.630 7.500 7.635 9.413 17.493
par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Mortality_Rate, col = "#FFDA33", main = "Distribution of Mortality Rate", xlab = NULL, breaks = 15)
boxplot(LGA2016$Mortality_Rate, col = "#FFDA33", main = "Boxplot of Mortality Rate", ylab = "Mortality Rate") The boxplot of Mortality rate shows two outliers with a high mortality rate.As the data appears approximately normal in the histogram, the Z-score approach can be used to detect outliers. In this approach, outliers are defined as values with a z-score greater than 3 (Dolgun, 2020c).
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.81392 -0.73913 -0.04977 0.00000 0.65524 3.63287
[1] 137 141
The two outliers for Mortality Rate are observations 137 and 141.
Min. 1st Qu. Median Mean 3rd Qu. Max.
27519 49368 55677 59259 63757 236678
par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Income_Mean, col = "#BB0F55", main = "Distribution of Mean Income", xlab = NULL, breaks = 15)
boxplot(LGA2016$Income_Mean, col = "#BB0F55", main = "Boxplot of Mean Income", ylab = "Mean Income")The boxplot of Mean Income shows a number of outliers in the data. The histogram shows that the distribution of the data is right skewed. This is supported by the summary statistics, as the mean value (59259) is greater than the median (55677) which is a more robust statistic in a skewed distribution (Baglin, 2020).
In this situation, the z-score method of detecting outliers is not appropriate.
This data represents the average personal income for local government areas in Australia. The outliers here likely represent the small number of wealthy suburbs/LGAs in Australia, as opposed to errors in measurement and data entry. To remove these observations would not produce an accurate representation of the true data. Instead a transformation can be applied to convert the distribution to a normal distribution, before scanning and removing outliers.
There are many approaches to transformation to convert a skewed distribution to a normal distribution. In this case, a reciprocal transformation is the most effective approach to transforming the Mean Income data to an approximately normal distribution.
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.225e-06 1.568e-05 1.796e-05 1.786e-05 2.026e-05 3.634e-05
par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016_transformed$Income_Mean, col = "#EC3E68", main = "Distribution of Mean Income
(Transformed)", xlab = NULL, breaks = 15)
boxplot(LGA2016_transformed$Income_Mean, col = "#EC3E68", main = "Boxplot of Mean Income
(Transformed)", ylab = "Mean Income")
As the data is now approximately normal, the z-score approach to outliers can be applied to detect outliers in the data.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.62800 -0.57949 0.02603 0.00000 0.63665 4.91502
[1] 75 125 239 260 381 447 514
The z-score approach identified 7 observations that are considered to be outliers.
There are a total of 14 unique outliers in the LGA2016 data, out of the total 529 observations. As the outliers make up less than 5% of the overall data, it is appropriate to remove these values.
LGA2016_trimmed <- LGA2016_transformed[-c(PREMoutliers, MORToutliers, INCoutliers),]
summary(LGA2016_trimmed[,5:7]) Mortality_Rate Premature_Rate Income_Mean
Min. : 0.000 Min. : 0.00 Min. :6.765e-06
1st Qu.: 5.691 1st Qu.:30.79 1st Qu.:1.570e-05
Median : 7.502 Median :37.64 Median :1.796e-05
Mean : 7.659 Mean :40.40 Mean :1.786e-05
3rd Qu.: 9.451 3rd Qu.:45.83 3rd Qu.:2.016e-05
Max. :15.025 Max. :91.67 Max. :2.724e-05
par(mfrow = c(1,3))
hist(LGA2016_trimmed$Income_Mean, col = "#EC3E68", main = "Distribution of
Mean Income", xlab = "Mean Income", breaks = 15, xlim = c(6.765e-06,2.724e-05))
hist(LGA2016_trimmed$Mortality_Rate, col = "#FFDA33", main = "Distribution of
Mortality Rate", xlab = "Mortality Rate", breaks = 15, xlim = c(0,18))
hist(LGA2016_trimmed$Premature_Rate, col = "#00CED1", main = "Distribution of
Premature Death Rate", xlab = "Premature Death Rate", breaks = 15, xlim = c(0,100))A QQ plot is another useful tool for assessing the normality of data.
par(mfrow = c(1,3))
qqPlot(LGA2016_trimmed$Income_Mean, dist = "norm", col = "#EC3E68", ylab = "Income Mean")
qqPlot(LGA2016_trimmed$Mortality_Rate, dist = "norm", col = "#FFDA33", ylab = "Mortality Rate")
qqPlot(LGA2016_trimmed$Premature_Rate, dist = "norm", col = "#00CED1", ylab = "Premature Death Rate")The Income_Mean and Mortality Rate variables in the data now appear to be approximately normal on a histogram and QQ Plot. There is some remaining skewness in the Premature Rate data.Transformations of the Premature Rate variable were not effective in improving normality due to the presence of zero values in the data. This skewness should be taken into consideration when the data is being analysed.
The pre-processing of this data is complete. The original datasets were tidied, and appropriate data type conversions were made. The new variables for mortality rate and premature death rate were created, and the datasets were merged and filtered to extract observations from 2016. The data was scanned for NA values. Two NA values were identified, evaluated and appropriate measures were taken to handle them. The variables were scanned for univariate outliers. Tukey’s method of detecting outliers was applied to the Premature Rate variable due to the right skewness of the distribution. The z score approach to outliers was used to detect outliers in the normally distributed Mortality Rate data. A reciprocal transformation was applied to the Mean Income data before outliers were detected using the z score approach. In total, 14 unique outliers were identified and removed from the data. The data was re-assessed for normality in a histogram and QQ plot. The resulting data is ready for data analysis of mean income and rates of mortality in 2016.
Baglin, J., (2020). Module 2: Descriptive Statistics through Visualisation https://astral-theory-157510.appspot.com/secured/MATH1324_Module_02.html#measures_of_central_tendency
Doglun, A., (2020a). Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data http://rare-phoenix-161610.appspot.com/secured/Module_04.html
Doglun, A., (2020b). Module 5: Scan: Missing Values http://rare-phoenix-161610.appspot.com/secured/Module_05.html
Doglun, A., (2020c). Module 5: Scan: Outliers http://rare-phoenix-161610.appspot.com/secured/Module_06.html