MATH2349: Data Wrangling

Required packages

library(readxl)
library(magrittr)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(rebus)
library(knitr)
library(outliers)
library(forecast)
library(car)

Executive Summary

The aim of this pre-processing is to extract and prepare data for analysis of mean income data and rates of mortality and premature death in Local Government Areas (LGA’s) in Australia in the year of 2016. The data was extracted from two datasets before being merged into one table.

The first dataset contains data on mortality statistics in Australia, and was imported and saved as the Mortality data frame. The relevant identifying variables, and variables containing raw numbers of total deaths, premature deaths and population statistics were extracted from the dataset. The structure of the Mortality data frame was reviewed. The variables were converted to the appropriate data type where required, and the identifying variables of LGA and LGA Name required adjustments to the string composition so that they contained values consistent with the second data frame for merging. New variables Mortality_Rate and Premature_Rate were created from the original variables, in preparation for data analysis.

The Income data contains statistics on personal income by Local Government Areas in Australia. The identifying variables for LGA, and the relevant variables containing mean personal income data by year, were extracted from the dataset via sub-setting. This data was untidy, and steps to extract variables and reshape the data were completed to create a tidy data frame. The structure of the Income data was checked and data types were converted where appropriate.

The Mortality and Income data frames were merged using an inner join function, and filtered for data from 2016. A scan for missing values found two NA values which were evaluated and addressed. The variables were scanned for outliers and a transformation was applied to the Mean Income data. The small number of outliers were removed.

Data

Two datasets will be merged during pre-processing. These datasets contain Mortality data and Income data for Local Government Areas (LGA) in Australia.

Mortality Data

The first dataset comes from the Australian Institute of Health and Welfare (AIHW), from their reports on Mortality over Regions and Time (MORT). The imported table contains 2,730 observations over 17 variables for mortality observations by sex and year. For the purpose of this data pre-processing, the relevant variables from the “Persons” data (containing combined male and female data) have been selected;

LGA [character]: Local Government Area (LGA) code
LGA NAME [character]: Name of the Local Government Area
Year [factor]:Year of registered death
Total Deaths [numeric]: Number of deaths due to all causes
Population [numeric]: Estimated population of LGA as of 30 June
Premature Deaths [numeric]: Number of deaths where the person was under the age of 75 years old

Source: https://www.aihw.gov.au/reports/life-expectancy-death/mort-books/contents/mort-books

Income Data

The second dataset is from the Australian Bureau of Statistics (ABS) data on Personal Income in Australia. The dataset contains several personal income data statistics from 2011 to 2017. Data relating to Mean personal income per LGA has been extracted via subsetting from the original dataset.

LGA [character]: Local Government Area (LGA) code
LGA NAME [character]: Name of the Local Government Area
2011-12…2016-17 [numeric]: The year the data was collected

Source: https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/personal-income-australia/latest-release

Importing & Subsetting

Mortality Data

Mortality <- read_excel("Mortality_by_LGA.xlsx", sheet = "Table 1", range = "AL6:BB2736")
Mortality  %<>% select("Reference", "Name", "Year", "Population","Total deaths", "Premature deaths")
colnames(Mortality) <- c("LGA", "LGA NAME", "Year","Population", "Total_deaths", "Premature_Death")
head(Mortality)

Income Data

Income <- read_excel("Income_by_LGA.xls", sheet = 6, range = "A7:AF555",.name_repair = "minimal")
Mean_Income <- Income %<>% select(1,2, 27:32)

head(Mean_Income)

Understand: Mortality Data

Structure

str(Mortality)

tibble [2,730 × 6] (S3: tbl_df/tbl/data.frame)
 $ LGA            : chr [1:2730] "LGA10050" "LGA10050" "LGA10050" "LGA10050" ...
 $ LGA NAME       : chr [1:2730] "Albury (C)" "Albury (C)" "Albury (C)" "Albury (C)" ...
 $ Year           : num [1:2730] 2014 2015 2016 2017 2018 ...
 $ Population     : num [1:2730] 51051 51548 52171 52894 53642 ...
 $ Total_deaths   : num [1:2730] 389 425 462 436 446 ...
 $ Premature_Death: num [1:2730] 138 123 142 142 142 ...

Modify LGA Variable

LGA is a character variable and this data type is correct. This variable will form part of the key to merge datasets. The strings will need to be modified to match the Income dataset LGA variable. To do this, “LGA” needs to be removed.

Mortality$LGA %<>% str_remove_all("LGA")
str(Mortality$LGA)

 chr [1:2730] "10050" "10050" "10050" "10050" "10050" "10130" "10130" ...

Modify LGA NAME variable

LGA NAME is a character variable and this data type is correct. This variable will also form part of the key to merge datasets. The strings will need to be modified to match the Income dataset LGA NAME variable. To do this, the additional LGA identifiers in brackets eg. (C), (RC), (Vic.) need to be removed.

Mortality$`LGA NAME` %<>% 
  str_remove_all("\\(" %R% ANY_CHAR %R% "\\)") %>% 
  str_remove_all("\\(" %R% ANY_CHAR %R% ANY_CHAR %R% "\\)") %>% 
  str_remove_all("\\(Vic.\\)") %>% 
  str_trim(side = "right")
str(Mortality$`LGA NAME`)

 chr [1:2730] "Albury" "Albury" "Albury" "Albury" "Albury" ...

Modify Year variable

The Year variable is a factor and needs to be converted. In order to align with the Income dataset, factor levels are set to years from 2011 to 2018.

Mortality$Year %<>% 
  factor(levels = c("2011","2012","2013","2014","2015","2016", "2017", "2018"), ordered = TRUE)

class(Mortality$Year)

[1] "ordered" "factor"

levels(Mortality$Year)

[1] "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018"

Statistic variables

The variables Total_Deaths, Population and Premature Death are correctly classified as numeric and do not require modification.
All data types for the Mortality variables are correct.

str(Mortality)

tibble [2,730 × 6] (S3: tbl_df/tbl/data.frame)
 $ LGA            : chr [1:2730] "10050" "10050" "10050" "10050" ...
 $ LGA NAME       : chr [1:2730] "Albury" "Albury" "Albury" "Albury" ...
 $ Year           : Ord.factor w/ 8 levels "2011"<"2012"<..: 4 5 6 7 8 4 5 6 7 8 ...
 $ Population     : num [1:2730] 51051 51548 52171 52894 53642 ...
 $ Total_deaths   : num [1:2730] 389 425 462 436 446 ...
 $ Premature_Death: num [1:2730] 138 123 142 142 142 ...

Create New Variables

Mortality Rate & Premature Death Rate

Mortality rate and premature death rate are important measures that provide a better insight into the deaths per population than the raw numbers.
These new variables can be created from the existing variables;

Mortality_Rate: Number of deaths per 1,000 people
Premature_Rate: Percentage of total deaths that were under the age of 75.

Once the new variables have been created, the original variables Population, Total_deaths and Premature_Death variables are no longer required, and can be removed with the dplyr select() function.

Mortality %<>% 
  mutate(Mortality_Rate = Total_deaths/Population*1000,
         Premature_Rate = Premature_Death/Total_deaths *100) %>% 
  select(-Total_deaths, -Population, -Premature_Death)

head(Mortality)

The final Mortality data frame contains Mortality rates and Premature Death rates for 2,730 observations for LGA and Year.

Understand: Income Data

str(Mean_Income)

tibble [548 × 8] (S3: tbl_df/tbl/data.frame)
 $ LGA     : chr [1:548] "Australia" "New South Wales" "10050" "10130" ...
 $ LGA NAME: chr [1:548] NA NA "Albury" "Armidale Regional" ...
 $ 2011-12 : chr [1:548] "55559" "56932" "49398" "44456" ...
 $ 2012-13 : chr [1:548] "58061" "59304" "50704" "45799" ...
 $ 2013-14 : chr [1:548] "59802" "61340" "52204" "46641" ...
 $ 2014-15 : chr [1:548] "61036" "62798" "53775" "49109" ...
 $ 2015-16 : chr [1:548] "61975" "64493" "54388" "50284" ...
 $ 2016-17 : chr [1:548] "62594" "65196" "55124" "50884" ...

Tidy Income Data

The Income data is untidy.

In order for Income data frame to be considered tidy, the data must meet the following requirements;
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
(Dolgun, 2020a)

The Income data does not meet these criteria. The data for the State or Territory where each LGA is geographically located is found within the LGA column, and needs to be separated to form its own column. Additionally, the column headings of the data frame are values for the year variable and each row contains multiple values for the mean income variable. The data needs to be re-shaped to form these variable columns.

Creating the State.Territory Variable

The data contains observations for National and State/Territory summary statistics. These observations can be identified as they do not have a numeric LGA code in the LGA variable.
The original data is formatted so that LGAs are positioned in the data frame below the State/Territory observation for where they are geographically located. The State/Territory names can be extracted from the LGA variable, and down-filled to form a new State.Territory variable. This may useful for data analysis. Once the State.Territory variable is formed, the original State/Territory observations can be removed.

Mean_Income %<>% 
  mutate(State.Territory = ifelse(str_detect(Mean_Income$LGA,"[:alpha:]"), Mean_Income$LGA, NA)) %>%
  fill(State.Territory) %>% 
  filter(!str_detect(LGA, "[:alpha:]"))
head(Mean_Income)

Re-Shaping

Additionally, to make the Income data tidy, the data frame needs to be re-shaped. This is done using the gather() function from tidyr package, to transform the data frame from wide to long format. The column names will form the Year variable, and the values from the columns form the Income_Mean variable.

Mean_Income %<>% gather("Year", "Income_Mean", 3:8)
head(Mean_Income)

The Income data frame is now in a tidy form, where the data is stored in a consistent format. Each variable forms a column, each row contains a single observation, and each cell contains a single value.

Structure

str(Mean_Income)

tibble [3,234 × 5] (S3: tbl_df/tbl/data.frame)
 $ LGA            : chr [1:3234] "10050" "10130" "10250" "10300" ...
 $ LGA NAME       : chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...
 $ State.Territory: chr [1:3234] "New South Wales" "New South Wales" "New South Wales" "New South Wales" ...
 $ Year           : chr [1:3234] "2011-12" "2011-12" "2011-12" "2011-12" ...
 $ Income_Mean    : chr [1:3234] "49398" "44456" "43656" "44514" ...

Review the LGA and LGA NAME Variables

LGA and LGA NAME variables are in the appropriate ‘character’ data type, and appropriate string patterns to join as a key with the Mortality data.

str(Mean_Income$LGA)

 chr [1:3234] "10050" "10130" "10250" "10300" "10470" "10500" "10550" ...

str(Mean_Income$`LGA NAME`)

 chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...

Modify the Income_Mean variable

Income Mean refers to the mean personal income. The variable is currently as ‘character’ and needs to be converted to numeric.

Mean_Income$Income_Mean %<>% as.numeric()
class(Mean_Income$Income_Mean)

[1] "numeric"

Modify the Year Variable

Now that a Year variable has been formed, it needs to be converted to a factor. As this variable will be used as a key to join with the Mortality data, the string needs to be adjusted before being converted. The data currently contains financial years which span two calendar year eg. “2016-17”. For simplicity, this will be adjusted to the first year of the financial year. This variable will be converted to a factor, with the levels consistent with the Mortality data.

Mean_Income %<>% mutate(Year = str_sub(Year, 1,4))
Mean_Income$Year %<>% factor(levels = c("2011","2012","2013","2014","2015","2016", "2017", "2018"),ordered = TRUE)
class(Mean_Income$Year)

[1] "ordered" "factor"

levels(Mean_Income$Year)

[1] "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018"

Modify the State.Territory Variable

The State.Territory variable is to be converted to a factor. The labels for the States and Territories are changed to the commonly used abbreviated code.

Mean_Income$State.Territory %<>% 
  factor(levels = c("New South Wales", "Victoria", "Queensland", "South Australia","Western Australia", "Tasmania", "Northern Territory", "Australian Capital Territory"),
         labels = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT"))
class(Mean_Income$State.Territory)

[1] "factor"

levels(Mean_Income$State.Territory)

[1] "NSW" "VIC" "QLD" "SA"  "WA"  "TAS" "NT"  "ACT"

str(Mean_Income)

tibble [3,234 × 5] (S3: tbl_df/tbl/data.frame)
 $ LGA            : chr [1:3234] "10050" "10130" "10250" "10300" ...
 $ LGA NAME       : chr [1:3234] "Albury" "Armidale Regional" "Ballina" "Balranald" ...
 $ State.Territory: Factor w/ 8 levels "NSW","VIC","QLD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year           : Ord.factor w/ 8 levels "2011"<"2012"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Income_Mean    : num [1:3234] 49398 44456 43656 44514 51228 ...

Join the Data

The data is joined using an inner join. This ensures that the resulting dataset only contains observations for LGA/Year that exists in both of the Mortality and Mean_Income datasets. The data are joined by their common attributes of LGA, LGA NAME and Year.

LGA_data <- Mortality %>% 
  inner_join(Mean_Income, by = c("LGA", "LGA NAME", "Year")) %>% 
  select(State.Territory, everything())

head(LGA_data)

Subsetting 2016 Data

The focus of this data pre-processing is to extract the 2016 data in preparation for analysis. The 2016 data is extracted using the dplyr filter() function.

LGA2016 <- LGA_data %>% filter(Year == 2016)
dim(LGA2016)

[1] 530   7

There are observations for 530 LGAs in 2016.

Scan for Missing Values

The defined function is.specialOrNA below is used to scan the data for missing values including NA, NaN and inf values in the data.

Scan for Na and NaN values

is.specialOrNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))}

sapply(LGA2016, function(x) sum( is.specialOrNA(x) ))

State.Territory             LGA        LGA NAME            Year  Mortality_Rate 
              0               0               0               0               0 
 Premature_Rate     Income_Mean 
              1               1

Using sapply() to apply the is.specialOrNA() to the dataframe, it is evident that there are missing values in Income_Mean and Premature_Rate. The filter() function can be used with is.specialOrNA to investigate the missing values.

Investigate Missing Values

LGA2016 %>% 
  filter(is.specialOrNA(Income_Mean) | is.specialOrNA(Premature_Rate))

Evaluating Missing Values

Premature Rate

The LGA 44000 Maralinga Tjarutja contains an NA in premature rate variable. This LGA also has a Mortality_Rate of zero which suggests that there were no reported deaths in 2016. Consequently, it makes sense that the premature rate should also be zero. This can be imputed.

Imputing NA

LGA2016$Premature_Rate[is.specialOrNA(LGA2016$Premature_Rate)] <- 0

Income Mean

Income_Mean data is missing for LGA 56620 Ngaanyatjarraku. There are no values in the Income data for Ngaanyatjarraku in any year that can be imputed. It would not be appropriate to impute summary statistics for Mean_Income as Ngaanyatjarraku has a small population of 1,714 people, compared to the mean LGA Population (44,556) and is unlikely to be an accurate imputation. Removing this value is an appropriate approach as it makes up less than 5% of the data, and is unlikely to affect the overall data.

Remove observation for LGA 56620 Ngaanyatjarraku

LGA2016 %<>% filter(complete.cases(LGA2016$Income_Mean))

Re-Scan for Missing Values

sapply(LGA2016, function(x) sum( is.specialOrNA(x) ))

State.Territory             LGA        LGA NAME            Year  Mortality_Rate 
              0               0               0               0               0 
 Premature_Rate     Income_Mean 
              0               0

There are no remaining missing values in the data.

Scan for Outliers

Outliers in data can have significant impacts on analysis of the data. If outliers are present in the data during analysis, they may increase the error variance, reduce the strength of statistical tests, and can bias the results used in estimates of model parameters. Consequently, it is vital to scan and address outliers in the pre-processing phase (Dolgun, 2020c).

This analysis will look for univariate outliers in the variables Premature_Rate, Mortality_Rate and Income_Mean. Summary statistics will provide some insight into the data. A histogram is useful to look at the distribution of the data and a boxplot can be used to detect univariate outliers in each variable.

Premature Death Rate

summary(LGA2016$Premature_Rate)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   30.77   37.69   41.05   46.15  100.00

par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Premature_Rate, col = "#00CED1", main = "Distribution of Premature Death Rate", xlab = NULL, breaks = 15)
boxplot(LGA2016$Premature_Rate, col = "#00CED1", main = "Boxplot of Premature Death Rate")

The boxplot shows a number of outliers in the premature death rate data. The histogram shows that the data is right skewed. The summary statistics demonstrate a greater mean (41.05) than median (37.69), which supports the notion of right skewness in the data.

Detecting outliers

As the data is skewed, the Tukey’s Method of Detecting outliers is the most appropriate method to apply to the data. In Tukey’s method, outliers are values that fall beyond the outlier fences (Dolgun, 2020c).
The outlier fences for extreme outliers, which will be investigated in this analysis, are defined as 2nd Quant - 3 *IQR and 3nd Quant + 3 *IQR for lower and upper fences respctively.

PREMupper_fence <- quantile(LGA2016$Premature_Rate)[[4]] + 3*IQR(LGA2016$Premature_Rate)
PREMlower_fence <- quantile(LGA2016$Premature_Rate)[[2]] - 3*IQR(LGA2016$Premature_Rate)

PREMoutliers <- which(LGA2016$Premature_Rate > PREMupper_fence | LGA2016$Premature_Rate < PREMlower_fence) 
PREMoutliers

[1] 206 239 260 275 333 396 518

Tukey’s Method detected 7 extreme outliers in the data.

Mortality Rate

summary(LGA2016$Mortality_Rate)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   5.630   7.500   7.635   9.413  17.493

par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Mortality_Rate, col = "#FFDA33", main = "Distribution of Mortality Rate", xlab = NULL, breaks = 15)
boxplot(LGA2016$Mortality_Rate, col = "#FFDA33", main = "Boxplot of Mortality Rate", ylab = "Mortality Rate")

The boxplot of Mortality rate shows two outliers with a high mortality rate.As the data appears approximately normal in the histogram, the Z-score approach can be used to detect outliers. In this approach, outliers are defined as values with a z-score greater than 3 (Dolgun, 2020c).

Detecting outliers

z.scores <- LGA2016$Mortality_Rate %>%  scores(type = "z")
z.scores %>% summary()

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.81392 -0.73913 -0.04977  0.00000  0.65524  3.63287

MORToutliers <- which( abs(z.scores) >3 )
MORToutliers

[1] 137 141

The two outliers for Mortality Rate are observations 137 and 141.

Mean Income

summary(LGA2016$Income_Mean)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27519   49368   55677   59259   63757  236678

par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016$Income_Mean, col = "#BB0F55", main = "Distribution of Mean Income", xlab = NULL, breaks = 15)
boxplot(LGA2016$Income_Mean, col = "#BB0F55", main = "Boxplot of Mean Income", ylab = "Mean Income")

The boxplot of Mean Income shows a number of outliers in the data. The histogram shows that the distribution of the data is right skewed. This is supported by the summary statistics, as the mean value (59259) is greater than the median (55677) which is a more robust statistic in a skewed distribution (Baglin, 2020).

In this situation, the z-score method of detecting outliers is not appropriate.
This data represents the average personal income for local government areas in Australia. The outliers here likely represent the small number of wealthy suburbs/LGAs in Australia, as opposed to errors in measurement and data entry. To remove these observations would not produce an accurate representation of the true data. Instead a transformation can be applied to convert the distribution to a normal distribution, before scanning and removing outliers.

Transformation

There are many approaches to transformation to convert a skewed distribution to a normal distribution. In this case, a reciprocal transformation is the most effective approach to transforming the Mean Income data to an approximately normal distribution.

LGA2016_transformed <- LGA2016 %>% mutate(Income_Mean = (LGA2016$Income_Mean)^(-1))

summary(LGA2016_transformed$Income_Mean)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
4.225e-06 1.568e-05 1.796e-05 1.786e-05 2.026e-05 3.634e-05

par(mfrow = c(1,2),mar = c(2.5, 2, 2, 2), bty = "L", cex.main = "0.8")
hist(LGA2016_transformed$Income_Mean, col = "#EC3E68", main = "Distribution of Mean Income 
     (Transformed)", xlab = NULL, breaks = 15)
boxplot(LGA2016_transformed$Income_Mean, col = "#EC3E68", main = "Boxplot of Mean Income 
        (Transformed)", ylab = "Mean Income")

Detecting Outliers

As the data is now approximately normal, the z-score approach to outliers can be applied to detect outliers in the data.

z.scores <- LGA2016_transformed$Income_Mean %>%  scores(type = "z")
z.scores %>% summary()

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.62800 -0.57949  0.02603  0.00000  0.63665  4.91502

INCoutliers <- which( abs(z.scores) >3 )
INCoutliers

[1]  75 125 239 260 381 447 514

The z-score approach identified 7 observations that are considered to be outliers.

Handling Outliers

There are a total of 14 unique outliers in the LGA2016 data, out of the total 529 observations. As the outliers make up less than 5% of the overall data, it is appropriate to remove these values.

LGA2016_trimmed <- LGA2016_transformed[-c(PREMoutliers, MORToutliers, INCoutliers),]
summary(LGA2016_trimmed[,5:7])

 Mortality_Rate   Premature_Rate   Income_Mean       
 Min.   : 0.000   Min.   : 0.00   Min.   :6.765e-06  
 1st Qu.: 5.691   1st Qu.:30.79   1st Qu.:1.570e-05  
 Median : 7.502   Median :37.64   Median :1.796e-05  
 Mean   : 7.659   Mean   :40.40   Mean   :1.786e-05  
 3rd Qu.: 9.451   3rd Qu.:45.83   3rd Qu.:2.016e-05  
 Max.   :15.025   Max.   :91.67   Max.   :2.724e-05

par(mfrow = c(1,3))
hist(LGA2016_trimmed$Income_Mean, col = "#EC3E68", main = "Distribution of 
     Mean Income", xlab = "Mean Income", breaks = 15, xlim = c(6.765e-06,2.724e-05))
hist(LGA2016_trimmed$Mortality_Rate, col = "#FFDA33", main = "Distribution of 
     Mortality Rate", xlab = "Mortality Rate", breaks = 15, xlim = c(0,18))
hist(LGA2016_trimmed$Premature_Rate, col = "#00CED1", main = "Distribution of 
     Premature Death Rate", xlab = "Premature Death Rate", breaks = 15, xlim = c(0,100))

Checking Normality

A QQ plot is another useful tool for assessing the normality of data.

par(mfrow = c(1,3))
qqPlot(LGA2016_trimmed$Income_Mean, dist = "norm", col = "#EC3E68", ylab = "Income Mean")
qqPlot(LGA2016_trimmed$Mortality_Rate, dist = "norm", col = "#FFDA33", ylab = "Mortality Rate")
qqPlot(LGA2016_trimmed$Premature_Rate, dist = "norm", col = "#00CED1", ylab = "Premature Death Rate")

The Income_Mean and Mortality Rate variables in the data now appear to be approximately normal on a histogram and QQ Plot. There is some remaining skewness in the Premature Rate data.Transformations of the Premature Rate variable were not effective in improving normality due to the presence of zero values in the data. This skewness should be taken into consideration when the data is being analysed.

Conclusion

The pre-processing of this data is complete. The original datasets were tidied, and appropriate data type conversions were made. The new variables for mortality rate and premature death rate were created, and the datasets were merged and filtered to extract observations from 2016. The data was scanned for NA values. Two NA values were identified, evaluated and appropriate measures were taken to handle them. The variables were scanned for univariate outliers. Tukey’s method of detecting outliers was applied to the Premature Rate variable due to the right skewness of the distribution. The z score approach to outliers was used to detect outliers in the normally distributed Mortality Rate data. A reciprocal transformation was applied to the Mean Income data before outliers were detected using the z score approach. In total, 14 unique outliers were identified and removed from the data. The data was re-assessed for normality in a histogram and QQ plot. The resulting data is ready for data analysis of mean income and rates of mortality in 2016.

References

Baglin, J., (2020). Module 2: Descriptive Statistics through Visualisation https://astral-theory-157510.appspot.com/secured/MATH1324_Module_02.html#measures_of_central_tendency

Doglun, A., (2020a). Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data http://rare-phoenix-161610.appspot.com/secured/Module_04.html

Doglun, A., (2020b). Module 5: Scan: Missing Values http://rare-phoenix-161610.appspot.com/secured/Module_05.html

Doglun, A., (2020c). Module 5: Scan: Outliers http://rare-phoenix-161610.appspot.com/secured/Module_06.html

MATH2349: Data Wrangling

Assignment 2

Alexandra Louise Walsh (S3873565)

Required packages

Executive Summary

Data

Mortality Data

Income Data

Importing & Subsetting

Mortality Data

Income Data

Understand: Mortality Data

Structure

Modify LGA Variable

Modify LGA NAME variable

Modify Year variable

Statistic variables

Create New Variables

Mortality Rate & Premature Death Rate

Understand: Income Data

Tidy Income Data

Creating the State.Territory Variable

Re-Shaping

Structure

Review the LGA and LGA NAME Variables

Modify the Income_Mean variable

Modify the Year Variable

Modify the State.Territory Variable

Join the Data

Subsetting 2016 Data

Scan for Missing Values

Scan for Na and NaN values

Investigate Missing Values

Evaluating Missing Values

Premature Rate

Imputing NA

Income Mean

Remove observation for LGA 56620 Ngaanyatjarraku

Re-Scan for Missing Values

Scan for Outliers

Premature Death Rate

Detecting outliers

Mortality Rate

Detecting outliers

Mean Income

Transformation

Detecting Outliers

Handling Outliers

Checking Normality

Conclusion

References