library(readxl)
library(magrittr)
library(dplyr)
library(tidyr)
library(outliers)
library(forecast)
For both data sets, they were both pre-processed in the same way but with different variable names indicating which data set they are the from, male or female.
Firstly, the data sets returned NA values in between observations that differentiated between states, territory and country data, this was removed by sub-setting the data.
There were more NA values in the first column under each state or territory as they were left blank for easy understanding in an excel spreadsheet. The corresponding states and territory were manually imputed to remove the NA values while keeping the differentiation between state, territory and country data.
Lastly, the variable names needed to be renamed since two variables did not have names from the initial data set and the other variables names needed to be cleaned for ease of understanding.
In 2017, a postal survey was sent out to households throughout Australia regarding the law of same sex marriage. Individuals responded with whether they support or do not support the legalisation of same sex marriage in Australia. This survey was conducted by the Australia Bureau of Statistics (ABS).
The data sets used in this report are provided by the ABS. It reports the number of individuals who participated in the survey by state and territory, gender, age group and total.
Each data set represents a gender (male or female) and the variables of the data sets include:
Source: ABS
male_p <- read_excel("d:/Users/Vivian/Desktop/australian_marriage_law_postal_survey_2017_-_participation_final.xls",
sheet = "Table 2", skip = 5)
New names:
* `` -> ...1
* `` -> ...2
head(male_p)
female_p <- read_excel("d:/Users/Vivian/Desktop/australian_marriage_law_postal_survey_2017_-_participation_final.xls",
sheet = "Table 3", skip = 5)
New names:
* `` -> ...1
* `` -> ...2
head(female_p)
Codes for pre-processing the data sets for reference further on in the report.
#subset
There were 12 warnings (use warnings() to see them)
male_p <- male_p[c(1:3, 5:7, 9:11, 13:15, 17:19, 21:23, 25:27, 29:31, 33:35), ]
The `i` argument of ``[.tbl_df`()` must lie in [0, rows] if positive, as of tibble 3.0.0.
Use `NA` as row index to obtain a row full of `NA` values.
[90mThis warning is displayed once every 8 hours.[39m
[90mCall `lifecycle::last_warnings()` to see where this warning was generated.[39m
# replace NA in location
male_p$...1 <- c("New South Wales", "New South Wales", "New South Wales", "Victoria", "Victoria", "Victoria",
"Queensland", "Queensland", "Queensland", "South Australia", "South Australia", "South Australia",
"Western Australia", "Western Australia", "Western Australia", "Tasmania", "Tasmania", "Tasmania",
"Northern Territory", "Northern Territory", "Northern Territory", "Australian Capital Territory",
"Australian Capital Territory", "Australian Capital Territory", "Australia", "Australia", "Australia")
# rename columns
colnames(male_p) <- c("Location", "Measure", "Male_18-19 years", "Male_20-24 years", "Male_25-29 years",
"Male_30-34 years", "Male_35-39 years", "Male_40-44 years", "Male_45-49 years",
"Male_50-54 years", "Male_55-59 years", "Male_60-64 years", "Male_65-69 years",
"Male_70-74 years", "Male_75-79 years", "Male_80-84 years", "Male_85 years and over",
"Total Males")
head(male_p)
# subset
female_p <- female_p[c(1:3, 5:7, 9:11, 13:15, 17:19, 21:23, 25:27, 29:31, 33:35), ]
# replace NA in location
female_p$...1 <- c("New South Wales", "New South Wales", "New South Wales", "Victoria", "Victoria", "Victoria",
"Queensland", "Queensland", "Queensland", "South Australia", "South Australia", "South Australia",
"Western Australia", "Western Australia", "Western Australia", "Tasmania", "Tasmania", "Tasmania",
"Northern Territory", "Northern Territory", "Northern Territory", "Australian Capital Territory",
"Australian Capital Territory", "Australian Capital Territory", "Australia", "Australia", "Australia")
# rename columns
colnames(female_p) <- c("Location", "Measure", "Female_18-19 years", "Female_20-24 years", "Female_25-29 years",
"Female_30-34 years", "Female_35-39 years", "Female_40-44 years", "Female_45-49 years",
"Female_50-54 years", "Female_55-59 years", "Female_60-64 years", "Female_65-69 years",
"Female_70-74 years", "Female_75-79 years", "Female_80-84 years", "Female_85 years and over",
"Total Females")
head(female_p)
gender_p <- male_p %>% bind_cols(female_p[ , 3:18])
head(gender_p)
The data sets were merged by binding the columns. This was used to avoid looping over the observations in one data set for every observation in the second data set.
The first two variables in the female data set were excluded from being binded as they would just be a repetition of data.
There are two types of major variables in data sets, qualitative and quantitative variables. With qualitative variables denoting categorical values and can either be nominal (with labels for identification) or ordinal (labels for identification that are ordered) variables. Quantitative variables are measurable values and can either be continuous (measured values) or discrete (counted values within certain frames).
The data structures used in R are mainly vectors, lists, matrices or data frames, all of which contain various data including integer, numeric, character or factor. Vectors are a set of elements, lists are lists of elements, matrices are a collection of data elements with each element being a specific data type, finally, data frames processes characteristics from both lists and matrices with each column its own data type.
str(gender_p)
tibble [27 x 34] (S3: tbl_df/tbl/data.frame)
$ Location : chr [1:27] "New South Wales" "New South Wales" "New South Wales" "Victoria" ...
$ Measure : chr [1:27] "Total participants" "Eligible participants" "Participation rate (%)" "Total participants" ...
$ Male_18-19 years : num [1:27] 52118 69996 74.5 38720 48852 ...
$ Male_20-24 years : num [1:27] 143088 204979 69.8 118897 161658 ...
$ Male_25-29 years : num [1:27] 141349 204397 69.2 121875 167708 ...
$ Male_30-34 years : num [1:27] 146259 210612 69.4 126194 173516 ...
$ Male_35-39 years : num [1:27] 149283 212888 70.1 125877 169904 ...
$ Male_40-44 years : num [1:27] 155857 214989 72.5 128107 168441 ...
$ Male_45-49 years : num [1:27] 167357 221908 75.4 138049 175707 ...
$ Male_50-54 years : num [1:27] 167745 213740 78.5 133085 164846 ...
$ Male_55-59 years : num [1:27] 177169 217769 81.4 137064 164299 ...
$ Male_60-64 years : num [1:27] 165382 195444 84.6 126281 147252 ...
$ Male_65-69 years : num [1:27] 155743 177828 87.6 116846 132643 ...
$ Male_70-74 years : num [1:27] 130540 145915 89.5 96183 107856 ...
$ Male_75-79 years : num [1:27] 89406 99697 89.7 67240 75569 ...
$ Male_80-84 years : num [1:27] 58246 66137 88.1 43863 50406 ...
$ Male_85 years and over : num [1:27] 48003 57894 82.9 35922 43869 ...
$ Total Males : num [1:27] 1.95e+06 2.51e+06 7.75e+01 1.55e+06 1.95e+06 ...
$ Female_18-19 years : num [1:27] 57345 71646 80 44520 52518 ...
$ Female_20-24 years : num [1:27] 153549 201716 76.1 130855 163269 ...
$ Female_25-29 years : num [1:27] 153718 204377 75.2 134506 169260 ...
$ Female_30-34 years : num [1:27] 160855 215120 74.8 139483 176568 ...
$ Female_35-39 years : num [1:27] 162711 216203 75.3 137849 172864 ...
$ Female_40-44 years : num [1:27] 171415 220718 77.7 141735 173926 ...
$ Female_45-49 years : num [1:27] 187089 232767 80.4 156738 188193 ...
$ Female_50-54 years : num [1:27] 181995 220482 82.5 148344 174218 ...
$ Female_55-59 years : num [1:27] 192455 226306 85 150084 173138 ...
$ Female_60-64 years : num [1:27] 179082 205641 87.1 138943 157489 ...
$ Female_65-69 years : num [1:27] 166004 186263 89.1 128102 143372 ...
$ Female_70-74 years : num [1:27] 138350 153681 90 104198 116423 ...
$ Female_75-79 years : num [1:27] 98810 111061 89 74643 84649 ...
$ Female_80-84 years : num [1:27] 70583 81879 86.2 52672 62018 ...
$ Female_85 years and over: num [1:27] 74012 93521 79.1 53923 69272 ...
$ Total Females : num [1:27] 2.15e+06 2.64e+06 8.13e+01 1.74e+06 2.08e+06 ...
The data set has Location and Measure as character data types while all other variables are numeric.
Location will be converted to factor with unordered labels denoting the states, territories and country.
gender_p$Location <- as.factor(gender_p$Location)
is.factor(gender_p$Location)
[1] TRUE
levels(gender_p$Location)
[1] "Australia" "Australian Capital Territory"
[3] "New South Wales" "Northern Territory"
[5] "Queensland" "South Australia"
[7] "Tasmania" "Victoria"
[9] "Western Australia"
The levels of the factor Location do not need to be relabelled as they have already been during pre-processing of data.
Measure has been left as character data type as they do not need to be specifically labelled and all other variables have been left as numeric data type as some observations contain decimal values.
After conversion the data set has character, factor and numeric data types.
Both data set used in the report are untidy. This is because both data sets contains observations of age groups as variables. It will be reshaped into a tidy format by using gather to gather all the age groups into a new variable named Male_Age or Female_Age
# tidy male
male_p2 <- male_p %>% gather("Male_18-19 years", "Male_20-24 years", "Male_25-29 years",
"Male_30-34 years", "Male_35-39 years", "Male_40-44 years", "Male_45-49 years",
"Male_50-54 years", "Male_55-59 years", "Male_60-64 years", "Male_65-69 years",
"Male_70-74 years", "Male_75-79 years", "Male_80-84 years", "Male_85 years and over",
key = "Male_Age", value = "M_Age_Count")
head(male_p2)
# tidy female
female_p2 <- female_p %>% gather("Female_18-19 years", "Female_20-24 years", "Female_25-29 years",
"Female_30-34 years", "Female_35-39 years", "Female_40-44 years", "Female_45-49 years",
"Female_50-54 years", "Female_55-59 years", "Female_60-64 years", "Female_65-69 years",
"Female_70-74 years", "Female_75-79 years", "Female_80-84 years", "Female_85 years and over",
key = "Female_Age", value = "F_Age_Count")
head(female_p2)
# merge 2
gender2_p <- male_p2 %>% bind_cols(female_p2[3:5])
head(gender2_p)
Both data sets were tidied separately before merging again by using column bind. This was done due to the duplication or looping of data when applying two tidy function in one data set. When tested, observations amounted to 6075 which is an incorrect representation of the data.
Two variables will be made by mutating the data set. The first one will be called Total Age Count for the sum of participants in each age group by gender and the second called Total Participants with the sum of participants from all age groups by gender.
gender3_p <- gender2_p %>% mutate(Total_Age_Count = `M_Age_Count` + `F_Age_Count`,
Total_Participants = `Total Males` + `Total Females`)
head(gender3_p)
sum(is.na(gender3_p))
[1] 0
Results show that there are no missing values in the data.
A function called is.special is written to scan for special values (infinite values and NaN). It scans all numeric variables to find any special values.
Source: Module 5 Notes
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(gender3_p, function(x) sum(is.na(x)))
Location Measure Total Males
0 0 0
Male_Age M_Age_Count Total Females
0 0 0
Female_Age F_Age_Count Total_Age_Count
0 0 0
Total_Participants
0
sapply was used in conjunction with the function and the calculation of the sum of missing values for each variable.
The result of this scan reports that there are no special values or NAs.
The scan for any obvious errors will be conducted based on the fact that observations cannot be a negative value as there cannot be a negative amount of participants.
gender_nn <- (gender3_p >= 0)
summary(gender_nn)
Location Measure Total Males Male_Age
Mode:logical Mode:logical Mode:logical Mode:logical
TRUE:405 TRUE:405 TRUE:405 TRUE:405
M_Age_Count Total Females Female_Age F_Age_Count
Mode:logical Mode:logical Mode:logical Mode:logical
TRUE:405 TRUE:405 TRUE:405 TRUE:405
Total_Age_Count Total_Participants
Mode:logical Mode:logical
TRUE:405 TRUE:405
For each variable in the data set, they were scanned for any negative values with any TRUE value being a non-negative number.
The summary of the scan reports that each variable have a total of 405 non-negative observations which is the total amount of observations in the data set, therefore, there are no obvious errors.
The variable Total Age Count will be used to scan for outliers, this is because it contains the observations the number of participants in each age group and state or territory/
To have a visual representation and scan to see if there are any outliers, a boxplot will be used.
boxplot(gender3_p$Total_Age_Count, main = "Total Age Count Boxplot", ylab = "Count", col = "light blue")
As can be seen in the boxplot there are numerous amounts of outliers in the data.
To get an idea of where these outlier values are located in the data set, the z-score method will be used. The data will be visualised in a histogram to check if the data is normally distributed.
hist(gender3_p$Total_Age_Count, main = "Total Age Count Histogram", xlab = "Count", col = "light green")
As can be seen, the plot shows that the data is right skewed. Due to the central limit theorem, n>30, as the data has more than 30 observations (405) it assumes the data is normally distributed regardless of the underlying distribution being right skewed. The z-score method can be used.
z.scores <- gender3_p$Total_Age_Count %>% scores(type = "z")
summary(z.scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.52033 -0.52003 -0.43580 0.00000 0.01695 4.73738
which(abs(z.scores) >3)
[1] 53 80 106 107 133 134 160 161 187 188 214 215 241 242 268 269
[17] 295 296
length(which(abs(z.scores) >3))
[1] 18
The summary of the z.score reveals that the minimum is -0.52 and the maximum 4.74. An observation is considered an outlier if its z-score is greater than 3. The which() function shows the location of the outliers in the variable and the length() function shows how many outliers there are.
From the results, there are a total of 18 outliers in the variable Total Age Count.
To handle the outliers, excluding, deleting or imputing values were all not favourable for the reason that the data contains the sum of participants in each state and territory by age group. If they were removed or replaced with another value, it can impact the data in a negative way since the values isn’t raw data and they aren’t ambiguous even though the amount of outliers is a small amount compared to the total amount of observations.
Similarly to imputing values, capping is also not a very favourable approach to handling outliers since it involves replacing the outlier values with its nearest neighbours that are not outliers. Though for the investigation it will be used to view the result it would give.
To input these values, a user-defined function called cap will be needed (source: Module 6 Notes).
The cap function will be applied into Total Age Count and its summary will be compared to Total Age Count’s summary.
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
summary(gender3_p$Total_Age_Count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
92.8 173.7 22954.0 140819.9 145403.0 1422089.0
count_cap <- gender3_p$Total_Age_Count %>% cap()
summary(count_cap)
Min. 1st Qu. Median Mean 3rd Qu. Max.
92.8 173.7 22954.0 147846.7 145403.0 890359.0
In comparing both summaries, the only values that are different are the mean and maximum values. The mean increased by 7026.8 after capping while the maximum value decreased by 531,730. Both are quite a significant difference in values and once again, with the data containing observations of survey participants, it is not reasonable to change the data since it can impact the data negatively.
Apply an appropriate transformation for at least one of the variables
To help prepare the data for modelling, data transformation will be used to pre-process it, specifically on the variables M_Age_Count, F_Age_Count and Total_Age_Count.
gender_sub <- gender3_p %>% dplyr::select(M_Age_Count, F_Age_Count, Total_Age_Count)
par(mfrow=c(3,1))
colnames <- colnames(gender_sub)
for (i in 1:3) {
hist(unlist(gender_sub[,i]), main = colnames[i])
}
For a visual representation of what the data currently looks like, they have been plotted in histograms to see their distribution. As can be seen, all three plots show right skewed distributions. As normal distributions are preferred over skewed, a variety of data transformation will be applied to the variables to compare and determine which one would be best suited to normalise the data.
# log
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
log_gender <-sapply(gender_sub, log10)
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(log_gender[,i]), main = colnames[i])
}
# ln
ln_gender <- sapply(gender_sub, log)
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(ln_gender[,i]), main = colnames[i])
}
# sqrt
sqrt_gender <- sapply(gender_sub, sqrt)
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(sqrt_gender[,i]), main = colnames[i])
}
# reciprocal
rec_male <- 1/gender3_p$M_Age_Count
rec_female <- 1/gender3_p$F_Age_Count
rec_total <- 1/gender3_p$Total_Age_Count
par(mfrow=c(3,1))
hist(rec_male, main = "M_Age_Count")
hist(rec_female, main = "F_Age_Count")
hist(rec_total, main = "Total_Age_Count")
# boxcox
box_gender <- apply(gender_sub, 2, function(x) BoxCox(x, lambda = "auto"))
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(box_gender[,i]), main = colnames[i])
}
# normalisation
# mean centering
center_gender <-apply(gender_sub, 2, function(x) scale(x, center = TRUE, scale = FALSE))
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(center_gender[,i]), main = colnames[i])
}
# without centering
scale_gender <-apply(gender_sub, 2, function(x) scale(x, center = FALSE, scale = sd(gender3_p$M_Age_Count)))
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(scale_gender[,i]), main = colnames[i])
}
# z-score
z_gender <-apply(gender_sub, 2, function(x) scale(x, center = TRUE, scale = TRUE))
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(z_gender[,i]), main = colnames[i])
}
# min-max
minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}
minmax_gender <- sapply(gender_sub, minmaxnormalise)
par(mfrow=c(3,1))
for (i in 1:3) {
hist(unlist(minmax_gender[,i]), main = colnames[i])
}
From comparing the plots of the data transformation, the ln (log) transformation showed the best result. While it did not give complete normalisation to the data, it normalised the data more than other data transformations. For example, sqrt() function continued to produce a right skewed distribution, the same result returned with normalisation methods.
Therefore, if the data needed to be transformed into a normal distribution, the ln(log) function would work best.
In closing, binning was not used as the data was already binned into age groups from the initial data sets.