library(readxl)
library(tidyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: ‘tidyr’
The following object is masked _by_ ‘.GlobalEnv’:
table1
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(stringr)
library(car)
Loading required package: carData
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:
recode
library(ggplot2)
library(rvest)
Loading required package: xml2
The data in (.xls) files are not in a desirable state to be joined with any other dataset. Different steps are taken than advised in the notebook but, all the aspects of pre-processing are covered with just different order. Steps were taken for preprocessing:- 1-datasets loaded from Australian Marriage Law Postal Survey 2017 - Participation and subsetted for obtaining two parts LHS for federal divisions and RHS for age brackets. Response data is loaded and processed like a participant. 2-response and participant data are joined with electorates table fetched from the web for adding additional info on state and area of the federal division. 3-with trial and error it was observed before merging participant and response table removal of present NA values in the tables makes further processing easier. 4-Final dataset is created which contains merged participant, response and electorate tables. 5-Required data type conversions are done with the final table. 6-Outlier detection and decision on dealing with outliers are made.
The data was taken from Australian Marriage Law Postal Survey, 2017. Source->https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1800.02017?OpenDocument
Australian Marriage Law Postal Survey, 2017 was conducted by the Australian Bureau of Statistics. This survey tries to find out people’s views on legalizing same-sex marriage in Australia.
Two excel sheets are being used in the coming tasks:- 1->Australian Marriage Law Postal Survey 2017 - Participation
Australian Marriage Law Postal Survey 2017 - Participation contains information on participants, Distributed over different age bands. A further division by (State/federal division) and gender(male/female)are provided in different sheets.
In this particular script sheet 6 and sheet 7 are going to be used: sheet 6 contains male participants distributed over different age bands (RHS) and federal division(LHS).
sheet 7 contains female participants distributed over different age bands and federal divisions.
The image is the screenshot of sheet 6. Sheet 6 and 7 are identical in the arrangement.
2->Australian Marriage Law Postal Survey 2017 - Response
Australian Marriage Law Postal Survey 2017 - Response contains information on responses of various participants of the study.
In this analysis sheet, 3 of the data will be used from the Australian Marriage Law Postal Survey 2017 - Response the sheet contains responses of eligible participants and also contains info on non-responding participant and unclear responses over different federal divisions.
Importing Data Australian Marriage Law Postal Survey 2017 - Participation We are going to load sheet 6 and 7
data=read_excel("data.xls",sheet = 6)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 12 more problems
data_F=read_excel("data.xls",sheet = 7)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 12 more problems
head(data)
head(data_F)
a trim is going to be peformed to extract usefule table from the sheet preserving all information
#Breaking problem in 2 parts
#partA contains federal divisions
partA = data%>%select(contains("Australian Bureau of Statistics"))%>%slice(7:650)
colnames(partA)="area"
#removing all the bold reference of state and only keeping federal division.
partA=partA%>%filter(str_detect(partA$area,pattern = "Divisions|Australia$",negate = TRUE))
#partB#####################################
#partB contains the age group table with all data points
partB = data%>%select(-c(1))%>%slice(5:649)%>%drop_na()
#specifying column name
cols=c("category",
"18-19",
"20-24",
"25-29",
"30-34",
"35-39",
"40-44",
"45-49",
"50-54",
"55-59",
"60-64",
"65-69",
"70-74",
"75-79",
"80-84",
"> 85",
"Total for Gender")
colnames(partB)=cols
#adding to add a column gender indicating sheet as male after merge with female sheet
#replicating area 3 times to match and fit the category
#as category contains Total participants,Eligible participants,Participation rate (%) for each federal division
partB=partB%>%mutate(area = rep(partA$area, each=3) ,gender="Male")%>%filter(!grepl("Total", area))
head(partA)
head(partB)
NA
A similar operation will be performed on sheet 7
#part A####################################3
partA_F = data_F%>%select(contains("Australian Bureau of Statistics"))%>%slice(7:650)
colnames(partA_F)="area"
partA_F=partA_F%>%filter(str_detect(partA_F$area,pattern = "Divisions|Australia$",negate = TRUE))
#partB#####################################
partB_F = data_F%>%select(-c(1))%>%slice(5:649)%>%drop_na()
#specifying column name
colnames(partB_F)=cols#using cols from previous sheet specification
partB_F=partB_F%>%mutate(area = rep(partA_F$area, each=3) ,gender="Female")%>%filter(!grepl("Total", area))
head(partA_F)
head(partB_F)
An attempt is made to get the table wilth State with their federal divisions and respective area in sq.km
Here rvest package is used to extract the tables present at the following AEC(Australian Election Comission) which will be combined with participant and respondents table.
current_electorates <-
read_html("http://www.aec.gov.au/profiles/") %>%
rvest::html_nodes("table") %>%
rvest::html_table() %>%first()%>%select(`Electoral division`, State, `Area (sq km)`)
old_electorates <-
read_html("http://www.aec.gov.au/Electorates/abolished.htm") %>%
rvest::html_nodes("table") %>%
rvest::html_table() %>%
first() %>%
select(`Electoral division` = Division, State)
electorates <- bind_rows(current_electorates, old_electorates)
head(electorates)
One more import is necessary Now, response table of participant will be imported and similar operations like sheet 6 & 7 will be performed. #change name of col
############################################RESPONSE TABLE
response=read_excel("response.xls",sheet = 3)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* … and 10 more problems
response=response%>%slice(8:182)
col_name = c( "area",
"Yes",
"Yes pct",
"No",
"No pct",
"Response Total",
"Response Total pct",
"blank",
"Response clear",
"Response clear pct",
"Response not clear(b)",
"Response not clear(b) pct",
"Non-responding",
"Non-responding pct",
"Eligible Total",
"Eligible Total pct")
colnames(response)=col_name
response=response%>%select(-blank)%>%filter(!grepl("Total", area))%>%filter(!grepl("Total", area)) %>%drop_na()
The quesion we are asking here Is data in tidy format?
The data is not in tidy format. Reason:- age brackets are distributed over columns that directly violate tidy data rules. To make data a tidy df, a gather operation is to be performed to create a column with age brackets as different levels of columns.
partB=partB%>%gather(key="age_group", value = "count",-c("category","area","gender"))
partB_F=partB_F%>%gather(key="age_group", value = "count",-c("category","area","gender"))
head(partB)
dim(partB)
[1] 7200 5
head(partB_F)
dim(partB_F)
[1] 7200 5
Combining both df into one ALL_PARTB
ALL_PARTB=rbind(partB,partB_F)
dim(ALL_PARTB)
[1] 14400 5
Forming one final participant df (with state and area) ALL
ALL<-
ALL_PARTB %>%
left_join(electorates, by = c("area" = "Electoral division"))
head(ALL)
dim(ALL)
[1] 14496 7
repeat the same steps for respondant
response_data <-
response%>%
left_join(electorates, by = c("area" = "Electoral division"))
head(response_data)
dim(response_data)
[1] 151 17
There are some inconsistencies present in data which if not handled at this point will engender problems in further analysis. Scanning tables for missing values for a neat join of all tables.
colSums(is.na(ALL))
category area gender age_group count State Area (sq km)
0 0 0 0 0 288 1056
This suggest some error from the electorates table
colSums(is.na(electorates))
Electoral division State Area (sq km)
0 0 64
On close obserevation there are some duplicate values with different case settings.
#INVESTIGATION OF ELECTORAL PUSH POSSIBLE PROBLEMS
#some duplicate values were present
electorates=electorates%>%filter(!duplicated(electorates$`Electoral division`))
#vic and vic. problem
electorates$State=lapply(electorates$State,function(x)str_replace(x,pattern = "\\.",replacement = ""))
#converting to lower for uniform grouping
electorates$State=lapply(electorates$State,function(x)tolower(x))
NA’s are present in areas so to decide what to impute on NA’s for better data distribution of data is observed.
#checking the distribultion to decide replacement
x=electorates%>%filter(electorates$State=="vic")
x$`Area (sq km)`=as.numeric(x$`Area (sq km)`)
NAs introduced by coercion
x=x%>%filter(!is.na(x$`Area (sq km)`))
hist(x$`Area (sq km)`)#mean will under represent groups
boxplot(x$`Area (sq km)`)
The distributin is skewed so using mean to impute NA’s will not be a good decision. Group Median will be used to impute NA in the same group
#will use median
electorates$State=as.character(electorates$State)
electorates$State=as.factor(electorates$State)
electorates$`Area (sq km)`=as.integer(electorates$`Area (sq km)`)
NAs introduced by coercion
electorates=electorates %>% group_by(State) %>%
mutate(`Area (sq km)`=ifelse(is.na(`Area (sq km)`),median(`Area (sq km)`,na.rm=TRUE),`Area (sq km)`))
colSums(is.na(electorates))
Electoral division State Area (sq km)
0 0 0
All NA’s are handled
Some inconsistency was found in area column which was cause of 288 NA values it is handled in next code block.
Join will be attempted again
ALL_PARTB$area[ALL_PARTB$area=="Lingiari(c)"]="Lingiari"
ALL_PARTB$area[ALL_PARTB$area=="Canberra(d)"]="Canberra"
ALL_PARTB$area[ALL_PARTB$area=="Fenner(e)"]="Fenner"
ALL<-
ALL_PARTB %>%
left_join(electorates, by = c("area" = "Electoral division"))
same fore respondant
response$area[response$area=="Lingiari(c)"]="Lingiari"
response$area[response$area=="Canberra(d)"]="Canberra"
response$area[response$area=="Fenner(e)"]="Fenner"
response_data <-
response%>%
left_join(electorates, by = c("area" = "Electoral division"))
colSums(is.na(ALL))
category area gender age_group count State Area (sq km)
0 0 0 0 0 0 0
colSums(is.na(response_data))
area Yes Yes pct No
0 0 0 0
No pct Response Total Response Total pct Response clear
0 0 0 0
Response clear pct Response not clear(b) Response not clear(b) pct Non-responding
0 0 0 0
Non-responding pct Eligible Total Eligible Total pct State
0 0 0 0
Area (sq km)
0
It can be observed NA’s in all columns are eliminated
aa=left_join(ALL,response_data, by = "area")
aa=select(aa,-c(6,7))
Now, obvious incosistencies in the data needed to be checked for 1-Negative values are not possible in any numeric columns in our dataset.As Negative number of peope does not makes sense.
Incos_detect=function(x)
{
re=which(x<0)
re
}
test=aa[,c(5,6,7,8,9,10,11,12,14,15)]
lapply(test, function(x) Incos_detect(x))
$count
integer(0)
$Yes
integer(0)
$`Yes pct`
integer(0)
$No
integer(0)
$`No pct`
integer(0)
$`Response Total`
integer(0)
$`Response Total pct`
integer(0)
$`Response clear`
integer(0)
$`Response not clear(b)`
integer(0)
$`Response not clear(b) pct`
integer(0)
final data
head(aa)
Data types of the final table are explored and appropriate data type conversions are appled wherever is necessary.
glimpse(aa)
Observations: 14,400
Variables: 21
$ category [3m[38;5;246m<chr>[39m[23m "Total participants", "Eligible participants", "Participation rate (%)", …
$ area [3m[38;5;246m<chr>[39m[23m "Banks", "Banks", "Banks", "Barton", "Barton", "Barton", "Bennelong", "Be…
$ gender [3m[38;5;246m<chr>[39m[23m "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ age_group [3m[38;5;246m<chr>[39m[23m "18-19", "18-19", "18-19", "18-19", "18-19", "18-19", "18-19", "18-19", "…
$ count [3m[38;5;246m<chr>[39m[23m "1102", "1431", "77", "977", "1278", "76.400000000000006", "1177", "1488"…
$ Yes [3m[38;5;246m<chr>[39m[23m "37736", "37736", "37736", "37153", "37153", "37153", "42943", "42943", "…
$ `Yes pct` [3m[38;5;246m<chr>[39m[23m "44.899999999999999", "44.899999999999999", "44.899999999999999", "43.600…
$ No [3m[38;5;246m<chr>[39m[23m "46343", "46343", "46343", "47984", "47984", "47984", "43215", "43215", "…
$ `No pct` [3m[38;5;246m<chr>[39m[23m "55.100000000000001", "55.100000000000001", "55.100000000000001", "56.399…
$ `Response Total` [3m[38;5;246m<chr>[39m[23m "84079", "84079", "84079", "85137", "85137", "85137", "86158", "86158", "…
$ `Response Total pct` [3m[38;5;246m<chr>[39m[23m "100", "100", "100", "100", "100", "100", "100", "100", "100", "100", "10…
$ `Response clear` [3m[38;5;246m<chr>[39m[23m "84079", "84079", "84079", "85137", "85137", "85137", "86158", "86158", "…
$ `Response clear pct` [3m[38;5;246m<chr>[39m[23m "79.900000000000006", "79.900000000000006", "79.900000000000006", "77.799…
$ `Response not clear(b)` [3m[38;5;246m<chr>[39m[23m "247", "247", "247", "226", "226", "226", "244", "244", "244", "212", "21…
$ `Response not clear(b) pct` [3m[38;5;246m<chr>[39m[23m "0.20000000000000001", "0.20000000000000001", "0.20000000000000001", "0.2…
$ `Non-responding` [3m[38;5;246m<chr>[39m[23m "20928", "20928", "20928", "24008", "24008", "24008", "19973", "19973", "…
$ `Non-responding pct` [3m[38;5;246m<chr>[39m[23m "19.899999999999999", "19.899999999999999", "19.899999999999999", "22", "…
$ `Eligible Total` [3m[38;5;246m<chr>[39m[23m "105254", "105254", "105254", "109371", "109371", "109371", "106375", "10…
$ `Eligible Total pct` [3m[38;5;246m<chr>[39m[23m "100", "100", "100", "100", "100", "100", "100", "100", "100", "100", "10…
$ State.y [3m[38;5;246m<fct>[39m[23m nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw…
$ `Area (sq km).y` [3m[38;5;246m<dbl>[39m[23m 53, 53, 53, 40, 40, 40, 60, 60, 60, 786, 786, 786, 61, 61, 61, 101, 101, …
almost all the numeric values are character type and some factor varibles are also char.
Converting to appropriate types. instead of changing all the char to dbl one by one clubbing of all the columns are performed and converted to dbl.
aa$area=as.factor(aa$area)
aa$age_group=factor(aa$age_group, levels = cols,
ordered = TRUE)
aa$gender=as.factor(aa$gender)
aa$State.y=as.factor(aa$State.y)
aa$`Area (sq km).y`=as.numeric(aa$`Area (sq km).y`)
#clubbing of supposedly dbl which are char
names=colnames(aa)
names=names[5:19]
aa[,names] = apply(aa[,names], 2, function(x) as.numeric(as.character(x)))
glimpse(aa)
Observations: 14,400
Variables: 21
$ category [3m[38;5;246m<chr>[39m[23m "Total participants", "Eligible participants", "Participation rate (%)", …
$ area [3m[38;5;246m<fct>[39m[23m Banks, Banks, Banks, Barton, Barton, Barton, Bennelong, Bennelong, Bennel…
$ gender [3m[38;5;246m<fct>[39m[23m Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, M…
$ age_group [3m[38;5;246m<ord>[39m[23m 18-19, 18-19, 18-19, 18-19, 18-19, 18-19, 18-19, 18-19, 18-19, 18-19, 18-…
$ count [3m[38;5;246m<dbl>[39m[23m 1102.0, 1431.0, 77.0, 977.0, 1278.0, 76.4, 1177.0, 1488.0, 79.1, 1523.0, …
$ Yes [3m[38;5;246m<dbl>[39m[23m 37736, 37736, 37736, 37153, 37153, 37153, 42943, 42943, 42943, 48471, 484…
$ `Yes pct` [3m[38;5;246m<dbl>[39m[23m 44.9, 44.9, 44.9, 43.6, 43.6, 43.6, 49.8, 49.8, 49.8, 54.6, 54.6, 54.6, 2…
$ No [3m[38;5;246m<dbl>[39m[23m 46343, 46343, 46343, 47984, 47984, 47984, 43215, 43215, 43215, 40369, 403…
$ `No pct` [3m[38;5;246m<dbl>[39m[23m 55.1, 55.1, 55.1, 56.4, 56.4, 56.4, 50.2, 50.2, 50.2, 45.4, 45.4, 45.4, 7…
$ `Response Total` [3m[38;5;246m<dbl>[39m[23m 84079, 84079, 84079, 85137, 85137, 85137, 86158, 86158, 86158, 88840, 888…
$ `Response Total pct` [3m[38;5;246m<dbl>[39m[23m 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ `Response clear` [3m[38;5;246m<dbl>[39m[23m 84079, 84079, 84079, 85137, 85137, 85137, 86158, 86158, 86158, 88840, 888…
$ `Response clear pct` [3m[38;5;246m<dbl>[39m[23m 79.9, 79.9, 79.9, 77.8, 77.8, 77.8, 81.0, 81.0, 81.0, 84.5, 84.5, 84.5, 7…
$ `Response not clear(b)` [3m[38;5;246m<dbl>[39m[23m 247, 247, 247, 226, 226, 226, 244, 244, 244, 212, 212, 212, 220, 220, 220…
$ `Response not clear(b) pct` [3m[38;5;246m<dbl>[39m[23m 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2…
$ `Non-responding` [3m[38;5;246m<dbl>[39m[23m 20928, 20928, 20928, 24008, 24008, 24008, 19973, 19973, 19973, 16038, 160…
$ `Non-responding pct` [3m[38;5;246m<dbl>[39m[23m 19.9, 19.9, 19.9, 22.0, 22.0, 22.0, 18.8, 18.8, 18.8, 15.3, 15.3, 15.3, 2…
$ `Eligible Total` [3m[38;5;246m<dbl>[39m[23m 105254, 105254, 105254, 109371, 109371, 109371, 106375, 106375, 106375, 1…
$ `Eligible Total pct` [3m[38;5;246m<dbl>[39m[23m 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100…
$ State.y [3m[38;5;246m<fct>[39m[23m nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw, nsw…
$ `Area (sq km).y` [3m[38;5;246m<dbl>[39m[23m 53, 53, 53, 40, 40, 40, 60, 60, 60, 786, 786, 786, 61, 61, 61, 101, 101, …
All the columns are in appropriate data format. With age group as a ordered factor.
In this step, all the unnecessary columns of percentage will be removed as they possess only redundant information and can be calculated at any time using mutate function.
#removing percentages
aa=aa%>%select(-c(7,9,11,13,15,17,19))
only after successful NA removal and fixing the data type it was possible to create a varible.
The fabricated column will contain total elegible participant per sq.km of Area
aa=aa%>%mutate(eligible_part_Per_sq.km=`Eligible Total`/`Area (sq km).y`)
In this section an analysis for outliers on all the numeric columns is performed. As seen in the previous parts the distribution of the data is not normal so boxplots are used for detecting outliers.
m=aa[,5:15]
ggplot(stack(m), aes(x = ind, y = values))+geom_boxplot()
non-vector columns will be ignored
The scale of y axis is not appropriate for comparision of boxplot with other boxplots. To enable us to compare boxplots a scale shift of y axis might help.
ggplot(stack(m), aes(x = ind, y = values)) +scale_y_log10()+geom_boxplot()
non-vector columns will be ignored
A clear presence of outliers can be seen in almost every column. In further steps, an attempt for the treatment of outliers will be made.
The outliers per function are created to compute the percentage of outlier present in the particular column. This step is necessary as a high percentage of outliers >5% will make us consider other options other than elimination.
outliersper <- function(x){
length(which(x > mean(x) + 3 * sd(x) | x < mean(x) - 3 * sd(x)) ) / length(x)
}
out=aa[,c(5,6,7,8,9,10,11,12,14,15)]
out_count=lapply(out, function(x) outliersper(x))
out_count
$count
[1] 0.04083333
$Yes
[1] 0.006666667
$No
[1] 0.006666667
$`Response Total`
[1] 0.02666667
$`Response clear`
[1] 0.02666667
$`Response not clear(b)`
[1] 0
$`Non-responding`
[1] 0.01333333
$`Eligible Total`
[1] 0.04666667
$`Area (sq km).y`
[1] 0.04666667
$eligible_part_Per_sq.km
[1] 0.01333333
It can be observed that all columns are having an outlier percentage <5% with maximum outlier in Area (sq km).yandEligible Total` followed by count.
The fuction outlier_remov when applied to a column keeps only the data which are not outliers according to Tukays Method of outlier detection
outlier_remov= function(x){
aa=aa%>%filter((x < mean(x) + 1.5 * sd(x) & x > mean(x) - 1.5 * sd(x)))
aa
}
#dimention of df before removal of outliers
dim(aa)
[1] 14400 15
after_removal=outlier_remov(aa$count)
dim(after_removal)
[1] 13801 15
all the rows with values outside of the range of Q1-1.5 x IQR to Q3+1.5 x IQR are removed.
boxplot(after_removal$count)
OBSERVATION Even after removing the outliers it can be observed that the outliers of the count column are still present as new outliers are introduced due to the removal of the old one which changed the quartile values.
DECISION FOR OUTLIERS The values of columns are interdependent and removal of outliers is not going to give us any positive effect in cleaning the data.
-One possible reason for many outliers can be outlier observations of the area of the federal district. As a very small or very large area will have extreme stats, for example, a very large federal division will have several participants.
-A decision was made against the removal of outliers or imputation of outliers with mean, median or capping them to the nearest quantile value as any of these will disturb the data and the results from any statistical test done on data will not be reliable.
-If in any case removal of outliers is a must. NA’s can be introduced in the place of outliers. Such a move will keep the original distribution of data.
Discussion Various different analysis is possible from the dataset: -t-test with male and female participant counts. -see if age bracket makes difference, 3 categories (young,mid age,old) can be use to bracket all age groups and then ANOVA test can be use to see if any varience exist across the groups. -regression might be possible area of place and no. of participant as a line(make graph) is there
#TRANSFORM This step is to transform the it could be used for further analysis.
library(forecast)
Registered S3 method overwritten by 'xts':
method from
as.zoo.xts zoo
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Registered S3 methods overwritten by 'forecast':
method from
fitted.fracdiff fracdiff
residuals.fracdiff fracdiff
hist(aa$`Area (sq km).y`)
transformed=BoxCox(aa$`Area (sq km).y`,lambda = "auto")
hist(transformed)