Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(readr)
library(xlsx)
library(readxl)
library(forcats)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(lubridate)
library(ggplot2)
library(knitr)

Executive Summary

Data

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

The two data sets relate to voting results in the 2019 Australian federal election, and 2019 Australian Bureau of Statistics data on household income, household ownership, and rent and mortgage costs. Both data sets are grouped by federal electorate geographical areas. The election results are on a two party preferred basis, minor parties are not included.

There are a total of 17 variables and 151 observations.

The variables are as follows: electorate - electorate name DivisionID - unique ID number for each electorate (allocated by the Aust. Electorate Commission) StateAB - Australian State abbreviation PartyAB - Political party abbreviation. ALP = Aust. Labor Party; LP = Liberal Party; NP = National Party; LNP = Liberal National Party (Amalgamation of Liberal Party and National Party in QLD) Liberal/National Coalition Votes Liberal/National Coalition Percentage Australian Labor Party Votes Australian Labor Party Percentage TotalVotes - Total number of votes cast in the electorate Swing - Percentage swing for winning party from previous election yes_votes - tally of yes votes yes_percentage - of total legible votes no_votes - tally of no votes no_percentage - of total legible votes total_votes - tally of legible votes difference - percentage difference between yes and no. Positive value is yes, negative value is no.

The election result data set was retrieved from the Australian Electoral Commission website: https://results.aec.gov.au/24310/Website/HouseDownloadsMenu-24310-Csv.htm

The marriage equality plebiscite data set was retrieved from the Australian Bureau of Statistics website: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1800.02017?OpenDocument

# This is the R chunk for the Data Section

tpp1 <- read_csv("HouseTppByDivisionDownload-24310.csv", skip=1)
Parsed with column specification:
cols(
  DivisionNm = col_character(),
  DivisionID = col_double(),
  StateAb = col_character(),
  PartyAb = col_character(),
  `Liberal/National Coalition Votes` = col_double(),
  `Liberal/National Coalition Percentage` = col_double(),
  `Australian Labor Party Votes` = col_double(),
  `Australian Labor Party Percentage` = col_double(),
  TotalVotes = col_double(),
  Swing = col_double()
)
head(tpp1)
marriage1 <- read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet=3, skip= 5)
New names:
* `` -> ...1
* `` -> ...3
* `` -> ...5
* Total -> Total...6
* `` -> ...7
* ...
head(marriage1)
marriage2 <- rename(marriage1, electorate =...1, yes_votes = Yes, yes_percentage=...3, no_votes= No, no_percentage = ...5, total_votes = Total...6, total_percentage = ...7)
head(marriage2)
tpp2 <- rename(tpp1,electorate=DivisionNm)
head(tpp2)
marriage3 <- marriage2 %>% select(electorate,yes_votes,yes_percentage,no_votes,no_percentage,total_votes)
voting1 <- tpp2 %>% left_join(marriage3)
Joining, by = "electorate"
head(voting1)

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

There is a mix of variables, including characters, factors and numerical.

Both StateAb and PartyAb were imported as character variables. These were assigned as factors. The marriage equality plebiscite data was imported as characters, so these were re-assigned as numerical variables.

Tidy & Manipulate Data I

The data base had branched variables, with the ‘no’ header including both vote tally and proportion. To ensure tidy data, the data was imported with the branch headers dropped. The variables then were renamed to ensure accuracy and consistency.

# This is the R chunk for the Tidy & Manipulate Data I 

Tidy & Manipulate Data II

So as to more easily assess the difference in levels of support for marriage equality, a new variable was created - difference. This took the yes_proportion and subtracted the no_proportion. A positive value indicated majority support for marriage equality, while a negative value indicated minority, with the proportional difference between the two being the numerical value. This variable will later aid analysis when looking into the relationship between political party support and marriage equality support, and the level of such support.

Scan I

The data was scanned for missing data. There are 11 observations with all data missing relating to the marriage equality plebiscite. It constitutes 7.28% of the total 151 observations. Furthermore, nine of the eleven were seats help by the ALP, with one each held by the NP and LP. All three electorates in the ACT have plebiscite data missing, one of the NT’s two seats, five of Victoria’s thirty-eight seats, and one of SA’s ten seats.

This is data set is to be used in an analysis to see if there is a relationship between support for marriage equality in the 2017 plebiscite, and voting support for particular political parties in the 2019 federal election. Therefore, more analysis is needed before action is taken. If the missing values were re-coded with median values or other values this could impact analysis results. The decision was made to leave the missing data as NA until more analysis was undertaken. Such analysis might include analysing the support for marriage equality in electorates with comparable political party support, or other metrics.

# This is the R chunk for the Scan I

colSums(is.na(voting2))
                           electorate                            DivisionID 
                                    0                                     0 
                              StateAb                               PartyAb 
                                    0                                     0 
     Liberal/National Coalition Votes Liberal/National Coalition Percentage 
                                    0                                     0 
         Australian Labor Party Votes     Australian Labor Party Percentage 
                                    0                                     0 
                           TotalVotes                                 Swing 
                                    0                                     0 
                            yes_votes                        yes_percentage 
                                   11                                    11 
                             no_votes                         no_percentage 
                                   11                                    11 
                          total_votes                            difference 
                                   11                                    11 

Scan II

Several univariate scans were undertaken to check for outliers. There were no obvious outliers except for one observation of the variable ‘swing’. The Whitlam electorate observation had a swing value of 39%. Considering the other outliers were in the 11-12% range this was a significant outlier that required investigation. Upon researching the previous election result, for which swing is derived from in relation to the most recent election, it was found that in the previous election a Liberal party candidate had run but not in this election, hence the swing figure was not an accurate reflection, as the was no Liberal candidate for voters to swing against. The decision was made to change this observation to NA, and remove it for calculations. It was changed to NA so as to not alter other calculations and analysis.

# This is the R chunk for the Scan II

voting2$difference %>% boxplot(main="Box Plot of support difference for marriage equality", ylab="support difference", col="grey")


voting2$Swing %>% boxplot(main="Box Plot of winning party swing", ylab="swing", col="grey")


voting2$yes_percentage %>% boxplot(main="Box plot of Marriage equality plebiscite yes vote by electorate", ylab= "proportion", col="grey")


voting2$`Liberal/National Coalition Percentage` %>% boxplot(main="Box plot of LNP vote percentage by electorate", ylab= "vote percentage", col="grey")


voting2$`Australian Labor Party Percentage` %>% boxplot(main="Box plot of Labor vote percentage by electorate", ylab= "vote percentage", col="grey")


voting2$Swing[148] <- NA

Transform

Several variables were checked for distribution skew. The yes percentage variable had a mild left skew. The mathematical operation of a square root transformation was used at it was a left skew. The resulting object, trans_yes1, has a much reduced skew.

# This is the R chunk for the Transform Section

voting2$`Australian Labor Party Percentage` %>% hist(xlab="labor percentage", main = "labor support", col = "grey" )


voting2$yes_percentage %>% hist(xlab="yes percentage", main = "yes support by electorate", col="grey")


trans_yes1 <- voting2$yes_percentage^2
hist(trans_yes1)

NA
NA

NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections. 

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )



---
title: "MATH2349 Data Wrangling"
author: Gavin Height 3873959
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
---



## Required packages 


Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

```{r}
# This is the R chunk for the required packages
library(readr)
library(xlsx)
library(readxl)
library(forcats)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(lubridate)
library(ggplot2)
library(knitr)
```


## Executive Summary 

- This data set was preprocessed for the purpose of undertaking analysis to see if there is a relationship between election results of Australian political parties in the 2019 federal election and voting results for the 2017 marriage equality plebiscite.

- Two data sets were merged, one showing voting results for the 2019 Australian Federal Election of the House of Representatives, the other showing the voting results of the 2017 Australian plebiscite on marriage equality by House of Representative electorates. The two data sets were merged, variables assigned correctly and data was tidied. A new variable was added to more easily see how an electorate voted in the plebiscite.

-  11 observations were found to be missing values for plebiscite data variables. These were left as NA as 9 out of 11 of them were in electorates won by the Labor party and re-coding with median values or otherwise may bias analysis results. More investigation is needed before potential further action is taken.

-  One significant outlier was found in the swing variable. This was investigated and found to be due to no Liberal party candidate running in the 2019 election. This value was re-coded to NA to prevent it impacting analysis results.

-  The values of the variable, 'percentage of yes support', had a mild left skew. A mathematical operation of a square root was used to correct the skew.


## Data 

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

The two data sets relate to voting results in the 2019 Australian federal election, and 2019 Australian Bureau of Statistics data on household income, household ownership, and rent and mortgage costs. Both data sets are grouped by federal electorate geographical areas.
The election results are on a two party preferred basis, minor parties are not included.

There are a total of 17 variables and 151 observations.

The variables are as follows:
electorate - electorate name
DivisionID - unique ID number for each electorate (allocated by the Aust. Electorate Commission)
StateAB - Australian State abbreviation
PartyAB - Political party abbreviation. ALP = Aust. Labor Party; LP = Liberal Party; NP = National Party; LNP = Liberal National Party (Amalgamation of Liberal Party and National Party in QLD)
Liberal/National Coalition Votes
Liberal/National Coalition Percentage
Australian Labor Party Votes
Australian Labor Party Percentage
TotalVotes - Total number of votes cast in the electorate
Swing - Percentage swing for winning party from previous election
yes_votes - tally of yes votes
yes_percentage - of total legible votes
no_votes - tally of no votes
no_percentage - of total legible votes
total_votes - tally of legible votes
difference - percentage difference between yes and no. Positive value is yes, negative value is no.




The election result data set was retrieved from the Australian Electoral Commission website:
https://results.aec.gov.au/24310/Website/HouseDownloadsMenu-24310-Csv.htm

The marriage equality plebiscite data set was retrieved from the Australian Bureau of Statistics website:
https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1800.02017?OpenDocument


```{r}
# This is the R chunk for the Data Section

tpp1 <- read_csv("HouseTppByDivisionDownload-24310.csv", skip=1)
head(tpp1)
marriage1 <- read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls", sheet=3, skip= 5)
head(marriage1)
marriage2 <- rename(marriage1, electorate =...1, yes_votes = Yes, yes_percentage=...3, no_votes= No, no_percentage = ...5, total_votes = Total...6, total_percentage = ...7)
head(marriage2)
tpp2 <- rename(tpp1,electorate=DivisionNm)
head(tpp2)
marriage3 <- marriage2 %>% select(electorate,yes_votes,yes_percentage,no_votes,no_percentage,total_votes)
voting1 <- tpp2 %>% left_join(marriage3)
head(voting1)
```

## Understand 

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

There is a mix of variables, including characters, factors and numerical.

Both StateAb and PartyAb were imported as character variables. These were assigned as factors.
The marriage equality plebiscite data was imported as characters, so these were re-assigned as numerical variables.

```{r}
# This is the R chunk for the Understand Section

voting1$StateAb <- voting1$StateAb%>% as.factor()
voting1$PartyAb <-  voting1$PartyAb %>% as.factor()
voting1$yes_votes <- voting1$yes_votes %>% as.double()
voting1$yes_percentage <- voting1$yes_percentage %>% as.double()
voting1$no_votes <- voting1$no_votes %>% as.double()
voting1$no_percentage <- voting1$no_percentage %>% as.double()
voting1$total_votes <- voting1$total_votes %>% as.double()
head(voting1)

```


##	Tidy & Manipulate Data I 


The data base had branched variables, with the 'no' header including both vote tally and proportion. To ensure tidy data, the data was imported with the branch headers dropped. The variables then were renamed to ensure accuracy and consistency.


```{r}
# This is the R chunk for the Tidy & Manipulate Data I 

```

##	Tidy & Manipulate Data II 


So as to more easily assess the difference in levels of support for marriage equality, a new variable was created - difference. This took the yes_proportion and subtracted the no_proportion. A positive value indicated majority support for marriage equality, while a negative value indicated minority, with the proportional difference between the two being the numerical value. This variable will later aid analysis when looking into the relationship between political party support and marriage equality support, and the level of such support.

```{r}
# This is the R chunk for the Tidy & Manipulate Data II 
voting2 <- mutate(voting1, difference = yes_percentage - no_percentage)
head(voting2)

```


##	Scan I 


The data was scanned for missing data. There are 11 observations with all data missing relating to the marriage equality plebiscite. It constitutes 7.28% of the total 151 observations. Furthermore, nine of the eleven were seats help by the ALP, with one each held by the NP and LP. All three electorates in the ACT have plebiscite data missing, one of the NT's two seats, five of Victoria's thirty-eight seats, and one of SA's ten seats.

This is data set is to be used in an analysis to see if there is a relationship between support for marriage equality in the 2017 plebiscite, and voting support for particular political parties in the 2019 federal election. Therefore, more analysis is needed before action is taken. If the missing values were re-coded with median values or other values this could impact analysis results. The decision was made to leave the missing data as NA until more analysis was undertaken. Such analysis might include analysing the support for marriage equality in electorates with comparable political party support, or other metrics. 


```{r}
# This is the R chunk for the Scan I

colSums(is.na(voting2))

```


##	Scan II


Several univariate scans were undertaken to check for outliers. There were no obvious outliers except for one observation of the variable 'swing'. The Whitlam electorate observation had a swing value of 39%. Considering the other outliers were in the 11-12% range this was a significant outlier that required investigation. Upon researching the previous election result, for which swing is derived from in relation to the most recent election, it was found that in the previous election a Liberal party candidate had run but not in this election, hence the swing figure was not an accurate reflection, as the was no Liberal candidate for voters to swing against. The decision was made to change this observation to NA, and remove it for calculations. It was changed to NA so as to not alter other calculations and analysis.

```{r}
# This is the R chunk for the Scan II

voting2$difference %>% boxplot(main="Box Plot of support difference for marriage equality", ylab="support difference", col="grey")

voting2$Swing %>% boxplot(main="Box Plot of winning party swing", ylab="swing", col="grey")

voting2$yes_percentage %>% boxplot(main="Box plot of Marriage equality plebiscite yes vote by electorate", ylab= "proportion", col="grey")

voting2$`Liberal/National Coalition Percentage` %>% boxplot(main="Box plot of LNP vote percentage by electorate", ylab= "vote percentage", col="grey")

voting2$`Australian Labor Party Percentage` %>% boxplot(main="Box plot of Labor vote percentage by electorate", ylab= "vote percentage", col="grey")

voting2$Swing[148] <- NA

```


##	Transform 


Several variables were checked for distribution skew. The yes percentage variable had a mild left skew. The mathematical operation of a square root transformation was used at it was a left skew. The resulting object, trans_yes1, has a much reduced skew.


```{r}
# This is the R chunk for the Transform Section

voting2$`Australian Labor Party Percentage` %>% hist(xlab="labor percentage", main = "labor support", col = "grey" )

voting2$yes_percentage %>% hist(xlab="yes percentage", main = "yes support by electorate", col="grey")

trans_yes1 <- voting2$yes_percentage^2
hist(trans_yes1)


```


NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.\  

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )

<br>
<br>
