Assign 3 Pre Pos

Information is from the site stated below

http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6523.02015-16?OpenDocument

In particular I want to look at three files from the above site

Household Income and Income Distribution, Victoria

Household Income and Income Distribution, Tasmania

Household Income and Income Distribution, Western Australia

Viewable at ( Published to )

  http://rpubs.com/fast_Eddie/431428

In this Study where we wish to compare Household Incomes between three Australian States.

The States of interest are Victoria, Tasmania and Western Australia.

We will look at household incomes for the years 1995 - 2016

please note : NOT all years were recorded for each state, however the years shown were recorded for each state

Executive Summary:

In this assignment we are specifically interested in assessing the income of three states i.e Victoria, Tasmania and Western Australia from 1996 to 2016 and we have compared the Western Australia and Tasmania states to Victoria. The variables we are looking at are the household income and the year, which are the numeric variables. In order to get the exact data what we wished to analyze, we would have to tidy the data. Tidying helped us to store data in a consistent format following the three interrelated rules like the years in one column and household income in row so that essentially each cell contains its own value. We performed the operations such as importing data, cleaning the data by reshaping data from long to wide format, filtering data, identifying outliers, and transformed the variables as well to make the data ready for the statistical analysis. After the analysis of the above data, for visualization we used ggplot2. The graph is showing the comparison of income for three states from 1995 to 2016. we can see that Victoria is showing steady growth and Tasmania is seems to be similar patterns to Victoria whereas WA showing a steep growth for certain amount of years and does appear to be now declining. The other graph is showing comparison of WA and TAS to Vic. As can be seen TAS tends to fluctuate around the 80% mark of when compared to Victorian household income.

Data Plain text & R code & Output:

For this assignment we chose the data from this site:

http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6523.02015-16?OpenDocument

from this site we have chosen the following : Household Income and Income Distribution, Victoria Household Income and Income Distribution, Tasmania Household Income and Income Distribution, Western Australia. We have renamed the datasets to

Please NOTE , I had to rename the data sets to :

 TasHouse .xlsx  and  
 VicHouse .xlsx
 WestHouse .xlxs

Because they have the same names and the importing them caused problems due to overwriting. The column names were formatted to meaningful names In order to create a new data frame from selected variables.

Import Section

#library(readxl)

TasHouse <- read_excel("C:/Users/dan/Desktop/Pre Pro 3/TasHouse.xlsx")
#View(TasHouse)
VicHouse <- read_excel("C:/Users/dan/Desktop/Pre Pro 3/VicHouse.xlsx")
#View(VivHouse)
WestHouse <- read_excel("C:/Users/dan/Desktop/Pre Pro 3/WestHouse.xlsx")
#View(WestHouse)

# str(VicHouse)
# str(TasHouse)
# str(WestHouse)

Understand Plain text & R code & Output:

We have selected only those variables which we needed. The variables we intend to look at are household income and the year, which through constant checking and verifying will be coerced to numeric variables.

extract certain row and / or columns

All three Sections below will prepare each of the imported data sets

in readiness for analysis

Section 1 ######## Preparing Data for Victoria

Sort out what I wish to look at in the Victoria data frame

#need rows
vic94a <- VicHouse[,1:16]
vic94b <- VicHouse[ 7:13 , 1:16]  
vic94c <- VicHouse[ 16:24 , 1:16]

vic <- rbind(as.data.frame(vic94a, vic94b, vic94c))

vic1 <- vic[-c(25:117),]
#View(vic1)
vic1 <- vic1[-c(1:3),]
#View(vic1)
vic1 <- vic1[-c(2),]
#View(vic1)
vic1 <- vic1[-c(10),]
#View(vic1)

rownames(vic1) <- NULL  ##  Renumber rows
#View(vic1)
#head(vic1)

##-------------I'm using the row names in this section to 
##-------------Identify which rows are what later in the coding


vic1[1, 1] <- "Years"  ## Change the name of the rows
vic1[1, 2] <- "Dollars"

rownames(vic1) <- NULL##  Now remove row 2

 
vic2 <-  vic1[-c(2),]    ##############           <<<<<===========   Change to vic2 here
rownames(vic2) <- NULL##  Now remove row 2        ##     because this makes it easier to 
#View(vic2)                                       ##     know which works from what dosent 

vic2 <-  vic2[-c(9),]
rownames(vic2) <- NULL##  Now remove row 2
#View(vic2)

## NOTE rows 1 - 8 are Wages , and Rows 9 - 17 are Percentiles <<===== For the Victorian data  ##########


vic2 <- vic2[-c(2:6),]
rownames(vic2) <- NULL##  Now remove row 2
#View(vic2)

vic2 <- vic2[-c(3:12),]     ###                  
rownames(vic2) <- NULL##  Now remove row 

vic2[,2]<- NULL   ## Remove column 2           <<<<<==========    Reduce to 2 columns here
#str(vic2)

##  Rename columns


names(vic2) <- c("Years", "1995", "1996", "1997", "1998",
                          "2000", "2001", "2003", "2004",  
                          "2006", "2008", "2010", "2012",
                          "2014", "2016")

vic2 <- vic2[-1,]   #   c(3:12)
rownames(vic2) <- NULL##  Now remove row 2
#str(vic2)
#View(vic2)

##  Used gather here to Transform wide to long    <<<=============

vic2 <- gather(vic2)
#str(vic2)
colnames(vic2)[1] <- 'Years' 
colnames(vic2)[2] <- 'VicHouseholds'
## Now remove row 1, then change all to numeric
vic2 <- vic2[-1,]
## Now renumber rows 
rownames(vic2) <- NULL##  Now remove row 2

vic2 <- as.data.frame(vic2)   #, dimnames = NULL


vic2$Years <- as.numeric(vic2$Years)
vic2$VicHouseholds <- as.numeric(vic2$VicHouseholds)

#str(vic2)          ## Doing a structure check here check here   <<===============
#nrow(vic2)
## Checks and Winsoring  <<==========   Also imputational if needed

##----------------------------------------------------   KEEP  -------Check functions

##  This looks for NaN, Inf, -Inf , and then tells me if they are 
##  present in this particular column

is.special <- function(vic2){
if (is.numeric(vic2)) !is.finite(vic2) else is.na(vic2)
}

sum (sapply(vic2, is.special))  ## <<===     check here

## [1] 0

# Use Imputational method below ##  Not needed , so no variable changed   <<============

#surveys_combined$hindfoot_length <- impute(surveys_combined$hindfoot_length, fun = mean)

#View(vic2)
#str(vic2)

Section 2 ######## Preparing Data for Tasmania

———————————

Sort out what I wish to look at in the Tasmanian data frame

#need rows
t94a <- TasHouse[,1:16]
t94b <- TasHouse[ 7:13 , 1:16]  
t94c <- TasHouse[ 16:24 , 1:16]

tas <- rbind(as.data.frame(t94a, t94b, t94c))

tas1 <- tas[-c(25:117),]
#View(vic1)
tas1 <- tas1[-c(1:3),]
#View(vic1)
tas1 <- tas1[-c(2),]
#View(vic1)
tas1 <- tas1[-c(10),]
#View(vic1)

rownames(vic1) <- NULL  ##  Renumber rows
#View(vic1)
#head(vic1)

##-------------I'm using the row names in this section to 
##-------------Identify which rows are what later in the coding


tas1[1, 1] <- "Years"  ## Change the name of the rows
tas1[1, 2] <- "Dollars"

rownames(tas1) <- NULL##  Now remove row 2

## NOTE rows 1 - 8 are Wages , and Rows 9 - 17 are Percentiles <<===== For the Victorian data  ##########

tas2 <-  tas1[-c(2),]    ##############           <<<<<===========   Change to vic2 here
rownames(tas2) <- NULL##  Now remove row 2        ##     because this makes it easier to 
#View(vic2)                                       ##     know which works from what dosent 

tas2 <-  tas2[-c(9),]
rownames(tas2) <- NULL##  Now remove row 2
View(tas2)

tas2 <- tas2[-c(2:6),]
rownames(tas2) <- NULL##  Now remove row 2
#View(vic2)

tas2 <- tas2[-c(3:12),]     ###                  
rownames(tas2) <- NULL##  Now remove row 

tas2[,2]<- NULL   ## Remove column 2           <<<<<==========    Reduce to 2 columns here
#str(vic2)

##  Rename columns

names(tas2) <- c("Years", "1995", "1996", "1997", "1998",
                          "2000", "2001", "2003", "2004",  
                          "2006", "2008", "2010", "2012",
                          "2014", "2016")

tas2 <- tas2[-1,]   #   c(3:12)
rownames(tas2) <- NULL##  Now remove row 2
#str(vic2)
#View(vic2)


##  Used gather here to Transform wide to long    <<<=============

tas2 <- gather(tas2)
#str(vic2)
colnames(tas2)[1] <- 'Years' 
colnames(tas2)[2] <- 'TasHouseholds'
## Now remove row 1, then change all to numeric
tas2 <- tas2[-1,]
## Now renumber rows 
rownames(tas) <- NULL##  Now remove row 2
## Change column 2 of this data frame to numeric for analysis and NA check

tas2 <- as.data.frame(tas2)   #, dimnames = NULL


tas2$Years <- as.numeric(tas2$Years)
tas2$TasHouseholds <- as.numeric(tas2$TasHouseholds)

#str(tas2)          ## Doing a structure check here check here   <<===============
#nrow(tas2)
## Checks and Winsoring  <<==========   Also imputational if needed

##----------------------------------------------------   KEEP  -------Check functions

##  This looks for NaN, Inf, -Inf , and then tells me if they are 
##  present in this particular column

is.special <- function(tas2){
if (is.numeric(tas2)) !is.finite(tas2) else is.na(tas2)
}

sum (sapply(tas2, is.special))  ## <<===     check here

## [1] 0

# Use Imputational method below ##  Not needed , so no variable changed   <<============

#surveys_combined$hindfoot_length <- impute(surveys_combined$hindfoot_length, fun = mean)

#View(tas2)

Section 3 ######## Preparing Data for Western Australia

———————————

Sort out what I wish to look at in the Western Australia data frame

#need rows
w94a <- WestHouse[,1:16]
w94b <- WestHouse[ 7:13 , 1:16]  
w94c <- WestHouse[ 16:24 , 1:16]

wes <- rbind(as.data.frame(w94a, w94b, w94c))

wes1 <- wes[-c(25:117),]
#View(vic1)
wes2 <- wes1[-c(1:3),]
#View(vic1)
wes2 <- wes2[-c(2),]
#View(vic1)
wes2 <- wes2[-c(10),]
#View(vic1)

rownames(wes2) <- NULL  ##  Renumber rows
#View(vic1)
#head(vic1)

##-------------I'm using the row names in this section to 
##-------------Identify which rows are what later in the coding


wes2[1, 1] <- "Years"  ## Change the name of the rows
wes2[1, 2] <- "Dollars"

rownames(wes2) <- NULL##  Now remove row 2

 
wes2 <-  wes2[-c(2),]    ##############           <<<<<===========   Change to vic2 here
rownames(wes2) <- NULL##  Now remove row 2        ##     because this makes it easier to 
#View(vic2)                                       ##     know which works from what dosent

## NOTE rows 1 - 8 are Wages , and Rows 9 - 17 are Percentiles <<===== For the Victorian data  ##########

wes2 <-  wes2[-c(9),]
rownames(wes2) <- NULL##  Now remove row 2
View(wes2)

wes2 <- wes2[-c(2:6),]
rownames(tas2) <- NULL##  Now remove row 2
#View(vic2)

wes2 <- wes2[-c(3:12),]     ###                  
rownames(wes2) <- NULL##  Now remove row 

wes2[,2]<- NULL   ## Remove column 2           <<<<<==========    Reduce to 2 columns here
#str(vic2)

##  Rename columns

names(wes2) <- c("Years", "1995", "1996", "1997", "1998",
                          "2000", "2001", "2003", "2004",  
                          "2006", "2008", "2010", "2012",
                          "2014", "2016")

wes2 <- wes2[-1,]   #   c(3:12)
rownames(wes2) <- NULL##  Now remove row 2
#str(vic2)
#View(vic2)

##  Used gather here to Transform wide to long    <<<=============

wes2<- gather(wes2)
#str(vic2)
colnames(wes2)[1] <- 'Years' 
colnames(wes2)[2] <- 'WesHouseholds'
## Now remove row 1, then change all to numeric
wes2 <- wes2[-1,]
## Now renumber rows 
rownames(wes2) <- NULL##  Now remove row 2

## Change column 2 of this data frame to numeric for analysis and NA check

wes2 <- as.data.frame(wes2)   #, dimnames = NULL


wes2$Years <- as.numeric(wes2$Years)
wes2$WesHouseholds <- as.numeric(wes2$WesHouseholds)

#str(wes2)          ## Doing a structure check here check here   <<===============
#nrow(wes2)

## Checks and Winsoring  <<==========   Also imputational if needed

##----------------------------------------------------   KEEP  -------Check functions

##  This looks for NaN, Inf, -Inf , and then tells me if they are 
##  present in this particular column

is.special <- function(wes2){
if (is.numeric(wes2)) !is.finite(wes2) else is.na(wes2)
}

sum (sapply(vic2, is.special))  ## <<===     check here

## [1] 0

#View(wes2)

# Use Imputational method below ##  Not needed , so no variable changed   <<============

#surveys_combined$hindfoot_length <- impute(surveys_combined$hindfoot_length, fun = mean)

Section 4 ######## Check that data elements are identical

Using base r

Tidy & Manipulate Data I ( Plain text & R code & Output) :

In order to get the exact data what we wished to analyze, we would have to tidy the data. Tidying helped us to store data in a consistent format, so that essentially each cell contains its own value. We used the filter function to select the data what we needed. Column name change: the column names were formatted to meaningful names In order to create a new data frame from selected variables rbind : rows were bound together with rbind.

We used gather() to change from long to wide. (line 1), and rename columns in the WA data frame

Use identical test to make sure all elements in the first column match for the 3 data sets

identical(vic2[1], tas2[1])  ## looking at Years to see if they're identical befor we combine

## [1] TRUE

identical(vic2[1], wes2[1])  ## into one data frame

## [1] TRUE

Section 5 ######## Add data columns together to forma a new data frame

Merge all three cleaned data sets into one data set using left_join and inner_join

Combined <- left_join(vic2, tas2 )    ## This is the first join

## Joining, by = "Years"

Combined <- inner_join(Combined, wes2)  ## This is the second join

## Joining, by = "Years"

                                        ## All joined by years            <<=========
#View(Combined)
#head(Combined)
str(Combined)

## 'data.frame':    14 obs. of  4 variables:
##  $ Years        : num  1995 1996 1997 1998 2000 ...
##  $ VicHouseholds: num  1340 1297 1342 1400 1534 ...
##  $ TasHouseholds: num  1146 1138 1122 1107 1134 ...
##  $ WesHouseholds: num  1339 1341 1396 1394 1449 ...

#class(Combined)

Tidy & Manipulate Data II( Plain text & R code & Output):

The mutate function was used three time to add 3 more columns to allow us to perform a comparison of Household Income levels between the states. We performed these final checks,

Structure, head, view

To ensure that the data we now have is consistent and in a suitable state to perform the analysis correctly a structure check is performed

Scan II (Plain text & R code & Output:) (outliers.)

As this study was essentually a comparative Analysis, there weren’t in our case any outliers.

Transform Plain text & R code & Output:

Apply an appropriate transformation for at least one of the variables.

As can be seen below, TasWesDif (the 3rd in list below) will be changed to a Factor, prior to this conversion we also changed the column name (Label) to identify what the column was.

Use Mutate to add % dif columns in Household incomes between the three states

Also change one column to Factor

Combined <- mutate(Combined, VicTasDif =  TasHouseholds / VicHouseholds )
Combined$VicTasDif =  round(Combined$VicTasDif, 3)
            

Combined <- mutate(Combined, VicWesDif =  WesHouseholds / VicHouseholds )
Combined$VicWesDif =  round(Combined$VicWesDif, 3)

Combined <- mutate(Combined, TasWesDif = TasHouseholds / WesHouseholds)
#Combined$TasWesDif =  (as.factor(Combined$TasWesDif))

Combined$TasWesDif <- as.numeric(as.character(Combined$TasWesDif))

Combined$TasWesDif <- log10(Combined$TasWesDif) 

Combined$TasWesDif <- round(Combined$TasWesDif, digits=3)
Combined$TasWesDif =  (as.factor(Combined$TasWesDif))

str(Combined)

## 'data.frame':    14 obs. of  7 variables:
##  $ Years        : num  1995 1996 1997 1998 2000 ...
##  $ VicHouseholds: num  1340 1297 1342 1400 1534 ...
##  $ TasHouseholds: num  1146 1138 1122 1107 1134 ...
##  $ WesHouseholds: num  1339 1341 1396 1394 1449 ...
##  $ VicTasDif    : num  0.855 0.877 0.836 0.791 0.739 0.73 0.752 0.784 0.792 0.749 ...
##  $ VicWesDif    : num  0.999 1.034 1.04 0.996 0.945 ...
##  $ TasWesDif    : Factor w/ 12 levels "-0.225","-0.171",..: 12 11 9 8 6 5 7 10 6 4 ...

#View(Combined)
head(Combined, 3)

##   Years VicHouseholds TasHouseholds WesHouseholds VicTasDif VicWesDif
## 1  1995          1340          1146          1339     0.855     0.999
## 2  1996          1297          1138          1341     0.877     1.034
## 3  1997          1342          1122          1396     0.836     1.040
##   TasWesDif
## 1    -0.068
## 2    -0.071
## 3    -0.095

nrow() was used for all three data sets to determine all row lenghts were equal. Identical was used to compare and ensure all three data sets were identical in the first column to each other before combining, we then merged these 3 data frames into 1 data frame by using the below functions.

left join ,inner join After joining the columns, we have performed the following four checks again. Structure head, class, view (Visual checks)

And then we gave the new file a new name called combined.

ggplots

For visualization we used ggplot2.

ggplot(Combined, aes(Combined$Years)) + 
  geom_line(aes(y = Combined$VicHouseholds, colour = "red")) + 
  geom_line(aes(y = Combined$TasHouseholds, colour = "blue")) +
  geom_line(aes(y = Combined$WesHouseholds, colour = "green")) +
  xlab("Years") + ylab("Household Incomes per State ($)") +
  scale_colour_discrete(name="States", labels=c("Tas","WA","Vic")) +
  theme(panel.border = element_blank())

The graph shows the changes in the household income during the years from 1996 to 2016 for the three states i.e Vic, Tas and WA.

##     Plot of WA & Tas showing % difference in household to Victoria

ggplot(Combined, aes(Combined$Years)) + 
  geom_line(aes(y = Combined$VicTasDif, colour = "red")) + 
  geom_line(aes(y = Combined$VicWesDif, colour = "blue")) +
  #geom_line(aes(y = Combined$TasWesDif, colour = "green")) +
  xlab("Years") + ylab("As  %  to Victoria Household Income") +
  scale_colour_discrete(name="States", labels=c("WA","Tas","Vic")) +
  theme(panel.border = element_blank())

In above graph shows the comparison of the Tas and WA house with the Vic household.

write.csv(Combined, file = "Combined.csv")  ## Create new file and save back to your folder

write.csv was used to save a copy of all of the above work back to our work folder We wanted to save all our work to refer back again if we need to, so we used the write.csv function to save the copy back to the working folder.

Conclusion

As a study we found this to quite interesting, notably in the variance between WA and Victoria. In the years 2005 to 2016 WA household income has had a distinctive increase compared to Victoria and for that matter when also compared to Tasmania. Whereas Tasmania appears to have tracked Victoria while consistently remaining at about 80% of the Victorain household income for the years 1996 to 2016. Its possible to conclude from this study that the WA mining boom over the years 2005 to 2016 has had an impact on the WA household income.

Citations

Please NOTE all information in relation to this study was accessed from the sites as noted above

Assign 3 Pre Pos

s3686502 Dan Enoka and s3350855 Grace Darlene

8 October 2018

Information is from the site stated below

Viewable at ( Published to )

The States of interest are Victoria, Tasmania and Western Australia.

Executive Summary:

Data Plain text & R code & Output:

Please NOTE , I had to rename the data sets to :

Import Section

Understand Plain text & R code & Output:

extract certain row and / or columns

All three Sections below will prepare each of the imported data sets

in readiness for analysis

Sort out what I wish to look at in the Victoria data frame

———————————

Sort out what I wish to look at in the Tasmanian data frame

———————————

Sort out what I wish to look at in the Western Australia data frame

Tidy & Manipulate Data I ( Plain text & R code & Output) :

Use identical test to make sure all elements in the first column match for the 3 data sets

Merge all three cleaned data sets into one data set using left_join and inner_join

Tidy & Manipulate Data II( Plain text & R code & Output):

Use Mutate to add % dif columns in Household incomes between the three states

Also change one column to Factor

ggplots

Conclusion

Citations