The simplest form of merge() finds the intersection between two different sets of data.
Merge() is very similar to a database join in SQL or merge function in Power BI. This is, indeed, the case and the different arguments to merge() allow you to perform natural joins, as well as left, right, and full outer joins.
The merge() function takes quite a large number of arguments. These arguments can look quite intimidating until you realize that they form a smaller number of related arguments:
x: A data frame.
y: A data frame.
by, by.x, by.y: The names of the columns that are common to both x and y. The default is to use the columns with common names between the two data frames.
all, all.x, all.y: Logical values that specify the type of merge. The default value is all=FALSE (meaning that only the matching rows are returned).
That last group of arguments - all, all.x and all.y - deserves some explanation. These arguments determine the type of merge that will happen.
The data sources used to gather the variables for the project were:
Individual’s taxation statistic by Postcode provided by the Australian Taxation Office (Years 2013-2016)
Department of Social Services (DSS) Payment Demographic Data by postcode for the years 2013 to 2016
some caption
Those data sources provide information of the postcode demographic which allow to identify the characteristic of the locations that require major attention for the government in terms of welfare payments.
The Payment Demographic Data by postcode and the Individual’s taxation statistic by Postcode are organised in different tables for each year. Therefore, the first step for the process was gather each table per year and then append all the years in one table per data set.
In this case, we are going to use Left outer join: To include all the rows of the data frame x = (Individual’s taxation statistic by Postcode) and only those from y = (Department of Social Services (DSS) Payment Demographic Data by postcode) that match, specify all.x=TRUE.
Let’s start:
library(readxl)
library(httr)
library(dplyr)
library(tidyr)
#### Table 1 Data for 2013
url2 <- "https://data.gov.au/dataset/e29ef9ca-0d1a-47ec-9e9b-14a79a941511/resource/fa6fda82-a70d-4f4c-bb00-203ffc4d20af/download/taxstats2013individual06selected-items-by-taxable-status-postcode.xlsx"
p2f <- tempfile()
GET(url2, write_disk(tf2 <- tempfile(fileext = ".xlsx")))
## Response [https://data.gov.au/dataset/e29ef9ca-0d1a-47ec-9e9b-14a79a941511/resource/fa6fda82-a70d-4f4c-bb00-203ffc4d20af/download/taxstats2013individual06selected-items-by-taxable-status-postcode.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 7.46 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e446982310c.xlsx
T2013 <- as.data.frame(read_excel(tf2, sheet = 3, col_names = TRUE, skip = 2))
T2013n <- T2013[, c(1,2,3,4,7,8,17,18,25,26,27,28,33,34,95,96,113,114)]
colnames(T2013n) <- c("State", "Postcode", "No.Population", "No.HealthInsurance","No.Tax", "Tax", "No.Salary", "Salary",
"No.Allowances", "Allowances", "No.Pensions", "Pensions", "No.Interest", "Interest", "NoRent",
"Rent", "No.Superannuation", "Superannuation")
rep2013 <- rep("2013", length(T2013n$State))
T2013n <- T2013n%>%mutate(Year = rep2013)
#### Table 1 Data for 2014
url3 <- "https://data.gov.au/dataset/25e81c18-2083-4abe-81b6-0f530053c63f/resource/942c9a30-baf2-4432-9513-1395fbef7edf/download/taxstats2014individual06taxablestatuspostcode.xlsx"
p3f <- tempfile()
GET(url3, write_disk(tf3 <- tempfile(fileext = ".xlsx")))
## Response [https://data.gov.au/dataset/25e81c18-2083-4abe-81b6-0f530053c63f/resource/942c9a30-baf2-4432-9513-1395fbef7edf/download/taxstats2014individual06taxablestatuspostcode.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 5.02 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e445f2a443d.xlsx
T2014 <- as.data.frame(read_excel(tf3, sheet = 2, col_names = TRUE, skip = 2))
T2014n <- T2014[ ,c(2,3,4,5,8,9,18,19,26,27,28,29,34,35,96,97,114,115)]
colnames(T2014n) <- c("State", "Postcode", "No.Population", "No.HealthInsurance","No.Tax", "Tax", "No.Salary", "Salary",
"No.Allowances", "Allowances", "No.Pensions", "Pensions", "No.Interest", "Interest", "NoRent",
"Rent", "No.Superannuation", "Superannuation")
T2014n <- T2014n%>%group_by(State, Postcode)%>%
summarize(No.Population = sum(No.Population), No.HealthInsurance = sum(No.HealthInsurance), No.Tax = sum(No.Tax),
Tax = sum(Tax), No.Salary = sum(No.Salary), Salary = sum(Salary), No.Allowances = sum(No.Allowances),
Allowances = sum(Allowances), No.Pensions = sum(No.Pensions), Pensions = sum(Pensions), No.Interest = sum(No.Interest), Interest = sum(Interest),
NoRent = sum(NoRent), Rent = sum(Rent), No.Superannuation = sum(No.Superannuation), Superannuation = sum(Superannuation))
rep2014 <- rep("2014", length(25))
T2014n <- as.data.frame(T2014n%>%mutate(Year = rep2014, Postcode = as.character(Postcode)))
#### Table 1 Data for 2015
url4 <- "https://data.gov.au/dataset/5c99cfed-254d-40a6-af1c-47412b7de6fe/resource/12443334-2273-4e0f-a2e1-c9d0c1d8213b/download/taxstats2015individual06taxablestatusstateterritorypostcode.xlsx"
p4f <- tempfile()
GET(url4, write_disk(tf4 <- tempfile(fileext = ".xlsx")))
## Response [https://data.gov.au/dataset/5c99cfed-254d-40a6-af1c-47412b7de6fe/resource/12443334-2273-4e0f-a2e1-c9d0c1d8213b/download/taxstats2015individual06taxablestatusstateterritorypostcode.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 7.72 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e44381524f6.xlsx
T2015 <- as.data.frame(read_excel(tf4, sheet = 3, col_names = TRUE, skip = 2))
T2015n <- T2015[ ,c(1,2,3,134,6,7,16,17, 24,25,26,27,32,33,94,95,112,113)]
colnames(T2015n) <- c("State", "Postcode", "No.Population", "No.HealthInsurance","No.Tax", "Tax", "No.Salary", "Salary",
"No.Allowances", "Allowances", "No.Pensions", "Pensions", "No.Interest", "Interest", "NoRent",
"Rent", "No.Superannuation", "Superannuation")
rep2015 <- rep("2015", length(T2015$State))
T2015n <- T2015n%>%mutate(Year = rep2015)
#### Table 1: Data for 2016
url5 <- "https://data.gov.au/dataset/d170213c-4391-4d10-ac24-b0c11768da3f/resource/8c86b6d5-0485-4387-b12b-ad26a9da2033/download/taxstats2016individual06taxablestatusstateterritorypostcodetaxableincome.xlsx"
p5f <- tempfile()
GET(url5, write_disk(tf5 <- tempfile(fileext = ".xlsx")))
## Response [https://data.gov.au/dataset/d170213c-4391-4d10-ac24-b0c11768da3f/resource/8c86b6d5-0485-4387-b12b-ad26a9da2033/download/taxstats2016individual06taxablestatusstateterritorypostcodetaxableincome.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 8.36 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e44825e4.xlsx
T2016 <- as.data.frame(read_excel(tf5, sheet = 3, col_names = TRUE, skip = 2))
T2016n <- T2016[ ,c(1,2,3,140,6,7,18,19,28,29,30,31,36,37,100,101,118,119)]
colnames(T2016n) <- c("State", "Postcode", "No.Population", "No.HealthInsurance","No.Tax", "Tax", "No.Salary", "Salary",
"No.Allowances", "Allowances", "No.Pensions", "Pensions", "No.Interest", "Interest", "NoRent",
"Rent", "No.Superannuation", "Superannuation")
rep2016 <- rep("2016", length(T2016$State))
T2016n <- T2016n%>%mutate(Year = rep2016)
#### Consolidation Table 1 (Appending)
table1 <- rbind(T2013n, T2014n, T2015n, T2016n)
head(table1)
## State Postcode No.Population No.HealthInsurance No.Tax Tax
## 1 ACT 2600 5755 4455 4960 155433316
## 2 ACT 2601 2650 1360 1800 37896737
## 3 ACT 2602 19110 12325 16355 284355118
## 4 ACT 2603 6180 4725 5370 177086957
## 5 ACT 2604 7165 5070 6285 142567504
## 6 ACT 2605 7520 5565 6475 144414959
## No.Salary Salary No.Allowances Allowances No.Pensions Pensions
## 1 4025 330601387 175 867475 195 1452867
## 2 2185 125396805 195 1192431 45 371161
## 3 15525 1045005856 1075 5975182 940 7201386
## 4 4600 373517913 225 1225973 170 1557289
## 5 5970 491470590 285 1368105 210 1805084
## 6 5570 425204412 225 1162566 475 3396660
## No.Interest Interest NoRent Rent No.Superannuation Superannuation
## 1 4105 14852802 1560 41760840 225 4601346
## 2 1775 3622769 545 14057264 25 516019
## 3 13075 27229224 3430 71800358 235 4049535
## 4 4380 15509702 1530 45263237 225 4903593
## 5 4820 10721436 2000 44753380 100 1606696
## 6 5245 15611915 1610 36426162 140 2672196
## Year
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## Response [https://data.gov.au/dataset/cff2ae8a-55e4-47db-a66d-e177fe0ac6a0/resource/6b2343f4-09d3-4b03-a1fe-41897f710c6c/download/dssdemographicsdecember2013.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 668 kB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e444878185c.xlsx
## Response [https://data.gov.au/dataset/cff2ae8a-55e4-47db-a66d-e177fe0ac6a0/resource/f0615bb3-463f-4352-902c-0b6bb0e22e7d/download/june-2014-dss-demographics-suppressed-2014-final-ftb.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 1.19 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e445c2123cf.xlsx
## Response [https://data.gov.au/dataset/cff2ae8a-55e4-47db-a66d-e177fe0ac6a0/resource/e6457899-378e-406f-8027-a6ee8a19eec6/download/dss-demographics-june-2015.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 1.72 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e441a6f71ac.xlsx
## Response [https://data.gov.au/dataset/cff2ae8a-55e4-47db-a66d-e177fe0ac6a0/resource/829fc910-e263-4a59-ad33-1cd1159ce0df/download/dss-demographics-june-2016.xlsx]
## Date: 2018-10-03 06:45
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 1.7 MB
## <ON DISK> C:\Users\josev\AppData\Local\Temp\Rtmpk3s4Yc\file4e444003fd7.xlsx
## 'data.frame': 11619 obs. of 29 variables:
## $ Postcode : chr "200" "800" "801" "803" ...
## $ ABSTUDY (Living allowance) : num 10 10 10 10 10 36 10 10 10 10 ...
## $ ABSTUDY (Non-living allowance) : num 10 10 10 10 10 90 10 115 10 10 ...
## $ Age Pension : num 10 171 10 10 10 ...
## $ Austudy : num 10 10 10 10 10 47 10 26 10 10 ...
## $ Carer Allowance : num 10 20 10 10 10 421 10 337 10 10 ...
## $ Carer Allowance (Child Health Care Card only) : num 10 10 10 10 10 10 10 20 10 10 ...
## $ Carer Payment : num 10 10 10 10 10 148 10 121 10 10 ...
## $ Commonwealth Seniors Health Card : num 10 34 10 10 10 212 10 90 10 10 ...
## $ Double Orphan Pension : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Disability Support Pension : num 10 295 21 10 10 954 10 555 10 10 ...
## $ Family Tax Benefit Part A : num 10 135 10 10 10 ...
## $ Family Tax Benefit Part B : num 10 122 10 10 10 ...
## $ Health Care Card : num 10 308 10 10 10 1350 10 935 10 10 ...
## $ Low Income Card : num 10 22 10 10 10 177 10 100 10 10 ...
## $ Newstart Allowance : num 10 224 10 10 10 686 10 388 10 10 ...
## $ Parenting Payment (Partnered) : num 10 10 10 10 10 83 10 68 10 10 ...
## $ Parenting Payment (Single) : num 10 27 10 10 10 324 10 286 10 10 ...
## $ Partner Allowance : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Pensioner Concession Card : num 10 505 37 10 10 ...
## $ Sickness Allowance : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Special Benefit : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Wife Pension (Partner on Age Pension) : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Wife Pension (Partner on Disability Support Pension): num 10 10 10 10 10 10 10 10 10 10 ...
## $ Widow Allowance : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Widow B Pension : num 10 10 10 10 10 10 10 10 10 10 ...
## $ Youth Allowance (other) : num 10 10 10 10 10 94 10 84 10 10 ...
## $ Youth Allowance (student and apprentice) : num 10 10 10 10 10 85 10 59 10 10 ...
## $ Year : num 2013 2013 2013 2013 2013 ...
After the appending, the data from DSS payment demographic was merged to the Individual taxation statistic table using the variables “Postcode” and “Year” as connector of the tables (Primary Key and Foreign Key).
mergetables <- merge(x= table1, y = table2, by.x = c('Postcode','Year'), by.y = c('Postcode','Year'), all.x = TRUE)
mergetables[is.na(mergetables)] <- 0
After merging and joining the different data sources, the resultant consolidation is a data frame with 10082 observation and 46 variables (See below)
## 'data.frame': 10282 obs. of 46 variables:
## $ Postcode : chr "2000" "2000" "2000" "2000" ...
## $ Year : chr "2013" "2014" "2015" "2016" ...
## $ State : chr "NSW" "NSW" "NSW" "NSW" ...
## $ No.Population : num 34525 35372 36185 24207 130 ...
## $ No.HealthInsurance : num 7115 7410 7707 7710 65 ...
## $ No.Tax : num 18525 18937 19471 15806 90 ...
## $ Tax : num 3.86e+08 4.14e+08 4.52e+08 4.78e+08 2.34e+06 ...
## $ No.Salary : num 30310 30942 31293 20465 105 ...
## $ Salary : num 1.09e+09 1.16e+09 1.21e+09 1.08e+09 6.28e+06 ...
## $ No.Allowances : num 400 364 373 331 15 6 4 4 215 213 ...
## $ Allowances : num 2399279 2307947 2295483 2019617 74853 ...
## $ No.Pensions : num 140 133 149 132 0 3 2 2 45 42 ...
## $ Pensions : num 1465139 1499689 1702404 1635906 6521 ...
## $ No.Interest : num 11660 11835 13510 13183 75 ...
## $ Interest : num 32708556 32075103 30352478 32943378 197975 ...
## $ NoRent : num 3075 3247 3576 2586 20 ...
## $ Rent : num 1.17e+08 1.23e+08 1.35e+08 1.05e+08 5.16e+05 ...
## $ No.Superannuation : num 295 321 350 328 0 0 1 0 20 18 ...
## $ Superannuation : num 5917261 7398363 9335644 9442442 25000 ...
## $ ABSTUDY (Living allowance) : num 10 10 10 11 10 0 10 0 10 23 ...
## $ ABSTUDY (Non-living allowance) : num 10 0 0 0 10 0 0 0 10 10 ...
## $ Age Pension : num 468 460 420 405 10 0 0 0 197 204 ...
## $ Austudy : num 39 43 29 24 10 0 0 0 31 46 ...
## $ Carer Allowance : num 105 105 85 85 10 0 0 0 39 48 ...
## $ Carer Allowance (Child Health Care Card only) : num 10 10 10 0 10 0 0 0 10 0 ...
## $ Carer Payment : num 62 64 42 42 10 0 0 0 25 28 ...
## $ Commonwealth Seniors Health Card : num 162 0 182 0 10 0 0 0 22 0 ...
## $ Double Orphan Pension : num 10 422 0 228 10 0 0 0 10 181 ...
## $ Disability Support Pension : num 370 360 267 324 10 0 0 0 117 112 ...
## $ Family Tax Benefit Part A : num 400 422 341 263 10 0 0 0 184 183 ...
## $ Family Tax Benefit Part B : num 336 351 297 493 10 10 0 20 172 165 ...
## $ Health Care Card : num 628 116 561 147 28 10 10 21 389 108 ...
## $ Low Income Card : num 154 382 136 268 27 0 10 0 196 136 ...
## $ Newstart Allowance : num 395 10 344 17 10 0 0 0 148 10 ...
## $ Parenting Payment (Partnered) : num 22 45 24 28 10 0 0 0 10 21 ...
## $ Parenting Payment (Single) : num 38 10 36 0 10 0 0 0 23 10 ...
## $ Partner Allowance : num 10 1014 10 786 10 ...
## $ Pensioner Concession Card : num 1032 168 850 173 10 ...
## $ Sickness Allowance : num 10 10 10 0 10 0 0 0 10 0 ...
## $ Special Benefit : num 10 10 10 7 10 0 0 0 10 10 ...
## $ Wife Pension (Partner on Age Pension) : num 10 10 10 7 10 0 0 0 10 10 ...
## $ Wife Pension (Partner on Disability Support Pension): num 10 0 0 0 10 0 0 0 10 0 ...
## $ Widow Allowance : num 10 10 10 0 10 0 0 0 10 0 ...
## $ Widow B Pension : num 10 10 0 0 10 0 0 0 10 0 ...
## $ Youth Allowance (other) : num 10 23 10 10 10 0 10 0 10 10 ...
## $ Youth Allowance (student and apprentice) : num 109 150 128 131 36 44 37 33 204 242 ...