Constantine Makris - s3498940
Last updated: 24 October, 2019
When property is transferred between individuals (or companies) duty is paid to the state government where that property is situated. The main factors in determining how much duty is payable is the sale price of the property as well it’s intended use, the exact formula being beyond the scope of this investigation.
This investigation aims to determine if there is a correlation between the average duty paid per postcode and the average income of individuals in that particular postcode. This investigation will only look at postcodes in Victoria.
The data you used for this investigation have been sourced from both the Federal and State Governments via their respective open data portals. Data on duty paid was originally compiled the State Revenue Office of Victoria while income date was originally compiled from the Australian Tax Office records. Data used was for the financial year ended June 2017. This was the most UpToDate data available for both data sets.
Is there a correlation between mean income and mean duty paid, or more specifically, a positive correlation, on a postcode basis
This question will aim to answered by determining a correlation between the two sets of data using R as well as hypothesis testing the said correlation as well as the slope. Graphically presenting the data will attempt to illustrate if the is any correlation present. Both variables could be considered ‘predictor’ variables but I have chosen to use mean income as the predictor as you earn your income first then use it to purchase a property.
download.file('https://data.gov.au/data/dataset/5fa69f19-ec44-4c46-88eb-0f6fd5c2f43b/resource/a05364a6-b127-4c08-8edf-e6e96ec9e1be/download/ts17individual06taxablestatusstatepostcode.xlsx',
'ts17individual06taxablestatusstatepostcode.xlsx',mode="wb")
taxData <- read_xlsx("ts17individual06taxablestatusstatepostcode.xlsx",sheet = "Individuals Table 6B")From both datasets ‘postcode’ was used as the key to join both sets of data.
The data set for transfer duty only contained two additional fields:
• Record_Count
• Duty.Paid
Both these fields were used to calculate the mean duty paid for each postcode
The data set for taxation information contained 152 fields of which only three were used for this investigation:
• Postcode (Key)
• Taxable income or loss3 no.
• Taxable income or loss3 $
These fields were used to calculate the mean income for each postcode.
Pre-processing
The were many steps involved in getting the data into a state where it could be used to undertake the required testing. First the date sets where loaded into two dataframes, being taxData and transferData.
Firstly the tax information needed to be cleaned up as there were many fields and rows that were not required.
taxData <- taxData[c(2,4:5)] #Remove columns not required
colnames(taxData) <- c('Postcode','TaxableIncome.No','TaxableIncome.Amount') #Assign userfreindly field names
taxData <- taxData[-c(1:2),] #remove redundant rows at top of dataframeThe two dataframes were then joined. Row 635 removed as it was for data relating to transfers were the postcode was unknown.
The data in both data frames were in char format and needed to be converted to numeric, data also contained commas which needed to be removed prior to conversion.
transferData$Record_Count <- gsub(",","",transferData$Record_Count) %>% as.numeric() #remove commas and convert to numeric
transferData$Duty.Paid <- gsub(",","",transferData$Duty.Paid) %>% as.numeric()
transferData$TaxableIncome.No <- gsub(",","",transferData$TaxableIncome.No) %>% as.numeric()
transferData$TaxableIncome.Amount <- gsub(",","",transferData$TaxableIncome.Amount) %>% as.numeric()Finally, mean values calculated and columns not required for final analysis removed
Attributes of each dataset as follows:
stats <- c('Mean Transfer Duty','Mean Income') %>% as.data.frame()
colnames(stats) <- c('Dataset')
stats$min <- c(min(transferData$meanDuty),min(transferData$meanIncome))
stats$Q1 <- c(quantile(transferData$meanDuty,probs = .25),quantile(transferData$meanIncome,probs = .25))
stats$median <- c(median(transferData$meanDuty),median(transferData$meanIncome))
stats$Q3 <- c(quantile(transferData$meanDuty,probs = .75),quantile(transferData$meanIncome,probs = .75))
stats$max <- c(max(transferData$meanDuty),max(transferData$meanIncome))
stats$mean <- c(mean(transferData$meanDuty),mean(transferData$meanIncome))
stats$sd <- c(sd(transferData$meanDuty),sd(transferData$meanIncome))
n <- transferData$meanDuty %>% length()
knitr::kable(stats)| Dataset | min | Q1 | median | Q3 | max | mean | sd |
|---|---|---|---|---|---|---|---|
| Mean Transfer Duty | 1128.667 | 9022.157 | 15031.75 | 29753.91 | 318991.1 | 22513.96 | 22505.53 |
| Mean Income | 26589.569 | 43417.791 | 49017.35 | 56713.59 | 196386.9 | 52990.95 | 16322.08 |
Correlation stats:
transferDataModel <- lm(meanDuty ~ meanIncome, data = transferData)
transferDataSummary <- transferDataModel %>% summary()
transferDataCoef <- transferDataSummary %>% coef()
transferDataSummary##
## Call:
## lm(formula = meanDuty ~ meanIncome, data = transferData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33759 -7211 -2556 3984 266065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.938e+04 2.141e+03 -13.72 <2e-16 ***
## meanIncome 9.792e-01 3.861e-02 25.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15860 on 632 degrees of freedom
## Multiple R-squared: 0.5043, Adjusted R-squared: 0.5036
## F-statistic: 643.1 on 1 and 632 DF, p-value: < 2.2e-16
From the above summary we can derive the following important statistical information:
\(F\)-statistic = 643.084
\(a\) (intecept) = -29375.58
\(b\) (slope) = 0.979
\(t\)-value (intercept) = -13.721
\(t\)-value (slope) = 25.359
Using the derived statistics we can plot a scatter plot and regression line to visually present the relationship as well as have the information required to conduct our hypothesis testing
a <- transferDataCoef[1,1] %>% round(3) #a
b <- transferDataCoef[2,1] %>% round(3) #b
par(cex= 0.5,cex.axis = 1.4,cex.lab = 1.4, cex.main = 1.8,cex.sub = 1.4,fig=c(0, 1, 0.4, 1),pin=c(5,3))
plot(meanDuty ~ meanIncome,main="Mean Income Earned vs Mean Duty Paid - Vic", data = transferData, xlab = "Mean Income", ylab = "Mean Duty Paid")
abline(a = a, b = b, col= "red",lwd = 3)Using hypothesis we can test the statistical significance of the correlation, the constant and the slope.
Test the overall regression model
Using the \(F\)-Statistic we can test if the data fits the regression model by calculating the \(p\)-value:
\(H_0\) : The data does not fit the regression model
\(H_A\) : The data does fit the regression model
fStat <- transferDataSummary$fstatistic[1]
df2 <- length(transferData$meanDuty)-2
pRegres <- pf(q = fStat,1,df2,lower.tail = FALSE)\(p\) = 2.116985410^{-98}
\(p\)-value returned is very small, <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that the data fits the regression model
Test the slope
Value returned for the slope was 0.979
To test the slope the following test is used:
\(H_0\) : \(\beta\) = 0
\(H_A\) : \(\beta\) \(\neq\) 0
Calculating the \(p\)-value for the slope:
\(p\) = 2.116985410^{-98}
\(p\)-value returned is very small, <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that the data is positively related.
Test the correlation
Using the data we can calculate the Pearson correlation coefficient or \(r\)-value:
\(r\) = 0.7101735
This result is close to +1 indicating a strong positive relationship. To test the \(r\)-value the following test is used:
\(H_0\) : \(r\) = 0
\(H_A\) : \(r\) \(\neq\) 0
First the test statistic, \(t\), needs to be calculated using the following formula:
\(t\) = \(r\sqrt{\frac{n-2}{1-r^2}}\)
\(t\) = 25.3590998
With \(t\) calculated now we can find \(p\):
## [1] 2.116985e-98
\(p\)-value <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that there is a positive relationship between mean duty paid and mean income.
The graphical representation of the data indicated that there was a positive relationship between transfer duty paid with income earned. This was further confirmed when presented with a correlation result of 0.710. The subsequent hypothesis testing result was statistically significant and shows that that there is a positive relationship between mean duty paid and mean income.
One of the weaknesses was that the data makes no distinction between owner occupied properties and investment properties. I’m also fairly confident that the data as presented also included transfers paid on commercial properties, this is an assumption on my part, I had sought confirmation from the State Revenue Office but by the time this assignment ‘went to print’ I had yet to hear from them. I base this assumption on the result for Derrimut, mean duty-$318,991.15 / mean income-$84,049.15. Derimut is a largely industrial area located approx. 17km west of Melbourne, a mean duty figure of ~$319k would lead me to believe that these are duties paid on large industrial block as opposed to residential dwellings. 2016 census data from the Australian Bureau of Statistics indicates median weekly earnings amount of $752 (~$39k annualised). This result also leads me to believe that the ATO data is included income derived from businesses in the area as well as individuals.
The issues as discussed above would impact the results, for example there is a high probability the owner of an investment property will not reside in the same suburb as the investment property, this is even more likely with commercial properties.
If the investigation was to be refined it would need the input/cooperation of both the Australian Tax Office and the State Revenue Office to further refine the data as the data provided on a open level does not allow any further refinement. It would also be interesting to repeat the exercise for other Australian states.
In saying this the results do seem fairly conclusive in regards to the positive relationship between the two sets of data. When people earn more they tend to upgrade their living arrangements, i.e. more income → more expensive property → higher amount of duty paid.
2016 Census QuickStats - Derrimut, viewed 24 October 2019, https://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/SSC20737.
Data Sources
2016-17 income year - all individuals, viewed 24 October 2019,
https://data.gov.au/dataset/ds-dga-5fa69f19-ec44-4c46-88eb-0f6fd5c2f43b/distribution/dist-dga-a05364a6-b127-4c08-8edf-e6e96ec9e1be/details?q=taxation%20postcode
Land Transfer Duty by Postcode 2016-17, viewed 24 October 2019,
https://discover.data.vic.gov.au/dataset/land-transfer-duty-by-postcode-2016-17