Intro to Stats - Assginemnt 3

An Investigation Into the Correlation Between Income Earned and Transfer Duty Paid in Victoria

Constantine Makris - s3498940

Last updated: 24 October, 2019

Introduction

When property is transferred between individuals (or companies) duty is paid to the state government where that property is situated. The main factors in determining how much duty is payable is the sale price of the property as well it’s intended use, the exact formula being beyond the scope of this investigation.

This investigation aims to determine if there is a correlation between the average duty paid per postcode and the average income of individuals in that particular postcode. This investigation will only look at postcodes in Victoria.

The data you used for this investigation have been sourced from both the Federal and State Governments via their respective open data portals. Data on duty paid was originally compiled the State Revenue Office of Victoria while income date was originally compiled from the Australian Tax Office records. Data used was for the financial year ended June 2017. This was the most UpToDate data available for both data sets.

Problem Statement

Data

download.file('https://data.gov.au/data/dataset/5fa69f19-ec44-4c46-88eb-0f6fd5c2f43b/resource/a05364a6-b127-4c08-8edf-e6e96ec9e1be/download/ts17individual06taxablestatusstatepostcode.xlsx',
              'ts17individual06taxablestatusstatepostcode.xlsx',mode="wb")

taxData <- read_xlsx("ts17individual06taxablestatusstatepostcode.xlsx",sheet = "Individuals Table 6B")
transferData <- read.csv('https://www.sro.vic.gov.au/ckfinder/userfiles/files/Land%20Transfer%20Duty%20By%20Postcode_2016-17.csv',stringsAsFactors = FALSE)

From both datasets ‘postcode’ was used as the key to join both sets of data.

Data (cont.)

The data set for transfer duty only contained two additional fields:

     • Record_Count
     • Duty.Paid

Both these fields were used to calculate the mean duty paid for each postcode

The data set for taxation information contained 152 fields of which only three were used for this investigation:

     • Postcode (Key)
     • Taxable income or loss3 no.
     • Taxable income or loss3 $

These fields were used to calculate the mean income for each postcode.

Data (cont.)

Pre-processing

The were many steps involved in getting the data into a state where it could be used to undertake the required testing. First the date sets where loaded into two dataframes, being taxData and transferData.

Firstly the tax information needed to be cleaned up as there were many fields and rows that were not required.

taxData <- taxData[c(2,4:5)] #Remove columns not required
colnames(taxData) <- c('Postcode','TaxableIncome.No','TaxableIncome.Amount') #Assign userfreindly field names
taxData <- taxData[-c(1:2),] #remove redundant rows at top of dataframe

The two dataframes were then joined. Row 635 removed as it was for data relating to transfers were the postcode was unknown.

transferData <- inner_join(transferData,taxData,by='Postcode') #join the dataframes
transferData <-  transferData[-c(635),] #remove row 635

Data (cont.)

The data in both data frames were in char format and needed to be converted to numeric, data also contained commas which needed to be removed prior to conversion.

transferData$Record_Count <- gsub(",","",transferData$Record_Count) %>% as.numeric() #remove commas and convert to numeric
transferData$Duty.Paid <- gsub(",","",transferData$Duty.Paid) %>% as.numeric()
transferData$TaxableIncome.No <- gsub(",","",transferData$TaxableIncome.No) %>% as.numeric()
transferData$TaxableIncome.Amount <- gsub(",","",transferData$TaxableIncome.Amount) %>% as.numeric()

Finally, mean values calculated and columns not required for final analysis removed

transferData <- transferData %>% mutate(meanDuty=Duty.Paid/Record_Count,meanIncome=TaxableIncome.Amount/TaxableIncome.No) #calculate mean values
transferData <- transferData[-c(2:5)] #remove redundant columns

Descriptive Statistics and Visualisation

Attributes of each dataset as follows:

stats <- c('Mean Transfer Duty','Mean Income') %>% as.data.frame()
colnames(stats) <- c('Dataset')
stats$min <- c(min(transferData$meanDuty),min(transferData$meanIncome))
stats$Q1 <- c(quantile(transferData$meanDuty,probs = .25),quantile(transferData$meanIncome,probs = .25))
stats$median <- c(median(transferData$meanDuty),median(transferData$meanIncome))
stats$Q3 <- c(quantile(transferData$meanDuty,probs = .75),quantile(transferData$meanIncome,probs = .75))
stats$max <- c(max(transferData$meanDuty),max(transferData$meanIncome))
stats$mean <- c(mean(transferData$meanDuty),mean(transferData$meanIncome))
stats$sd <- c(sd(transferData$meanDuty),sd(transferData$meanIncome))
n <- transferData$meanDuty %>% length()

knitr::kable(stats)
Dataset min Q1 median Q3 max mean sd
Mean Transfer Duty 1128.667 9022.157 15031.75 29753.91 318991.1 22513.96 22505.53
Mean Income 26589.569 43417.791 49017.35 56713.59 196386.9 52990.95 16322.08

Descriptive Statistics and Visualisation (cont.)

Correlation stats:

transferDataModel <- lm(meanDuty ~ meanIncome, data = transferData)
transferDataSummary <- transferDataModel %>% summary()
transferDataCoef <- transferDataSummary %>% coef()
transferDataSummary
## 
## Call:
## lm(formula = meanDuty ~ meanIncome, data = transferData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33759  -7211  -2556   3984 266065 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.938e+04  2.141e+03  -13.72   <2e-16 ***
## meanIncome   9.792e-01  3.861e-02   25.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15860 on 632 degrees of freedom
## Multiple R-squared:  0.5043, Adjusted R-squared:  0.5036 
## F-statistic: 643.1 on 1 and 632 DF,  p-value: < 2.2e-16

Descriptive Statistics and Visualisation (cont.)

From the above summary we can derive the following important statistical information:

\(F\)-statistic = 643.084
\(a\) (intecept) = -29375.58
\(b\) (slope) = 0.979

\(t\)-value (intercept) = -13.721
\(t\)-value (slope) = 25.359

Using the derived statistics we can plot a scatter plot and regression line to visually present the relationship as well as have the information required to conduct our hypothesis testing

Descriptive Statistics and Visualisation (cont.)

a <- transferDataCoef[1,1] %>% round(3) #a
b <- transferDataCoef[2,1] %>% round(3) #b
par(cex= 0.5,cex.axis = 1.4,cex.lab = 1.4, cex.main = 1.8,cex.sub = 1.4,fig=c(0, 1, 0.4, 1),pin=c(5,3))
plot(meanDuty ~ meanIncome,main="Mean Income Earned vs Mean Duty Paid - Vic", data = transferData, xlab = "Mean Income", ylab = "Mean Duty Paid")
abline(a = a, b = b, col= "red",lwd = 3)

Hypothesis Testing

Using hypothesis we can test the statistical significance of the correlation, the constant and the slope.

Test the overall regression model

Using the \(F\)-Statistic we can test if the data fits the regression model by calculating the \(p\)-value:

     \(H_0\) : The data does not fit the regression model
     \(H_A\) : The data does fit the regression model

fStat <- transferDataSummary$fstatistic[1]
df2 <- length(transferData$meanDuty)-2
pRegres <- pf(q = fStat,1,df2,lower.tail = FALSE)

\(p\) = 2.116985410^{-98}

\(p\)-value returned is very small, <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that the data fits the regression model

Hypothesis Testing cont.

Test the slope

Value returned for the slope was 0.979

To test the slope the following test is used:

     \(H_0\) : \(\beta\) = 0
     \(H_A\) : \(\beta\) \(\neq\) 0

Calculating the \(p\)-value for the slope:

PSlope <- 2*pt(q = transferDataCoef[2,3],df = length(transferData$meanDuty)-2, lower.tail=FALSE)

\(p\) = 2.116985410^{-98}

\(p\)-value returned is very small, <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that the data is positively related.

Hypothesis Testing Cont.

Test the correlation

Using the data we can calculate the Pearson correlation coefficient or \(r\)-value:

r <- cor(transferData$meanDuty,transferData$meanIncome)

\(r\) = 0.7101735

This result is close to +1 indicating a strong positive relationship. To test the \(r\)-value the following test is used:

     \(H_0\) : \(r\) = 0
     \(H_A\) : \(r\) \(\neq\) 0

First the test statistic, \(t\), needs to be calculated using the following formula:

\(t\) = \(r\sqrt{\frac{n-2}{1-r^2}}\)

\(t\) = 25.3590998

Hypothesis Testing Cont.

With \(t\) calculated now we can find \(p\):

 2*pt(q = r*sqrt((n-2)/(1-r^2)),df = n - 2,lower.tail=FALSE)
## [1] 2.116985e-98

\(p\)-value <0.0001. As this is less than the of 0.05 level of significance the null hypothesis can be rejected. The result was statistically significant and shows that that there is a positive relationship between mean duty paid and mean income.

Discussion

The graphical representation of the data indicated that there was a positive relationship between transfer duty paid with income earned. This was further confirmed when presented with a correlation result of 0.710. The subsequent hypothesis testing result was statistically significant and shows that that there is a positive relationship between mean duty paid and mean income.

One of the weaknesses was that the data makes no distinction between owner occupied properties and investment properties. I’m also fairly confident that the data as presented also included transfers paid on commercial properties, this is an assumption on my part, I had sought confirmation from the State Revenue Office but by the time this assignment ‘went to print’ I had yet to hear from them. I base this assumption on the result for Derrimut, mean duty-$318,991.15 / mean income-$84,049.15. Derimut is a largely industrial area located approx. 17km west of Melbourne, a mean duty figure of ~$319k would lead me to believe that these are duties paid on large industrial block as opposed to residential dwellings. 2016 census data from the Australian Bureau of Statistics indicates median weekly earnings amount of $752 (~$39k annualised). This result also leads me to believe that the ATO data is included income derived from businesses in the area as well as individuals.

The issues as discussed above would impact the results, for example there is a high probability the owner of an investment property will not reside in the same suburb as the investment property, this is even more likely with commercial properties.

Discussion cont.

If the investigation was to be refined it would need the input/cooperation of both the Australian Tax Office and the State Revenue Office to further refine the data as the data provided on a open level does not allow any further refinement. It would also be interesting to repeat the exercise for other Australian states.

In saying this the results do seem fairly conclusive in regards to the positive relationship between the two sets of data. When people earn more they tend to upgrade their living arrangements, i.e. more income → more expensive property → higher amount of duty paid.

References

2016 Census QuickStats - Derrimut, viewed 24 October 2019, https://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/SSC20737.

Data Sources

2016-17 income year - all individuals, viewed 24 October 2019,
https://data.gov.au/dataset/ds-dga-5fa69f19-ec44-4c46-88eb-0f6fd5c2f43b/distribution/dist-dga-a05364a6-b127-4c08-8edf-e6e96ec9e1be/details?q=taxation%20postcode

Land Transfer Duty by Postcode 2016-17, viewed 24 October 2019,
https://discover.data.vic.gov.au/dataset/land-transfer-duty-by-postcode-2016-17