Michelle Yvette Jansen S3870634
25 October 2020 (https://rpubs.com/cybecom/aa-math1324-ass2)
Having lived in two major cities (Perth and now Brisbane) it has fascinated me about the rivalry between communities separated by a river border. Liveability, lifestyle, housing and so on all factor into the argument. A search of the Australian media reveals several articles debating this: North vs south - which area is best?, A city divided and Why north of the Brisbane River is officially more liveable than the south.
Exactly what parameters define ‘the best location - north or south’ is open to debate but I have chosen to explore just one - the mean salaries - to see if there is evidence to suggest that there is a difference between the mean salaries of those living north with those living south of the river. In this investigation, I focus on Brisbane.
Although the analyses garners amusement, human psychology underpins the debate and ultimately impacts areas such as real estate values and government spending.
In order to perform this analysis, salary information from the Australian Taxation Office for the 2017-18 year has been used along with postcode breakdowns for suburbs in the Brisbane Local Government Area (LGA) that fall north of the river and south of the river.
To discover if there is a statistically significant difference between the salaries of Brisbane income earners living north of the river compared with those who live south of the river.
Two approaches will be used:
Review and report summary statistics of the 2017-18 population data
Simulate random samples and perform a two-sample independent t-Test on the north and south groups
Photo by Johann Walter Bantz on Unsplash
Taxation Statistics 2017-18 - Individuals - Table 25 from the Australian Taxation Office. Full population data. Includes salary income data for all 2017-18 tax payers. Click here for information and the dataset (Australian Taxation Office, 2020) License: Creative Commons Attribution 2.5 Australia
Only the following two used from the data set of 34 variables:
Relate to Brisbane Local Government Authority (LGA). Google maps (Google, n.d.) and Epstein’s table used to work out relevant ranges (Epstein, 2014):
4000 - 4078 (North of the river)
4101 - 4179, 4300, 4303, 4306, 4312, 4500, 4501, 4503, 4520 (South of the river)
earnings <- read_csv("ts18individual25countaveragemedianbypostcode.csv")
earnings2 <- select(earnings, `Postcode`,`Average salary or wages`)
str(earnings2) ## tibble [2,470 x 2] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:2470] 800 810 812 820 822 828 829 830 832 835 ...
## $ Average salary or wages: num [1:2470] 76428 67931 67487 77029 46538 ...
qld_pc <- subset(earnings2, earnings2$Postcode > 3999 & earnings2$Postcode < 4521) # get QLD codes
# Now split data out to be north or south of the river.
qld_pc_north <- subset(qld_pc, qld_pc$Postcode > 3999 & qld_pc$Postcode < 4073)
qld_pc_south <- subset(qld_pc, qld_pc$Postcode > 4072)set.seed(2300) # check random sample of 5 obs - north
anom_check_north <- sample_n(qld_pc_north, 5)
str(anom_check_north)## tibble [5 x 2] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:5] 4013 4030 4019 4034 4018
## $ Average salary or wages: num [1:5] 65644 70599 56609 60037 56645
set.seed(2301) # check random sample of 5 obs - south
anom_check_south <- sample_n(qld_pc_south, 5)
str(anom_check_south)## tibble [5 x 2] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:5] 4165 4301 4344 4121 4073
## $ Average salary or wages: num [1:5] 61539 49826 53404 71967 74018
qld_north_stats <- summarise(qld_pc_north,
mean_north_salary = mean(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
sd_north_salary = sd(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
min_north_salary = min(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
max_north_salary = max(qld_pc_north$`Average salary or wages`, na.rm = TRUE))
qld_south_stats <- summarise(qld_pc_south,
mean_south_salary = mean(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
sd_south_salary = sd(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
min_south_salary = min(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
max_south_salary = max(qld_pc_south$`Average salary or wages`, na.rm = TRUE))
knitr::kable(qld_north_stats)| mean_north_salary | sd_north_salary | min_north_salary | max_north_salary |
|---|---|---|---|
| 66957.38 | 9708.287 | 49014 | 88086 |
| mean_south_salary | sd_south_salary | min_south_salary | max_south_salary |
|---|---|---|---|
| 54003.04 | 8095.449 | 35689 | 91532 |
## [1] 12954.35
The population data shows a true difference in means of $12954.35 with those living north of the Brisbane river having an average salary of $66957.38 and those living south of the Brisbane river having an average salary of $54003.04.
par(mai=c(1,1,0,1),mar=c(5,2,1,0)+0.1,oma=c(1,1,1,1),mfrow = c(2, 2))
hist(qld_pc_north$`Average salary or wages`, xlab = "Average Salary North", main ="Histogram of North of River Salaries", col = "blue")
hist(qld_pc_south$`Average salary or wages`, xlab = "Average Salary South", main ="Histogram of South of River Salaries", col = "green")
qqPlot(qld_pc_north$`Average salary or wages`, ylab = "Average Salary North", main ="QQ Plot of North of River Salaries", lwd=1, col="red")
qqPlot(qld_pc_south$`Average salary or wages`,ylab = "Average Salary South", main ="QQ Plot of South of River Salaries", lwd=1, col="red")par(mfrow = c(1, 1))
boxplot(`Average salary or wages`~Location,data=combined_pop, main="Brisbane LGA Salary Data",
xlab="Side of the River they Dwell", ylab="Salary", col="cyan")set.seed(2401) # so we get the same random sample each code run
northsample <- sample_n(qld_pc_north, 31) # dplyr - sample 31 north data points > 30
set.seed(2601) # so we get the same random sample each code run
southsample <- sample_n(qld_pc_south, 31) # dplyr - sample 31 south data points > 30
# Perform a little magic to create sample data with a location column# Let's merge the two data sets
combined_samp <- rbind(northsample, southsample)
combined_samp$Location <- as.factor(combined_samp$Location)
str(combined_samp)## tibble [62 x 3] (S3: tbl_df/tbl/data.frame)
## $ Postcode : num [1:62] 4014 4010 4035 4007 4034 ...
## $ Average salary or wages: num [1:62] 65401 66690 69533 82161 60037 ...
## $ Location : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
combined_samp %>% group_by(Location) %>% summarise(Min = min(`Average salary or wages`,na.rm = TRUE),
Q1 = quantile(`Average salary or wages`,probs = .25,na.rm = TRUE),
Median = median(`Average salary or wages`, na.rm = TRUE),
Q3 = quantile(`Average salary or wages`,probs = .75,na.rm = TRUE),
Max = max(`Average salary or wages`,na.rm = TRUE),
Mean = mean(`Average salary or wages`, na.rm = TRUE),
SD = sd(`Average salary or wages`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Average salary or wages`))) -> samp_summary
knitr::kable(samp_summary)| Location | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| North | 49014 | 61162.5 | 66690 | 75303.5 | 88086 | 68223.45 | 10085.501 | 31 | 0 |
| South | 35689 | 47731.0 | 52401 | 56379.5 | 74018 | 52459.39 | 8285.363 | 31 | 0 |
par(mai=c(1,1,0,1),mar=c(5,2,1,0)+0.1,oma=c(1,1,1,1),mfrow = c(2, 2))
hist(northsample$`Average salary or wages`, xlab = "Average Salary North", main ="Histogram of North of River Salaries", col = "blue")
hist(southsample$`Average salary or wages`,xlab = "Average Salary South", main ="Histogram of South of River Salaries", col = "green")
qqPlot(northsample$`Average salary or wages`, ylab = "Average Salary North", main ="QQ Plot of North of River Salaries", lwd=1, col="red")
qqPlot(southsample$`Average salary or wages`,ylab = "Average Salary South", main ="QQ Plot of South of River Salaries", lwd=1, col="red") Although we are using a large sample size, the graphs suggest approximately normal distributions.
Let’s check for homogeneity of variance. We use the following statistical hypotheses:
\[H_0: \sigma_1^2 = \sigma_2^2 \] \[H_A: \sigma_1^2 \ne \sigma_2^2 \] where σ1 and σ2 refer to the population variance of the north of the river and south of the river groups.
To calculate this, we can use the leveneTest in R.
leveneTest(combined_samp$`Average salary or wages` ~ combined_samp$Location, data = combined_samp) # Do a Levene testAs p is 0.248 which is greater than 0.05 we fail to reject Ho and can assume equal variance.
Now that our variances have been considered, we can study the difference in means so our null and alternate hypotheses are as follows:
\[H_0: \mu_1 - \mu_2 = 0 \] \[H_A: \mu_1 - \mu_2 \ne 0\] where μ1 and μ2 refer to the population means of the north of the river and south of the river groups.
In r, we now apply a two sample t-test to test if the ‘unknown’ population mean of the north data set is equal or not to the ‘unknown’ population mean of the south data set. (Remember we are simulating random sampling and are ‘pretending’ we don’t know the true population means.)
t.test(
combined_samp$`Average salary or wages` ~ combined_samp$Location, data = combined_samp, var.equal = TRUE, alternative = "two.sided")##
## Two Sample t-test
##
## data: combined_samp$`Average salary or wages` by combined_samp$Location
## t = 6.7245, df = 60, p-value = 7.368e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11074.81 20453.32
## sample estimates:
## mean in group North mean in group South
## 68223.45 52459.39
The p-value is far less than 0.05 so we can state that there is statistically significant information to say that the mean salary of income tax earners living north of the Brisbane river is not equal to the mean salary of income tax earners living south of the Brisbane river.
The central limit theorem ensured that the t-test could be applied due to the large sample size in each group (31 which is > 30).
The Levene’s test of homogeneity of variance indicated equal variances.
The results of the two-sample t-test assuming equal variance found a statistically significant difference between salary of income tax earners living north of the Brisbane river and salary of income tax earners living south of the Brisbane river; t(df=60)=6.7245, p=.000000007, 95% CI for the difference in means [11074.81, 20453.32].
In reality, because access to the full data population is available, sampling is not actually necessary and the population data can easily be analysed to find a more accurate result. However, for the purpose of this investigation, it was useful to compare the results and discover that they did correspond with each other.
It should be noted that when breaking out postcode data, there were only 39 postcodes north of the river but there were 206 postcodes south of the river. Although a random sample of 31 postcodes for each location was taken, this represented a significant proportion of the original population data set for the north side.
Further analysis could be undertaken to look at other aspects of salary distribution including the top postcodes and bottom postcodes each side of the river not just for Brisbane but for other Australian cities too.
The results of the sample analyses suggest that the mean salaries of those living north of the Brisbane river in the Brisbane LGA are higher than those living south of the Brisbane river which corresponds to the result we got when we analysed the population data earlier in this investigation.
Photo by Piqsels. Free for personal and commercial use. Boxing Winner
Australian Taxation Office (2020). Taxation Statistics 2017-18. [online] Data.gov.au. Available at: https://data.gov.au/data/dataset/23b8c299-a85b-4fc0-a07d-5ed14e23a103/resource/343f1d18-067b-44ee-b7b3-1b04c4872b86/download/ts18individual25countaveragemedianbypostcode.csv [Accessed 25 Oct. 2020].
Epstein, J. (2014). Australian LGA to postcode mappings with PostGIS and Intersects | GreenAsh. [online] greenash.net.au. Available at: https://greenash.net.au/thoughts/2014/07/australian-lga-to-postcode-mappings-with-postgis-and-intersects/ [Accessed 22 Oct. 2020].
Google (n.d.). Google Maps. [online] Google Maps. Available at: https://www.google.com/maps/@-27.4197998 [Accessed 23 Oct. 2020].
Jansen, M. (2020). MATH2349 Assignment 2 Data Wrangling. [Assignment] Available at: https://rpubs.com/cybecom/dw-ass2-Oct2020 [Accessed 23 Oct. 2020].
Rodd, J. (1996). Pareto’s law of income distribution, or the 80/20 rule. International Journal of Nonprofit and Voluntary Sector Marketing, 1(1), pp.77–89.
Shubham (2016). r - Add empty columns to a dataframe with specified names from a vector. [online] Stack Overflow. Available at: https://stackoverflow.com/questions/18214395/add-empty-columns-to-a-dataframe-with-specified-names-from-a-vector [Accessed 25 Oct. 2020].