Introduction

Having lived in two major cities (Perth and now Brisbane) it has fascinated me about the rivalry between communities separated by a river border. Liveability, lifestyle, housing and so on all factor into the argument. A search of the Australian media reveals several articles debating this: North vs south - which area is best?, A city divided and Why north of the Brisbane River is officially more liveable than the south.

Exactly what parameters define ‘the best location - north or south’ is open to debate but I have chosen to explore just one - the mean salaries - to see if there is evidence to suggest that there is a difference between the mean salaries of those living north with those living south of the river. In this investigation, I focus on Brisbane.

Although the analyses garners amusement, human psychology underpins the debate and ultimately impacts areas such as real estate values and government spending.

In order to perform this analysis, salary information from the Australian Taxation Office for the 2017-18 year has been used along with postcode breakdowns for suburbs in the Brisbane Local Government Area (LGA) that fall north of the river and south of the river.

Problem Statement

To discover if there is a statistically significant difference between the salaries of Brisbane income earners living north of the river compared with those who live south of the river.

Two approaches will be used:

Review and report summary statistics of the 2017-18 population data
Simulate random samples and perform a two-sample independent t-Test on the north and south groups

Photo by Johann Walter Bantz on Unsplash

Data

Taxation Statistics 2017-18 - Individuals - Table 25 from the Australian Taxation Office. Full population data. Includes salary income data for all 2017-18 tax payers. Click here for information and the dataset (Australian Taxation Office, 2020) License: Creative Commons Attribution 2.5 Australia

Collection and sampling

Full data set means hypotheses testing not necessary but population and sample analysis has been done for comparison
Two-sample independent t-Test performed on random sample from population. Uses r sample_n function.

Variables

Only the following two used from the data set of 34 variables:

Postcode Postal area code of income area (four-digit number)
Average total income or loss Amount ($) of the average income per person living in that postcode.

Postcodes

Relate to Brisbane Local Government Authority (LGA). Google maps (Google, n.d.) and Epstein’s table used to work out relevant ranges (Epstein, 2014):

4000 - 4078 (North of the river)
4101 - 4179, 4300, 4303, 4306, 4312, 4500, 4501, 4503, 4520 (South of the river)

Read and tidy data

earnings <- read_csv("ts18individual25countaveragemedianbypostcode.csv")
earnings2 <- select(earnings, `Postcode`,`Average salary or wages`)
str(earnings2)

## tibble [2,470 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Postcode               : num [1:2470] 800 810 812 820 822 828 829 830 832 835 ...
##  $ Average salary or wages: num [1:2470] 76428 67931 67487 77029 46538 ...

qld_pc <- subset(earnings2, earnings2$Postcode > 3999 & earnings2$Postcode < 4521) # get QLD codes
# Now split data out to be north or south of the river.
qld_pc_north <- subset(qld_pc, qld_pc$Postcode > 3999 & qld_pc$Postcode < 4073)
qld_pc_south <- subset(qld_pc, qld_pc$Postcode > 4072)

# Scan for missing data
qld_pc_north[!complete.cases(qld_pc_north), ]      # 0 missing

qld_pc_south[!complete.cases(qld_pc_south), ]      # 0 missing

Check random observations

set.seed(2300)   # check random sample of 5 obs - north
anom_check_north <- sample_n(qld_pc_north, 5)
str(anom_check_north)

## tibble [5 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Postcode               : num [1:5] 4013 4030 4019 4034 4018
##  $ Average salary or wages: num [1:5] 65644 70599 56609 60037 56645

set.seed(2301)  # check random sample of 5 obs - south
anom_check_south <- sample_n(qld_pc_south, 5)
str(anom_check_south)

## tibble [5 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Postcode               : num [1:5] 4165 4301 4344 4121 4073
##  $ Average salary or wages: num [1:5] 61539 49826 53404 71967 74018

# Looking OK.

North and south of the river data

datatable(qld_pc_north, class = 'compact', rownames = FALSE, filter="none", options = list(pageLength = 3, scrollX=T) )

datatable(qld_pc_south, class = 'compact', rownames = FALSE, filter="none", options = list(pageLength = 3, scrollX=T) )

Summary statistics of population data

qld_north_stats <- summarise(qld_pc_north,
          mean_north_salary = mean(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
          sd_north_salary = sd(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
          min_north_salary = min(qld_pc_north$`Average salary or wages`, na.rm = TRUE),
          max_north_salary = max(qld_pc_north$`Average salary or wages`, na.rm = TRUE))
qld_south_stats <- summarise(qld_pc_south,
          mean_south_salary = mean(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
          sd_south_salary = sd(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
          min_south_salary = min(qld_pc_south$`Average salary or wages`, na.rm = TRUE),
          max_south_salary = max(qld_pc_south$`Average salary or wages`, na.rm = TRUE))
knitr::kable(qld_north_stats)

mean_north_salary	sd_north_salary	min_north_salary	max_north_salary
66957.38	9708.287	49014	88086

knitr::kable(qld_south_stats)

mean_south_salary	sd_south_salary	min_south_salary	max_south_salary
54003.04	8095.449	35689	91532

diff_salaries  # difference in salaries

## [1] 12954.35

The population data shows a true difference in means of $12954.35 with those living north of the Brisbane river having an average salary of $66957.38 and those living south of the Brisbane river having an average salary of $54003.04.

Visualisation of population data

par(mai=c(1,1,0,1),mar=c(5,2,1,0)+0.1,oma=c(1,1,1,1),mfrow = c(2, 2))
hist(qld_pc_north$`Average salary or wages`, xlab = "Average Salary North", main ="Histogram of North of River Salaries", col = "blue")
hist(qld_pc_south$`Average salary or wages`, xlab = "Average Salary South", main ="Histogram of South of River Salaries", col = "green")
qqPlot(qld_pc_north$`Average salary or wages`, ylab = "Average Salary North", main ="QQ Plot of North of River Salaries", lwd=1, col="red")
qqPlot(qld_pc_south$`Average salary or wages`,ylab = "Average Salary South", main ="QQ Plot of South of River Salaries", lwd=1, col="red")

Visualisation of population data continued

par(mfrow = c(1, 1))
boxplot(`Average salary or wages`~Location,data=combined_pop, main="Brisbane LGA Salary Data",
   xlab="Side of the River they Dwell", ylab="Salary", col="cyan")

Summary and interpretation of population data

north data looks normally distributed, south data looks a little right-skewed.
boxplot for south data shows a right skew and some outliers
outliers not to be unexpected; “Pareto’s law of income distribution” (Rodd, 1996). 80/20 rule; that a small proportion of income earners are responsible for a large share of income. For the purposes of this investigation, it makes no sense to manipulate those outliers otherwise we may bias the findings. (Jansen, 2020)

In the Brisbane LGA, those who live north of the river have a higher mean salary than those who live south of the river.

Extract and work on sample data

set.seed(2401) # so we get the same random sample each code run
northsample <- sample_n(qld_pc_north, 31) # dplyr - sample 31 north data points > 30
set.seed(2601) # so we get the same random sample each code run
southsample <- sample_n(qld_pc_south, 31) # dplyr - sample 31 south data points > 30
# Perform a little magic to create sample data with a location column

# Let's merge the two data sets
combined_samp <- rbind(northsample, southsample)
combined_samp$Location <- as.factor(combined_samp$Location)
str(combined_samp)

## tibble [62 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Postcode               : num [1:62] 4014 4010 4035 4007 4034 ...
##  $ Average salary or wages: num [1:62] 65401 66690 69533 82161 60037 ...
##  $ Location               : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...

Display sample data (2 x 31 observations = 62)

Descriptive statistics on sample data

combined_samp %>% group_by(Location) %>% summarise(Min = min(`Average salary or wages`,na.rm = TRUE),
Q1 = quantile(`Average salary or wages`,probs = .25,na.rm = TRUE),
Median = median(`Average salary or wages`, na.rm = TRUE),
Q3 = quantile(`Average salary or wages`,probs = .75,na.rm = TRUE),
Max = max(`Average salary or wages`,na.rm = TRUE),
Mean = mean(`Average salary or wages`, na.rm = TRUE),
SD = sd(`Average salary or wages`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Average salary or wages`))) -> samp_summary
knitr::kable(samp_summary)

Location	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
North	49014	61162.5	66690	75303.5	88086	68223.45	10085.501	31	0
South	35689	47731.0	52401	56379.5	74018	52459.39	8285.363	31	0

Notes

random sample taken >= 30 (31 to be precise)
central limit theorem invoked; can assume sample data normally distributed

Visualisation of Sample Data

par(mai=c(1,1,0,1),mar=c(5,2,1,0)+0.1,oma=c(1,1,1,1),mfrow = c(2, 2))
hist(northsample$`Average salary or wages`, xlab = "Average Salary North", main ="Histogram of North of River Salaries", col = "blue")
hist(southsample$`Average salary or wages`,xlab = "Average Salary South", main ="Histogram of South of River Salaries", col = "green")
qqPlot(northsample$`Average salary or wages`, ylab = "Average Salary North", main ="QQ Plot of North of River Salaries", lwd=1, col="red")
qqPlot(southsample$`Average salary or wages`,ylab = "Average Salary South", main ="QQ Plot of South of River Salaries", lwd=1, col="red")

Although we are using a large sample size, the graphs suggest approximately normal distributions.

Hypothesis Testing

Variance

Let’s check for homogeneity of variance. We use the following statistical hypotheses:

\[H_0: \sigma_1^2 = \sigma_2^2 \] \[H_A: \sigma_1^2 \ne \sigma_2^2 \] where σ1 and σ2 refer to the population variance of the north of the river and south of the river groups.

To calculate this, we can use the leveneTest in R.

leveneTest(combined_samp$`Average salary or wages` ~ combined_samp$Location, data = combined_samp) # Do a Levene test

As p is 0.248 which is greater than 0.05 we fail to reject Ho and can assume equal variance.

Two-sample t-Test

Now that our variances have been considered, we can study the difference in means so our null and alternate hypotheses are as follows:

\[H_0: \mu_1 - \mu_2 = 0 \] \[H_A: \mu_1 - \mu_2 \ne 0\] where μ1 and μ2 refer to the population means of the north of the river and south of the river groups.

In r, we now apply a two sample t-test to test if the ‘unknown’ population mean of the north data set is equal or not to the ‘unknown’ population mean of the south data set. (Remember we are simulating random sampling and are ‘pretending’ we don’t know the true population means.)

t.test(
  combined_samp$`Average salary or wages` ~ combined_samp$Location, data = combined_samp, var.equal = TRUE, alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  combined_samp$`Average salary or wages` by combined_samp$Location
## t = 6.7245, df = 60, p-value = 7.368e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  11074.81 20453.32
## sample estimates:
## mean in group North mean in group South 
##            68223.45            52459.39

Discussion

The p-value is far less than 0.05 so we can state that there is statistically significant information to say that the mean salary of income tax earners living north of the Brisbane river is not equal to the mean salary of income tax earners living south of the Brisbane river.

The central limit theorem ensured that the t-test could be applied due to the large sample size in each group (31 which is > 30).

The Levene’s test of homogeneity of variance indicated equal variances.

The results of the two-sample t-test assuming equal variance found a statistically significant difference between salary of income tax earners living north of the Brisbane river and salary of income tax earners living south of the Brisbane river; t(df=60)=6.7245, p=.000000007, 95% CI for the difference in means [11074.81, 20453.32].

In reality, because access to the full data population is available, sampling is not actually necessary and the population data can easily be analysed to find a more accurate result. However, for the purpose of this investigation, it was useful to compare the results and discover that they did correspond with each other.

It should be noted that when breaking out postcode data, there were only 39 postcodes north of the river but there were 206 postcodes south of the river. Although a random sample of 31 postcodes for each location was taken, this represented a significant proportion of the original population data set for the north side.

Further analysis could be undertaken to look at other aspects of salary distribution including the top postcodes and bottom postcodes each side of the river not just for Brisbane but for other Australian cities too.

Summary

The results of the sample analyses suggest that the mean salaries of those living north of the Brisbane river in the Brisbane LGA are higher than those living south of the Brisbane river which corresponds to the result we got when we analysed the population data earlier in this investigation.

North of the river wins!

Photo by Piqsels. Free for personal and commercial use. Boxing Winner

References

Australian Taxation Office (2020). Taxation Statistics 2017-18. [online] Data.gov.au. Available at: https://data.gov.au/data/dataset/23b8c299-a85b-4fc0-a07d-5ed14e23a103/resource/343f1d18-067b-44ee-b7b3-1b04c4872b86/download/ts18individual25countaveragemedianbypostcode.csv [Accessed 25 Oct. 2020].

Epstein, J. (2014). Australian LGA to postcode mappings with PostGIS and Intersects | GreenAsh. [online] greenash.net.au. Available at: https://greenash.net.au/thoughts/2014/07/australian-lga-to-postcode-mappings-with-postgis-and-intersects/ [Accessed 22 Oct. 2020].

Google (n.d.). Google Maps. [online] Google Maps. Available at: https://www.google.com/maps/@-27.4197998 [Accessed 23 Oct. 2020].

Jansen, M. (2020). MATH2349 Assignment 2 Data Wrangling. [Assignment] Available at: https://rpubs.com/cybecom/dw-ass2-Oct2020 [Accessed 23 Oct. 2020].

Rodd, J. (1996). Pareto’s law of income distribution, or the 80/20 rule. International Journal of Nonprofit and Voluntary Sector Marketing, 1(1), pp.77–89.

Shubham (2016). r - Add empty columns to a dataframe with specified names from a vector. [online] Stack Overflow. Available at: https://stackoverflow.com/questions/18214395/add-empty-columns-to-a-dataframe-with-specified-names-from-a-vector [Accessed 25 Oct. 2020].

North of the River Versus South of the River in Brisbane

Who has the higher mean salary?