General liquor licences and the median sale price of houses

A Chi-square test of association

April Coulter s3869101

Last updated: 25 October, 2020

Introduction

Every month, the Victorian Commission for Gambling and Liquor Regulation (VCGLR) publishes a stocktake of active liquor licences in Victoria. Licences are granted under certain categories, covering production, pre-retail, BYO, and various service and/or supply types.

In March 2020, the VCGLR recorded 23,276 active licences across Victoria.

Pubs, hotels and taverns

Locations

The locations of the venues are provided in the dataset.

Problem Statement

Method

  1. Choose two sources relevant to the same time period (March 2020), for licence data and house price data.
  2. Convert house price values to a discretised categorical variable to handle extreme outliers.
  3. Take a sample of the preprocessed data, by matching the data by suburb and keeping only matching rows.
  4. Table the data and plot it in a barchart to see the proportions of each kind of general licence across each house price category.
  5. Determine a hypothesis test, where
    1. H0 : There is no association in the population between general licence category and median house price category.
    2. HA: There is an association in the population between general licence category and median house price category.
  6. Apply a chi-square test of association to test for a statistically significant association between general licence category and median house price category.
  7. Compare the observed χ2 value to the χ2crit critical value.
  8. Find the p-value of the sample χ2, assuming there is no association in the population.
  9. Reject, or fail to reject, H0.

Data

Source 1 and selected variables

Source 2 and selected variables

Preprocessing

  1. Read in Source 1 and factor variables
    1. Suburb as an unordered factor with 1573 levels
    2. Category as an unordered factor with 14 levels, representing 14 categories of licence.
  2. Read in Source 2 and factor variables
    1. Suburb as an unordered factor with 754 levels
  3. Filter Source 1 by Category to keep only General Licence and Late night (general) Licence factor levels.
  4. Check variables for standard and non-standard missing values.
  5. Discretise Source 2 variable Q1 2020 by equal frequency, and store as a vector. This is to convert the house price values to a categorical variable and dampen the effect of outliers. This is done prior to merging, to avoid bias. After merging, the same suburb may be represented more than once, or not at all, which would affect the frequency distribution of the values across bins.
  6. Column bind the “bins” vector to Source 2 variables as a new column variable.
  7. Rename Source 2 variables
    1. Q1 2020 renamed as Median House Price
    2. “bins” variable renamed as Price Category

(Preprocessing continued)

  1. Group Source 2 by Price Category and summarise by min and max values of Median House Price. This is to reveal the price ranges of the discretised values.
  2. Factor Price Category as an ordered factor, labeling by price ranges.
  3. Merge Sources 1 and 2 by Suburb using inner_join. Store as Liquor_Houses dataframe (1559 observations over 5 variables).
  4. Re-factor Liquor_Houses variables to remove empty levels. Variables are now:
    1. Licence Num: character variable of the license identification number
    2. Suburb: factor with 456 levels corresponding to suburbs in Victoria
    3. Category: factor with 2 levels; General Licence, and Late night (general) Licence
    4. Median House Price: numerical variable of the median sale price of houses in the first quarter of 2020, in Australian dollars.
    5. Price Category: Ordered factor with 9 levels corresponding to price ranges based on the equal frequency discretised values of Median House Price

Descriptive Statistics and Visualisation

The data can now be arranged in a proportional table of Price Category by licence Category

# table data

table <- table(Liquor_Houses$`Price Category`, Liquor_Houses$Category)
prop_table <- table %>% prop.table(margin = 2) %>% round(3)
knitr::kable(prop_table)
General Licence Late night (general) Licence
<293k 0.132 0.050
293-397k 0.160 0.097
398-490k 0.098 0.039
491-575k 0.057 0.086
576-652k 0.066 0.064
653-789k 0.053 0.081
790-990k 0.069 0.084
991k-1.3m 0.146 0.209
>1.3m 0.218 0.290

Important Variables

boxplot((Houses_02$`Median House Price` / 1000000), main = "Median House Price of 754 suburbs", ylab = "Median Price in Millions AUD", col = "beige")

barplot(table(Liquor_Houses$`Price Category`), horiz = TRUE,  col = "beige", ylab = "Price Category", main = "Frequency Distribution of Licences within Price Categories") 

barplot(table(Liquor_Houses$Category), col = "beige", ylab = "Total Licences", main = "Frequency Distribution of Licences within Licence Categories")

Licence Category by Price Category

barplot(prop_table, main = "General Licences by House Price", beside = TRUE, col = brewer.pal(9, name = "OrRd"), ylab = "Proportion within Group", ylim = c(0,0.4), legend = rownames(prop_table), args.legend = list(x = "top", title = "Median Sale Price", bty = "n", ncol = 3 ))      #                               
grid(NA, NULL)

Hypothesis Testing

Chi-square test of association

# apply the test

chi <- chisq.test(table(Liquor_Houses$`Price Category`, Liquor_Houses$Category))
chi
## 
##  Pearson's Chi-squared test
## 
## data:  table(Liquor_Houses$`Price Category`, Liquor_Houses$Category)
## X-squared = 56.394, df = 8, p-value = 2.364e-09

Check assumptions of the test

# check expected counts

knitr::kable(chi$expected)
General Licence Late night (general) Licence
<293k 136.24118 40.75882
293-397k 174.72739 52.27261
398-490k 101.60359 30.39641
491-575k 76.20269 22.79731
576-652k 78.51187 23.48813
653-789k 71.58435 21.41565
790-990k 86.97883 26.02117
991k-1.3m 192.43105 57.56895
>1.3m 281.71905 84.28095

Critical value

Degrees of freedom for the chi-square test are calcuated as:

\[df = (r - 1)(c - 1)\] where r = rows, and c = columns of the tabled data.

Therefore, the degrees of freedom for this chi-square test are:

\[df = (9 − 1)(2 − 1) = 8\]

Now we can find the one-tailed critical value, χ2crit, associated with α = 0.05 and df = 8:

# calculate X2 critical value

qchisq(p = 0.95,df = 8)
## [1] 15.50731

Discussion

Conclusion

References

Data

Department of Environment, Land, Water and Planning (DELWP) 2020, March 2020 quarter: Houses by Suburb, viewed 10 October 2020, https://www.propertyandlandtitles.vic.gov.au/__data/assets/excel_doc/0035/485936/Houses_1st_Quarter_2020.xls

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Current Victorian Liquor Licences as of 31/03/2020, viewed 10 October 2020, https://www.vcglr.vic.gov.au/sites/default/files/current_victorian_licences_by_location_geo-coded_march_2020.xlsx

Information

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Victorian Liquor Licences by Location, viewed 20 October 2020, https://www.vcglr.vic.gov.au/resources/data-and-research/liquor-data/liquor-licences-location

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Victorian Liquor Licences by Category, viewed 23 October 2020, https://www.vcglr.vic.gov.au/resources/data-and-research/liquor-data/liquor-licences-category

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Liquor Licences by Licence Category, viewed 23 October 2020, https://www.vcglr.vic.gov.au/sites/default/files/07_-_vcglr_renewable_licences_historic_2.pdf

Discretize Numeric Variables n.d, viewed 23 October 2020, https://recipes.tidymodels.org/reference/discretize.html