Reading time: 24 minutes
This is a personal project built for skills demonstration purposes. The goal of this project is to help a young Brisbane family to buy a property. The parent have a total annual income of AUD 80k to 100k. They have a kid, 2 pets, and are working near St Lucia, it is a suburb near city central in Brisbane. They said they are intending to buy a house in the September of 2020.
This project was created by R Markdown using R in RStudio. The data set used in this project was a Brisbane Real estate sales public data set between 2018 to 2020, downloaded from Kaggle. SQL was used during data cleaning (to demonstrate the skill). To achieve the goal, 6 graphs and 4 interactive maps were created in this project. There are 3 type of properties in the data set to suggest to the family, include house, townhouse and unit. There are also information about how many rooms a property had when it was sold.
Results reveal that the demand of properties has been noticably affected by Covid-19 and is declining sharply in July 2020 across different property types. The family should look for one now as there would be less competitors in the market. The price of a property generally increases with increasing number of rooms (20% - 40% increment for addition of each room). In general, price of a house will be higher if it is close to Brisbane CBD. The best option for the family is to start putting efforts and look for affordable 3-room houses. Although I am also recommanding a 3-room townhouse which is $120k cheaper than a 3-room house on average, however there were not many options availabe based on the data in the last few years. A standard 3-room house generally costs about AUD 650k and a similar townhouse may be AUD 500k. If a 3-room house is selected, it will be more ideal as the family would have rooms for kids, guests, and more spaces in the yard for their pets. The best locations to hunt for a 3-room house based on the conditions of the family are Kenmore, Jindalee, Oxley, Moorooka, and Salisbury. If the family do not mind just a little more commuting time to their workplace, they should look at properties in Inala and Forest Lake. These locations would also have many great options.
(Highlights)
R packages loaded in this project include
library(tidyverse)
library(htmltools)
library(DT)
library(lubridate)
library(kableExtra)
library(skimr)
library(leaflet)
library(mapview)
library(rgdal)
library(ggrepel)
library(snakecase)
library(treemapify)
The purpose of this project is to find the type of house that is suitable for a young Brisbane family. The family have a total income of 80 thousands to 100 thousands dollars a year, and they have a kid. This a 3 person family. They prefer a house that is pet-friendly as they have a cat and a dog. They hope to have 1 extra room for guests. Their parents love their grandchild a lot and wishing to visit frequently. They are first home buyer and are working near a Brisbane suburb - Saint Lucia. They are intending to buy a house in the September of 2020.
From the introduction, I can see that the family would want a 3-room property and perhaps a yard as outdoor spaces for their pets.
To aid the family, I will first provide general information about the trend of Brisbane property prices in the most recent years. I will then explore the prices of different types of properties, and drawing a closer look at adjacent suburbs and the number of rooms. To aid my exploration, I have several important questions to ask myself, include:
Data set used in this personal project is a public data set downloaded from Kaggle, uploaded by HtAG Holdings and Alex Fedoseev. Kaggle is a professional platform for data science community. HtAG Holdings is a web portal provides actionable market information for real estate professional. Click this link to the relevant Kaggle webpage. This data set was used to analyse the impact of COVID-19 on property market on May 4, 2020.
The data has a size of 35.87 MB and was downloaed as a csv file.
This section uploads the data set into R. Following is the first 10 thousand rows of the data set.
prop.price <- read_csv("aus-property-sales-sep2018-april2020.csv")
prop.price
This table is built base on the original information given on the kaggle website.
Variable <- c("date_sold", "price", "suburb", "city_name", "state", "lat", "lon", "bedrooms", "property_type", "loc_pid", "lga_pid")
Description <- c("Date of sales",
"Price for which the property was sold for. Null if unknown.",
"Suburb of sale",
"Name of the Australia city",
"State",
"Latitude",
"Longitude",
"Number of bedrooms",
"Type of property i.e. house, unit, townhouse",
"ID of the locality (suburb)",
"ID of the LGA (council area)")
data.frame(Variable, Description) %>%
kbl() %>%
kable_styling(bootstrap_options = c("hover", "bordered", "striped"))
| Variable | Description |
|---|---|
| date_sold | Date of sales |
| price | Price for which the property was sold for. Null if unknown. |
| suburb | Suburb of sale |
| city_name | Name of the Australia city |
| state | State |
| lat | Latitude |
| lon | Longitude |
| bedrooms | Number of bedrooms |
| property_type | Type of property i.e. house, unit, townhouse |
| loc_pid | ID of the locality (suburb) |
| lga_pid | ID of the LGA (council area) |
This data set has 320,334 rows of observation and 11 columns of variables. Among variables, there are character, numeric and date types.
skim_without_charts(prop.price)
| Name | prop.price |
| Number of rows | 320334 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 3 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 4 | 9 | 0 | 10032 | 0 |
| suburb | 0 | 1 | 3 | 21 | 0 | 2028 | 0 |
| city_name | 0 | 1 | 5 | 9 | 0 | 6 | 0 |
| state | 0 | 1 | 2 | 3 | 0 | 6 | 0 |
| property_type | 0 | 1 | 4 | 9 | 0 | 3 | 0 |
| loc_pid | 0 | 1 | 3 | 11 | 0 | 2122 | 0 |
| lga_pid | 0 | 1 | 3 | 6 | 0 | 115 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| lat | 78 | 1 | -34.63 | 3.19 | -38.48 | -37.79 | -34.74 | -33.72 | 31.92 |
| lon | 78 | 1 | 143.32 | 11.11 | 115.58 | 144.72 | 145.16 | 151.07 | 153.19 |
| bedrooms | 0 | 1 | 3.11 | 0.99 | 0.00 | 2.00 | 3.00 | 4.00 | 5.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_sold | 0 | 1 | 2018-09-02 | 2020-07-16 | 2019-08-19 | 684 |
Insights from this table:
I closer looked and found that there are missing values in the price column, however they have been labeled as “NULL” making them undetectable. It will be a cleaning task during cleaning stage.
Following is another way of glimpsing the data. This method lists out all variables in the data set, along with their current type, and their first few observations. I can see NULL appears on the price column, which will need to be removed during data cleaning, so that R can recognise price as a numeric vector.
glimpse(prop.price)
## Rows: 320,334
## Columns: 11
## $ date_sold <dttm> 2018-09-18, 2018-09-24, 2018-09-26, 2018-09-26, 2018-10~
## $ price <chr> "NULL", "NULL", "1730000", "1928000", "1475000", "960000~
## $ suburb <chr> "Darling Point", "Darling Point", "Darling Point", "Darl~
## $ city_name <chr> "Sydney", "Sydney", "Sydney", "Sydney", "Sydney", "Sydne~
## $ state <chr> "NSW", "NSW", "NSW", "NSW", "NSW", "NSW", "NSW", "NSW", ~
## $ lat <dbl> -33.86957, -33.87218, -33.86839, -33.87546, -33.87573, -~
## $ lon <dbl> 151.2413, 151.2397, 151.2375, 151.2363, 151.2336, 151.23~
## $ bedrooms <dbl> 3, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 2, 3, 2, 3, 3, 1,~
## $ property_type <chr> "unit", "unit", "unit", "unit", "unit", "unit", "unit", ~
## $ loc_pid <chr> "NSW1221", "NSW1221", "NSW1221", "NSW1221", "NSW1221", "~
## $ lga_pid <chr> "NSW180", "NSW180", "NSW180", "NSW180", "NSW180", "NSW18~
In R,
I will convert some of the variables to other types during data cleaning, such as:
In this section, I am applying SQL for my preliminary cleaning to demonstrate my ability in that programming language and the platform BigQuery. I will then upload the cleaned data set back into R for further manipulation.
On the BigQuery database, I create a data set folder and store the property price data set as a table under that data set folder. I then use some of the BigQuery features to examine the data set such as SCHEMA, DETAILS and PREVIEW.
During my data exploration and cleaning using SQL, I found that:
Step 1 to 6
Step 7 to 8
I am using SQL to remove the rows that has price with NULL and Zero bed room numbers.
After the cleaning query, 19.6% rows of data have been removed, there are still 80.4% of data remain in the data set, which is 257,684 rows of data.
I download this data sheet from the BigQuery into the relevant R project folder in my computer and reuploading back to R in next section.
After data cleaning using SQL, now I upload this cleaned data set again into R.
Click the right button to show the code, and following table indicates that the data import has been successful.
prop.price2 <- read_csv("aus-property-sales_SQL_Cleaned.csv")
prop.price2
Checking the data set and yes, 257,684 rows of observations and 11 columns have been uploaded to R.
skim_without_charts(prop.price2)
| Name | prop.price2 |
| Number of rows | 257684 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date_sold | 0 | 1 | 23 | 23 | 0 | 684 | 0 |
| suburb | 0 | 1 | 3 | 21 | 0 | 2011 | 0 |
| city_name | 0 | 1 | 5 | 9 | 0 | 6 | 0 |
| state | 0 | 1 | 2 | 3 | 0 | 6 | 0 |
| property_type | 0 | 1 | 4 | 9 | 0 | 3 | 0 |
| loc_pid | 0 | 1 | 3 | 11 | 0 | 2104 | 0 |
| lga_pid | 0 | 1 | 3 | 6 | 0 | 115 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 810878.66 | 618775.78 | 1000.00 | 486000.00 | 655000.00 | 920000.00 | 1.0000e+08 |
| lat | 53 | 1 | -34.68 | 3.22 | -38.48 | -37.79 | -34.82 | -33.74 | -2.7290e+01 |
| lon | 53 | 1 | 143.92 | 10.40 | 115.58 | 144.76 | 145.18 | 151.06 | 1.5319e+02 |
| bedrooms | 0 | 1 | 3.11 | 0.97 | 1.00 | 2.00 | 3.00 | 4.00 | 5.0000e+00 |
Converting the date
The date column is imported as character, however, it should be in the date or date-time format. Following code complete the conversion (click right.)
prop.price2 <- prop.price2 %>%
mutate(date_sold = as.POSIXct(date_sold))
Converting “bedrooms” into factor type
The column “bedrooms” records the number of bed rooms a house has, though it is right to be recognised as a numerical variable. Optionally, I am converting it into factor type so that I can use it to group the data in later analysis sections.
Code in the right allowing this conversion.
prop.price2 <- prop.price2 %>%
mutate(bedrooms = factor(bedrooms))
Converting character variable into factor
Not all character variable needs to be converted into factor. However, it is the case in this scenario so that I can use them for grouping the data during analysis. Changing them into factor type also allows me to look at their levels (elements) within them using R functions, for example, summary(). Another benefit is that R can run faster if data are stored as factor.
prop.price2 <- prop.price2 %>%
mutate_if(is.character, factor)
Final check
In this final check, all variables are in the data type I reviewed and selected.
glimpse(prop.price2)
## Rows: 257,684
## Columns: 11
## $ date_sold <dttm> 2019-04-05, 2019-09-11, 2019-09-16, 2019-11-07, 2020-01~
## $ price <dbl> 281000, 235000, 265000, 340000, 310000, 289000, 312000, ~
## $ suburb <fct> Aberfoyle Park, Aberfoyle Park, Aberfoyle Park, Aberfoyl~
## $ city_name <fct> Adelaide, Adelaide, Adelaide, Adelaide, Adelaide, Adelai~
## $ state <fct> SA, SA, SA, SA, SA, SA, SA, SA, SA, SA, SA, SA, SA, SA, ~
## $ lat <dbl> -35.07963, -35.06134, -35.06490, -35.07239, -35.06394, -~
## $ lon <dbl> 138.6070, 138.6114, 138.5911, 138.6044, 138.5952, 138.59~
## $ bedrooms <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,~
## $ property_type <fct> unit, unit, unit, unit, unit, unit, unit, house, house, ~
## $ loc_pid <fct> SA1, SA1, SA1, SA1, SA1, SA1, SA1, SA1, SA1, SA1, SA1, S~
## $ lga_pid <fct> SA93, SA93, SA93, SA93, SA93, SA93, SA93, SA93, SA93, SA~
Following is the summary of the data set. I can see the sample size of each level in each variables, as well as seeing the maximum price of a property can be 100 million dollars, and the minimum price of a property can be 1000 dollars! It makes me wonder is there an error needs to fixed.
I also see that the data set has also sales data for other state other than Queensland, where Brisbane city is. Since the information from other state is irrelevant to the purpose of this project, they will be removed in the section 6.6.
summary(prop.price2)
## date_sold price suburb
## Min. :2018-09-02 00:00:00 Min. : 1000 Pakenham : 1667
## 1st Qu.:2019-02-25 00:00:00 1st Qu.: 486000 Craigieburn: 1327
## Median :2019-08-24 00:00:00 Median : 655000 Frankston : 1219
## Mean :2019-08-04 22:11:29 Mean : 810879 Melbourne : 1079
## 3rd Qu.:2020-01-06 00:00:00 3rd Qu.: 920000 Sunbury : 1048
## Max. :2020-07-16 00:00:00 Max. :100000000 Tarneit : 1031
## (Other) :250313
## city_name state lat lon bedrooms
## Adelaide :27606 ACT: 9396 Min. :-38.48 Min. :115.6 1: 13035
## Brisbane :27339 NSW:68164 1st Qu.:-37.79 1st Qu.:144.8 2: 53051
## Canberra : 9396 QLD:27339 Median :-34.82 Median :145.2 3:101784
## Melbourne:98522 SA :27606 Mean :-34.68 Mean :143.9 4: 72799
## Perth :26657 VIC:98522 3rd Qu.:-33.74 3rd Qu.:151.1 5: 17015
## Sydney :68164 WA :26657 Max. :-27.29 Max. :153.2
## NA's :53 NA's :53
## property_type loc_pid lga_pid
## house :178248 VIC2023: 1667 ACT : 9396
## townhouse: 19442 VIC660 : 1327 BRIS2 : 7471
## unit : 59994 VIC939 : 1219 BRIS3 : 6860
## VIC1634: 1079 VIC127 : 6685
## VIC2394: 1048 NSW283 : 5653
## VIC2453: 1031 NSW329 : 5419
## (Other):250313 (Other):216200
Following graphs show that the 100000000 is an outlier and will affect my analysis for the purpose of this project. The family in the scenario has also an income that can’t afford a 100 million dollars, it should justify the removal of this data point.
ggplot(prop.price2, aes(x = "", y = price)) +
geom_jitter(size = 3, alpha = 0.5)
Removing the outlier using following code (click right button).
prop.price2 <- prop.price2 %>%
filter(price < 25000000)
Plotting the plot again. Yup, the point has been removed.
ggplot(prop.price2, aes(x = "", y = price)) +
geom_jitter(size = 3, alpha = 0.5)
This project focuses on the real estate sales of Brisbane. The data set has the sales of other cities in other states, include:
levels(prop.price2$city_name)
## [1] "Adelaide" "Brisbane" "Canberra" "Melbourne" "Perth" "Sydney"
This section will remove other cities from the data set to leave only Brisbane data. The size of the data set will become smaller and speed up the entire analysis workflow. Following code complete the job (click the right button).
bris.prop <- prop.price2 %>%
filter(city_name == "Brisbane") %>%
select(-city_name, -state) %>% # Remove variables city_name and state, since they are redundant now
arrange(date_sold)
Nearly 90% of irrelevant data (other cities in other states) has been removed and left only with 10% of Brisbane data.
Although the objective of this project is to focus on the real estate sales of 2020, but I will leave the sales of other years in the data set to allow analysis of different trends over the last few years. Therefore, this data set has Brisbane sales data of:
years <- bris.prop %>%
mutate(year = factor(year(date_sold)))
levels(years$year)
## [1] "2018" "2019" "2020"
There are longitude and latitude data in the data set allowing me to draw map during visualisation. However, these points are point information scattering on the map, I want to draw polygons on the map to show the boundaries of each suburbs of Brisbane, which cannot be performed by these pointed information. For this to happen, I will need a shape file that contains the boundaries information of Brisbane suburbs in a shape file format.
I searched on the internet and obtained a shapefile containing Queensland suburb information through Australia government website, and imported into R.
Shape file imported.
# Upload shape file of Queensland suburb boundary
myshape <- readOGR("QLD_LOCALITY_POLYGON_shp/QLD_LOCALITY_POLYGON_shp.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\karho\Desktop\R and Stats\R for online presence\20210626_p5_Australia_property\ausproperty\QLD_LOCALITY_POLYGON_shp\QLD_LOCALITY_POLYGON_shp.shp", layer: "QLD_LOCALITY_POLYGON_shp"
## with 3298 features
## It has 12 fields
The file contains boundary information of all suburbs in Queensland, however I only need Brisbane ones. Following codes filter out the redundant suburbs (non-Brisbane) and left only with suburbs in Brisbane (according to the data set I used in this project).
# Changing the case format of the levels of the column that store the suburb names in my data set to match the naming format in shape file, as first step to combine both together.
bris.prop <- bris.prop %>%
mutate(suburb = toupper(suburb))
# Checking is all suburbs in Brisbane match the suburb names in the shape file. Conclusion: Yes.
a <- is.element(bris.prop$suburb, myshape$QLD_LOCA_2)
a <- factor(a)
# levels(a) <- use this code to check if only "TRUE" remains.
# Reduce the shape file to leave only suburbs that matches the suburbs in the my data set.
# After combination, the shape file (myshape2) has been reduced from 3298 elements to 175 elements that contains only Brisbane suburb.
myshape2 <- subset(myshape, is.element(myshape$QLD_LOCA_2, bris.prop$suburb))
The shape file will look like below (A snapshot taken from later section to aid your understanding.)
Looking good, I will use this manipulated shape file object (myshape) during my mapping in later section. I can see that there are no sales data in a few Brisbane region but not too many, it should be fine.
Following graph summarises the overall sales of properties (count) in Brisbane from September 2018 to July 2020.
# set up df
df7.1 <- bris.prop %>%
mutate(YearMonth = format(date_sold, "%Y-%m")) %>%
relocate(YearMonth, .after = date_sold) %>%
group_by(YearMonth, property_type, bedrooms) %>%
summarise(count = n()) %>%
mutate(property_type = fct_recode(property_type,
"House" = "house",
"Townhouse" = "townhouse",
"Unit" = "unit")) %>%
arrange(YearMonth)
# plot
ggplot(df7.1, aes(x = YearMonth, y = count, colour = bedrooms, group = bedrooms)) +
geom_point() +
geom_path() +
facet_wrap(~ property_type, ncol = 3, nrow = 1) +
labs(title = "Figure 1. Monthly Total Real Estate Sales (Count) in Brisbane",
subtitle = "First Covid-19 Case in Australia: 25 January 2020",
caption = "Between 2018-09-02 to 2020-07-16",
y = "Number of property sold",
x = "Date (Year-Month)",
colour = "Number of Bedrooms:") +
theme_bw() +
theme(legend.position = "none",
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
axis.text.x = element_text(angle = 90, vjust = 0.75),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
strip.text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14)) +
geom_text_repel(aes(label = ifelse(YearMonth == "2019-07", paste0(bedrooms, " rooms"), "")),
size = 4) +
geom_vline(xintercept = "2020-01", color = "red", linetype = 2) +
geom_text(aes(x = "2020-01",
y = 420,
label = "Covid Strike"),
size = 3,
colour = "red",
hjust = 0.5)
Insights:
Now I want to check out the trend of prices of eacg property types.
# set up df
df7.2 <- bris.prop %>%
mutate(YearMonth = format(date_sold, "%Y-%m"),
YearMonth = factor(YearMonth)) %>%
relocate(YearMonth, .after = date_sold) %>%
group_by(YearMonth, property_type, bedrooms) %>%
summarise(average = mean(price)) %>%
mutate(property_type = fct_recode(property_type,
"House" = "house",
"Townhouse" = "townhouse",
"Unit" = "unit"))
# plot
ggplot(df7.2, aes(x = YearMonth, y = average, colour = bedrooms, group = bedrooms)) +
geom_point() +
geom_path() +
facet_wrap(~ property_type, ncol = 3, nrow = 1) +
labs(title = "Figure 2. Monthly Average Real Estate Sales (Price) in Brisbane",
subtitle = "First Covid-19 Case in Australia: 25 January 2020",
caption = "Between 2018-09-02 to 2020-07-16",
y = "AUD (million)",
x = "Date (Year-Month)",
colour = "Number of Bedrooms:") +
theme_bw() +
theme(legend.position = "none",
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
axis.text.x = element_text(angle = 90, vjust = 0.75),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
strip.text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14)) +
geom_text_repel(aes(label = ifelse(YearMonth == "2019-07", paste0(bedrooms, " rooms"), "")),
size = 4) +
geom_vline(xintercept = "2020-01", color = "red", linetype = 2) +
geom_text(aes(x = "2020-01",
y = 420,
label = "Covid Strike"),
size = 3,
colour = "red",
hjust = 0.5) +
scale_y_continuous(label = function(x)paste0({x/1000000}, "M"))
There is an expensive average in 2019-01 masking the overall trends. Removing it would allows us to have a closer examination.
ggplot(df7.2 %>% filter(average < 4000000), aes(x = YearMonth, y = average, colour = bedrooms, group = bedrooms)) +
geom_point() +
geom_path() +
facet_wrap(~ property_type, ncol = 3, nrow = 1) +
labs(title = "Figure 3. Monthly Average Real Estate Sales (Price) in Brisbane",
subtitle = "First Covid-19 Case in Australia: 25 January 2020",
caption = "Between 2018-09-02 to 2020-07-16",
y = "AUD (thousand)",
x = "Date (Year-Month)",
colour = "Number of Bedrooms:") +
theme_bw() +
theme(legend.position = "none",
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
axis.text.x = element_text(angle = 90, vjust = 0.75),
axis.title.y = element_text(margin = margin(0, 10, 0, 0)),
strip.text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 14)) +
geom_text_repel(aes(label = ifelse(YearMonth == "2019-10", paste0(bedrooms, " rooms"), "")),
size = 4) +
geom_vline(xintercept = "2020-01", color = "red", linetype = 2) +
geom_text(aes(x = "2020-01",
y = 200000,
label = "Covid Strike"),
size = 3,
colour = "red",
hjust = 0.5) +
scale_y_continuous(label = function(x)paste0({x/1000}, "k"),
lim = c(0, 1400000),
breaks = seq(0, 1400000, 200000))
Insights:
There there 3 type of properties (house, townhouse and unit) in the data set and 5 types of room category (1 room to 5 rooms). Following graph compares the overall average prices of these groups over the last 3 years between 2018 and 2020.
df7.3 <- bris.prop %>%
group_by(property_type, bedrooms) %>%
summarise(count = n(),
Average_price = round(mean(price)),
Average_price_lab = paste0("$ ", round(Average_price/1000), "k"))
# set up lab
df7.3 <- df7.3 %>%
mutate(lab = paste(bedrooms, "\n", "(n =", count, ")"))
# plot
ggplot(df7.3, aes(x = lab, y = Average_price, fill = property_type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~property_type, scale = "free_x") +
scale_y_continuous(label = function(x)paste0({x/1000000}, "M")) +
geom_text(aes(label = Average_price_lab),
vjust = -1,
size = 3) +
labs(title = "Figure 4. Averga price of different properties in Brisbane",
subtitle = "n = Sample size",
caption = "Between 2018-09-02 to 2020-07-16",
x = "Number of room",
y = "Average price ($)") +
theme(legend.position = "none",
plot.title = element_text(size = 15, face = "bold"),
strip.text = element_text(size = 12)
)
This should only be quick guidance as locality and age of property may also affect property prices and these are not included in the table.
Insights:
This is a multi-layer map, feel free to spend a few minutes exploring each layer.
# create and transform df
df8.1 <- bris.prop %>%
group_by(suburb) %>%
summarise(Average_price = mean(price))
# Aligning data of the shapefile (myshape2) and the dataset (df8.1)
df8.1 <- df8.1[order(match(df8.1$suburb, myshape2$QLD_LOCA_2)), ]
# set up colour function
price_cut <- c(0, 200000, 400000, 600000, 800000, 1000000, 1400000)
pal <- colorBin(palette = "YlOrRd", bins = price_cut)
# set up lab
labs <- paste0(df8.1$suburb, "</p>", "$", prettyNum(df8.1$Average_price, big.mark = ","))
# Set year in factor type in the original data set
bris.prop <- bris.prop %>%
mutate(year = factor(year(date_sold)))
# Create new variables for bis.prop data set- (color-range, map_lab, map_popup)
bris.prop$colour_range <- cut(bris.prop$price,
breaks = c(0, 300000, 700000, 1000000, 4000000, 10000000),
labels = c("$0 - $300k",
"$300k - $700k",
"$700k - 1M",
"$1M - $4M",
"$4M - $10M"))
bris.prop <- bris.prop %>%
mutate(map_lab = paste0(property_type, " ($", price, ")", ", ", bedrooms, " rooms"),
map_popup = paste0("Property type: ", property_type, "<br>",
"Price: $", prettyNum(price, big.mark = ","), "<br>",
"Number of bedroom: ", bedrooms, "<br>",
"Date sold: ", date_sold, "<br>",
"Suburb: ", suburb))
df8.3_pal <- colorFactor(palette = c("yellow", "green", "orange", "red", "black"), domain = bris.prop$colour_range)
#set up layers
house_1rm <- bris.prop %>% filter(property_type == "house", bedrooms == "1")
house_2rm <- bris.prop %>% filter(property_type == "house", bedrooms == "2")
house_3rm <- bris.prop %>% filter(property_type == "house", bedrooms == "3")
house_4rm <- bris.prop %>% filter(property_type == "house", bedrooms == "4")
house_5rm <- bris.prop %>% filter(property_type == "house", bedrooms == "5")
townhouse_1rm <- bris.prop %>% filter(property_type == "townhouse", bedrooms == "1")
townhouse_2rm <- bris.prop %>% filter(property_type == "townhouse", bedrooms == "2")
townhouse_3rm <- bris.prop %>% filter(property_type == "townhouse", bedrooms == "3")
townhouse_4rm <- bris.prop %>% filter(property_type == "townhouse", bedrooms == "4")
townhouse_5rm <- bris.prop %>% filter(property_type == "townhouse", bedrooms == "5")
unit_1rm <- bris.prop %>% filter(property_type == "unit", bedrooms == "1")
unit_2rm <- bris.prop %>% filter(property_type == "unit", bedrooms == "2")
unit_3rm <- bris.prop %>% filter(property_type == "unit", bedrooms == "3")
# plot
leaflet(width = "100%") %>%
addTiles(group = "Open Street Map") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri World Imagery") %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Stamen Toner Lite") %>%
setView(lat = -27.4705, lng = 153.0260, zoom = 11) %>%
addLayersControl(baseGroups = c("Open Street Map",
"Esri World Imagery",
"Stamen Toner Lite"),
overlayGroups = c("Overall Sales Cluster",
"Suburb Polygon - yellow",
"Suburb Polygon - blue",
"Suburb Polygon - Averaged",
"House - 1 room",
"House - 2 room",
"House - 3 room",
"House - 4 room",
"House - 5 room",
"Townhouse - 1 room",
"Townhouse - 2 room",
"Townhouse - 3 room",
"Townhouse - 4 room",
"Townhouse - 5 room",
"Unit - 1 room",
"Unit - 2 room",
"Unit - 3 room"),
options = layersControlOptions(collapsed = F)) %>%
hideGroup(c("Suburb Polygon - yellow",
"House - 1 room",
"House - 2 room",
"House - 4 room",
"House - 5 room",
"Townhouse - 1 room",
"Townhouse - 2 room",
"Townhouse - 3 room",
"Townhouse - 4 room",
"Townhouse - 5 room",
"Unit - 1 room",
"Unit - 2 room",
"Unit - 3 room",
"Suburb Polygon - Averaged",
"Overall Sales Cluster")) %>%
addLegend(position = "bottomleft",
opacity = 0.5,
pal = pal,
values = price_cut,
title = "Suburb Average property Price (Polygon)") %>%
addLegend(position = "bottomleft",
opacity = 0.5,
values = bris.prop$colour_range,
pal = df8.3_pal,
title = "Property Prices (Points)") %>%
addPolygons(group = "Suburb Polygon - yellow",
data = myshape2,
weight = 1,
fillOpacity = 0.5,
opacity = 1,
smoothFactor = 0.5,
color = ~"yellow",
highlight = highlightOptions(weight = 2,
fillOpacity = 0.3
),
label = ~QLD_LOCA_2,
labelOptions = labelOptions(direction = "auto",
textsize = "15px"),
popup = myshape2$QLD_LOCA_2,
popupOptions = popupOptions(autoClose = FALSE, closeOnClick = FALSE)) %>%
addPolygons(group = "Suburb Polygeon - blue",
data = myshape2,
weight = 1,
fillOpacity = 0.4,
opacity = 1,
smoothFactor = 0.5,
color = ~"blue",
highlight = highlightOptions(weight = 2,
fillOpacity = 0.2),
label = ~QLD_LOCA_2,
labelOptions = labelOptions(direction = "auto",
textsize = "15px"),
popup = myshape2$QLD_LOCA_2,
popupOptions = popupOptions(autoClose = FALSE, closeOnClick = FALSE)) %>%
addCircleMarkers(group = "House - 1 room",
data = house_1rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "House - 2 room",
data = house_2rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "House - 3 room",
data = house_3rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "House - 4 room",
data = house_4rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "House - 5 room",
data = house_5rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Townhouse - 1 room",
data = townhouse_1rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Townhouse - 2 room",
data = townhouse_2rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Townhouse - 3 room",
data = townhouse_3rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Townhouse - 4 room",
data = townhouse_4rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Townhouse - 5 room",
data = townhouse_5rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Unit - 1 room",
data = townhouse_1rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Unit - 2 room",
data = townhouse_2rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Unit - 3 room",
data = townhouse_3rm,
lat = ~lat,
lng = ~lon,
radius = ~10,
opacity = 1,
weight = 1,
label = ~map_lab,
popup = ~map_popup,
color = ~df8.3_pal(colour_range)) %>%
addCircleMarkers(group = "Overall Sales Cluster",
data = bris.prop,
lat = ~ lat,
lng = ~ lon,
label = ~ paste0(suburb,
" $", prettyNum(price, big.mark = ","), ", ",
paste0("sold in ", year), ", ",
property_type, ", ",
paste0(bedrooms, " rooms")),
weight = 1, # thickness of circular
radius = 10, # size of circle
clusterOptions = markerClusterOptions(),
color = ~"Blue"
) %>%
addPolygons(group = "Suburb Polygon - Averaged",
data = myshape2,
weight = 1.5,
color = "black",
fillOpacity = 0.7,
smoothFactor = 0.5,
dashArray = 3,
fillColor = pal(df8.1$Average_price),
highlight = highlightOptions(
fillOpacity = 0.6,
bringToFront = F),
label = lapply(labs, HTML),
labelOptions = labelOptions(direction = "auto",
textsize = "15px"))