Module 1 (R Assignment)

Description of Dataset

The dataset that was chosen for this assignment captures property sales in Melbourne city for the years 2016 and 2017. The dataset is publicly accessible on Kaggle and can be downloaded at the following link: https://www.kaggle.com/datasets/amalab182/property-salesmelbourne-city.

There are 22 independent features within the dataset that capture information regarding the property itself, where the property is located, and any other relevant information to the sale of the property (i.e. sale price, method of sale, sale date, etc.). Before visualizing the data some basic descriptive statics were performed on the dataset. In its unaltered state there are 18,396 individual observations or sales recorded. 11 of the 22 features are missing at least 1 point of data, with the Year that the property was built and the building area of the entire property being the features with the highest percentage of data missing. The most expensive property sold for 9 million AUD and the cheapest only 85,000 AUD. Lastly, there are 305 unique Real estate agency who handled all of the sales within the dataset.

library(psych)
library(lubridate)
library(ggplot2)
library(scales)
library(dplyr)
library(ggthemes)
library(RColorBrewer)
library(tidyr)
library(stringr)
library(leaflet)

Visualizations and Findings

df <- read.csv("D:\\Users\\1rodg\\Files\\Graduate School\\DS736 Data Visualization\\Module 1\\datasets\\Property Sales of Melbourne City.csv", row.names="X")

ColsToFactor <- c("Suburb", "Type", "Method", "SellerG", "CouncilArea",
                  "Regionname")
df[ColsToFactor] <- lapply(df[ColsToFactor], as.factor)

plot1df <- df[, names(df) %in% c('Date'), drop= FALSE]

x <- dmy(plot1df$Date)
plot1df$salemonthname <- months(x, abbreviate = TRUE)
plot1df$saledow <- weekdays(x, abbreviate = TRUE)

plot1df <- plot1df %>%
  mutate(month = salemonthname, dayoftheweek = saledow) %>%
  group_by(month, dayoftheweek) %>%
  summarise(n = length(Date), .groups='keep') %>%
  data.frame()

plot1df <- plot1df %>%
  complete(month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), 
           dayoftheweek = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"), 
           fill = list(n = 0))

day_order <- factor(plot1df$dayoftheweek, level=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
plot1df$month <- factor(plot1df$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", 
                                                  "Jun", "Jul", "Aug", "Sep", "Oct", 
                                                  "Nov", "Dec"))

plot2df <- df[, names(df) %in% c('Distance'), drop = FALSE]
plot2df <- plot2df[!is.na(plot2df$Distance), , drop = FALSE]

plot3df <- df[, names(df) %in% c('Type', 'Regionname')]
plot3df <- plot3df[!plot3df$Regionname == "", , drop = FALSE]
plot3df <- plot3df %>%
  group_by(Regionname, Type) %>%
  summarise(n=length(Type), .groups='keep') %>%
  group_by(Regionname) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()
plot3df <- plot3df %>%
  mutate(Type = recode(Type, "h" = "House", "t" = "Townhouse", "u" = "Apartment"))
plot3df$Regionname <- str_wrap(plot3df$Regionname, width = 10)

plot4df <- df[, names(df) %in% c('Method', 'Regionname')]
plot4df <- plot4df[!plot4df$Regionname == "", , drop = FALSE]
plot4df <- plot4df %>%
  group_by(Regionname, Method) %>%
  summarise(n=length(Method), .groups='keep') %>%
  data.frame()

plot4df <- plot4df %>%
  complete(Regionname = c("Eastern Metropolitan", "Eastern Victoria", "Northern Metropolitan", "Northern Victoria", 
  "South-Eastern Metropolitan", "Southern Metropolitan", "Western Metropolitan", "Western Victoria"), 
           Method = c("PI", "S", "SA", "SP", "VB"), 
           fill = list(n = 0))

plot5df <- df[!is.na(df$Lattitude) & !is.na(df$Longtitude), names(df) %in% c('Lattitude', 'Longtitude', 'Price')]
plot5df_max <- plot5df[order(-plot5df$Price), ][1:25,]
row.names(plot5df_max) <- NULL
plot5df_min <- plot5df[order(plot5df$Price), ][1:25,]

Multi-Line Plot

The plot below depicts the number of sales recorded in specific months on specific days of the week. The reason as to why some of the month may appear to have more sales than the others is due to the fact that some months appear in 2 separate years while others only appear in a single year in the dataset. For this plot to display more accurately, some standardization could have been done to take the average for the properties sold in months that appear twice (once in 2016 and 2017). Other than that, it is a bit shocking that majority of the sales occurred on Saturdays with hardly any sales appearing during the week. This may be due to the fact that the properties are selling at auction and Saturdays are when the auction house happens to be the busiest. This plot was included from the perspective of a potential buyer, such that they may see when they would have the most competition if they are looking to purchase a property in Melbourne.

ggplot(plot1df, aes(x = day_order, y = n, group=month)) +
  geom_line(aes(color=month), size=3) + 
  labs(title = "Melbourne Sales by Day and Month", x = "Days of the Week", y = "Sale Counts") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5),
        legend.title = element_text(size = 16),
        legend.text = element_text(size = 14)) +
  geom_point(shape=21, size=5, color="black", fill="white") +
  scale_y_continuous(labels=comma) + 
  scale_color_brewer(palette = "Paired", name="Month")

Histogram

The plot below is a histogram which displays the number of properties within a certain distance from the Melbourne central business district. It can be easily seen that most of the properties sold within the dataset occur between 2 to 14 kilometers of Melbourne CBD. Above 14 or more kilometers, the histogram starts to skew toward the right meaning less and less properties are being sold in these distances. The reason that this plot was included is again from the perspective of a potential buyer. By looking at this plot a buyer could easily see where others have purchased homes in relation to where they may work. Here a buyer would be able to tell that it is fairly common for properties to see within 6, 8, and 12 kilometers of Melbourne CBD.

ggplot(plot2df, aes(x=Distance)) +
  geom_histogram(bins = 25, color="darkgreen", fill="lightgreen") +
  labs(title = "Histogram of Distance From Property to Melbourne CBD", x = "Distance (KM)",
       y = "Number of Properties") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels=comma) +
  geom_histogram(aes(y = after_stat(count)), 
                 bins = 25, 
                 fill = NA, 
                 color = NA) +
  geom_text(stat = 'bin', aes(label = comma(after_stat(count))), 
            bins = 25, 
            vjust = -0.5, 
            color = "darkred",
            size = 4,
            angle = 35,
            position = position_nudge(x = 0.75)) + 
  ylim(0, 3500) +
  scale_x_continuous(breaks = seq(0, 50, by = 2))

Pie Charts

The plot below depicts multiple pie charts relating to different regions around Melbourne and the types of properties that they often sell. This plot was also included to help a perspective buyer find what region of Melbourne they may have the best luck in purchasing a property of a specifc type. For instance, if a buyer wanted to live in a town home or apartment, they would consider moving to one of the Metropolitan areas as all of the Victoria’s have little to no properties of this type sold - with their best bet being the Southern Metropolitan area.

ggplot(data = plot3df, aes(x="", y = n, fill = Type)) +
  geom_bar(stat="identity", position = 'fill') +
  coord_polar(theta = "y", start = 0) +
  labs(fill = "Property Type", x = NULL, y = NULL, title = "% of Property Type by Melbourne Regions",
       caption = "Slices under 5% are not labeled") + 
  theme_light() + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~Regionname, ncol=3, nrow=3) + 
  scale_fill_brewer(palette = "Dark2") + 
  geom_text(aes(x=1.15, label=ifelse(percent_of_total>5, paste0(percent_of_total,"%"),"")),
            size=3.5,
            position=position_fill(vjust = 0.5))

Heatmap

The plot below shows a heatmap between the region that a property was sold and how many times a specific method of sale occurred. This visualization provides a quick and easy way for a potential buyer to see where the majority of people are buying and if there are any regions that have more sub-optimal methods of sales than others. For instance, both Northern and Southern Metropolitan have the highest amounts of properties sold normally at auction. Whereas, Northern and Western Metropolitan have the most Properties sold prior by the Auction House. Depending on the buyer, they may see the fact that the auction house has sold the properties before as being a good thing and be more inclined to place a bid on properties in those regions.

ggplot(plot4df, aes(x = Method, y= Regionname, fill = n)) + 
  geom_tile(color="black") + 
  geom_text(aes(label=comma(n))) +
  coord_equal(ratio = 1) + 
  labs(title="Sale Methods by Melbourne Regions",
       x = "Method of Sale",
       y = "Region Name",
       fill = "Sale Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_continuous(low="white", high = "darkgreen") + 
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour="black")))

Map

The last visualization shows where the 25 cheapest and 25 most expensive properties in Melbourne have sold. This may be particularly beneficial for higher stake buyers looking to purchase within neighborhoods that others have previously bought in, or property flippers who would want to purchase properties in similar areas as those that have sold for more. Looking at the map below it can be easily seen that all of the high dollar properties appear just southeast of the Melbourne Central Business District, where as majority of the cheaper properties are selling to the northwest. Oddly enough, the property which sold for the most overall is the one farthest from Melbourne CBD (and majority of the other high end properties) and may want to be considered removing as an outlier.

m <- leaflet() %>%
  addTiles() %>%
  
  addCircleMarkers(
    data = plot5df_max,
    ~Longtitude, ~Lattitude,
    popup = ~paste("Price: $", format(Price, big.mark = ",")),
    radius = 6,
    color = "darkgreen",
    fillOpacity = 0.8,
    label = ~paste("Price: $", format(Price, big.mark = ","))
  ) %>%
  
  addCircleMarkers(
    data = plot5df_min,
    ~Longtitude, ~Lattitude,
    popup = ~paste("Price: $", format(Price, big.mark = ",")),
    radius = 6,
    color = "darkred",
    fillOpacity = 0.8,
    label = ~paste("Price: $", format(Price, big.mark = ","))
  ) %>%
  
  setView(lng = mean(c(plot5df_max$Longtitude, plot5df_min$Longtitude)),
          lat = mean(c(plot5df_max$Lattitude, plot5df_min$Lattitude)),
          zoom = 11) %>%
  
  addLegend(
    position = "topright",
    colors = c("darkgreen", "darkred"),
    labels = c("Most Expensive Sales", "Least Expensive Sales"),
    title = "Price Categories (Top/Bottom 25)"
  )
m

Wrap up

Overall, all five of these visualizations could provide critical information for someone who is interested in purchasing a property in or around the Melbourne Central Business District.