MATH2349 Data Wrangling

Required packages

library(readr)
library(magrittr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(Hmisc)

Executive Summary

Data preprocessing or data wrangling is very important for analysis as real world data is always messy. So, raw data need to be rearranged, reshape, cleaning, filtering and also deal with missing values and outliers in order to make the data ready for statistical analysis.

There are two datasets consist of population and properties price in Australia. The steps of data preprocessing including;
1. Import the datasets into R studio.
2. Check and create appropriate type of variables.
3. Scan the missing values as well as infinite and Nan values.
4. Create new variables for filltering only the important valiables.
5. Select, rename variables in order to prepare for merging the datasets.
6 Compute some values in order to prepare for merging the datasets.
7. Convert untidy to tidy dataset.
8. Merge the two datasets.
9. Scan outliers by creating box plot, and find the solution to deal with them.
10. Create histogram to transform the data in order to change the scale and create normality.

Data

The two datasets were collected from open source website called kaggle.com.

1. Population in Australia dataset

(https://www.kaggle.com/htagholdings/aus-real-estate-sales-march-2019-to-april-2020)
The dataset contains number of population in each city in Australia in 2011 and 2017. In this report, the important variables including:
Rank : Rank of population in each city
GCCSA.SUA : City name
June.2017.2. : Population in 2017

2. Properties price in Australia dataset

(https://www.kaggle.com/koki25ando/city-list-of-australia)
The dataset contains properties price in 3 types (house, townhouse, unit) in major capital cities consist of Sydney, Melbourne, Brisbane, Adelaide, Perth, and Canberra between 2018 and 2020, the important variables including:
date_sold : Date of property is sold
price : Price of property
city_name : City name
property_type : Type of property

ppl <- read.csv("AUS_state.csv")
print.data.frame(head(ppl))

##   Rank              GCCSA.SUA            State.Territory June.2017.2.
## 1    1                 Sydney            New South Wales    5,131,326
## 2    2              Melbourne                   Victoria    4,850,740
## 3    3               Brisbane                 Queensland    2,408,223
## 4    4                  Perth          Western Australia    2,043,138
## 5    5               Adelaide            South Australia    1,333,927
## 6    6 Gold Coast–Tweed Heads Queensland/New South Wales      663,321
##   X2011.Census.3..Population Growth
## 1                  4,391,674 16.84%
## 2                  3,999,982 21.27%
## 3                  2,065,996 16.56%
## 4                  1,728,867 18.18%
## 5                  1,262,940  5.62%
## 6                    557,822 18.91%
##   Percentage.of.national.population..June.2017.
## 1                                        20.86%
## 2                                        19.72%
## 3                                         9.79%
## 4                                         8.31%
## 5                                         5.42%
## 6                                         2.70%

sale <- read.csv("aus-property-sales.csv")
print.data.frame(head(sale))

##             date_sold   price        suburb city_name state        lat
## 1 2018-09-18 00:00:00    NULL Darling Point    Sydney   NSW -33.869565
## 2 2018-09-24 00:00:00    NULL Darling Point    Sydney   NSW -33.872179
## 3 2018-09-26 00:00:00 1730000 Darling Point    Sydney   NSW -33.868386
## 4 2018-09-26 00:00:00 1928000 Darling Point    Sydney   NSW -33.875465
## 5 2018-10-02 00:00:00 1475000 Darling Point    Sydney   NSW -33.875734
## 6 2018-10-12 00:00:00  960000 Darling Point    Sydney   NSW -33.869516
##          lon bedrooms property_type loc_pid lga_pid
## 1 151.241317        3          unit NSW1221  NSW180
## 2 151.239726        3          unit NSW1221  NSW180
## 3 151.237471        2          unit NSW1221  NSW180
## 4  151.23628        3          unit NSW1221  NSW180
## 5 151.233575        3          unit NSW1221  NSW180
## 6 151.237239        2          unit NSW1221  NSW180

Understand

In order to check type of variables, using str() function.

str(ppl)

## 'data.frame':    96 obs. of  7 variables:
##  $ Rank                                         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GCCSA.SUA                                    : chr  "Sydney" "Melbourne" "Brisbane" "Perth" ...
##  $ State.Territory                              : chr  "New South Wales" "Victoria" "Queensland" "Western Australia" ...
##  $ June.2017.2.                                 : chr  "5,131,326" "4,850,740" "2,408,223" "2,043,138" ...
##  $ X2011.Census.3..Population                   : chr  "4,391,674" "3,999,982" "2,065,996" "1,728,867" ...
##  $ Growth                                       : chr  "16.84%" "21.27%" "16.56%" "18.18%" ...
##  $ Percentage.of.national.population..June.2017.: chr  "20.86%" "19.72%" "9.79%" "8.31%" ...

str(sale)

## 'data.frame':    320334 obs. of  11 variables:
##  $ date_sold    : chr  "2018-09-18 00:00:00" "2018-09-24 00:00:00" "2018-09-26 00:00:00" "2018-09-26 00:00:00" ...
##  $ price        : chr  "NULL" "NULL" "1730000" "1928000" ...
##  $ suburb       : chr  "Darling Point" "Darling Point" "Darling Point" "Darling Point" ...
##  $ city_name    : chr  "Sydney" "Sydney" "Sydney" "Sydney" ...
##  $ state        : chr  "NSW" "NSW" "NSW" "NSW" ...
##  $ lat          : chr  "-33.869565" "-33.872179" "-33.868386" "-33.875465" ...
##  $ lon          : chr  "151.241317" "151.239726" "151.237471" "151.23628" ...
##  $ bedrooms     : int  3 3 2 3 3 2 3 2 3 3 ...
##  $ property_type: chr  "unit" "unit" "unit" "unit" ...
##  $ loc_pid      : chr  "NSW1221" "NSW1221" "NSW1221" "NSW1221" ...
##  $ lga_pid      : chr  "NSW180" "NSW180" "NSW180" "NSW180" ...

In this section, we choose top ten of the most population variable of rank and created them to factor and order it. (Top ten rank because the property price dataset has totally six cities, so it is enough to merge all city of two datasets.)

ppl$Rank <- as.factor(ppl$Rank)

ppl$Rank <- factor(ppl$Rank,labels =  c(1,2,3,4,5,6,7,8,9,10),
                   levels = c(1,2,3,4,5,6,7,8,9,10), ordered = TRUE )

ppl <- ppl[1:10,]


print.data.frame(head(ppl))

##   Rank              GCCSA.SUA            State.Territory June.2017.2.
## 1    1                 Sydney            New South Wales    5,131,326
## 2    2              Melbourne                   Victoria    4,850,740
## 3    3               Brisbane                 Queensland    2,408,223
## 4    4                  Perth          Western Australia    2,043,138
## 5    5               Adelaide            South Australia    1,333,927
## 6    6 Gold Coast–Tweed Heads Queensland/New South Wales      663,321
##   X2011.Census.3..Population Growth
## 1                  4,391,674 16.84%
## 2                  3,999,982 21.27%
## 3                  2,065,996 16.56%
## 4                  1,728,867 18.18%
## 5                  1,262,940  5.62%
## 6                    557,822 18.91%
##   Percentage.of.national.population..June.2017.
## 1                                        20.86%
## 2                                        19.72%
## 3                                         9.79%
## 4                                         8.31%
## 5                                         5.42%
## 6                                         2.70%

Changing date_sold from character to date by ymd_hms() function.

sale$date_sold <- ymd_hms(sale$date_sold )

Changing price from character to numeric by as.numeric() function.

sale$price <- as.numeric(sale$price)

## Warning: NAs introduced by coercion

Scan I

Scanning the data for missing values of price. And also, creating function named is.special() in order to check infinite value and NaN value.

sum(is.na(sale$price))

## [1] 62508

is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

sum(is.special(sale$price))

## [1] 0

There are 62,508 missing values and do not have infinite and NaN value. However, the missing values will be excluded because there have some duplicated date of observations and we would like to find the average price of each property type in each city of the year. Therefore, the missing values can be excluded.

Tidy & Manipulate Data II

Using mutate() function to create a new variable named year which is a new column to show only year from date_sold in order to filter only year 2018 by fillter() function.

sale_yr <- sale %>% mutate(year = year(date_sold))


sale_2018 <- sale_yr %>% filter(year == 2018)

print.data.frame(head(sale_2018))

##    date_sold   price        suburb city_name state        lat        lon
## 1 2018-09-18      NA Darling Point    Sydney   NSW -33.869565 151.241317
## 2 2018-09-24      NA Darling Point    Sydney   NSW -33.872179 151.239726
## 3 2018-09-26 1730000 Darling Point    Sydney   NSW -33.868386 151.237471
## 4 2018-09-26 1928000 Darling Point    Sydney   NSW -33.875465  151.23628
## 5 2018-10-02 1475000 Darling Point    Sydney   NSW -33.875734 151.233575
## 6 2018-10-12  960000 Darling Point    Sydney   NSW -33.869516 151.237239
##   bedrooms property_type loc_pid lga_pid year
## 1        3          unit NSW1221  NSW180 2018
## 2        3          unit NSW1221  NSW180 2018
## 3        2          unit NSW1221  NSW180 2018
## 4        3          unit NSW1221  NSW180 2018
## 5        3          unit NSW1221  NSW180 2018
## 6        2          unit NSW1221  NSW180 2018

Tidy & Manipulate Data I

The two datasets are not tidy. Thus, in this section, variables need to reshape to tidy format. First of all, select the important variables. Next, for the population dataset have to be changed some variables name for merging the two datasets by rename() function. For the property price dataset need to be found average price of property by group them by city and property type. Then, using spread() function in order to spread property type into new columns and change average price to observations for correct format. Eventually, we create new variable named average which is the average price of properties in the city.

ppl <- ppl %>% select(Rank, GCCSA.SUA, June.2017.2.)

ppl <- ppl %>% rename("city_name"= "GCCSA.SUA" ,
               "population" = "June.2017.2.")

print.data.frame(head(sale_2018))

##    date_sold   price        suburb city_name state        lat        lon
## 1 2018-09-18      NA Darling Point    Sydney   NSW -33.869565 151.241317
## 2 2018-09-24      NA Darling Point    Sydney   NSW -33.872179 151.239726
## 3 2018-09-26 1730000 Darling Point    Sydney   NSW -33.868386 151.237471
## 4 2018-09-26 1928000 Darling Point    Sydney   NSW -33.875465  151.23628
## 5 2018-10-02 1475000 Darling Point    Sydney   NSW -33.875734 151.233575
## 6 2018-10-12  960000 Darling Point    Sydney   NSW -33.869516 151.237239
##   bedrooms property_type loc_pid lga_pid year
## 1        3          unit NSW1221  NSW180 2018
## 2        3          unit NSW1221  NSW180 2018
## 3        2          unit NSW1221  NSW180 2018
## 4        3          unit NSW1221  NSW180 2018
## 5        3          unit NSW1221  NSW180 2018
## 6        2          unit NSW1221  NSW180 2018

sale_2018 <- sale_2018 %>% select(year, city_name, price, property_type)


sale_avg <- sale_2018 %>% group_by(city_name, property_type )  %>% 
              summarise(avg_price = mean(price, na.rm = TRUE)) 

print.data.frame(head(sale_avg))

##   city_name property_type avg_price
## 1  Adelaide         house  528790.7
## 2  Adelaide     townhouse  462211.4
## 3  Adelaide          unit  328053.4
## 4  Brisbane         house  775280.0
## 5  Brisbane     townhouse  486994.2
## 6  Brisbane          unit  478374.4

sale_type <- sale_avg %>% spread(key = "property_type", value = "avg_price")

print.data.frame(head(sale_type))

##   city_name     house townhouse     unit
## 1  Adelaide  528790.7  462211.4 328053.4
## 2  Brisbane  775280.0  486994.2 478374.4
## 3  Canberra  770609.0  514242.3 436899.0
## 4 Melbourne  922952.6  807831.5 553977.3
## 5     Perth  600768.2  553877.8 411650.1
## 6    Sydney 1327468.2  905157.0 858744.0

using mutate() function to create the average of properties price. And left_join() function to combine all properties price together.

price_of_city <- sale_type %>% group_by(city_name) %>% 
              mutate(average =  (house+ townhouse+ unit)/3)

all_price <- sale_type %>% left_join(price_of_city)

print.data.frame(head(all_price))

##   city_name     house townhouse     unit   average
## 1  Adelaide  528790.7  462211.4 328053.4  439685.2
## 2  Brisbane  775280.0  486994.2 478374.4  580216.2
## 3  Canberra  770609.0  514242.3 436899.0  573916.8
## 4 Melbourne  922952.6  807831.5 553977.3  761587.1
## 5     Perth  600768.2  553877.8 411650.1  522098.7
## 6    Sydney 1327468.2  905157.0 858744.0 1030456.4

Merging Data

Merging the two datasets by inner_join() function. we used inner join because we need to compare observations between population and properties price in each city.

pop_prop <- ppl %>% inner_join(all_price, by="city_name")

print.data.frame(head(pop_prop))

##   Rank city_name population     house townhouse     unit   average
## 1    1    Sydney  5,131,326 1327468.2  905157.0 858744.0 1030456.4
## 2    2 Melbourne  4,850,740  922952.6  807831.5 553977.3  761587.1
## 3    3  Brisbane  2,408,223  775280.0  486994.2 478374.4  580216.2
## 4    4     Perth  2,043,138  600768.2  553877.8 411650.1  522098.7
## 5    5  Adelaide  1,333,927  528790.7  462211.4 328053.4  439685.2

Scan II

Scanning numeric values outliers of properties price by boxplot() function.

boxplot(pop_prop$house, pop_prop$townhouse, pop_prop$unit,
        names = c("House", "townhouse", "unit"),
        xlab = "Property type",
        ylab = "Price (AUD)",
        main = "Price of properties")

The box plots show only one outlier of unit price which is the maximum price. So, looking for the maximum price, using summary() function in order to exclude the outlier

summary(pop_prop$unit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  328053  411650  478374  526160  553977  858744

The maximum of unit price is at 858744. Thus, using anti_join() function for excluding this price in order to remove the outlier.

unit_out <- pop_prop %>%  subset(pop_prop$unit >= 858743)
no_out <- pop_prop %>% anti_join(unit_out)
 
no_out$unit %>% boxplot(xlab = "Unit",
                          ylab = "Price (AUD)")

Transform

We use hist() function to create histogram of properties price and apply an appropriate transformation. And then using log10() function in order to change scale of the variables and to create normality. Unfortunately, The property dataset has very small of cities variables, so the histogram cannot be created normality, as code shown below.

Histogram of house price.

hist(pop_prop$house,
      main = "Histogram of house price",
      xlab = "house")

log_h <- log10(pop_prop$house)

hist(log_h,
     main = "Histogram of house price (log10)",
     xlab = "house")

Histogram of townhouse price.

hist(pop_prop$townhouse,
     main = "Histogram of townhouse price",
     xlab = "townhouse")

log_th <- log10(pop_prop$townhouse)

hist(log_th,
     main = "Histogram of townhouse price (log10)",
     xlab = "townhouse")

Histogram of unit price.

hist(pop_prop$unit,
      main = "Histogram of unit price",
      xlab = "unit")

log_unit <- log10(pop_prop$unit)

hist(log_unit,
     main = "Histogram of unit price (log10)",
     xlab = "unit")

For future analytics, in order to create normality we need to find a large variables of datasets as after merge them we can get enough observations to produce normality.

References

Kaggle, AUS Real Estate Sales September 2018 to June 2020, Kaggle, viewed 1 October 2020, https://www.kaggle.com/htagholdings/aus-real-estate-sales-march-2019-to-april-2020

Kaggle, Cities in Australia, Kaggle, viewed 1 October 2020, https://www.kaggle.com/koki25ando/city-list-of-australia