library(readr)
library(magrittr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(Hmisc)
Data preprocessing or data wrangling is very important for analysis as real world data is always messy. So, raw data need to be rearranged, reshape, cleaning, filtering and also deal with missing values and outliers in order to make the data ready for statistical analysis.
There are two datasets consist of population and properties price in Australia. The steps of data preprocessing including;
1. Import the datasets into R studio.
2. Check and create appropriate type of variables.
3. Scan the missing values as well as infinite and Nan values.
4. Create new variables for filltering only the important valiables.
5. Select, rename variables in order to prepare for merging the datasets.
6 Compute some values in order to prepare for merging the datasets.
7. Convert untidy to tidy dataset.
8. Merge the two datasets.
9. Scan outliers by creating box plot, and find the solution to deal with them.
10. Create histogram to transform the data in order to change the scale and create normality.
The two datasets were collected from open source website called kaggle.com.
(https://www.kaggle.com/htagholdings/aus-real-estate-sales-march-2019-to-april-2020)
The dataset contains number of population in each city in Australia in 2011 and 2017. In this report, the important variables including:
Rank : Rank of population in each city
GCCSA.SUA : City name
June.2017.2. : Population in 2017
(https://www.kaggle.com/koki25ando/city-list-of-australia)
The dataset contains properties price in 3 types (house, townhouse, unit) in major capital cities consist of Sydney, Melbourne, Brisbane, Adelaide, Perth, and Canberra between 2018 and 2020, the important variables including:
date_sold : Date of property is sold
price : Price of property
city_name : City name
property_type : Type of property
ppl <- read.csv("AUS_state.csv")
print.data.frame(head(ppl))
## Rank GCCSA.SUA State.Territory June.2017.2.
## 1 1 Sydney New South Wales 5,131,326
## 2 2 Melbourne Victoria 4,850,740
## 3 3 Brisbane Queensland 2,408,223
## 4 4 Perth Western Australia 2,043,138
## 5 5 Adelaide South Australia 1,333,927
## 6 6 Gold Coast–Tweed Heads Queensland/New South Wales 663,321
## X2011.Census.3..Population Growth
## 1 4,391,674 16.84%
## 2 3,999,982 21.27%
## 3 2,065,996 16.56%
## 4 1,728,867 18.18%
## 5 1,262,940 5.62%
## 6 557,822 18.91%
## Percentage.of.national.population..June.2017.
## 1 20.86%
## 2 19.72%
## 3 9.79%
## 4 8.31%
## 5 5.42%
## 6 2.70%
sale <- read.csv("aus-property-sales.csv")
print.data.frame(head(sale))
## date_sold price suburb city_name state lat
## 1 2018-09-18 00:00:00 NULL Darling Point Sydney NSW -33.869565
## 2 2018-09-24 00:00:00 NULL Darling Point Sydney NSW -33.872179
## 3 2018-09-26 00:00:00 1730000 Darling Point Sydney NSW -33.868386
## 4 2018-09-26 00:00:00 1928000 Darling Point Sydney NSW -33.875465
## 5 2018-10-02 00:00:00 1475000 Darling Point Sydney NSW -33.875734
## 6 2018-10-12 00:00:00 960000 Darling Point Sydney NSW -33.869516
## lon bedrooms property_type loc_pid lga_pid
## 1 151.241317 3 unit NSW1221 NSW180
## 2 151.239726 3 unit NSW1221 NSW180
## 3 151.237471 2 unit NSW1221 NSW180
## 4 151.23628 3 unit NSW1221 NSW180
## 5 151.233575 3 unit NSW1221 NSW180
## 6 151.237239 2 unit NSW1221 NSW180
In order to check type of variables, using str() function.
str(ppl)
## 'data.frame': 96 obs. of 7 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GCCSA.SUA : chr "Sydney" "Melbourne" "Brisbane" "Perth" ...
## $ State.Territory : chr "New South Wales" "Victoria" "Queensland" "Western Australia" ...
## $ June.2017.2. : chr "5,131,326" "4,850,740" "2,408,223" "2,043,138" ...
## $ X2011.Census.3..Population : chr "4,391,674" "3,999,982" "2,065,996" "1,728,867" ...
## $ Growth : chr "16.84%" "21.27%" "16.56%" "18.18%" ...
## $ Percentage.of.national.population..June.2017.: chr "20.86%" "19.72%" "9.79%" "8.31%" ...
str(sale)
## 'data.frame': 320334 obs. of 11 variables:
## $ date_sold : chr "2018-09-18 00:00:00" "2018-09-24 00:00:00" "2018-09-26 00:00:00" "2018-09-26 00:00:00" ...
## $ price : chr "NULL" "NULL" "1730000" "1928000" ...
## $ suburb : chr "Darling Point" "Darling Point" "Darling Point" "Darling Point" ...
## $ city_name : chr "Sydney" "Sydney" "Sydney" "Sydney" ...
## $ state : chr "NSW" "NSW" "NSW" "NSW" ...
## $ lat : chr "-33.869565" "-33.872179" "-33.868386" "-33.875465" ...
## $ lon : chr "151.241317" "151.239726" "151.237471" "151.23628" ...
## $ bedrooms : int 3 3 2 3 3 2 3 2 3 3 ...
## $ property_type: chr "unit" "unit" "unit" "unit" ...
## $ loc_pid : chr "NSW1221" "NSW1221" "NSW1221" "NSW1221" ...
## $ lga_pid : chr "NSW180" "NSW180" "NSW180" "NSW180" ...
In this section, we choose top ten of the most population variable of rank and created them to factor and order it. (Top ten rank because the property price dataset has totally six cities, so it is enough to merge all city of two datasets.)
ppl$Rank <- as.factor(ppl$Rank)
ppl$Rank <- factor(ppl$Rank,labels = c(1,2,3,4,5,6,7,8,9,10),
levels = c(1,2,3,4,5,6,7,8,9,10), ordered = TRUE )
ppl <- ppl[1:10,]
print.data.frame(head(ppl))
## Rank GCCSA.SUA State.Territory June.2017.2.
## 1 1 Sydney New South Wales 5,131,326
## 2 2 Melbourne Victoria 4,850,740
## 3 3 Brisbane Queensland 2,408,223
## 4 4 Perth Western Australia 2,043,138
## 5 5 Adelaide South Australia 1,333,927
## 6 6 Gold Coast–Tweed Heads Queensland/New South Wales 663,321
## X2011.Census.3..Population Growth
## 1 4,391,674 16.84%
## 2 3,999,982 21.27%
## 3 2,065,996 16.56%
## 4 1,728,867 18.18%
## 5 1,262,940 5.62%
## 6 557,822 18.91%
## Percentage.of.national.population..June.2017.
## 1 20.86%
## 2 19.72%
## 3 9.79%
## 4 8.31%
## 5 5.42%
## 6 2.70%
Changing date_sold from character to date by ymd_hms() function.
sale$date_sold <- ymd_hms(sale$date_sold )
Changing price from character to numeric by as.numeric() function.
sale$price <- as.numeric(sale$price)
## Warning: NAs introduced by coercion
Scanning the data for missing values of price. And also, creating function named is.special() in order to check infinite value and NaN value.
sum(is.na(sale$price))
## [1] 62508
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sum(is.special(sale$price))
## [1] 0
There are 62,508 missing values and do not have infinite and NaN value. However, the missing values will be excluded because there have some duplicated date of observations and we would like to find the average price of each property type in each city of the year. Therefore, the missing values can be excluded.
Using mutate() function to create a new variable named year which is a new column to show only year from date_sold in order to filter only year 2018 by fillter() function.
sale_yr <- sale %>% mutate(year = year(date_sold))
sale_2018 <- sale_yr %>% filter(year == 2018)
print.data.frame(head(sale_2018))
## date_sold price suburb city_name state lat lon
## 1 2018-09-18 NA Darling Point Sydney NSW -33.869565 151.241317
## 2 2018-09-24 NA Darling Point Sydney NSW -33.872179 151.239726
## 3 2018-09-26 1730000 Darling Point Sydney NSW -33.868386 151.237471
## 4 2018-09-26 1928000 Darling Point Sydney NSW -33.875465 151.23628
## 5 2018-10-02 1475000 Darling Point Sydney NSW -33.875734 151.233575
## 6 2018-10-12 960000 Darling Point Sydney NSW -33.869516 151.237239
## bedrooms property_type loc_pid lga_pid year
## 1 3 unit NSW1221 NSW180 2018
## 2 3 unit NSW1221 NSW180 2018
## 3 2 unit NSW1221 NSW180 2018
## 4 3 unit NSW1221 NSW180 2018
## 5 3 unit NSW1221 NSW180 2018
## 6 2 unit NSW1221 NSW180 2018
The two datasets are not tidy. Thus, in this section, variables need to reshape to tidy format. First of all, select the important variables. Next, for the population dataset have to be changed some variables name for merging the two datasets by rename() function. For the property price dataset need to be found average price of property by group them by city and property type. Then, using spread() function in order to spread property type into new columns and change average price to observations for correct format. Eventually, we create new variable named average which is the average price of properties in the city.
ppl <- ppl %>% select(Rank, GCCSA.SUA, June.2017.2.)
ppl <- ppl %>% rename("city_name"= "GCCSA.SUA" ,
"population" = "June.2017.2.")
print.data.frame(head(sale_2018))
## date_sold price suburb city_name state lat lon
## 1 2018-09-18 NA Darling Point Sydney NSW -33.869565 151.241317
## 2 2018-09-24 NA Darling Point Sydney NSW -33.872179 151.239726
## 3 2018-09-26 1730000 Darling Point Sydney NSW -33.868386 151.237471
## 4 2018-09-26 1928000 Darling Point Sydney NSW -33.875465 151.23628
## 5 2018-10-02 1475000 Darling Point Sydney NSW -33.875734 151.233575
## 6 2018-10-12 960000 Darling Point Sydney NSW -33.869516 151.237239
## bedrooms property_type loc_pid lga_pid year
## 1 3 unit NSW1221 NSW180 2018
## 2 3 unit NSW1221 NSW180 2018
## 3 2 unit NSW1221 NSW180 2018
## 4 3 unit NSW1221 NSW180 2018
## 5 3 unit NSW1221 NSW180 2018
## 6 2 unit NSW1221 NSW180 2018
sale_2018 <- sale_2018 %>% select(year, city_name, price, property_type)
sale_avg <- sale_2018 %>% group_by(city_name, property_type ) %>%
summarise(avg_price = mean(price, na.rm = TRUE))
print.data.frame(head(sale_avg))
## city_name property_type avg_price
## 1 Adelaide house 528790.7
## 2 Adelaide townhouse 462211.4
## 3 Adelaide unit 328053.4
## 4 Brisbane house 775280.0
## 5 Brisbane townhouse 486994.2
## 6 Brisbane unit 478374.4
sale_type <- sale_avg %>% spread(key = "property_type", value = "avg_price")
print.data.frame(head(sale_type))
## city_name house townhouse unit
## 1 Adelaide 528790.7 462211.4 328053.4
## 2 Brisbane 775280.0 486994.2 478374.4
## 3 Canberra 770609.0 514242.3 436899.0
## 4 Melbourne 922952.6 807831.5 553977.3
## 5 Perth 600768.2 553877.8 411650.1
## 6 Sydney 1327468.2 905157.0 858744.0
using mutate() function to create the average of properties price. And left_join() function to combine all properties price together.
price_of_city <- sale_type %>% group_by(city_name) %>%
mutate(average = (house+ townhouse+ unit)/3)
all_price <- sale_type %>% left_join(price_of_city)
print.data.frame(head(all_price))
## city_name house townhouse unit average
## 1 Adelaide 528790.7 462211.4 328053.4 439685.2
## 2 Brisbane 775280.0 486994.2 478374.4 580216.2
## 3 Canberra 770609.0 514242.3 436899.0 573916.8
## 4 Melbourne 922952.6 807831.5 553977.3 761587.1
## 5 Perth 600768.2 553877.8 411650.1 522098.7
## 6 Sydney 1327468.2 905157.0 858744.0 1030456.4
Merging the two datasets by inner_join() function. we used inner join because we need to compare observations between population and properties price in each city.
pop_prop <- ppl %>% inner_join(all_price, by="city_name")
print.data.frame(head(pop_prop))
## Rank city_name population house townhouse unit average
## 1 1 Sydney 5,131,326 1327468.2 905157.0 858744.0 1030456.4
## 2 2 Melbourne 4,850,740 922952.6 807831.5 553977.3 761587.1
## 3 3 Brisbane 2,408,223 775280.0 486994.2 478374.4 580216.2
## 4 4 Perth 2,043,138 600768.2 553877.8 411650.1 522098.7
## 5 5 Adelaide 1,333,927 528790.7 462211.4 328053.4 439685.2
Scanning numeric values outliers of properties price by boxplot() function.
boxplot(pop_prop$house, pop_prop$townhouse, pop_prop$unit,
names = c("House", "townhouse", "unit"),
xlab = "Property type",
ylab = "Price (AUD)",
main = "Price of properties")
The box plots show only one outlier of unit price which is the maximum price. So, looking for the maximum price, using summary() function in order to exclude the outlier
summary(pop_prop$unit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 328053 411650 478374 526160 553977 858744
The maximum of unit price is at 858744. Thus, using anti_join() function for excluding this price in order to remove the outlier.
unit_out <- pop_prop %>% subset(pop_prop$unit >= 858743)
no_out <- pop_prop %>% anti_join(unit_out)
no_out$unit %>% boxplot(xlab = "Unit",
ylab = "Price (AUD)")
We use hist() function to create histogram of properties price and apply an appropriate transformation. And then using log10() function in order to change scale of the variables and to create normality. Unfortunately, The property dataset has very small of cities variables, so the histogram cannot be created normality, as code shown below.
Histogram of house price.
hist(pop_prop$house,
main = "Histogram of house price",
xlab = "house")
log_h <- log10(pop_prop$house)
hist(log_h,
main = "Histogram of house price (log10)",
xlab = "house")
Histogram of townhouse price.
hist(pop_prop$townhouse,
main = "Histogram of townhouse price",
xlab = "townhouse")
log_th <- log10(pop_prop$townhouse)
hist(log_th,
main = "Histogram of townhouse price (log10)",
xlab = "townhouse")
Histogram of unit price.
hist(pop_prop$unit,
main = "Histogram of unit price",
xlab = "unit")
log_unit <- log10(pop_prop$unit)
hist(log_unit,
main = "Histogram of unit price (log10)",
xlab = "unit")
For future analytics, in order to create normality we need to find a large variables of datasets as after merge them we can get enough observations to produce normality.
Kaggle, AUS Real Estate Sales September 2018 to June 2020, Kaggle, viewed 1 October 2020, https://www.kaggle.com/htagholdings/aus-real-estate-sales-march-2019-to-april-2020
Kaggle, Cities in Australia, Kaggle, viewed 1 October 2020, https://www.kaggle.com/koki25ando/city-list-of-australia