MATH2349 Data Wrangling

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
library(Hmisc)
library(ggplot2)

Executive Summary

In your own words, provide a brief summary of the preprocessing. Explain the steps that you have taken to preprocess your data. Write this section last after you have performed all data preprocessing. (Word count Max: 300 words)

The 3 datasets (housing, crime & population) were read into R using a combination of the read_csv and read_excel functions. In the case of the crime’s dataset, I created a subset of the original data using only the columns I required. After reading in all 3 datasets, I reshaped the crime dataset from wide to long format to satisfy the tidy requirements. I then cleaned each of the primary key columns (year & lga) in the 3 datasets to ensure they were matched. This involved removing additional characters and whitespace from the population dataset along with suffix text and whitespace from the housing dataset. I also split the date variable into 3 new variables of year, month and day. Once all keys had been cleaned, I joined datasets using a mutating join from the tidyr package. Using the str() function to investigate the structure of the merged dataset I found a number of variables which required conversion to factors (both ordered and unordered). Using the colSums() function I generated a count of missing values in each column. I was concerned by the high number of missing values in the price column as this was the dependant variable. I chose to impute the missing values of the critical columns and leave the remainder. This involved finding the mean price by suburb and rooms and using this as the imputed value for the missing Price values. The remainder were imputed using a median value. I used a combination of multi and univariate analysis to scan outliers and found the majority to be valuable data. With 3 notable exceptions for the Landsize, YearBuilt and BuildingArea variables, for which I used a combination of imputation methods to handle. Finally, I applied a log transformation to the Price variable for the purpose of reducing skewness.

Data

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

Melbourne Housing Data

The Melbourne housing dataset is a dataset hosted via Kaggle, which includes a range of property information variables by address for properties sold in Melbourne from 2016 - 2018. The data is in CSV format and was scraped from publically available market results posted on Domain.com.au (Pino 2018).

Variable descriptions as per the source below:

Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in Australian dollars
Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.
Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.
SellerG: Real Estate Agent
Date: Date sold
Distance: Distance from CBD in Kilometres
Regionname: General Region (West, North West, North, North east …etc)
Propertycount: Number of properties that exist in the suburb.
Bedroom2: Scraped # of Bedrooms (from different source)
Bathroom: Number of Bathrooms
Car: Number of carspots
Landsize: Land Size in Metres
BuildingArea: Building Size in Metres
YearBuilt: Year the house was built
CouncilArea: Governing council for the area
Lattitude: Self explanitory
Longtitude: Self explanitory

Crime Statistics Data

The Crime Statistics dataset is a CSV formatted file sourced directly from the Crime Statistics Agency Victoria (CSA 2019), providing a total of the criminal incidents recorded by year and Local Government Area (LGA) in Victoria from 2011-2020. The dataset is sorted by Police Region with a totals row at the bottom of each group. When reading in the file we will select only the LGA name and years 2016-2018 for our analysis as this aligns with the date range of the Melbourne housing dataset.

Variable descriptions below:

Police Region: Regions of Victoria as defined by Victoria Police
Local Government Area: Victorian Local Government Areas (LGA)
Year (2011-2020): A count of the total criminal incedents reported for the given year

ABS Population Data

The ABS Population data is an XLS formatted dataset extracted directly via the Australian Bureau of Statistics open data portal (ABS 2016). The original file provides a range of population metrics (births/deaths/arrivals/departures etc.) for 2018. For the purpose of our analysis, we will only be using the below two variables:

LGA: Victorian Local Government Areas (LGA)
Population: The estimated population of the relevant LGA

Part 1:

# This is the R chunk for the Data Section

# read in the Melbourne housing dataset
housing <- read_csv('datasets\\Melbourne_housing_FULL.csv')

# display the head
head(housing)

# read in the Crime statistics dataset, skip the first row as the column headers are displayed on the second row and select only LGA data for years 2016-2018
crime <- read_csv('datasets\\CI_LGAa.csv',skip=1) %>% select(c('Local Government Area','2016':'2018'))

# display the head
head(crime)

# read in the ABS Population dataset, selecting only the relevant columns and renaming to something more appropriate
population <- read_excel('datasets\\ABS_LGA_pop.xls',range="B7:M87") %>% 
  select(c(1,12)) %>% 
  rename('lga' = 1,'population' = 2)

# display the head
head(population)

# reshape the crime dataset from wide to long format
crime <- crime %>% gather(key="year",value="crimes",'2016':'2018')

# clean the LGA names in the population dataset using the str_replace() function
population$lga <- population$lga %>% str_replace("\\([^()]{0,}\\)", "")

# run it once more for good measure, also to remove the second set of parentheses on the lga names
population$lga <- population$lga %>% str_replace("\\([^()]{0,}\\)", "")

# after removing the pesky parentheses, lets also remove any trailing whitespace which may be left over
population$lga <- population$lga %>% str_trim(side="right")

# remove the 'total' row from the crime dataset
crime <- crime %>% filter(`Local Government Area`!='Total')

# use a mutating join to left join the population dataset to the crime dataset on the lga key
crim_pop <- crime %>% left_join(population,c("Local Government Area"="lga"))

# check the councilArea variable of the housing dataset to see what it looks like
distinct(as.data.frame(housing$CouncilArea))

# remove the city/shire council part of the string so we can join to our crim_pop table
housing$CouncilArea <- housing$CouncilArea %>% str_replace_all(c("City Council"="","Shire Council"=""))

# remove any remaining whitespace
housing$CouncilArea <- housing$CouncilArea %>% str_trim(side="right")

# Split out the year, month and day from the date variable so it can be used in the join
housing <- housing %>% mutate(Date = dmy(Date),
                              year = year(Date),
                              month=month(Date),
                              day=day(Date))

# update the population date to a numeric type so it can be joined to the housing table
crim_pop$year <- as.numeric(crim_pop$year)

# use another mutating join to inner join the housing dataset to the crim_pop dataset
housing_merged <- housing %>% inner_join(crim_pop,c("CouncilArea"="Local Government Area","year"="year"))

# finally, drop the unwanted columns
housing_merged <- housing_merged %>% select(-"Date")

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Process:

We will first use the dim() function to identify the dimensions of the newly merged dataset. Then, using the str() function we will generate an output of each variable along with its datatype and value examples. The majority of variables have been correctly assigned, however we will still need to make a number of factor conversions in order to transform our qualitative data to the correct type. This applies to the Rooms, Type, Method, Bedroom2, Bathroom and Car variables. Whilst the Postcode variable is reassigned from numeric type to a character.

# This is the R chunk for the Understand Section

# first lets check the dimensions of the dataset
dim(housing_merged)

## [1] 34854    25

#now lets check the overall structure of our dataset
str(housing_merged)

## tibble [34,854 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Suburb       : chr [1:34854] "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
##  $ Address      : chr [1:34854] "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659 Victoria St" ...
##  $ Rooms        : num [1:34854] 2 2 2 3 3 3 4 4 2 2 ...
##  $ Type         : chr [1:34854] "h" "h" "h" "u" ...
##  $ Price        : num [1:34854] NA 1480000 1035000 NA 1465000 ...
##  $ Method       : chr [1:34854] "SS" "S" "S" "VB" ...
##  $ SellerG      : chr [1:34854] "Jellis" "Biggin" "Biggin" "Rounds" ...
##  $ Distance     : num [1:34854] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : num [1:34854] 3067 3067 3067 3067 3067 ...
##  $ Bedroom2     : num [1:34854] 2 2 2 3 3 3 3 3 4 3 ...
##  $ Bathroom     : num [1:34854] 1 1 1 2 2 2 1 2 1 2 ...
##  $ Car          : num [1:34854] 1 1 0 1 0 1 2 2 2 1 ...
##  $ Landsize     : num [1:34854] 126 202 156 0 134 94 120 400 201 202 ...
##  $ BuildingArea : num [1:34854] NA NA 79 NA 150 NA 142 220 NA NA ...
##  $ YearBuilt    : num [1:34854] NA NA 1900 NA 1900 ...
##  $ CouncilArea  : chr [1:34854] "Yarra" "Yarra" "Yarra" "Yarra" ...
##  $ Lattitude    : num [1:34854] -37.8 -37.8 -37.8 -37.8 -37.8 ...
##  $ Longtitude   : num [1:34854] 145 145 145 145 145 ...
##  $ Regionname   : chr [1:34854] "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
##  $ Propertycount: num [1:34854] 4019 4019 4019 4019 4019 ...
##  $ year         : num [1:34854] 2016 2016 2016 2016 2017 ...
##  $ month        : num [1:34854] 9 12 2 2 3 3 6 8 8 8 ...
##  $ day          : int [1:34854] 3 3 4 4 4 4 4 6 6 6 ...
##  $ crimes       : num [1:34854] 10727 10727 10727 10727 11130 ...
##  $ population   : num [1:34854] 98521 98521 98521 98521 98521 ...

# now lets set our factor variables and ensure we order those that are relevant
housing_merged$Rooms <- housing_merged$Rooms %>% factor(ordered = TRUE)
housing_merged$Type <- housing_merged$Type %>% factor(levels=c("h","t","u"),ordered = TRUE)
housing_merged$Method <- housing_merged$Method %>% factor()
housing_merged$Bedroom2 <- housing_merged$Bedroom2 %>% factor(ordered = TRUE)
housing_merged$Bathroom <- housing_merged$Bathroom %>% factor(ordered = TRUE)
housing_merged$Car <- housing_merged$Car %>% factor(ordered = TRUE)

# lets update to the correct datatypes
housing_merged$Postcode <- housing_merged$Postcode %>% as.character()

# now lets recheck to make sure it looks good
str(housing_merged)

## tibble [34,854 x 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Suburb       : chr [1:34854] "Abbotsford" "Abbotsford" "Abbotsford" "Abbotsford" ...
##  $ Address      : chr [1:34854] "68 Studley St" "85 Turner St" "25 Bloomburg St" "18/659 Victoria St" ...
##  $ Rooms        : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 2 2 2 3 3 3 4 4 2 2 ...
##  $ Type         : Ord.factor w/ 3 levels "h"<"t"<"u": 1 1 1 3 1 1 1 1 1 1 ...
##  $ Price        : num [1:34854] NA 1480000 1035000 NA 1465000 ...
##  $ Method       : Factor w/ 9 levels "PI","PN","S",..: 7 3 3 8 6 1 8 5 3 3 ...
##  $ SellerG      : chr [1:34854] "Jellis" "Biggin" "Biggin" "Rounds" ...
##  $ Distance     : num [1:34854] 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : chr [1:34854] "3067" "3067" "3067" "3067" ...
##  $ Bedroom2     : Ord.factor w/ 15 levels "0"<"1"<"2"<"3"<..: 3 3 3 4 4 4 4 4 5 4 ...
##  $ Bathroom     : Ord.factor w/ 11 levels "0"<"1"<"2"<"3"<..: 2 2 2 3 3 3 2 3 2 3 ...
##  $ Car          : Ord.factor w/ 15 levels "0"<"1"<"2"<"3"<..: 2 2 1 2 1 2 3 3 3 2 ...
##  $ Landsize     : num [1:34854] 126 202 156 0 134 94 120 400 201 202 ...
##  $ BuildingArea : num [1:34854] NA NA 79 NA 150 NA 142 220 NA NA ...
##  $ YearBuilt    : num [1:34854] NA NA 1900 NA 1900 ...
##  $ CouncilArea  : chr [1:34854] "Yarra" "Yarra" "Yarra" "Yarra" ...
##  $ Lattitude    : num [1:34854] -37.8 -37.8 -37.8 -37.8 -37.8 ...
##  $ Longtitude   : num [1:34854] 145 145 145 145 145 ...
##  $ Regionname   : chr [1:34854] "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
##  $ Propertycount: num [1:34854] 4019 4019 4019 4019 4019 ...
##  $ year         : num [1:34854] 2016 2016 2016 2016 2017 ...
##  $ month        : num [1:34854] 9 12 2 2 3 3 6 8 8 8 ...
##  $ day          : int [1:34854] 3 3 4 4 4 4 4 6 6 6 ...
##  $ crimes       : num [1:34854] 10727 10727 10727 10727 11130 ...
##  $ population   : num [1:34854] 98521 98521 98521 98521 98521 ...

Tidy & Manipulate Data I

Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.

Process:

This Crime Statistics dataset is in an untidy (wide) format, with the column headers being values (year) instead of variables. Using the gather() function of the tidyr package, we will reshape the dataframe to tidy (long) format using the year as the key argument and ‘crimes’ as the value argument, with a column range passed in to represent the year variables.

Refer to the Data section above for the r code required to complete the reshape.

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.

I have chosen to create two new variables below using the mutate function.

Process:

crime_rate: This is a measure of the crimes committed per capita by dividing the total crimes by the total population for that local government area.

price_sqm: This is a measure of the property price per square metre, which is calculated by dividing the property price by the land size variable.

# This is the R chunk for the Tidy & Manipulate Data II 

# use the mutate() function to create a new crime_rate variable
housing_merged <- housing_merged %>% mutate(crime_rate=crimes/population)

# use the mutate() function to create a new price/square metre variable
housing_merged <- housing_merged %>% mutate(price_sqm=Price/Landsize)

Scan I

Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

Process:

I chose to first scan the entire dataset for missing data by variable in order to get a sense of which columns were missing values and how many. Next, I decided to complete a check of the factor variables to find any obvious value errors. There was no obvious issues although some interesting results, e.g. the Bedroom2 variable does not align with the Rooms variable, meaning one of the sources must be innacurate. Looking deeper into the Bedroom2 variable we can see there are some interesting results in the high ranges, such as a 30 bedroom house. As a result of this discrepency as well as its 8214 missing values, we will remove the Bedroom2 variable and rely on the Rooms variable for our bedroom count.

We now need to handle the missing values. Some of the missing values will not impact our final analysis and/or we have no means of imputing an accurate value for them, e.g. latitude,longitude and year built. The most important missing values are those for the price variable, due to its importance in any further analysis. We definately do not want to remove all 7610 missing price values as these are a significant proportion of our dataset, and given the remaining fields, I suspect we can provide an accurately imputed value based on a conditional. I will use the group_by function to provide a mean value based on Suburb and Rooms, as these are likely to be the 2 factors most influential in determining a house price, this method should prove highly accurate. For the remaining variables with missing values, there are not as obvious a choice to determine an accurate value, so in each case we will impute with the median.

# This is the R chunk for the Scan I

# using the colSums() function we will count the total null values per variable
colSums(is.na(housing_merged))

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0          7610 
##        Method       SellerG      Distance      Postcode      Bedroom2 
##             0             0             0             0          8214 
##      Bathroom           Car      Landsize  BuildingArea     YearBuilt 
##          8223          8725         11807         21112         19303 
##   CouncilArea     Lattitude    Longtitude    Regionname Propertycount 
##             0          7973          7973             0             0 
##          year         month           day        crimes    population 
##             0             0             0             0             0 
##    crime_rate     price_sqm 
##             0         16872

# complete a sense check on all of the factors variables 
housing_merged %>% select_if(is.factor) %>% summary()

##      Rooms       Type          Method         Bedroom2        Bathroom    
##  3      :15083   h:23978   S      :19743   3      :11881   1      :12969  
##  2      : 8330   t: 3580   SP     : 5094   4      : 6348   2      :11064  
##  4      : 7956   u: 7296   PI     : 4850   2      : 5777   3      : 2181  
##  5      : 1737             VB     : 3107   5      : 1427   4      :  269  
##  1      : 1479             SN     : 1317   1      :  966   5      :   77  
##  6      :  204             PN     :  308   (Other):  241   (Other):   71  
##  (Other):   65             (Other):  435   NA's   : 8214   NA's   : 8223  
##       Car       
##  2      :12214  
##  1      : 9164  
##  0      : 1631  
##  3      : 1606  
##  4      : 1161  
##  (Other):  353  
##  NA's   : 8725

# complete a check on individual factor variables
housing_merged$Rooms %>% levels()

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "12" "16"

housing_merged$Type %>% levels()

## [1] "h" "t" "u"

housing_merged$Method %>% levels()

## [1] "PI" "PN" "S"  "SA" "SN" "SP" "SS" "VB" "W"

housing_merged$Bedroom2 %>% levels()

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "12" "16" "20" "30"

housing_merged$Bathroom %>% levels()

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "12"

housing_merged$Car %>% levels()

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "18" "26"

# remove the Bedroom2 variable
housing_merged <- housing_merged %>% select(-"Bedroom2")

# impute the price variable with the mean value based on suburb and rooms
housing_merged <- housing_merged %>% group_by(Suburb,Rooms) %>%
mutate(Price=ifelse(is.na(Price),mean(Price,na.rm=TRUE),Price))

# repeat above step for the remaining relevant variables, using an appropriate aggregate for each
housing_merged$Bathroom <- housing_merged$Bathroom %>% impute(fun=median)
housing_merged$Car <- housing_merged$Car %>% impute(fun=median)
housing_merged$Landsize <- housing_merged$Landsize %>% impute(fun=median)

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

Process:

Due to the nature of the Melbourne housing market there is likely to be many outliers within the dataset which are perfectely valid values. However, we will still perform our outlier analysis to confirm. Due to the size of our dataset we are unable to perform a Mahalanobis Distance analysis so we will use a Box-Plot analysis instead. Again we can observe that despite the large number of outliers in many of the variables, none of these outliers represent impossible values and in fact most likely add value to the data set. For example we can observe that the maximum Price in Melbourne was just over $11 million, which despite being a significant outlier does not negatively impact the integrity of the data. After performing our analysis we can observe 3 outlier values which stick out, one of the values can be seen in the scatter plot showing Land size by distance from the CBD. In the top left corner we can observe a value with a large land size which is very close to the CBD. By selecting the max land size value we see it is a 3 bedroom house in Fitzroy. So it’s probably fair to say its not 430,000 square metres! We will impute the value with the mean land size for 3 bedroom houses in Fitzroy. The second value is in the YearBuilt variable, we can see a value at the far bottom of the boxplot indicating a house built sometime around the 13th Century. It’s likely safe to assume that any aboriginal structure still standing from that time would not be up for sale on Domain so we can safely impute this value with the mean YearBuilt for the suburb. Finally, we can see an outlier in the boxplot for the BuildingArea variable, with one value indicating a house over 40,000 square metres in size. Unless there was an Amazon distribution centre up for sale in New Gisborne, we can again safely impute this value with the mean building area by suburb. We can also see a large number of items with BuildingSize = 0, although given we cannot be surely if these are houses or empty property we will take no further action.

# This is the R chunk for the Scan II

# complete a sense check on each of the numeric variables 
housing_merged %>% select_if(is.numeric) %>% summary()

## 
##  11807 values imputed to 521

##     Suburb              Rooms           Price             Distance    
##  Length:34854       3      :15083   Min.   :   85000   Min.   : 0.00  
##  Class :character   2      : 8330   1st Qu.:  660000   1st Qu.: 6.40  
##  Mode  :character   4      : 7956   Median :  907000   Median :10.30  
##                     5      : 1737   Mean   : 1096680   Mean   :11.19  
##                     1      : 1479   3rd Qu.: 1350000   3rd Qu.:14.00  
##                     6      :  204   Max.   :11200000   Max.   :48.10  
##                     (Other):   65   NA's   :120                       
##     Landsize       BuildingArea       YearBuilt       Lattitude     
##  Min.   :     0   Min.   :    0.0   Min.   :1196    Min.   :-38.19  
##  1st Qu.:   357   1st Qu.:  102.0   1st Qu.:1940    1st Qu.:-37.86  
##  Median :   521   Median :  136.0   Median :1970    Median :-37.81  
##  Mean   :   569   Mean   :  160.3   Mean   :1965    Mean   :-37.81  
##  3rd Qu.:   598   3rd Qu.:  188.0   3rd Qu.:2000    3rd Qu.:-37.75  
##  Max.   :433014   Max.   :44515.0   Max.   :2106    Max.   :-37.39  
##                   NA's   :21112     NA's   :19303   NA's   :7973    
##    Longtitude    Propertycount        year          month       
##  Min.   :144.4   Min.   :   83   Min.   :2016   Min.   : 1.000  
##  1st Qu.:144.9   1st Qu.: 4385   1st Qu.:2016   1st Qu.: 5.000  
##  Median :145.0   Median : 6763   Median :2017   Median : 7.000  
##  Mean   :145.0   Mean   : 7573   Mean   :2017   Mean   : 7.142  
##  3rd Qu.:145.1   3rd Qu.:10412   3rd Qu.:2017   3rd Qu.:10.000  
##  Max.   :145.5   Max.   :21650   Max.   :2018   Max.   :12.000  
##  NA's   :7973                                                   
##       day            crimes        population       crime_rate     
##  Min.   : 1.00   Min.   : 1508   Min.   : 34158   Min.   :0.02658  
##  1st Qu.: 9.00   1st Qu.: 6767   1st Qu.:117498   1st Qu.:0.03791  
##  Median :17.00   Median : 8383   Median :161609   Median :0.05757  
##  Mean   :15.88   Mean   : 9861   Mean   :156061   Mean   :0.06391  
##  3rd Qu.:24.00   3rd Qu.:11807   3rd Qu.:181289   3rd Qu.:0.07465  
##  Max.   :30.00   Max.   :27570   Max.   :340419   Max.   :0.16221  
##                                                                    
##    price_sqm       
##  Min.   :   6.235  
##  1st Qu.:1391.905  
##  Median :2451.537  
##  Mean   :     Inf  
##  3rd Qu.:4811.473  
##  Max.   :     Inf  
##  NA's   :16872

# Checking outliers based on summary
boxplot(housing_merged$Price~housing_merged$Rooms,main = "House Price by Rooms",par(cex.axis=.5))

boxplot(housing_merged$Rooms,main = "Rooms")

boxplot(housing_merged$Distance,main="Distance from CBD (km)")

boxplot(housing_merged$BuildingArea,main="Buidling Area")

boxplot(housing_merged$YearBuilt,main="Year Built")

boxplot(housing_merged$Propertycount,main="Property Count")

boxplot(housing_merged$crimes,main="Crimes")

boxplot(housing_merged$population,main="Population")

# build a scatter plot to show outliers
plot(housing_merged$Landsize~housing_merged$Distance,main="Land Size by Distance from CBD")

# select the landsize outlier value
housing_merged[which.max(housing_merged$Landsize),]

# select the year built outlier value
housing_merged[which.min(housing_merged$YearBuilt),]

# select the Building area outlier value
housing_merged[which.max(housing_merged$BuildingArea),]

# impute the landsize outlier with the mean for 3 bedroom houses in Fitzroy
housing_merged <- housing_merged %>% group_by(Suburb,Rooms) %>% 
mutate(Landsize=ifelse(Landsize>400000,mean(Landsize,na.rm=TRUE),Landsize))

# impute the building area outlier with the mean for 5 bedroom houses in New Gisborne
housing_merged <- housing_merged %>% group_by(Suburb,Rooms) %>% 
mutate(BuildingArea=ifelse(BuildingArea>40000,mean(BuildingArea,na.rm=TRUE),BuildingArea))

# impute the year built outlier with the mean for Mount Waverly
housing_merged <- housing_merged %>% group_by(Suburb) %>% 
mutate(YearBuilt=ifelse(YearBuilt<1800,mean(YearBuilt,na.rm=TRUE),YearBuilt))

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

Process:

I will now apply a transformation to the Price variable, so as to standardise the values for potential future statistical analysis given it is the dependant variable of the dataset. I first examined the distribution shape and found it to be positively skewed. As a result I chose to test both square root and log transformations to see which has the better impact. From the comparison of histogram shapes below we can see that the log transformation has a much better impact in reducing skewness of the variable so we will proceed with this method.

# This is the R chunk for the Transform Section

# we will now apply a transformation on price, first lets check the shape
hist(housing_merged$Price)

# apply a log transformation to the variable
housing_log <- log(housing_merged$Price)

# check the histogram again for comparison
hist(housing_log)

# apply a square root transformation
housing_root <- sqrt(housing_merged$Price)

# check histogram again for comparison
hist(housing_root)

References

Pino, T 2018, Melbourne Housing Market, Kaggle.com, viewed 9 October 2020, https://www.kaggle.com/anthonypino/melbourne-housing-market

ABS 2016, ERP by LGA (ASGS 2016), Australian Bureau of Statistics, viewed 10 October 2020, http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ABS_ERP_LGA2016

CSA 2019, Latest crime data by area, Crime Statistics Agency Victoria, viewed 10 October 2020, https://www.crimestatistics.vic.gov.au/crime-statistics/latest-crime-data-by-area

Dr. Anil Dolgun 2020, Module 8 - Special Operations: Dealing with date/time and character variables, Appspot.com, viewed 12 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_08.html

Dr. Anil Dolgun 2020, Module 7 - Transform: Data Transformation, Standardisation, and Reduction, Appspot.com, viewed 12 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_07.html

Dr. Anil Dolgun 2020, Module 6 - Scan: Outliers, Appspot.com, viewed 12 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_06.html

Dr. Anil Dolgun 2020, Module 5 - Scan: Missing Values, Appspot.com, viewed 12 October 2020, http://rare-phoenix-161610.appspot.com/secured/Module_05.html

MATH2349 Data Wrangling

Assignment 2

Wesley Mayne - s3855554

Required packages

Executive Summary

Data

Melbourne Housing Data

Crime Statistics Data

ABS Population Data

Part 1:

Understand

Process:

Tidy & Manipulate Data I

Process:

Tidy & Manipulate Data II

Process:

Scan I

Process:

Scan II

Process:

Transform

Process:

References