# This is the R chunk for the required packages
# Call packages ---------------------------------------------------------
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
Data pre-processing is one of the stages in the Data Science process. It is preceded by two other stages: Framing the Problem, which involves identifying the client/audience and what problem they want to resolve and secondly, identifying the required data. This data pre-processing exercise relates to preparing data to provide insights to help a new Building Maintenance company target its Flyers to residents in the City of Melbourne.
The data was sourced from City of Melbourne Open Data and processed in R. Two data sets were used, “Residential_dwellings_2019.csv” the primary data set and “Development_Activity_Monitor.csv” the secondary data set. The two data sets were then merged and sliced to create a new data-frame. The merged data set was then inspected so as to establish the structure of the data object and to identify data types and attributes, which would be important in how the data is processed, analysed and modelled. Any necessary data type conversions were then undertaken.
The data was then tidied up by consolidating the various dwelling types under a single column. The data set was also scanned for missing values. Missing values were found and dealt with through imputation. No special or impossible values were found in the data set.
The Dwelling number variable was scanned for outliers. The Tukey’s method was used to detect outliers in the Boxplot. Outliers were found but were deemed to be consistent with the nature of the data and not an oddity hence the decision to keep them.
A Histogram of the “Dwelling number” variable revealed the data as skewed to the right. Three Transformation methods for transforming right skewed data, log(), sqrt() and Reciprical were used. The sqrt() transformation method returned the best transformation with the least amount of skewness.
# This is the R chunk for the Data Section and includes sub-setting the merged data set
# Import Data set 1 -------------------------------------------------------
DS_1 <- read_csv("~/Desktop/Data Wrangling _Assign2/Residential_dwellings_2019.csv")
## Parsed with column specification:
## cols(
## `Census year` = col_double(),
## `Block ID` = col_double(),
## `Property ID` = col_double(),
## `Base property ID` = col_double(),
## `Street address` = col_character(),
## `CLUE small area` = col_character(),
## `Dwelling type` = col_character(),
## `Dwelling number` = col_double(),
## `x coordinate` = col_double(),
## `y coordinate` = col_double()
## )
# inspect DS_1 data set
head(DS_1)
# Import Data set 2 -------------------------------------------------------
DS_2 <- read_csv("~/Desktop/Data Wrangling _Assign2/Development_Activity_Monitor.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## data_format = col_character(),
## development_key = col_character(),
## status = col_character(),
## clue_small_area = col_character(),
## street_address = col_character(),
## town_planning_application_no = col_character(),
## `Location 1` = col_character()
## )
## See spec(...) for full column specifications.
# inspect DS_2 data set
head(DS_2)
# Merge DS_1 & DS_2 data sets---------------------------------------------------------
Merged_DS <- DS_1 %>% left_join(DS_2, by = c("Property ID"="property_id"))
# subset Merged_DS dataframe by omitting variables not required
Merged_DS <- Merged_DS[,c(1:10, 14,22,24:28)]
# inspect merged data set
head(Merged_DS)
In order to reveal the data structure of the R object created, str() function was used. The “Merged_DS” is a data frame R object with the following dimensions: 10,322 observations and 17 variables. To further understand the data structure, glimpse() function was used to identify the data attributes. The merged data set was comprised of numerical and character variables. All the number values had defaulted to double numeric values. Some of the character variables consisted of categorical variables. The double numeric values were converted to either integer numeric or nominal character values depending on the nature of the variable. The Categorical character variables were converted to factors.
# This is the R chunk for the Understand I Section
str(Merged_DS) # inspect structure of dataframe
glimpse(Merged_DS) # global view of all variables and data attributes
dim(Merged_DS) # inspect dimensions of data set
Merged_DS[,c("CLUE small area","Dwelling type")] <- lapply(Merged_DS[,c("CLUE small area","Dwelling type")],as.factor) # convert variables from character to factor variables
Merged_DS[,c(1,8,11,13:17)] <- lapply(Merged_DS[,c(1,8,11,13:17)],as.integer) # convert discrete variables from double to integer numeric data type variable
Merged_DS[,c(2:4,9:10,12)] <- lapply(Merged_DS[,c(2:4,9:10,12)],as.character) # convert nominal variables from double to character data type variable
According to Wickham’s Tidy Data principles,“A variable contains all values that measure the same underlying attribute”[4]. The data from the “Development_Activity_Monitor” data set had the various Dwelling types as column headings and the row values in the columns represented the count of dwellings under each Dwelling type. The count however cannot be logically assigned as an attribute to the Dwelling type. To correct the error, the gather() function was used to consolidate and place the various dwelling types under a single column.
# This is the R chunk for the Tidy & Manipulate Data I and includes sub-setting to exclude one variable
Merged_DS <- Merged_DS %>% gather(studio_dwe,one_bdrm_dwe,two_bdrm_dwe,three_bdrm_dwe,student_apartments, key = Residential_type,value = resi_count)
Merged_DS <- Merged_DS[,-14] # subset to exclude unrequired variable
The first step was to determine the total number missing values in the data frame using the sum(is.na()) function. Next, colSums() function was used to identify variables with missing values. This helped in determining the best approach to handle missing values on the basis of variables affected. A determination was then made to replace missing values in the “year_completed” column with the numerical value “0” as the variable would be used in calculating the age of buildings and would require numerical values.
The missing values in the “floors_above” and “residential_type” variables were replaced with the character value, “unknown”, and this was determined appropriate as data was not available and there was no logical way to impute the missing values.
The next step was to test how significant the remaining missing values were in the data set. The test result was under the 5% threshold recommended for a decision to make a blanket exclusion of all observations with missing values, the decision was thus made to omit the remaining missing values using na.omit().
The data frame was also scanned for special values using the is.nan() and in.infinite() function. Since special characters do not necessarily conform to a particular data type, scanning them returns a list and the lapply() function was used together with the scan functions is.nan() and in.infinite(). The result was then converted to a vector using the unlist() function. The scan reported no special characters .
# This is the R chunk for the Scan I Part A
# scan for missing values -----------------------------------------------------------------
sum(is.na(Merged_DS))
colSums(is.na(Merged_DS))
# replace missing values
Merged_DS <-Merged_DS %>% replace_na(list(year_completed = 0, floors_above = "unknown",Residential_type = "unknown"))
# determine significance of any remaining missing values after replacing some missing values
mean(is.na(Merged_DS))*100
# omit remaining missing values
Merged_DS <- na.omit(Merged_DS)
# scan for special characters -----------------------------------------------------------------
unlist (lapply (Merged_DS, function (x) sum (is.nan (x))))
unlist (lapply (Merged_DS, function (x) sum (is.infinite (x))))
The original data did not specifically state the age of buildings at the time of the Census, a variable deemed necessary. This however could be determined from two existing variables, “Census_year” and “year_completed”.The age was calculated by subtracting the values in the “Census_year” column from values in the “year_completed” column with the result mutating to a new column. Unique() function was then used to inspect the composition of the new variable.
The imputed “0” value in the “year_completed” column was necessary for the above computation but it is not logical or useful for analysis as year “0” is meaningless. To address this, the numerical value “0” was changed to “pre-2002”. The “0” value had also caused the computed ages to be “2019”. This corrected to “over 17 years”. An error was also picked up of a building age of “1”. This related to a building included in the 2019 census but with a completion date, 2020. This was corrected to an age of “0”.
# This is the R chunk for the Tidy & Manipulate Data II
Merged_DS <- Merged_DS %>% mutate(Bld_Age =Merged_DS$`Census year` - Merged_DS$year_completed) # create a new variable, Bld_Age
unique(Merged_DS$Bld_Age) # inspect values in created column
# replace specific row values to better communicate observation
Merged_DS$year_completed <- replace(Merged_DS$year_completed,Merged_DS$year_completed == 0,"pre-2002") # replace imputed missing values with a string
Merged_DS$Bld_Age <- replace(Merged_DS$Bld_Age,Merged_DS$Bld_Age == 2019,"over 17 yrs") # replace impossible value with a string
Merged_DS$Bld_Age <- replace(Merged_DS$Bld_Age,Merged_DS$Bld_Age == -1,0) # replace buildings included in 2019 census but recorded as completed in 2020
The data was scanned for errors and inconsistencies which, if present in numerical variables, are likely to interfere with mathematical operations and statistical analysis. The code used to detect errors required the object to be a matrix. The first step was to convert the data frame to a matrix and assign it to a new object. The apply() function was then used to return unique values under each column. Non numeric values returned NA values. For columns that returned a mix of numeric and NA values, further investigations were carried out using the unique() function for the specific column in order to determine what the NA represented. There were no obvious errors detected.
# This is the R chunk for the Scan I Part A
# scan for obvious errors (inconsistencies) -----------------------------------------------------------------
X <- data.matrix(Merged_DS) # convert dataframe to a matrix
## Warning in data.matrix(Merged_DS): NAs introduced by coercion
## Warning in data.matrix(Merged_DS): NAs introduced by coercion
## Warning in data.matrix(Merged_DS): NAs introduced by coercion
## Warning in data.matrix(Merged_DS): NAs introduced by coercion
## Warning in data.matrix(Merged_DS): NAs introduced by coercion
apply(X,2, unique) # extract unique numerical values in each column
# investigate further results with a mix of NA and numerical values
unique(Merged_DS$floors_above)
unique(Merged_DS$year_completed)
unique(Merged_DS$Bld_Age)
Re-inspected the data set using head() function after tidying the merged data set. The “Blg_Age” variable was initially a categorical character variable consisting of ordinal values. This was converted to an ordered factor.
The steps followed to convert the “Bld_Age” variable to a factor were to first create a list of the unique values in “Bld_Age”. The values were then ordered using the string manipulation function str_sort() and setting the numeric argument, TRUE. The factor() function was then used to convert the “Bld_Age” variable to an ordered factor.
The Census year was initially converted to an integer variable in order to facilitate calculation of the building age and create the Bld_Age variable. It is however not an integer as that would suggest that it is subject mathematical operations. Suming the year values would result in a meaningless number. The variable was therefore converted to a nominal character variable to reflect its true nature.
# This is the R chunk for the Understand II Section
head(Merged_DS) # inspect tidied data set
# convert Bld_Age_yrs data type to create an ordered factor
lvl <- str_sort(unique(Merged_DS$Bld_Age), numeric = TRUE) # create level list
Merged_DS$Bld_Age <- factor(Merged_DS$Bld_Age,levels = lvl, ordered = TRUE)
# inspect levels of ordered factor variable
levels(Merged_DS$Bld_Age)
# convert Census year variable data type from integer to nominal character
Merged_DS$`Census year` <- as.character(Merged_DS$`Census year`)
The “Dwelling number” variable indicates the number of dwellings within a building. The number of dwellings in a particular building will therefore depend on the type of building. Ideally this data should be an ordered factor as the number of dwellings is unique to a specific building however the levels would be too many. The values also had a wide range from 1 to 806 and this would most likely distort the mean. A better measure would be the median. “The center of a skewed distribution can be better measured by the median” and the use of non parametric tests are preferable in such instances [5]. Boxplots are best suited as a nonparametric way of detecting outliers.
Based on the above, Tukey’s Method Boxplot was used to scan for outliers. Recognising that the data as laid out not ideal, boxplot was based on unique values in the Dwelling number variable rather than the individual values.
# This is the R chunk for the Scan II
# unique values for Dwelling number variable sorted in order
Dwe_n <- str_sort(unique(Merged_DS$`Dwelling number`), numeric = TRUE)
# This is the R chunk for the Scan II
# scan dataset for outliers
unique(Merged_DS$`Dwelling number`) %>% boxplot(main="Melbourne City Council: Number of Dwellings per Property ID", ylab="Dwellings", col = "greenyellow")
The shape of the data distribution was determined by use of a Histogram. The Histogram revealed that the “Dwelling number” variable data is positively skewed/skewed to the right. The log(), sqrt() and Reciprical are data transformation methods used to transform right-skewed data. All the three methods were used to transform the data and based on the results, the sqrt() method returned the best transformation,with the least skewed data.
# This is the R chunk for the Transform Section
# transform Dwelling type variable
par(mfrow = c(2,2))
hist(unique(Merged_DS$`Dwelling number`))
log_Mel <- log10(unique(Merged_DS$`Dwelling number`))
hist(log_Mel)
sqrt_Mel <- sqrt(unique(Merged_DS$`Dwelling number`))
hist(sqrt_Mel)
Mel_recip <- 1/unique(Merged_DS$`Dwelling number`)
hist(Mel_recip)
[1] City of Melbourne 2020,Residential dwellings, City of Melbourne Open Data Team, viewed 26/9/2020, https://data.melbourne.vic.gov.au/Property/Residential-dwellings/44kh-ty54
[2] City of Melbourne 2020,Development Activity Monitor,City of Melbourne Open Data Team, viewed 26/9/2020, https://data.melbourne.vic.gov.au/Property/Development-Activity-Monitor/gh7s-qda8
[3] Missing Values in R n.d.,faculty.nps.edu,viewed 8/10/2020, https://faculty.nps.edu/sebuttre/home/R/missings.html
[4] Wickham H 2014., Tidy Data, The Journal of Statistical Software, viewed 15 October 2020, https://vita.had.co.nz
[5] Ogee et al., 2015, Choosing Between a Nonparametric Test and a Parametric Test,Minitab.com, viewed 15 October 2020,https://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test