# This is the R chunk for the required packages
library(readr)
suppressMessages(library(dplyr))
library(tidyr)
library(stringr)
library(readxl)
In this RStudio report we are creating clean and tidy data that will be used to analyse society wellbeing in different Local Govemrent Areas in Victoria. These indicators include life satisfaction, life worthwhile and overall community satisfaction, as well as housing prices to give an idea about the standard of living in specific area. We have loaded 4 different dataframes from 3 different datasets. All 4 dataframes has at least one variable that stores the Local Goverment Area and it was used to join all 4 dataframes. Firstly, we have imported the dataset using appropriate function according to file type. We made sure that unnecessary columns and rows are omitted from the beginning. Then, we went through all variables in the dataframes and checked that they have the correct data type and overall correct structure. Next, we mutated our dataframes and create extra variables that gives better insights about our data. After that, and before going into further data manipulation, we dealt with missing values and outliers, because it was easier to do that while data was in wide format. Moreover, data transformation was applied to help gain better understanding about our data and helped us better identify outliers. Next, we made sure that all four dataframes has a consistent common variable that can be used for merging and then we merged that dataset all together. Finally, we made our data more tidy and ready for analysis by changing it from wide to long format.
Note: The order of tasks given in the assignment sheet is different than the order used here, and that is due to necessity to preprocess the data.
Victorian Population Health Survey 2017 dataset is provided by The Department of Health and Human Services. The dataset includes different health indicators about popultation according to local goverment areas in Vicrtoria. The dataset in an Excel file that has different sheets, where each sheet has one indicator. In this report we will use two sheets only, life satisfaction and life worthwhile.
Download link: https://www2.health.vic.gov.au/Api/downloadmedia/%7B1AF5D560-4144-4AFE-BDBA-7CFA431865E5%7D
Dataset 1a variables description:
1- LocalGovArea: the name of the Victorian local goverment council, there are overall 79 local goverments
2- LowAndMedium Female: percentage of female who have low or medium life satisfaction
3- LowAndMedium Male: percentage of male who have low or medium life satisfaction
4- High Female: percentage of female who have high life satisfaction
5- High Male: percentage of male who have high life satisfaction
6- VeryHigh Female: percentage of female who have very high life satisfaction
7- VeryHigh Male: percentage of male who have very high life satisfaction
Let’s import the dataset which is an xlsx file that contains more than one sheet, we will use two sheets only (life satisfaction and life worthwhile). We will import everyting first, and we will exclude the first 7 rows that contains no observations.
# Importing data 1a
suppressMessages(life_satisfaction <- read_excel("VPHS 2017 by LGA and gender - v2.xlsx",
sheet = "Life satisfacation", col_names = FALSE, skip = 7))
# Remove unnecessary columns
life_satisfaction <- life_satisfaction %>% select(2,3,10,17,24,31,38)
# Remove unnecessary rows at the bottom of the sheet
life_satisfaction <- life_satisfaction[-seq(80,94), ]
# Define columns names
colnames(life_satisfaction) <- c("LocalGovArea","LowAndMedium Female","LowAndMedium Male",
"High Female", "High Male", "VeryHigh Female", "VeryHigh Male")
head(life_satisfaction)
Dataset 1b variables description:
1- LocalGovArea: the name of the Victorian local goverment council, there are overall 79 local goverments
2- LowAndMedium Female: percentage of female who have low or medium life worthwhile
3- LowAndMedium Male: percentage of male who have low or medium life worthwhile
4- High Female: percentage of female who have high life worthwhile
5- High Male: percentage of male who have high life worthwhile
6- VeryHigh Female: percentage of female who have very high life worthwhile
7- VeryHigh Male: percentage of male who have very high life worthwhile
Importing the second sheet:
# Importing data 1b
suppressMessages(life_worthwhile <- read_excel("VPHS 2017 by LGA and gender - v2.xlsx",
sheet = "Life worthwhile", col_names = FALSE, skip = 7))
# Remove unnecessary columns
life_worthwhile <- life_worthwhile %>% select(2,3,10,17,24,31,38)
# Remove unnecessary rows at the bottom of the sheet
life_worthwhile <- life_worthwhile[-seq(80,94), ]
# Define columns names
colnames(life_worthwhile) <- c("LocalGovArea","LowAndMedium Female","LowAndMedium Male",
"High Female", "High Male", "VeryHigh Female", "VeryHigh Male")
head(life_worthwhile)
Melbourne Housing Market is a dataset that is published in Kaggle which has over 60,000 obervations. Each observation is a house/unit/townhouse that is for sale and includes information like price, number of bedrooms, address and local goverment area. That dataset is in a CSV file.
Page link: https://www.kaggle.com/anthonypino/melbourne-housing-market
Download link: https://www.kaggle.com/anthonypino/melbourne-housing-market/download
Dataset variables description:
1- Suburb 2- Address 3- Rooms 4- Type
5- Price 6-Method 7- Seller 8- Date
9- Postcode 10- Regoinname
11- Propertycount: number of properties for sale in same council area
12- Distance: distance from Melbourne CBD
13- CouncilArea: Local goverment area
Let’s import that dataset.
suppressMessages(melbourne_house_prices <- read_csv("MELBOURNE_HOUSE_PRICES_LESS.csv"))
head(melbourne_house_prices)
This dataset, which is provided by Know Your Council website, gives different indicators to evaluate the performace of each local goverment area in Victoria. There are over 60 indicators, but we will only use one indicator that has to do with society satisfaction about the council’s services.
Download link: https://knowyourcouncil.vic.gov.au/__data/assets/excel_doc/0010/32887/LGPRF-2014-2019-Full-Council-Data-Set-EXTERNAL.xlsx
Dataset variables description:
1- Council: local government area name
2- Group: goverment group
3- ID
4- Descriptio: indicator’s description
5- Service Provided: T/F if service is provided or not
6- Result: the numerical value used as result for each indicator
Now let
vic_councils_indicators <- read_excel("LGPRF-2014-2019-Full-Council-Data-Set-EXTERNAL.xlsx",
sheet = "Indicators 2017-18", col_types = c("text",
"text", "text", "text", "text", "text",
"numeric", "skip", "skip", "skip",
"skip", "skip", "skip", "skip", "skip",
"skip"))
head(vic_councils_indicators)
Checking the dimension of dataset 1a:
#Dimension of dataset 1a
dim(life_satisfaction)
[1] 79 7
#How many values for Local goverments areas
paste("There are ", dim(life_satisfaction)[1], " local goverment areas in Victoria")
[1] "There are 79 local goverment areas in Victoria"
#Structure of the dataset
str(life_satisfaction)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 79 obs. of 7 variables:
$ LocalGovArea : chr "Alpine (S)" "Ararat (RC)" "Ballarat (C)" "Banyule (C)" ...
$ LowAndMedium Female: num 28.2 29.6 20 18.5 19.9 ...
$ LowAndMedium Male : num 5.26 14.52 20.97 17.52 26.28 ...
$ High Female : num 47.7 51.5 48.2 56.4 51.9 ...
$ High Male : num 74 44.1 55.1 61.4 43.2 ...
$ VeryHigh Female : num 23.1 18.1 31.3 24.2 27.6 ...
$ VeryHigh Male : num 19.4 40.8 23.2 21.1 30.3 ...
# Using a combination of (sapply) and (typeof), let's check the data type of each column
sapply(life_satisfaction,typeof)
LocalGovArea LowAndMedium Female LowAndMedium Male High Female High Male VeryHigh Female
"character" "double" "double" "double" "double" "double"
VeryHigh Male
"double"
All data types of all columns are OK in this dataset.
Checking the dimension of dataset 1b:
#Dimension of dataset 1b
dim(life_worthwhile)
[1] 79 7
#How many values for Local goverments areas
paste("There are ", dim(life_worthwhile)[1], " local goverment areas in Victoria")
[1] "There are 79 local goverment areas in Victoria"
Checking the structure:
#Structure of the dataset
str(life_worthwhile)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 79 obs. of 7 variables:
$ LocalGovArea : chr "Alpine (S)" "Ararat (RC)" "Ballarat (C)" "Banyule (C)" ...
$ LowAndMedium Female: num 17.7 18.5 10.7 16.5 18.5 ...
$ LowAndMedium Male : num 3.16 14.77 26.66 13.35 13.08 ...
$ High Female : num 48.2 46.2 41.9 43.6 46.5 ...
$ High Male : num 52.1 40.3 47.1 59.4 48.4 ...
$ VeryHigh Female : num 33.1 34.8 46.5 37.6 33.7 ...
$ VeryHigh Male : num 41.9 43.3 25.2 27.2 37.5 ...
# Using a combination of (sapply) and (typeof), let's check the data type of each column
sapply(life_worthwhile,typeof)
LocalGovArea LowAndMedium Female LowAndMedium Male High Female High Male VeryHigh Female
"character" "double" "double" "double" "double" "double"
VeryHigh Male
"double"
All data types of all columns are OK in this dataset.
Both dataframes 1a and 1b has Local Goverment Areas column with Group type between brackets. We need to rename it and remove brackets so that they are consistent with other dataframes.
#Before
head(life_satisfaction$LocalGovArea)
[1] "Alpine (S)" "Ararat (RC)" "Ballarat (C)" "Banyule (C)" "Bass Coast (S)" "Baw Baw (S)"
#dataset 1a
life_satisfaction$LocalGovArea <- gsub("\\(.*?)","",life_satisfaction$LocalGovArea)
#dataset 1a
life_worthwhile$LocalGovArea <- gsub("\\(.*?)","",life_worthwhile$LocalGovArea)
# Now remove resulting whitespaces
life_satisfaction$LocalGovArea <- trimws(life_satisfaction$LocalGovArea,"r")
life_worthwhile$LocalGovArea <- trimws(life_worthwhile$LocalGovArea,"r")
#After
head(life_satisfaction$LocalGovArea)
[1] "Alpine" "Ararat" "Ballarat" "Banyule" "Bass Coast" "Baw Baw"
Check the dimensions of the dataset:
dim(melbourne_house_prices)
[1] 63023 13
paste("There are ", dim(melbourne_house_prices)[1], " observations in this dataset")
[1] "There are 63023 observations in this dataset"
Check the structure of the dataframe:
#Structure of the dataset
str(melbourne_house_prices)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 63023 obs. of 13 variables:
$ Suburb : chr "Abbotsford" "Abbotsford" "Abbotsford" "Aberfeldie" ...
$ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
$ Rooms : num 3 3 3 3 2 2 2 3 6 3 ...
$ Type : chr "h" "h" "h" "h" ...
$ Price : num 1490000 1220000 1420000 1515000 670000 ...
$ Method : chr "S" "S" "S" "S" ...
$ SellerG : chr "Jellis" "Marshall" "Nelson" "Barry" ...
$ Date : chr "1/04/2017" "1/04/2017" "1/04/2017" "1/04/2017" ...
$ Postcode : num 3067 3067 3067 3040 3042 ...
$ Regionname : chr "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Western Metropolitan" ...
$ Propertycount: num 4019 4019 4019 1543 3464 ...
$ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
$ CouncilArea : chr "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council" ...
- attr(*, "spec")=
.. cols(
.. Suburb = [31mcol_character()[39m,
.. Address = [31mcol_character()[39m,
.. Rooms = [32mcol_double()[39m,
.. Type = [31mcol_character()[39m,
.. Price = [32mcol_double()[39m,
.. Method = [31mcol_character()[39m,
.. SellerG = [31mcol_character()[39m,
.. Date = [31mcol_character()[39m,
.. Postcode = [32mcol_double()[39m,
.. Regionname = [31mcol_character()[39m,
.. Propertycount = [32mcol_double()[39m,
.. Distance = [32mcol_double()[39m,
.. CouncilArea = [31mcol_character()[39m
.. )
Since we are going to match datasets with local goverment area, we need to remove the pattern “City Council” from Council Area.
# Before
head(melbourne_house_prices$CouncilArea)
[1] "Yarra City Council" "Yarra City Council" "Yarra City Council" "Moonee Valley City Council"
[5] "Moonee Valley City Council" "Moonee Valley City Council"
# Removing pattern
melbourne_house_prices$CouncilArea <- str_remove(melbourne_house_prices$CouncilArea,
" City Council")
# After
head(melbourne_house_prices$CouncilArea)
[1] "Yarra" "Yarra" "Yarra" "Moonee Valley" "Moonee Valley" "Moonee Valley"
Let’s rename CouncilArea column name to LocalGovArea so it matches other dataframes to be merged later
colnames(melbourne_house_prices)[13] <- "LocalGovArea"
Therse are columns that can be changed to factors:
paste("There are ",length(unique(melbourne_house_prices$Rooms))," sizes according to rooms number")
[1] "There are 14 sizes according to rooms number"
paste("There are ",length(unique(melbourne_house_prices$Suburb)), " unique Suburbs")
[1] "There are 380 unique Suburbs"
paste("There are ",length(unique(melbourne_house_prices$Type)), " types of Houses")
[1] "There are 3 types of Houses"
paste("There are ",length(unique(melbourne_house_prices$Regionname)), " unique Regionname")
[1] "There are 8 unique Regionname"
paste("There are ",length(unique(melbourne_house_prices$LocalGovArea)), " unique Local goverment areas")
[1] "There are 34 unique Local goverment areas"
As we can see this dataset only has 34 Local Government areas, while the previous dataset had 79.
These unique values can be converted to factors instead, and some can given better meaningful labels.
melbourne_house_prices$Rooms = factor(melbourne_house_prices$Rooms, levels = unique(melbourne_house_prices$Rooms), ordered = FALSE)
melbourne_house_prices$Suburb = factor(melbourne_house_prices$Suburb, levels = unique(melbourne_house_prices$Suburb), ordered = FALSE)
melbourne_house_prices$Regionname = factor(melbourne_house_prices$Regionname, levels = unique(melbourne_house_prices$Regionname), ordered = FALSE)
melbourne_house_prices$LocalGovArea = factor(melbourne_house_prices$LocalGovArea, levels = unique(melbourne_house_prices$LocalGovArea), ordered = FALSE)
# Let's also convert Type to factor and give levels meaningful labels
melbourne_house_prices$Type <- factor(melbourne_house_prices$Type,levels=c('h','u','t'), labels=c('house','unit','townhouse'), ordered = FALSE)
head(melbourne_house_prices$Type)
[1] house house house house house townhouse
Levels: house unit townhouse
Now convert the date column to date type
melbourne_house_prices$Date <- as.Date(melbourne_house_prices$Date, "%d/%m/%Y")
Let’s check everything is now ok in this dataframe:
#Structure of the dataset
str(melbourne_house_prices)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 63023 obs. of 13 variables:
$ Suburb : Factor w/ 380 levels "Abbotsford","Aberfeldie",..: 1 1 1 2 3 3 3 3 4 5 ...
$ Address : chr "49 Lithgow St" "59A Turner St" "119B Yarra St" "68 Vida St" ...
$ Rooms : Factor w/ 14 levels "3","2","6","4",..: 1 1 1 1 2 2 2 1 3 1 ...
$ Type : Factor w/ 3 levels "house","unit",..: 1 1 1 1 1 3 2 1 1 1 ...
$ Price : num 1490000 1220000 1420000 1515000 670000 ...
$ Method : chr "S" "S" "S" "S" ...
$ SellerG : chr "Jellis" "Marshall" "Nelson" "Barry" ...
$ Date : Date, format: "2017-04-01" "2017-04-01" "2017-04-01" "2017-04-01" ...
$ Postcode : num 3067 3067 3067 3040 3042 ...
$ Regionname : Factor w/ 8 levels "Northern Metropolitan",..: 1 1 1 2 2 2 2 2 2 3 ...
$ Propertycount: num 4019 4019 4019 1543 3464 ...
$ Distance : num 3 3 3 7.5 10.4 10.4 10.4 10.4 14 3 ...
$ LocalGovArea : Factor w/ 34 levels "Yarra","Moonee Valley",..: 1 1 1 2 2 2 2 2 3 4 ...
- attr(*, "spec")=
.. cols(
.. Suburb = [31mcol_character()[39m,
.. Address = [31mcol_character()[39m,
.. Rooms = [32mcol_double()[39m,
.. Type = [31mcol_character()[39m,
.. Price = [32mcol_double()[39m,
.. Method = [31mcol_character()[39m,
.. SellerG = [31mcol_character()[39m,
.. Date = [31mcol_character()[39m,
.. Postcode = [32mcol_double()[39m,
.. Regionname = [31mcol_character()[39m,
.. Propertycount = [32mcol_double()[39m,
.. Distance = [32mcol_double()[39m,
.. CouncilArea = [31mcol_character()[39m
.. )
All good.
Checking dimensions
dim(vic_councils_indicators)
[1] 5214 7
This dataset is too big for our need, we will only take obervations that we need for our analysis. We need ‘Community Satisfaction’ only so we will filter dataframe to only give this indicator:
# Use filter to use Community Satisfaction only
vic_councils_indicators <- vic_councils_indicators %>%
filter(Description == "Community satisfaction rating out 100 with how council has performed in making decisions in the interests of the community")
# Check that we have 79 local goverments
paste("There are ", dim(vic_councils_indicators)[1], " local goverment areas in this dataset")
[1] "There are 79 local goverment areas in this dataset"
Remove unnecessary variables and give a better name for the dataset:
community_satisfaction <- vic_councils_indicators %>% select(1,2,7)
Also, change column names to match other dataset:
colnames(community_satisfaction)[1] <- "LocalGovArea"
colnames(community_satisfaction)[3] <- "CommunitySatisfaction"
Let’s check the structure of the dataset now:
#Structure of the dataset
str(community_satisfaction)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 79 obs. of 3 variables:
$ LocalGovArea : chr "Southern Grampians Shire" "South Gippsland Shire" "Stonnington City" "Ararat Rural City" ...
$ Group : chr "LARGE_SHIRE" "LARGE_SHIRE" "METRO" "SMALL_SHIRE" ...
$ CommunitySatisfaction: num 50 42 58 33 48 52 54 59 58 61 ...
Check Group variable if can be converted to factor:
unique(community_satisfaction$Group)
[1] "LARGE_SHIRE" "METRO" "SMALL_SHIRE" "INTERFACE" "REGIONAL"
Thus, we need to change group to factor:
community_satisfaction$Group = factor(community_satisfaction$Group, levels = unique(community_satisfaction$Group), ordered = FALSE)
One last thing, we need to rename some Local Goverment Areas so the name is aligned with other datasets:
community_satisfaction$LocalGovArea <- str_remove(community_satisfaction$LocalGovArea, " Shire")
community_satisfaction$LocalGovArea <- str_remove(community_satisfaction$LocalGovArea, " City")
community_satisfaction$LocalGovArea <- str_remove(community_satisfaction$LocalGovArea, "Borough of ")
community_satisfaction$LocalGovArea <- str_remove(community_satisfaction$LocalGovArea, " Rural")
In our case, it is easier to work with missing and special values and outliers while data is in wide format, so before going into further transformation, let’s deal with NAs first.
#create function for NAs and special values
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
Check if dataset 1a and 1b has NAs
sum(is.na(life_satisfaction))
[1] 0
sum(is.na(life_worthwhile))
[1] 0
sum(is.infinite(life_satisfaction$LifeSatisfaction))
Unknown or uninitialised column: 'LifeSatisfaction'.
[1] 0
sum(is.infinite(life_worthwhile$LifeWorthwhile))
Unknown or uninitialised column: 'LifeWorthwhile'.
[1] 0
sum(is.nan(life_satisfaction$LifeSatisfaction))
Unknown or uninitialised column: 'LifeSatisfaction'.
[1] 0
sum(is.nan(life_worthwhile$LifeWorthwhile))
Unknown or uninitialised column: 'LifeWorthwhile'.
[1] 0
None found.
Check if dataset 2 has NAs
colSums(is.na(melbourne_house_prices))
Suburb Address Rooms Type Price Method SellerG Date Postcode
0 0 0 0 14590 0 0 0 0
Regionname Propertycount Distance LocalGovArea
0 0 0 0
We see that prices in melbourne_house_prices dataset has many missing values, 14590. It is reasonable to find the average price of houses that has the same rooms numbre, same type and same local government area, and then use it to fill those NAs. Therefore, we fill them with the average prices of houses/units/townhouses in same local goverment area taking into account type and number of rooms.
# Group the dataset accorind to localGovArea, type and rooms then mutate price of NAs to
# have mean value of their corresponding group
melbourne_house_prices <-
melbourne_house_prices %>%
group_by(LocalGovArea,Type,Rooms) %>%
mutate(Price = ifelse(is.na(Price), mean(Price, na.rm = T), Price))
There will probably be some NAs left, because their group combination doesn’t exist thus can’t have a mean. Let’s check how many NAs left:
sum(is.na(melbourne_house_prices$Price))
[1] 42
42 is far less than 14590.
We can now safely omit these 42 observations.
melbourne_house_prices <- na.omit(melbourne_house_prices)
sum(is.na(melbourne_house_prices))
[1] 0
Check for NAs:
colSums(is.na(melbourne_house_prices))
Suburb Address Rooms Type Price Method SellerG Date Postcode
0 0 0 0 0 0 0 0 0
Regionname Propertycount Distance LocalGovArea
0 0 0 0
None found.
Working in life statisfaction and worthwhile dataset, we will mutate dataset male/female columns to give average of both
life_satisfaction <- life_satisfaction %>%
mutate(LowAndMedium =(`LowAndMedium Male`+`LowAndMedium Female`)/2,
High = (`High Male`+`High Female`) / 2,
VeryHigh = (`VeryHigh Male`+`VeryHigh Female`) /2)
life_worthwhile <- life_worthwhile %>%
mutate(LowAndMedium = (`LowAndMedium Male`+`LowAndMedium Female`)/2,
High = (`High Male`+`High Female`) / 2,
VeryHigh = (`VeryHigh Male`+`VeryHigh Female`) /2)
Let us also also consider high satistisfaction and very hight satistisfaction to be in same category
life_satisfaction <- life_satisfaction %>% mutate(HighAndVeryHigh = High + VeryHigh)
life_worthwhile <- life_worthwhile %>% mutate(HighAndVeryHigh = High + VeryHigh)
The sum of LowAndMedium and HighAndVeryHigh is always 100, thus we get rid of one, and since we are more interested in the high satisfaction rate will will only use HighAndVeryHigh from both datasets.
life_satisfaction <- life_satisfaction %>% select(LocalGovArea, HighAndVeryHigh)
life_worthwhile <- life_worthwhile %>% select(LocalGovArea, HighAndVeryHigh)
# Rename columns
colnames(life_satisfaction)[2] <- "LifeSatisfaction"
colnames(life_worthwhile)[2] <- "LifeWorthwhile"
Now we can merge the two dataset into one dataset that gives life satisfaction and life worthwhile:
life_sat_worth <- life_satisfaction %>% left_join(life_worthwhile, "LocalGovArea")
head(life_sat_worth)
We remove unnecessory columns:
melbourne_house_prices <- melbourne_house_prices %>% select(-c(1,2,6,7,8,9,10,11))
head(melbourne_house_prices)
Some houses might be very old, thus very cheap when compared to the average. On the other hand, some houses might be like mansion or made out of very good materials, thus very expensive. We can consider these as outliers because of sampling error and remove them. But we need proceed with cautious here. Furthermore, since data still in wide format it easier to proceed with outlier scanning.
Let’s first take a subgroup of this dataset which are houses with 3-5 rooms so that there is no huge different in their prices.
house_prices_3to5_rooms <- melbourne_house_prices %>%
filter((Rooms == 3) | (Rooms == 4) | (Rooms == 5))
dim(house_prices_3to5_rooms)
[1] 47123 5
We are left with 47123 observation which is about 75% of original data.
Let’s take a subset and plot a box plot for a 5 local goverment areas:
tmpHouses <- house_prices_3to5_rooms %>%
filter((LocalGovArea == "Yarra") | (LocalGovArea == "Melbourne") |
(LocalGovArea == "Hume") | (LocalGovArea == "Monash") |
(LocalGovArea == "Glen Eira"))
outliers = boxplot(tmpHouses$Price ~ droplevels(tmpHouses)$LocalGovArea,
main="Houses prices by local goverment", ylab = "Price",
xlab = "Local Goverment", plot = TRUE)
paste('There are ', length(outliers$out), ' outliers in this subset')
[1] "There are 320 outliers in this subset"
There are many outliers at the top side which indicate skewness, removing these outliers will most probably remove the skewness but there is a valid reason why we have this behavior since it applies to all local goverment areas. So, in this case we better transform our data first to remove skewness and then check for outliers.
If we take Yarra area as an example and plot a histogram:
tmpHouses <- house_prices_3to5_rooms %>% filter((LocalGovArea == "Yarra"))
hist(tmpHouses$Price, breaks = 20, main="Yarra's houses' prices", ylab = "Price", xlab = "Price")
Let’s use Log transformation to try and elimnate skewness
house_prices_3to5_rooms <- house_prices_3to5_rooms %>% mutate(logPrice = log(Price))
Check the skewness in Yarra now:
tmpHouses <- house_prices_3to5_rooms %>% filter((LocalGovArea == "Yarra"))
hist(tmpHouses$logPrice, breaks = 20, main="Yarra's houses' prices", ylab = "Price",
xlab = "Log(Price)")
Now let’s go back to our subset of local goverments and check the outliers:
tmpHouses <- house_prices_3to5_rooms %>%
filter((LocalGovArea == "Yarra") | (LocalGovArea == "Melbourne") |
(LocalGovArea == "Hume") | (LocalGovArea == "Monash") | (LocalGovArea == "Glen Eira"))
outliers <- boxplot(tmpHouses$logPrice ~ droplevels(tmpHouses)$LocalGovArea,
main="Houses prices by local goverment", ylab = "logPrice",
xlab = "Local Goverment", plot = TRUE)
paste('There are ', length(outliers$out), ' in this subset')
[1] "There are 238 in this subset"
There are now fewer outliers because of the transformation, which means that we are making use of more data.
Now lets omit outliers from the whole dataset using boxplot to find all outliers per group area.
# Use boxplot to store all outliers
all_outliers <- boxplot(house_prices_3to5_rooms$logPrice ~ house_prices_3to5_rooms$LocalGovArea,
plot = FALSE)
paste('Overall there are ', length(all_outliers$out), ' outliers.')
[1] "Overall there are 1229 outliers."
Now use the outliers object to go through all the data and omit outliers that were defined by the boxplot.
for(i in seq(1, length(all_outliers$out)))
house_prices_3to5_rooms <-
house_prices_3to5_rooms[!((house_prices_3to5_rooms$LocalGovArea==all_outliers$names[all_outliers$group[i]]) & (house_prices_3to5_rooms$logPrice==all_outliers$out[i])),]
The new box plot for the subset
tmpHouses <- house_prices_3to5_rooms %>%
filter((LocalGovArea == "Yarra") | (LocalGovArea == "Melbourne") | (LocalGovArea == "Hume") |
(LocalGovArea == "Monash") | (LocalGovArea == "Glen Eira"))
outliers <- boxplot(tmpHouses$logPrice ~ droplevels(tmpHouses)$LocalGovArea,
main="Houses prices by local goverment", ylab = "logPrice",
xlab = "Local Goverment", plot = TRUE)
As you can see it is better than previous dataset. In this way we can have a good average price for each local goverment.
Life satisfaction and worthwhile dataframes along with community satisfaction dataframe are both giving indicators about wellbeing of people in specific local goverment area. We will join these two datasets into one.
wellbening_indicators <- life_sat_worth %>% left_join(community_satisfaction, "LocalGovArea")
head(wellbening_indicators)
Check if anything went wrong by checking for NAs
sum(is.na(wellbening_indicators))
[1] 0
All good.
Life statisfaction and worthwhile dataset is not tidy because it has two observations per row, we will use gather function to transform it long format. Although it might take more space, long format is more flexiable for analysis and it is easy to slice data by adding a condition to a column.
wellbening_indicators <- wellbening_indicators %>%
gather(LifeSatisfaction, LifeWorthwhile, CommunitySatisfaction, key = "Indicator",
value = "IndicatorPercentage")
head(wellbening_indicators)
First let’s group melboune_houses_prices by LocalGovArea and calculate mean price and mean distance
house_prices_3to5_rooms <- house_prices_3to5_rooms %>% group_by(LocalGovArea) %>%
summarise(meanPrice = mean(Price), meanDistance = mean(Distance))
head(house_prices_3to5_rooms)
We can see that the above dataset is in wide format. We consider it tidy because this is how we want it to be when merging the dataset with wellbing indicators dataset.
After grouping, local Goverment Areas are no longer factors, so we change their type to char:
house_prices_3to5_rooms$LocalGovArea <- as.character(house_prices_3to5_rooms$LocalGovArea)
Rename some of the LocalGovArea so they satisfy the naming in the other dataset
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Cardinia Shire Council"] <- "Cardinia"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Macedon Ranges Shire Council"] <- "Macedon Ranges"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Mitchell Shire Council"] <- "Mitchell"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Moorabool Shire Council"] <- "Moorabool"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Murrindindi Shire Council"] <- "Murrindindi"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Nillumbik Shire Council"] <- "Nillumbik"
house_prices_3to5_rooms$LocalGovArea[house_prices_3to5_rooms$LocalGovArea ==
"Yarra Ranges Shire Council"] <- "Yarra Ranges"
Now merge Local Goverments Areas houses information with wellbeing indicators, and since we only have 34 local goverment in house_prices_3to5_rooms dataset and 79 in the other dataset, we will left join house_prices_3to5_rooms with wellbening_indicators so that local goverment area that has no house information will be automatically discarded.
final_dataset <- house_prices_3to5_rooms %>% left_join(wellbening_indicators,"LocalGovArea")
head(final_dataset)
Check for NAs
sum(is.na(final_dataset))
[1] 0
None found.
Now our dataset is clean, tidy and ready for analysis.