In this assignment, we will apply different data manipulation and data handling techniques like merging, handling missing values, handling outliers and data transformation etc. Below are some packages needed to complete the tasks.
We loaded powerful packages like tidyr to tidy the data sets, hmisc package to help in the imputation etc.
library(readr)
library(tidyr)
library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱Hmisc㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
src, summarize
The following objects are masked from 㤼㸱package:base㤼㸲:
format.pval, units
library(outliers)
library(magrittr)
Attaching package: 㤼㸱magrittr㤼㸲
The following object is masked from 㤼㸱package:tidyr㤼㸲:
extract
library(deductive)
library(validate)
Attaching package: 㤼㸱validate㤼㸲
The following objects are masked from 㤼㸱package:Hmisc㤼㸲:
label, label<-
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
expr
The following object is masked from 㤼㸱package:dplyr㤼㸲:
expr
library(forecast)
Registered S3 method overwritten by 'xts':
method from
as.zoo.xts zoo
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Registered S3 methods overwritten by 'forecast':
method from
fitted.fracdiff fracdiff
residuals.fracdiff fracdiff
We collected 2 open data sets and subsetted one of them and merged with the data set2.
We detected the structure of data set and converted one of the character variable to factor which was further labeled and ordered.
Data set seems to be clean, hence we didnt really have to run through any data tidy methods.
We created one more variable using 2 numeric variables and named it EMI.
We ran through some methods of missing data and found it is less than 5%, hence we will be using data exclude technique as it is a safest approach.
We found outliers using univariate technique and again performed imputation.
Also, performed data transformation on variable with large numbers. There is a column name “Sale price” which deals with the property price. We performed log and reciprocal which helped in compressing high numbers for the amount.
As we move towards normality, we would be using Box-Cox to support it. We loaded package “forecast”.
–Census of Land Use and Employment (CLUE)
Census of Land Use and Employment (CLUE) provides comprehensive information about land use, employment and economic activity across the City of Melbourne, making it a valuable research tool.
CLUE assists the City of Melbourne’s business planning, policy development and strategic decision making. Investors, consultants, students, urban researchers, property analysts and businesses can also benefit from CLUE as it provides a better understanding of customers and the market place.
Source: This data has been pulled from https://data.melbourne.vic.gov.au/Property-Planning
–Data Set 1: This data set consist of the count of dwellings/housing per residential building for Residential Apartment, House/Townhouse.
–Data Set 2: This data set consist of the area, sale price and the total count of transaction decided for the property price.
–Both the data sets have “Type” in common which indicates whether the property is “Residential Apartment” or “House/Townhouse”.
****Variable Description ****
Census Date(Integer): Consists of year only, when property is finalized.
Block ID(Integer): Shared the block number for the residential property.
Property ID(Integer): This is the ID of the proprty.
Base Property ID(Integer): This column is used for record purposes indicates property ID only.
Street Address(Character): Indicate the address of the property.
CLUE Small Area(Character) : Indicates the suburb/locality of the property.
Type(Character): Indicates if property is “House/Townhouse” or “Residential Apartment”. This will be used as common key to join 2 data sets.
Dwelling Number, x.coordinate , y.coordinate, location: These columns are for CLUE data maintainance.
Sale Price(Numeric): This the amount decided for the propertty.
Transaction_count(Numeric): This is the variable with the total count of transaction decided to pay the sale price.
getwd()
[1] "C:/Users/aditi/OneDrive/Documents/Course/Data Preprocessing/Assignment 3"
#Data set 1: Residential_dwelling.csv
res <- read.csv("Residential_dwelling.csv", stringsAsFactors = FALSE)
head(res)
str(res)
'data.frame': 10122 obs. of 11 variables:
$ Census.year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
$ Base_property_ID: int 558820 616049 614213 107818 615136 604118 617929 107684 103766 614925 ...
$ Block_ID : int 438 2524 354 858 2513 1108 2512 608 917 2535 ...
$ Property_ID : int 558820 616049 614213 107818 615136 604118 617929 107684 103766 614925 ...
$ Street_address : chr "54-56 Walsh Street" "170 Bellair Street" "2 Victoria Court" "765 Punt Road" ...
$ Type : chr "House/Townhouse" "residential apartment" "House/Townhouse" "residential apartment" ...
$ CLUE.small.area : chr "West Melbourne (Residential)" "Kensington" "North Melbourne" "South Yarra" ...
$ Dwelling.number : int 1 1 1 7 1 133 1 11 1 1 ...
$ x.coordinate : num 145 145 145 145 145 ...
$ y.coordinate : num -37.8 -37.8 -37.8 -37.8 -37.8 ...
$ Location : chr "(-37.80820209, 144.9527228)" "(-37.79353782, 144.9299154)" "(-37.80477133, 144.9503638)" "(-37.83668585, 144.9865296)" ...
#Data set 2: sale_data.csv
sale <- read.csv("sale_data.csv", stringsAsFactors = FALSE)
head(sale)
str(sale)
'data.frame': 316 obs. of 5 variables:
$ Sale_Year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ Area : chr "Carlton" "Carlton" "Docklands" "East Melbourne" ...
$ Type : chr "House/Townhouse" "Residential Apartment" "Residential Apartment" "House/Townhouse" ...
$ Sale_Price : num 316250 220000 487300 622500 295000 ...
$ Transaction_Count: int 116 309 556 42 139 258 46 1065 57 115 ...
#Subsettings the 1st data set res and naming it as res_sub
res_sub <- res[1:100, c(3:6)]
str(res_sub)
'data.frame': 100 obs. of 4 variables:
$ Block_ID : int 438 2524 354 858 2513 1108 2512 608 917 2535 ...
$ Property_ID : int 558820 616049 614213 107818 615136 604118 617929 107684 103766 614925 ...
$ Street_address: chr "54-56 Walsh Street" "170 Bellair Street" "2 Victoria Court" "765 Punt Road" ...
$ Type : chr "House/Townhouse" "residential apartment" "House/Townhouse" "residential apartment" ...
#Merging the data into data set 2 after subsetting
res_sale <- left_join(res_sub, sale, by="Type")
str(res_sale)
'data.frame': 10468 obs. of 8 variables:
$ Block_ID : int 438 438 438 438 438 438 438 438 438 438 ...
$ Property_ID : int 558820 558820 558820 558820 558820 558820 558820 558820 558820 558820 ...
$ Street_address : chr "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" ...
$ Type : chr "House/Townhouse" "House/Townhouse" "House/Townhouse" "House/Townhouse" ...
$ Sale_Year : int 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 ...
$ Area : chr "Carlton" "East Melbourne" "Kensington" "North Melbourne" ...
$ Sale_Price : num 316250 622500 215250 280000 516000 ...
$ Transaction_Count: int 116 42 258 115 42 39 34 113 42 315 ...
dim(res_sale)
[1] 10468 8
Below is the description of different variables. We will convert Character variable “Type” to Factor.
****Variable Description ****
Block ID(Integer): Shared the block number for the residential property.
Property ID(Integer): This is the ID of the property.
Base Property ID(Integer): This column is used for record purposes indicates property ID only.
Street Address(Character): Indicate the address of the property.
CLUE Small Area(Character) : Indicates the suburb/locality of the property.
Type(Character): Indicates if property is “House/Townhouse” or “Residential Apartment”. This will be used as common key to join 2 data sets.
Sale_Year(Integer): When property is finalized.
Sale Price(Numeric): This the amount decided for the property.
Transaction_count(Numeric): This is the variable with the total count of transaction decided to pay the sale price.
–Here we have converted one character variable to factor variable and shared the final structure of “res_sale”.
–For the factor variable, we will use labels which are easy to read and share the level after applying new labels.
#converting character to factor.
res_sale$Type <- as.factor(res_sale$Type)
str(res_sale)
'data.frame': 10468 obs. of 8 variables:
$ Block_ID : int 438 438 438 438 438 438 438 438 438 438 ...
$ Property_ID : int 558820 558820 558820 558820 558820 558820 558820 558820 558820 558820 ...
$ Street_address : chr "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" ...
$ Type : Factor w/ 2 levels "House/Townhouse",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Sale_Year : int 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 ...
$ Area : chr "Carlton" "East Melbourne" "Kensington" "North Melbourne" ...
$ Sale_Price : num 316250 622500 215250 280000 516000 ...
$ Transaction_Count: int 116 42 258 115 42 39 34 113 42 315 ...
levels(res_sale$Type) #Shows default level of Type which is alphabetically sorted.
[1] "House/Townhouse" "residential apartment"
res_sale$Type <- factor(res_sale$Type, levels = c("House/Townhouse","residential apartment"), labels=c("House","Apartment"))
levels(res_sale$Type)
[1] "House" "Apartment"
res_sale
NA
NA
NA
NA
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.
#Data is in tidy format, hence proceeding with the next requirement.
Added one more variable from the existing data by “EMI” in final dataset by taking sale price by transaction count and named it res_sale1.
EMI is defined as total sale price divided by transaction of count in which the amount would be paid.
res_sale1 <- mutate(res_sale, EMI = Sale_Price/Transaction_Count)
str(res_sale1)
'data.frame': 10468 obs. of 9 variables:
$ Block_ID : int 438 438 438 438 438 438 438 438 438 438 ...
$ Property_ID : int 558820 558820 558820 558820 558820 558820 558820 558820 558820 558820 ...
$ Street_address : chr "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" ...
$ Type : Factor w/ 2 levels "House","Apartment": 1 1 1 1 1 1 1 1 1 1 ...
$ Sale_Year : int 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 ...
$ Area : chr "Carlton" "East Melbourne" "Kensington" "North Melbourne" ...
$ Sale_Price : num 316250 622500 215250 280000 516000 ...
$ Transaction_Count: int 116 42 258 115 42 39 34 113 42 315 ...
$ EMI : num 2726 14821 834 2435 12286 ...
res_sale1
NA
NA
NA
Using “colsums”, we will find missing values for each variable.
Using “which”, we can find the location of the outliers.
We have scanned data set and each column that shows total of 95 missing values which is less than 5% of total observations(10468). So, we will simply exclude them in the operations by using “na.rm=TRUE”.
str(res_sale1)
'data.frame': 10468 obs. of 9 variables:
$ Block_ID : int 438 438 438 438 438 438 438 438 438 438 ...
$ Property_ID : int 558820 558820 558820 558820 558820 558820 558820 558820 558820 558820 ...
$ Street_address : chr "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" "54-56 Walsh Street" ...
$ Type : Factor w/ 2 levels "House","Apartment": 1 1 1 1 1 1 1 1 1 1 ...
$ Sale_Year : int 2000 2000 2000 2000 2000 2000 2000 2001 2001 2001 ...
$ Area : chr "Carlton" "East Melbourne" "Kensington" "North Melbourne" ...
$ Sale_Price : num 316250 622500 215250 280000 516000 ...
$ Transaction_Count: int 116 42 258 115 42 39 34 113 42 315 ...
$ EMI : num 2726 14821 834 2435 12286 ...
In this section, we will find the outliers using univariate way for the numeric variables..
A box plot is a graphical display for describing the distribution of the data using the median, the first (Q1) and third quartiles (Q3), and the inter-quartile range (IQR=Q3−Q1).
Since we found 1 outlier for sales price and EMI which is potentially due to high property price based on the locality the properrty is in, hence is an outlier.
We usually use “out” to find the location of outliers and that’s how we can remove them from total set, however we are using imputing to do it.
#We have 2 numerical columns and we will use univariate way to find outliers.
res_sale2 <- res_sale1$Sale_Price %>% boxplot(main="Box plot for Sale price", ylab="Sale_price", col= "green", na.rm=TRUE)
res_sale3 <-res_sale1$EMI %>% boxplot(main = "Box plot for EMI", ylab="EMI", col="blue", na.rm=TRUE)
#Using out to find the location.
res_sale3$out
[1] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[11] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[21] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[31] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[41] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[51] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[61] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[71] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[81] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[91] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[101] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[111] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[121] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[131] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[141] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[151] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[161] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[171] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[181] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[191] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[201] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[211] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[221] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[231] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[241] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[251] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[261] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[271] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[281] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[291] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[301] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[311] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[321] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[331] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[341] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[351] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[361] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[371] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[381] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[391] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[401] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[411] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[421] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[431] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[441] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[451] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[461] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[471] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[481] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[491] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[501] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[511] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[521] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[531] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[541] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[551] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[561] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[571] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[581] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[591] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[601] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33
[611] 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33
[621] 72291.67 74583.33 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33
[631] 72361.11 170909.09 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
[641] 163030.00 433333.33 135411.76 158333.33 72291.67 74583.33 72361.11 170909.09
res_sale2$out
[1] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[12] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[23] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[34] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[45] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[56] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[67] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[78] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[89] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[100] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[111] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[122] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[133] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[144] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[155] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[166] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[177] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[188] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[199] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[210] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[221] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[232] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[243] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[254] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[265] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[276] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[287] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[298] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[309] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[320] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[331] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[342] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[353] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[364] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[375] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[386] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[397] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[408] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[419] 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000
[430] 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000
[441] 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600
[452] 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000 2135000
[463] 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000 2100000
[474] 2135000 3260600 14300000 2302000 2605000 2100000 2135000 3260600 14300000 2302000 2605000
[485] 2100000 2135000
#Using imputing to fix outliers
#For Sales price column
res_sale1$Sale_Price[ which( abs(z.scores) >3 )] <- mean(res_sale1$Sale_Price, na.rm = TRUE)
#For EMI column
res_sale1$EMI[ which( abs(z.scores) >3 )] <- mean(res_sale1$EMI, na.rm = TRUE)
In this section, we will perform Data transformation on “Sale_Price”.
Our reason to perform data transformation is to understand sales prices better. We will perform mathematical ways to tansform the data.
We performed transformation via logarithm and reciprocal methods.
The Box-Cox transformation is a type of power transformation to transform non-normal data into a normal distribution.
#data transformation on sale price as it helps compressing high values.
#1. The Log Transformation
log_saleprice <- log10(res_sale1$Sale_Price)
hist(log_saleprice)
#2. Reciprocal Transformation
recp <- 1/res_sale1$Sale_Price
hist(recp)
NA
NA
NA
NA
NA
NOTE: Follow the order outlined above in the report as possible as you can. Note that sometimes the order of the tasks may be different than the order given here. Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )