WQD 7004 Group Project

Semester 1 2020/2021

Instructor: AP Dr Fariza Hanum Md Nasaruddin

Team Members:

  1. Liow Wei Jie (S2016012)
  2. Tang Kam Yin (S2018291)

Summary

This project was completed using a dataset acquired from Kaggle. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

The goal of this analysis is to predict the price of housing for 2016 in King County based on the variables provided in the dataset. Any observations which are unrelated to the goal of predicting the house pricing will be recorded and summarized at the end of this report.

Objectives

  1. Preparation
  2. Overview of Data
  3. Data Pre-Processing
  4. Exploratory data analysis
  5. Predictive Modeling
  6. Conclusion & Future Works

0. Preparation

Before exploring the data and building the models, we need to load some necessary packages and call the libraries for this analysis.

library(tidyverse) # used for data manipulation and visualization
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.2 --
## v broom     0.7.2      v recipes   0.1.15
## v dials     0.0.9      v rsample   0.0.8 
## v infer     0.5.3      v tune      0.1.2 
## v modeldata 0.1.0      v workflows 0.2.1 
## v parsnip   0.1.4      v yardstick 0.0.7
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
library(caret) # used to streamline the model training process for regression and classification problems
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
library(leaflet) #creates an interactive map
library(GGally) #extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(Amelia)# used for imputing missing data
## Loading required package: Rcpp
## 
## Attaching package: 'Rcpp'
## The following object is masked from 'package:rsample':
## 
##     populate
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(corrplot) #visualizing a correlation matrix in R
## corrplot 0.84 loaded
library(cluster)
library(caTools)

1. Overview of Data

Importing the dataset

house = read.csv('kc_house_data.csv')

This dataset contains the house prices in King County, Washington based on the sales from May 2014 to May 2015. Apart from house price, it contains information of other 20 variables such as Date of House Sale, Sale ID, house condition and so on. The table below describe the interpretation of the variables in the dataset.

Variable Description
id Unique ID per house sale
date Date of the house sale
price Price of house sale in currency of USD
bedrooms Number of bedrooms
bathrooms Number of Bathrooms, where 0.5 represents a bathroom with a toilet but with no shower
sqft_living Square footage of the apartments interior living space
sqft_lot Square footage of the land space
floors Number of floors
waterfront An index to indicate if the house was overlooking the waterfront or not. 0 represents no waterfront, 1 represents with waterfront.
view An index from 0 to 4 of how good the view of the property was. 0 represents no good view, 4 represents very good view.
condition An index from 1 to 5 on the condition of the house. 1 represents poorer condition, and 5 represents superb condition.
grade An index from 1 to 13. 1 to 3 falls short of building construction and design, 7 has an average level of construction and design, and 11 to 13 have higher quality level of construction and design.
sqft_above The square footage of the interior housing space that is above the ground level
sqft_basement The square footage of the interior housing space that is below the ground level
yr_built The year of house built
yr_renovated The year of the house’s last renovation
zipcode The zipcode is the postal code to indicate the area the house is in
lat Latitude
long Longitude
sqft_living15 The average square footage of interior housing living space for the nearest 15 neighboring houses
sqft_lot 15 The average square footage of land space for the nearest 15 neighboring houses

2. Data Pre-Processing

Firstly, we display the compact structure of data and the variable using str().

str(house)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

Then, we display the sample data from each variable using head().

head(house)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

Next, we get understanding about different statiscal features using summary().

summary(house)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200

2.1 Data Cleaning

Missing Value Detection: Amelia Package Missingness Map Function was used to identify the missing data in the dataset. From the map below, it can be observed that the dataset does not consist of any missing data for any of the variables.

missmap(house)

Outlier Detection: Outliers are detected and analyzed using the Outlier Boxplots. From the outlier boxplot we inferred that the data consists of many outliers for the target variable, Price. However, the outliers variable corresponded to outliers for Number of Bedrooms, Number of Bathrooms and Square Feet Living. Upon further investigation, we found that the outliers correspond to high value of condition, view and grade. Thus, we concluded that these outliers are legitimate outliers and we decided to retain them in the data.

boxplot(house$price)

boxplot(house$bedrooms)

boxplot(house$bathrooms)

boxplot(house$sqft_living)

Summary of other data inconsistencies: There were two findings:-
All data clean up will be performed at the copy of original dataset, namely “house_clean”.

house_clean=house
nrow(house_clean)
## [1] 21613

There were two findings:-

  1. An observation with 33 bedrooms in 1620 ft, and come with only 1.75 bathrooms. The data was imputed to only 3 bedrooms.
max(house_clean$bedrooms)
## [1] 33
house_clean$bedrooms[house_clean$bedrooms==33]=3
nrow(house_clean)
## [1] 21613
max(house_clean$bedrooms)
## [1] 11
  1. 13 observations with 0 bedroom, and come with 0.75 to 2.5 bathroom. Since it is not common to have houses without bedrooms, thus we decided to exclude these observations.
min(house_clean$bedrooms)
## [1] 0
house_clean= house_clean[house_clean$bedrooms != 0,]
nrow(house_clean)
## [1] 21600
  1. In total, there are 10 observations with 0 bathroom. After step #2 above, there are only 3 remaining observation with 0 bathroom. Since it is not common to have houses without bathrooms, thus we decided to exclude these observations.
min(house_clean$bathrooms)
## [1] 0
house_clean= house_clean[house_clean$bathrooms != 0,]
nrow(house_clean)
## [1] 21597

2.2 Data Transformation

A majority of the variables found in the King County housing dataset were deemed acceptable for performing the analysis. However, while traversing the data we found that some of the columns need to have their data types adjusted in order to meet our goal. Thus we made the decision to retain all 21 original columns along with the transformed data.

The columns transformed are listed below:

  1. Date: Changing the Date Format for Regression
house_clean$date<-(substr(house_clean$date, 1, 8))
house_clean$date<- ymd(house_clean$date)
house_clean$date<-as.numeric(as.Date(house_clean$date, origin = "1900-01-01"))
head(house_clean$date)
## [1] 16356 16413 16491 16413 16484 16202
  1. Age: New column, Data Type as Numeric Continuous
house_clean$age= 2015 - house_clean$yr_built + 1
head(house_clean$age)
## [1] 61 65 83 51 29 15
  1. Renovated: New column, Data Type as Numeric Nominal
    Covert “yr_renovated” into a simpler categorical variable to indicate if a house has been renovated in the past.

The table below describe the further detail of Renovated (Variable: renovated).

Category Definition
1 If yr_renovated == ‘0’, it means no renovation has been done.
2 If yr_renovated != ‘0’, it means renovation has been done.
house_clean$renovated= cut(house_clean$yr_renovated, breaks = c(-1,0,3000), labels=c("1","2"))
house_clean$renovated=as.numeric(house_clean$renovated)
head(house_clean$renovated)
## [1] 1 2 1 1 1 1
  1. Price Category: New column, Data Type as Numeric Nominal
    Convert “price” variable from numeric continuous variable to a numeric nominal variable for categorical modeling purposes in this project.

The table below describe the further detail of price category (Variable: price_cat).

Category Price Range (USD)
1 0 to 350,000
2 350,001 to 450,000
3 450,001 to 700,000
4 700,001 and above
house_clean$price_cat = cut(house_clean$price, breaks = c(0,350000,450000,700000,10000000), labels=c("1","2","3","4"))
house_clean$price[1:10]
##  [1]  221900  538000  180000  604000  510000 1225000  257500  291850  229500
## [10]  323000
house_clean$price_cat[1:10]
##  [1] 1 3 1 3 3 4 1 1 1 1
## Levels: 1 2 3 4

3. Exploratory data analysis

The objective of data visualization and pattern discovery is to reveal the relationships between the house features and the target variable, price. We want to identify the house features which affect the price variable and could be potential predictors. Through visualization, we gathered the following information about the data.

Correlation Matrix: The correlation matrix gives a summary of correlations between the variables in the dataset. The objective behind analyzing the correlation between the continuous variables in the data was to identify variables that have significant linear relationship with price and those which do not. This matrix can help to identify relationship between potential predictors.

house_clean.cor = cor(house_clean[sapply(house_clean, function(x) !is.factor(x))])
corrplot(house_clean.cor)

From the correlation matrix, these are the findings:-

  1. Price has a high positive correlation with number of bathroom, sqft_living, grade, sqft_above, and sqft_living15.

  2. Price has low positive correlation with number of bedroom, floors, waterfront, view, sqft_basement and latitude.

  3. Price has non significant reltionship with sqft_lot, condition, yr_built, yr_renovated, zipcode, longitude, sqft_lot15, age, and renovated.

  4. sqft_above, sqft_living15, number of bathroom, number of bedroom, grade and sqft_above show high positive correlation with sqft_living and may explain the same variation in Price as sqft_living.

In addition to the correlation matrix, the following charts in the following were created:

  1. Scatterplot Matrix
  1. High positive correlation
pairs(~price+bathrooms+sqft_living+grade+sqft_above+sqft_living15, data=house_clean, main="High Positive Corr. ScatterPlot Matrix")

  1. Low positive correlation
pairs(~price+bedrooms+floors+waterfront+view+sqft_basement+lat, data=house_clean, main="Low Positive Corr. ScatterPlot Matrix")

  1. Latitude vs Longitude, coloured by Price: The graph in the following illustrates the King County Region. We can see that the area without the color points are not incorporated cities. We observe that price increases as we move from southern area to northern area across the latitude but has little variation as we move across the longitude.

To build the map for this dataset, we will use the leaflet package, which creates an interactive map, and the color of the circle markers on the map varies depending on the price. The higher the price a house is sold for, the bolder the color.

coordinates_data = dplyr::select(house_clean, price, lat, long)
head(coordinates_data)
##     price     lat     long
## 1  221900 47.5112 -122.257
## 2  538000 47.7210 -122.319
## 3  180000 47.7379 -122.233
## 4  604000 47.5208 -122.393
## 5  510000 47.6168 -122.045
## 6 1225000 47.6561 -122.005
pal = colorNumeric("YlOrRd", domain = coordinates_data$price)
int_map <- coordinates_data %>%
leaflet()%>%
addProviderTiles(providers$OpenStreetMap.Mapnik)%>%
addCircleMarkers(col = ~pal(price), opacity = 1.1, radius = 0.3) %>% 
addLegend(pal = pal, values = ~price) 
## Assuming "long" and "lat" are longitude and latitude, respectively
int_map
  1. Price vs Square Feet Living, colour by Number of Bathroom: The scatterplot below indicates the house price increases as square foot living and number of bathroom increase.
plot(house_clean$sqft_living15, house_clean$price, pch=19, col=house_clean$bathrooms, xlab='Square foot living+No.of bathrooms',ylab='House Price')

4. Predictive Modeling

  1. Regression
    Multiple Linear Regression Model

The first model is built by having the high positive correlation variables based on the corrplot.

hr=select(house_clean,price,bathrooms,sqft_living,grade,sqft_above,sqft_living15)
set.seed(123)
split=sample.split(hr$price, SplitRatio = 0.8)
training_set=subset(hr, split==TRUE)
test_set=subset(hr, split==FALSE)
regressor=lm(formula=price~., data=training_set)
summary(regressor)
## 
## Call:
## lm(formula = price ~ ., data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1005385  -137443   -22645   100515  4736642 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.664e+05  1.525e+04 -43.702  < 2e-16 ***
## bathrooms     -3.493e+04  3.859e+03  -9.053  < 2e-16 ***
## sqft_living    2.561e+02  5.070e+00  50.504  < 2e-16 ***
## grade          1.126e+05  2.777e+03  40.542  < 2e-16 ***
## sqft_above    -8.285e+01  5.010e+00 -16.537  < 2e-16 ***
## sqft_living15  1.788e+01  4.506e+00   3.967  7.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253800 on 17819 degrees of freedom
## Multiple R-squared:  0.5496, Adjusted R-squared:  0.5495 
## F-statistic:  4349 on 5 and 17819 DF,  p-value: < 2.2e-16

As shown, the multiple R-squared returned the value of 0.5495 which are not consider strong for the model even though all the variables are highly positive correlated to the output. To compare, another model is built to identify the highest multiple R-squared values we can get.

set.seed(123)
split2=sample.split(house_clean$price, SplitRatio = 0.8)
training_set2=subset(house_clean, split2==TRUE)
test_set2=subset(house_clean, split2==FALSE)
regressor2=lm(formula=price~., data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ ., data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1041221   -78634    -2392    62995  4776282 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.170e+07  3.179e+06   3.679 0.000235 ***
## id            -1.177e-06  5.010e-07  -2.350 0.018801 *  
## date           8.871e+01  1.263e+01   7.025 2.21e-12 ***
## bedrooms      -3.791e+04  2.053e+03 -18.468  < 2e-16 ***
## bathrooms      3.860e+04  3.399e+03  11.358  < 2e-16 ***
## sqft_living    1.282e+02  4.580e+00  27.986  < 2e-16 ***
## sqft_lot       1.446e-02  4.738e-02   0.305 0.760198    
## floors        -8.876e+03  3.782e+03  -2.347 0.018922 *  
## waterfront     5.977e+05  1.760e+04  33.965  < 2e-16 ***
## view           3.810e+04  2.220e+03  17.160  < 2e-16 ***
## condition      1.664e+04  2.482e+03   6.704 2.09e-11 ***
## grade          5.865e+04  2.360e+03  24.847  < 2e-16 ***
## sqft_above     2.905e+01  4.534e+00   6.409 1.50e-10 ***
## sqft_basement         NA         NA      NA       NA    
## yr_built      -1.792e+03  7.785e+01 -23.023  < 2e-16 ***
## yr_renovated   2.451e+03  4.443e+02   5.516 3.53e-08 ***
## zipcode       -4.718e+02  3.421e+01 -13.791  < 2e-16 ***
## lat            3.895e+05  1.326e+04  29.385  < 2e-16 ***
## long          -1.857e+05  1.370e+04 -13.548  < 2e-16 ***
## sqft_living15 -6.247e+00  3.600e+00  -1.736 0.082660 .  
## sqft_lot15    -2.993e-01  7.596e-02  -3.940 8.17e-05 ***
## age                   NA         NA      NA       NA    
## renovated     -4.870e+06  8.868e+05  -5.491 4.06e-08 ***
## price_cat2     2.264e+04  4.613e+03   4.909 9.23e-07 ***
## price_cat3     8.759e+04  4.799e+03  18.251  < 2e-16 ***
## price_cat4     3.323e+05  6.873e+03  48.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189700 on 17801 degrees of freedom
## Multiple R-squared:  0.7488, Adjusted R-squared:  0.7485 
## F-statistic:  2307 on 23 and 17801 DF,  p-value: < 2.2e-16

Remove the individual variables that are not significant to the output like ID, Date, sqft_lot,floors,sqft_basement,sqft_living15, Age and Price categories. Start the back-elimination method to find the best combination of variables.

regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long+sqft_lot15, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated + 
##     zipcode + lat + long + sqft_lot15, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1336593  -100048    -9461    79103  4198469 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.151e+07  3.179e+06   3.620 0.000296 ***
## bedrooms     -4.041e+04  2.227e+03 -18.143  < 2e-16 ***
## bathrooms     4.479e+04  3.547e+03  12.626  < 2e-16 ***
## sqft_living   1.596e+02  4.574e+00  34.883  < 2e-16 ***
## waterfront    6.085e+05  1.909e+04  31.877  < 2e-16 ***
## view          5.435e+04  2.364e+03  22.992  < 2e-16 ***
## condition     2.619e+04  2.665e+03   9.824  < 2e-16 ***
## grade         1.002e+05  2.326e+03  43.066  < 2e-16 ***
## sqft_above    3.857e+01  4.382e+00   8.804  < 2e-16 ***
## yr_built     -2.706e+03  8.030e+01 -33.694  < 2e-16 ***
## yr_renovated  1.989e+01  4.094e+00   4.859 1.19e-06 ***
## zipcode      -6.266e+02  3.671e+01 -17.070  < 2e-16 ***
## lat           6.006e+05  1.204e+04  49.900  < 2e-16 ***
## long         -2.129e+05  1.454e+04 -14.645  < 2e-16 ***
## sqft_lot15   -2.870e-01  6.074e-02  -4.724 2.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206200 on 17810 degrees of freedom
## Multiple R-squared:  0.7031, Adjusted R-squared:  0.7028 
## F-statistic:  3012 on 14 and 17810 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated + 
##     zipcode + lat + long, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1340677  -100132    -9086    79373  4215797 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.759e+06  3.159e+06   3.089  0.00201 ** 
## bedrooms     -3.950e+04  2.220e+03 -17.791  < 2e-16 ***
## bathrooms     4.559e+04  3.546e+03  12.859  < 2e-16 ***
## sqft_living   1.576e+02  4.558e+00  34.578  < 2e-16 ***
## waterfront    6.090e+05  1.910e+04  31.887  < 2e-16 ***
## view          5.384e+04  2.363e+03  22.785  < 2e-16 ***
## condition     2.606e+04  2.667e+03   9.773  < 2e-16 ***
## grade         1.005e+05  2.327e+03  43.174  < 2e-16 ***
## sqft_above    3.789e+01  4.382e+00   8.647  < 2e-16 ***
## yr_built     -2.690e+03  8.028e+01 -33.507  < 2e-16 ***
## yr_renovated  1.986e+01  4.097e+00   4.848 1.26e-06 ***
## zipcode      -6.270e+02  3.673e+01 -17.071  < 2e-16 ***
## lat           6.047e+05  1.201e+04  50.341  < 2e-16 ***
## long         -2.257e+05  1.430e+04 -15.785  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206300 on 17811 degrees of freedom
## Multiple R-squared:  0.7027, Adjusted R-squared:  0.7025 
## F-statistic:  3238 on 13 and 17811 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated + 
##     zipcode + lat, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1302274  -100359   -10258    78455  4254937 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.186e+07  3.178e+06   3.732  0.00019 ***
## bedrooms     -3.843e+04  2.235e+03 -17.198  < 2e-16 ***
## bathrooms     4.967e+04  3.561e+03  13.949  < 2e-16 ***
## sqft_living   1.575e+02  4.590e+00  34.321  < 2e-16 ***
## waterfront    6.183e+05  1.922e+04  32.167  < 2e-16 ***
## view          5.400e+04  2.379e+03  22.696  < 2e-16 ***
## condition     2.576e+04  2.685e+03   9.592  < 2e-16 ***
## grade         1.054e+05  2.322e+03  45.412  < 2e-16 ***
## sqft_above    2.517e+01  4.337e+00   5.803 6.61e-09 ***
## yr_built     -2.969e+03  7.886e+01 -37.646  < 2e-16 ***
## yr_renovated  1.788e+01  4.123e+00   4.337 1.45e-05 ***
## zipcode      -3.571e+02  3.273e+01 -10.910  < 2e-16 ***
## lat           5.948e+05  1.208e+04  49.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207700 on 17812 degrees of freedom
## Multiple R-squared:  0.6985, Adjusted R-squared:  0.6983 
## F-statistic:  3439 on 12 and 17812 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated + 
##     zipcode, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1403285  -112691    -9278    91845  4181640 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.669e+06  3.385e+06   1.675   0.0940 .  
## bedrooms     -4.243e+04  2.380e+03 -17.827  < 2e-16 ***
## bathrooms     5.789e+04  3.791e+03  15.271  < 2e-16 ***
## sqft_living   1.655e+02  4.889e+00  33.845  < 2e-16 ***
## waterfront    6.118e+05  2.049e+04  29.862  < 2e-16 ***
## view          4.285e+04  2.524e+03  16.974  < 2e-16 ***
## condition     1.786e+04  2.857e+03   6.251 4.18e-10 ***
## grade         1.271e+05  2.430e+03  52.319  < 2e-16 ***
## sqft_above    8.381e+00  4.608e+00   1.819   0.0690 .  
## yr_built     -3.614e+03  8.288e+01 -43.598  < 2e-16 ***
## yr_renovated  8.994e+00  4.391e+00   2.048   0.0405 *  
## zipcode       6.179e+00  3.399e+01   0.182   0.8558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221400 on 17813 degrees of freedom
## Multiple R-squared:  0.6575, Adjusted R-squared:  0.6573 
## F-statistic:  3109 on 11 and 17813 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated, 
##     data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1403840  -112578    -9315    91801  4181855 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.284e+06  1.555e+05  40.412  < 2e-16 ***
## bedrooms     -4.246e+04  2.377e+03 -17.861  < 2e-16 ***
## bathrooms     5.794e+04  3.784e+03  15.312  < 2e-16 ***
## sqft_living   1.655e+02  4.889e+00  33.846  < 2e-16 ***
## waterfront    6.118e+05  2.049e+04  29.863  < 2e-16 ***
## view          4.288e+04  2.517e+03  17.037  < 2e-16 ***
## condition     1.778e+04  2.825e+03   6.295 3.14e-10 ***
## grade         1.271e+05  2.427e+03  52.392  < 2e-16 ***
## sqft_above    8.290e+00  4.581e+00   1.810   0.0704 .  
## yr_built     -3.618e+03  7.945e+01 -45.538  < 2e-16 ***
## yr_renovated  8.964e+00  4.388e+00   2.043   0.0410 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221400 on 17814 degrees of freedom
## Multiple R-squared:  0.6575, Adjusted R-squared:  0.6573 
## F-statistic:  3420 on 10 and 17814 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1393539  -112727    -9107    91624  4189855 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.389e+06  1.467e+05  43.552  < 2e-16 ***
## bedrooms    -4.266e+04  2.375e+03 -17.959  < 2e-16 ***
## bathrooms    5.915e+04  3.737e+03  15.826  < 2e-16 ***
## sqft_living  1.653e+02  4.889e+00  33.813  < 2e-16 ***
## waterfront   6.141e+05  2.046e+04  30.016  < 2e-16 ***
## view         4.297e+04  2.517e+03  17.075  < 2e-16 ***
## condition    1.676e+04  2.780e+03   6.028 1.69e-09 ***
## grade        1.272e+05  2.427e+03  52.418  < 2e-16 ***
## sqft_above   8.437e+00  4.581e+00   1.842   0.0655 .  
## yr_built    -3.670e+03  7.518e+01 -48.818  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221400 on 17815 degrees of freedom
## Multiple R-squared:  0.6574, Adjusted R-squared:  0.6572 
## F-statistic:  3798 on 9 and 17815 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1210220  -124276   -16798    94965  4579064 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.163e+05  1.955e+04 -36.640  < 2e-16 ***
## bedrooms    -3.658e+04  2.526e+03 -14.483  < 2e-16 ***
## bathrooms   -1.193e+04  3.665e+03  -3.256  0.00113 ** 
## sqft_living  2.245e+02  5.043e+00  44.515  < 2e-16 ***
## waterfront   6.200e+05  2.178e+04  28.465  < 2e-16 ***
## view         5.819e+04  2.659e+03  21.881  < 2e-16 ***
## condition    5.515e+04  2.840e+03  19.421  < 2e-16 ***
## grade        1.038e+05  2.533e+03  40.991  < 2e-16 ***
## sqft_above  -3.474e+01  4.786e+00  -7.259 4.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235700 on 17816 degrees of freedom
## Multiple R-squared:  0.6116, Adjusted R-squared:  0.6114 
## F-statistic:  3506 on 8 and 17816 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1219983  -125466   -16687    95660  4593532 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -706798.90   19533.75 -36.183  < 2e-16 ***
## bedrooms     -36031.59    2528.06 -14.253  < 2e-16 ***
## bathrooms    -11877.00    3670.29  -3.236  0.00121 ** 
## sqft_living     200.54       3.82  52.495  < 2e-16 ***
## waterfront   614383.56   21800.24  28.182  < 2e-16 ***
## view          61992.77    2610.73  23.745  < 2e-16 ***
## condition     58706.58    2800.89  20.960  < 2e-16 ***
## grade         99014.41    2448.14  40.445  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 236100 on 17817 degrees of freedom
## Multiple R-squared:  0.6104, Adjusted R-squared:  0.6103 
## F-statistic:  3988 on 7 and 17817 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1686540  -134120   -17271   102274  4125967 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -83910.959  12554.853  -6.684  2.4e-11 ***
## bedrooms    -55121.532   2595.042 -21.241  < 2e-16 ***
## bathrooms    21872.995   3734.533   5.857  4.8e-09 ***
## sqft_living    284.693      3.347  85.048  < 2e-16 ***
## waterfront  593461.737  22771.958  26.061  < 2e-16 ***
## view         68125.704   2723.265  25.016  < 2e-16 ***
## condition    44690.162   2904.077  15.389  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 246700 on 17818 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5745 
## F-statistic:  4012 on 6 and 17818 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1682353  -135344   -18482   102445  4181603 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71581.68    7500.83   9.543  < 2e-16 ***
## bedrooms    -51232.19    2599.74 -19.707  < 2e-16 ***
## bathrooms    13819.89    3722.07   3.713 0.000205 ***
## sqft_living    284.71       3.37  84.495  < 2e-16 ***
## waterfront  594059.33   22922.11  25.916  < 2e-16 ***
## view         70972.94    2734.89  25.951  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 248300 on 17819 degrees of freedom
## Multiple R-squared:  0.569,  Adjusted R-squared:  0.5689 
## F-statistic:  4705 on 5 and 17819 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront, 
##     data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1600895  -141495   -20351   103847  4215584 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68725.672   7640.219   8.995  < 2e-16 ***
## bedrooms    -56272.803   2640.933 -21.308  < 2e-16 ***
## bathrooms    11475.873   3790.534   3.028  0.00247 ** 
## sqft_living    303.861      3.349  90.727  < 2e-16 ***
## waterfront  823546.537  21542.947  38.228  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252900 on 17820 degrees of freedom
## Multiple R-squared:  0.5527, Adjusted R-squared:  0.5526 
## F-statistic:  5505 on 4 and 17820 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1740023  -144720   -23212   102883  4090122 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76100.72    7944.57   9.579   <2e-16 ***
## bedrooms    -64971.75    2736.80 -23.740   <2e-16 ***
## bathrooms    10152.13    3942.62   2.575     0.01 *  
## sqft_living    318.87       3.46  92.168   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263100 on 17821 degrees of freedom
## Multiple R-squared:  0.516,  Adjusted R-squared:  0.516 
## F-statistic:  6334 on 3 and 17821 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1541685  -187741   -42448   111924  5897103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -52026       9505  -5.474 4.47e-08 ***
## bedrooms       22009       3122   7.050 1.85e-12 ***
## bathrooms     246084       3644  67.536  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319700 on 17822 degrees of freedom
## Multiple R-squared:  0.2853, Adjusted R-squared:  0.2853 
## F-statistic:  3558 on 2 and 17822 DF,  p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms, data = training_set2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846969 -205281  -66844  105669 6804375 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    92937      10376   8.956   <2e-16 ***
## bedrooms      133781       2966  45.103   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 358300 on 17823 degrees of freedom
## Multiple R-squared:  0.1024, Adjusted R-squared:  0.1024 
## F-statistic:  2034 on 1 and 17823 DF,  p-value: < 2.2e-16

As concluded, the value of multiple R-square drops as we eliminate the variables. The highest multiple R-square value is gained when the variables bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade, sqft_above, yr_built, yr_renovated, zipcode, lat, long, and sqft_lot15 are considered.

regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long+sqft_lot15, data=training_set2)
summary(regressor2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront + 
##     view + condition + grade + sqft_above + yr_built + yr_renovated + 
##     zipcode + lat + long + sqft_lot15, data = training_set2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1336593  -100048    -9461    79103  4198469 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.151e+07  3.179e+06   3.620 0.000296 ***
## bedrooms     -4.041e+04  2.227e+03 -18.143  < 2e-16 ***
## bathrooms     4.479e+04  3.547e+03  12.626  < 2e-16 ***
## sqft_living   1.596e+02  4.574e+00  34.883  < 2e-16 ***
## waterfront    6.085e+05  1.909e+04  31.877  < 2e-16 ***
## view          5.435e+04  2.364e+03  22.992  < 2e-16 ***
## condition     2.619e+04  2.665e+03   9.824  < 2e-16 ***
## grade         1.002e+05  2.326e+03  43.066  < 2e-16 ***
## sqft_above    3.857e+01  4.382e+00   8.804  < 2e-16 ***
## yr_built     -2.706e+03  8.030e+01 -33.694  < 2e-16 ***
## yr_renovated  1.989e+01  4.094e+00   4.859 1.19e-06 ***
## zipcode      -6.266e+02  3.671e+01 -17.070  < 2e-16 ***
## lat           6.006e+05  1.204e+04  49.900  < 2e-16 ***
## long         -2.129e+05  1.454e+04 -14.645  < 2e-16 ***
## sqft_lot15   -2.870e-01  6.074e-02  -4.724 2.33e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206200 on 17810 degrees of freedom
## Multiple R-squared:  0.7031, Adjusted R-squared:  0.7028 
## F-statistic:  3012 on 14 and 17810 DF,  p-value: < 2.2e-16

It returns the multiple R-squared value of 0.7028.

Next, compare the both model where one is using the high correlation variables as suggested by corrplot and the other one is gained through back-elimination. Accuracy check and Prediction of the first model:

##Accuracy of the model on the train dataset
pred=regressor$fitted.values
tally_table=data.frame(actual=training_set$price, predicted=pred)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat("The accuracy on the train data is:",accuracy)
## The accuracy on the train data is: 0.6630708
#Continue with Prediction on the test data
pred_test=predict(newdata=test_set,regressor)
tally_table=data.frame(actual=test_set$price, predicted=pred_test)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat(" and the accuracy on the test data is:",accuracy) 
##  and the accuracy on the test data is: 0.6604951

Accuracy check and Prediction of the second model:

##Accuracy of the model on the train dataset
pred2=regressor2$fitted.values
tally_table=data.frame(actual=training_set2$price, predicted=pred2)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat("The accuracy on the train data is:",accuracy)
## The accuracy on the train data is: 0.7402863
#Continue with Prediction on the test data
pred_test2=predict(newdata=test_set2,regressor2)
tally_table=data.frame(actual=test_set2$price, predicted=pred_test2)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat(" and the accuracy on the test data is:",accuracy)
##  and the accuracy on the test data is: 0.7416078

In sum, the model built by using back-elimination and includes more variable return higher accuracy. The model can predict the price with an accuracy of 74.1%

  1. Classification
  1. Hierarchical Clustering
memory.size()
## [1] 306.34
memory.limit()
## [1] 16314
memory.limit(size=56000)
## [1] 56000
x=select(house_clean, price, price_cat)
##Identify the optimal cluster by plotting dendrogram.
dendrogram=hclust(dist(x,method='euclidean'),method='ward.D')
plot(dendrogram, main=paste('Dendrogram'),xlab='Price',ylab='Price Categories')

##Fitting the Hierarchical Clustering into the data
hc = hclust(dist(x, method='euclidean'), method='ward.D')
y_hc=cutree(hc,4)

From the dendrogram, we can identify the optimal no.of cluster would be 4.

  1. K-mean Clustering
kx=select(house_clean,price,price_cat)
##Split the data into Train and Test set
set.seed(123)
split=sample.split(kx$price_cat,SplitRatio = 0.8)
training_set=subset(kx, split==TRUE)
test_set=subset(kx, split==FALSE)
##Identify the optimal cluster by using the elbow method
set.seed(6)
wcss=vector()
for(i in 1:10)wcss[i]=sum(kmeans(kx,i)$withinss)
plot(1:10,wcss,type='b',main=paste('The Elbow Method'),xlab='Number of Cluster',ylab='WCSS')

##Fitting the K-mean Clustering into the data
set.seed(29)
kmeans=kmeans(x=kx,centers=5)
y_kmeans=kmeans$cluster

Difficult to identify the optimal cluster from the elbow method’s plot.

  1. Hierarchical Clustering
clusplot(x,y_hc,lines=0,shade=TRUE,color=TRUE,lables=2,plotchar=FALSE,span=TRUE,main=paste('Cluster of Price'),xlab="Price of House",ylab = "Price Categories")
## Warning in plot.window(...): "lables" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "lables" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "lables" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "lables" is not a
## graphical parameter
## Warning in box(...): "lables" is not a graphical parameter
## Warning in title(...): "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter

## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter

  1. K-Means Clustering
clusplot(kx,y_kmeans,lines=0,shade=TRUE,color=TRUE,lables=2,plotchar=FALSE,span=TRUE,main=paste('Cluster of Price'),xlab='Price of House',ylab = 'Price Categories')

5. Conclusion & Future Works

From Multiple Linear Regression, model via Back-elimination method is better than the model using only high positive correlation variable by showing higher accuracy of 74.1%.
From the Clustering Classification, 4 clusters are being identify through Dendrogram. Its rather difficult to identify the cluster through elbow method since the graph doesn’t show significant flat curve. From the exploratory data analysis in section #3, we concluded that the outcome variable has high number of legitimate outliers due to the characteristics of the house which have also been captured in this dataset. As a recommendation for future work, it will be great to include the identifying characteristics such as amenities (swimming pool, gym room, etc), neighboring education facilities (reputable school and universities), and nearest distance to public transportation. These characteristics will undoubtedly help to determine the house price. Based on the characteristics, we could further segmentize the house into two categories, luxury house or ordinary house. A different model can be developed based on the house category.