This project was completed using a dataset acquired from Kaggle. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
The goal of this analysis is to predict the price of housing for 2016 in King County based on the variables provided in the dataset. Any observations which are unrelated to the goal of predicting the house pricing will be recorded and summarized at the end of this report.
Before exploring the data and building the models, we need to load some necessary packages and call the libraries for this analysis.
library(tidyverse) # used for data manipulation and visualization
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.1.2 --
## v broom 0.7.2 v recipes 0.1.15
## v dials 0.0.9 v rsample 0.0.8
## v infer 0.5.3 v tune 0.1.2
## v modeldata 0.1.0 v workflows 0.2.1
## v parsnip 0.1.4 v yardstick 0.0.7
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(caret) # used to streamline the model training process for regression and classification problems
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
library(leaflet) #creates an interactive map
library(GGally) #extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(Amelia)# used for imputing missing data
## Loading required package: Rcpp
##
## Attaching package: 'Rcpp'
## The following object is masked from 'package:rsample':
##
## populate
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(corrplot) #visualizing a correlation matrix in R
## corrplot 0.84 loaded
library(cluster)
library(caTools)
house = read.csv('kc_house_data.csv')
This dataset contains the house prices in King County, Washington based on the sales from May 2014 to May 2015. Apart from house price, it contains information of other 20 variables such as Date of House Sale, Sale ID, house condition and so on. The table below describe the interpretation of the variables in the dataset.
| Variable | Description |
|---|---|
| id | Unique ID per house sale |
| date | Date of the house sale |
| price | Price of house sale in currency of USD |
| bedrooms | Number of bedrooms |
| bathrooms | Number of Bathrooms, where 0.5 represents a bathroom with a toilet but with no shower |
| sqft_living | Square footage of the apartments interior living space |
| sqft_lot | Square footage of the land space |
| floors | Number of floors |
| waterfront | An index to indicate if the house was overlooking the waterfront or not. 0 represents no waterfront, 1 represents with waterfront. |
| view | An index from 0 to 4 of how good the view of the property was. 0 represents no good view, 4 represents very good view. |
| condition | An index from 1 to 5 on the condition of the house. 1 represents poorer condition, and 5 represents superb condition. |
| grade | An index from 1 to 13. 1 to 3 falls short of building construction and design, 7 has an average level of construction and design, and 11 to 13 have higher quality level of construction and design. |
| sqft_above | The square footage of the interior housing space that is above the ground level |
| sqft_basement | The square footage of the interior housing space that is below the ground level |
| yr_built | The year of house built |
| yr_renovated | The year of the house’s last renovation |
| zipcode | The zipcode is the postal code to indicate the area the house is in |
| lat | Latitude |
| long | Longitude |
| sqft_living15 | The average square footage of interior housing living space for the nearest 15 neighboring houses |
| sqft_lot 15 | The average square footage of land space for the nearest 15 neighboring houses |
Firstly, we display the compact structure of data and the variable using str().
str(house)
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
Then, we display the sample data from each variable using head().
head(house)
## id date price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000 221900 3 1.00 1180 5650
## 2 6414100192 20141209T000000 538000 3 2.25 2570 7242
## 3 5631500400 20150225T000000 180000 2 1.00 770 10000
## 4 2487200875 20141209T000000 604000 4 3.00 1960 5000
## 5 1954400510 20150218T000000 510000 3 2.00 1680 8080
## 6 7237550310 20140512T000000 1225000 4 4.50 5420 101930
## floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1 1 0 0 3 7 1180 0 1955
## 2 2 0 0 3 7 2170 400 1951
## 3 1 0 0 3 6 770 0 1933
## 4 1 0 0 5 7 1050 910 1965
## 5 1 0 0 3 8 1680 0 1987
## 6 1 0 0 3 11 3890 1530 2001
## yr_renovated zipcode lat long sqft_living15 sqft_lot15
## 1 0 98178 47.5112 -122.257 1340 5650
## 2 1991 98125 47.7210 -122.319 1690 7639
## 3 0 98028 47.7379 -122.233 2720 8062
## 4 0 98136 47.5208 -122.393 1360 5000
## 5 0 98074 47.6168 -122.045 1800 7503
## 6 0 98053 47.6561 -122.005 4760 101930
Next, we get understanding about different statiscal features using summary().
summary(house)
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
Missing Value Detection: Amelia Package Missingness Map Function was used to identify the missing data in the dataset. From the map below, it can be observed that the dataset does not consist of any missing data for any of the variables.
missmap(house)
Outlier Detection: Outliers are detected and analyzed using the Outlier Boxplots. From the outlier boxplot we inferred that the data consists of many outliers for the target variable, Price. However, the outliers variable corresponded to outliers for Number of Bedrooms, Number of Bathrooms and Square Feet Living. Upon further investigation, we found that the outliers correspond to high value of condition, view and grade. Thus, we concluded that these outliers are legitimate outliers and we decided to retain them in the data.
boxplot(house$price)
boxplot(house$bedrooms)
boxplot(house$bathrooms)
boxplot(house$sqft_living)
Summary of other data inconsistencies: There were two findings:-
All data clean up will be performed at the copy of original dataset, namely “house_clean”.
house_clean=house
nrow(house_clean)
## [1] 21613
There were two findings:-
max(house_clean$bedrooms)
## [1] 33
house_clean$bedrooms[house_clean$bedrooms==33]=3
nrow(house_clean)
## [1] 21613
max(house_clean$bedrooms)
## [1] 11
min(house_clean$bedrooms)
## [1] 0
house_clean= house_clean[house_clean$bedrooms != 0,]
nrow(house_clean)
## [1] 21600
min(house_clean$bathrooms)
## [1] 0
house_clean= house_clean[house_clean$bathrooms != 0,]
nrow(house_clean)
## [1] 21597
A majority of the variables found in the King County housing dataset were deemed acceptable for performing the analysis. However, while traversing the data we found that some of the columns need to have their data types adjusted in order to meet our goal. Thus we made the decision to retain all 21 original columns along with the transformed data.
The columns transformed are listed below:
house_clean$date<-(substr(house_clean$date, 1, 8))
house_clean$date<- ymd(house_clean$date)
house_clean$date<-as.numeric(as.Date(house_clean$date, origin = "1900-01-01"))
head(house_clean$date)
## [1] 16356 16413 16491 16413 16484 16202
house_clean$age= 2015 - house_clean$yr_built + 1
head(house_clean$age)
## [1] 61 65 83 51 29 15
The table below describe the further detail of Renovated (Variable: renovated).
| Category | Definition |
|---|---|
| 1 | If yr_renovated == ‘0’, it means no renovation has been done. |
| 2 | If yr_renovated != ‘0’, it means renovation has been done. |
house_clean$renovated= cut(house_clean$yr_renovated, breaks = c(-1,0,3000), labels=c("1","2"))
house_clean$renovated=as.numeric(house_clean$renovated)
head(house_clean$renovated)
## [1] 1 2 1 1 1 1
The table below describe the further detail of price category (Variable: price_cat).
| Category | Price Range (USD) |
|---|---|
| 1 | 0 to 350,000 |
| 2 | 350,001 to 450,000 |
| 3 | 450,001 to 700,000 |
| 4 | 700,001 and above |
house_clean$price_cat = cut(house_clean$price, breaks = c(0,350000,450000,700000,10000000), labels=c("1","2","3","4"))
house_clean$price[1:10]
## [1] 221900 538000 180000 604000 510000 1225000 257500 291850 229500
## [10] 323000
house_clean$price_cat[1:10]
## [1] 1 3 1 3 3 4 1 1 1 1
## Levels: 1 2 3 4
The objective of data visualization and pattern discovery is to reveal the relationships between the house features and the target variable, price. We want to identify the house features which affect the price variable and could be potential predictors. Through visualization, we gathered the following information about the data.
Correlation Matrix: The correlation matrix gives a summary of correlations between the variables in the dataset. The objective behind analyzing the correlation between the continuous variables in the data was to identify variables that have significant linear relationship with price and those which do not. This matrix can help to identify relationship between potential predictors.
house_clean.cor = cor(house_clean[sapply(house_clean, function(x) !is.factor(x))])
corrplot(house_clean.cor)
From the correlation matrix, these are the findings:-
Price has a high positive correlation with number of bathroom, sqft_living, grade, sqft_above, and sqft_living15.
Price has low positive correlation with number of bedroom, floors, waterfront, view, sqft_basement and latitude.
Price has non significant reltionship with sqft_lot, condition, yr_built, yr_renovated, zipcode, longitude, sqft_lot15, age, and renovated.
sqft_above, sqft_living15, number of bathroom, number of bedroom, grade and sqft_above show high positive correlation with sqft_living and may explain the same variation in Price as sqft_living.
In addition to the correlation matrix, the following charts in the following were created:
pairs(~price+bathrooms+sqft_living+grade+sqft_above+sqft_living15, data=house_clean, main="High Positive Corr. ScatterPlot Matrix")
pairs(~price+bedrooms+floors+waterfront+view+sqft_basement+lat, data=house_clean, main="Low Positive Corr. ScatterPlot Matrix")
To build the map for this dataset, we will use the leaflet package, which creates an interactive map, and the color of the circle markers on the map varies depending on the price. The higher the price a house is sold for, the bolder the color.
coordinates_data = dplyr::select(house_clean, price, lat, long)
head(coordinates_data)
## price lat long
## 1 221900 47.5112 -122.257
## 2 538000 47.7210 -122.319
## 3 180000 47.7379 -122.233
## 4 604000 47.5208 -122.393
## 5 510000 47.6168 -122.045
## 6 1225000 47.6561 -122.005
pal = colorNumeric("YlOrRd", domain = coordinates_data$price)
int_map <- coordinates_data %>%
leaflet()%>%
addProviderTiles(providers$OpenStreetMap.Mapnik)%>%
addCircleMarkers(col = ~pal(price), opacity = 1.1, radius = 0.3) %>%
addLegend(pal = pal, values = ~price)
## Assuming "long" and "lat" are longitude and latitude, respectively
int_map
plot(house_clean$sqft_living15, house_clean$price, pch=19, col=house_clean$bathrooms, xlab='Square foot living+No.of bathrooms',ylab='House Price')
The first model is built by having the high positive correlation variables based on the corrplot.
hr=select(house_clean,price,bathrooms,sqft_living,grade,sqft_above,sqft_living15)
set.seed(123)
split=sample.split(hr$price, SplitRatio = 0.8)
training_set=subset(hr, split==TRUE)
test_set=subset(hr, split==FALSE)
regressor=lm(formula=price~., data=training_set)
summary(regressor)
##
## Call:
## lm(formula = price ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1005385 -137443 -22645 100515 4736642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.664e+05 1.525e+04 -43.702 < 2e-16 ***
## bathrooms -3.493e+04 3.859e+03 -9.053 < 2e-16 ***
## sqft_living 2.561e+02 5.070e+00 50.504 < 2e-16 ***
## grade 1.126e+05 2.777e+03 40.542 < 2e-16 ***
## sqft_above -8.285e+01 5.010e+00 -16.537 < 2e-16 ***
## sqft_living15 1.788e+01 4.506e+00 3.967 7.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 253800 on 17819 degrees of freedom
## Multiple R-squared: 0.5496, Adjusted R-squared: 0.5495
## F-statistic: 4349 on 5 and 17819 DF, p-value: < 2.2e-16
As shown, the multiple R-squared returned the value of 0.5495 which are not consider strong for the model even though all the variables are highly positive correlated to the output. To compare, another model is built to identify the highest multiple R-squared values we can get.
set.seed(123)
split2=sample.split(house_clean$price, SplitRatio = 0.8)
training_set2=subset(house_clean, split2==TRUE)
test_set2=subset(house_clean, split2==FALSE)
regressor2=lm(formula=price~., data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ ., data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1041221 -78634 -2392 62995 4776282
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.170e+07 3.179e+06 3.679 0.000235 ***
## id -1.177e-06 5.010e-07 -2.350 0.018801 *
## date 8.871e+01 1.263e+01 7.025 2.21e-12 ***
## bedrooms -3.791e+04 2.053e+03 -18.468 < 2e-16 ***
## bathrooms 3.860e+04 3.399e+03 11.358 < 2e-16 ***
## sqft_living 1.282e+02 4.580e+00 27.986 < 2e-16 ***
## sqft_lot 1.446e-02 4.738e-02 0.305 0.760198
## floors -8.876e+03 3.782e+03 -2.347 0.018922 *
## waterfront 5.977e+05 1.760e+04 33.965 < 2e-16 ***
## view 3.810e+04 2.220e+03 17.160 < 2e-16 ***
## condition 1.664e+04 2.482e+03 6.704 2.09e-11 ***
## grade 5.865e+04 2.360e+03 24.847 < 2e-16 ***
## sqft_above 2.905e+01 4.534e+00 6.409 1.50e-10 ***
## sqft_basement NA NA NA NA
## yr_built -1.792e+03 7.785e+01 -23.023 < 2e-16 ***
## yr_renovated 2.451e+03 4.443e+02 5.516 3.53e-08 ***
## zipcode -4.718e+02 3.421e+01 -13.791 < 2e-16 ***
## lat 3.895e+05 1.326e+04 29.385 < 2e-16 ***
## long -1.857e+05 1.370e+04 -13.548 < 2e-16 ***
## sqft_living15 -6.247e+00 3.600e+00 -1.736 0.082660 .
## sqft_lot15 -2.993e-01 7.596e-02 -3.940 8.17e-05 ***
## age NA NA NA NA
## renovated -4.870e+06 8.868e+05 -5.491 4.06e-08 ***
## price_cat2 2.264e+04 4.613e+03 4.909 9.23e-07 ***
## price_cat3 8.759e+04 4.799e+03 18.251 < 2e-16 ***
## price_cat4 3.323e+05 6.873e+03 48.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189700 on 17801 degrees of freedom
## Multiple R-squared: 0.7488, Adjusted R-squared: 0.7485
## F-statistic: 2307 on 23 and 17801 DF, p-value: < 2.2e-16
Remove the individual variables that are not significant to the output like ID, Date, sqft_lot,floors,sqft_basement,sqft_living15, Age and Price categories. Start the back-elimination method to find the best combination of variables.
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long+sqft_lot15, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated +
## zipcode + lat + long + sqft_lot15, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336593 -100048 -9461 79103 4198469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.151e+07 3.179e+06 3.620 0.000296 ***
## bedrooms -4.041e+04 2.227e+03 -18.143 < 2e-16 ***
## bathrooms 4.479e+04 3.547e+03 12.626 < 2e-16 ***
## sqft_living 1.596e+02 4.574e+00 34.883 < 2e-16 ***
## waterfront 6.085e+05 1.909e+04 31.877 < 2e-16 ***
## view 5.435e+04 2.364e+03 22.992 < 2e-16 ***
## condition 2.619e+04 2.665e+03 9.824 < 2e-16 ***
## grade 1.002e+05 2.326e+03 43.066 < 2e-16 ***
## sqft_above 3.857e+01 4.382e+00 8.804 < 2e-16 ***
## yr_built -2.706e+03 8.030e+01 -33.694 < 2e-16 ***
## yr_renovated 1.989e+01 4.094e+00 4.859 1.19e-06 ***
## zipcode -6.266e+02 3.671e+01 -17.070 < 2e-16 ***
## lat 6.006e+05 1.204e+04 49.900 < 2e-16 ***
## long -2.129e+05 1.454e+04 -14.645 < 2e-16 ***
## sqft_lot15 -2.870e-01 6.074e-02 -4.724 2.33e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206200 on 17810 degrees of freedom
## Multiple R-squared: 0.7031, Adjusted R-squared: 0.7028
## F-statistic: 3012 on 14 and 17810 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated +
## zipcode + lat + long, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1340677 -100132 -9086 79373 4215797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.759e+06 3.159e+06 3.089 0.00201 **
## bedrooms -3.950e+04 2.220e+03 -17.791 < 2e-16 ***
## bathrooms 4.559e+04 3.546e+03 12.859 < 2e-16 ***
## sqft_living 1.576e+02 4.558e+00 34.578 < 2e-16 ***
## waterfront 6.090e+05 1.910e+04 31.887 < 2e-16 ***
## view 5.384e+04 2.363e+03 22.785 < 2e-16 ***
## condition 2.606e+04 2.667e+03 9.773 < 2e-16 ***
## grade 1.005e+05 2.327e+03 43.174 < 2e-16 ***
## sqft_above 3.789e+01 4.382e+00 8.647 < 2e-16 ***
## yr_built -2.690e+03 8.028e+01 -33.507 < 2e-16 ***
## yr_renovated 1.986e+01 4.097e+00 4.848 1.26e-06 ***
## zipcode -6.270e+02 3.673e+01 -17.071 < 2e-16 ***
## lat 6.047e+05 1.201e+04 50.341 < 2e-16 ***
## long -2.257e+05 1.430e+04 -15.785 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206300 on 17811 degrees of freedom
## Multiple R-squared: 0.7027, Adjusted R-squared: 0.7025
## F-statistic: 3238 on 13 and 17811 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated +
## zipcode + lat, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1302274 -100359 -10258 78455 4254937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.186e+07 3.178e+06 3.732 0.00019 ***
## bedrooms -3.843e+04 2.235e+03 -17.198 < 2e-16 ***
## bathrooms 4.967e+04 3.561e+03 13.949 < 2e-16 ***
## sqft_living 1.575e+02 4.590e+00 34.321 < 2e-16 ***
## waterfront 6.183e+05 1.922e+04 32.167 < 2e-16 ***
## view 5.400e+04 2.379e+03 22.696 < 2e-16 ***
## condition 2.576e+04 2.685e+03 9.592 < 2e-16 ***
## grade 1.054e+05 2.322e+03 45.412 < 2e-16 ***
## sqft_above 2.517e+01 4.337e+00 5.803 6.61e-09 ***
## yr_built -2.969e+03 7.886e+01 -37.646 < 2e-16 ***
## yr_renovated 1.788e+01 4.123e+00 4.337 1.45e-05 ***
## zipcode -3.571e+02 3.273e+01 -10.910 < 2e-16 ***
## lat 5.948e+05 1.208e+04 49.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207700 on 17812 degrees of freedom
## Multiple R-squared: 0.6985, Adjusted R-squared: 0.6983
## F-statistic: 3439 on 12 and 17812 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated +
## zipcode, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1403285 -112691 -9278 91845 4181640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.669e+06 3.385e+06 1.675 0.0940 .
## bedrooms -4.243e+04 2.380e+03 -17.827 < 2e-16 ***
## bathrooms 5.789e+04 3.791e+03 15.271 < 2e-16 ***
## sqft_living 1.655e+02 4.889e+00 33.845 < 2e-16 ***
## waterfront 6.118e+05 2.049e+04 29.862 < 2e-16 ***
## view 4.285e+04 2.524e+03 16.974 < 2e-16 ***
## condition 1.786e+04 2.857e+03 6.251 4.18e-10 ***
## grade 1.271e+05 2.430e+03 52.319 < 2e-16 ***
## sqft_above 8.381e+00 4.608e+00 1.819 0.0690 .
## yr_built -3.614e+03 8.288e+01 -43.598 < 2e-16 ***
## yr_renovated 8.994e+00 4.391e+00 2.048 0.0405 *
## zipcode 6.179e+00 3.399e+01 0.182 0.8558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221400 on 17813 degrees of freedom
## Multiple R-squared: 0.6575, Adjusted R-squared: 0.6573
## F-statistic: 3109 on 11 and 17813 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated,
## data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1403840 -112578 -9315 91801 4181855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.284e+06 1.555e+05 40.412 < 2e-16 ***
## bedrooms -4.246e+04 2.377e+03 -17.861 < 2e-16 ***
## bathrooms 5.794e+04 3.784e+03 15.312 < 2e-16 ***
## sqft_living 1.655e+02 4.889e+00 33.846 < 2e-16 ***
## waterfront 6.118e+05 2.049e+04 29.863 < 2e-16 ***
## view 4.288e+04 2.517e+03 17.037 < 2e-16 ***
## condition 1.778e+04 2.825e+03 6.295 3.14e-10 ***
## grade 1.271e+05 2.427e+03 52.392 < 2e-16 ***
## sqft_above 8.290e+00 4.581e+00 1.810 0.0704 .
## yr_built -3.618e+03 7.945e+01 -45.538 < 2e-16 ***
## yr_renovated 8.964e+00 4.388e+00 2.043 0.0410 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221400 on 17814 degrees of freedom
## Multiple R-squared: 0.6575, Adjusted R-squared: 0.6573
## F-statistic: 3420 on 10 and 17814 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1393539 -112727 -9107 91624 4189855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.389e+06 1.467e+05 43.552 < 2e-16 ***
## bedrooms -4.266e+04 2.375e+03 -17.959 < 2e-16 ***
## bathrooms 5.915e+04 3.737e+03 15.826 < 2e-16 ***
## sqft_living 1.653e+02 4.889e+00 33.813 < 2e-16 ***
## waterfront 6.141e+05 2.046e+04 30.016 < 2e-16 ***
## view 4.297e+04 2.517e+03 17.075 < 2e-16 ***
## condition 1.676e+04 2.780e+03 6.028 1.69e-09 ***
## grade 1.272e+05 2.427e+03 52.418 < 2e-16 ***
## sqft_above 8.437e+00 4.581e+00 1.842 0.0655 .
## yr_built -3.670e+03 7.518e+01 -48.818 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221400 on 17815 degrees of freedom
## Multiple R-squared: 0.6574, Adjusted R-squared: 0.6572
## F-statistic: 3798 on 9 and 17815 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1210220 -124276 -16798 94965 4579064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.163e+05 1.955e+04 -36.640 < 2e-16 ***
## bedrooms -3.658e+04 2.526e+03 -14.483 < 2e-16 ***
## bathrooms -1.193e+04 3.665e+03 -3.256 0.00113 **
## sqft_living 2.245e+02 5.043e+00 44.515 < 2e-16 ***
## waterfront 6.200e+05 2.178e+04 28.465 < 2e-16 ***
## view 5.819e+04 2.659e+03 21.881 < 2e-16 ***
## condition 5.515e+04 2.840e+03 19.421 < 2e-16 ***
## grade 1.038e+05 2.533e+03 40.991 < 2e-16 ***
## sqft_above -3.474e+01 4.786e+00 -7.259 4.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235700 on 17816 degrees of freedom
## Multiple R-squared: 0.6116, Adjusted R-squared: 0.6114
## F-statistic: 3506 on 8 and 17816 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1219983 -125466 -16687 95660 4593532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -706798.90 19533.75 -36.183 < 2e-16 ***
## bedrooms -36031.59 2528.06 -14.253 < 2e-16 ***
## bathrooms -11877.00 3670.29 -3.236 0.00121 **
## sqft_living 200.54 3.82 52.495 < 2e-16 ***
## waterfront 614383.56 21800.24 28.182 < 2e-16 ***
## view 61992.77 2610.73 23.745 < 2e-16 ***
## condition 58706.58 2800.89 20.960 < 2e-16 ***
## grade 99014.41 2448.14 40.445 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 236100 on 17817 degrees of freedom
## Multiple R-squared: 0.6104, Adjusted R-squared: 0.6103
## F-statistic: 3988 on 7 and 17817 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1686540 -134120 -17271 102274 4125967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -83910.959 12554.853 -6.684 2.4e-11 ***
## bedrooms -55121.532 2595.042 -21.241 < 2e-16 ***
## bathrooms 21872.995 3734.533 5.857 4.8e-09 ***
## sqft_living 284.693 3.347 85.048 < 2e-16 ***
## waterfront 593461.737 22771.958 26.061 < 2e-16 ***
## view 68125.704 2723.265 25.016 < 2e-16 ***
## condition 44690.162 2904.077 15.389 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 246700 on 17818 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.5745
## F-statistic: 4012 on 6 and 17818 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1682353 -135344 -18482 102445 4181603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71581.68 7500.83 9.543 < 2e-16 ***
## bedrooms -51232.19 2599.74 -19.707 < 2e-16 ***
## bathrooms 13819.89 3722.07 3.713 0.000205 ***
## sqft_living 284.71 3.37 84.495 < 2e-16 ***
## waterfront 594059.33 22922.11 25.916 < 2e-16 ***
## view 70972.94 2734.89 25.951 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 248300 on 17819 degrees of freedom
## Multiple R-squared: 0.569, Adjusted R-squared: 0.5689
## F-statistic: 4705 on 5 and 17819 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront,
## data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1600895 -141495 -20351 103847 4215584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68725.672 7640.219 8.995 < 2e-16 ***
## bedrooms -56272.803 2640.933 -21.308 < 2e-16 ***
## bathrooms 11475.873 3790.534 3.028 0.00247 **
## sqft_living 303.861 3.349 90.727 < 2e-16 ***
## waterfront 823546.537 21542.947 38.228 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252900 on 17820 degrees of freedom
## Multiple R-squared: 0.5527, Adjusted R-squared: 0.5526
## F-statistic: 5505 on 4 and 17820 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1740023 -144720 -23212 102883 4090122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76100.72 7944.57 9.579 <2e-16 ***
## bedrooms -64971.75 2736.80 -23.740 <2e-16 ***
## bathrooms 10152.13 3942.62 2.575 0.01 *
## sqft_living 318.87 3.46 92.168 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263100 on 17821 degrees of freedom
## Multiple R-squared: 0.516, Adjusted R-squared: 0.516
## F-statistic: 6334 on 3 and 17821 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms+bathrooms, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1541685 -187741 -42448 111924 5897103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52026 9505 -5.474 4.47e-08 ***
## bedrooms 22009 3122 7.050 1.85e-12 ***
## bathrooms 246084 3644 67.536 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 319700 on 17822 degrees of freedom
## Multiple R-squared: 0.2853, Adjusted R-squared: 0.2853
## F-statistic: 3558 on 2 and 17822 DF, p-value: < 2.2e-16
regressor2=lm(formula=price~bedrooms, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -846969 -205281 -66844 105669 6804375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92937 10376 8.956 <2e-16 ***
## bedrooms 133781 2966 45.103 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 358300 on 17823 degrees of freedom
## Multiple R-squared: 0.1024, Adjusted R-squared: 0.1024
## F-statistic: 2034 on 1 and 17823 DF, p-value: < 2.2e-16
As concluded, the value of multiple R-square drops as we eliminate the variables. The highest multiple R-square value is gained when the variables bedrooms, bathrooms, sqft_living, waterfront, view, condition, grade, sqft_above, yr_built, yr_renovated, zipcode, lat, long, and sqft_lot15 are considered.
regressor2=lm(formula=price~bedrooms+bathrooms+sqft_living+waterfront+view+condition+grade+sqft_above+yr_built+yr_renovated+zipcode+lat+long+sqft_lot15, data=training_set2)
summary(regressor2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated +
## zipcode + lat + long + sqft_lot15, data = training_set2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1336593 -100048 -9461 79103 4198469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.151e+07 3.179e+06 3.620 0.000296 ***
## bedrooms -4.041e+04 2.227e+03 -18.143 < 2e-16 ***
## bathrooms 4.479e+04 3.547e+03 12.626 < 2e-16 ***
## sqft_living 1.596e+02 4.574e+00 34.883 < 2e-16 ***
## waterfront 6.085e+05 1.909e+04 31.877 < 2e-16 ***
## view 5.435e+04 2.364e+03 22.992 < 2e-16 ***
## condition 2.619e+04 2.665e+03 9.824 < 2e-16 ***
## grade 1.002e+05 2.326e+03 43.066 < 2e-16 ***
## sqft_above 3.857e+01 4.382e+00 8.804 < 2e-16 ***
## yr_built -2.706e+03 8.030e+01 -33.694 < 2e-16 ***
## yr_renovated 1.989e+01 4.094e+00 4.859 1.19e-06 ***
## zipcode -6.266e+02 3.671e+01 -17.070 < 2e-16 ***
## lat 6.006e+05 1.204e+04 49.900 < 2e-16 ***
## long -2.129e+05 1.454e+04 -14.645 < 2e-16 ***
## sqft_lot15 -2.870e-01 6.074e-02 -4.724 2.33e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 206200 on 17810 degrees of freedom
## Multiple R-squared: 0.7031, Adjusted R-squared: 0.7028
## F-statistic: 3012 on 14 and 17810 DF, p-value: < 2.2e-16
It returns the multiple R-squared value of 0.7028.
Next, compare the both model where one is using the high correlation variables as suggested by corrplot and the other one is gained through back-elimination. Accuracy check and Prediction of the first model:
##Accuracy of the model on the train dataset
pred=regressor$fitted.values
tally_table=data.frame(actual=training_set$price, predicted=pred)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat("The accuracy on the train data is:",accuracy)
## The accuracy on the train data is: 0.6630708
#Continue with Prediction on the test data
pred_test=predict(newdata=test_set,regressor)
tally_table=data.frame(actual=test_set$price, predicted=pred_test)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat(" and the accuracy on the test data is:",accuracy)
## and the accuracy on the test data is: 0.6604951
Accuracy check and Prediction of the second model:
##Accuracy of the model on the train dataset
pred2=regressor2$fitted.values
tally_table=data.frame(actual=training_set2$price, predicted=pred2)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat("The accuracy on the train data is:",accuracy)
## The accuracy on the train data is: 0.7402863
#Continue with Prediction on the test data
pred_test2=predict(newdata=test_set2,regressor2)
tally_table=data.frame(actual=test_set2$price, predicted=pred_test2)
mape=mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy=1-mape
cat(" and the accuracy on the test data is:",accuracy)
## and the accuracy on the test data is: 0.7416078
In sum, the model built by using back-elimination and includes more variable return higher accuracy. The model can predict the price with an accuracy of 74.1%
memory.size()
## [1] 306.34
memory.limit()
## [1] 16314
memory.limit(size=56000)
## [1] 56000
x=select(house_clean, price, price_cat)
##Identify the optimal cluster by plotting dendrogram.
dendrogram=hclust(dist(x,method='euclidean'),method='ward.D')
plot(dendrogram, main=paste('Dendrogram'),xlab='Price',ylab='Price Categories')
##Fitting the Hierarchical Clustering into the data
hc = hclust(dist(x, method='euclidean'), method='ward.D')
y_hc=cutree(hc,4)
From the dendrogram, we can identify the optimal no.of cluster would be 4.
kx=select(house_clean,price,price_cat)
##Split the data into Train and Test set
set.seed(123)
split=sample.split(kx$price_cat,SplitRatio = 0.8)
training_set=subset(kx, split==TRUE)
test_set=subset(kx, split==FALSE)
##Identify the optimal cluster by using the elbow method
set.seed(6)
wcss=vector()
for(i in 1:10)wcss[i]=sum(kmeans(kx,i)$withinss)
plot(1:10,wcss,type='b',main=paste('The Elbow Method'),xlab='Number of Cluster',ylab='WCSS')
##Fitting the K-mean Clustering into the data
set.seed(29)
kmeans=kmeans(x=kx,centers=5)
y_kmeans=kmeans$cluster
Difficult to identify the optimal cluster from the elbow method’s plot.
clusplot(x,y_hc,lines=0,shade=TRUE,color=TRUE,lables=2,plotchar=FALSE,span=TRUE,main=paste('Cluster of Price'),xlab="Price of House",ylab = "Price Categories")
## Warning in plot.window(...): "lables" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "lables" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "lables" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "lables" is not a
## graphical parameter
## Warning in box(...): "lables" is not a graphical parameter
## Warning in title(...): "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in segments(lx1, ly1, lx2, ly2, ...): "lables" is not a graphical
## parameter
## Warning in polygon(z[[k]], density = if (shade) density[k] else 0, col =
## col.clus[jInd[i]], : "lables" is not a graphical parameter
clusplot(kx,y_kmeans,lines=0,shade=TRUE,color=TRUE,lables=2,plotchar=FALSE,span=TRUE,main=paste('Cluster of Price'),xlab='Price of House',ylab = 'Price Categories')
From Multiple Linear Regression, model via Back-elimination method is better than the model using only high positive correlation variable by showing higher accuracy of 74.1%.
From the Clustering Classification, 4 clusters are being identify through Dendrogram. Its rather difficult to identify the cluster through elbow method since the graph doesn’t show significant flat curve. From the exploratory data analysis in section #3, we concluded that the outcome variable has high number of legitimate outliers due to the characteristics of the house which have also been captured in this dataset. As a recommendation for future work, it will be great to include the identifying characteristics such as amenities (swimming pool, gym room, etc), neighboring education facilities (reputable school and universities), and nearest distance to public transportation. These characteristics will undoubtedly help to determine the house price. Based on the characteristics, we could further segmentize the house into two categories, luxury house or ordinary house. A different model can be developed based on the house category.