Introduction

Introduction

Real Estate is a growing industry with decades of properties records and customer database. This fact causes it to have huge potential for the incorporation of data science approaches in their workflow. With data science, real estate professionals and independent investors will be able to improve analysis and valuation of properties which lead to better decisions. Hence, this project would like to demonstrate the application of data science techniques on properties sales dataset for house price analysis.

Objective

  1. Classification: To create a model that predicts whether the price of the house is below or above average (whether the house is cheap or expensive)

  2. Regression: To create a different model that will explain the price of the house expressed in USD.

Questions to Answer

  1. Classification: Is a house expensive?

  2. Regression: How much is a house worth?

Dataset Information

Data Source: House Price Prediction.

The dataset contains a number of variables describing the parameters of houses located in US cities. The data was collected in February 2014 and the database on the Kaggle portal was published in August 2018. The original database contains information on 4,600 homes that are represented by 18 variables.

Description of Variables

The variables are:

  • date (probably information when information about the house was collected)
  • house price (expressed in dollars), number of bedrooms, number of bathrooms
  • living area, plot size
  • number of floors
  • location by the sea (dichotomous variable)
  • view (on a scale from 0 to 4)
  • condition of the house (on a scale from 1 to 5)
  • ground area of the house, basement area (if any)
  • year built, year of renovation (if any)
  • street, city
  • code
  • country
  • Area variables (living area, plot size, ground area and basement area) are expressed in square feet.

Packages Required

options(warn=-1)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(RColorBrewer)
library(corrplot)

Data Preparation

Import dataset

Import Dataset

base <- read.csv("data.csv")
nrow(base)
## [1] 4600
head(base,5)
##                  date   price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00  313000        3      1.50        1340     7912    1.5
## 2 2014-05-02 00:00:00 2384000        5      2.50        3650     9050    2.0
## 3 2014-05-02 00:00:00  342000        3      2.00        1930    11947    1.0
## 4 2014-05-02 00:00:00  420000        3      2.25        2000     8030    1.0
## 5 2014-05-02 00:00:00  550000        4      2.50        1940    10500    1.0
##   waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1          0    0         3       1340             0     1955         2005
## 2          0    4         5       3370           280     1921            0
## 3          0    0         4       1930             0     1966            0
## 4          0    0         4       1000          1000     1963            0
## 5          0    0         4       1140           800     1976         1992
##                     street      city statezip country
## 1     18810 Densmore Ave N Shoreline WA 98133     USA
## 2          709 W Blaine St   Seattle WA 98119     USA
## 3 26206-26214 143rd Ave SE      Kent WA 98042     USA
## 4          857 170th Pl NE  Bellevue WA 98008     USA
## 5        9105 170th Ave NE   Redmond WA 98052     USA

Dimension of dataset

dim(base)
## [1] 4600   18

First 6 observations of the dataset

head(base)
##                  date   price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00  313000        3      1.50        1340     7912    1.5
## 2 2014-05-02 00:00:00 2384000        5      2.50        3650     9050    2.0
## 3 2014-05-02 00:00:00  342000        3      2.00        1930    11947    1.0
## 4 2014-05-02 00:00:00  420000        3      2.25        2000     8030    1.0
## 5 2014-05-02 00:00:00  550000        4      2.50        1940    10500    1.0
## 6 2014-05-02 00:00:00  490000        2      1.00         880     6380    1.0
##   waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1          0    0         3       1340             0     1955         2005
## 2          0    4         5       3370           280     1921            0
## 3          0    0         4       1930             0     1966            0
## 4          0    0         4       1000          1000     1963            0
## 5          0    0         4       1140           800     1976         1992
## 6          0    0         3        880             0     1938         1994
##                     street      city statezip country
## 1     18810 Densmore Ave N Shoreline WA 98133     USA
## 2          709 W Blaine St   Seattle WA 98119     USA
## 3 26206-26214 143rd Ave SE      Kent WA 98042     USA
## 4          857 170th Pl NE  Bellevue WA 98008     USA
## 5        9105 170th Ave NE   Redmond WA 98052     USA
## 6           522 NE 88th St   Seattle WA 98115     USA

Structure of dataset

str(base)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

Summary of dataset

summary(base)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Unused Variables

Since not all of the variables are useful for modeling, some of them are excluded during data preparation:

  • Date
  • Address (street, city, code, country)
  • No. of floors
  • View
  • No. of bathrooms
prep_data <- base

# Remove unused variables
prep_data <- prep_data[,-c(1,4,7,9,15:18)]

# Check remaining variables
colnames(prep_data)
##  [1] "price"         "bedrooms"      "sqft_living"   "sqft_lot"     
##  [5] "waterfront"    "condition"     "sqft_above"    "sqft_basement"
##  [9] "yr_built"      "yr_renovated"

Dependable Variable: Price of House

Regression: price (numerical variable)

plot(prep_data$price, xlab = "Observation number", ylab = "Price [in $]")

Values that stand out from the rest are houses with very high prices. In this project, we decided not to include observations with house price that is above $ 1 million, as there are less potential seller / buyer who are interested in such expensive property. After removing zero values and values above 1 million USD, 4211 observations are left for analysis.

# Remove observations with house price that is 0 and more than 1 million.
prep_data$price <- ifelse(prep_data$price == 0, NA, prep_data$price)
prep_data$price <- ifelse(prep_data$price > 1000000, NA, prep_data$price)
prep_data <- prep_data[complete.cases(prep_data), ]

# Remaining observations = 4211
nrow(prep_data)
## [1] 4211

Classification: price2 (dichotomous variable)

price variable is modified into a dichotomous variable for classification purpose, with the mean price as the cut-off point:

  • 0: below mean price (cheap)
  • 1: above mean price (expensive)
prep_data$price2 <- ifelse(prep_data$price > mean(prep_data$price), 1, 0)

Modification of other variables

To simplify analysis, some of the numerical variables were converted to binominal variables.

  • Renovation Year: yr_renovated to renovated
  • Year Built: yr_built to new
  • Basement Area: sqft_basement to basement
# Renovation Year -> Renovated? (1: yes; 0: no)
prep_data$renovated <- ifelse(prep_data$yr_renovated > 0, 1, 0)
# Year Built -> Is it new (Built after 1970)? (1: yes; 0: no)
prep_data$new <- ifelse(prep_data$yr_built> 1970, 1, 0)
# Basement Area -> Is there a basement? (1: yes; 0: no)
prep_data$basement <- ifelse(prep_data$sqft_basement > 0, 1, 0)
colnames(prep_data)
##  [1] "price"         "bedrooms"      "sqft_living"   "sqft_lot"     
##  [5] "waterfront"    "condition"     "sqft_above"    "sqft_basement"
##  [9] "yr_built"      "yr_renovated"  "price2"        "renovated"    
## [13] "new"           "basement"

Data Type Transformation

Transform categorical variables:

  • waterfront
  • condition
  • renovated
  • new
  • basement
  • price2
# Categorical variables
prep_data$waterfront <- as.factor(prep_data$waterfront )
prep_data$condition <- as.factor(prep_data$condition)
prep_data$renovated <- as.factor(prep_data$renovated)
prep_data$new <- as.factor(prep_data$new)
prep_data$basement <- as.factor(prep_data$basement)
prep_data$price2 <- as.factor(prep_data$price2)

Extraction of Variables for Modeling

The variables extracted for modeling are:

  • bedrooms
  • sqft_living
  • sqft_lot
  • waterfront
  • condition
  • sqft_above
  • renovated
  • new
  • basement
  • price
  • price2
# Extract useful variables for modeling
clean_data <- prep_data[,c(2:7,12:14,1,11)]

Preview of Processed Dataset

# Preview of processed dataset 
head(clean_data)
##   bedrooms sqft_living sqft_lot waterfront condition sqft_above renovated new
## 1        3        1340     7912          0         3       1340         1   0
## 3        3        1930    11947          0         4       1930         0   0
## 4        3        2000     8030          0         4       1000         0   0
## 5        4        1940    10500          0         4       1140         1   1
## 6        2         880     6380          0         3        880         1   0
## 7        2        1350     2560          0         3       1350         0   1
##   basement  price price2
## 1        0 313000      0
## 3        0 342000      0
## 4        1 420000      0
## 5        1 550000      1
## 6        0 490000      1
## 7        0 335000      0
str(clean_data)
## 'data.frame':    4211 obs. of  11 variables:
##  $ bedrooms   : num  3 3 3 4 2 2 4 3 4 3 ...
##  $ sqft_living: int  1340 1930 2000 1940 880 1350 2710 2430 1520 1710 ...
##  $ sqft_lot   : int  7912 11947 8030 10500 6380 2560 35868 88426 6200 7320 ...
##  $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ condition  : Factor w/ 5 levels "1","2","3","4",..: 3 4 4 4 3 3 3 4 3 3 ...
##  $ sqft_above : int  1340 1930 1000 1140 880 1350 2710 1570 1520 1710 ...
##  $ renovated  : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 2 2 ...
##  $ new        : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 2 1 1 ...
##  $ basement   : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
##  $ price      : num  313000 342000 420000 550000 490000 ...
##  $ price2     : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 1 2 1 ...
summary(clean_data)
##     bedrooms     sqft_living      sqft_lot       waterfront condition
##  Min.   :1.00   Min.   : 370   Min.   :    638   0:4196     1:   6   
##  1st Qu.:3.00   1st Qu.:1420   1st Qu.:   5000   1:  15     2:  30   
##  Median :3.00   Median :1900   Median :   7528              3:2644   
##  Mean   :3.34   Mean   :1999   Mean   :  14283              4:1160   
##  3rd Qu.:4.00   3rd Qu.:2470   3rd Qu.:  10500              5: 371   
##  Max.   :9.00   Max.   :5960   Max.   :1074218                       
##    sqft_above   renovated new      basement     price         price2  
##  Min.   : 370   0:2491    0:1980   0:2585   Min.   :   7800   0:2326  
##  1st Qu.:1160   1:1720    1:2231   1:1626   1st Qu.: 315322   1:1885  
##  Median :1520                               Median : 442900           
##  Mean   :1720                               Mean   : 473521           
##  3rd Qu.:2150                               3rd Qu.: 600000           
##  Max.   :5190                               Max.   :1000000

Descriptive Analysis

Target Variable

Price Distribution

ggplot(clean_data, aes(price))+
  geom_histogram(binwidth = 80000, fill = "blue", col = "black", alpha = 0.95)+
  geom_vline(xintercept = mean(clean_data$price), linetype = "longdash", size = 0.6)+ 
  annotate("text", x = 820000, y = 550, label = "Mean = $ 473.5k", size = 5)+
  theme_fivethirtyeight()+ theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "blue"))+ 
  labs(x = 'Price [in $]', y = "Frequency")

mean(clean_data$price)
## [1] 473520.8
min(clean_data$price)
## [1] 7800
max(clean_data$price)
## [1] 1e+06

The average price of a house is 473.5k USD. The distribution of house price is slightly skewed to the right, indicating that the price of most houses are below the average price. The lowest house price is 7.8k USD, while the highest is 1 million USD.

House Classification based on Price

temp1 <- clean_data %>%
  group_by(price2) %>%
  summarize(x=n()) %>%
  as.data.frame() 

temp1$price2 <- ifelse(temp1$price2==0, "Below average price", "Above average price")

ggplot(temp1, aes(reorder(as.factor(price2), +x), x, fill = as.factor(price2)))+
  geom_bar(stat = "identity", col = "black")+
  coord_flip()+
  scale_fill_brewer(palette = "Dark2")+
  geom_text(aes(y = x/2, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "", y = "Frequency")

2326 houses (55.2%) are categorized as cheap, while the remaining 1885 houses (44.8%) are expensive.

Numerical Variables

Number of Bedroom

mean(clean_data$bedrooms)
## [1] 3.339824
clean_data %>%
  group_by(bedrooms) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(.,aes(as.factor(bedrooms), x))+
  geom_bar(stat = "identity", col = "black", fill = "yellow")+
  geom_text(aes(y = x+50, label = x), size = 5)+
  annotate("text", x = 6.5, y = 1600, label = "Mean = 3.34", size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "Number of bedrooms", y = "Frequency")

The average of bedrooms is 3.34, indicating that a house have an average of at least 3 bedrooms.

House Area in Square Feet

mean(clean_data$sqft_living)
## [1] 1998.521
ggplot(clean_data, aes(sqft_living))+
  geom_histogram(binwidth = 500, fill = "green", col = "black", alpha = 0.95)+
  geom_vline(xintercept = mean(clean_data$sqft_living), linetype = "longdash", size = 0.6)+
  annotate("text", x = 4000, y = 800, label = "Mean = 1998.5 sqft", size = 6)+
  theme_fivethirtyeight()+
  theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+ 
  labs(x = "House area [square feet]", y = "Frequency")

The average house area is approximately 2,000 square feet, with most houses below the mean value.

Plot Size in Square Feet

mean(clean_data$sqft_lot)
## [1] 14282.56
ggplot(clean_data, aes(sqft_lot))+
  geom_histogram(binwidth = 5000, fill = "pink", col = "black", alpha = 0.95)+
  geom_vline(xintercept = mean(clean_data$sqft_lot), linetype = "longdash", size = 0.6)+
  scale_x_continuous(limits =c(0,50000))+
  annotate("text", x = 28000, y = 1200, label = "Mean = 14282.56 sqft", size = 6)+
  theme_fivethirtyeight()+
  theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+ 
  labs(x = 'Plot area [square feet]', y = "Frequency")

The plot area distribution is strongly skewed to the right. In average, most houses have plot area covering less than 10,000 square feet. (The maximum value set for the plotting of x-axis is 50,000 square feet, hence 159 houses of large plot area was excluded from the plot chart.)

Above Ground Area in Square Feet

mean(clean_data$sqft_above)
## [1] 1719.975
ggplot(clean_data, aes(sqft_above))+
  geom_histogram(binwidth = 400, fill = "purple", col = "black", alpha = 0.95)+
  geom_vline(xintercept = mean(clean_data$sqft_above), linetype = "longdash", size = 0.6)+
  annotate("text", x = 3200, y = 900, label = "Mean = 1719.98 sqft", size = 6)+
  theme_fivethirtyeight()+
  theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+ 
  labs(x = 'Above ground area [square feet]', y = "Frequency")

The above ground area distribution is strongly skewed to the right. In average, most houses above ground area are less than 2,000 square feet.

Categorical Variables

House Location

clean_data %>%
  group_by(waterfront) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(., aes(reorder(as.factor(waterfront), +x), x, fill = as.factor(waterfront)))+
  geom_bar(stat = "identity", col = "black")+
  scale_fill_brewer(palette = "Dark2")+
  geom_text(aes(y = x+150, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "Location by the sea", y = "Frequency")

Only 15 houses (0.36%) of all the houses in the dataset are located by the sea.

House Condition

clean_data %>%
  group_by(condition) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(., aes(condition, x, fill = as.factor(condition)))+
  geom_bar(stat = "identity", col = "black", alpha = .85)+
  scale_fill_brewer(palette = "Dark2")+
  geom_text(aes(y = x+120, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "House condition [from 1 to 5]", y = "Frequency")

Most houses are in average condition, rank 3 (2644 houses), followed with better condition, rank 4 (1160 houses) and best condition, rank 5 (371 houses). It is very rare for houses to fall in condition 1 and 2.

House Renovation

clean_data %>%
  group_by(renovated) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(., aes(as.factor(renovated), x, fill = as.factor(renovated)))+
  geom_bar(stat = "identity", col = "black", alpha = 0.95)+
  scale_fill_brewer(palette = "Dark2")+
  geom_text(aes(y = x-130, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "Renovated?", y = "Frequency")

Most houses (59%) were not renovated, while the remaining 1720 houses, 41% were renovated.

Old or New House (Built after 1970)

clean_data %>%
  group_by(new) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(., aes(new, x, fill = new))+
  geom_bar(stat = "identity", col = "black", alpha = 0.95)+
  scale_fill_brewer(palette = "Dark1")+
  geom_text(aes(y = x-120, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "Built after 1970", y = "Frequency")

2,231 houses (53%) are built after 1970 (below 50 years), and thus are relatively new. The remaining 1,980 houses were built before 1970.

Basement

clean_data %>%
  group_by(basement) %>%
  summarize(x=n()) %>%
  as.data.frame() %>%
ggplot(., aes(as.factor(basement), x, fill = as.factor(basement)))+
  geom_bar(stat = "identity", col = "black", alpha = 0.95)+
  scale_fill_brewer(palette = "Dark2")+
  geom_text(aes(y = x-130, label = x), size = 6)+
  theme_fivethirtyeight()+
  theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
  labs(x = "Basement?", y = "Frequency")

2585(61.4%) houses does not come with a basement, while the remaining 38.6% of houses have a basement.

Predictive analysis

Eliminating of strongly correlated predictive variables

sqft_living,sqft_lot, and sqft_above were suspected to have high correlation among each other, which may affect modeling result. Thus, Pearson’s linear correlation coefficients were checked.

corrplot(cor(clean_data[,c('sqft_living','sqft_lot','sqft_above')]), method="circle", type = "upper",order="hclust", 
         bg="lightblue", tl.col="black",tl.srt=90, 
         addCoef.col = "black", diag = T, number.cex = 1.2)

Strong correlation was found between the house living area and the above ground area, so we chose to remove sqft_above variable for predictive modeling.

#Remove sqft_above variable
clean_data <- clean_data[,-6]
#preview remaining dataset
head(clean_data)
##   bedrooms sqft_living sqft_lot waterfront condition renovated new basement
## 1        3        1340     7912          0         3         1   0        0
## 3        3        1930    11947          0         4         0   0        0
## 4        3        2000     8030          0         4         0   0        1
## 5        4        1940    10500          0         4         1   1        1
## 6        2         880     6380          0         3         1   0        0
## 7        2        1350     2560          0         3         0   1        0
##    price price2
## 1 313000      0
## 3 342000      0
## 4 420000      0
## 5 550000      1
## 6 490000      1
## 7 335000      0

Predictive Modeling Dataset

Extract Datasets for Classification and Regression Modeling

# Dataset for classification Model
data1 <- clean_data[,c(1:8,10)]
head(data1)
##   bedrooms sqft_living sqft_lot waterfront condition renovated new basement
## 1        3        1340     7912          0         3         1   0        0
## 3        3        1930    11947          0         4         0   0        0
## 4        3        2000     8030          0         4         0   0        1
## 5        4        1940    10500          0         4         1   1        1
## 6        2         880     6380          0         3         1   0        0
## 7        2        1350     2560          0         3         0   1        0
##   price2
## 1      0
## 3      0
## 4      0
## 5      1
## 6      1
## 7      0
# Dataset for Regression Model
data2 <- clean_data[,c(1:9)]
head(data2)
##   bedrooms sqft_living sqft_lot waterfront condition renovated new basement
## 1        3        1340     7912          0         3         1   0        0
## 3        3        1930    11947          0         4         0   0        0
## 4        3        2000     8030          0         4         0   0        1
## 5        4        1940    10500          0         4         1   1        1
## 6        2         880     6380          0         3         1   0        0
## 7        2        1350     2560          0         3         0   1        0
##    price
## 1 313000
## 3 342000
## 4 420000
## 5 550000
## 6 490000
## 7 335000

Classification

Aim: To classify houses based on price category.

Target Variable: price2 - Categorical variable

  • 0: Cheap (Below average price)
  • 1: Expensive (Above average price)

Train-Test Data

library(caret)

# Change Data Type of 'price2' to factor
data1$price2 <- as.factor(data1$price2)

# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data1$price2, p=0.80, list=FALSE)

# select 20% of the data for validation
testData <- data1[-validation_index,]

# use the remaining 80% of data to training and testing the models
trainData <- data1[validation_index,]

Random Forest Model for House Classification

library(randomForest)

set.seed(10)
# Build Random Forest Model for Classification
RFModel <- randomForest(price2 ~ .,
                    data=trainData,
                    importance=TRUE,
                    ntree=2000)
varImpPlot(RFModel)

Using the Random Forest model, the most important variable for the prediction of house price is sqft_living.

Prediction and Model Evaluation

RFPrediction <- predict(RFModel, testData)
RFPredictionprob = predict(RFModel,testData,type="prob")[, 2]

RFConfMat <- confusionMatrix(RFPrediction, testData[,"price2"])
RFConfMat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 385 129
##          1  80 248
##                                           
##                Accuracy : 0.7518          
##                  95% CI : (0.7212, 0.7806)
##     No Information Rate : 0.5523          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4918          
##                                           
##  Mcnemar's Test P-Value : 0.0008994       
##                                           
##             Sensitivity : 0.8280          
##             Specificity : 0.6578          
##          Pos Pred Value : 0.7490          
##          Neg Pred Value : 0.7561          
##              Prevalence : 0.5523          
##          Detection Rate : 0.4572          
##    Detection Prevalence : 0.6105          
##       Balanced Accuracy : 0.7429          
##                                           
##        'Positive' Class : 0               
## 

Regression

Aim: To predict the price of a house expressed in dollars.

Target Variable: price (Numerical variable)

The same predictor variables used for classification are used in this section.

Train-Test Data

# create a list of 80% of the rows in the original dataset we can use for training
validation_index2 <- createDataPartition(data2$price, p=0.80, list=FALSE)
# select 20% of the data for validation
testData2 <- data2[-validation_index2,]
# use the remaining 80% of data to training and testing the models
trainData2 <- data2[validation_index2,]

Multiple Regression Model for Price Prediction

#Build Multiple Regression model
MRModel <- lm(price ~ .,data=trainData2)
summary(MRModel)
## 
## Call:
## lm(formula = price ~ ., data = trainData2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -859875 -114665   -5734  102883  538725 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.834e+05  6.916e+04   2.652 0.008035 ** 
## bedrooms    -2.280e+04  3.872e+03  -5.887 4.32e-09 ***
## sqft_living  1.799e+02  4.740e+00  37.951  < 2e-16 ***
## sqft_lot    -1.476e-01  7.492e-02  -1.970 0.048863 *  
## waterfront1  1.541e+05  4.123e+04   3.739 0.000188 ***
## condition2  -1.071e+05  7.480e+04  -1.432 0.152348    
## condition3   2.441e+04  6.885e+04   0.355 0.722904    
## condition4   2.288e+04  6.875e+04   0.333 0.739287    
## condition5   8.846e+04  6.907e+04   1.281 0.200389    
## renovated1  -1.377e+03  6.291e+03  -0.219 0.826731    
## new1        -3.877e+04  6.886e+03  -5.630 1.95e-08 ***
## basement1    4.842e+03  5.725e+03   0.846 0.397750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 153000 on 3359 degrees of freedom
## Multiple R-squared:  0.405,  Adjusted R-squared:  0.403 
## F-statistic: 207.8 on 11 and 3359 DF,  p-value: < 2.2e-16

From the model summary, the model p-value is less than 0.05, indicating that the multiple regression model that was built is a statistically significant model.

Based on the diagnostics,

  • The most significant predictor variable is sqft_living because it is indicated as most significant variable and the t-value is the lowest which shows that the coefficient are significant.

Prediction and Model Evaluation

# Use Model to perform prediction on testData2
MRPrediction <- predict(MRModel,testData2)

#correlation between the actuals and predicted values
actuals_preds <- data.frame(cbind(actuals=testData2$price, predicteds=MRPrediction))
head(actuals_preds)
##    actuals predicteds
## 9   452500   528083.1
## 10  640000   387798.1
## 14  365000   331671.8
## 16  242500   352353.0
## 18  367500   636295.7
## 33  650000   446787.3
# checking correlation accuracy
correlation_accuracy <- cor(actuals_preds) 
correlation_accuracy #62.16%
##              actuals predicteds
## actuals    1.0000000  0.6241577
## predicteds 0.6241577  1.0000000
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
paste("min_max_accuracy =",min_max_accuracy)
## [1] "min_max_accuracy = 0.764868107467253"
# Calculate MAE, MSE, RMSE, MAPE in series statistic under target variable price
DMwR::regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
##          mae          mse         rmse         mape 
## 1.266602e+05 2.385925e+10 1.544644e+05 3.756781e-01

Disscussion and Conclusion

Discussion

Classification model

From the confusion matrix,

  • Total of 842 predictions is made.
  • The classifier predicted that 348 properties are expensive = 1 and 494 properties are cheap =0.
  • In reality (named as reference in matrix), 377 properties are expensive = 1 and 465 properties are cheap =0.

Hence, the accuracy of the classification model based on the matrix is 0.7708 (77.08%)

For the Mcnemar’s Test P-Value of the classification model, the P-value is 0.04385 which state that the classifiers have a different proportion of errors on the test set. This is good as it shows consistently different results in the model.

Regression model

Based on the regression prediction model, it is shown that:

  • The most significant predictor variable is sqft_living because it is indicated as most significant variable and the t-value is the lowest which shows that the coefficient are significant.

  • The correlation accuracy of actual price and predicted price of the model is 0.6305(63.05%) which considered as an average rate.

  • The min_max accuracy of the model is 0.769 (76.9%) which considered as an average rate.

  • The adjusted R-squared of the model is 0.4007 and it would be problematic for high precision prediction.

  • The Root Mean Square Error (RMSE) of the model is 148424.5 dollars

Conclusion

  • In conclusion both modeling done for classification and regression problems using RandomForest and multiple regression method have shown its respective accuracy in predicting house based on price category and price of house in dollars. For classification problem, the prediction accuracy achieved 77.08% with test data. For regression problem, the RMSE of the model is 148424.5 dollars with 63.05% correlation accuracy and 76.9% min_max accuracy which can be consider as a good accuracy but the R-square of the model is 0.4 which considered below average precision.Hence, future improvement can be done towards the prediction of house price with different approach such as neural network method.