Introduction
Real Estate is a growing industry with decades of properties records and customer database. This fact causes it to have huge potential for the incorporation of data science approaches in their workflow. With data science, real estate professionals and independent investors will be able to improve analysis and valuation of properties which lead to better decisions. Hence, this project would like to demonstrate the application of data science techniques on properties sales dataset for house price analysis.
Objective
Classification: To create a model that predicts whether the price of the house is below or above average (whether the house is cheap or expensive)
Regression: To create a different model that will explain the price of the house expressed in USD.
Questions to Answer
Classification: Is a house expensive?
Regression: How much is a house worth?
Dataset Information
Data Source: House Price Prediction.
The dataset contains a number of variables describing the parameters of houses located in US cities. The data was collected in February 2014 and the database on the Kaggle portal was published in August 2018. The original database contains information on 4,600 homes that are represented by 18 variables.
Description of Variables
The variables are:
options(warn=-1)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(RColorBrewer)
library(corrplot)
Import Dataset
base <- read.csv("data.csv")
nrow(base)
## [1] 4600
head(base,5)
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00 313000 3 1.50 1340 7912 1.5
## 2 2014-05-02 00:00:00 2384000 5 2.50 3650 9050 2.0
## 3 2014-05-02 00:00:00 342000 3 2.00 1930 11947 1.0
## 4 2014-05-02 00:00:00 420000 3 2.25 2000 8030 1.0
## 5 2014-05-02 00:00:00 550000 4 2.50 1940 10500 1.0
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1 0 0 3 1340 0 1955 2005
## 2 0 4 5 3370 280 1921 0
## 3 0 0 4 1930 0 1966 0
## 4 0 0 4 1000 1000 1963 0
## 5 0 0 4 1140 800 1976 1992
## street city statezip country
## 1 18810 Densmore Ave N Shoreline WA 98133 USA
## 2 709 W Blaine St Seattle WA 98119 USA
## 3 26206-26214 143rd Ave SE Kent WA 98042 USA
## 4 857 170th Pl NE Bellevue WA 98008 USA
## 5 9105 170th Ave NE Redmond WA 98052 USA
Dimension of dataset
dim(base)
## [1] 4600 18
First 6 observations of the dataset
head(base)
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00 313000 3 1.50 1340 7912 1.5
## 2 2014-05-02 00:00:00 2384000 5 2.50 3650 9050 2.0
## 3 2014-05-02 00:00:00 342000 3 2.00 1930 11947 1.0
## 4 2014-05-02 00:00:00 420000 3 2.25 2000 8030 1.0
## 5 2014-05-02 00:00:00 550000 4 2.50 1940 10500 1.0
## 6 2014-05-02 00:00:00 490000 2 1.00 880 6380 1.0
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1 0 0 3 1340 0 1955 2005
## 2 0 4 5 3370 280 1921 0
## 3 0 0 4 1930 0 1966 0
## 4 0 0 4 1000 1000 1963 0
## 5 0 0 4 1140 800 1976 1992
## 6 0 0 3 880 0 1938 1994
## street city statezip country
## 1 18810 Densmore Ave N Shoreline WA 98133 USA
## 2 709 W Blaine St Seattle WA 98119 USA
## 3 26206-26214 143rd Ave SE Kent WA 98042 USA
## 4 857 170th Pl NE Bellevue WA 98008 USA
## 5 9105 170th Ave NE Redmond WA 98052 USA
## 6 522 NE 88th St Seattle WA 98115 USA
Structure of dataset
str(base)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
Summary of dataset
summary(base)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Since not all of the variables are useful for modeling, some of them are excluded during data preparation:
prep_data <- base
# Remove unused variables
prep_data <- prep_data[,-c(1,4,7,9,15:18)]
# Check remaining variables
colnames(prep_data)
## [1] "price" "bedrooms" "sqft_living" "sqft_lot"
## [5] "waterfront" "condition" "sqft_above" "sqft_basement"
## [9] "yr_built" "yr_renovated"
Regression: price (numerical variable)
plot(prep_data$price, xlab = "Observation number", ylab = "Price [in $]")
Values that stand out from the rest are houses with very high prices. In this project, we decided not to include observations with house price that is above $ 1 million, as there are less potential seller / buyer who are interested in such expensive property. After removing zero values and values above 1 million USD, 4211 observations are left for analysis.
# Remove observations with house price that is 0 and more than 1 million.
prep_data$price <- ifelse(prep_data$price == 0, NA, prep_data$price)
prep_data$price <- ifelse(prep_data$price > 1000000, NA, prep_data$price)
prep_data <- prep_data[complete.cases(prep_data), ]
# Remaining observations = 4211
nrow(prep_data)
## [1] 4211
Classification: price2 (dichotomous variable)
price variable is modified into a dichotomous variable for classification purpose, with the mean price as the cut-off point:
prep_data$price2 <- ifelse(prep_data$price > mean(prep_data$price), 1, 0)
To simplify analysis, some of the numerical variables were converted to binominal variables.
yr_renovated to renovatedyr_built to newsqft_basement to basement# Renovation Year -> Renovated? (1: yes; 0: no)
prep_data$renovated <- ifelse(prep_data$yr_renovated > 0, 1, 0)
# Year Built -> Is it new (Built after 1970)? (1: yes; 0: no)
prep_data$new <- ifelse(prep_data$yr_built> 1970, 1, 0)
# Basement Area -> Is there a basement? (1: yes; 0: no)
prep_data$basement <- ifelse(prep_data$sqft_basement > 0, 1, 0)
colnames(prep_data)
## [1] "price" "bedrooms" "sqft_living" "sqft_lot"
## [5] "waterfront" "condition" "sqft_above" "sqft_basement"
## [9] "yr_built" "yr_renovated" "price2" "renovated"
## [13] "new" "basement"
Transform categorical variables:
# Categorical variables
prep_data$waterfront <- as.factor(prep_data$waterfront )
prep_data$condition <- as.factor(prep_data$condition)
prep_data$renovated <- as.factor(prep_data$renovated)
prep_data$new <- as.factor(prep_data$new)
prep_data$basement <- as.factor(prep_data$basement)
prep_data$price2 <- as.factor(prep_data$price2)
The variables extracted for modeling are:
# Extract useful variables for modeling
clean_data <- prep_data[,c(2:7,12:14,1,11)]
# Preview of processed dataset
head(clean_data)
## bedrooms sqft_living sqft_lot waterfront condition sqft_above renovated new
## 1 3 1340 7912 0 3 1340 1 0
## 3 3 1930 11947 0 4 1930 0 0
## 4 3 2000 8030 0 4 1000 0 0
## 5 4 1940 10500 0 4 1140 1 1
## 6 2 880 6380 0 3 880 1 0
## 7 2 1350 2560 0 3 1350 0 1
## basement price price2
## 1 0 313000 0
## 3 0 342000 0
## 4 1 420000 0
## 5 1 550000 1
## 6 0 490000 1
## 7 0 335000 0
str(clean_data)
## 'data.frame': 4211 obs. of 11 variables:
## $ bedrooms : num 3 3 3 4 2 2 4 3 4 3 ...
## $ sqft_living: int 1340 1930 2000 1940 880 1350 2710 2430 1520 1710 ...
## $ sqft_lot : int 7912 11947 8030 10500 6380 2560 35868 88426 6200 7320 ...
## $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Factor w/ 5 levels "1","2","3","4",..: 3 4 4 4 3 3 3 4 3 3 ...
## $ sqft_above : int 1340 1930 1000 1140 880 1350 2710 1570 1520 1710 ...
## $ renovated : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 2 2 ...
## $ new : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 2 1 1 ...
## $ basement : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
## $ price : num 313000 342000 420000 550000 490000 ...
## $ price2 : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 1 2 1 ...
summary(clean_data)
## bedrooms sqft_living sqft_lot waterfront condition
## Min. :1.00 Min. : 370 Min. : 638 0:4196 1: 6
## 1st Qu.:3.00 1st Qu.:1420 1st Qu.: 5000 1: 15 2: 30
## Median :3.00 Median :1900 Median : 7528 3:2644
## Mean :3.34 Mean :1999 Mean : 14283 4:1160
## 3rd Qu.:4.00 3rd Qu.:2470 3rd Qu.: 10500 5: 371
## Max. :9.00 Max. :5960 Max. :1074218
## sqft_above renovated new basement price price2
## Min. : 370 0:2491 0:1980 0:2585 Min. : 7800 0:2326
## 1st Qu.:1160 1:1720 1:2231 1:1626 1st Qu.: 315322 1:1885
## Median :1520 Median : 442900
## Mean :1720 Mean : 473521
## 3rd Qu.:2150 3rd Qu.: 600000
## Max. :5190 Max. :1000000
Price Distribution
ggplot(clean_data, aes(price))+
geom_histogram(binwidth = 80000, fill = "blue", col = "black", alpha = 0.95)+
geom_vline(xintercept = mean(clean_data$price), linetype = "longdash", size = 0.6)+
annotate("text", x = 820000, y = 550, label = "Mean = $ 473.5k", size = 5)+
theme_fivethirtyeight()+ theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "blue"))+
labs(x = 'Price [in $]', y = "Frequency")
mean(clean_data$price)
## [1] 473520.8
min(clean_data$price)
## [1] 7800
max(clean_data$price)
## [1] 1e+06
The average price of a house is 473.5k USD. The distribution of house price is slightly skewed to the right, indicating that the price of most houses are below the average price. The lowest house price is 7.8k USD, while the highest is 1 million USD.
House Classification based on Price
temp1 <- clean_data %>%
group_by(price2) %>%
summarize(x=n()) %>%
as.data.frame()
temp1$price2 <- ifelse(temp1$price2==0, "Below average price", "Above average price")
ggplot(temp1, aes(reorder(as.factor(price2), +x), x, fill = as.factor(price2)))+
geom_bar(stat = "identity", col = "black")+
coord_flip()+
scale_fill_brewer(palette = "Dark2")+
geom_text(aes(y = x/2, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "", y = "Frequency")
2326 houses (55.2%) are categorized as cheap, while the remaining 1885 houses (44.8%) are expensive.
Number of Bedroom
mean(clean_data$bedrooms)
## [1] 3.339824
clean_data %>%
group_by(bedrooms) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(.,aes(as.factor(bedrooms), x))+
geom_bar(stat = "identity", col = "black", fill = "yellow")+
geom_text(aes(y = x+50, label = x), size = 5)+
annotate("text", x = 6.5, y = 1600, label = "Mean = 3.34", size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "Number of bedrooms", y = "Frequency")
The average of bedrooms is 3.34, indicating that a house have an average of at least 3 bedrooms.
House Area in Square Feet
mean(clean_data$sqft_living)
## [1] 1998.521
ggplot(clean_data, aes(sqft_living))+
geom_histogram(binwidth = 500, fill = "green", col = "black", alpha = 0.95)+
geom_vline(xintercept = mean(clean_data$sqft_living), linetype = "longdash", size = 0.6)+
annotate("text", x = 4000, y = 800, label = "Mean = 1998.5 sqft", size = 6)+
theme_fivethirtyeight()+
theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "House area [square feet]", y = "Frequency")
The average house area is approximately 2,000 square feet, with most houses below the mean value.
Plot Size in Square Feet
mean(clean_data$sqft_lot)
## [1] 14282.56
ggplot(clean_data, aes(sqft_lot))+
geom_histogram(binwidth = 5000, fill = "pink", col = "black", alpha = 0.95)+
geom_vline(xintercept = mean(clean_data$sqft_lot), linetype = "longdash", size = 0.6)+
scale_x_continuous(limits =c(0,50000))+
annotate("text", x = 28000, y = 1200, label = "Mean = 14282.56 sqft", size = 6)+
theme_fivethirtyeight()+
theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = 'Plot area [square feet]', y = "Frequency")
The plot area distribution is strongly skewed to the right. In average, most houses have plot area covering less than 10,000 square feet. (The maximum value set for the plotting of x-axis is 50,000 square feet, hence 159 houses of large plot area was excluded from the plot chart.)
Above Ground Area in Square Feet
mean(clean_data$sqft_above)
## [1] 1719.975
ggplot(clean_data, aes(sqft_above))+
geom_histogram(binwidth = 400, fill = "purple", col = "black", alpha = 0.95)+
geom_vline(xintercept = mean(clean_data$sqft_above), linetype = "longdash", size = 0.6)+
annotate("text", x = 3200, y = 900, label = "Mean = 1719.98 sqft", size = 6)+
theme_fivethirtyeight()+
theme(axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = 'Above ground area [square feet]', y = "Frequency")
The above ground area distribution is strongly skewed to the right. In average, most houses above ground area are less than 2,000 square feet.
House Location
clean_data %>%
group_by(waterfront) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(., aes(reorder(as.factor(waterfront), +x), x, fill = as.factor(waterfront)))+
geom_bar(stat = "identity", col = "black")+
scale_fill_brewer(palette = "Dark2")+
geom_text(aes(y = x+150, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "Location by the sea", y = "Frequency")
Only 15 houses (0.36%) of all the houses in the dataset are located by the sea.
House Condition
clean_data %>%
group_by(condition) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(., aes(condition, x, fill = as.factor(condition)))+
geom_bar(stat = "identity", col = "black", alpha = .85)+
scale_fill_brewer(palette = "Dark2")+
geom_text(aes(y = x+120, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "House condition [from 1 to 5]", y = "Frequency")
Most houses are in average condition, rank 3 (2644 houses), followed with better condition, rank 4 (1160 houses) and best condition, rank 5 (371 houses). It is very rare for houses to fall in condition 1 and 2.
House Renovation
clean_data %>%
group_by(renovated) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(., aes(as.factor(renovated), x, fill = as.factor(renovated)))+
geom_bar(stat = "identity", col = "black", alpha = 0.95)+
scale_fill_brewer(palette = "Dark2")+
geom_text(aes(y = x-130, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "Renovated?", y = "Frequency")
Most houses (59%) were not renovated, while the remaining 1720 houses, 41% were renovated.
Old or New House (Built after 1970)
clean_data %>%
group_by(new) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(., aes(new, x, fill = new))+
geom_bar(stat = "identity", col = "black", alpha = 0.95)+
scale_fill_brewer(palette = "Dark1")+
geom_text(aes(y = x-120, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "Built after 1970", y = "Frequency")
2,231 houses (53%) are built after 1970 (below 50 years), and thus are relatively new. The remaining 1,980 houses were built before 1970.
Basement
clean_data %>%
group_by(basement) %>%
summarize(x=n()) %>%
as.data.frame() %>%
ggplot(., aes(as.factor(basement), x, fill = as.factor(basement)))+
geom_bar(stat = "identity", col = "black", alpha = 0.95)+
scale_fill_brewer(palette = "Dark2")+
geom_text(aes(y = x-130, label = x), size = 6)+
theme_fivethirtyeight()+
theme(legend.position = "none", axis.title = element_text(), axis.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"))+
labs(x = "Basement?", y = "Frequency")
2585(61.4%) houses does not come with a basement, while the remaining 38.6% of houses have a basement.
Extract Datasets for Classification and Regression Modeling
# Dataset for classification Model
data1 <- clean_data[,c(1:8,10)]
head(data1)
## bedrooms sqft_living sqft_lot waterfront condition renovated new basement
## 1 3 1340 7912 0 3 1 0 0
## 3 3 1930 11947 0 4 0 0 0
## 4 3 2000 8030 0 4 0 0 1
## 5 4 1940 10500 0 4 1 1 1
## 6 2 880 6380 0 3 1 0 0
## 7 2 1350 2560 0 3 0 1 0
## price2
## 1 0
## 3 0
## 4 0
## 5 1
## 6 1
## 7 0
# Dataset for Regression Model
data2 <- clean_data[,c(1:9)]
head(data2)
## bedrooms sqft_living sqft_lot waterfront condition renovated new basement
## 1 3 1340 7912 0 3 1 0 0
## 3 3 1930 11947 0 4 0 0 0
## 4 3 2000 8030 0 4 0 0 1
## 5 4 1940 10500 0 4 1 1 1
## 6 2 880 6380 0 3 1 0 0
## 7 2 1350 2560 0 3 0 1 0
## price
## 1 313000
## 3 342000
## 4 420000
## 5 550000
## 6 490000
## 7 335000
Aim: To classify houses based on price category.
Target Variable: price2 - Categorical variable
Train-Test Data
library(caret)
# Change Data Type of 'price2' to factor
data1$price2 <- as.factor(data1$price2)
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data1$price2, p=0.80, list=FALSE)
# select 20% of the data for validation
testData <- data1[-validation_index,]
# use the remaining 80% of data to training and testing the models
trainData <- data1[validation_index,]
Random Forest Model for House Classification
library(randomForest)
set.seed(10)
# Build Random Forest Model for Classification
RFModel <- randomForest(price2 ~ .,
data=trainData,
importance=TRUE,
ntree=2000)
varImpPlot(RFModel)
Using the Random Forest model, the most important variable for the prediction of house price is sqft_living.
Prediction and Model Evaluation
RFPrediction <- predict(RFModel, testData)
RFPredictionprob = predict(RFModel,testData,type="prob")[, 2]
RFConfMat <- confusionMatrix(RFPrediction, testData[,"price2"])
RFConfMat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 385 129
## 1 80 248
##
## Accuracy : 0.7518
## 95% CI : (0.7212, 0.7806)
## No Information Rate : 0.5523
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4918
##
## Mcnemar's Test P-Value : 0.0008994
##
## Sensitivity : 0.8280
## Specificity : 0.6578
## Pos Pred Value : 0.7490
## Neg Pred Value : 0.7561
## Prevalence : 0.5523
## Detection Rate : 0.4572
## Detection Prevalence : 0.6105
## Balanced Accuracy : 0.7429
##
## 'Positive' Class : 0
##
Aim: To predict the price of a house expressed in dollars.
Target Variable: price (Numerical variable)
The same predictor variables used for classification are used in this section.
Train-Test Data
# create a list of 80% of the rows in the original dataset we can use for training
validation_index2 <- createDataPartition(data2$price, p=0.80, list=FALSE)
# select 20% of the data for validation
testData2 <- data2[-validation_index2,]
# use the remaining 80% of data to training and testing the models
trainData2 <- data2[validation_index2,]
Multiple Regression Model for Price Prediction
#Build Multiple Regression model
MRModel <- lm(price ~ .,data=trainData2)
summary(MRModel)
##
## Call:
## lm(formula = price ~ ., data = trainData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -859875 -114665 -5734 102883 538725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.834e+05 6.916e+04 2.652 0.008035 **
## bedrooms -2.280e+04 3.872e+03 -5.887 4.32e-09 ***
## sqft_living 1.799e+02 4.740e+00 37.951 < 2e-16 ***
## sqft_lot -1.476e-01 7.492e-02 -1.970 0.048863 *
## waterfront1 1.541e+05 4.123e+04 3.739 0.000188 ***
## condition2 -1.071e+05 7.480e+04 -1.432 0.152348
## condition3 2.441e+04 6.885e+04 0.355 0.722904
## condition4 2.288e+04 6.875e+04 0.333 0.739287
## condition5 8.846e+04 6.907e+04 1.281 0.200389
## renovated1 -1.377e+03 6.291e+03 -0.219 0.826731
## new1 -3.877e+04 6.886e+03 -5.630 1.95e-08 ***
## basement1 4.842e+03 5.725e+03 0.846 0.397750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 153000 on 3359 degrees of freedom
## Multiple R-squared: 0.405, Adjusted R-squared: 0.403
## F-statistic: 207.8 on 11 and 3359 DF, p-value: < 2.2e-16
From the model summary, the model p-value is less than 0.05, indicating that the multiple regression model that was built is a statistically significant model.
Based on the diagnostics,
sqft_living because it is indicated as most significant variable and the t-value is the lowest which shows that the coefficient are significant.Prediction and Model Evaluation
# Use Model to perform prediction on testData2
MRPrediction <- predict(MRModel,testData2)
#correlation between the actuals and predicted values
actuals_preds <- data.frame(cbind(actuals=testData2$price, predicteds=MRPrediction))
head(actuals_preds)
## actuals predicteds
## 9 452500 528083.1
## 10 640000 387798.1
## 14 365000 331671.8
## 16 242500 352353.0
## 18 367500 636295.7
## 33 650000 446787.3
# checking correlation accuracy
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy #62.16%
## actuals predicteds
## actuals 1.0000000 0.6241577
## predicteds 0.6241577 1.0000000
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
paste("min_max_accuracy =",min_max_accuracy)
## [1] "min_max_accuracy = 0.764868107467253"
# Calculate MAE, MSE, RMSE, MAPE in series statistic under target variable price
DMwR::regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
## mae mse rmse mape
## 1.266602e+05 2.385925e+10 1.544644e+05 3.756781e-01
Classification model
From the confusion matrix,
Hence, the accuracy of the classification model based on the matrix is 0.7708 (77.08%)
For the Mcnemar’s Test P-Value of the classification model, the P-value is 0.04385 which state that the classifiers have a different proportion of errors on the test set. This is good as it shows consistently different results in the model.
Regression model
Based on the regression prediction model, it is shown that:
The most significant predictor variable is sqft_living because it is indicated as most significant variable and the t-value is the lowest which shows that the coefficient are significant.
The correlation accuracy of actual price and predicted price of the model is 0.6305(63.05%) which considered as an average rate.
The min_max accuracy of the model is 0.769 (76.9%) which considered as an average rate.
The adjusted R-squared of the model is 0.4007 and it would be problematic for high precision prediction.
The Root Mean Square Error (RMSE) of the model is 148424.5 dollars