Predicting House Sale Price … in

I currently am a business owner CPQ Energy, so I decided to get some capital. The first capital I had at hand is the equity in my house. So I decided to make an analysis, based on data provided by our realtor, to set a price that could get some traction on the offering.

I didn’t want to be blind in doing this and considering my engineering background I decided to make some maths and statistics using R.

Here below is what I think is the range of the price I should keep asking for the house. Although the model may have some weaknesses due to the fact I have not found good support in the p-factor I decided it is the best I could get considering the data I have at hand.

The result is a range between $450,063 and $527,486 with a fit median predicted of $488,774 which is in the ballpark number where I need to be.

It is a good exercise and I plan to continue improving it as far as I am concerned.

By the way, selling a house is not a big deal, I think it is just a commodity that requires too much annual maintenance and taxes. No complain should you think different than me.

Let’s start with the analysis.

1 - Libraries Required

library("tibble")
library("DT")
library("scales")

2 - Input File Reading - Data Provided by a Realtor

Data Reading from a CSV file

setwd("C:/Users/Dario/Documents/R Programming/myHouseSalePrice")
datah <- as_data_frame(read.csv(file = "data.csv", header = T, as.is = T, 
                                colClasses = c(rep("character", 4), 
                                               rep("numeric", 5), 
                                               "character", "integer"), 
                                strip.white = T, stringsAsFactors=F))

# Changing Column Names

names(datah)[5] <- "USD.LP"

# Eliminating commas to convert it to numeric

datah$LOT.SQ.FT <- gsub(",", "", datah$LOT.SQ.FT)
datah$HOUSE.SQ.FT <- gsub(",", "", datah$HOUSE.SQ.FT)
datah$LIST.PRICE <- gsub(",", "", datah$LIST.PRICE)
datah$SALES.PRICE <- gsub(",", "", datah$SALES.PRICE)

# COnverting to Numeric

datah[,c(2,3,4,10)] <- sapply(datah[,c(2,3,4,10)], as.integer)
datah[,c(7:9)] <- sapply(datah[,c(7:9)], as.integer)

# Built Year is Nullified for this analysis

datah$YEAR.BLT <- NULL

# Set all Columns Names in lower-case

names(datah) <- stringr::str_to_lower(names(datah))
datah <- datah[complete.cases(datah),]

3 - Preparing Data Sets — To be used in Later Analysis

# 65% of the sample size

trows <- floor(0.65 * nrow(datah))

# set the seed to make your partition reproducible

set.seed(123)
train_idx <- sample(seq_len(nrow(datah)), size = trows)

# set train & test datasets

train <- datah[train_idx,]
test <- datah[-train_idx,]

4 - Plotting some graphs for Exploratory Analysis

# Based on results of this plot we observe that only lot.sq.ft, house.sq.ft and
# list.price are the ones showing a linear relationship with sales.price.
# Therefore, we will use these three parameters for creating a linear regression
# model.

plot(datah[,c(2:10)])

5 - Creating Linear Regression Model

# Creating Linear Multiple Regression Model

hlm <- lm(formula = sales.price ~ lot.sq.ft + house.sq.ft + list.price, 
          data = datah)

# Displays Statistics of the Model

summary(hlm)

## 
## Call:
## lm(formula = sales.price ~ lot.sq.ft + house.sq.ft + list.price, 
##     data = datah)
## 
## Residuals:
##     1     2     3     4     5     6     7 
##  4738 -5200  7701 -5239 -7250 -1672  6922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -5.045e+04  4.931e+04  -1.023  0.38158   
## lot.sq.ft   -1.275e+00  2.332e+00  -0.547  0.62252   
## house.sq.ft  2.524e+01  2.072e+01   1.219  0.31009   
## list.price   9.173e-01  1.255e-01   7.309  0.00529 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8936 on 3 degrees of freedom
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9548 
## F-statistic: 43.23 on 3 and 3 DF,  p-value: 0.005731

# We observe that in the Summary Report only list.price provides the highest
# significance value for prediction (see ** symbol). But even this value is
# statistically no-significant in the multiple linear regression model of datah.

# The coefficient of determination of the multiple linear regression model for 
# the data set 'datah' is at 97% and Adjusted R-squared at 95% which is very good.

# We will, later on, add more data to the model and re-run this to see how
# the prediction is performing.

5 - Creating & Using a Predictive Model of Best Selling Price

# Prediction of best price including Interval of Confidence

# We use data referred to our house' dimensions and use a list price which is
# what we are asking for as of today's writing of this report.

house.price.range <- predict(hlm, data.frame(lot.sq.ft=9600, house.sq.ft=3350, 
                                             list.price=509000), 
                             interval = "confidence")

# By using head() we shows fit price, lower and upper range of the best 
# selling price.

fit.price = dollar(as.numeric(house.price.range[1]))
lower.fit = dollar(as.numeric(house.price.range[2]))
upper.fit = dollar(as.numeric(house.price.range[3]))

house.price.range <- cbind.data.frame(lower.fit, fit.price, upper.fit)

head(house.price.range)

##   lower.fit fit.price upper.fit
## 1  $450,063  $488,774  $527,486