I currently am a business owner CPQ Energy, so I decided to get some capital. The first capital I had at hand is the equity in my house. So I decided to make an analysis, based on data provided by our realtor, to set a price that could get some traction on the offering.
I didn’t want to be blind in doing this and considering my engineering background I decided to make some maths and statistics using R.
Here below is what I think is the range of the price I should keep asking for the house. Although the model may have some weaknesses due to the fact I have not found good support in the p-factor I decided it is the best I could get considering the data I have at hand.
The result is a range between $450,063 and $527,486 with a fit median predicted of $488,774 which is in the ballpark number where I need to be.
It is a good exercise and I plan to continue improving it as far as I am concerned.
By the way, selling a house is not a big deal, I think it is just a commodity that requires too much annual maintenance and taxes. No complain should you think different than me.
Let’s start with the analysis.
library("tibble")
library("DT")
library("scales")
Data Reading from a CSV file
setwd("C:/Users/Dario/Documents/R Programming/myHouseSalePrice")
datah <- as_data_frame(read.csv(file = "data.csv", header = T, as.is = T,
colClasses = c(rep("character", 4),
rep("numeric", 5),
"character", "integer"),
strip.white = T, stringsAsFactors=F))
# Changing Column Names
names(datah)[5] <- "USD.LP"
# Eliminating commas to convert it to numeric
datah$LOT.SQ.FT <- gsub(",", "", datah$LOT.SQ.FT)
datah$HOUSE.SQ.FT <- gsub(",", "", datah$HOUSE.SQ.FT)
datah$LIST.PRICE <- gsub(",", "", datah$LIST.PRICE)
datah$SALES.PRICE <- gsub(",", "", datah$SALES.PRICE)
# COnverting to Numeric
datah[,c(2,3,4,10)] <- sapply(datah[,c(2,3,4,10)], as.integer)
datah[,c(7:9)] <- sapply(datah[,c(7:9)], as.integer)
# Built Year is Nullified for this analysis
datah$YEAR.BLT <- NULL
# Set all Columns Names in lower-case
names(datah) <- stringr::str_to_lower(names(datah))
datah <- datah[complete.cases(datah),]
# 65% of the sample size
trows <- floor(0.65 * nrow(datah))
# set the seed to make your partition reproducible
set.seed(123)
train_idx <- sample(seq_len(nrow(datah)), size = trows)
# set train & test datasets
train <- datah[train_idx,]
test <- datah[-train_idx,]
# Based on results of this plot we observe that only lot.sq.ft, house.sq.ft and
# list.price are the ones showing a linear relationship with sales.price.
# Therefore, we will use these three parameters for creating a linear regression
# model.
plot(datah[,c(2:10)])
# Creating Linear Multiple Regression Model
hlm <- lm(formula = sales.price ~ lot.sq.ft + house.sq.ft + list.price,
data = datah)
# Displays Statistics of the Model
summary(hlm)
##
## Call:
## lm(formula = sales.price ~ lot.sq.ft + house.sq.ft + list.price,
## data = datah)
##
## Residuals:
## 1 2 3 4 5 6 7
## 4738 -5200 7701 -5239 -7250 -1672 6922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.045e+04 4.931e+04 -1.023 0.38158
## lot.sq.ft -1.275e+00 2.332e+00 -0.547 0.62252
## house.sq.ft 2.524e+01 2.072e+01 1.219 0.31009
## list.price 9.173e-01 1.255e-01 7.309 0.00529 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8936 on 3 degrees of freedom
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9548
## F-statistic: 43.23 on 3 and 3 DF, p-value: 0.005731
# We observe that in the Summary Report only list.price provides the highest
# significance value for prediction (see ** symbol). But even this value is
# statistically no-significant in the multiple linear regression model of datah.
# The coefficient of determination of the multiple linear regression model for
# the data set 'datah' is at 97% and Adjusted R-squared at 95% which is very good.
# We will, later on, add more data to the model and re-run this to see how
# the prediction is performing.
# Prediction of best price including Interval of Confidence
# We use data referred to our house' dimensions and use a list price which is
# what we are asking for as of today's writing of this report.
house.price.range <- predict(hlm, data.frame(lot.sq.ft=9600, house.sq.ft=3350,
list.price=509000),
interval = "confidence")
# By using head() we shows fit price, lower and upper range of the best
# selling price.
fit.price = dollar(as.numeric(house.price.range[1]))
lower.fit = dollar(as.numeric(house.price.range[2]))
upper.fit = dollar(as.numeric(house.price.range[3]))
house.price.range <- cbind.data.frame(lower.fit, fit.price, upper.fit)
head(house.price.range)
## lower.fit fit.price upper.fit
## 1 $450,063 $488,774 $527,486