The code in this notebook is meant to see what features of homes in Grinnell, Iowa that could be potentially use to predict profit or loss. Profit is a measure of the price the house was purchased minus the price the house was sold for.
A predictive model could then be used for individuals who buy and sell homes as a business.
# load in data
library(RCurl)
data <- getURL('https://raw.githubusercontent.com/KevinJpotter/edx_capstone/master/GrinnellHouses.csv')
df <- read.csv(text = data, header = TRUE, row.names = 'X')
df <- subset(df, volunteer = 'no', select = c(Bedrooms, Baths, SquareFeet, LotSize, YearBuilt, YearSold, MonthSold, DaySold, OrigPrice, SalePrice))
#creat column named profit
df$profit <- (df$OrigPrice - df$SalePrice)
# Look at the Data Distributions
summary(df)
## Bedrooms Baths SquareFeet LotSize
## Min. :0.000 Min. :0.000 Min. : 640 Min. : 0.02893
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1150 1st Qu.: 0.23388
## Median :3.000 Median :1.750 Median :1440 Median : 0.28409
## Mean :3.195 Mean :1.779 Mean :1583 Mean : 0.72346
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:1833 3rd Qu.: 0.37018
## Max. :8.000 Max. :6.000 Max. :6815 Max. :55.00000
## NA's :18 NA's :188
## YearBuilt YearSold MonthSold DaySold
## Min. :1870 Min. :2005 Min. : 1.000 Min. : 1.00
## 1st Qu.:1900 1st Qu.:2007 1st Qu.: 5.000 1st Qu.: 8.00
## Median :1956 Median :2009 Median : 7.000 Median :16.00
## Mean :1946 Mean :2009 Mean : 6.831 Mean :16.23
## 3rd Qu.:1973 3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:25.00
## Max. :2013 Max. :2015 Max. :12.000 Max. :31.00
##
## OrigPrice SalePrice profit
## Min. : 5990 Min. : 7000 Min. :-94000
## 1st Qu.: 89900 1st Qu.: 83000 1st Qu.: 4000
## Median :129900 Median :119340 Median : 9000
## Mean :146047 Mean :133204 Mean : 12843
## 3rd Qu.:179000 3rd Qu.:162500 3rd Qu.: 16900
## Max. :695000 Max. :606000 Max. :173000
##
# Drop Data with null vlaues
library(dplyr)
df <- select (df,-c(LotSize))
df <- df[complete.cases(df),]
# Hitorgram of Engineered Feature
hist(df$profit, breaks = 50, xlab="Profit / Loss", main="Net Profit / Loss Distribution", col="lightblue", xlim = c(-20000, 40000))
# Look at the correlation
library(corrplot)
corrplot(cor(df))
library(car)
# Scatterplot and box plot of two most correlated features
scatterplot(y = df$profit, x = df$SquareFeet,
main = 'Profit & Square Feet' ,
ylab = 'Profit', xlab = 'Square Feet' ,
regLine=list(method=lm, lty=1, lwd=2, col='red'),
grid = FALSE)
scatterplot(y = df$profit, x = df$OrigPrice,
main = 'Profit & Original Price' ,
ylab = 'Profit', xlab = 'Original Price' ,
regLine=list(method=lm, lty=1, lwd=2, col='red'),
grid = FALSE)
Lot price was dropped from the feature set because of the high number of null values. Additionally 19 observations were also dropped from the data set due to null values and allow for evaluation.
In conclusion it appears the original price of the home and square footage of the home have potential to be used for prediction of profit / loss. Further testing and analysis is necessary to test this theory. The two variable have high correlation to Profit and appear to have a linear relationship.