Problem

The code in this notebook is meant to see what features of homes in Grinnell, Iowa that could be potentially use to predict profit or loss. Profit is a measure of the price the house was purchased minus the price the house was sold for.

A predictive model could then be used for individuals who buy and sell homes as a business.

# load in data
library(RCurl)


data <- getURL('https://raw.githubusercontent.com/KevinJpotter/edx_capstone/master/GrinnellHouses.csv')

df <- read.csv(text = data, header = TRUE, row.names = 'X')

df <- subset(df, volunteer = 'no', select = c(Bedrooms, Baths, SquareFeet, LotSize, YearBuilt, YearSold, MonthSold, DaySold, OrigPrice, SalePrice))

#creat column named profit
df$profit <- (df$OrigPrice - df$SalePrice)

# Look at the Data Distributions
summary(df)
##     Bedrooms         Baths         SquareFeet      LotSize        
##  Min.   :0.000   Min.   :0.000   Min.   : 640   Min.   : 0.02893  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1150   1st Qu.: 0.23388  
##  Median :3.000   Median :1.750   Median :1440   Median : 0.28409  
##  Mean   :3.195   Mean   :1.779   Mean   :1583   Mean   : 0.72346  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:1833   3rd Qu.: 0.37018  
##  Max.   :8.000   Max.   :6.000   Max.   :6815   Max.   :55.00000  
##                                  NA's   :18     NA's   :188       
##    YearBuilt       YearSold      MonthSold         DaySold     
##  Min.   :1870   Min.   :2005   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1900   1st Qu.:2007   1st Qu.: 5.000   1st Qu.: 8.00  
##  Median :1956   Median :2009   Median : 7.000   Median :16.00  
##  Mean   :1946   Mean   :2009   Mean   : 6.831   Mean   :16.23  
##  3rd Qu.:1973   3rd Qu.:2012   3rd Qu.: 9.000   3rd Qu.:25.00  
##  Max.   :2013   Max.   :2015   Max.   :12.000   Max.   :31.00  
##                                                                
##    OrigPrice        SalePrice          profit      
##  Min.   :  5990   Min.   :  7000   Min.   :-94000  
##  1st Qu.: 89900   1st Qu.: 83000   1st Qu.:  4000  
##  Median :129900   Median :119340   Median :  9000  
##  Mean   :146047   Mean   :133204   Mean   : 12843  
##  3rd Qu.:179000   3rd Qu.:162500   3rd Qu.: 16900  
##  Max.   :695000   Max.   :606000   Max.   :173000  
## 
# Drop Data with null vlaues
library(dplyr)
df <- select (df,-c(LotSize))
df <- df[complete.cases(df),]

# Hitorgram of Engineered Feature
hist(df$profit, breaks = 50, xlab="Profit / Loss", main="Net Profit / Loss Distribution", col="lightblue", xlim = c(-20000, 40000))

# Look at the correlation 
library(corrplot)
corrplot(cor(df))

library(car)
# Scatterplot and box plot of two most correlated features
scatterplot(y = df$profit, x = df$SquareFeet,
            main = 'Profit & Square Feet' ,
          ylab = 'Profit', xlab = 'Square Feet' ,
          regLine=list(method=lm, lty=1, lwd=2, col='red'),
          grid = FALSE)

scatterplot(y = df$profit, x = df$OrigPrice,
            main = 'Profit & Original Price' ,
          ylab = 'Profit', xlab = 'Original Price' ,
          regLine=list(method=lm, lty=1, lwd=2, col='red'),
          grid = FALSE)

Conclusion

Lot price was dropped from the feature set because of the high number of null values. Additionally 19 observations were also dropped from the data set due to null values and allow for evaluation.

In conclusion it appears the original price of the home and square footage of the home have potential to be used for prediction of profit / loss. Further testing and analysis is necessary to test this theory. The two variable have high correlation to Profit and appear to have a linear relationship.