Part 1 Your job is to clean the data and make it ready for developing predictive analytics. Look for missing values, outliers and recode the categorical data to make it ready for analysis. Report the descriptive statistics and use charts as it seems appropriate.

Data and package load.

library(Amelia)
## Warning: package 'Amelia' was built under R version 3.6.3
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(mlr)
## Warning: package 'mlr' was built under R version 3.6.3
## Loading required package: ParamHelpers
## Warning: package 'ParamHelpers' was built under R version 3.6.3
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
file = read.csv("C://Users//still//OneDrive//Documents//School//6136 Data Mining//Hollywood Homes.csv")

Lets get a picture of our data and what’s potentially wrong with it.

str(file)
## 'data.frame':    100 obs. of  8 variables:
##  $ Selling.Price...000.: int  741 791 665 765 472 531 650 582 815 402 ...
##  $ Bedrooms            : int  4 4 4 4 3 3 3 4 4 3 ...
##  $ Bathrooms           : num  4 3 2.5 NA 3 3 2 2.5 4 2.5 ...
##  $ Days.on.Market      : int  7 102 39 NA 118 35 145 207 42 48 ...
##  $ Age                 : int  14 10 19 NA 32 25 24 22 15 22 ...
##  $ Square.Feet         : int  3424 2980 2790 NA 1968 1933 1699 2018 3217 1610 ...
##  $ Location            : Factor w/ 3 levels "","Harbor Islands",..: 3 2 3 1 2 2 3 2 3 2 ...
##  $ Foreclosed          : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 2 2 2 2 2 ...
summary(file)
##  Selling.Price...000.    Bedrooms      Bathrooms     Days.on.Market  
##  Min.   : 189.0       Min.   :1.00   Min.   :1.000   Min.   :   2.0  
##  1st Qu.: 465.8       1st Qu.:3.00   1st Qu.:2.000   1st Qu.:  38.0  
##  Median : 579.5       Median :3.00   Median :3.000   Median : 104.0  
##  Mean   : 641.9       Mean   :3.38   Mean   :2.783   Mean   : 127.8  
##  3rd Qu.: 792.2       3rd Qu.:4.00   3rd Qu.:3.500   3rd Qu.: 157.0  
##  Max.   :1525.0       Max.   :5.00   Max.   :4.000   Max.   :1188.0  
##  NA's   :4                           NA's   :1       NA's   :1       
##       Age         Square.Feet             Location  Foreclosed
##  Min.   : 2.00   Min.   : 520                 : 1      : 1    
##  1st Qu.:16.00   1st Qu.:1765   Harbor Islands:54   No :70    
##  Median :24.00   Median :2248   West Lake     :45   Yes:29    
##  Mean   :22.43   Mean   :2329                                 
##  3rd Qu.:29.00   3rd Qu.:2784                                 
##  Max.   :36.00   Max.   :4979                                 
##  NA's   :1       NA's   :1
boxplot(file)

missmap(file, main = "Missing values vs observed")

Outliers and missing values noted

After deliberation the best solution is to replace outliers with the average of post treated data for SellingPrice, DaysOnMarket, and SquareFeet. One thing I have decided to do before anything however is removing row 4 as it’s missing pretty much every row of data and 1 particular row isn’t worth the effort of creating and running a likelihood assessment and filling it with most similair to based on selling price value. Although it’s the lazier approach nothing says we need to keep all the original data intact and the cleaning is up to the users desire, it would however be cheap of me to remove any outliers or null that inconveniences us so I will refrain from that the removal of data being as we started with only 100 rows and are now down to 99.

file1 = file[-c(4), ]
str(file1)
## 'data.frame':    99 obs. of  8 variables:
##  $ Selling.Price...000.: int  741 791 665 472 531 650 582 815 402 NA ...
##  $ Bedrooms            : int  4 4 4 3 3 3 4 4 3 3 ...
##  $ Bathrooms           : num  4 3 2.5 3 3 2 2.5 4 2.5 2 ...
##  $ Days.on.Market      : int  7 102 39 118 35 145 207 42 48 150 ...
##  $ Age                 : int  14 10 19 32 25 24 22 15 22 30 ...
##  $ Square.Feet         : int  3424 2980 2790 1968 1933 1699 2018 3217 1610 1652 ...
##  $ Location            : Factor w/ 3 levels "","Harbor Islands",..: 3 2 3 2 2 3 2 3 2 2 ...
##  $ Foreclosed          : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 2 2 2 3 ...
summary(file1)
##  Selling.Price...000.    Bedrooms       Bathrooms     Days.on.Market  
##  Min.   : 189.0       Min.   :1.000   Min.   :1.000   Min.   :   2.0  
##  1st Qu.: 465.5       1st Qu.:3.000   1st Qu.:2.000   1st Qu.:  38.0  
##  Median : 579.0       Median :3.000   Median :3.000   Median : 104.0  
##  Mean   : 640.6       Mean   :3.374   Mean   :2.783   Mean   : 127.8  
##  3rd Qu.: 793.5       3rd Qu.:4.000   3rd Qu.:3.500   3rd Qu.: 157.0  
##  Max.   :1525.0       Max.   :5.000   Max.   :4.000   Max.   :1188.0  
##  NA's   :4                                                            
##       Age         Square.Feet             Location  Foreclosed
##  Min.   : 2.00   Min.   : 520                 : 0      : 0    
##  1st Qu.:16.00   1st Qu.:1765   Harbor Islands:54   No :70    
##  Median :24.00   Median :2248   West Lake     :45   Yes:29    
##  Mean   :22.43   Mean   :2329                                 
##  3rd Qu.:29.00   3rd Qu.:2784                                 
##  Max.   :36.00   Max.   :4979                                 
## 
boxplot(file1)

missmap(file1, main = "Missing values vs observed")

Now that row 4 can no longer inconveniences us lets work to find outleirs and create fencing limits

IQR_SellingPrice = 793.5-465.5
UpFen_SellingPrice = 793.5+1.5*IQR_SellingPrice
IQR_DaysOnMarket = 157-38
UpFen_DaysOnMarket = 157+1.5*IQR_DaysOnMarket
IQR_SquareFeet = 2784-1765
UpFen_SquareFeet = 2784+1.5*IQR_SquareFeet
UpFen_SellingPrice
## [1] 1285.5
UpFen_DaysOnMarket
## [1] 335.5
UpFen_SquareFeet
## [1] 4312.5

We should note that box-and-whisker plots only suggest that current outliers exceed the upper bounds thus only an upper fence will be used.

My next step is to create a “Fully cleaned” df to which an unskewed mean could be extracted and later plugged in as replacements for outliers and missing values.

cleanout = subset(file1, Selling.Price...000. <= 1285.5 & Days.on.Market <= 335.5 & Square.Feet <= 4312.5)
boxplot(cleanout)

missmap(cleanout, main = "Missing values vs observed")

summary(cleanout)
##  Selling.Price...000.    Bedrooms      Bathrooms    Days.on.Market 
##  Min.   : 189.0       Min.   :1.00   Min.   :1.00   Min.   :  2.0  
##  1st Qu.: 452.5       1st Qu.:3.00   1st Qu.:2.00   1st Qu.: 39.5  
##  Median : 565.0       Median :3.00   Median :3.00   Median :104.0  
##  Mean   : 605.2       Mean   :3.31   Mean   :2.77   Mean   :104.0  
##  3rd Qu.: 745.5       3rd Qu.:4.00   3rd Qu.:3.50   3rd Qu.:153.0  
##  Max.   :1149.0       Max.   :5.00   Max.   :4.00   Max.   :266.0  
##       Age         Square.Feet             Location  Foreclosed
##  Min.   : 2.00   Min.   : 520                 : 0      : 0    
##  1st Qu.:18.00   1st Qu.:1745   Harbor Islands:49   No :61    
##  Median :24.00   Median :2198   West Lake     :38   Yes:26    
##  Mean   :23.06   Mean   :2216                                 
##  3rd Qu.:29.00   3rd Qu.:2690                                 
##  Max.   :36.00   Max.   :3550

Lets first deal with missing values as those are the quickest to fill in using our new reference to the cleanout df.

file2 <- file1
file2$Selling.Price...000.[which(is.na(file1$Selling.Price...000.))] <- mean(cleanout$Selling.Price...000.)
missmap(file2, main = "Missing values vs observed")

Next we identify the positions of outleirs and run replace functions though the cleanout df.

#Identify outlier positions
print(which(file2$Selling.Price...000. > UpFen_SellingPrice))
## [1] 11 33
print(which(file2$Days.on.Market > UpFen_DaysOnMarket))
## [1] 34 87 94 99
print(which(file2$Square.Feet > UpFen_SquareFeet))
## [1] 11 33 37 73
#Reference said positions
file2$Selling.Price...000.[11] <- mean(cleanout$Selling.Price...000.)
file2$Selling.Price...000.[33] <- mean(cleanout$Selling.Price...000.)
file2$Days.on.Market[34] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[87] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[94] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[99] <- mean(cleanout$Days.on.Market)
file2$Square.Feet[11] <- mean(cleanout$Square.Feet)
file2$Square.Feet[33] <- mean(cleanout$Square.Feet)
file2$Square.Feet[37] <- mean(cleanout$Square.Feet)
file2$Square.Feet[73] <- mean(cleanout$Square.Feet)
#Check again
print(which(file2$Selling.Price...000. > UpFen_SellingPrice))
## integer(0)
print(which(file2$Days.on.Market > UpFen_DaysOnMarket))
## integer(0)
print(which(file2$Square.Feet > UpFen_SquareFeet))
## integer(0)
#Outliers and nulls replaced

The last step to fullfill Part 1 is to “recode the categorical data to make it ready for analysis” which will be accomplished below.

file2$Location = factor(file2$Location,
                                 levels = c('Harbor Islands', 'West Lake'),
                                 labels = c(1, 2))
file2$Foreclosed = factor(file2$Foreclosed,
                         levels = c('No', 'Yes'),
                         labels = c(0, 1))

This is the final df meeting the requirements of Part 1 to be used in all further analysis to be performed in part 2

clean <- file2
View(clean)

Part 2 Your job is to design and develop a predictive model that predicts the selling price of the houses based on a series of attributes. Your goal is to improve the performance of your predictive model. Identify the target variable and select the attributes that help you to predict the target variable. Build and apply a predictive model. Discuss the performance evaluation of the model. Interpret the results and explain your conclusion in detail.

Initiate the training and test set, i’m chosing to go with the 80/20 rule for my split ratio

set.seed(123)
split = sample.split(clean$Selling.Price...000., SplitRatio = 4/5)
train = subset(clean, split == TRUE)
test = subset(clean, split == FALSE)

Some correlation testing to test suspected relationship strengths to better influence testing and models to come

cor.test(clean$Selling.Price...000.,clean$Bedrooms)
## 
##  Pearson's product-moment correlation
## 
## data:  clean$Selling.Price...000. and clean$Bedrooms
## t = 14.162, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7441872 0.8763640
## sample estimates:
##       cor 
## 0.8209867
cor.test(clean$Selling.Price...000.,clean$Age)
## 
##  Pearson's product-moment correlation
## 
## data:  clean$Selling.Price...000. and clean$Age
## t = -10.92, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8198235 -0.6388350
## sample estimates:
##        cor 
## -0.7425955
cor.test(clean$Square.Feet,clean$Bedrooms)
## 
##  Pearson's product-moment correlation
## 
## data:  clean$Square.Feet and clean$Bedrooms
## t = 10.389, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6166509 0.8074503
## sample estimates:
##       cor 
## 0.7257181

Now to create the linear model with Selling Price being my dependent variable

regression <- lm(formula = Selling.Price...000. ~ Bedrooms + Bathrooms + Days.on.Market + Age + Square.Feet + Location+ Foreclosed, data = train)
summary(regression)
## 
## Call:
## lm(formula = Selling.Price...000. ~ Bedrooms + Bathrooms + Days.on.Market + 
##     Age + Square.Feet + Location + Foreclosed, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.13  -66.01    5.66   74.08  338.09 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    301.33808  112.00960   2.690 0.008894 ** 
## Bedrooms       112.42698   26.23379   4.286 5.63e-05 ***
## Bathrooms       -2.75643   28.24661  -0.098 0.922537    
## Days.on.Market   0.07603    0.22859   0.333 0.740402    
## Age             -8.63300    2.46771  -3.498 0.000812 ***
## Square.Feet      0.05043    0.03122   1.615 0.110647    
## Location2       32.00019   29.78725   1.074 0.286330    
## Foreclosed1     36.92179   39.34107   0.939 0.351167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.7 on 71 degrees of freedom
## Multiple R-squared:  0.7778, Adjusted R-squared:  0.7559 
## F-statistic: 35.51 on 7 and 71 DF,  p-value: < 2.2e-16

Create the predictor and lets recall the original values of of those rows and then compare to see the difference in values generated by the predictor to the original. (A good way to spot spread and notice variability and weaknesses in model)

y_pred = predict(regression, newdata = test)
print(y_pred)
##        5        6        9       12       17       21       22       25 
## 499.2304 514.6639 807.9468 921.2314 951.5779 793.5824 685.3140 519.8429 
##       32       33       35       51       54       60       66       68 
## 757.8894 569.3493 623.5937 435.4622 838.6448 500.2995 425.0080 686.0965 
##       69       88       89       90 
## 562.9023 608.8491 555.0963 335.4321
print(test$Selling.Price...000.)
##  [1]  472.0000  531.0000  815.0000  605.2184 1149.0000  915.0000  525.0000
##  [8]  644.0000  720.0000  510.0000  579.0000  410.0000  887.0000  631.0000
## [15]  417.0000  482.0000  447.0000  679.0000  465.0000  450.0000
diff <- (y_pred - test$Selling.Price...000.)
summary(diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -197.42  -81.26   16.74   12.94   67.04  316.01

We’ll end it with some plots of the most significant p value functions in according to the regression model that had an r squared of .7559

#Visual of test and training data for selling price as a function of bedrooms
ggplot() +
  geom_point(aes(x = train$Bedrooms, y = train$Selling.Price...000.),
             colour = 'black') +
  geom_line(aes(x = train$Bedrooms, y = predict(regression, newdata = train)),
            colour = 'green') +
  ggtitle('Selling Price vs Bedrooms (Train set)') +
  xlab('Bedrooms') +
  ylab('Selling Price')

ggplot() +
  geom_point(aes(x = test$Bedrooms, y = test$Selling.Price...000.),
             colour = 'black') +
  geom_line(aes(x = train$Bedrooms, y = predict(regression, newdata = train)),
            colour = 'green') +
  ggtitle('Selling Price vs Bedrooms (Test set)') +
  xlab('Bedrooms') +
  ylab('Selling Price')

#Visual of test and training data for selling price as a function of age
ggplot() +
  geom_point(aes(x = train$Age, y = train$Selling.Price...000.),
             colour = 'black') +
  geom_line(aes(x = train$Age, y = predict(regression, newdata = train)),
            colour = 'red') +
  ggtitle('Selling Price vs Age (Train set)') +
  xlab('Age') +
  ylab('Selling Price')

ggplot() +
  geom_point(aes(x = test$Age, y = test$Selling.Price...000.),
             colour = 'black') +
  geom_line(aes(x = train$Age, y = predict(regression, newdata = train)),
            colour = 'red') +
  ggtitle('Selling Price vs Age (Test set)') +
  xlab('Age') +
  ylab('Selling Price')

Part 2 Complete