Part 1 Your job is to clean the data and make it ready for developing predictive analytics. Look for missing values, outliers and recode the categorical data to make it ready for analysis. Report the descriptive statistics and use charts as it seems appropriate.
Data and package load.
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.6.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2021 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(mlr)
## Warning: package 'mlr' was built under R version 3.6.3
## Loading required package: ParamHelpers
## Warning: package 'ParamHelpers' was built under R version 3.6.3
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
file = read.csv("C://Users//still//OneDrive//Documents//School//6136 Data Mining//Hollywood Homes.csv")
Lets get a picture of our data and what’s potentially wrong with it.
str(file)
## 'data.frame': 100 obs. of 8 variables:
## $ Selling.Price...000.: int 741 791 665 765 472 531 650 582 815 402 ...
## $ Bedrooms : int 4 4 4 4 3 3 3 4 4 3 ...
## $ Bathrooms : num 4 3 2.5 NA 3 3 2 2.5 4 2.5 ...
## $ Days.on.Market : int 7 102 39 NA 118 35 145 207 42 48 ...
## $ Age : int 14 10 19 NA 32 25 24 22 15 22 ...
## $ Square.Feet : int 3424 2980 2790 NA 1968 1933 1699 2018 3217 1610 ...
## $ Location : Factor w/ 3 levels "","Harbor Islands",..: 3 2 3 1 2 2 3 2 3 2 ...
## $ Foreclosed : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 2 2 2 2 2 ...
summary(file)
## Selling.Price...000. Bedrooms Bathrooms Days.on.Market
## Min. : 189.0 Min. :1.00 Min. :1.000 Min. : 2.0
## 1st Qu.: 465.8 1st Qu.:3.00 1st Qu.:2.000 1st Qu.: 38.0
## Median : 579.5 Median :3.00 Median :3.000 Median : 104.0
## Mean : 641.9 Mean :3.38 Mean :2.783 Mean : 127.8
## 3rd Qu.: 792.2 3rd Qu.:4.00 3rd Qu.:3.500 3rd Qu.: 157.0
## Max. :1525.0 Max. :5.00 Max. :4.000 Max. :1188.0
## NA's :4 NA's :1 NA's :1
## Age Square.Feet Location Foreclosed
## Min. : 2.00 Min. : 520 : 1 : 1
## 1st Qu.:16.00 1st Qu.:1765 Harbor Islands:54 No :70
## Median :24.00 Median :2248 West Lake :45 Yes:29
## Mean :22.43 Mean :2329
## 3rd Qu.:29.00 3rd Qu.:2784
## Max. :36.00 Max. :4979
## NA's :1 NA's :1
boxplot(file)
missmap(file, main = "Missing values vs observed")
Outliers and missing values noted
After deliberation the best solution is to replace outliers with the average of post treated data for SellingPrice, DaysOnMarket, and SquareFeet. One thing I have decided to do before anything however is removing row 4 as it’s missing pretty much every row of data and 1 particular row isn’t worth the effort of creating and running a likelihood assessment and filling it with most similair to based on selling price value. Although it’s the lazier approach nothing says we need to keep all the original data intact and the cleaning is up to the users desire, it would however be cheap of me to remove any outliers or null that inconveniences us so I will refrain from that the removal of data being as we started with only 100 rows and are now down to 99.
file1 = file[-c(4), ]
str(file1)
## 'data.frame': 99 obs. of 8 variables:
## $ Selling.Price...000.: int 741 791 665 472 531 650 582 815 402 NA ...
## $ Bedrooms : int 4 4 4 3 3 3 4 4 3 3 ...
## $ Bathrooms : num 4 3 2.5 3 3 2 2.5 4 2.5 2 ...
## $ Days.on.Market : int 7 102 39 118 35 145 207 42 48 150 ...
## $ Age : int 14 10 19 32 25 24 22 15 22 30 ...
## $ Square.Feet : int 3424 2980 2790 1968 1933 1699 2018 3217 1610 1652 ...
## $ Location : Factor w/ 3 levels "","Harbor Islands",..: 3 2 3 2 2 3 2 3 2 2 ...
## $ Foreclosed : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 2 2 2 3 ...
summary(file1)
## Selling.Price...000. Bedrooms Bathrooms Days.on.Market
## Min. : 189.0 Min. :1.000 Min. :1.000 Min. : 2.0
## 1st Qu.: 465.5 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 38.0
## Median : 579.0 Median :3.000 Median :3.000 Median : 104.0
## Mean : 640.6 Mean :3.374 Mean :2.783 Mean : 127.8
## 3rd Qu.: 793.5 3rd Qu.:4.000 3rd Qu.:3.500 3rd Qu.: 157.0
## Max. :1525.0 Max. :5.000 Max. :4.000 Max. :1188.0
## NA's :4
## Age Square.Feet Location Foreclosed
## Min. : 2.00 Min. : 520 : 0 : 0
## 1st Qu.:16.00 1st Qu.:1765 Harbor Islands:54 No :70
## Median :24.00 Median :2248 West Lake :45 Yes:29
## Mean :22.43 Mean :2329
## 3rd Qu.:29.00 3rd Qu.:2784
## Max. :36.00 Max. :4979
##
boxplot(file1)
missmap(file1, main = "Missing values vs observed")
Now that row 4 can no longer inconveniences us lets work to find outleirs and create fencing limits
IQR_SellingPrice = 793.5-465.5
UpFen_SellingPrice = 793.5+1.5*IQR_SellingPrice
IQR_DaysOnMarket = 157-38
UpFen_DaysOnMarket = 157+1.5*IQR_DaysOnMarket
IQR_SquareFeet = 2784-1765
UpFen_SquareFeet = 2784+1.5*IQR_SquareFeet
UpFen_SellingPrice
## [1] 1285.5
UpFen_DaysOnMarket
## [1] 335.5
UpFen_SquareFeet
## [1] 4312.5
We should note that box-and-whisker plots only suggest that current outliers exceed the upper bounds thus only an upper fence will be used.
My next step is to create a “Fully cleaned” df to which an unskewed mean could be extracted and later plugged in as replacements for outliers and missing values.
cleanout = subset(file1, Selling.Price...000. <= 1285.5 & Days.on.Market <= 335.5 & Square.Feet <= 4312.5)
boxplot(cleanout)
missmap(cleanout, main = "Missing values vs observed")
summary(cleanout)
## Selling.Price...000. Bedrooms Bathrooms Days.on.Market
## Min. : 189.0 Min. :1.00 Min. :1.00 Min. : 2.0
## 1st Qu.: 452.5 1st Qu.:3.00 1st Qu.:2.00 1st Qu.: 39.5
## Median : 565.0 Median :3.00 Median :3.00 Median :104.0
## Mean : 605.2 Mean :3.31 Mean :2.77 Mean :104.0
## 3rd Qu.: 745.5 3rd Qu.:4.00 3rd Qu.:3.50 3rd Qu.:153.0
## Max. :1149.0 Max. :5.00 Max. :4.00 Max. :266.0
## Age Square.Feet Location Foreclosed
## Min. : 2.00 Min. : 520 : 0 : 0
## 1st Qu.:18.00 1st Qu.:1745 Harbor Islands:49 No :61
## Median :24.00 Median :2198 West Lake :38 Yes:26
## Mean :23.06 Mean :2216
## 3rd Qu.:29.00 3rd Qu.:2690
## Max. :36.00 Max. :3550
Lets first deal with missing values as those are the quickest to fill in using our new reference to the cleanout df.
file2 <- file1
file2$Selling.Price...000.[which(is.na(file1$Selling.Price...000.))] <- mean(cleanout$Selling.Price...000.)
missmap(file2, main = "Missing values vs observed")
Next we identify the positions of outleirs and run replace functions though the cleanout df.
#Identify outlier positions
print(which(file2$Selling.Price...000. > UpFen_SellingPrice))
## [1] 11 33
print(which(file2$Days.on.Market > UpFen_DaysOnMarket))
## [1] 34 87 94 99
print(which(file2$Square.Feet > UpFen_SquareFeet))
## [1] 11 33 37 73
#Reference said positions
file2$Selling.Price...000.[11] <- mean(cleanout$Selling.Price...000.)
file2$Selling.Price...000.[33] <- mean(cleanout$Selling.Price...000.)
file2$Days.on.Market[34] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[87] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[94] <- mean(cleanout$Days.on.Market)
file2$Days.on.Market[99] <- mean(cleanout$Days.on.Market)
file2$Square.Feet[11] <- mean(cleanout$Square.Feet)
file2$Square.Feet[33] <- mean(cleanout$Square.Feet)
file2$Square.Feet[37] <- mean(cleanout$Square.Feet)
file2$Square.Feet[73] <- mean(cleanout$Square.Feet)
#Check again
print(which(file2$Selling.Price...000. > UpFen_SellingPrice))
## integer(0)
print(which(file2$Days.on.Market > UpFen_DaysOnMarket))
## integer(0)
print(which(file2$Square.Feet > UpFen_SquareFeet))
## integer(0)
#Outliers and nulls replaced
The last step to fullfill Part 1 is to “recode the categorical data to make it ready for analysis” which will be accomplished below.
file2$Location = factor(file2$Location,
levels = c('Harbor Islands', 'West Lake'),
labels = c(1, 2))
file2$Foreclosed = factor(file2$Foreclosed,
levels = c('No', 'Yes'),
labels = c(0, 1))
This is the final df meeting the requirements of Part 1 to be used in all further analysis to be performed in part 2
clean <- file2
View(clean)
Part 2 Your job is to design and develop a predictive model that predicts the selling price of the houses based on a series of attributes. Your goal is to improve the performance of your predictive model. Identify the target variable and select the attributes that help you to predict the target variable. Build and apply a predictive model. Discuss the performance evaluation of the model. Interpret the results and explain your conclusion in detail.
Initiate the training and test set, i’m chosing to go with the 80/20 rule for my split ratio
set.seed(123)
split = sample.split(clean$Selling.Price...000., SplitRatio = 4/5)
train = subset(clean, split == TRUE)
test = subset(clean, split == FALSE)
Some correlation testing to test suspected relationship strengths to better influence testing and models to come
cor.test(clean$Selling.Price...000.,clean$Bedrooms)
##
## Pearson's product-moment correlation
##
## data: clean$Selling.Price...000. and clean$Bedrooms
## t = 14.162, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7441872 0.8763640
## sample estimates:
## cor
## 0.8209867
cor.test(clean$Selling.Price...000.,clean$Age)
##
## Pearson's product-moment correlation
##
## data: clean$Selling.Price...000. and clean$Age
## t = -10.92, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8198235 -0.6388350
## sample estimates:
## cor
## -0.7425955
cor.test(clean$Square.Feet,clean$Bedrooms)
##
## Pearson's product-moment correlation
##
## data: clean$Square.Feet and clean$Bedrooms
## t = 10.389, df = 97, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6166509 0.8074503
## sample estimates:
## cor
## 0.7257181
Now to create the linear model with Selling Price being my dependent variable
regression <- lm(formula = Selling.Price...000. ~ Bedrooms + Bathrooms + Days.on.Market + Age + Square.Feet + Location+ Foreclosed, data = train)
summary(regression)
##
## Call:
## lm(formula = Selling.Price...000. ~ Bedrooms + Bathrooms + Days.on.Market +
## Age + Square.Feet + Location + Foreclosed, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -299.13 -66.01 5.66 74.08 338.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 301.33808 112.00960 2.690 0.008894 **
## Bedrooms 112.42698 26.23379 4.286 5.63e-05 ***
## Bathrooms -2.75643 28.24661 -0.098 0.922537
## Days.on.Market 0.07603 0.22859 0.333 0.740402
## Age -8.63300 2.46771 -3.498 0.000812 ***
## Square.Feet 0.05043 0.03122 1.615 0.110647
## Location2 32.00019 29.78725 1.074 0.286330
## Foreclosed1 36.92179 39.34107 0.939 0.351167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.7 on 71 degrees of freedom
## Multiple R-squared: 0.7778, Adjusted R-squared: 0.7559
## F-statistic: 35.51 on 7 and 71 DF, p-value: < 2.2e-16
Create the predictor and lets recall the original values of of those rows and then compare to see the difference in values generated by the predictor to the original. (A good way to spot spread and notice variability and weaknesses in model)
y_pred = predict(regression, newdata = test)
print(y_pred)
## 5 6 9 12 17 21 22 25
## 499.2304 514.6639 807.9468 921.2314 951.5779 793.5824 685.3140 519.8429
## 32 33 35 51 54 60 66 68
## 757.8894 569.3493 623.5937 435.4622 838.6448 500.2995 425.0080 686.0965
## 69 88 89 90
## 562.9023 608.8491 555.0963 335.4321
print(test$Selling.Price...000.)
## [1] 472.0000 531.0000 815.0000 605.2184 1149.0000 915.0000 525.0000
## [8] 644.0000 720.0000 510.0000 579.0000 410.0000 887.0000 631.0000
## [15] 417.0000 482.0000 447.0000 679.0000 465.0000 450.0000
diff <- (y_pred - test$Selling.Price...000.)
summary(diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -197.42 -81.26 16.74 12.94 67.04 316.01
We’ll end it with some plots of the most significant p value functions in according to the regression model that had an r squared of .7559
#Visual of test and training data for selling price as a function of bedrooms
ggplot() +
geom_point(aes(x = train$Bedrooms, y = train$Selling.Price...000.),
colour = 'black') +
geom_line(aes(x = train$Bedrooms, y = predict(regression, newdata = train)),
colour = 'green') +
ggtitle('Selling Price vs Bedrooms (Train set)') +
xlab('Bedrooms') +
ylab('Selling Price')
ggplot() +
geom_point(aes(x = test$Bedrooms, y = test$Selling.Price...000.),
colour = 'black') +
geom_line(aes(x = train$Bedrooms, y = predict(regression, newdata = train)),
colour = 'green') +
ggtitle('Selling Price vs Bedrooms (Test set)') +
xlab('Bedrooms') +
ylab('Selling Price')
#Visual of test and training data for selling price as a function of age
ggplot() +
geom_point(aes(x = train$Age, y = train$Selling.Price...000.),
colour = 'black') +
geom_line(aes(x = train$Age, y = predict(regression, newdata = train)),
colour = 'red') +
ggtitle('Selling Price vs Age (Train set)') +
xlab('Age') +
ylab('Selling Price')
ggplot() +
geom_point(aes(x = test$Age, y = test$Selling.Price...000.),
colour = 'black') +
geom_line(aes(x = train$Age, y = predict(regression, newdata = train)),
colour = 'red') +
ggtitle('Selling Price vs Age (Test set)') +
xlab('Age') +
ylab('Selling Price')
Part 2 Complete