This project will use a dataset of housing prices to predict house prices based on given inputs, even incomplete data!
The data used in this project was downloaded from GeeksForGeeks. It is a simple application of linear regression and a machine learning method called Random Forest to predict house prices using incomplete data. The results are then evaluated with ANOVA to determine whether or not they introduce significant bias to the data. Author comments on specific parts within the document have been relocated to the end in a Commentary section, but have been preserved to give context to the logical flow of the document. Assistance and inspiration was taken from several other similar projects which are listed in the Cited Sources section. The central focus of this project is, “Can machine learning techniques implemented in R be used to predict house prices?”
# Load the library I need to handle the CSV. This is one of the core tidyverse packages, along with ggplot2, tibble, tidyr, dplyr, stringr, purr, and forcats. Tidyverse can be installed as a whole, but since I'm trying not to rely too much on libraries to do this, as it is a learning exercise, I am just focusing on what I need.
library(readr)
# Load the dataset
house_prices_original <- read_csv("HousePrices.csv")
## Rows: 2919 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): MSZoning, LotConfig, BldgType, Exterior1st
## dbl (9): Id, MSSubClass, LotArea, OverallCond, YearBuilt, YearRemodAdd, Bsmt...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
house_prices_lm <- read_csv("HousePrices.csv")
## Rows: 2919 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): MSZoning, LotConfig, BldgType, Exterior1st
## dbl (9): Id, MSSubClass, LotArea, OverallCond, YearBuilt, YearRemodAdd, Bsmt...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
house_prices <- read_csv("HousePrices.csv")
## Rows: 2919 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): MSZoning, LotConfig, BldgType, Exterior1st
## dbl (9): Id, MSSubClass, LotArea, OverallCond, YearBuilt, YearRemodAdd, Bsmt...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows of the dataset
head(house_prices)
## # A tibble: 6 × 13
## Id MSSubClass MSZoning LotArea LotConfig BldgType OverallCond YearBuilt
## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 0 60 RL 8450 Inside 1Fam 5 2003
## 2 1 20 RL 9600 FR2 1Fam 8 1976
## 3 2 60 RL 11250 Inside 1Fam 5 2001
## 4 3 70 RL 9550 Corner 1Fam 5 1915
## 5 4 60 RL 14260 FR2 1Fam 5 2000
## 6 5 50 RL 14115 Inside 1Fam 5 1993
## # ℹ 5 more variables: YearRemodAdd <dbl>, Exterior1st <chr>, BsmtFinSF2 <dbl>,
## # TotalBsmtSF <dbl>, SalePrice <dbl>
# Summary of the dataset
summary(house_prices)
## Id MSSubClass MSZoning LotArea
## Min. : 0.0 Min. : 20.00 Length:2919 Min. : 1300
## 1st Qu.: 729.5 1st Qu.: 20.00 Class :character 1st Qu.: 7478
## Median :1459.0 Median : 50.00 Mode :character Median : 9453
## Mean :1459.0 Mean : 57.14 Mean : 10168
## 3rd Qu.:2188.5 3rd Qu.: 70.00 3rd Qu.: 11570
## Max. :2918.0 Max. :190.00 Max. :215245
##
## LotConfig BldgType OverallCond YearBuilt
## Length:2919 Length:2919 Min. :1.000 Min. :1872
## Class :character Class :character 1st Qu.:5.000 1st Qu.:1954
## Mode :character Mode :character Median :5.000 Median :1973
## Mean :5.565 Mean :1971
## 3rd Qu.:6.000 3rd Qu.:2001
## Max. :9.000 Max. :2010
##
## YearRemodAdd Exterior1st BsmtFinSF2 TotalBsmtSF
## Min. :1950 Length:2919 Min. : 0.00 Min. : 0.0
## 1st Qu.:1965 Class :character 1st Qu.: 0.00 1st Qu.: 793.0
## Median :1993 Mode :character Median : 0.00 Median : 989.5
## Mean :1984 Mean : 49.58 Mean :1051.8
## 3rd Qu.:2004 3rd Qu.: 0.00 3rd Qu.:1302.0
## Max. :2010 Max. :1526.00 Max. :6110.0
## NA's :1 NA's :1
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
## NA's :1459
This is our dataset, “HousePrices.csv” which has been downloaded from GeeksForGeeks, listed in the sources section. It has 2919 observations of 13 variables. Some entries are missing in the SalePrice column, and a few other columns are missing one or two entries.
Most of the data is numerical and easy to interpret or self-explanatory. The entries can be interpreted as follows:
| # | Name | Description |
|---|---|---|
| 1 | Id | To count the records. |
| 2 | MSSubClass | Identifies the type of dwelling involved in the sale. |
| 3 | MSZoning | Identifies the general zoning classification of the sale. |
| 4 | LotArea | Lot size in square feet. |
| 5 | LotConfig | Configuration of the lot. |
| 6 | BldgType | Type of dwelling. |
| 7 | OverallCond | Rates the overall condition of the house. |
| 8 | YearBuilt | Original construction year. |
| 9 | YearRemodAdd | Remodel date (same as construction date if no remodeling or additions). |
| 10 | Exterior1st | Exterior covering on house. |
| 11 | BsmtFinSF2 | Type 2 finished square feet. |
| 12 | TotalBsmtSF | Total square feet of basement area. |
| 13 | SalePrice | To be predicted. |
Entries within the “Exerior1st” categorical data are pseudo-ordinal, in that they bear some preference and pricing, but are not necessarily orderable. Some siding and insulation types are preferred by the market, some are more expensive during construction, and some are more aesthetically pleasing to different groups of buyers, all of which confounds true ordinality. These are listed alphabetically below for ease of interpretation.
| Code | Description |
|---|---|
| AsbShng | Asbestos shingle |
| AsphShn | Asphalt shingle |
| BrkComm | Common brick |
| BrkFace | Brick face |
| CBlock | Cinderblock |
| CemntBd | Cement |
| HdBoard | Hardboard |
| ImStucc | Stucco imitation |
| MetalSd | Metal siding |
| Plywood | Plywood siding |
| Stone | Stone built |
| Stucco | Stucco built |
| VinylSd | Vinyl siding |
| WdSdng | Wood shingle |
# Summarize the number of missing values per column
sum_na <- sapply(house_prices, function(x) sum(is.na(x)))
sum_na[sum_na > 0]
## MSZoning Exterior1st BsmtFinSF2 TotalBsmtSF SalePrice
## 4 1 1 1 1459
There is a small amount of incomplete information in the data set and a large number of incomplete sale prices (1,459 out of 2919, or 49.98%!). Therefore, a model must be trained on the existing complete data and use it to fill in the missing data, a process in data cleaning known as imputation.
# Identifying complete cases and preparing datasets separately
complete_cases <- house_prices[complete.cases(house_prices), ]
data_to_impute <- house_prices[!complete.cases(house_prices$SalePrice), ]
# Build a linear regression model using only complete cases
model <- lm(SalePrice ~ ., data = complete_cases)
# Predict the missing SalePrice values in the dataset needing imputation
predicted_values <- predict(model, newdata = data_to_impute)
# Impute the missing SalePrice values in the original dataset
house_prices_lm$SalePrice[is.na(house_prices$SalePrice)] <- predicted_values
# Display the dataset to verify imputation
print(head(house_prices))
## # A tibble: 6 × 13
## Id MSSubClass MSZoning LotArea LotConfig BldgType OverallCond YearBuilt
## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 0 60 RL 8450 Inside 1Fam 5 2003
## 2 1 20 RL 9600 FR2 1Fam 8 1976
## 3 2 60 RL 11250 Inside 1Fam 5 2001
## 4 3 70 RL 9550 Corner 1Fam 5 1915
## 5 4 60 RL 14260 FR2 1Fam 5 2000
## 6 5 50 RL 14115 Inside 1Fam 5 1993
## # ℹ 5 more variables: YearRemodAdd <dbl>, Exterior1st <chr>, BsmtFinSF2 <dbl>,
## # TotalBsmtSF <dbl>, SalePrice <dbl>
# lm() imputes are now saved under house_prices_lm$SalePrice, while the original house prices are still retained in the other dataset.
(Comment 1)# Install and load the randomForest package
options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("randomForest")
## Installing package into 'C:/Users/Colin/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'randomForest' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'randomForest'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Colin\AppData\Local\R\win-library\4.3\00LOCK\randomForest\libs\x64\randomForest.dll
## to
## C:\Users\Colin\AppData\Local\R\win-library\4.3\randomForest\libs\x64\randomForest.dll:
## Permission denied
## Warning: restored 'randomForest'
##
## The downloaded binary packages are in
## C:\Users\Colin\AppData\Local\Temp\Rtmp25Ov9B\downloaded_packages
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
# Handling missing values
house_prices <- na.omit(house_prices)
# Splitting the dataset into training and testing
set.seed(8675309) # for reproducibility, though I might change the seed to something different in the future.
training_rows <- sample(1:nrow(house_prices), 0.8 * nrow(house_prices))
train_data <- house_prices[training_rows, ]
test_data <- house_prices[-training_rows, ]
# Train the Random Forest model
rf_model <- randomForest(SalePrice ~ ., data = train_data, ntree = 500, mtry = 3)
# Evaluate the model
predicted_prices <- predict(rf_model, test_data)
mse <- mean((predicted_prices - test_data$SalePrice)^2)
print(paste("Mean Squared Error: ", mse))
## [1] "Mean Squared Error: 1311930349.81587"
# Importance of each variable
importance(rf_model)
## IncNodePurity
## Id 2.916981e+11
## MSSubClass 4.114516e+11
## MSZoning 1.638102e+11
## LotArea 1.163829e+12
## LotConfig 8.742121e+10
## BldgType 6.164416e+10
## OverallCond 2.599377e+11
## YearBuilt 1.508881e+12
## YearRemodAdd 9.547679e+11
## Exterior1st 2.327944e+11
## BsmtFinSF2 4.826619e+10
## TotalBsmtSF 2.261925e+12
# Taking data_to_impute a the subset with missing SalePrice
predicted_values <- predict(rf_model, newdata = data_to_impute)
house_prices$SalePrice[is.na(house_prices$SalePrice)] <- predicted_values
The outputs of this randomForest() model can be inspected to see if the filled in data is accurate.
# Display rows where SalePrice was imputed
# Summary from before imputation
summary(house_prices_original$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
# Summary from after lm() imputation
summary(house_prices_lm$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -11474 130000 168093 178085 216734 755000 6
# Summary from after RF imputation
summary(house_prices$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
(Comment 2)library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
ggplot(house_prices_original, aes(x=SalePrice)) +
geom_histogram(bins=30, fill="#AA1111") +
ggtitle("Figure 1: Distribution of Sale Prices Before Imputation")
## Warning: Removed 1459 rows containing non-finite values (`stat_bin()`).
ggplot(house_prices_lm, ylab = "Sale Price (USD)", aes(x=SalePrice)) +
geom_histogram(bins=30, fill="#11AA11") +
ggtitle("Figure 2: Distribution of Sale Prices After lm() Imputation")
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
ggplot(house_prices, ylab = "Sale Price (USD)", aes(x=SalePrice)) +
geom_histogram(bins=30, fill="#11AAAA") +
ggtitle("Figure 3: Distribution of Sale Prices After randomForest Imputation")
# predicted_prices are the randomForest predictions and test_data contains the real prices
actual_prices <- test_data$SalePrice
# Calculate mean squared error
mse <- mean((predicted_prices - actual_prices)^2)
# Calculate root mean squared error
rmse <- sqrt(mse)
# Calculate mean absolute error
mae <- mean(abs(predicted_prices - actual_prices))
# Calculate r-squared
rss <- sum((predicted_prices - actual_prices)^2)
tss <- sum((actual_prices - mean(actual_prices))^2)
r_squared <- 1 - rss/tss
# Print the metrics
print(paste("MSE:", mse))
## [1] "MSE: 1311930349.81587"
print(paste("RMSE:", rmse))
## [1] "RMSE: 36220.5790927736"
print(paste("MAE:", mae))
## [1] "MAE: 26553.869735087"
print(paste("R-squared:", r_squared))
## [1] "R-squared: 0.726953922890782"
# using ggplot2 to visualize the outputs because it's prettier than the standard we've used all semester; still haven't needed to load the entire tidyverse package, so going to continue using individual libraries. Colors are chosen kind of intuitively, using HEX codes that are reversed for contrast.
library(ggplot2)
ggplot() +
geom_point(aes(x = actual_prices, y = predicted_prices), colour = "#AA1111") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#11AAAA") +
labs(x = "Actual Prices (USD)", y = "Predicted Prices (USD)", title = "Figure 4: Comparison of Actual and Predicted Prices") +
theme_minimal()
# Add a 'dataset' label to each dataset
house_prices_original$dataset <- '1. Original'
house_prices_lm$dataset <- '2. LM Imputed'
house_prices$dataset <- '3. RF Imputed'
# Check if there are any NAs in the dataset (this could be important if I decided to expand the project to include new datasets)
sum(is.na(house_prices_original$dataset))
## [1] 0
sum(is.na(house_prices_lm$dataset))
## [1] 0
sum(is.na(house_prices$dataset))
## [1] 0
# Combine the datasets into a single dataframe while omitting NAs in case a new dataset is introduced that contains NAs
house_prices_original_nona <- na.omit(house_prices_original)
house_prices_lm_nona <- na.omit(house_prices_lm)
house_prices_nona <- na.omit(house_prices)
combined_data <- rbind(house_prices_original_nona, house_prices_lm_nona, house_prices_nona)
# Perform ANOVA on SalePrice across the different datasets
anova_result <- aov(SalePrice ~ dataset, data = combined_data)
# Display the ANOVA summary
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## dataset 2 1.173e+10 5.866e+09 1.026 0.358
## Residuals 5830 3.332e+13 5.716e+09
ggplot(combined_data, ylab = "Sale Price (USD)", aes(x = dataset, y = SalePrice, fill = dataset)) +
geom_boxplot() +
labs(title = "Figure 5: Comparison of Sale Prices Across Datasets",
x = "Dataset Type",
y = "Sale Price (USD)") +
theme_minimal()
# just a quick if-then to evaluate the results by Tukey's Test, which is a click-button option in Prism.
if (summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) {
post_hoc_result <- TukeyHSD(anova_result)
print(post_hoc_result)
}
This project evaluated the impact of different statistical techniques to predict missing house prices on an incomplete dataset. Linear regression (LM) and Random Forest (RF) models were used to impute missing values, and these predicted house prices were evaluated using analysis of variance (ANOVA) with a Tukey Test statement. The ANOVA results showed no significant differences in the mean Sale Prices compared to the original, the LM imputed, and the RF imputed datasets (p = 0.358). These methods accurately resemble the central tendency of the original data without significant bias. The imputation of negative values for some house prices presents a challenge in some cases. This may have been the result of the statistical methods used, or the scale of the dataset being used to train the model. Further experimentation using Neural Networks is advisable, as would be the use of a larger dataset, as these have been demonstrated by industry leaders to model complex datasets while handling outliers and skewed data.
This dataset was small (around 3000 observations, 50% complete on sales prices). There is a total of 13 variables for each entry, which falls far short of a comprehensive examination of houses. A well-trained machine learning model should be able to be used to give recommendations on what modifications should be made to a house before listing it for sale. One well-known example from a Zillow survey of home values is to paint the front door of the house a desirable color, which can potentially increase the sale price by several thousand dollars. Zillow posts succinct results on this fairly often.
It was a good start, but a much larger and more complete dataset could be used to train a neural net. There are millions of home sales per year in the continental United States, which thousands of unique data points which could be collected to give a more accurate set of datapoint which can be used to train a machine learning system to predict house prices. The Zestimate system employed by Zillow uses neural networks fine tuned on millions of entries to do this, and has as little as 1.8% difference from real house prices.
Possible next steps to fine tune the model:
Using the linear regression method to evaluate the imputed dataset that was created using RandomForest, then comparing that data to the original LM data.
Preprocess the data by hand, removing outliers, in order to train the RF model on a more predictive set of data.
RF Hyperparameters can be used to fine tune results, paying more attention to how deep the “forest” of RF must be, finding the point at which the curve is overfit and then reducing depth until a robust model is found. GeeksForGeeks has a decent article on this.
Select features of my data which may have stronger influence over the sale price, including interacting variables that may have additive or subtractive effects on final sale price (for example, a house of a neighborhood type with a finished basement may be worth more, but less than one zoned differently unless the basement is unfinished because a finished basement adds taxable living space but also allows for rental subletting to reduce mortgage payments.)
Far flung future exploration could involve the following techniques:
Scaling up the training data set by scraping data from the public facing assessor’s databases online.
Automating data restructuring and data cleaning to combine data from a variety of sources into one consistent set.
Locating a source of funding to use an MTurk approach to clean the data using human workers.
Using a combination of MTurk or volunteer data cleaning team to clean a sample of the largest scraped dataset, which would be used to train a model which can restructure and clean the largest dataset.
Including private sources of public data, such as the MLS database or the Zillow database, to test the model.
Multimodal machine learning model imputations to evaluate the accuracy of different models.
A neural net trained on these multimodal results could be used to predict house prices given incomplete information.
A separate neural net trained to develop incomplete datasets could potentially be used to create synthetic data to further refine the predictive model.
Geeks for Geeks. (n.d.). House price prediction using machine learning in Python. Retrieved from https://www.geeksforgeeks.org/house-price-prediction-using-machine-learning-in-python/
Akkio. (n.d.). House price prediction using machine learning. Retrieved from https://www.akkio.com/post/house-price-prediction-using-machine-learning
Artificialis. (n.d.). Can’t decide between a Linear Regression or a Random Forest? Here, let me help. Medium. Retrieved from https://medium.com/artificialis/cant-decide-between-a-linear-regression-or-a-random-forest-here-let-me-help-ab941b94da4c
Torlay, L. (n.d.). Creating a Zillow Zestimate using machine learning. Medium. Retrieved from https://medium.com/@ltorlay/creating-a-zillow-zestimate-using-machine-learning-a4785b9e541
Towards Data Science. (n.d.). Feature scaling: Effectively choose input variables based on distributions. Retrieved from https://towardsdatascience.com/feature-scaling-effectively-choose-input-variables-based-on-distributions-3032207c921f
This document was produced as a final project for MAT 143H -
Introduction to Statistics (Honors) at North Shore Community
College.
The course was led by Professor Billy Jackson.
Student Name: Colin L. Stark Semester: Spring of 2024
Comments
The following comments were removed from the body of this document for clarity and have been preserved for reference in this section
Comment 1:
I am using the lm() function here, but since lm() has its disadvantages in providing an accurate model of real highly complex real world relationships, I have decided to use the randomForest library to improve the data cleaning. This decision was made using input from a Medium article listed in the sources section (“Can’t Decide Between…”, Medium).
Return to continue reading.
Comment 2:
The means and medians appear to be very similar, the median difference is around +$5093 (3.12% MOE), the mean difference is -$2836 (1.57% MOE), respectively. These values have varied a bit as I adjusted the models, but this is the most recent value for each.
As any prudent observer would notice by looking at the Minima column for the lm() set, the minimum for these predicted house prices are wrong! This flies in the face of logic. How can be house have a negative sale price? Let’s explain this issue with the data by using ggplot2 to visualize our original and our modeled SalePrices.
Return to continue reading.
Comment 3:
Very interesting! On several attempts, one entry crosses the Y-axis here. Looking into the imputed house_prices dataset, it appears that the predicted sale price for a house (entry number 1822) built in 1900 and remodeled in 1950 seems to imply that the house is worth less than worthless. Though it’s not reflected in this project, I’ve tried a few dozen variations at this point and still get negative numbers for at least one house using linear regression. Some of them take much longer to “train” than randomForest, so I’ve excluded them from contrast. For values such as this, would be possible to manually exclude any results which are below a certain threshold, or to favor higher values based on distance from 0 by including a coefficient applied after the LM predictions are produced. That might betray some data ethics, however, and so I’ve opted to keep it intact. We also have a plurality of NA values in the original dataset and 6 rows which were automatically excepted from the plot in the LM dataset because their values were NA due to competing NA values that were not predicted by linear regression. This was handled automatically by ggplot2 and kept strictly open for review.
Despite having that very low value for one house in the linear regression model, both the LM and the RF plot appears to have a strong resemblance to the complete set of data for the most part. This may mean the model is still predictive. It does mean that we must adjust our assumptions of how accurate it is near the limits of the range of the original data, or outside that range. To see how well it performs within the range, let’s look at how the model compared with the sale price of houses within our original dataset.
Return to continue reading.
Comment 4:
Reading this scatterplot, we can see that it is consistent with the two bar graphs showing the distribution of sales before and after prediction; the model and the real sale prices have a positive relationship with an r-squared value around 0.73. This means, around 73% of the price is reflected in the RF model.
Just evaluating this plot visually, there is an obvious trend in real world sale prices at the lower end of the range being lower than the predicted prices, but at the upper end of the range those prices exceed the predicted prices. This suggests a more complex relationship which might be better explained by training a neural net on a much larger dataset. That can be done in further exploration.
Return to continue reading.
Comment 5:
Let’s take a scalpel to the above observations we’ve made on these models. We have three sets of data, the Original House Prices, the lm() Imputed House Prices, and the randomForest Imputed House Prices. We want to compare the mean sale prices across the original dataset and the final imputed randomForest dataset to see if the method has significantly altered the central tendency. I’ve been given the option to use one of the analysis techniques listed in the Final Project Primer, and I’ve chosen ANOVA because it looks like the best way to evaluate these data, but unfortunately I completed ANOVA analyses with a GUI-focused statistics platform (Graphpad Prism) at a previous job so it is not entirely new to me.
Return to continue reading.