Project Proposal - Predicting House Prices Using Different Machine Learning Techniques

“Before you start trying to work out which direction the property market is headed, you should be aware that there are markets within markets.” - Paul Clitheroe. That quote by Paul Clitheroe is so simple yet so interesting in the way that it tells us so much that can be harnessed through learning about the other parts of markets that in turn builds up/ makes up the real estate world. The power of data and analytics from the real estate can help in optimizing the buyers selection process, analyzing and monitoring the market trends, understanding patterns and finally, making predictions that will be highly beneficial to both buyer and the seller. With greater insight into market trends and more precise data on metrics such as pricing and time on the market from thousands of comparable properties, realtors can better determine the value of a property. Buyers and investors can use these insights to make better offers.

In this project, we will be using the “Houses Sold” data from Madame Danielle Henderson in order to help her better understand the decisions she has made, increase her profit and equip her with the knowledge she would need to know about the future success or potential downfall of the properties she has. With the computation that will be made by us (The Research Body), Madame Henderson will be able to view the prediction of house prices / sale price that will be formed.

In order for us to have a guide to pinpoint us to a direction for the research project, we have established a few goals and outcomes that we seek to achieve through this project research.

Efficiently clean up the data set in a way that it makes performing analysis and implementation of machine learning techniques easy.
Discover the various features that might be an influence on the price of houses.
Use predictive models to make forecasts for selling prices of the houses.
Explore the relationship and groupings of the houses so we can check for similarity or dissimilarity.

The data set that has been provided appears to be sufficient for adequate training and testing having After cleaning the data, there’s a potential that we might have about 2000 observations. This is still good enough for the current project analysis.

The data set contains 2930 rows/observations and 82 columns/features.
We have different datatypes in this data set that we can leverage in order to perform even more extensive computations.
Out of the 237,330 fields, we have 13,997 field that currently show up as “NA’s”.
A few features have been captured & listed below to give an insight to what the data looks like.

In order to attain our objectives / goals, we will properly clean up the data so we can seamlessly work with the data to provide us with various types of results. Cleaning the data includes but is not limited to:

Imputation of the NA’s.
Scaling the Data.
Factoring where needed…. and more.

We will also look for patterns and relationships using visuals in our data analysis to acquire deeper insights of what the data may be trying to tell us. The research body will go through various processes so as to understand the narrative and behaviors that might be going on behind the scenes in terms of correlations and more, We shall also import libraries and utilities in R that will enable us examine and discover the numerous features we have in the “Houses” data with its’ relation to sale prices and how the housing price is affected by the features in this data set.

It’ll be imperative for us to go through these processes because only then, can we have a better understanding of what features are significant to this research and will help us with the model selection and accuracy which leads to a truer value forecasted for the sale price. The Supervised learning algorithms will be implemented in this project to run a classification on the houses in the data. Decision Trees will be the another algorithm that will be used for the analysis.Additionally, Gradient & XGBoosting will be used with the purpose of providing a more robust prediction.

Decisions trees require less pre-processing even with categorical variables and are not typically susceptible by outliers.

It outperforms other methods and learns quickly what direction to go when a value is missing.