Introduction

Deep Learning regression using Tensorflow for house prices prediction.

House Prices: Advanced Regression Techniques

link for the kaggle competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

datasets: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

repository with the code of this notebook and the tensorflow model: https://github.com/dimitreOliveira/HousePrices

Overview

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Acknowledgments

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

Exploratory Data Analysis

Null occurrence

First let’s take a look at how many null values we have on the train set.

As we can see we have lots of null values among all columns, to make our work easier we’ll take them out for now and latter decide how to deal with them.

Label analysis

Now we have our train set with no null values, so first let’s take a look at how is our label feature (“SalePrice”) distribution with a histogram and see the feature summary.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34900  129975  163000  180921  214000  755000 

We can see some interesting properties, our label has a peak around 160000, then it starts to decline and forms a long tail ending at 75500, as our summary shows.

Next we will apply a logarithmic transformation to make our distribution looks more friendly, note that now it will look more normalized, and will lose it’s right side long tail.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.46   11.78   12.00   12.02   12.27   13.53 

Numerical features correlation

After this let’s start taking a look at how the remaining 26 numeric features correlate with the target “SalePrice” with a correlation matrix.

Categorical features correlation

And to our remaining 20 categorical features lets take a look at some box plots, to feel how our data behaves with “SalePrice”.

As we can see we have lots of features that have really low correlation (numerical) or low variance (categorical) with with “SalePrice”, features like this can disturb the training of our model, maybe latter we can feature engineer them to have more useful features, but for now we’ll set them aside to have a simpler model.

Data pre-processing

Now that we have more information about the features of our dataset, we can filter out all the unwanted features and work with a cleaner dataset.

Features behavior

After filtering out the unwanted features let’s take a look how our remaining features behaves with the target features and others with some scatter plot matrices.

Data inference

Now we can go back to our features with null values, the ones with high amount of missing data (more than 15%) we will drop, as the effort of inferring values would probably be too much and would still have chances of adding bias to the training, but the remaining we will try to infer the missing values.

As the remaining missing features have low missing count, we will use a simple technique to infer data, we will replace the missing values with the median or mode of the feature.

Inferred data correlation

Now let’s take a look at how the date we just created behaves with the target feature the same way we did before.

First the numerical features.

Then the categorical features.

As you can see we still have a number of irrelevant features, so we will also remove them.

Then we will do the same process to our test set.

Export the data

After all the data cleaning and processing we can write the resulting data frame into two csv files (train and test) and use it on our model.

reminder the link with the tensorflow code is at: https://github.com/dimitreOliveira/HousePrices

