Regression Exercise

Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.

It will be used a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.

Citation: The data used in this exercise originates from the following study:

Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.

It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science).

Review the data

Run the following cell to load the data and view the first few rows.

In [2]:
import pandas as pd

# load the training dataset
data = pd.read_csv('data/real_estate.csv')
data.head()
Out[2]:
transaction_date house_age transit_distance local_convenience_stores latitude longitude price_per_unit
0 2012.917 32.0 84.87882 10 24.98298 121.54024 37.9
1 2012.917 19.5 306.59470 9 24.98034 121.53951 42.2
2 2013.583 13.3 561.98450 5 24.98746 121.54391 47.3
3 2013.500 13.3 561.98450 5 24.98746 121.54391 54.8
4 2012.833 5.0 390.56840 5 24.97937 121.54245 43.1

The data consists of the following variables:

  • transaction_date - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
  • house_age - the house age (in years)
  • transit_distance - the distance to the nearest light rail station (in meters)
  • local_convenience_stores - the number of convenience stores within walking distance
  • latitude - the geographic coordinate, latitude
  • longitude - the geographic coordinate, longitude
  • price_per_unit house price of unit area (3.3 square meters)

Train a Regression Model

Your challenge is to explore and prepare the data, identify predictive features that will help predict the price_per_unit label, and train a regression model that achieves the lowest Root Mean Square Error (RMSE) you can achieve (which must be less than 7) when evaluated against a test subset of data.

Add markdown and code cells as required to create your solution.

Data Exploration

In [3]:
# Check na values
data.isnull().sum()
Out[3]:
transaction_date            0
house_age                   0
transit_distance            0
local_convenience_stores    0
latitude                    0
longitude                   0
price_per_unit              0
dtype: int64
In [4]:
# shape
data.shape
Out[4]:
(414, 7)
In [5]:
# Check vars distribution
data.describe()
Out[5]:
transaction_date house_age transit_distance local_convenience_stores latitude longitude price_per_unit
count 414.000000 414.000000 414.000000 414.000000 414.000000 414.000000 414.000000
mean 2013.148971 17.712560 1083.885689 4.094203 24.969030 121.533361 37.980193
std 0.281967 11.392485 1262.109595 2.945562 0.012410 0.015347 13.606488
min 2012.667000 0.000000 23.382840 0.000000 24.932070 121.473530 7.600000
25% 2012.917000 9.025000 289.324800 1.000000 24.963000 121.528085 27.700000
50% 2013.167000 16.100000 492.231300 4.000000 24.971100 121.538630 38.450000
75% 2013.417000 28.150000 1454.279000 6.000000 24.977455 121.543305 46.600000
max 2013.583000 43.800000 6488.021000 10.000000 25.014590 121.566270 117.500000
In [6]:
# Price distribution

import pandas as pd
import matplotlib.pyplot as plt

# Plots displayed inline
%matplotlib inline

# Get the label column
label = data['price_per_unit']


# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')

# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Price')

# Add a title to the Figure
fig.suptitle('Price Distribution')

# Show the figure
fig.show()

The plots show that the price per unit ranges from 0 to just over 115.000. However, the mean (and median) number of daily rentals is near to 40000, with most of the data between 28000 and around 50000. The few values above this are shown in the box plot as small circles, indicating that they are few outliers.

In [7]:
# Remove outliers
q90 = data.price_per_unit.quantile(0.90)

# filter
data = data[(data['price_per_unit']<q90)]

# Check distribution again
# Get the label column
label = data['price_per_unit']


# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')

# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# Plot the boxplot   
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Price')

# Add a title to the Figure
fig.suptitle('Price Distribution')

# Show the figure
fig.show()
In [8]:
# Histogram to each of num vars

# Plot a histogram for each numeric feature
for col in data[data.columns[0:-1]]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = data[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

Half of the houses were sold until march, 2013. The mean of house age is near to 18 years. Half of the house are age 0-15 years. The mean of transit distance is near to 1000, half of the house are until 500 of transit distance. The mean of local convinience stores is 4, median is same number. These vars are not normally distributed.

In [9]:
# Check relationships between the vars and the label

for col in data[data.columns[0:-1]]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = data[col]
    label = data['price_per_unit']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('House Price')
    ax.set_title('Price vs ' + col + '- correlation: ' + str(correlation))
plt.show()

There is a negative correlation between transit distance and the price of the house and also a weak correlation between local convenience stores and the price. But no conclusive results. Latitude and longitude have positive correlations with the price.

In [10]:
# Relationship between transaction date and price
for col in data[['transaction_date', 'local_convenience_stores']]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    data.boxplot(column = 'price_per_unit', by = col, ax = ax)
    ax.set_title('Label by ' + col)
    ax.set_ylabel("House Price Distribution by Categorical Variable")
    plt.xticks(rotation=90)
plt.show()

The plots show some small variance in the relationship between transaction data and house price.

Creating the Model

Preparing datasets

In [11]:
# Train the regression model
# Separate features and labels
X, y = data[['transaction_date', 'house_age', 'transit_distance', 'local_convenience_stores', 'latitude',
             'longitude']].values, data['price_per_unit'].values
print('Features:',X[:7], '\nLabels:', y[:7], sep='\n')
Features:
[[2012.917     32.        84.87882   10.        24.98298  121.54024]
 [2012.917     19.5      306.5947     9.        24.98034  121.53951]
 [2013.583     13.3      561.9845     5.        24.98746  121.54391]
 [2013.5       13.3      561.9845     5.        24.98746  121.54391]
 [2012.833      5.       390.5684     5.        24.97937  121.54245]
 [2012.667      7.1     2175.03       3.        24.96305  121.51254]
 [2012.667     34.5      623.4731     7.        24.97933  121.53642]]

Labels:
[37.9 42.2 47.3 54.8 43.1 32.1 40.3]

Split train and test

In [12]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))
Training Set: 260 rows
Test Set: 112 rows

Gradient Boosting with Scaled Features

In [27]:
# Train the model with scaled features
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

import numpy as np

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', GradientBoostingRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  [0, 1, 3, 4])],
                                   verbose=False)),
                ('regressor',
                 GradientBoostingRegressor(alpha=0.9, ccp_a...
                                           learning_rate=0.1, loss='ls',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
                                           min_samples_leaf=1,
                                           min_samples_split=2,
                                           min_weight_fraction_leaf=0.0,
                                           n_estimators=100,
                                           n_iter_no_change=None,
                                           presort='deprecated',
                                           random_state=None, subsample=1.0,
                                           tol=0.0001, validation_fraction=0.1,
                                           verbose=0, warm_start=False))],
         verbose=False)

Evaluating the Model with the Test Dataset

In [28]:
# Testing the model with validation data
# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('House Price Predictions - Preprocessed, GradientBoosting')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
MSE: 28.790505916042967
RMSE: 5.365678514041162
R2: 0.7329337534387831

Training a Random Forest Model with Scaled Features

In [29]:
# Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
    ])

# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")

# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('House Price Predictions - Preprocessed, Random Forest')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean=True,
                                                                                  with_std=True))],
                                                           verbose=False),
                                                  [0, 1, 3, 4])],
                                   verbose=False)),
                ('regressor',
                 RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                       criterion='mse', max_depth=None,
                                       max_features='auto', max_leaf_nodes=None,
                                       max_samples=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       n_estimators=100, n_jobs=None,
                                       oob_score=False, random_state=None,
                                       verbose=0, warm_start=False))],
         verbose=False) 

MSE: 31.447752124429492
RMSE: 5.607829537747157
R2: 0.7082846287192563

Use the Trained Model

Save your trained model, and then use it to predict the price-per-unit for the following real estate transactions:

transaction_date house_age transit_distance local_convenience_stores latitude longitude
2013.167 16.2 289.3248 5 24.98203 121.54348
2013.000 13.6 4082.015 0 24.94155 121.50381

Gradient Boosting showed a better performance, so it will be used

In [34]:
import joblib

# Save the model as a pickle file
filename = './housepricemodel.pkl'
joblib.dump(model, filename)


# Load the model from the file
loaded_model = joblib.load(filename)

# Create a numpy array containing a new observation
X_new = np.array([[2013.167,16.2,289.3248,5,24.98203,121.54348],
                  [2013.000,13.6,4082.015,0,24.94155,121.5038]])
#print ('New sample: {}'.format(list(X_new[0])))

# Use the model to predict house price
results = loaded_model.predict(X_new)
print('Predictions:')
for prediction in results:
    print(round(prediction,2))
Predictions:
47.25
15.85