Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.
It will be used a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.
Citation: The data used in this exercise originates from the following study:
Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science).
Run the following cell to load the data and view the first few rows.
import pandas as pd
# load the training dataset
data = pd.read_csv('data/real_estate.csv')
data.head()
The data consists of the following variables:
Your challenge is to explore and prepare the data, identify predictive features that will help predict the price_per_unit label, and train a regression model that achieves the lowest Root Mean Square Error (RMSE) you can achieve (which must be less than 7) when evaluated against a test subset of data.
Add markdown and code cells as required to create your solution.
# Check na values
data.isnull().sum()
# shape
data.shape
# Check vars distribution
data.describe()
# Price distribution
import pandas as pd
import matplotlib.pyplot as plt
# Plots displayed inline
%matplotlib inline
# Get the label column
label = data['price_per_unit']
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))
# Plot the histogram
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')
# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)
# Plot the boxplot
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Price')
# Add a title to the Figure
fig.suptitle('Price Distribution')
# Show the figure
fig.show()
The plots show that the price per unit ranges from 0 to just over 115.000. However, the mean (and median) number of daily rentals is near to 40000, with most of the data between 28000 and around 50000. The few values above this are shown in the box plot as small circles, indicating that they are few outliers.
# Remove outliers
q90 = data.price_per_unit.quantile(0.90)
# filter
data = data[(data['price_per_unit']<q90)]
# Check distribution again
# Get the label column
label = data['price_per_unit']
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))
# Plot the histogram
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')
# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)
# Plot the boxplot
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Price')
# Add a title to the Figure
fig.suptitle('Price Distribution')
# Show the figure
fig.show()
# Histogram to each of num vars
# Plot a histogram for each numeric feature
for col in data[data.columns[0:-1]]:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = data[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
Half of the houses were sold until march, 2013. The mean of house age is near to 18 years. Half of the house are age 0-15 years. The mean of transit distance is near to 1000, half of the house are until 500 of transit distance. The mean of local convinience stores is 4, median is same number. These vars are not normally distributed.
# Check relationships between the vars and the label
for col in data[data.columns[0:-1]]:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = data[col]
label = data['price_per_unit']
correlation = feature.corr(label)
plt.scatter(x=feature, y=label)
plt.xlabel(col)
plt.ylabel('House Price')
ax.set_title('Price vs ' + col + '- correlation: ' + str(correlation))
plt.show()
There is a negative correlation between transit distance and the price of the house and also a weak correlation between local convenience stores and the price. But no conclusive results. Latitude and longitude have positive correlations with the price.
# Relationship between transaction date and price
for col in data[['transaction_date', 'local_convenience_stores']]:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
data.boxplot(column = 'price_per_unit', by = col, ax = ax)
ax.set_title('Label by ' + col)
ax.set_ylabel("House Price Distribution by Categorical Variable")
plt.xticks(rotation=90)
plt.show()
The plots show some small variance in the relationship between transaction data and house price.
# Train the regression model
# Separate features and labels
X, y = data[['transaction_date', 'house_age', 'transit_distance', 'local_convenience_stores', 'latitude',
'longitude']].values, data['price_per_unit'].values
print('Features:',X[:7], '\nLabels:', y[:7], sep='\n')
from sklearn.model_selection import train_test_split
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))
# Train the model with scaled features
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
# Testing the model with validation data
# Get predictions
predictions = model.predict(X_test)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('House Price Predictions - Preprocessed, GradientBoosting')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,3,4]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
])
# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")
# Get predictions
predictions = model.predict(X_test)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('House Price Predictions - Preprocessed, Random Forest')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Save your trained model, and then use it to predict the price-per-unit for the following real estate transactions:
| transaction_date | house_age | transit_distance | local_convenience_stores | latitude | longitude |
|---|---|---|---|---|---|
| 2013.167 | 16.2 | 289.3248 | 5 | 24.98203 | 121.54348 |
| 2013.000 | 13.6 | 4082.015 | 0 | 24.94155 | 121.50381 |
import joblib
# Save the model as a pickle file
filename = './housepricemodel.pkl'
joblib.dump(model, filename)
# Load the model from the file
loaded_model = joblib.load(filename)
# Create a numpy array containing a new observation
X_new = np.array([[2013.167,16.2,289.3248,5,24.98203,121.54348],
[2013.000,13.6,4082.015,0,24.94155,121.5038]])
#print ('New sample: {}'.format(list(X_new[0])))
# Use the model to predict house price
results = loaded_model.predict(X_new)
print('Predictions:')
for prediction in results:
print(round(prediction,2))