Machine Learning Models for Forecasting California's Housing Prices¶

Loading and summarizing data¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
housing = pd.read_csv("/Users/nnthieu/Downloads/housing.csv")
housing.head()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [ ]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
In [ ]:
housing["ocean_proximity"].value_counts()
Out[ ]:
ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64
In [ ]:
housing.isnull().sum()
Out[ ]:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64
In [ ]:
housing.describe()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Look at the histogram of the features to check their units and any abnormalities in their frequencies.

In [ ]:
housing_age_max = housing[housing["housing_median_age"]== max(housing["housing_median_age"])]
housing_age_max.shape
Out[ ]:
(1273, 10)
In [ ]:
import matplotlib.pyplot as plt

# Assuming housing is a DataFrame containing the data
housing.hist(bins=50, figsize=(20,15))
plt.show()
No description has been provided for this image

Notes: 'median_income' unit should be '*10,000'. The housing median age and the median house value were also capped. The lat‐ ter may be a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s out‐ put) to see if this is a problem or not. If they tell you that they need precise pre‐ dictions even beyond $500,000, then you have mainly two options: a. Collect proper labels for the districts whose labels were capped. b. Remove those districts from the training set (and also from the test set, since your system should not be evaluated poorly if it predicts values beyond $500,000).

Many histograms are tail heavy: left skewed

In [ ]:
housing.plot(kind="scatter", x="longitude", y="latitude")
Out[ ]:
<Axes: xlabel='longitude', ylabel='latitude'>
No description has been provided for this image
In [ ]:
housing_num = housing.drop("ocean_proximity", axis=1)
corr_matrix = housing_num.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[ ]:
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049457
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64
In [ ]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
                 alpha=0.1)
Out[ ]:
<Axes: xlabel='median_income', ylabel='median_house_value'>
No description has been provided for this image

Cleaning data¶

Replacing missing data with medians

In [ ]:
from sklearn.impute import SimpleImputer

# Assuming housing is your DataFrame and housing_num contains only numerical columns
housing_num = housing.drop("ocean_proximity", axis=1)

# Replace any missing values (NaN) with the median value of each column
imputer = SimpleImputer(strategy='median')
housing_num_imputed = imputer.fit_transform(housing_num)

# Convert the imputed NumPy array back to a DataFrame with the original column names
housing_num_imputed = pd.DataFrame(housing_num_imputed, columns=housing_num.columns)

# Replace the numerical columns in the original DataFrame with the imputed values
housing[housing_num.columns] = housing_num_imputed

# Check if there are any missing values after imputation
housing.isnull().sum()
Out[ ]:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64
In [ ]:
# housing["total_bedrooms"].median()
# replace missing in "total_bedrooms" with it's median (435).
# housing["total_bedrooms"] = housing["total_bedrooms"].fillna(435)
# housing.isnull().sum()

Set train, test data

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
# Separate target from predictors
y = housing.median_house_value
X = housing.drop(['median_house_value'], axis=1)
In [ ]:
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)
print(len(X_train_full), "X_train_full +", len(X_valid_full), "X_valid_full")
16512 X_train_full + 4128 X_valid_full
In [ ]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "category"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
In [ ]:
X_train.head()
Out[ ]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
15610 -116.87 32.72 13.0 3268.0 491.0 1431.0 503.0 5.7652
12478 -121.48 38.59 52.0 1186.0 341.0 1038.0 320.0 1.6116
5996 -117.73 34.09 36.0 1543.0 297.0 1355.0 303.0 3.5313
11827 -121.02 39.01 17.0 4786.0 799.0 2066.0 770.0 3.9734
5183 -118.26 33.94 41.0 1510.0 410.0 1408.0 389.0 1.6500
In [ ]:
y_train.head()
Out[ ]:
15610    259900.0
12478     70500.0
5996     117800.0
11827    185400.0
5183      94200.0
Name: median_house_value, dtype: float64

Preparing data before building models

In [ ]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Building Random forest model¶

In [ ]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
In [ ]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
MAE: 32050.43134932171

Validating random forest models¶

CV = 5 folds for cross-validation: MAEs are not different so much, so the model is quite stable when changing the validation data.

In [ ]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)
MAE scores:
 [58470.89921027 50076.79370155 47183.21111192 47945.04634205
 52563.55166667]
In [ ]:
print("Average MAE score (across experiments):")
print(scores.mean())
Average MAE score (across experiments):
51247.90040649225

XGBoost models¶

Now I try gradient boost model

In [ ]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)
Out[ ]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In [ ]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 31466.55344455926

Turning XGBoost models to find the model with smallest MAE¶

In [ ]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 31422.358263355818
In [ ]:
my_model = XGBRegressor(n_estimators=500, early_stopping_rounds=5)
my_model.fit(X_train, y_train, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 32654.97896961784
In [ ]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=5)
my_model.fit(X_train, y_train, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 30812.942529486132
In [ ]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4, early_stopping_rounds=5)
my_model.fit(X_train, y_train, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 30812.942529486132

Conclusions¶

From these models, I will find the best model with smallest MAEs.

.