Machine Learning Models for Forecasting California's Housing Prices¶
Loading and summarizing data¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
housing = pd.read_csv("/Users/nnthieu/Downloads/housing.csv")
housing.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
housing["ocean_proximity"].value_counts()
ocean_proximity <1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: count, dtype: int64
housing.isnull().sum()
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 207 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0 dtype: int64
housing.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
Look at the histogram of the features to check their units and any abnormalities in their frequencies.
housing_age_max = housing[housing["housing_median_age"]== max(housing["housing_median_age"])]
housing_age_max.shape
(1273, 10)
import matplotlib.pyplot as plt
# Assuming housing is a DataFrame containing the data
housing.hist(bins=50, figsize=(20,15))
plt.show()
Notes: 'median_income' unit should be '*10,000'. The housing median age and the median house value were also capped. The lat‐ ter may be a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s out‐ put) to see if this is a problem or not. If they tell you that they need precise pre‐ dictions even beyond $500,000, then you have mainly two options: a. Collect proper labels for the districts whose labels were capped. b. Remove those districts from the training set (and also from the test set, since your system should not be evaluated poorly if it predicts values beyond $500,000).
Many histograms are tail heavy: left skewed
housing.plot(kind="scatter", x="longitude", y="latitude")
<Axes: xlabel='longitude', ylabel='latitude'>
housing_num = housing.drop("ocean_proximity", axis=1)
corr_matrix = housing_num.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049457 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
<Axes: xlabel='median_income', ylabel='median_house_value'>
Cleaning data¶
Replacing missing data with medians
from sklearn.impute import SimpleImputer
# Assuming housing is your DataFrame and housing_num contains only numerical columns
housing_num = housing.drop("ocean_proximity", axis=1)
# Replace any missing values (NaN) with the median value of each column
imputer = SimpleImputer(strategy='median')
housing_num_imputed = imputer.fit_transform(housing_num)
# Convert the imputed NumPy array back to a DataFrame with the original column names
housing_num_imputed = pd.DataFrame(housing_num_imputed, columns=housing_num.columns)
# Replace the numerical columns in the original DataFrame with the imputed values
housing[housing_num.columns] = housing_num_imputed
# Check if there are any missing values after imputation
housing.isnull().sum()
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0 dtype: int64
# housing["total_bedrooms"].median()
# replace missing in "total_bedrooms" with it's median (435).
# housing["total_bedrooms"] = housing["total_bedrooms"].fillna(435)
# housing.isnull().sum()
Set train, test data
import pandas as pd
from sklearn.model_selection import train_test_split
# Separate target from predictors
y = housing.median_house_value
X = housing.drop(['median_house_value'], axis=1)
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
print(len(X_train_full), "X_train_full +", len(X_valid_full), "X_valid_full")
16512 X_train_full + 4128 X_valid_full
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "category"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_train.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | |
---|---|---|---|---|---|---|---|---|
15610 | -116.87 | 32.72 | 13.0 | 3268.0 | 491.0 | 1431.0 | 503.0 | 5.7652 |
12478 | -121.48 | 38.59 | 52.0 | 1186.0 | 341.0 | 1038.0 | 320.0 | 1.6116 |
5996 | -117.73 | 34.09 | 36.0 | 1543.0 | 297.0 | 1355.0 | 303.0 | 3.5313 |
11827 | -121.02 | 39.01 | 17.0 | 4786.0 | 799.0 | 2066.0 | 770.0 | 3.9734 |
5183 | -118.26 | 33.94 | 41.0 | 1510.0 | 410.0 | 1408.0 | 389.0 | 1.6500 |
y_train.head()
15610 259900.0 12478 70500.0 5996 117800.0 11827 185400.0 5183 94200.0 Name: median_house_value, dtype: float64
Preparing data before building models
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
Building Random forest model¶
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
MAE: 32050.43134932171
Validating random forest models¶
CV = 5 folds for cross-validation: MAEs are not different so much, so the model is quite stable when changing the validation data.
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,
scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)
MAE scores: [58470.89921027 50076.79370155 47183.21111192 47945.04634205 52563.55166667]
print("Average MAE score (across experiments):")
print(scores.mean())
Average MAE score (across experiments): 51247.90040649225
XGBoost models¶
Now I try gradient boost model
from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
from sklearn.metrics import mean_absolute_error
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 31466.55344455926
Turning XGBoost models to find the model with smallest MAE¶
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 31422.358263355818
my_model = XGBRegressor(n_estimators=500, early_stopping_rounds=5)
my_model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 32654.97896961784
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=5)
my_model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 30812.942529486132
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4, early_stopping_rounds=5)
my_model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
verbose=False)
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Mean Absolute Error: 30812.942529486132
Conclusions¶
From these models, I will find the best model with smallest MAEs.
.