DATA 622 Lab 5: Ames Housing Data with Trees

Author

Leslie Tavarez

library(reticulate)
use_python("/opt/anaconda3/bin/python3", required = TRUE)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
from xgboost import plot_tree

Problem 1: Building up to a Baseline Model

  1. Basic EDA and Feature Selection: Divide the data into an initial training and testing split. Perform a log-transform on the sales price, this will be the target variable. Perform a basic EDA to select features for your model. Report a summary of the EDA and the selection of variables for your model. (Hint: the quality variables are important, and your intuition about what is important in housing is useful for picking). Pick more than 10 variables but less than all 80 (before one-hot encoding). Several variables encode the absence of a feature (such as no pool, no garage, etc) as ‘NA’. This dataset has almost no real missing data- make sure to encode and interpret ‘NA’ values appropriately for the variables you select. Make sure ordered categorical features have an integer or ordinal encoding, and use one-hot encoding on remaining categorical features.
url = "https://raw.githubusercontent.com/georgehagstrom/DATA622Spring2026/main/website/assignments/labs/labData/AmesHousing.csv"
df = pd.read_csv(url)
df.columns = df.columns.str.strip()

df = df[df["Gr Liv Area"] < 4000].copy()
df["LogSalePrice"] = np.log(df["SalePrice"])

selected_features = [
    "Overall Qual", "Gr Liv Area", "Garage Cars", "Garage Area",
    "Total Bsmt SF", "Year Built", "Year Remod/Add", "Full Bath",
    "TotRms AbvGrd", "Lot Area", "Kitchen Qual", "Exter Qual",
    "Neighborhood", "Foundation", "Central Air"
]

X = df[selected_features].copy()
y = df["LogSalePrice"]

X["Garage Cars"] = X["Garage Cars"].fillna(0)
X["Garage Area"] = X["Garage Area"].fillna(0)
X["Total Bsmt SF"] = X["Total Bsmt SF"].fillna(0)

ordinal_cols = ["Kitchen Qual", "Exter Qual"]
ordinal_map = {"Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5}

for col in ordinal_cols:
    X[col] = X[col].map(ordinal_map).fillna(3)

categorical_cols = ["Neighborhood", "Foundation", "Central Air"]
for col in categorical_cols:
    X[col] = X[col].astype(str)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.head()
      Overall Qual  Gr Liv Area  ...  Foundation  Central Air
748              7         2019  ...      BrkTil            N
869              8         1513  ...       PConc            Y
2304             5          887  ...      CBlock            Y
2231             5         1517  ...      CBlock            N
2926             5          902  ...      CBlock            Y

[5 rows x 15 columns]
  1. Baseline Model: Use ridge regression with cross-validation to fit a baseline model. Evaluate it on the test set. What is the mean squared error and what are the most important features in your linear model?
# Identify column types
numeric_cols = [
    "Overall Qual", "Gr Liv Area", "Garage Cars", "Garage Area",
    "Total Bsmt SF", "Year Built", "Year Remod/Add", "Full Bath",
    "TotRms AbvGrd", "Lot Area"
]

ordinal_cols = ["Kitchen Qual", "Exter Qual"]
categorical_cols = ["Neighborhood", "Foundation", "Central Air"]

# Preprocessing pipeline
preprocess = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_cols),
        ("ord", "passthrough", ordinal_cols),
        ("cat", OneHotEncoder(drop="first", sparse_output=False), categorical_cols)
    ]
)

# Ridge regression with cross-validation
alphas = np.logspace(-3, 3, 50)
ridge = RidgeCV(alphas=alphas, cv=5)

model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("ridge", ridge)
])

# Fit model
model.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                sparse_output=False),
                                                  ['Neighborhood', 'Foundation'...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03]),
                         cv=5))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

mse
0.015004972935219222
# Get feature names after preprocessing
ohe = model.named_steps["preprocess"].named_transformers_["cat"]
cat_features = ohe.get_feature_names_out(categorical_cols)

feature_names = numeric_cols + ordinal_cols + list(cat_features)

coeffs = model.named_steps["ridge"].coef_

# Combine into a sorted table
importance = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coeffs
}).sort_values(by="coefficient", key=abs, ascending=False)

importance.head(15)
                 feature  coefficient
21  Neighborhood_GrnHill     0.490194
17  Neighborhood_Crawfor     0.213658
1            Gr Liv Area     0.166118
0           Overall Qual     0.161701
13   Neighborhood_BrDale    -0.151782
44         Central Air_Y     0.143871
15  Neighborhood_ClearCr     0.132700
24  Neighborhood_MeadowV    -0.123329
36  Neighborhood_StoneBr     0.121234
42      Foundation_Stone     0.104597
29  Neighborhood_NoRidge     0.103342
38  Neighborhood_Veenker     0.098370
30  Neighborhood_NridgHt     0.096111
37   Neighborhood_Timber     0.088704
32    Neighborhood_SWISU     0.084378

Baseline Ridge Regression A ridge regression model with cross‑validation was fit using a preprocessing pipeline that applied robust scaling to numeric variables, ordinal encoding to quality ratings, and one‑hot encoding to categorical variables. Cross‑validation selected the regularization strength from a grid of log‑spaced alpha values.

Test Set Performance The model achieved a mean squared error of approximately 0.0150 on the test set (using the log‑transformed SalePrice). This is a strong baseline performance for linear modeling on the Ames Housing dataset.

Most Important Features Feature importance was assessed using the absolute magnitude of the ridge coefficients after preprocessing. The most influential predictors were primarily neighborhood indicators, followed by key structural attributes. The top features were:

  • Neighborhood_GrnHill
  • Neighborhood_Crawfor
  • Gr Liv Area
  • Overall Qual
  • Neighborhood_BrDale
  • Central Air_Y
  • Neighborhood_ClearCr
  • Neighborhood_MeadowV
  • Neighborhood_StoneBr
  • Foundation_Stone

These results show that both location (Neighborhood) and core structural characteristics such as Gr Liv Area and Overall Qual play major roles in predicting log‑sale price.The model achieved a mean squared error of approximately 0.0150 on the test set (using the log‑transformed SalePrice).

Problem 2: Decision Trees and Random Forests

  1. Simple Decision Tree: Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
# Column groups
numeric_cols = [
    "Overall Qual", "Gr Liv Area", "Garage Cars", "Garage Area",
    "Total Bsmt SF", "Year Built", "Year Remod/Add", "Full Bath",
    "TotRms AbvGrd", "Lot Area"
]

ordinal_cols = ["Kitchen Qual", "Exter Qual"]
categorical_cols = ["Neighborhood", "Foundation", "Central Air"]

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_cols),
        ("ord", "passthrough", ordinal_cols),
        ('cat',
 OneHotEncoder(
     drop='first',
     sparse_output=False,
     handle_unknown='ignore'
 ),
 ['Neighborhood', 'Foundation', 'Central Air'])
    ]
)

# Tree model inside pipeline
tree_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("tree", DecisionTreeRegressor(
        max_depth=4,
        min_samples_leaf=20
    ))
])

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Fit
tree_model.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Neighborhood', 'Foundation',
                                                   'Central Air'])])),
                ('tree',
                 DecisionTreeRegressor(max_depth=4, min_samples_leaf=20))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict + MSE
y_pred_tree = tree_model.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)
mse_tree
0.03390532339106262
# Plot the tree
plt.figure(figsize=(20,10))  # reasonable size for PDF
plot_tree(
    tree_model.named_steps["tree"],
    feature_names=tree_model.named_steps["preprocess"].get_feature_names_out(),
    filled=True,
    rounded=True,
    fontsize=6
)
plt.show()

A simple regression tree was fit to the training data using the same preprocessing pipeline as earlier models, including robust scaling for numeric variables, ordinal passthrough for quality ratings, and one‑hot encoding for categorical variables. The tree was constrained to a maximum depth of four with a minimum of twenty samples per leaf to maintain interpretability and limit overfitting. The resulting model achieved a test mean squared error of 0.0339, which is higher than the ridge regression baseline of roughly 0.015, reflecting the fact that shallow trees have higher variance and struggle to capture smooth relationships in the data. The structure of the tree shows that the first and most important split occurs on Overall Qual, confirming that overall material and finish quality is the strongest single predictor of log‑sale price. Subsequent splits involve Gr Liv Area, Year Built, and Total Bsmt SF, indicating that structural size and age further refine predictions, while deeper splits on Neighborhood and Central Air capture location and amenity effects.

  1. Tree Pruning: Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

For tree pruning, I used cost‑complexity pruning with cross‑validation to find the best level of model simplicity. The pruning process tested different values of the complexity parameter α and selected the one that gave the lowest cross‑validated error. After refitting the pruned tree, the test MSE did not improve compared to the original tree’s MSE of 0.0339. In fact, the pruned tree performed about the same or slightly worse. This means the original depth‑4 tree was already regularized enough, and pruning removed splits that were still helpful. Pruning made the tree smaller and easier to interpret, but it did not improve prediction accuracy.

  1. Random Forests: Use random forest to model this data. Use ‘GridSearchCV’ and at least 5-fold cross-validation to find the optimal value of ‘max_features’. Plot the predicted test error using the ‘cv_results_[’mean_test_score’]’ of your cross-validated model, and describe the effect of ‘max_features’ on the error rate. Where does the optimal value fall in comparison to and ? What test MSE do you obtain? Use the ‘feature_importance_’ values to determine which variables are most important.
rf_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("rf", RandomForestRegressor(
        n_estimators=500,
        random_state=42
    ))
])

# Fit preprocess ONCE to get number of features
p = rf_model.named_steps["preprocess"].fit_transform(X_train).shape[1]

param_grid = {
    "rf__max_features": [5, 10, 20, int(np.sqrt(p)), int(p/3), p]
}

rf_grid = GridSearchCV(
    rf_model,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('num',
                                                                         RobustScaler(),
                                                                         ['Overall '
                                                                          'Qual',
                                                                          'Gr '
                                                                          'Liv '
                                                                          'Area',
                                                                          'Garage '
                                                                          'Cars',
                                                                          'Garage '
                                                                          'Area',
                                                                          'Total '
                                                                          'Bsmt '
                                                                          'SF',
                                                                          'Year '
                                                                          'Built',
                                                                          'Year '
                                                                          'Remod/Add',
                                                                          'Full '
                                                                          'Bath',
                                                                          'TotRms '
                                                                          'AbvGrd',
                                                                          'Lot '
                                                                          'Area']),
                                                                        ('ord',
                                                                         'passthrough',
                                                                         ['Kitchen '
                                                                          'Qual',
                                                                          'Exter '
                                                                          'Qual']),
                                                                        ('cat',
                                                                         OneHotEncoder(drop='first',
                                                                                       handle_unknown='ignore',
                                                                                       sparse_output=False),
                                                                         ['Neighborhood',
                                                                          'Foundation',
                                                                          'Central '
                                                                          'Air'])])),
                                       ('rf',
                                        RandomForestRegressor(n_estimators=500,
                                                              random_state=42))]),
             n_jobs=-1, param_grid={'rf__max_features': [5, 10, 20, 6, 15, 45]},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
best_rf = rf_grid.best_estimator_
rf_grid.best_params_
{'rf__max_features': 10}
print("Best params:", rf_grid.best_params_)
Best params: {'rf__max_features': 10}
print("Test MSE:", mean_squared_error(y_test, best_rf.predict(X_test)))
Test MSE: 0.014460969836012747
mse_rf = mean_squared_error(y_test, rf_grid.best_estimator_.predict(X_test))
mse_rf
0.014460969836012747
mean_scores = -rf_grid.cv_results_["mean_test_score"]
features = rf_grid.cv_results_["param_rf__max_features"]

plt.figure(figsize=(8,5))
plt.plot(features, mean_scores, marker="o")
plt.xlabel("max_features")
plt.ylabel("CV MSE")
plt.title("Effect of max_features on Random Forest Error")
plt.show()

A random forest model was fit using the same preprocessing pipeline as before, and GridSearchCV with 5‑fold cross‑validation was used to tune the max_features parameter. The cross‑validated error curve showed that the model performed best when max_features was set to 10, with error increasing noticeably when too few or too many features were used at each split. This optimal value falls near the middle of the tested range and is close to the common rule‑of‑thumb choice of √p. Using the tuned model, the random forest achieved a test MSE of 0.01446, which is lower than the decision tree and close to the ridge regression baseline. Feature importance values indicated that Overall Qual, Gr Liv Area, Total Bsmt SF, Garage Cars, and key Neighborhood indicators were the most influential predictors, showing that both structural characteristics and location play major roles in determining sale price.

  1. Comparing Ridge Regression and Random Forests: Compare the two models in the following two ways. First, did ridge regression have a lower or higher test error than your Random Forest model (I expect Random Forest to outperform, but with good feature engineering the two should be comparable on this dataset)? Next, make a ‘partial dependence’ plot of how the ridge regression model and the random forest model predict the relationship between housing size and sales price (I hope that you have included some of the living area features in your models…). I recommend ‘GrLivArea’. Create a grid of ‘GrLivArea’ values between 0 and 15000. Next, loop through each value of ‘GrLivArea’ in the grid, and clone the training set at each step of the loop. In the cloned training set, replace the real value of ‘GrLivArea’ with the grid value, and predict the log sales price on all the cloned data. Then compute the average of the predicted log sales price for each value of the ‘GrLivArea’. Make a plot of the mean predicted log sales price versus ‘GrLivArea’ for both Random Forest and Ridge Regression. Explain what you observe. Note: Being aware of this phenomenon is crucial for properly using tree based methods.
# 1. Rebuild Ridge inside the SAME preprocessing pipeline as RF
alphas = np.logspace(-3, 3, 50)

ridge_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("ridge", RidgeCV(alphas=alphas))
])

ridge_model.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03])))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# 2. Create grid of GrLivArea values
grid = np.linspace(0, 15000, 200)

ridge_means = []
rf_means = []

# 3. Loop through grid values
for val in grid:
    X_clone = X_train.copy()
    X_clone["Gr Liv Area"] = val

    ridge_pred = ridge_model.predict(X_clone)
    rf_pred = best_rf.predict(X_clone)

    ridge_means.append(ridge_pred.mean())
    rf_means.append(rf_pred.mean())

# 4. Plot both curves
plt.figure(figsize=(8,5))
plt.plot(grid, ridge_means, label="Ridge Regression", linewidth=2)
plt.plot(grid, rf_means, label="Random Forest", linewidth=2)
plt.xlabel("GrLivArea")
plt.ylabel("Mean Predicted Log Sale Price")
plt.title("Partial Dependence: GrLivArea vs Log Sale Price")
plt.legend()
plt.show()

The random forest model achieved a lower test error (MSE ≈ 0.01446) than the ridge regression model (MSE ≈ 0.015), meaning the random forest performed slightly better on this dataset. To compare how each model captures the effect of living area, I generated partial dependence curves for GrLivArea by replacing the feature with values from 0 to 15,000 in a cloned version of the training data, predicting log‑sale price for all observations, and averaging the predictions at each grid point.

Ridge regression produced a smooth, strictly increasing linear curve, reflecting the model’s assumption that living area has a constant additive effect on log‑price. In contrast, the random forest curve rose slightly at small values but then remained nearly flat across most of the range. This occurs because tree‑based models tend to split first on stronger predictors such as OverallQual, Neighborhood, and TotalBsmtSF; once those splits are made, GrLivArea contributes little additional variation within each leaf. As a result, the random forest effectively down‑weights GrLivArea, producing a much flatter partial dependence curve. This illustrates a key difference between the models: ridge regression enforces a global linear trend, while random forests capture complex interactions and may ignore features that are redundant with more informative predictors.

Problem 3: Boosting: Learning Rate and Tree Number

  1. I strongly recommend using ‘xgb’ for this exercise. If you have a strong preference to stay within ‘sklearn’, you may use ‘HistGradientBoostingRegressor’, but it will be more effort to do part (b) properly. Fit a boosted model with the default hyperparameters. What is your average mean squared error and how does it compare to your other models?
# Build XGBoost model inside the preprocessing pipeline
xgb_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("xgb", XGBRegressor(
        objective="reg:squarederror",
        random_state=42
    ))
])

# Fit
xgb_model.fit(X_train, y_train)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
                              feature_types=None, feature_weights=None,
                              gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict
y_pred_xgb = xgb_model.predict(X_test)

# Compute MSE
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mse_xgb
0.01587704884760921

The default XGBoost model achieved a test MSE of 0.01588, which performs similarly to ridge regression and slightly worse than the tuned random forest. Even without tuning, XGBoost is competitive with the best models in the assignment.

  1. One of the differences between gradient boosted trees and bagging methods like random forest, is that boosted trees can sometimes be more powerful, but are also more sensitive to hyperparameters and prone to overfitting. A fundamental trade-off in hyperparameters is between learning rate (‘learning_rate’) and the number of trees (‘n_estimators’). For three values of the learning rate (pick the default of 0.3, one larger value, and one smaller value), plot the relationship between both training and testing mean-squared error and the number of boosting iterations. Make sure to pass the training and testing dataset as ‘eval_set’ when you are fitting your ‘xgb’ model, i.e. ‘eval_set=[(X_train, y_train), (X_test, y_test)]’. The evolution of the error is tracked in ‘results[’validation_n’]’ where corresponds to each validation set. For each learning rate, what is the best value of ‘n_estimators’ and the corresponding test mean squared error? Note: passing the test set as eval_set is used here for visualization only. See the extra credit to learn how to do this properly with cross-validation.
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

learning_rates = [0.1, 0.3, 0.5]

# Transform data ONCE
X_train_t = preprocess.fit_transform(X_train)
X_test_t  = preprocess.transform(X_test)

results_summary = {}

for lr in learning_rates:
    print(f"\n=== Learning Rate: {lr} ===")

    xgb = XGBRegressor(
        objective="reg:squarederror",
        learning_rate=lr,
        n_estimators=500,
        random_state=42
    )

    xgb.fit(
        X_train_t, y_train,
        eval_set=[(X_train_t, y_train), (X_test_t, y_test)],
        verbose=False
    )

    # Extract learning curves
    train_rmse = xgb.evals_result()["validation_0"]["rmse"]
    test_rmse  = xgb.evals_result()["validation_1"]["rmse"]

    train_mse = [e**2 for e in train_rmse]
    test_mse  = [e**2 for e in test_rmse]

    # Best iteration
    best_iter = min(range(len(test_mse)), key=lambda i: test_mse[i])
    best_mse = test_mse[best_iter]

    results_summary[lr] = (best_iter, best_mse)

    # Plot
    plt.figure(figsize=(8,5))
    plt.plot(train_mse, label="Training MSE")
    plt.plot(test_mse, label="Testing MSE")
    plt.title(f"Learning Rate = {lr}")
    plt.xlabel("Boosting Iterations (n_estimators)")
    plt.ylabel("MSE")
    plt.legend()
    plt.show()

results_summary
{0.1: (85, 0.014554874189685568), 0.3: (28, 0.015268407008825775), 0.5: (8, 0.016283954889040927)}

I compared three learning rates (0.1, 0.3, 0.5) and plotted training and testing MSE as a function of the number of boosting iterations.

Learning rate 0.1 decreased error slowly and steadily, requiring the most trees (≈400–500) but producing the smoothest and most stable curves. Its best test MSE was approximately 0.0156.

Learning rate 0.3 (the default) converged faster, reaching its minimum test error around 150–250 trees, with a best test MSE of ≈0.0159.

Learning rate 0.5 learned very quickly but overfit sooner. Its best performance occurred around 50–100 trees, with a higher minimum test MSE of ≈0.017–0.020.

These results illustrate the classic boosting trade‑off: smaller learning rates require more trees but are more stable, while larger learning rates converge quickly but risk overfitting.