DATA 622 Lab 5: Ames Housing Data with Trees

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                sparse_output=False),
                                                  ['Neighborhood', 'Foundation'...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03]),
                         cv=5))])

preprocess: ColumnTransformer

ColumnTransformer(transformers=[('num', RobustScaler(),
                                 ['Overall Qual', 'Gr Liv Area', 'Garage Cars',
                                  'Garage Area', 'Total Bsmt SF', 'Year Built',
                                  'Year Remod/Add', 'Full Bath',
                                  'TotRms AbvGrd', 'Lot Area']),
                                ('ord', 'passthrough',
                                 ['Kitchen Qual', 'Exter Qual']),
                                ('cat',
                                 OneHotEncoder(drop='first',
                                               sparse_output=False),
                                 ['Neighborhood', 'Foundation',
                                  'Central Air'])])

num

['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Garage Area', 'Total Bsmt SF', 'Year Built', 'Year Remod/Add', 'Full Bath', 'TotRms AbvGrd', 'Lot Area']

RobustScaler

RobustScaler()

ord

['Kitchen Qual', 'Exter Qual']

passthrough

passthrough

cat

['Neighborhood', 'Foundation', 'Central Air']

OneHotEncoder

?Documentation for RidgeCV

OneHotEncoder(drop='first', sparse_output=False)

RidgeCV

RidgeCV(alphas=array([1.00000000e-03, 1.32571137e-03, 1.75751062e-03, 2.32995181e-03,
       3.08884360e-03, 4.09491506e-03, 5.42867544e-03, 7.19685673e-03,
       9.54095476e-03, 1.26485522e-02, 1.67683294e-02, 2.22299648e-02,
       2.94705170e-02, 3.90693994e-02, 5.17947468e-02, 6.86648845e-02,
       9.10298178e-02, 1.20679264e-01, 1.59985872e-01, 2.12095089e-01,
       2.81176870e-01, 3.72759372e-0...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03]),
        cv=5)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

mse

0.015004972935219222

# Get feature names after preprocessing
ohe = model.named_steps["preprocess"].named_transformers_["cat"]
cat_features = ohe.get_feature_names_out(categorical_cols)

feature_names = numeric_cols + ordinal_cols + list(cat_features)

coeffs = model.named_steps["ridge"].coef_

# Combine into a sorted table
importance = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coeffs
}).sort_values(by="coefficient", key=abs, ascending=False)

importance.head(15)

                 feature  coefficient
21  Neighborhood_GrnHill     0.490194
17  Neighborhood_Crawfor     0.213658
1            Gr Liv Area     0.166118
0           Overall Qual     0.161701
13   Neighborhood_BrDale    -0.151782
44         Central Air_Y     0.143871
15  Neighborhood_ClearCr     0.132700
24  Neighborhood_MeadowV    -0.123329
36  Neighborhood_StoneBr     0.121234
42      Foundation_Stone     0.104597
29  Neighborhood_NoRidge     0.103342
38  Neighborhood_Veenker     0.098370
30  Neighborhood_NridgHt     0.096111
37   Neighborhood_Timber     0.088704
32    Neighborhood_SWISU     0.084378

Baseline Ridge Regression A ridge regression model with cross‑validation was fit using a preprocessing pipeline that applied robust scaling to numeric variables, ordinal encoding to quality ratings, and one‑hot encoding to categorical variables. Cross‑validation selected the regularization strength from a grid of log‑spaced alpha values.

Test Set Performance The model achieved a mean squared error of approximately 0.0150 on the test set (using the log‑transformed SalePrice). This is a strong baseline performance for linear modeling on the Ames Housing dataset.

Most Important Features Feature importance was assessed using the absolute magnitude of the ridge coefficients after preprocessing. The most influential predictors were primarily neighborhood indicators, followed by key structural attributes. The top features were:

Neighborhood_GrnHill
Neighborhood_Crawfor
Gr Liv Area
Overall Qual
Neighborhood_BrDale
Central Air_Y
Neighborhood_ClearCr
Neighborhood_MeadowV
Neighborhood_StoneBr
Foundation_Stone

These results show that both location (Neighborhood) and core structural characteristics such as Gr Liv Area and Overall Qual play major roles in predicting log‑sale price.The model achieved a mean squared error of approximately 0.0150 on the test set (using the log‑transformed SalePrice).

Problem 2: Decision Trees and Random Forests

Simple Decision Tree: Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

# Column groups
numeric_cols = [
    "Overall Qual", "Gr Liv Area", "Garage Cars", "Garage Area",
    "Total Bsmt SF", "Year Built", "Year Remod/Add", "Full Bath",
    "TotRms AbvGrd", "Lot Area"
]

ordinal_cols = ["Kitchen Qual", "Exter Qual"]
categorical_cols = ["Neighborhood", "Foundation", "Central Air"]

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("num", RobustScaler(), numeric_cols),
        ("ord", "passthrough", ordinal_cols),
        ('cat',
 OneHotEncoder(
     drop='first',
     sparse_output=False,
     handle_unknown='ignore'
 ),
 ['Neighborhood', 'Foundation', 'Central Air'])
    ]
)

# Tree model inside pipeline
tree_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("tree", DecisionTreeRegressor(
        max_depth=4,
        min_samples_leaf=20
    ))
])

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Fit
tree_model.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Neighborhood', 'Foundation',
                                                   'Central Air'])])),
                ('tree',
                 DecisionTreeRegressor(max_depth=4, min_samples_leaf=20))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Neighborhood', 'Foundation',
                                                   'Central Air'])])),
                ('tree',
                 DecisionTreeRegressor(max_depth=4, min_samples_leaf=20))])

preprocess: ColumnTransformer

ColumnTransformer(transformers=[('num', RobustScaler(),
                                 ['Overall Qual', 'Gr Liv Area', 'Garage Cars',
                                  'Garage Area', 'Total Bsmt SF', 'Year Built',
                                  'Year Remod/Add', 'Full Bath',
                                  'TotRms AbvGrd', 'Lot Area']),
                                ('ord', 'passthrough',
                                 ['Kitchen Qual', 'Exter Qual']),
                                ('cat',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore',
                                               sparse_output=False),
                                 ['Neighborhood', 'Foundation',
                                  'Central Air'])])

num

['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Garage Area', 'Total Bsmt SF', 'Year Built', 'Year Remod/Add', 'Full Bath', 'TotRms AbvGrd', 'Lot Area']

RobustScaler

RobustScaler()

ord

['Kitchen Qual', 'Exter Qual']

passthrough

passthrough

cat

['Neighborhood', 'Foundation', 'Central Air']

OneHotEncoder

?Documentation for DecisionTreeRegressor

OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)

DecisionTreeRegressor

DecisionTreeRegressor(max_depth=4, min_samples_leaf=20)

# Predict + MSE
y_pred_tree = tree_model.predict(X_test)
mse_tree = mean_squared_error(y_test, y_pred_tree)
mse_tree

0.03390532339106262

# Plot the tree
plt.figure(figsize=(20,10))  # reasonable size for PDF
plot_tree(
    tree_model.named_steps["tree"],
    feature_names=tree_model.named_steps["preprocess"].get_feature_names_out(),
    filled=True,
    rounded=True,
    fontsize=6
)
plt.show()

A simple regression tree was fit to the training data using the same preprocessing pipeline as earlier models, including robust scaling for numeric variables, ordinal passthrough for quality ratings, and one‑hot encoding for categorical variables. The tree was constrained to a maximum depth of four with a minimum of twenty samples per leaf to maintain interpretability and limit overfitting. The resulting model achieved a test mean squared error of 0.0339, which is higher than the ridge regression baseline of roughly 0.015, reflecting the fact that shallow trees have higher variance and struggle to capture smooth relationships in the data. The structure of the tree shows that the first and most important split occurs on Overall Qual, confirming that overall material and finish quality is the strongest single predictor of log‑sale price. Subsequent splits involve Gr Liv Area, Year Built, and Total Bsmt SF, indicating that structural size and age further refine predictions, while deeper splits on Neighborhood and Central Air capture location and amenity effects.

Tree Pruning: Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

For tree pruning, I used cost‑complexity pruning with cross‑validation to find the best level of model simplicity. The pruning process tested different values of the complexity parameter α and selected the one that gave the lowest cross‑validated error. After refitting the pruned tree, the test MSE did not improve compared to the original tree’s MSE of 0.0339. In fact, the pruned tree performed about the same or slightly worse. This means the original depth‑4 tree was already regularized enough, and pruning removed splits that were still helpful. Pruning made the tree smaller and easier to interpret, but it did not improve prediction accuracy.

Random Forests: Use random forest to model this data. Use ‘GridSearchCV’ and at least 5-fold cross-validation to find the optimal value of ‘max_features’. Plot the predicted test error using the ‘cv_results_[’mean_test_score’]’ of your cross-validated model, and describe the effect of ‘max_features’ on the error rate. Where does the optimal value fall in comparison to and ? What test MSE do you obtain? Use the ‘feature_importance_’ values to determine which variables are most important.

rf_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("rf", RandomForestRegressor(
        n_estimators=500,
        random_state=42
    ))
])

# Fit preprocess ONCE to get number of features
p = rf_model.named_steps["preprocess"].fit_transform(X_train).shape[1]

param_grid = {
    "rf__max_features": [5, 10, 20, int(np.sqrt(p)), int(p/3), p]
}

rf_grid = GridSearchCV(
    rf_model,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

best_rf = rf_grid.best_estimator_
rf_grid.best_params_

{'rf__max_features': 10}

print("Best params:", rf_grid.best_params_)

Best params: {'rf__max_features': 10}

print("Test MSE:", mean_squared_error(y_test, best_rf.predict(X_test)))

Test MSE: 0.014460969836012747

mse_rf = mean_squared_error(y_test, rf_grid.best_estimator_.predict(X_test))
mse_rf

0.014460969836012747

mean_scores = -rf_grid.cv_results_["mean_test_score"]
features = rf_grid.cv_results_["param_rf__max_features"]

plt.figure(figsize=(8,5))
plt.plot(features, mean_scores, marker="o")
plt.xlabel("max_features")
plt.ylabel("CV MSE")
plt.title("Effect of max_features on Random Forest Error")
plt.show()

A random forest model was fit using the same preprocessing pipeline as before, and GridSearchCV with 5‑fold cross‑validation was used to tune the max_features parameter. The cross‑validated error curve showed that the model performed best when max_features was set to 10, with error increasing noticeably when too few or too many features were used at each split. This optimal value falls near the middle of the tested range and is close to the common rule‑of‑thumb choice of √p. Using the tuned model, the random forest achieved a test MSE of 0.01446, which is lower than the decision tree and close to the ridge regression baseline. Feature importance values indicated that Overall Qual, Gr Liv Area, Total Bsmt SF, Garage Cars, and key Neighborhood indicators were the most influential predictors, showing that both structural characteristics and location play major roles in determining sale price.

Comparing Ridge Regression and Random Forests: Compare the two models in the following two ways. First, did ridge regression have a lower or higher test error than your Random Forest model (I expect Random Forest to outperform, but with good feature engineering the two should be comparable on this dataset)? Next, make a ‘partial dependence’ plot of how the ridge regression model and the random forest model predict the relationship between housing size and sales price (I hope that you have included some of the living area features in your models…). I recommend ‘GrLivArea’. Create a grid of ‘GrLivArea’ values between 0 and 15000. Next, loop through each value of ‘GrLivArea’ in the grid, and clone the training set at each step of the loop. In the cloned training set, replace the real value of ‘GrLivArea’ with the grid value, and predict the log sales price on all the cloned data. Then compute the average of the predicted log sales price for each value of the ‘GrLivArea’. Make a plot of the mean predicted log sales price versus ‘GrLivArea’ for both Random Forest and Ridge Regression. Explain what you observe. Note: Being aware of this phenomenon is crucial for properly using tree based methods.

# 1. Rebuild Ridge inside the SAME preprocessing pipeline as RF
alphas = np.logspace(-3, 3, 50)

ridge_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("ridge", RidgeCV(alphas=alphas))
])

ridge_model.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03])))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03])))])

preprocess: ColumnTransformer

ColumnTransformer(transformers=[('num', RobustScaler(),
                                 ['Overall Qual', 'Gr Liv Area', 'Garage Cars',
                                  'Garage Area', 'Total Bsmt SF', 'Year Built',
                                  'Year Remod/Add', 'Full Bath',
                                  'TotRms AbvGrd', 'Lot Area']),
                                ('ord', 'passthrough',
                                 ['Kitchen Qual', 'Exter Qual']),
                                ('cat',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore',
                                               sparse_output=False),
                                 ['Neighborhood', 'Foundation',
                                  'Central Air'])])

num

['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Garage Area', 'Total Bsmt SF', 'Year Built', 'Year Remod/Add', 'Full Bath', 'TotRms AbvGrd', 'Lot Area']

RobustScaler

RobustScaler()

ord

['Kitchen Qual', 'Exter Qual']

passthrough

passthrough

cat

['Neighborhood', 'Foundation', 'Central Air']

OneHotEncoder

?Documentation for RidgeCV

OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)

RidgeCV

RidgeCV(alphas=array([1.00000000e-03, 1.32571137e-03, 1.75751062e-03, 2.32995181e-03,
       3.08884360e-03, 4.09491506e-03, 5.42867544e-03, 7.19685673e-03,
       9.54095476e-03, 1.26485522e-02, 1.67683294e-02, 2.22299648e-02,
       2.94705170e-02, 3.90693994e-02, 5.17947468e-02, 6.86648845e-02,
       9.10298178e-02, 1.20679264e-01, 1.59985872e-01, 2.12095089e-01,
       2.81176870e-01, 3.72759372e-0...
       8.68511374e-01, 1.15139540e+00, 1.52641797e+00, 2.02358965e+00,
       2.68269580e+00, 3.55648031e+00, 4.71486636e+00, 6.25055193e+00,
       8.28642773e+00, 1.09854114e+01, 1.45634848e+01, 1.93069773e+01,
       2.55954792e+01, 3.39322177e+01, 4.49843267e+01, 5.96362332e+01,
       7.90604321e+01, 1.04811313e+02, 1.38949549e+02, 1.84206997e+02,
       2.44205309e+02, 3.23745754e+02, 4.29193426e+02, 5.68986603e+02,
       7.54312006e+02, 1.00000000e+03]))

# 2. Create grid of GrLivArea values
grid = np.linspace(0, 15000, 200)

ridge_means = []
rf_means = []

# 3. Loop through grid values
for val in grid:
    X_clone = X_train.copy()
    X_clone["Gr Liv Area"] = val

    ridge_pred = ridge_model.predict(X_clone)
    rf_pred = best_rf.predict(X_clone)

    ridge_means.append(ridge_pred.mean())
    rf_means.append(rf_pred.mean())

# 4. Plot both curves
plt.figure(figsize=(8,5))
plt.plot(grid, ridge_means, label="Ridge Regression", linewidth=2)
plt.plot(grid, rf_means, label="Random Forest", linewidth=2)
plt.xlabel("GrLivArea")
plt.ylabel("Mean Predicted Log Sale Price")
plt.title("Partial Dependence: GrLivArea vs Log Sale Price")
plt.legend()
plt.show()

The random forest model achieved a lower test error (MSE ≈ 0.01446) than the ridge regression model (MSE ≈ 0.015), meaning the random forest performed slightly better on this dataset. To compare how each model captures the effect of living area, I generated partial dependence curves for GrLivArea by replacing the feature with values from 0 to 15,000 in a cloned version of the training data, predicting log‑sale price for all observations, and averaging the predictions at each grid point.

Ridge regression produced a smooth, strictly increasing linear curve, reflecting the model’s assumption that living area has a constant additive effect on log‑price. In contrast, the random forest curve rose slightly at small values but then remained nearly flat across most of the range. This occurs because tree‑based models tend to split first on stronger predictors such as OverallQual, Neighborhood, and TotalBsmtSF; once those splits are made, GrLivArea contributes little additional variation within each leaf. As a result, the random forest effectively down‑weights GrLivArea, producing a much flatter partial dependence curve. This illustrates a key difference between the models: ridge regression enforces a global linear trend, while random forests capture complex interactions and may ignore features that are redundant with more informative predictors.

Problem 3: Boosting: Learning Rate and Tree Number

I strongly recommend using ‘xgb’ for this exercise. If you have a strong preference to stay within ‘sklearn’, you may use ‘HistGradientBoostingRegressor’, but it will be more effort to do part (b) properly. Fit a boosted model with the default hyperparameters. What is your average mean squared error and how does it compare to your other models?

# Build XGBoost model inside the preprocessing pipeline
xgb_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("xgb", XGBRegressor(
        objective="reg:squarederror",
        random_state=42
    ))
])

# Fit
xgb_model.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
                              feature_types=None, feature_weights=None,
                              gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, ...))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num', RobustScaler(),
                                                  ['Overall Qual',
                                                   'Gr Liv Area', 'Garage Cars',
                                                   'Garage Area',
                                                   'Total Bsmt SF',
                                                   'Year Built',
                                                   'Year Remod/Add',
                                                   'Full Bath', 'TotRms AbvGrd',
                                                   'Lot Area']),
                                                 ('ord', 'passthrough',
                                                  ['Kitchen Qual',
                                                   'Exter Qual']),
                                                 ('cat',
                                                  OneHotEncoder(drop='first',
                                                                handle_unknown='ignore',
                                                                sparse_output=False),
                                                  ['Ne...
                              feature_types=None, feature_weights=None,
                              gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, ...))])

preprocess: ColumnTransformer

ColumnTransformer(transformers=[('num', RobustScaler(),
                                 ['Overall Qual', 'Gr Liv Area', 'Garage Cars',
                                  'Garage Area', 'Total Bsmt SF', 'Year Built',
                                  'Year Remod/Add', 'Full Bath',
                                  'TotRms AbvGrd', 'Lot Area']),
                                ('ord', 'passthrough',
                                 ['Kitchen Qual', 'Exter Qual']),
                                ('cat',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore',
                                               sparse_output=False),
                                 ['Neighborhood', 'Foundation',
                                  'Central Air'])])

num

['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Garage Area', 'Total Bsmt SF', 'Year Built', 'Year Remod/Add', 'Full Bath', 'TotRms AbvGrd', 'Lot Area']

RobustScaler

RobustScaler()

ord

['Kitchen Qual', 'Exter Qual']

passthrough

passthrough

cat

['Neighborhood', 'Foundation', 'Central Air']

OneHotEncoder