You have been given a dataset consisting of a single file, AmesHousing.csv, and a data dictionary data-description.txt. This dataset contains the sales price of each home as well as other attributes of the home and the sales transaction, such as the lot size, the square footage of each floor, when the home was built and remodeled, whether the kitchen is upgraded, etc. A data dictionary has been included with the data. During this lab, you will practice using different types of regression tree methods to predict housing prices in the data.
This dataset is a model dataset for demonstrating machine learning techniques and for prediction. This dataset was downloaded from kaggle, which holds periodic introductory competitions using this data, if you ever wanted to try your hand at kaggle competitions you could use this assignment to create your first entry. The dataset was originally compiled by De Cook for educational purposes. Skim this article for some tips on using this dataset: De Cock 2011. De Cook makes some suggestions for simplifying the data that you can take if you need to.
The dataset and data dictionary are available here:
Note: There is a lot of freedom in this exercise in terms of which variables you pick, some of the model choices, and also some hyperparameters. I have an expectation for what the answers will be in general, but you might get different answers depending on what you select and random chance. Report what you find, not what you expect to see (though if you see something unexpected it is a good hint to look for a mistake).
Problem 1: Building up to a Baseline Model
Basic EDA and Feature Selection: Divide the data into an initial training and testing split. Perform a log-transform on the sales price, this will be the target variable. Perform a basic EDA to select features for your model. Report a summary of the EDA and the selection of variables for your model. (Hint: the quality variables are important, and your intuition about what is important in housing is useful for picking). Pick more than 10 variables but less than all 80 (before one-hot encoding). Several variables encode the absence of a feature (such as no pool, no garage, etc) as ‘NA’. This dataset has almost no real missing data- make sure to encode and interpret ‘NA’ values appropriately for the variables you select. Make sure ordered categorical features have an integer or ordinal encoding, and use one-hot encoding on remaining categorical features.
from sklearn.model_selection import train_test_split# Load the data filepath="https://raw.githubusercontent.com/georgehagstrom/DATA622Spring2026/refs/heads/main/website/assignments/labs/labData/AmesHousing.csv"df = pd.read_csv(filepath)# Create train-test split (80-20 split)np.random.seed(42)train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)print(f"Training set size: {len(train_df)}")print(f"Testing set size: {len(test_df)}")
Training set size: 2344
Testing set size: 586
Log Transform the Target Variable
# Create log-transformed target variabletrain_df['LogSalePrice'] = np.log(train_df['SalePrice'])test_df['LogSalePrice'] = np.log(test_df['SalePrice'])# Check the distribution before and afterfig, axes = plt.subplots(1, 2, figsize=(7, 5))# Original SalePrice distributionaxes[0].hist(train_df['SalePrice'], bins=50, edgecolor='black')axes[0].set_title('Original SalePrice Distribution')axes[0].set_xlabel('SalePrice')axes[0].set_ylabel('Frequency')# Log-transformed distributionaxes[1].hist(train_df['LogSalePrice'], bins=50, edgecolor='black')axes[1].set_title('Log-Transformed SalePrice Distribution')axes[1].set_xlabel('Log(SalePrice)')axes[1].set_ylabel('Frequency')plt.tight_layout()plt.show()
Basic EDA - Correlation with Target
# Select numeric columnsnumeric_cols = train_df.select_dtypes(include=[np.number]).columns.tolist()# Calculate correlations with LogSalePricecorrelations = train_df[numeric_cols].corr()['LogSalePrice'].sort_values(ascending=False)# Display top 20 correlationsprint("Top 20 features correlated with LogSalePrice:")print(correlations.head(20))
Top 20 features correlated with LogSalePrice:
LogSalePrice 1.000000
SalePrice 0.944839
Overall Qual 0.818006
Gr Liv Area 0.683082
Garage Cars 0.665299
Garage Area 0.637300
Total Bsmt SF 0.605632
Year Built 0.598154
1st Flr SF 0.587634
Full Bath 0.571634
Year Remod/Add 0.566717
Garage Yr Blt 0.564565
Fireplaces 0.478890
TotRms AbvGrd 0.472891
Mas Vnr Area 0.433801
BsmtFin SF 1 0.404889
Wood Deck SF 0.333933
Lot Frontage 0.323639
Half Bath 0.306452
Open Porch SF 0.304375
Name: LogSalePrice, dtype: float64
from sklearn.feature_selection import f_regressionfrom sklearn.preprocessing import LabelEncoderprint("Analyzing Categorical Features with LogSalePrice:")print("="*60)# Identify categorical features (object type)categorical_cols = train_df.select_dtypes(include=['object']).columns.tolist()# Also include MS SubClass which is numeric but actually categoricalcategorical_cols.append('MS SubClass')print(f"Total categorical features to analyze: {len(categorical_cols)}")# Store resultscategorical_correlations = []for col in categorical_cols:if col in train_df.columns:# For features with many unique values, use encodingif train_df[col].nunique() >30: # Neighborhood has 28, so this threshold works# For high cardinality, use target encoding or skip for initial analysisprint(f"\n{col}: {train_df[col].nunique()} unique values - high cardinality")# Quick check using group means grouped_means = train_df.groupby(col)['LogSalePrice'].mean()# Calculate variance ratio (like F-statistic approximation) overall_mean = train_df['LogSalePrice'].mean() between_var = grouped_means.var() within_var = train_df.groupby(col)['LogSalePrice'].var().mean()if within_var >0: f_stat = between_var / within_var categorical_correlations.append({'feature': col,'f_statistic': f_stat,'unique_values': train_df[col].nunique() })print(f" F-statistic approximation: {f_stat:.4f}")else:# For low cardinality, use ANOVA# Prepare groups groups = []for category in train_df[col].unique(): groups.append(train_df[train_df[col] == category]['LogSalePrice'].values)# Perform ANOVAfrom scipy.stats import f_oneway f_stat, p_value = f_oneway(*groups) categorical_correlations.append({'feature': col,'f_statistic': f_stat,'p_value': p_value,'unique_values': train_df[col].nunique() })print(f"{col:20} | F-statistic: {f_stat:.4f} | p-value: {p_value:.4e} | n={train_df[col].nunique()}")# Sort by F-statistic (higher = more important)categorical_importance = pd.DataFrame(categorical_correlations).sort_values('f_statistic', ascending=False)print("\n"+"="*60)print("Top Categorical Features (by F-statistic):")print("="*60)print(categorical_importance.head(15).to_string(index=False))
Analyzing Categorical Features with LogSalePrice:
============================================================
Total categorical features to analyze: 44
MS Zoning | F-statistic: 93.0516 | p-value: 4.9653e-105 | n=7
Street | F-statistic: 19.9180 | p-value: 8.4665e-06 | n=2
Alley | F-statistic: nan | p-value: nan | n=2
Lot Shape | F-statistic: 95.4004 | p-value: 2.9900e-58 | n=4
Land Contour | F-statistic: 27.2000 | p-value: 2.7443e-17 | n=4
Utilities | F-statistic: 3.1945 | p-value: 4.1165e-02 | n=3
Lot Config | F-statistic: 12.2616 | p-value: 7.2132e-10 | n=5
Land Slope | F-statistic: 3.2763 | p-value: 3.7942e-02 | n=3
Neighborhood | F-statistic: 110.7400 | p-value: 0.0000e+00 | n=28
Condition 1 | F-statistic: 14.7206 | p-value: 3.4980e-21 | n=9
Condition 2 | F-statistic: 4.3339 | p-value: 8.8221e-05 | n=8
Bldg Type | F-statistic: 22.1515 | p-value: 5.6924e-18 | n=5
House Style | F-statistic: 28.9462 | p-value: 1.6600e-38 | n=8
Roof Style | F-statistic: 22.3528 | p-value: 5.9172e-22 | n=6
Roof Matl | F-statistic: 4.6005 | p-value: 4.0315e-05 | n=8
Exterior 1st | F-statistic: 34.2107 | p-value: 1.3952e-89 | n=16
Exterior 2nd | F-statistic: 30.2099 | p-value: 4.4420e-79 | n=16
Mas Vnr Type | F-statistic: nan | p-value: nan | n=4
Exter Qual | F-statistic: 627.0259 | p-value: 4.4529e-299 | n=4
Exter Cond | F-statistic: 30.1788 | p-value: 1.6122e-24 | n=5
Foundation | F-statistic: 210.4839 | p-value: 1.0738e-185 | n=6
Bsmt Qual | F-statistic: nan | p-value: nan | n=5
Bsmt Cond | F-statistic: nan | p-value: nan | n=5
Bsmt Exposure | F-statistic: nan | p-value: nan | n=4
BsmtFin Type 1 | F-statistic: nan | p-value: nan | n=6
BsmtFin Type 2 | F-statistic: nan | p-value: nan | n=6
Heating | F-statistic: 12.0205 | p-value: 1.6317e-11 | n=6
Heating QC | F-statistic: 187.7048 | p-value: 1.1462e-139 | n=5
Central Air | F-statistic: 393.1797 | p-value: 5.1642e-81 | n=2
Electrical | F-statistic: 62.4915 | p-value: 3.0630e-50 | n=5
Kitchen Qual | F-statistic: 582.7727 | p-value: 7.5467e-283 | n=4
Functional | F-statistic: 13.9644 | p-value: 7.6974e-18 | n=8
Fireplace Qu | F-statistic: nan | p-value: nan | n=5
Garage Type | F-statistic: nan | p-value: nan | n=6
Garage Finish | F-statistic: nan | p-value: nan | n=3
Garage Qual | F-statistic: nan | p-value: nan | n=5
Garage Cond | F-statistic: nan | p-value: nan | n=5
Paved Drive | F-statistic: 189.9970 | p-value: 3.4335e-77 | n=3
Pool QC | F-statistic: nan | p-value: nan | n=4
Fence | F-statistic: nan | p-value: nan | n=4
Misc Feature | F-statistic: nan | p-value: nan | n=4
Sale Type | F-statistic: 33.9222 | p-value: 1.2581e-56 | n=10
Sale Condition | F-statistic: 66.1088 | p-value: 9.6967e-65 | n=6
MS SubClass | F-statistic: 72.4793 | p-value: 5.1628e-181 | n=16
============================================================
Top Categorical Features (by F-statistic):
============================================================
feature f_statistic p_value unique_values
Exter Qual 627.025919 4.452871e-299 4
Kitchen Qual 582.772690 7.546660e-283 4
Central Air 393.179709 5.164174e-81 2
Foundation 210.483871 1.073751e-185 6
Paved Drive 189.996963 3.433454e-77 3
Heating QC 187.704805 1.146164e-139 5
Neighborhood 110.740042 0.000000e+00 28
Lot Shape 95.400401 2.989957e-58 4
MS Zoning 93.051551 4.965327e-105 7
MS SubClass 72.479346 5.162791e-181 16
Sale Condition 66.108817 9.696676e-65 6
Electrical 62.491528 3.062978e-50 5
Exterior 1st 34.210744 1.395231e-89 16
Sale Type 33.922199 1.258099e-56 10
Exterior 2nd 30.209893 4.441969e-79 16
Beyond numeric correlations, ANOVA F-statistics revealed that categorical features like Exter Qual (F=627), Kitchen Qual (F=583), and Central Air (F=393) have even stronger relationships with sale price than most numeric features. This confirms the importance of quality ratings and amenities in housing valuation. Neighborhood (F=111) also shows significant variation, with some neighborhoods commanding over 30% higher prices than others. MS SubClass (F=45) is also significant, indicating that building type matters. These insights will guide our feature selection and encoding strategies for modeling.
Feature Selection
# selected features based on correlation analysis AND F-statisticsselected_features = [# Strongest numeric correlated features (from Step 3 output)'Overall Qual', # 0.818 correlation - strongest numeric predictor'Gr Liv Area', # 0.683 - above ground living area'Garage Area', # 0.637 - garage size'Total Bsmt SF', # 0.606 - basement area (0 = no basement)'Year Built', # 0.598 - original construction'1st Flr SF', # 0.588 - first floor size'Year Remod/Add', # 0.567 - remodel year# Top categorical features by F-statistic (from ANOVA analysis)'Exter Qual', # F=627 - EXTERIOR QUALITY (strongest categorical)'Kitchen Qual', # F=583 - KITCHEN QUALITY (2nd strongest)'Central Air', # F=393 - CENTRAL AIR (binary but huge impact)'Foundation', # F=210 - FOUNDATION TYPE (structural importance)'Paved Drive', # F=190 - PAVED DRIVEWAY (amenity indicator)'Heating QC', # F=188 - HEATING QUALITY (comfort indicator)'Neighborhood', # F=111 - LOCATION (critical for housing)'MS Zoning', # F=93 - ZONING CLASSIFICATION (land use)'Sale Condition', # F=66 - TRANSACTION TYPE (sale conditions)# Original quality features (already covered by above)'Bsmt Qual', # Basement quality (NA = no basement) - F-statistic nan but important# Amenities (numeric)'Fireplaces', # 0.479 correlation - number of fireplaces'Wood Deck SF', # 0.334 - deck area# Additional meaningful features'Lot Area', # Lot size - 0.221 correlation'MS SubClass'# Building class - F=72 (categorical)]print(f"Selected feature set: {len(selected_features)} features")print("="*70)# Create a DataFrame to show feature importance from both methodsfeature_summary = []for feature in selected_features:if feature in correlations.index: pearson_r = correlations[feature] pearson_source ="numeric"else: pearson_r =None pearson_source ="categorical"# Add F-statistic if available from our analysisif'categorical_importance'inlocals(): fstat_row = categorical_importance[categorical_importance['feature'] == feature]iflen(fstat_row) >0: f_stat = fstat_row['f_statistic'].values[0]else: f_stat =Noneelse: f_stat =None feature_summary.append({'feature': feature,'pearson_r': pearson_r,'f_statistic': f_stat,'type': 'numeric'if feature in correlations.index else'categorical' })summary_df = pd.DataFrame(feature_summary)print("\nFeature Selection Summary:")print("-"*70)print(f"{'Feature':<18}{'Type':<12}{'Pearson r':<12}{'F-statistic':<12}")print("-"*70)for _, row in summary_df.iterrows(): pearson_str =f"{row['pearson_r']:.3f}"if row['pearson_r'] isnotNoneelse"-" fstat_str =f"{row['f_statistic']:.0f}"if row['f_statistic'] isnotNoneandnot pd.isna(row['f_statistic']) else"-"print(f"{row['feature']:<18}{row['type']:<12}{pearson_str:<12}{fstat_str:<12}")# Check for any missing values in new featuresprint("\n"+"="*70)print("Missing Value Check for New Features:")print("="*70)new_features = ['Central Air', 'Foundation', 'Paved Drive', 'Heating QC', 'MS Zoning', 'Sale Condition']for feature in new_features:if feature in train_df.columns: na_count = train_df[feature].isna().sum()print(f"{feature:15} - {na_count} NA values ({na_count/len(train_df)*100:.1f}%)")else:print(f"{feature:15} - NOT FOUND in dataset")# Visualize the importance of new categorical featuresfig, axes = plt.subplots(2, 3, figsize=(7, 5))axes = axes.flatten()new_cat_features = ['Central Air', 'Foundation', 'Paved Drive', 'Heating QC', 'MS Zoning', 'Sale Condition']for idx, feature inenumerate(new_cat_features): ax = axes[idx]# Calculate means means = train_df.groupby(feature)['LogSalePrice'].mean().sort_values(ascending=False)# Create bar plot bars = ax.bar(range(len(means)), means.values, color='darkorange', alpha=0.7) ax.set_xticks(range(len(means))) ax.set_xticklabels(means.index, rotation=45, ha='right', fontsize=9) ax.set_ylabel('Mean Log(SalePrice)')# Add F-statistic if available fstat_row = categorical_importance[categorical_importance['feature'] == feature]iflen(fstat_row) >0: f_val = fstat_row['f_statistic'].values[0] title =f"{feature}\n(F-statistic: {f_val:.0f})"else: title = feature ax.set_title(title, fontsize=10) ax.grid(True, alpha=0.3, axis='y')plt.tight_layout()plt.show()
Selected feature set: 21 features
======================================================================
Feature Selection Summary:
----------------------------------------------------------------------
Feature Type Pearson r F-statistic
----------------------------------------------------------------------
Overall Qual numeric 0.818 -
Gr Liv Area numeric 0.683 -
Garage Area numeric 0.637 -
Total Bsmt SF numeric 0.606 -
Year Built numeric 0.598 -
1st Flr SF numeric 0.588 -
Year Remod/Add numeric 0.567 -
Exter Qual categorical nan 627
Kitchen Qual categorical nan 583
Central Air categorical nan 393
Foundation categorical nan 210
Paved Drive categorical nan 190
Heating QC categorical nan 188
Neighborhood categorical nan 111
MS Zoning categorical nan 93
Sale Condition categorical nan 66
Bsmt Qual categorical nan -
Fireplaces numeric 0.479 -
Wood Deck SF numeric 0.334 -
Lot Area numeric 0.246 -
MS SubClass numeric -0.050 72
======================================================================
Missing Value Check for New Features:
======================================================================
Central Air - 0 NA values (0.0%)
Foundation - 0 NA values (0.0%)
Paved Drive - 0 NA values (0.0%)
Heating QC - 0 NA values (0.0%)
MS Zoning - 0 NA values (0.0%)
Sale Condition - 0 NA values (0.0%)
Check NA’s
# Create a copy with selected features plus targettrain_selected = train_df[selected_features + ['LogSalePrice']].copy()test_selected = test_df[selected_features + ['LogSalePrice']].copy()# Check for NA values in selected featuresprint("NA Counts in Training Set:")print("-"*40)na_counts = train_selected[selected_features].isna().sum()print(na_counts[na_counts >0])
NA Counts in Training Set:
----------------------------------------
Garage Area 1
Total Bsmt SF 1
Bsmt Qual 61
dtype: int64
Handle NA’s
# Handle only your selected featuresselected_features_to_handle = {'Bsmt Qual': 'None', # Categorical - no basement'Garage Area': 0, # Numeric - no garage 'Total Bsmt SF': 0# Numeric - no basement}for feature, fill_value in selected_features_to_handle.items(): train_selected[feature] = train_selected[feature].fillna(fill_value) test_selected[feature] = test_selected[feature].fillna(fill_value)print(f"{feature} -> filled with: {fill_value}")
Bsmt Qual -> filled with: None
Garage Area -> filled with: 0
Total Bsmt SF -> filled with: 0
ordinal encoding
# Define ordinal mapping based on the order we saw in Step 4# Quality order: Ex (Excellent) > Gd (Good) > TA (Typical) > Fa (Fair) > Po (Poor) > None (No feature)quality_mapping = {'Ex': 5, # Excellent'Gd': 4, # Good 'TA': 3, # Typical/Average'Fa': 2, # Fair'Po': 1, # Poor'None': 0# No feature (for basement)}# ordinal features to encode based on F-statistic analysis and domain knowledgeordinal_features = ['Exter Qual', # Exterior quality'Bsmt Qual', # Basement quality 'Kitchen Qual', # Kitchen quality'Heating QC', # Heating quality (NEW - from F-statistic analysis)'Paved Drive', # Paved driveway (Y, P, N - has natural order)'Central Air'# Central air (Y, N - binary but ordinal)]# Define mapping for Paved Drive (Y > P > N)paved_drive_mapping = {'Y': 2, # Paved'P': 1, # Partial'N': 0# Not paved}# Define mapping for Central Aircentral_air_mapping = {'Y': 1, # Has central air'N': 0# No central air}print("Ordinal Feature Encoding:")print("-"*40)for feature in ordinal_features:if feature in train_df.columns:# Choose appropriate mappingif feature =='Paved Drive': mapping = paved_drive_mappingelif feature =='Central Air': mapping = central_air_mappingelse: mapping = quality_mapping# Create encoded column train_selected[f'{feature}_encoded'] = train_selected[feature].map(mapping) test_selected[f'{feature}_encoded'] = test_selected[feature].map(mapping)print(f"\n{feature}:") unique_values =sorted([str(x) for x in train_selected[feature].unique()]) encoded_values =sorted(train_selected[f'{feature}_encoded'].unique())print(f" Unique values in training: {unique_values}")print(f" Encoded values: {encoded_values}")# Drop original categorical column train_selected.drop(feature, axis=1, inplace=True) test_selected.drop(feature, axis=1, inplace=True)else:print(f"\n{feature}: Not found in dataset")
from sklearn.preprocessing import OneHotEncoder# Update nominal features to include new ones from your enhanced selectionnominal_features = ['Neighborhood', 'MS SubClass', 'Foundation', 'MS Zoning', 'Sale Condition']print("Nominal Feature Analysis:")print("-"*40)# Check the number of unique values in eachfor feature in nominal_features:if feature in train_selected.columns: n_unique = train_selected[feature].nunique()print(f"{feature}: {n_unique} unique values")# Show sample values for context sample_vals =sorted([str(x) for x in train_selected[feature].unique()])[:5]print(f" Sample values: {sample_vals}")# Apply one-hot encodingprint("\nApplying One-Hot Encoding:")print("-"*40)# Use pandas get_dummies for simplicitytrain_encoded = pd.get_dummies(train_selected, columns=nominal_features, prefix=nominal_features)test_encoded = pd.get_dummies(test_selected, columns=nominal_features, prefix=nominal_features)# Ensure test set has same columns as training setmissing_cols =set(train_encoded.columns) -set(test_encoded.columns)for col in missing_cols:if col !='LogSalePrice': # Don't add target column test_encoded[col] =0# Ensure columns are in same ordertest_encoded = test_encoded[train_encoded.columns]print(f"Original feature count (before encoding): {len(train_selected.columns)}")print(f"After one-hot encoding: {len(train_encoded.columns)}")# Show breakdown of encoded columnsfor feature in nominal_features:if feature in train_selected.columns: n_unique = train_selected[feature].nunique()print(f" - {feature} encoded into {n_unique} binary columns")# Verify no missing values remainprint("\nFinal verification - any missing values?")print(f"Training set: {train_encoded.isnull().sum().sum()} missing values")print(f"Test set: {test_encoded.isnull().sum().sum()} missing values")# Show the final feature setprint("\nFinal Feature Set (first 20 columns):")print(train_encoded.columns[:20].tolist())print(f"... and {len(train_encoded.columns)-20} more columns")# Display target variable columnprint(f"\nTarget column: 'LogSalePrice' is present: {'LogSalePrice'in train_encoded.columns}")
Nominal Feature Analysis:
----------------------------------------
Neighborhood: 28 unique values
Sample values: ['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr']
MS SubClass: 16 unique values
Sample values: ['120', '150', '160', '180', '190']
Foundation: 6 unique values
Sample values: ['BrkTil', 'CBlock', 'PConc', 'Slab', 'Stone']
MS Zoning: 7 unique values
Sample values: ['A (agr)', 'C (all)', 'FV', 'I (all)', 'RH']
Sale Condition: 6 unique values
Sample values: ['Abnorml', 'AdjLand', 'Alloca', 'Family', 'Normal']
Applying One-Hot Encoding:
----------------------------------------
Original feature count (before encoding): 22
After one-hot encoding: 80
- Neighborhood encoded into 28 binary columns
- MS SubClass encoded into 16 binary columns
- Foundation encoded into 6 binary columns
- MS Zoning encoded into 7 binary columns
- Sale Condition encoded into 6 binary columns
Final verification - any missing values?
Training set: 0 missing values
Test set: 0 missing values
Final Feature Set (first 20 columns):
['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Total Bsmt SF', 'Year Built', '1st Flr SF', 'Year Remod/Add', 'Fireplaces', 'Wood Deck SF', 'Lot Area', 'LogSalePrice', 'Exter Qual_encoded', 'Bsmt Qual_encoded', 'Kitchen Qual_encoded', 'Heating QC_encoded', 'Paved Drive_encoded', 'Central Air_encoded', 'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale']
... and 60 more columns
Target column: 'LogSalePrice' is present: True
Before encoding: 22 features
10 numeric features (Overall Qual, Gr Liv Area, Garage Area, etc.)
7 ordinal features (Exter Qual, Bsmt Qual, Kitchen Qual, Heating QC, Paved Drive, Central Air, etc.)
5 nominal features (Neighborhood, MS SubClass, Foundation, MS Zoning, Sale Condition)
1 target variable (LogSalePrice)
After encoding: 80 features
All numeric and ordinal features remain as-is
28 Neighborhood dummy variables
16 MS SubClass dummy variables
6 Foundation dummy variables
7 MS Zoning dummy variables
6 Sale Condition dummy variables
Data Quality: No missing values in train or test sets
Baseline Model: Use ridge regression with cross-validation to fit a baseline model. Evaluate it on the test set. What is the mean squared error and what are the most important features in your linear model?
from sklearn.linear_model import Ridgefrom sklearn.model_selection import GridSearchCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scorewarnings.filterwarnings('ignore', category=RuntimeWarning)# Separate features and targetX_train = train_encoded.drop('LogSalePrice', axis=1)y_train = train_encoded['LogSalePrice']X_test = test_encoded.drop('LogSalePrice', axis=1)y_test = test_encoded['LogSalePrice']# Create pipeline with scaling (important for Ridge)ridge_pipeline = Pipeline([ ('scaler', StandardScaler()), ('ridge', Ridge(random_state=42))])# Define alpha values to testparam_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 50, 100, 500, 1000]}# Perform grid search with 5-fold cross-validationridge_grid = GridSearchCV( ridge_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)print("Performing Ridge Regression with 5-fold Cross-Validation...")ridge_grid.fit(X_train, y_train)# Get best modelbest_ridge = ridge_grid.best_estimator_best_alpha = ridge_grid.best_params_['ridge__alpha']print(f"\nBest alpha from cross-validation: {best_alpha}")# Make predictionsy_train_pred = best_ridge.predict(X_train)y_test_pred = best_ridge.predict(X_test)# Evaluate the modelprint("\n"+"="*60)print("Ridge Regression Model Performance (with CV):")print("="*60)print(f"Training Set:")print(f" RMSE: {np.sqrt(mean_squared_error(y_train, y_train_pred)):.4f}")print(f" MAE: {mean_absolute_error(y_train, y_train_pred):.4f}")print(f" R²: {r2_score(y_train, y_train_pred):.4f}")print(f"\nTest Set:")print(f" RMSE: {np.sqrt(mean_squared_error(y_test, y_test_pred)):.4f}")print(f" MAE: {mean_absolute_error(y_test, y_test_pred):.4f}")print(f" R²: {r2_score(y_test, y_test_pred):.4f}")# Convert predictions back to original scaley_test_pred_original = np.exp(y_test_pred)y_test_original = np.exp(y_test)print(f"\nOriginal Scale (SalePrice) Performance:")print(f"Test Set RMSE: ${np.sqrt(mean_squared_error(y_test_original, y_test_pred_original)):,.0f}")print(f"Test Set MAE: ${mean_absolute_error(y_test_original, y_test_pred_original):,.0f}")# Get coefficients (need to extract from pipeline)# The scaler transforms features, so we need to get coefficients properlyridge_model = best_ridge.named_steps['ridge']scaler = best_ridge.named_steps['scaler']# For interpretability, get coefficients on original scalecoefficients = ridge_model.coef_# Show top 10 most important featuresfeature_importance = pd.DataFrame({'feature': X_train.columns,'coefficient': coefficients}).sort_values('coefficient', key=abs, ascending=False)print(f"\nTop 10 Most Important Features (by absolute coefficient):")print(feature_importance.head(10).to_string(index=False))
Performing Ridge Regression with 5-fold Cross-Validation...
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best alpha from cross-validation: 100
============================================================
Ridge Regression Model Performance (with CV):
============================================================
Training Set:
RMSE: 0.1371
MAE: 0.0919
R²: 0.8832
Test Set:
RMSE: 0.1291
MAE: 0.0892
R²: 0.9099
Original Scale (SalePrice) Performance:
Test Set RMSE: $31,867
Test Set MAE: $17,208
Top 10 Most Important Features (by absolute coefficient):
feature coefficient
Gr Liv Area 0.087735
Overall Qual 0.087066
Bsmt Qual_encoded 0.030617
Kitchen Qual_encoded 0.030025
Garage Area 0.029412
Fireplaces 0.028678
Central Air_encoded 0.027651
Year Remod/Add 0.026722
MS SubClass_160 -0.024865
Neighborhood_Crawfor 0.022192
Baseline Model Performance Analysis:
Key Metrics: Training R²: 0.8832 - Model explains 88.3% of variance in training data
Test R²: 0.9099 - Model explains 90.99% of variance in test data
Test RMSE (log scale): 0.1291 - Strong predictive accuracy
Test RMSE (original scale): $31,867 - Average prediction error of ~$31.9K
Test MAE (original scale): $17,208 - Half of predictions are within ~$17.2K of actual price
Model Characteristics: Test R² > Training R² (0.9099 > 0.8832) indicates excellent generalization with no overfitting
The model effectively captures the underlying patterns in housing prices
Ridge regularization with alpha=100 provides optimal shrinkage, reducing coefficient variance while maintaining predictive power
Top Important Features: Gr Liv Area emerged as the most important feature (coefficient: 0.0877)
Additional top features (from your output) include zoning classifications, neighborhood indicators, and quality ratings
The regularization helped stabilize coefficients, making the model more robust
Problem 2: Decision Trees and Random Forests
Simple Decision Tree: Fit a regression tree to the training set. Plot the tree, and inter- pret the results. What test MSE do you obtain?
from sklearn.tree import DecisionTreeRegressor, plot_tree# Use the same feature set from Problem 1X_train = train_encoded.drop('LogSalePrice', axis=1)y_train = train_encoded['LogSalePrice']X_test = test_encoded.drop('LogSalePrice', axis=1)y_test = test_encoded['LogSalePrice']# Fit a simple decision tree (without pruning first)tree_simple = DecisionTreeRegressor(random_state=42)tree_simple.fit(X_train, y_train)# Make predictionsy_train_pred_tree = tree_simple.predict(X_train)y_test_pred_tree = tree_simple.predict(X_test)# Calculate MSEfrom sklearn.metrics import mean_squared_errortrain_mse_tree = mean_squared_error(y_train, y_train_pred_tree)test_mse_tree = mean_squared_error(y_test, y_test_pred_tree)print("Simple Decision Tree (No Pruning):")print("="*50)print(f"Training MSE: {train_mse_tree:.6f}")print(f"Training RMSE: {np.sqrt(train_mse_tree):.4f}")print(f"Test MSE: {test_mse_tree:.6f}")print(f"Test RMSE: {np.sqrt(test_mse_tree):.4f}")# Add these metrics to your analysisprint(f"\nTree Structure:")print(f" Tree Depth: {tree_simple.get_depth()}")print(f" Number of Leaves: {tree_simple.get_n_leaves()}")print(f" Total Nodes: {tree_simple.tree_.node_count}") # Convert to original scale for interpretationy_test_pred_original = np.exp(y_test_pred_tree)y_test_original = np.exp(y_test)test_rmse_original = np.sqrt(mean_squared_error(y_test_original, y_test_pred_original))print(f"\nTest MSE (original scale): ${test_mse_tree:.0f} (log scale)")print(f"Test RMSE (original scale): ${test_rmse_original:,.0f}")# Check for overfittingprint(f"\nOverfitting Check:")print(f"Training R²: {tree_simple.score(X_train, y_train):.4f}")print(f"Test R²: {tree_simple.score(X_test, y_test):.4f}")print(f"Difference: {tree_simple.score(X_train, y_train) - tree_simple.score(X_test, y_test):.4f}")
Simple Decision Tree (No Pruning):
==================================================
Training MSE: 0.000004
Training RMSE: 0.0020
Test MSE: 0.033990
Test RMSE: 0.1844
Tree Structure:
Tree Depth: 28
Number of Leaves: 2287
Total Nodes: 4573
Test MSE (original scale): $0 (log scale)
Test RMSE (original scale): $34,785
Overfitting Check:
Training R²: 1.0000
Test R²: 0.8163
Difference: 0.1837
Simple Decision Tree (No Pruning):
Training R² = 1.000 - Perfect fit, tree has memorized all training data
Test R² = 0.816 - Significantly lower, indicating poor generalization
Tree Depth: 28 - Extremely deep tree with 2,287 leaves (almost one per training sample)
Test RMSE: $34,785 - Average prediction error of ~$34.8K
Comparison with Ridge Regression:
Ridge Regression Test R²: 0.910 - Outperforms the simple tree by 9.4%
Ridge Test RMSE: $31,867 - $2,918 lower error than the unpruned tree
The unpruned decision tree severely overfits (training R² = 1.000 vs test R² = 0.816), while Ridge generalizes well with no signs of overfitting (test R² > training R²)
Tree Pruning: Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
from sklearn.model_selection import cross_val_score, GridSearchCV# Try different max_depth values to find optimal tree complexityprint("Tree Pruning with Cross-Validation:")print("="*50)# Define parameter grid for pruningparam_grid = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20],'min_samples_split': [2, 5, 10, 20],'min_samples_leaf': [1, 5, 10, 20]}# Initialize decision treetree_cv = DecisionTreeRegressor(random_state=42)# Perform grid search with 5-fold cross-validationgrid_search = GridSearchCV( tree_cv, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)grid_search.fit(X_train, y_train)# Get best parametersprint(f"\nBest Parameters from Cross-Validation:")print(f"max_depth: {grid_search.best_params_['max_depth']}")print(f"min_samples_split: {grid_search.best_params_['min_samples_split']}")print(f"min_samples_leaf: {grid_search.best_params_['min_samples_leaf']}")# Best modelbest_tree = grid_search.best_estimator_# Evaluate on test sety_train_pred_best = best_tree.predict(X_train)y_test_pred_best = best_tree.predict(X_test)train_mse_best = mean_squared_error(y_train, y_train_pred_best)test_mse_best = mean_squared_error(y_test, y_test_pred_best)print(f"\nPruned Tree Performance:")print(f"Training MSE: {train_mse_best:.6f}")print(f"Training RMSE: {np.sqrt(train_mse_best):.4f}")print(f"Test MSE: {test_mse_best:.6f}")print(f"Test RMSE: {np.sqrt(test_mse_best):.4f}")print(f"Test R²: {best_tree.score(X_test, y_test):.4f}")# Compare with unpruned treeprint(f"\nComparison with Unpruned Tree:")print(f"Unpruned Test MSE: {test_mse_tree:.6f}")print(f"Pruned Test MSE: {test_mse_best:.6f}")print(f"Improvement: {(test_mse_tree - test_mse_best)/test_mse_tree*100:.2f}%")# Check overfitting reductionprint(f"\nOverfitting Reduction:")print(f"Unpruned Train-Test R² gap: {1.0000-0.8216:.4f}")print(f"Pruned Train-Test R² gap: {best_tree.score(X_train, y_train) - best_tree.score(X_test, y_test):.4f}")# Visualize the pruned treeplt.figure(figsize=(10, 8))plot_tree(best_tree, feature_names=X_train.columns.tolist(), filled=True, rounded=True, max_depth=4, fontsize=8)plt.title(f"Pruned Decision Tree (max_depth={grid_search.best_params_['max_depth']})", fontsize=12)plt.tight_layout()plt.show()# Feature importance for pruned treefeature_importance_pruned = pd.DataFrame({'feature': X_train.columns,'importance': best_tree.feature_importances_}).sort_values('importance', ascending=False)print("\nTop 10 Most Important Features (Pruned Tree):")print(feature_importance_pruned.head(10).to_string(index=False))# Cross-validation results visualizationresults_df = pd.DataFrame(grid_search.cv_results_)pruning_results = results_df.groupby('param_max_depth')['mean_test_score'].mean()pruning_results =-pruning_results # Convert back to positive MSEplt.figure(figsize=(7, 5))plt.plot(pruning_results.index, pruning_results.values, 'bo-')plt.xlabel('Max Depth')plt.ylabel('Cross-Validation MSE')plt.title('Cross-Validation MSE vs Tree Depth')plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()print(f"\nOptimal tree depth: {grid_search.best_params_['max_depth']} (lowest CV MSE)")
Tree Pruning with Cross-Validation:
==================================================
Fitting 5 folds for each of 176 candidates, totalling 880 fits
Best Parameters from Cross-Validation:
max_depth: 20
min_samples_split: 2
min_samples_leaf: 10
Pruned Tree Performance:
Training MSE: 0.016669
Training RMSE: 0.1291
Test MSE: 0.027068
Test RMSE: 0.1645
Test R²: 0.8537
Comparison with Unpruned Tree:
Unpruned Test MSE: 0.033990
Pruned Test MSE: 0.027068
Improvement: 20.36%
Overfitting Reduction:
Unpruned Train-Test R² gap: 0.1784
Pruned Train-Test R² gap: 0.0427
Top 10 Most Important Features (Pruned Tree):
feature importance
Overall Qual 0.720165
Gr Liv Area 0.098289
1st Flr SF 0.044140
Total Bsmt SF 0.033143
Garage Area 0.017239
MS Zoning_RM 0.017209
Year Built 0.016128
Year Remod/Add 0.010462
Kitchen Qual_encoded 0.008561
Lot Area 0.006980
Optimal tree depth: 20 (lowest CV MSE)
Analysis:
Pruning significantly improved the decision tree model. Test MSE decreased by 20.36%, and the train-test R² gap narrowed from 0.1784 to 0.0427, indicating much better generalization. The optimal tree (max_depth=20, min_samples_leaf=10) balances complexity with predictive power, with Overall Qual remaining the most important feature.
Random Forests: Use random forest to model this data. Use ‘GridSearchCV’ and at least 5-fold cross-validation to find the optimal value of ‘max_features’. Plot the predicted test error using the ‘cv_results_['mean_test_score']’ of your cross-validated model, and describe the effect of ‘max_features’ on the error rate. Where does the optimal value fall in comparison to \(\sqrt{p}\) and \(p/3\)? What test MSE do you obtain? Use the ‘feature_importance_’ values to determine which variables are most important.
from sklearn.ensemble import RandomForestRegressorprint("Random Forest - Tuning max_features:")print("="*60)# Get number of featuresp = X_train.shape[1]print(f"Number of features (p): {p}")print(f"√p: {np.sqrt(p):.2f}")print(f"p/3: {p/3:.2f}")# Define parameter grid focusing on max_featuresparam_grid = {'max_features': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 40, 50, 58]}# Initialize random forest with reasonable defaultsrf = RandomForestRegressor( n_estimators=100, random_state=42, n_jobs=-1)# Perform grid search with 5-fold cross-validationrf_grid = GridSearchCV( rf, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)rf_grid.fit(X_train, y_train)# Extract cross-validation resultscv_results = pd.DataFrame(rf_grid.cv_results_)max_features_values = cv_results['param_max_features'].astype(int)mean_test_scores =-cv_results['mean_test_score'] # Convert back to positive MSEstd_test_scores = cv_results['std_test_score']# Plot the effect of max_features on test errorplt.figure(figsize=(7, 5))plt.errorbar(max_features_values, mean_test_scores, yerr=std_test_scores, fmt='o-', capsize=5, capthick=2, markersize=8, linewidth=2)plt.axvline(x=np.sqrt(p), color='red', linestyle='--', label=f'√p = {np.sqrt(p):.2f}', alpha=0.7)plt.axvline(x=p/3, color='green', linestyle='--', label=f'p/3 = {p/3:.2f}', alpha=0.7)plt.axvline(x=rf_grid.best_params_['max_features'], color='blue', linestyle=':', linewidth=3, label=f'Optimal = {rf_grid.best_params_["max_features"]}')plt.xlabel('max_features', fontsize=12)plt.ylabel('Cross-Validation MSE', fontsize=12)plt.title('Effect of max_features on Random Forest Test Error (5-fold CV)', fontsize=12)plt.legend()plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()# Print resultsprint("\n"+"="*60)print("max_features Tuning Results:")print("="*60)print(f"Optimal max_features: {rf_grid.best_params_['max_features']}")print(f"√p: {np.sqrt(p):.2f}")print(f"p/3: {p/3:.2f}")print(f"\nComparison:")if rf_grid.best_params_['max_features'] < np.sqrt(p):print(f"Optimal ({rf_grid.best_params_['max_features']}) < √p ({np.sqrt(p):.2f})")elif rf_grid.best_params_['max_features'] > np.sqrt(p):print(f"Optimal ({rf_grid.best_params_['max_features']}) > √p ({np.sqrt(p):.2f})")else:print(f"Optimal equals √p")if rf_grid.best_params_['max_features'] < p/3:print(f"Optimal ({rf_grid.best_params_['max_features']}) < p/3 ({p/3:.2f})")elif rf_grid.best_params_['max_features'] > p/3:print(f"Optimal ({rf_grid.best_params_['max_features']}) > p/3 ({p/3:.2f})")else:print(f"Optimal equals p/3")# Get best model and evaluate on test setbest_rf = rf_grid.best_estimator_y_test_pred = best_rf.predict(X_test)test_mse = mean_squared_error(y_test, y_test_pred)test_rmse = np.sqrt(test_mse)print(f"\nTest Performance with Optimal max_features:")print(f"Test MSE: {test_mse:.6f}")print(f"Test RMSE: {test_rmse:.4f}")print(f"Test R²: {best_rf.score(X_test, y_test):.4f}")# Convert to original scaley_test_original = np.exp(y_test)y_test_pred_original = np.exp(y_test_pred)test_rmse_original = np.sqrt(mean_squared_error(y_test_original, y_test_pred_original))print(f"\nOriginal Scale (SalePrice):")print(f"Test RMSE: ${test_rmse_original:,.0f}")print(f"Test MAE: ${mean_absolute_error(y_test_original, y_test_pred_original):,.0f}")# Feature importance analysisfeature_importance = pd.DataFrame({'feature': X_train.columns,'importance': best_rf.feature_importances_}).sort_values('importance', ascending=False)print("\n"+"="*60)print("Top 15 Most Important Features (Random Forest):")print("="*60)print(feature_importance.head(15).to_string(index=False))# Plot feature importanceplt.figure(figsize=(7, 5))top_features = feature_importance.head(20)plt.barh(range(len(top_features)), top_features['importance'].values)plt.yticks(range(len(top_features)), top_features['feature'].values)plt.xlabel('Feature Importance', fontsize=12)plt.title(f'Random Forest: Top 20 Feature Importances (max_features={best_rf.max_features})', fontsize=12)plt.gca().invert_yaxis()plt.tight_layout()plt.show()
Random Forest - Tuning max_features:
============================================================
Number of features (p): 79
√p: 8.89
p/3: 26.33
Fitting 5 folds for each of 18 candidates, totalling 90 fits
============================================================
max_features Tuning Results:
============================================================
Optimal max_features: 30
√p: 8.89
p/3: 26.33
Comparison:
Optimal (30) > √p (8.89)
Optimal (30) > p/3 (26.33)
Test Performance with Optimal max_features:
Test MSE: 0.015381
Test RMSE: 0.1240
Test R²: 0.9169
Original Scale (SalePrice):
Test RMSE: $26,964
Test MAE: $15,883
============================================================
Top 15 Most Important Features (Random Forest):
============================================================
feature importance
Overall Qual 0.244983
Gr Liv Area 0.147185
Exter Qual_encoded 0.105270
Year Built 0.091399
Bsmt Qual_encoded 0.062155
Garage Area 0.058623
1st Flr SF 0.050593
Total Bsmt SF 0.045865
Lot Area 0.028321
Fireplaces 0.023037
Kitchen Qual_encoded 0.022957
Central Air_encoded 0.019965
Year Remod/Add 0.018946
Paved Drive_encoded 0.010126
Wood Deck SF 0.006067
Random Forest Results Analysis: 1. Random Forest outperforms all models: Test R² = 0.9169, Test RMSE = 26,964 (best performance)
Top Features (Random Forest):
Overall Qual (0.245) - Dominant predictor
Gr Liv Area (0.147) - Living space
Exter Qual_encoded (0.105) - Exterior quality (3rd most important)
Year Built (0.091) - Construction year
Bsmt Qual_encoded (0.062) - Basement quality
Why Random Forest Performs Best:
Captures non-linear relationships (e.g., diminishing returns on size)
Handles interactions between features naturally
Ensemble of trees reduces variance while maintaining low bias
Optimal max_features=30 balances feature diversity at each split
Comparison with Ridge Regression:
Random Forest’s ability to model non-linear patterns gives it an edge
Ridge assumes linear relationships with regularization
Comparing Ridge Regression and Random Forests: Compare the two models in the following two ways. First, did ridge regression have a lower or higher test error than your Random Forest model (I expect Random Forest to outperform, but with good feature engineering the two should be comparable on this dataset)? Next, make a ‘partial dependence’ plot of how the ridge regression model and the random forest model predict the relationship between housing size and sales price (I hope that you have included some of the living area features in your models…). I recommend ‘GrLivArea’. Create a grid of ‘GrLivArea’ values between 0 and 15000. Next, loop through each value of ‘GrLivArea’ in the grid, and clone the training set at each step of the loop. In the cloned training set, replace the real value of ‘GrLivArea’ with the grid value, and predict the log sales price on all the cloned data. Then compute the average of the predicted log sales price for each value of the ‘GrLivArea’. Make a plot of the mean predicted log sales price versus ‘GrLivArea’ for both Random Forest and Ridge Regression. Explain what you observe. Note: Being aware of this phenomenon is crucial for properly using tree based methods.
Ridge regression model on our dataset
from sklearn.linear_model import Ridgefrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipeline# Train Ridge Regression (with cross-validation for alpha)print("Training Ridge Regression Model:")print("="*50)# Create pipeline with scaling (important for Ridge)ridge_pipeline = Pipeline([ ('scaler', StandardScaler()), ('ridge', Ridge(random_state=42))])# Find optimal alpha with cross-validationparam_grid_ridge = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}ridge_grid = GridSearchCV(ridge_pipeline, param_grid_ridge, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)ridge_grid.fit(X_train, y_train)best_ridge = ridge_grid.best_estimator_print(f"Optimal alpha: {ridge_grid.best_params_['ridge__alpha']}")# Evaluate Ridgey_test_pred_ridge = best_ridge.predict(X_test)test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)test_rmse_ridge = np.sqrt(test_mse_ridge)test_r2_ridge = best_ridge.score(X_test, y_test)print(f"\nRidge Regression Performance:")print(f"Test MSE: {test_mse_ridge:.6f}")print(f"Test RMSE: {test_rmse_ridge:.4f}")print(f"Test R²: {test_r2_ridge:.4f}")# Compare with Random Forestprint(f"\nComparison:")print(f"Random Forest Test MSE: {test_mse:.6f}")print(f"Ridge Test MSE: {test_mse_ridge:.6f}")if test_mse < test_mse_ridge:print(f"Random Forest outperforms Ridge by {(test_mse_ridge - test_mse)/test_mse_ridge*100:.2f}%")else:print(f"Ridge outperforms Random Forest by {(test_mse - test_mse_ridge)/test_mse*100:.2f}%")
Training Ridge Regression Model:
==================================================
Optimal alpha: 100
Ridge Regression Performance:
Test MSE: 0.016674
Test RMSE: 0.1291
Test R²: 0.9099
Comparison:
Random Forest Test MSE: 0.015381
Ridge Test MSE: 0.016674
Random Forest outperforms Ridge by 7.76%
Test Error: Random Forest (MSE: 0.015381) outperforms Ridge (MSE: 0.016674) by 7.7%, confirming Random Forest’s ability to capture non-linear patterns.
Partial Dependence for GrLivArea
# Partial Dependence for GrLivAreaprint("\n"+"="*50)print("Creating Partial Dependence Plots for GrLivArea:")print("="*50)# Check if GrLivArea is in our featuresif'Gr Liv Area'in X_train.columns: feature_name ='Gr Liv Area'else:print("Gr Liv Area not found in features. Using first size-related feature.") feature_name = X_train.columns[0] # Fallbackprint(f"Analyzing feature: {feature_name}")# Create grid of GrLivArea valuesgrlivarea_grid = np.linspace(0, 15000, 100)# Function to compute partial dependencedef partial_dependence(model, X_base, feature_name, grid_values, is_ridge=False):""" Compute partial dependence for a given feature """ predictions = []for val in grid_values:# Clone the dataset X_temp = X_base.copy()# Replace the feature with the grid value X_temp[feature_name] = val# Predictif is_ridge:# For ridge pipeline, we need to predict with the pipeline y_pred = model.predict(X_temp)else: y_pred = model.predict(X_temp)# Average predictions predictions.append(np.mean(y_pred))return np.array(predictions)# Compute partial dependence for both modelsprint("Computing partial dependence for Random Forest...")rf_pd = partial_dependence(best_rf, X_train, feature_name, grlivarea_grid, is_ridge=False)print("Computing partial dependence for Ridge...")ridge_pd = partial_dependence(best_ridge, X_train, feature_name, grlivarea_grid, is_ridge=True)# Create the plotplt.figure(figsize=(7, 5))plt.plot(grlivarea_grid, rf_pd, 'b-', linewidth=2, label='Random Forest', alpha=0.8)plt.plot(grlivarea_grid, ridge_pd, 'r-', linewidth=2, label='Ridge Regression', alpha=0.8)plt.xlabel('GrLivArea (Above Ground Living Area)', fontsize=12)plt.ylabel('Mean Predicted Log(SalePrice)', fontsize=12)plt.title('Partial Dependence Plot: GrLivArea vs Log(SalePrice)', fontsize=12)plt.legend(fontsize=11)plt.grid(True, alpha=0.3)# Add a vertical line at typical maximum (for reference)plt.axvline(x=5000, color='gray', linestyle=':', alpha=0.5)# Add annotations to highlight differencesplt.text(11000, np.mean(ridge_pd[-10:]), 'Ridge: Linear relationship', fontsize=10, color='red', alpha=0.7)plt.text(11000, np.mean(rf_pd[-10:]) -0.05, 'Random Forest: Plateau at high values', fontsize=10, color='blue', alpha=0.7)plt.tight_layout()plt.show()# Analyze the differencesprint("\n"+"="*50)print("Partial Dependence Analysis:")print("="*50)# Calculate slopes for different rangeslow_range = grlivarea_grid[grlivarea_grid <=2000]mid_range = grlivarea_grid[(grlivarea_grid >2000) & (grlivarea_grid <=4000)]high_range = grlivarea_grid[grlivarea_grid >4000]# Random Forest slopesrf_slope_low = (rf_pd[grlivarea_grid <=2000][-1] - rf_pd[grlivarea_grid <=2000][0]) / (2000)rf_slope_mid = (rf_pd[(grlivarea_grid >2000) & (grlivarea_grid <=4000)][-1] - rf_pd[(grlivarea_grid >2000) & (grlivarea_grid <=4000)][0]) / (2000)rf_slope_high = (rf_pd[grlivarea_grid >4000][-1] - rf_pd[grlivarea_grid >4000][0]) / (max(high_range) -min(high_range))# Ridge slopes (should be constant)ridge_slope = (ridge_pd[-1] - ridge_pd[0]) / (15000)print("\nSlope Analysis (change in log price per sq ft):")print("-"*50)print(f"Ridge Regression (constant slope): {ridge_slope:.6f}")print(f"\nRandom Forest:")print(f" Low area (0-2000 sq ft): {rf_slope_low:.6f}")print(f" Medium area (2000-4000 sq ft): {rf_slope_mid:.6f}")print(f" High area (>4000 sq ft): {rf_slope_high:.6f}")
==================================================
Creating Partial Dependence Plots for GrLivArea:
==================================================
Analyzing feature: Gr Liv Area
Computing partial dependence for Random Forest...
Computing partial dependence for Ridge...
==================================================
Partial Dependence Analysis:
==================================================
Slope Analysis (change in log price per sq ft):
--------------------------------------------------
Ridge Regression (constant slope): 0.000174
Random Forest:
Low area (0-2000 sq ft): 0.000133
Medium area (2000-4000 sq ft): 0.000041
High area (>4000 sq ft): -0.000001
Partial Dependence Analysis Results:
Key Findings: 1. Ridge Regression (Linear Model):
Shows a constant, linear relationship with GrLivArea
Each additional square foot adds the same marginal value regardless of house size
Slope is constant across the entire range
Random Forest (Tree-based Model):
Demonstrates a non-linear relationship with clear diminishing returns
Steep increase for small homes (0-2000 sq ft) - each additional sq ft adds significant value
Moderate increase for medium homes (2000-4000 sq ft) - value per sq ft decreases
Plateau/Flattening for very large homes (>4000 sq ft) - additional space adds minimal value
Important Observations:
Realistic Housing Market Phenomenon:
The Random Forest captures the realistic pattern that beyond a certain point, extra square footage adds less value per foot
Large homes (>4000 sq ft) appeal to a smaller market, and the incremental value diminishes
Ridge Regression fails to capture this non-linear relationship, leading to overestimation of very large home values
Why This Matters:
This explains why Random Forest achieves lower test error - it captures the true non-linear relationship
The plateau effect at high GrLivArea values is a crucial pattern that linear models miss
This phenomenon is common in housing data and other real-world applications
Practical Implication:
When using tree-based methods, be aware they naturally capture non-linearities and interactions
Linear models require explicit feature engineering (polynomial terms, splines) to capture such patterns
The partial dependence plot is essential for understanding how tree-based models make predictions
The Random Forest’s ability to model diminishing returns contributes to its superior performance (lower MSE than Ridge).
Problem 3: Boosting: Learning Rate and Tree Number
I strongly recommend using ‘xgb’ for this exercise. If you have a strong preference to stay within ‘sklearn’, you may use ‘HistGradientBoostingRegressor’, but it will be more effort to do part (b) properly. Fit a boosted model with the default hyperparameters. What is your average mean squared error and how does it compare to your other models?
from sklearn.ensemble import HistGradientBoostingRegressorprint("Problem 3: Gradient Boosting with HistGradientBoostingRegressor")print("="*60)# Use the same train/test data from previous problemsprint(f"Training samples: {X_train.shape[0]}")print(f"Test samples: {X_test.shape[0]}")print(f"Features: {X_train.shape[1]}")# Fit Gradient Boosting with default hyperparametersprint("\n(a) Default Gradient Boosting Model:")print("-"*40)# Default parameters: learning_rate=0.1, max_iter=100 (n_estimators equivalent)gb_default = HistGradientBoostingRegressor( max_iter=100, # equivalent to n_estimators learning_rate=0.1, # default learning rate max_depth=6, # default depth random_state=42, verbose=0)# Fit the modelgb_default.fit(X_train, y_train)# Predicty_train_pred = gb_default.predict(X_train)y_test_pred = gb_default.predict(X_test)# Calculate MSEtrain_mse = mean_squared_error(y_train, y_train_pred)test_mse = mean_squared_error(y_test, y_test_pred)print(f"Training MSE: {train_mse:.6f}")print(f"Training RMSE: {np.sqrt(train_mse):.4f}")print(f"Test MSE: {test_mse:.6f}")print(f"Test RMSE: {np.sqrt(test_mse):.4f}")print(f"Test R²: {gb_default.score(X_test, y_test):.4f}")print(f"Number of trees used: {gb_default.n_iter_}")# Convert to original scaley_test_original = np.exp(y_test)y_test_pred_original = np.exp(y_test_pred)test_rmse_original = np.sqrt(mean_squared_error(y_test_original, y_test_pred_original))print(f"\nOriginal Scale (SalePrice):")print(f"Test RMSE: ${test_rmse_original:,.0f}")# Check for overfittingprint(f"\nOverfitting Check:")print(f"Training R²: {gb_default.score(X_train, y_train):.4f}")print(f"Test R²: {gb_default.score(X_test, y_test):.4f}")print(f"Gap: {gb_default.score(X_train, y_train) - gb_default.score(X_test, y_test):.4f}")
Problem 3: Gradient Boosting with HistGradientBoostingRegressor
============================================================
Training samples: 2344
Test samples: 586
Features: 79
(a) Default Gradient Boosting Model:
----------------------------------------
Training MSE: 0.007474
Training RMSE: 0.0864
Test MSE: 0.016210
Test RMSE: 0.1273
Test R²: 0.9124
Number of trees used: 100
Original Scale (SalePrice):
Test RMSE: $27,996
Overfitting Check:
Training R²: 0.9536
Test R²: 0.9124
Gap: 0.0412
Comparison of Gradient Boosting with Other Models:
# Store model performance metrics programmaticallymodel_performance = {'Linear Regression': {'test_mse': mean_squared_error(y_test, lin_reg_pred) if'lin_reg_pred'inlocals() elseNone,'test_rmse': None,'test_r2': None },'Pruned Tree': {'test_mse': test_mse_best if'test_mse_best'inlocals() elseNone,'test_rmse': None,'test_r2': None },'Random Forest': {'test_mse': test_mse if'test_mse'inlocals() elseNone,'test_rmse': None,'test_r2': None },'Gradient Boosting': {'test_mse': test_mse,'test_rmse': np.sqrt(test_mse),'test_r2': gb_default.score(X_test, y_test) }}# Update with actual values if variables existif'y_test_pred_ridge'inlocals(): model_performance['Ridge Regression'] = {'test_mse': mean_squared_error(y_test, y_test_pred_ridge),'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred_ridge)),'test_r2': r2_score(y_test, y_test_pred_ridge) }# For Linear Regression - need to fit if not alreadytry:from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X_train, y_train) y_test_pred_lin = lin_reg.predict(X_test) model_performance['Linear Regression'] = {'test_mse': mean_squared_error(y_test, y_test_pred_lin),'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred_lin)),'test_r2': r2_score(y_test, y_test_pred_lin) }except:pass# For Pruned Treeif'best_tree'inlocals(): y_test_pred_pruned = best_tree.predict(X_test) model_performance['Pruned Tree'] = {'test_mse': mean_squared_error(y_test, y_test_pred_pruned),'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred_pruned)),'test_r2': best_tree.score(X_test, y_test) }# For Random Forestif'best_rf'inlocals(): y_test_pred_rf = best_rf.predict(X_test) model_performance['Random Forest'] = {'test_mse': mean_squared_error(y_test, y_test_pred_rf),'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred_rf)),'test_r2': best_rf.score(X_test, y_test) }# Print comparisonprint("\n"+"="*60)print("MODEL COMPARISON (Test MSE):")print("="*60)print(f"{'Model':<20}{'Test MSE':<12}{'Test RMSE':<12}{'Test R²':<10}")print("-"*60)# Sort by test MSEsorted_models =sorted(model_performance.items(), key=lambda x: x[1]['test_mse'] if x[1]['test_mse'] isnotNoneelsefloat('inf'))for model_name, metrics in sorted_models:if metrics['test_mse'] isnotNone:print(f"{model_name:<20}{metrics['test_mse']:<12.6f}{metrics['test_rmse']:<12.4f}{metrics['test_r2']:<10.4f}")# Determine best modelbest_model_name =min(model_performance.items(), key=lambda x: x[1]['test_mse'] if x[1]['test_mse'] isnotNoneelsefloat('inf'))[0]print(f"\n Best performing model: {best_model_name}")
============================================================
MODEL COMPARISON (Test MSE):
============================================================
Model Test MSE Test RMSE Test R²
------------------------------------------------------------
Random Forest 0.015381 0.1240 0.9169
Gradient Boosting 0.016210 0.1273 0.9124
Ridge Regression 0.016674 0.1291 0.9099
Linear Regression 0.016744 0.1294 0.9095
Pruned Tree 0.027068 0.1645 0.8537
Best performing model: Random Forest
Results Summary: - Gradient Boosting Test MSE: 0.016210 (RMSE: $27,996)
Random Forest Test MSE: 0.015381 (RMSE: $26,964)
Gap: Random Forest performs slightly better
Key Observations: 1. Performance Ranking:
Random Forest (0.015381) - Best
Gradient Boosting (0.016210) - Very close second
Linear Regression (0.016744)
Pruned Tree (0.027068)
Overfitting Analysis:
Random Forest: Training R² = 0.9539, Test R² = 0.9169 (gap = 0.0370)
Gradient Boosting: Training R² = 0.9536, Test R² = 0.9124 (gap = 0.0412)
Both show similar overfitting patterns, with Random Forest generalizing slightly better
Intuition on Why Boosting Works:
Sequential learning: Each tree corrects errors from previous trees
Adaptive: Focuses on hard-to-predict samples
Regularization: Learning rate shrinks each tree’s contribution
Why Random Forest Won Slightly: - Ensemble averaging reduces variance more effectively than boosting’s sequential approach
Bagging makes Random Forest more robust to outliers
With 79 features (including many one-hot encoded categories), Random Forest’s feature subsampling helps prevent overfitting
Random Forest achieves the lowest test MSE (0.015381), outperforming Gradient Boosting by 4.9%, Ridge Regression by 7.7%, and Linear Regression by 8.1%, making it the best performing model on this enhanced feature set.
One of the differences between gradient boosted trees and bagging methods like random forest, is that boosted trees can sometimes be more powerful, but are also more sensitive to hyperparameters and prone to overfitting. A fundamental trade-off in hyperparameters is between learning rate (‘learning_rate’) and the number of trees (‘n_estimators’). For three values of the learning rate (pick the default of 0.3, one larger value, and one smaller value), plot the relationship between both training and testing mean-squared error and the number of boosting iterations. Make sure to pass the training and testing dataset as ‘eval_set’ when you are fitting your ‘xgb’ model, i.e. ‘eval_set=[(X_train, y_train), (X_test, y_test)]’. The evolution of the error is tracked in ‘results[’validation_n’]’ where \(n\) corresponds to each validation set. For each learning rate, what is the best value of ‘n_estimators’ and the corresponding test mean squared error? Note: passing the test set as eval_set is used here for visualization only. See the extra credit to learn how to do this properly with cross-validation.
print("\n"+"="*60)print("Part (b): Learning Rate vs Number of Trees Trade-off")print("="*60)# Define learning rates to testlearning_rates = [0.01, 0.1, 0.3] # small, default, largermax_iterations =200# Maximum trees to try# Store results for plottingresults = {}for lr in learning_rates:print(f"\nTraining with learning_rate = {lr}")print("-"*40)# Initialize model with warm_start to track progress gb = HistGradientBoostingRegressor( learning_rate=lr, max_iter=max_iterations, max_depth=6, random_state=42, warm_start=True, # Allows incremental training verbose=0 )# We need to train incrementally to track validation error at each iteration# Since HistGradientBoostingRegressor doesn't have eval_set parameter,# we'll track validation manually train_errors = [] test_errors = []# Train one tree at a time to track progressfor n_trees inrange(1, max_iterations +1): gb.max_iter = n_trees gb.fit(X_train, y_train)# Calculate errors train_pred = gb.predict(X_train) test_pred = gb.predict(X_test) train_errors.append(mean_squared_error(y_train, train_pred)) test_errors.append(mean_squared_error(y_test, test_pred)) results[lr] = {'train_errors': train_errors,'test_errors': test_errors,'best_n': np.argmin(test_errors) +1, # +1 because 0-indexed'best_mse': min(test_errors) }print(f"Best n_estimators: {results[lr]['best_n']}")print(f"Best test MSE: {results[lr]['best_mse']:.6f}")
============================================================
Part (b): Learning Rate vs Number of Trees Trade-off
============================================================
Training with learning_rate = 0.01
----------------------------------------
Best n_estimators: 200
Best test MSE: 0.022118
Training with learning_rate = 0.1
----------------------------------------
Best n_estimators: 173
Best test MSE: 0.016160
Training with learning_rate = 0.3
----------------------------------------
Best n_estimators: 31
Best test MSE: 0.016287
# Plot the resultsplt.figure(figsize=(7, 5))colors = {0.01: 'blue', 0.1: 'green', 0.3: 'red'}markers = {0.01: 'o', 0.1: 's', 0.3: '^'}for lr in learning_rates: n_iterations =range(1, max_iterations +1)# Plot test error plt.subplot(2, 1, 1) plt.plot(n_iterations, results[lr]['test_errors'], color=colors[lr], marker=markers[lr], markersize=3, linewidth=1.5, label=f'learning_rate = {lr}')# Plot training error plt.subplot(2, 1, 2) plt.plot(n_iterations, results[lr]['train_errors'], color=colors[lr], marker=markers[lr], markersize=3, linewidth=1.5, label=f'learning_rate = {lr}')# Format top subplot (test error)plt.subplot(2, 1, 1)plt.xlabel('Number of Trees (n_estimators)', fontsize=11)plt.ylabel('Test MSE', fontsize=11)plt.title('Test Error vs Number of Trees for Different Learning Rates', fontsize=12)plt.legend()plt.grid(True, alpha=0.3)# Format bottom subplot (training error)plt.subplot(2, 1, 2)plt.xlabel('Number of Trees (n_estimators)', fontsize=11)plt.ylabel('Training MSE', fontsize=11)plt.title('Training Error vs Number of Trees for Different Learning Rates', fontsize=12)plt.legend()plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()# Summary tableprint("\n"+"="*60)print("OPTIMAL PARAMETERS SUMMARY:")print("="*60)print(f"{'Learning Rate':<15}{'Best n_estimators':<18}{'Best Test MSE':<15}")print("-"*60)for lr in learning_rates:print(f"{lr:<15}{results[lr]['best_n']:<18}{results[lr]['best_mse']:<15.6f}")# Analysis of trade-offprint("\n"+"="*60)print("KEY INSIGHTS ON LEARNING RATE vs TREES TRADE-OFF:")print("="*60)print("\n1. Small Learning Rate (0.01):")print(" - Requires MORE trees to reach optimal performance")print(f" - Optimal: {results[0.01]['best_n']} trees")print(" - Slow, stable convergence")print(" - Lower risk of overfitting")print("\n2. Default Learning Rate (0.1):")print(" - Faster convergence")print(f" - Optimal: {results[0.1]['best_n']} trees")print(" - Good balance of speed and accuracy")print("\n3. Large Learning Rate (0.3):")print(" - Converges quickly but may plateau")print(f" - Optimal: {results[0.3]['best_n']} trees")print(" - Higher risk of overshooting optimum")print("\n4. General Pattern:")print(" - Lower learning rate = more trees needed")print(" - Higher learning rate = fewer trees, but risk of overfitting")print(" - This demonstrates the bias-variance trade-off in boosting")
============================================================
OPTIMAL PARAMETERS SUMMARY:
============================================================
Learning Rate Best n_estimators Best Test MSE
------------------------------------------------------------
0.01 200 0.022118
0.1 173 0.016160
0.3 31 0.016287
============================================================
KEY INSIGHTS ON LEARNING RATE vs TREES TRADE-OFF:
============================================================
1. Small Learning Rate (0.01):
- Requires MORE trees to reach optimal performance
- Optimal: 200 trees
- Slow, stable convergence
- Lower risk of overfitting
2. Default Learning Rate (0.1):
- Faster convergence
- Optimal: 173 trees
- Good balance of speed and accuracy
3. Large Learning Rate (0.3):
- Converges quickly but may plateau
- Optimal: 31 trees
- Higher risk of overshooting optimum
4. General Pattern:
- Lower learning rate = more trees needed
- Higher learning rate = fewer trees, but risk of overfitting
- This demonstrates the bias-variance trade-off in boosting
Results:
1. Small Learning Rate (0.01):
Requires 200 trees (capped at max_iterations)
Test MSE: 0.022118 (highest among three)
Model likely needs >200 trees to reach optimal performance
Slow, stable convergence but computationally expensive
2. Default Learning Rate (0.1):
Optimal at 173 trees (not 110)
Test MSE: 0.016160 (best among three)
Best balance of speed and accuracy
Declines quickly, hits minimum at 173, then stabilizes
3. Large Learning Rate (0.3):
Fast convergence at 31 trees
Test MSE: 0.016287 (slightly worse than 0.1)
Converges quickly but overshoots optimum
Higher risk of instability
4. General Pattern Confirmed:
Lower learning rate (0.01) → more trees needed (200+)
Higher learning rate (0.3) → fewer trees (31)
Best test MSE achieved at learning_rate = 0.1 with 173 trees
The trade-off is clear: learning_rate = 0.1 achieves the lowest test MSE (0.016160), balancing convergence speed and stability. Learning_rate = 0.3 converges too quickly (31 trees) but yields higher error (0.016287). Learning_rate = 0.01 would likely improve with more trees (>200) but is computationally expensive.
Plotting the error curves shows the expected pattern of faster convergence with higher learning rates, but also highlights the risk of overshooting the optimal number of trees. The default learning rate (0.1) provides the best performance in this case, demonstrating the importance of tuning this hyperparameter for boosting models.
Extra Credit (5 pts): Early Stopping
The relationship you found in 3(b) suggests a technique called early stopping, where you stop training the ‘xgboost’ model if there hasn’t been an improvement in the performance on the validation set for some number of iterations. This suggests using hyperparameter optimization to select the value of the learning rate, while implementing early stopping. Implementing this properly within a cross-validation loop can be slightly tricky, because you need to pass the held out data in the cross-validation loop to the fitting routine as the ‘eval_set’, which is not automatically handled by the standard ‘sklearn’ methods for hyperparameter optimization. The following article discusses this issue and shows how to implement this by hand (option 2 in his code): Jeff Macaluso on Early Stopping. Implement a similar approach to pick the best value of the learning rate. I recommend tuning just the learning rate using a grid search approach, but if you want to try more hyperparameters and a randomized search, you are welcome to do so but it will be computationally more intensive (this is what Macaluso does). We will learn more methods for hyperparameter optimization later in the course. Report how the test MSE and the number of trees selected by early stopping varied across learning rates, and compare the test mse to your other three models (the default xgboost and the ridge regression and random forest). Make sure to follow Macaluso’s prescription for training your final model, by holding out 15% of the training set to recalculate the number of trees, and then evaluating that final model on the test set.
from sklearn.model_selection import KFoldimport timeprint("Problem 3(c): Early Stopping with Cross-Validation")print("="*60)# First, split the data: 85% training, 15% validation for final model# But we need to do CV on the training set firstX_full_train = X_train.copy()y_full_train = y_train.copy()# For final model, we'll hold out 15% of training as validationfrom sklearn.model_selection import train_test_splitX_final_train, X_val, y_final_train, y_val = train_test_split( X_full_train, y_full_train, test_size=0.15, random_state=42)print(f"Final Training set size: {len(X_final_train)}")print(f"Final Validation set size: {len(X_val)}")print(f"Original Test set size: {len(X_test)}")# Define learning rates to testparam_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],'n_iter_no_change': [5, 10, 15] # Patience parameter}# Cross-validation settingsn_folds =5kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)# Store results for each learning ratecv_results = []print("\n"+"="*60)print("Cross-Validation with Early Stopping")print("="*60)for lr in learning_rates:print(f"\nTesting learning_rate = {lr}")print("-"*40) fold_train_errors = [] fold_val_errors = [] fold_best_iters = []# Perform cross-validationfor fold, (train_idx, val_idx) inenumerate(kf.split(X_full_train)): X_tr = X_full_train.iloc[train_idx] y_tr = y_full_train.iloc[train_idx] X_val_fold = X_full_train.iloc[val_idx] y_val_fold = y_full_train.iloc[val_idx]# Initialize model with early stopping gb_cv = HistGradientBoostingRegressor( learning_rate=lr, max_iter=200, # Maximum trees max_depth=6, random_state=42, early_stopping=True, validation_fraction=0.15, # Hold out from training for early stopping n_iter_no_change=10, # Stop if no improvement for 10 iterations tol=1e-7, verbose=0 )# Fit with validation set for early stopping gb_cv.fit(X_tr, y_tr)# Get the best number of iterations best_iter = gb_cv.n_iter_ # Actual number of trees used fold_best_iters.append(best_iter)# Evaluate on validation fold val_pred = gb_cv.predict(X_val_fold) val_mse = mean_squared_error(y_val_fold, val_pred) fold_val_errors.append(val_mse)# Training error train_pred = gb_cv.predict(X_tr) train_mse = mean_squared_error(y_tr, train_pred) fold_train_errors.append(train_mse)print(f" Fold {fold+1}: best_iter={best_iter:3d}, train_mse={train_mse:.6f}, val_mse={val_mse:.6f}")# Calculate average cross-validation results avg_val_mse = np.mean(fold_val_errors) avg_train_mse = np.mean(fold_train_errors) avg_best_iter = np.mean(fold_best_iters) std_val_mse = np.std(fold_val_errors) cv_results.append({'learning_rate': lr,'avg_val_mse': avg_val_mse,'avg_train_mse': avg_train_mse,'avg_best_iter': avg_best_iter,'std_val_mse': std_val_mse,'fold_results': fold_val_errors })print(f"\n Average: best_iter={avg_best_iter:.0f}, val_mse={avg_val_mse:.6f} (+/- {std_val_mse:.6f})")# Find best learning ratebest_cv_result =min(cv_results, key=lambda x: x['avg_val_mse'])best_lr = best_cv_result['learning_rate']print("\n"+"="*60)print(f"Best learning rate from CV: {best_lr}")print(f"Best average validation MSE: {best_cv_result['avg_val_mse']:.6f}")# Now train final model with best learning rate on 85% of training, using 15% for early stoppingprint("\n"+"="*60)print("Training Final Model with Early Stopping")print("="*60)gb_final = HistGradientBoostingRegressor( learning_rate=best_lr, max_iter=200, max_depth=6, random_state=42, early_stopping=True, validation_fraction=0.15, n_iter_no_change=10, tol=1e-7, verbose=1)# Fit on final training set (85% of original training)gb_final.fit(X_final_train, y_final_train)# Get optimal number of treesoptimal_trees = gb_final.n_iter_print(f"\nOptimal number of trees selected by early stopping: {optimal_trees}")# Evaluate on validation set (the 15% held out)val_pred_final = gb_final.predict(X_val)val_mse_final = mean_squared_error(y_val, val_pred_final)val_rmse_final = np.sqrt(val_mse_final)val_r2_final = gb_final.score(X_val, y_val)print(f"\nValidation Set Performance:")print(f"Validation MSE: {val_mse_final:.6f}")print(f"Validation RMSE: {val_rmse_final:.4f}")print(f"Validation R²: {val_r2_final:.4f}")# Final evaluation on test sety_test_pred_final = gb_final.predict(X_test)test_mse_final = mean_squared_error(y_test, y_test_pred_final)test_rmse_final = np.sqrt(test_mse_final)test_r2_final = gb_final.score(X_test, y_test)# Convert to original scaley_test_original = np.exp(y_test)y_test_pred_original = np.exp(y_test_pred_final)test_rmse_original = np.sqrt(mean_squared_error(y_test_original, y_test_pred_original))test_mae_original = mean_absolute_error(y_test_original, y_test_pred_original)print(f"\nFinal Test Set Performance:")print(f"Test MSE: {test_mse_final:.6f}")print(f"Test RMSE: {test_rmse_final:.4f}")print(f"Test R²: {test_r2_final:.4f}")print(f"Test RMSE (original scale): ${test_rmse_original:,.0f}")print(f"Test MAE (original scale): ${test_mae_original:,.0f}")# Create summary plotplt.figure(figsize=(7, 5))# Plot CV resultslearning_rates_list = [r['learning_rate'] for r in cv_results]val_mse_list = [r['avg_val_mse'] for r in cv_results]std_mse_list = [r['std_val_mse'] for r in cv_results]best_iter_list = [r['avg_best_iter'] for r in cv_results]# Left plot: Validation MSE vs Learning Rateplt.subplot(1, 2, 1)plt.errorbar(learning_rates_list, val_mse_list, yerr=std_mse_list, fmt='o-', capsize=5, capthick=2, markersize=8, linewidth=2)plt.xscale('log')plt.xlabel('Learning Rate', fontsize=12)plt.ylabel('Cross-Validation MSE', fontsize=12)plt.title('Validation MSE vs Learning Rate', fontsize=12)plt.grid(True, alpha=0.3)# Right plot: Optimal Trees vs Learning Rateplt.subplot(1, 2, 2)plt.plot(learning_rates_list, best_iter_list, 's-', linewidth=2, markersize=8)plt.xscale('log')plt.xlabel('Learning Rate', fontsize=12)plt.ylabel('Optimal Number of Trees', fontsize=12)plt.title('Trees Selected by Early Stopping vs Learning Rate', fontsize=12)plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()# Summary of early stopping results across learning ratesprint("\n"+"="*60)print("EARLY STOPPING SUMMARY")print("="*60)print(f"{'Learning Rate':<15}{'Avg Trees':<12}{'CV MSE':<12}{'Final Test MSE':<15}")print("-"*60)for result in cv_results:print(f"{result['learning_rate']:<15}{result['avg_best_iter']:<12.0f}{result['avg_val_mse']:<12.6f}{'-':<15}")print(f"{best_lr:<15}{optimal_trees:<12}{best_cv_result['avg_val_mse']:<12.6f}{test_mse_final:<15.6f}")
Problem 3(c): Early Stopping with Cross-Validation
============================================================
Final Training set size: 1992
Final Validation set size: 352
Original Test set size: 586
============================================================
Cross-Validation with Early Stopping
============================================================
Testing learning_rate = 0.01
----------------------------------------
Fold 1: best_iter=200, train_mse=0.020444, val_mse=0.019070
Fold 2: best_iter=200, train_mse=0.018926, val_mse=0.024226
Fold 3: best_iter=200, train_mse=0.019365, val_mse=0.025533
Fold 4: best_iter=200, train_mse=0.020010, val_mse=0.023580
Fold 5: best_iter=200, train_mse=0.016337, val_mse=0.041257
Average: best_iter=200, val_mse=0.026733 (+/- 0.007581)
Testing learning_rate = 0.1
----------------------------------------
Fold 1: best_iter= 63, train_mse=0.011476, val_mse=0.014145
Fold 2: best_iter= 53, train_mse=0.010893, val_mse=0.020859
Fold 3: best_iter= 99, train_mse=0.009516, val_mse=0.019107
Fold 4: best_iter= 96, train_mse=0.009696, val_mse=0.017793
Fold 5: best_iter= 47, train_mse=0.009254, val_mse=0.036043
Average: best_iter=72, val_mse=0.021589 (+/- 0.007555)
Testing learning_rate = 0.3
----------------------------------------
Fold 1: best_iter= 22, train_mse=0.012158, val_mse=0.015907
Fold 2: best_iter= 24, train_mse=0.010905, val_mse=0.021262
Fold 3: best_iter= 55, train_mse=0.007856, val_mse=0.020377
Fold 4: best_iter= 51, train_mse=0.008797, val_mse=0.019526
Fold 5: best_iter= 25, train_mse=0.009084, val_mse=0.035117
Average: best_iter=35, val_mse=0.022438 (+/- 0.006596)
============================================================
Best learning rate from CV: 0.1
Best average validation MSE: 0.021589
============================================================
Training Final Model with Early Stopping
============================================================
Binning 0.001 GB of training data: 0.020 s
Binning 0.000 GB of validation data: 0.006 s
Fitting gradient boosted rounds:
Fit 58 trees in 0.565 s, (1329 total leaves)
Time spent computing histograms: 0.208s
Time spent finding best splits: 0.105s
Time spent applying splits: 0.190s
Time spent predicting: 0.004s
Optimal number of trees selected by early stopping: 58
Validation Set Performance:
Validation MSE: 0.016326
Validation RMSE: 0.1278
Validation R²: 0.8964
Final Test Set Performance:
Test MSE: 0.015435
Test RMSE: 0.1242
Test R²: 0.9166
Test RMSE (original scale): $27,218
Test MAE (original scale): $16,887
============================================================
EARLY STOPPING SUMMARY
============================================================
Learning Rate Avg Trees CV MSE Final Test MSE
------------------------------------------------------------
0.01 200 0.026733 -
0.1 72 0.021589 -
0.3 35 0.022438 -
0.1 58 0.021589 0.015435
Final Model Performance: - Optimal learning rate: 0.1 (selected by CV)
Optimal trees with early stopping: 58 (on 85% training set)
Test MSE: 0.015435
Test R²: 0.9166
Test RMSE (original): $27,218
Test MAE (original): $16,887
Trade-off Confirmed by Plot:
Smaller learning rate (0.01) → more trees needed (200+), higher error
Larger learning rate (0.3) → fewer trees (35), higher error
Optimal learning rate (0.1) → balanced trees (58-72), lowest error
Comparison with other models: Gradient Boosting with early stopping (learning_rate=0.1, 58 trees) achieved test MSE of 0.015435, outperforming the default Gradient Boosting (0.016210) by 4.8% and closely matching Random Forest (0.015381), which remains the overall best performer. The optimal learning rate of 0.1 balanced convergence speed and stability, selecting fewer trees than default while improving predictive accuracy.
*******
Note on Implementation I initially attempted to use XGBoost as recommended in the assignment. However, I encountered installation issues on my macOS system due to missing OpenMP runtime dependencies. Despite troubleshooting and attempting to install the required libraries, the XGBoost import continued to fail.
To ensure I could complete the assignment and demonstrate the required concepts, I used HistGradientBoostingRegressor from scikit-learn instead. This implementation offers the same core functionality gradient boosting with early stopping and learning rate tuning and allowed me to fully address all parts of the problem.
To maintain alignment with the assignment requirements:
I used 5-fold cross-validation to select the optimal learning rate
I implemented early stopping with validation sets to determine the optimal number of trees
I held out 15% of the training data for final model validation
I plotted the relationship between learning rate, number of trees, and test error
I reported test MSE and compared results with Ridge Regression and Random Forest
While the underlying library differs, the conceptual framework and analytical conclusions remain valid and directly address the assignment’s learning objectives.