Below is first section of the study guide on Boosting and a Boosting Walk-through. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I am writing in first person and keeping the material copy/paste friendly. I include references to the text (ESL, The Elements of Statistical Learning) citeturn0file0 and to the video transcript (Module 8 Asynch) citeturn0file1. I also provide example code in R, SAS, and Python so that you can see how the method is implemented across different environments.

SECTION 1: BOOSTING / BOOSTING WALK-THROUGH

  1. Overview and Intuition of Boosting I see boosting as a method for leveraging many weak learners (each only slightly better than random guessing) and combining them to form a single strong learner. The main idea is that each new model in the “boosting sequence” focuses on the errors made by the previous models. This means boosting is a sequential (not parallel) procedure that iteratively refines its predictions. citeturn0file1

• Contrasting with Bagging: – Bagging uses bootstrap samples in parallel and then averages all the resulting models. – Boosting proceeds in a forward-stagewise manner, reweighting or re-focusing on the hard-to-predict samples each round. citeturn0file1

• Key Advantages: – Often yields lower bias than a single model like a single tree, since boosting iteratively “corrects” itself. – Flexible: can work with decision trees, linear models, or other base learners. – Can handle a variety of loss functions (classification, regression, etc.) citeturn0file0

• Key Disadvantages: – Can be more computationally expensive (because of its sequential nature). – Potential to overfit if not properly regularized or if too many iterations are used. citeturn0file1

  1. General Boosting Algorithm Steps
  1. Fit a weak learner to the data (for example, a very shallow tree).
  2. Evaluate its predictions and compute the residuals or errors.
  3. Make those residuals (or a transformed version) the new “target” in the next iteration.
  4. Fit a new weak learner to these residuals.
  5. Repeat until a stopping criterion is reached (e.g., a maximum number of iterations or minimal improvement). citeturn0file1

Mathematically (high-level), the procedure in a regression setting can be thought of like this:

• Let F(x) denote the current model (initialized to something simple, such as a constant).
• At iteration m: – Compute the residuals rᵢ = yᵢ – F(xᵢ).
– Fit a weak learner hₘ(x) to these residuals.
– Update F(x) ← F(x) + ν ⋅ hₘ(x), where ν is a learning rate.

In practice, variations of this formula exist, especially for different loss functions (logistic loss, etc.). citeturn0file0

  1. Boosting Walk-Through Example (Conceptual) • Step 1: Suppose I have a simple dataset (x, y) where y is somewhat nonlinear in x. I start by fitting a “weak” model—a stump (decision tree of depth=1).
    • Step 2: Calculate residuals: errors = y – prediction. These errors still show clear structure (not random).
    • Step 3: Fit another stump to these errors.
    • Step 4: Add that stump’s predictions (scaled by a small learning rate) to the overall model’s prediction.
    • Step 5: Repeat, each time focusing on what the last model didn’t catch.

By iteration 30 or so, the overall model can capture quite complicated patterns. That is essentially the “magic” of boosting. It is a simple forward-stagewise procedure that can produce powerful results even if each learner is fairly weak. citeturn0file1

  1. Handwritten Formula (Simplified) Below is a simple version of the boosting update for regression, using a generic loss function L(y, F(x)):
  1. Initialize: F₀(x) = arg minᵧ ∑ᵢ L(yᵢ, y).

  2. For m = 1 to M:

    1. Compute pseudo-residuals:
      rᵢₘ = – [∂/∂F(xᵢ)] L(yᵢ, F(xᵢ)) evaluated at F = Fₘ₋₁(xᵢ).
    2. Fit a weak learner hₘ(x) to the {rᵢₘ}.
    3. Compute multiplier γₘ = arg minᵧ ∑ᵢ L(yᵢ, Fₘ₋₁(xᵢ) + γ hₘ(xᵢ)).
    4. Update model:
      Fₘ(x) = Fₘ₋₁(x) + ν ⋅ γₘ hₘ(x).
  3. Final model: Fₘ(x).

Here, ν is a learning rate (0 < ν ≤ 1) and M is the maximum number of iterations. This is the formula you’ll often see in references, including ESL Chapter 10. citeturn0file0

  1. Example Code in Python, R, and SAS

–––––––––––––––––––––––––––––––––––––––––––––––––––– A) Python Example (Using scikit-learn’s AdaBoost as a Basic Boosting)

from sklearn.ensemble import AdaBoostRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error

Suppose X, y are your features and targets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

A “weak learner”: a shallow decision tree

weak_learner = DecisionTreeRegressor(max_depth=1)

Build the AdaBoost ensemble

boost_model = AdaBoostRegressor( base_estimator=weak_learner, n_estimators=30, # Number of iterations learning_rate=0.1, # Shrinkage or step size loss=‘linear’ # For regression )

boost_model.fit(X_train, y_train) preds = boost_model.predict(X_test)

mse = mean_squared_error(y_test, preds) print(“Boosted Model MSE:”, mse)

–––––––––––––––––––––––––––––––––––––––––––––––––––– B) R Example (Using the gbm Package)

install.packages(“gbm”)

library(gbm)

Suppose we have a data frame df with columns for features and a column “y”

We’ll do a simple split:

set.seed(123) n <- nrow(df) train_idx <- sample(1:n, size = floor(0.8*n)) train_data <- df[train_idx, ] test_data <- df[-train_idx, ]

Fit a boosted model with Gaussian loss (regression)

boost_fit <- gbm( formula = y ~ ., distribution = “gaussian”, data = train_data, n.trees = 30, interaction.depth = 1, # max tree depth shrinkage = 0.1, # learning rate bag.fraction = 1.0, # no random sub-sampling cv.folds = 0 # can set this >0 for cross-validation )

Predict on test data

preds <- predict(boost_fit, test_data, n.trees = 30) mse <- mean((test_data$y - preds)^2) print(mse)

–––––––––––––––––––––––––––––––––––––––––––––––––––– C) SAS Example (Using PROC GRADBOOST in SAS Viya or HPFOREST / HPSPLIT Variation) In older Base SAS, there isn’t a built-in “boosting” procedure, so I might emulate it with macros or code. In newer SAS Viya releases, there is PROC GRADBOOST. Below is an illustrative syntax:

/* Assuming we have a CAS session and a data set MYDATA with inputs x1-xp and target y / proc gradboost data=mycas.mydata; input x1-xp / level=interval; / or level=nominal for categorical / target y / level=interval; autotune NTree=(30) / or specify a range to tune / LearningRate=(0.1) / or specify a search range */; savestate rstore=mycas.boost_model; run;

/* Score new data */ proc gradboost score data=mycas.newdata rstore=mycas.boost_model out=mycas.scored; run;

If PROC GRADBOOST is not available, some people replicate boosting by repeatedly fitting residuals in a macro loop with procedures like PROC HPSPLIT (for trees) and saving predictions. But in modern SAS, GRADBOOST handles it directly.

  1. Tips, Pitfalls, and Summary • Learning Rate (Shrinkage): Often set to a relatively small value (e.g., 0.1 or 0.01), because a large learning rate can cause overfitting quickly.
    • Number of Iterations (n_estimators / n.trees): Larger M can improve fit but also increase the risk of overfitting. A common practice is to combine a small learning rate with a larger M.
    • Base Learner Complexity: For decision-tree-based boosting, a max depth of 1–5 is typical.
    • Early Stopping: Use cross-validation or a validation set to stop when the improvement flattens out.

In summary, boosting is a powerful, conceptually simple method that incrementally zeroes in on difficult-to-predict observations. Each new iteration “boosts” the performance by learning from mistakes of the earlier ones. By the end, we have a strong ensemble of weak learners that often achieves excellent predictive accuracy. citeturn0file1

That concludes this first section on Boosting and a Boosting Walk-through.

NEXT

Below is my second section of the study guide, focusing on XGBoost. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I’m writing in first person and keeping the material easy to copy/paste. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript content (Module 8 Asynch) citeturn0file1. I also provide sample code in Python, R, and SAS for illustration.

SECTION 2: XGBOOST

  1. What is XGBoost? I see “XGBoost” (eXtreme Gradient Boosting) as an efficient, high-performance implementation of the gradient boosting framework. It can handle different types of data, offers several parameter-tuning options, and uses second-order gradient approximations to optimize a user-chosen loss function. This approach often yields state-of-the-art results in machine-learning competitions.
    Key references in ESL: Chapter 10 (Boosting), plus general ensemble methods references. citeturn0file0

  2. Core Ideas • Gradient Boosting Foundation. XGBoost is a specific realization of the “gradient boosting” algorithm. It uses the gradient (and sometimes the second derivative, or Hessian) of the loss function with respect to model predictions at each iteration.
    • Approximate Tree Learning. XGBoost grows decision trees level-by-level, with an approximate method for split finding that can handle large datasets efficiently.
    • Regularization for Trees. Unlike some earlier tree-based boosting implementations, XGBoost includes L1 (lasso) and L2 (ridge) penalties on the leaf weights. It also penalizes the total number of leaves (T) in each tree via a parameter gamma, thereby controlling overfitting. citeturn0file1

  3. The XGBoost Objective The typical XGBoost objective function can be summarized as:

Obj = ∑(ᵢ=1 to n) L(yᵢ, Fₘ₋₁(xᵢ) + fₘ(xᵢ)) + Ω(fₘ),

where • L is a loss function, e.g. mean squared error, logistic loss, etc.
• fₘ(x) is the new tree (or base learner) being added in iteration m.
• Ω(fₘ) is a regularization term, typically of the form:
Ω(f) = γ ⋅ T + ½ λ ∑(wⱼ²),
– T = number of leaves in the tree f.
– wⱼ = leaf weights (scores).
– λ corresponds to L2 penalty on the leaf weights.
– γ penalizes each leaf, encouraging shallower trees.

XGBoost fits the tree by (1) approximating the loss with a second-order Taylor expansion, (2) finding the best splits based on that approximation, and (3) updating the model. citeturn0file1

  1. Important Hyperparameters • n_estimators (nrounds in R, n.trees in some references): the maximum number of boosting rounds (trees).
    • eta (learning_rate): shrinkage parameter that scales each tree’s contribution (0 < eta ≤ 1). Smaller values slow down learning but can improve generalization.
    • max_depth: maximum depth of each tree. Deeper trees are more expressive but can overfit.
    • gamma: minimum loss reduction required to make a further partition in a leaf node (i.e., cost complexity). Higher gamma means more conservative tree growth.
    • subsample: fraction of the training data to sample in each boosting round (similar to bagging).
    • colsample_bytree / colsample_bynode: fraction of features to sample in each tree (or split).
    • λ (reg_lambda): L2 regularization on leaf weights (on by default).
    • α (reg_alpha): L1 regularization on leaf weights (off by default, set to > 0 to enable).

These hyperparameters help manage overfitting and can drastically affect XGBoost’s performance. Typically, a methodical approach (grid search, random search, or Bayesian optimization) is needed to find optimal values. citeturn0file1

  1. Pseudocode for XGBoost

Initialization: • F₀(x) = constant (e.g., the average of y if it’s regression).

For m = 1 to M: 1) For each observation i, compute: gᵢ = ∂/∂F(xᵢ) L(yᵢ, Fₘ₋₁(xᵢ)), hᵢ = ∂²/∂F(xᵢ)² L(yᵢ, Fₘ₋₁(xᵢ)). # second derivative 2) Fit a regression tree to the points {(gᵢ, hᵢ)}, with specialized splitting criteria that accounts for gᵢ and hᵢ. 3) For each leaf j, compute the optimal weight wⱼ that minimizes the approximate loss plus regularization. 4) Update Fₘ(x) = Fₘ₋₁(x) + η ⋅ fₘ(x).

Return Fₘ(x).

  1. Example Code Snippets in Python, R, and SAS

–––––––––––––––––––––––––––––––––––––––––––––––––––– A) Python Example

import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error

Suppose X, y are your data and targets (NumPy arrays or Pandas DataFrames)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Convert into DMatrix, which is a specialized XGBoost data structure

dtrain = xgb.DMatrix(data=X_train, label=y_train) dtest = xgb.DMatrix(data=X_test, label=y_test)

Set parameters: for regression

params = { ‘objective’: ‘reg:squarederror’, ‘eta’: 0.1, ‘max_depth’: 3, ‘subsample’: 0.8, ‘colsample_bytree’: 0.8, ‘lambda’: 1.0, # L2 reg ‘alpha’: 0.0 # L1 reg }

num_rounds = 100 # number of boosting rounds xgb_model = xgb.train(params, dtrain, num_boost_round=num_rounds)

Make predictions

preds = xgb_model.predict(dtest) mse = mean_squared_error(y_test, preds) print(“XGBoost MSE:”, mse)

–––––––––––––––––––––––––––––––––––––––––––––––––––– B) R Example (xgboost library)

install.packages(“xgboost”)

library(xgboost)

Suppose df is our data frame, with numeric columns for X and a numeric y

Make matrices

X <- as.matrix(df[, -which(names(df) == “y”)]) y <- df$y

Train/test split

set.seed(123) train_idx <- sample(nrow(df), size = 0.8 * nrow(df)) X_train <- X[train_idx, ] y_train <- y[train_idx] X_test <- X[-train_idx, ] y_test <- y[-train_idx]

Create xgb.DMatrix

dtrain <- xgb.DMatrix(data = X_train, label = y_train) dtest <- xgb.DMatrix(data = X_test, label = y_test)

Set parameters

params <- list( objective = “reg:squarederror”, eta = 0.1, max_depth = 3, subsample = 0.8, colsample_bytree = 0.8, lambda = 1, alpha = 0 )

Train

xgb_model <- xgb.train( params = params, data = dtrain, nrounds = 100, watchlist = list(train = dtrain, test = dtest), early_stopping_rounds = 10 )

Predict

preds <- predict(xgb_model, X_test) mse <- mean((y_test - preds)^2) print(mse)

–––––––––––––––––––––––––––––––––––––––––––––––––––– C) SAS Example In SAS, you can replicate XGBoost-like functionality in a few ways. If you have SAS Viya, you can leverage PROC XGBOOST. Otherwise, you can approximate it with PROC GRADBOOST or macros for gradient boosting. Here is an example with PROC XGBOOST in SAS Viya:

/* In SAS Viya: / proc cas; session mysession; loadactionset “decisionTree”; / Assuming we have table ‘mytable’ with ‘y’ as target, and x1, x2, …, xp as predictors. / action xgboost.train / table={name=“mytable”} target=“y” inputs={“x1”,“x2”,…, “xp”} nominals={} / specify categorical variables if needed */ nTree=100 objective=“reg:squarederror” maxDepth=3 eta=0.1 subsample=0.8 colSampleByTree=0.8 regLambda=1 regAlpha=0 seed=12345 savestate={name=“myXGBmodel”}; run;

/* Score new data */ proc cas; action xgboost.score / modelState={name=“myXGBmodel”} table={name=“myNewData”} casOut={name=“myScoredData”, replace=True}; run;

  1. Practical Tips and Summary • Regularization Tuning. Don’t neglect gamma, λ (reg_lambda), and α (reg_alpha). These can be key to controlling overfitting.
    • Learning Rate. Typically pick a smaller eta (e.g. 0.01–0.2) and combine with more boosting rounds.
    • Subsampling. Using subsample < 1.0 or colsample_bytree < 1.0 often helps reduce variance and speed up training.
    • Early Stopping. Using early_stopping_rounds can save time by halting training when the model stops improving on a validation set.
    • Custom Losses. One huge advantage of XGBoost is that you can define custom loss functions, as long as they’re differentiable and you can provide gradient and hessian.

In short, XGBoost is a highly optimized framework for gradient boosting with built-in regularization and sophisticated tree-building. It’s widely used in practice for structured data problems and can often outperform simpler methods, provided that you tune the parameters carefully.

Below is my third section of the study guide, focusing on Hyperparameters. I drew from The Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript (Module 8 Asynch) citeturn0file1. I will also provide some brief code snippets in Python, R, and SAS for illustration.

SECTION 3: HYPERPARAMETERS

  1. What Are Hyperparameters? I consider hyperparameters to be the “settings” or “knobs” that guide the learning process of a model. For a simple linear regression, the parameters are the slopes and intercept (learned from data), but we usually have no hyperparameters to tune. In contrast, for tree-based methods, boosting, and advanced models like neural networks, we have hyperparameters that control complexity, learning rate, regularization, and so on. These hyperparameters are not learned directly from the training data in a simple closed-form manner; instead, we pick them (for instance, by cross-validation or other search methods).
    Relevant references: ESL, Chapter 7 on model assessment and selection; also chapters dealing with each algorithm’s specific tunable knobs (e.g., Chapter 10 on boosting). citeturn0file0

  2. Common Hyperparameters for Tree-Based Models • max_depth: the maximum depth of the tree, controlling how many splits can occur from root to leaf.
    • min_samples_split or min_child_weight (in XGBoost): the minimum number of samples needed in a leaf node or child node, which helps prevent overly small partitions and thus overfitting.
    • gamma (in XGBoost): additional penalty on leaf splits, requiring a minimum loss reduction before a split can be made.
    • sub_sample or bagging_fraction: fraction of the training data to randomly sample for each round (adds randomness, reduces variance).
    • col_sample_by_tree: fraction of features used in each tree.

  3. Learning Rate vs. Number of Estimators • learning_rate (eta): how fast or slow we incorporate a new learner’s contribution in each boosting iteration. Lower learning rates typically require more iterations (n_estimators) to achieve good accuracy, but often generalize better.
    • n_estimators (M): the number of boosting rounds (trees in a boosted ensemble). Too few can underfit; too many might overfit if we don’t monitor for early stopping.
    These two hyperparameters are typically tuned together: small learning_rate with a large n_estimators can yield a high-performing model at a cost of more computation. citeturn0file1

  4. Regularization Hyperparameters • L1 regularization α (alpha) encourages sparsity in tree weights (or other model parameters).
    • L2 regularization λ (lambda) shrinks the weights, penalizing large values.
    • gamma (for XGBoost, LightGBM, etc.) also plays a role in regularization by adding a cost for each leaf in a tree.
    • penalty in logistic regression: can be “l1”, “l2”, or “elastic net”, controlling how coefficients are shrunk or forced to zero.

  5. Searching for Good Hyperparameters I commonly use systematic search procedures such as grid search or random search, possibly augmented by cross-validation. Automated hyperparameter optimization methods (Bayesian optimization, genetic algorithms, Hyperopt, Optuna, etc.) can also be used. The procedure typically involves:

  1. Choose a range or distribution for each hyperparameter.
  2. Sample different combinations of these hyperparameters.
  3. Fit the model on training folds, evaluate on a validation fold.
  4. Pick the combination that yields the best average validation score.
  5. Refit on the full training data if needed.
  1. Example Code Snippet for Hyperparameter Tuning

–––––––––––––––––––––––––––––––––––––––––––––––––––– A) Python (Using scikit-learn’s GridSearchCV)

from sklearn.model_selection import GridSearchCV from sklearn.ensemble import GradientBoostingRegressor

param_grid = { ‘n_estimators’: [50, 100], ‘learning_rate’: [0.01, 0.1, 0.2], ‘max_depth’: [1, 3, 5] }

gbm = GradientBoostingRegressor() grid_search = GridSearchCV( estimator=gbm, param_grid=param_grid, scoring=‘neg_mean_squared_error’, cv=5, n_jobs=-1 )

grid_search.fit(X_train, y_train) print(“Best Params:”, grid_search.best_params_) print(“Best CV Score:”, -grid_search.best_score_)

–––––––––––––––––––––––––––––––––––––––––––––––––––– B) R (Using caret for tuning a GBM)

install.packages(“caret”)

library(caret)

train_control <- trainControl(method = “cv”, number = 5) tune_grid <- expand.grid( n.trees = c(50, 100), interaction.depth = c(1, 3, 5), shrinkage = c(0.01, 0.1, 0.2), n.minobsinnode = c(5, 10) )

set.seed(123) gbm_fit <- train( y ~ ., data = df, method = “gbm”, trControl = train_control, tuneGrid = tune_grid, metric = “RMSE”, verbose = FALSE )

gbm_fit\(bestTune gbm_fit\)results

–––––––––––––––––––––––––––––––––––––––––––––––––––– C) SAS (PROC OPTMODEL or HPC Tuning in SAS Viya) SAS has macros or procedures (like PROC HPFOREST, HPGENSELECT) that can do some hyperparameter selection, but you may need a manual or macro-based approach.

Example snippet using a macro-based approach for searching hyperparameters in SAS older versions (conceptual outline):

%macro tuneMyForest(data=, target=, maxdepth=, ntrees=); proc hpforest data=&data; target &target.; input x1-xp; ntree=&ntrees.; maxdepth=&maxdepth.; /* additional hyperparameters, etc. */ ods output FitStatistics=FitStats; run; %mend;

%tuneMyForest(data=mydata, target=y, maxdepth=5, ntrees=50); /* gather FitStats, compare, etc. */

In modern SAS Viya, you can use AutoTune in actions like decisionTree.gbtreeTrain or xgboost.train, specifying search ranges for the hyperparameters.

  1. Hyperparameter Tuning Pitfalls • Overfitting on validation sets if repeatedly searching a large hyperparameter space.
    • Setting ranges or distributions too narrow can miss better solutions.
    • Computation time can explode with large parameter grids.
    • Sometimes it’s easy to fix certain parameters to well-known defaults (e.g., small learning rate) and only tune the critical ones (like n_estimators, max_depth) to reduce complexity.

  2. Summing Up Hyperparameters are crucial in controlling the behavior and performance of advanced machine-learning models, especially tree-based methods and boosted ensembles. They govern model complexity, regularization strength, and how learning progresses. The right combination of hyperparameters can dramatically improve predictive accuracy while preventing overfitting.

Below is my fourth section of the study guide, focusing on two XGBoost Demos. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript content (Module 8 Asynch) citeturn0file1. I will also provide sample code so you can replicate the demonstrations.

SECTION 4: XGBOOST DEMO 1 AND 2

DEMO 1: HANDWRITTEN DIGITS CLASSIFICATION (SCIKIT-LEARN + XGBOOST)

  1. Data Overview
    I see the digits dataset from scikit-learn as an example for multiclass classification. It contains 1,797 samples of 8×8 pixel images representing the digits 0 through 9. Each pixel is a feature, so we have 64 features. We want to classify which digit each image represents.

  2. Steps to Replicate

  1. Load Libraries and Data
    • Use scikit-learn’s built-in digits dataset:
    from sklearn.datasets import load_digits
    digits = load_digits()

• X will be digits.data, a (1797 × 64) array, and y will be digits.target (digits 0–9).

  1. Split into Training and Test
    • We can do a simple 70–30 split or use cross-validation.

  2. Convert Data into XGBoost’s DMatrix Format
    • xgb.DMatrix is a specialized data structure for XGBoost, but we can train directly with scikit-learn API too.

  3. Train an XGBoost Classifier
    • For a multiclass task, set the objective to “multi:softprob” or “multi:softmax” and specify num_class=10.

  4. Evaluate Accuracy
    • Use predictions on the test set and measure classification accuracy or confusion matrix.

  1. Demo 1: Code Example (Python)

import xgboost as xgb from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score

Load data

digits = load_digits() X = digits.data # shape (1797, 64) y = digits.target # labels 0 through 9

Train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

XGBoost Classifier using scikit-learn API

xgb_clf = xgb.XGBClassifier( objective=‘multi:softprob’, num_class=10, max_depth=3, learning_rate=0.1, n_estimators=100, subsample=0.8, colsample_bytree=0.8, random_state=42 )

xgb_clf.fit(X_train, y_train)

Predict

y_pred = xgb_clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(“XGBoost classification accuracy:”, accuracy)

• You will typically see accuracy in the 0.95–0.98 range depending on parameter settings.

  1. Notes • Because the images are small (8×8), a simple approach like unrolled pixels is enough to get decent results.
    • For more complex images, more advanced features or deep learning might be appropriate, but XGBoost can still perform surprisingly well on structured tabular data.

DEMO 2: REGRESSION EXAMPLE (CALIFORNIA HOUSING DATA)

  1. Data Overview
    • The California Housing dataset (available in scikit-learn) is a regression problem predicting median house prices based on demographic and geographic features.
    • Features include average income, average house age, average rooms, etc.

  2. Steps to Replicate

  1. Load Libraries and Data
    • from sklearn.datasets import fetch_california_housing
    • cal_housing = fetch_california_housing()

• The input features are in cal_housing.data, the target is cal_housing.target.

  1. Split into Training and Test

  2. Build XGBoost Regressor
    • objective=‘reg:squarederror’ (for standard regression).

  3. Evaluate MSE or R²

  1. Demo 2: Code Example (Python)

import xgboost as xgb from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error

Load data

cal_housing = fetch_california_housing() X = cal_housing.data y = cal_housing.target

Split into train/test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

XGBoost Regressor

xgb_reg = xgb.XGBRegressor( objective=‘reg:squarederror’, max_depth=4, learning_rate=0.1, n_estimators=200, subsample=0.8, colsample_bytree=0.8, alpha=0, lambda=1, random_state=42 )

xgb_reg.fit(X_train, y_train) preds = xgb_reg.predict(X_test)

mse = mean_squared_error(y_test, preds) print(“XGBoost MSE:”, mse) print(“XGBoost RMSE:”, mse**0.5)

  1. Interpretation and Potential Tuning • You can tune max_depth, learning_rate, alpha, lambda, etc. using a grid search or other methods.
    • Typical MSE might be around 0.3–0.4 or so depending on how the target is scaled, and the RMSE around 0.55–0.63 for the default approach.

SAMPLE CODE IN R

• For the digits classification example, you could build your own dataset or use the “mnist” data. For regression on the California Housing, you can pull from external sources or pre-downloaded data. R code is similar, using xgboost::xgb.train or xgboost::xgboost.

library(xgboost) # Suppose X, y are numeric matrices or vectors # For classification with multiple classes, set objective=“multi:softprob” and num_class=10 # For regression, objective=“reg:squarederror”

dtrain <- xgb.DMatrix(data=X_train, label=y_train) dtest <- xgb.DMatrix(data=X_test, label=y_test)

params <- list( objective = “reg:squarederror”, max_depth = 4, eta = 0.1, subsample = 0.8, colsample_bytree = 0.8 )

bst <- xgb.train(params = params, data = dtrain, nrounds=200) preds <- predict(bst, newdata=dtest)

Evaluate MSE

mse <- mean((y_test - preds)^2) cat(“MSE:”, mse, “”)

SAMPLE CODE IN SAS

• For classification, use something like:

proc cas; session mysession; loadactionset “decisionTree”; action xgboost.train / table={name=“digits_table”} target=“digit_label” inputs={“pixel1”,“pixel2”,…, “pixel64”} objective=“multi:softmax” numClasses=10 nTree=100 maxDepth=3 eta=0.1 subsample=0.8 colSampleByTree=0.8 randomSeed=42 savestate={name=“digits_xgb_model”}; run;

• For regression, set objective=“reg:squarederror” and remove the numClasses parameter.

SUMMARY • XGBoost is straightforward to apply once you’re familiar with the API.
• Classification tasks use objective=“multi:softmax” or “multi:softprob”.
• Regression tasks use objective=“reg:squarederror” or sometimes “reg:linear” in older versions.
• The two demos illustrate typical use cases: classification (handwritten digits) and regression (housing prices).

Below is my fifth section of the study guide, focusing on Grid Search. I based it on The Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and content from the transcripts (Module 8 Asynch) citeturn0file1. As usual, I’ll also provide copy/paste friendly code snippets for multiple programming languages.

SECTION 5: GRID SEARCH

  1. What Is Grid Search? I view “Grid Search” as a brute-force method for systematically exploring multiple combinations of hyperparameters. You define a discrete set of possible values (a “grid”) for each hyperparameter, then train and evaluate the model for every possible combination. The key steps are:
  1. Define ranges (or sets) of possible values for each hyperparameter: for example, max_depth ∈ {2,3,4}, learning_rate ∈ {0.01, 0.1}, etc.
  2. For each combination of hyperparameters in the Cartesian product of these sets, train the model on training folds and evaluate on a validation fold (or use cross-validation).
  3. Select the combination that yields the best performance metric.
  4. Optionally, refit the model with the chosen hyperparameters on the entire training set.
  1. Advantages and Disadvantages • Advantages:
    – Straightforward and easy to understand.
    – For a small parameter space, it can be quite effective.

• Disadvantages:
– Potentially expensive in computation time, since we evaluate every combination.
– The total number of combinations grows exponentially with the number of hyperparameters or the size of each grid.

  1. Pseudocode Outline

Given: – A model M(θ) with hyperparameters θ ∈ Θ1 × Θ2 × … × Θp.
– A performance metric Perf(·).
– A method for model evaluation, e.g. K-fold cross-validation.

Algorithm: 1) best_perf ← –∞ (or some minimal reference) 2) For each combination (θ₁, θ₂, …, θp) in Θ₁ × Θ₂ × … × Θp: a) Train model M(θ) on training folds. b) Evaluate on validation fold(s) and compute Perf(θ). c) If Perf(θ) > best_perf: i) best_perf ← Perf(θ) ii) best_params ← θ 3) Return best_params, best_perf

  1. Practical Implementation with Cross-Validation • Typically, in scikit-learn or R’s caret, GridSearchCV or train() automatically uses cross-validation for each combination of hyperparameters.
    • The final model is often retrained using the selected “best_params.”

  2. Example: Python (Using scikit-learn’s GridSearchCV)

from sklearn.model_selection import GridSearchCV from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score

Load data

iris = load_iris() X, y = iris.data, iris.target

Define the parameter grid

param_grid = { ‘n_estimators’: [50, 100], ‘learning_rate’: [0.01, 0.1], ‘max_depth’: [2, 3, 4] }

Initialize model

model = GradientBoostingClassifier()

Fit

grid_search.fit(X, y) print(“Best Params:”, grid_search.best_params_) print(“Best CV Score:”, grid_search.best_score_)

Evaluate on the same data or separate test set

best_model = grid_search.best_estimator_ preds = best_model.predict(X) accuracy = accuracy_score(y, preds) print(“Accuracy on entire dataset:”, accuracy)

  1. Example: R (Using caret)

library(caret)

Suppose df is a data frame with predictor columns and a factor target “Species”

train_control <- trainControl(method = “cv”, number = 5)

grid <- expand.grid( n.trees = c(50, 100), interaction.depth = c(2, 3, 4), shrinkage = c(0.01, 0.1), n.minobsinnode = c(5) )

set.seed(123) gbm_fit <- train( Species ~ ., data = iris, method = “gbm”, trControl = train_control, tuneGrid = grid, verbose = FALSE ) gbm_fit$bestTune gbm_fit

  1. Example: SAS (Conceptual) In some SAS environments, you can do manual looping for each hyperparameter combination or use autotuning options. For example, in SAS Viya’s CAS environment, certain actions (like xgboost.train) have an Autotune or “tune” parameter:

proc cas; session mySession; loadactionset “decisionTree”; action xgboost.train / table={name=“your_data”} target=“your_target” inputs={“x1”,“x2”,“x3”} autotune={ steps=10, objective=“AUTO”, searchmethod=“grid”, parameters={ { name=“nTree”, values=“50,100” }, { name=“maxDepth”, values=“2,4” }, { name=“eta”, values=“0.01,0.1” } } } ; run;

If your SAS version lacks these features, you can create a macro loop that calls PROC HPFOREST or PROC GRADBOOST with different parameter settings and collects metrics.

  1. When to Use Grid search is most useful when your hyperparameter space is small or you have strong prior intuition about what ranges to explore. If you have many hyperparameters or a wide range, you might consider Random Search or other optimization approaches.

  2. Summary Grid Search systematically explores a predefined set of hyperparameter values, which can guarantee that you don’t miss any combination in that grid. While exhaustive, it can be expensive for large parameter spaces. However, for moderate problem sizes, it remains a standard tool for model selection and can yield excellent results.

Below is my sixth (and final) section of the study guide, focusing on Random Search. As before, I drew on The Elements of Statistical Learning (ESL) citeturn0file0 and the Module 8 Asynchronous transcript material citeturn0file1. Code snippets are given in a copy/paste friendly format.

SECTION 6: RANDOM SEARCH

  1. What is Random Search? I consider Random Search an alternative hyperparameter optimization strategy to Grid Search. Instead of exhaustively enumerating a grid of possible hyperparameter values, we randomly sample from specified distributions for each hyperparameter. The key idea is that randomly chosen points in a high-dimensional space can often cover diverse regions more efficiently than an exhaustive (grid) method with the same computational budget.

  2. Advantages over Grid Search • Efficiency: For the same number of trials, random search often finds better parameter settings than a coarse grid, especially when only a few hyperparameters are truly influential.
    • Scalability: By sampling from each hyperparameter’s distribution, you can easily add more samples or draw from specialized distributions (log scale, uniform, etc.).
    • Adaptability: If you discover you need more trials, you can just continue sampling.

  3. Basic Steps

  1. Define a probability distribution or range for each hyperparameter. For example, learning_rate ∼ Uniform(0.01, 0.2), max_depth ∈ {2, 3, 4, 5}, or alpha ∼ LogUniform(1e–5, 1).
  2. Randomly sample a set of hyperparameter configurations.
  3. For each sampled configuration, train the model (e.g., with cross-validation) and record a performance metric.
  4. Keep track of the best-performing combination and possibly keep searching as resources allow.
  1. Example: Python (Using scikit-learn’s RandomizedSearchCV)

from sklearn.model_selection import RandomizedSearchCV from sklearn.ensemble import GradientBoostingRegressor from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error import numpy as np

Load data

boston = load_boston() X, y = boston.data, boston.target

Define parameter distributions

param_dist = { ‘n_estimators’: np.random.randint(50, 200, size=50), # sample integers from 50..200 ‘learning_rate’: np.linspace(0.01, 0.2, num=20), # sample from 0.01..0.2 ‘max_depth’: np.random.randint(2, 6, size=4) # sample from {2,3,4,5} }

Model

model = GradientBoostingRegressor()

Define the model (e.g., gbm for boosting)

model <- train( Species ~ ., data = iris, method = “gbm”, trControl = train_control, tuneLength = 10 # will try 10 random combinations )

model\(bestTune model\)results

  1. Example: SAS (Conceptual) In SAS, you can randomly generate parameter sets in a macro or in a CAS action if available. If you do not have an automated random search function, you can do something like:

%macro run_random_search(num_runs=10); %do i=1 %to &num_runs; %let ntrees = %sysfunc(rand(integer, 50, 200)); %let depth = %sysfunc(rand(integer, 2, 6)); %let lr = %sysevalf(%sysfunc(rand(uniform))*(0.2-0.01)+0.01);

  /* Then call your chosen procedure (e.g., PROC GRADBOOST, HPFOREST, or xgboost.train)
     with these parameters and record performance. */

%end; %mend;

%run_random_search(num_runs=10);

  1. When to Use Random Search • Especially helpful if your model has many hyperparameters or if you believe only a subset of them significantly affects performance.
    • Useful as a first pass to locate promising regions, followed by a more fine-grained search or a Bayesian optimization method.
    • If you want to quickly scale the number of trials or budget more computing time, random search is easy to extend.

  2. Summary Random Search is a flexible and often surprisingly effective approach to hyperparameter tuning, particularly in higher-dimensional spaces where grid search becomes prohibitively expensive. By specifying meaningful distributions for each hyperparameter, you can focus your search in promising regions and efficiently uncover a strong model configuration.

Here are three key takeaways from the Random Search section:

  1. Random Search samples hyperparameter combinations from predefined distributions, often providing better coverage of the search space than an exhaustive grid for the same number of trials.
  2. It is highly flexible and scalable, making it easy to adjust the number of tested combinations if more time or resources become available.
  3. Random Search can be combined with other methods (like Bayesian optimization) to refine the search after identifying promising regions of hyperparameter values.

Here are three thought-provoking questions to consider:

  1. In what types of scenarios might a purely random approach fail to locate good hyperparameters, and how could you mitigate this?
  2. How can domain knowledge guide your choice of distributions for the hyperparameters, instead of using uniform or naive distributions?
  3. After finding a good hyperparameter set through random search, how do we decide whether to refine it further or accept it as final?
---
title: "Boosting - 7333 - QTW"
output: html_notebook
---

Below is first section of the study guide on Boosting and a Boosting Walk-through. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I am writing in first person and keeping the material copy/paste friendly. I include references to the text (ESL, The Elements of Statistical Learning) citeturn0file0 and to the video transcript (Module 8 Asynch) citeturn0file1. I also provide example code in R, SAS, and Python so that you can see how the method is implemented across different environments.

SECTION 1: BOOSTING / BOOSTING WALK-THROUGH

1. Overview and Intuition of Boosting
I see boosting as a method for leveraging many weak learners (each only slightly better than random guessing) and combining them to form a single strong learner. The main idea is that each new model in the “boosting sequence” focuses on the errors made by the previous models. This means boosting is a sequential (not parallel) procedure that iteratively refines its predictions. citeturn0file1

• Contrasting with Bagging:
  – Bagging uses bootstrap samples in parallel and then averages all the resulting models.
  – Boosting proceeds in a forward-stagewise manner, reweighting or re-focusing on the hard-to-predict samples each round. citeturn0file1

• Key Advantages:
  – Often yields lower bias than a single model like a single tree, since boosting iteratively “corrects” itself.
  – Flexible: can work with decision trees, linear models, or other base learners.
  – Can handle a variety of loss functions (classification, regression, etc.) citeturn0file0

• Key Disadvantages:
  – Can be more computationally expensive (because of its sequential nature).
  – Potential to overfit if not properly regularized or if too many iterations are used. citeturn0file1

2. General Boosting Algorithm Steps
1) Fit a weak learner to the data (for example, a very shallow tree).  
2) Evaluate its predictions and compute the residuals or errors.  
3) Make those residuals (or a transformed version) the new “target” in the next iteration.  
4) Fit a new weak learner to these residuals.  
5) Repeat until a stopping criterion is reached (e.g., a maximum number of iterations or minimal improvement). citeturn0file1

Mathematically (high-level), the procedure in a regression setting can be thought of like this:

• Let F(x) denote the current model (initialized to something simple, such as a constant).  
• At iteration m:
  – Compute the residuals rᵢ = yᵢ – F(xᵢ).  
  – Fit a weak learner hₘ(x) to these residuals.  
  – Update F(x) ← F(x) + ν ⋅ hₘ(x), where ν is a learning rate.  

In practice, variations of this formula exist, especially for different loss functions (logistic loss, etc.). citeturn0file0

3. Boosting Walk-Through Example (Conceptual)
• Step 1: Suppose I have a simple dataset (x, y) where y is somewhat nonlinear in x. I start by fitting a “weak” model—a stump (decision tree of depth=1).  
• Step 2: Calculate residuals: errors = y – prediction. These errors still show clear structure (not random).  
• Step 3: Fit another stump to these errors.  
• Step 4: Add that stump’s predictions (scaled by a small learning rate) to the overall model’s prediction.  
• Step 5: Repeat, each time focusing on what the last model didn’t catch.  

By iteration 30 or so, the overall model can capture quite complicated patterns. That is essentially the “magic” of boosting. It is a simple forward-stagewise procedure that can produce powerful results even if each learner is fairly weak. citeturn0file1

4. Handwritten Formula (Simplified)
Below is a simple version of the boosting update for regression, using a generic loss function L(y, F(x)):

1) Initialize:
   F₀(x) = arg minᵧ ∑ᵢ L(yᵢ, y).  

2) For m = 1 to M:
   a) Compute pseudo-residuals:  
      rᵢₘ = – [∂/∂F(xᵢ)] L(yᵢ, F(xᵢ)) evaluated at F = Fₘ₋₁(xᵢ).  
   b) Fit a weak learner hₘ(x) to the {rᵢₘ}.  
   c) Compute multiplier γₘ = arg minᵧ ∑ᵢ L(yᵢ, Fₘ₋₁(xᵢ) + γ hₘ(xᵢ)).  
   d) Update model:  
      Fₘ(x) = Fₘ₋₁(x) + ν ⋅ γₘ hₘ(x).  

3) Final model: Fₘ(x).  

Here, ν is a learning rate (0 < ν ≤ 1) and M is the maximum number of iterations. This is the formula you’ll often see in references, including ESL Chapter 10. citeturn0file0

5. Example Code in Python, R, and SAS

––––––––––––––––––––––––––––––––––––––––––––––––––––
A) Python Example (Using scikit-learn’s AdaBoost as a Basic Boosting)

from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Suppose X, y are your features and targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# A "weak learner": a shallow decision tree
weak_learner = DecisionTreeRegressor(max_depth=1)

# Build the AdaBoost ensemble
boost_model = AdaBoostRegressor(
    base_estimator=weak_learner,
    n_estimators=30,       # Number of iterations
    learning_rate=0.1,     # Shrinkage or step size
    loss='linear'          # For regression
)

boost_model.fit(X_train, y_train)
preds = boost_model.predict(X_test)

mse = mean_squared_error(y_test, preds)
print("Boosted Model MSE:", mse)

––––––––––––––––––––––––––––––––––––––––––––––––––––
B) R Example (Using the gbm Package)

# install.packages("gbm")
library(gbm)

# Suppose we have a data frame df with columns for features and a column "y"
# We'll do a simple split:
set.seed(123)
n <- nrow(df)
train_idx <- sample(1:n, size = floor(0.8*n))
train_data <- df[train_idx, ]
test_data  <- df[-train_idx, ]

# Fit a boosted model with Gaussian loss (regression)
boost_fit <- gbm(
  formula = y ~ ., 
  distribution = "gaussian",
  data = train_data,
  n.trees = 30,
  interaction.depth = 1,  # max tree depth
  shrinkage = 0.1,        # learning rate
  bag.fraction = 1.0,     # no random sub-sampling
  cv.folds = 0            # can set this >0 for cross-validation
)

# Predict on test data
preds <- predict(boost_fit, test_data, n.trees = 30)
mse <- mean((test_data$y - preds)^2)
print(mse)

––––––––––––––––––––––––––––––––––––––––––––––––––––
C) SAS Example (Using PROC GRADBOOST in SAS Viya or HPFOREST / HPSPLIT Variation)
In older Base SAS, there isn’t a built-in “boosting” procedure, so I might emulate it with macros or code. In newer SAS Viya releases, there is PROC GRADBOOST. Below is an illustrative syntax:

/* Assuming we have a CAS session and a data set MYDATA with inputs x1-xp and target y */
proc gradboost data=mycas.mydata;
   input x1-xp / level=interval; /* or level=nominal for categorical */
   target y / level=interval;
   autotune NTree=(30)           /* or specify a range to tune */
            LearningRate=(0.1)   /* or specify a search range */;
   savestate rstore=mycas.boost_model; 
run;

/* Score new data */
proc gradboost score data=mycas.newdata
   rstore=mycas.boost_model
   out=mycas.scored;
run;

If PROC GRADBOOST is not available, some people replicate boosting by repeatedly fitting residuals in a macro loop with procedures like PROC HPSPLIT (for trees) and saving predictions. But in modern SAS, GRADBOOST handles it directly.

6. Tips, Pitfalls, and Summary
• Learning Rate (Shrinkage): Often set to a relatively small value (e.g., 0.1 or 0.01), because a large learning rate can cause overfitting quickly.  
• Number of Iterations (n_estimators / n.trees): Larger M can improve fit but also increase the risk of overfitting. A common practice is to combine a small learning rate with a larger M.  
• Base Learner Complexity: For decision-tree-based boosting, a max depth of 1–5 is typical.  
• Early Stopping: Use cross-validation or a validation set to stop when the improvement flattens out.  

In summary, boosting is a powerful, conceptually simple method that incrementally zeroes in on difficult-to-predict observations. Each new iteration “boosts” the performance by learning from mistakes of the earlier ones. By the end, we have a strong ensemble of weak learners that often achieves excellent predictive accuracy. citeturn0file1

That concludes this first section on Boosting and a Boosting Walk-through.  


NEXT

Below is my second section of the study guide, focusing on XGBoost. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I’m writing in first person and keeping the material easy to copy/paste. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript content (Module 8 Asynch) citeturn0file1. I also provide sample code in Python, R, and SAS for illustration.

SECTION 2: XGBOOST

1. What is XGBoost?
I see “XGBoost” (eXtreme Gradient Boosting) as an efficient, high-performance implementation of the gradient boosting framework. It can handle different types of data, offers several parameter-tuning options, and uses second-order gradient approximations to optimize a user-chosen loss function. This approach often yields state-of-the-art results in machine-learning competitions.  
Key references in ESL: Chapter 10 (Boosting), plus general ensemble methods references. citeturn0file0

2. Core Ideas
• Gradient Boosting Foundation. XGBoost is a specific realization of the “gradient boosting” algorithm. It uses the gradient (and sometimes the second derivative, or Hessian) of the loss function with respect to model predictions at each iteration.  
• Approximate Tree Learning. XGBoost grows decision trees level-by-level, with an approximate method for split finding that can handle large datasets efficiently.  
• Regularization for Trees. Unlike some earlier tree-based boosting implementations, XGBoost includes L1 (lasso) and L2 (ridge) penalties on the leaf weights. It also penalizes the total number of leaves (T) in each tree via a parameter gamma, thereby controlling overfitting. citeturn0file1

3. The XGBoost Objective
The typical XGBoost objective function can be summarized as:

Obj = ∑(ᵢ=1 to n) L(yᵢ, Fₘ₋₁(xᵢ) + fₘ(xᵢ)) + Ω(fₘ),  

where
• L is a loss function, e.g. mean squared error, logistic loss, etc.  
• fₘ(x) is the new tree (or base learner) being added in iteration m.  
• Ω(fₘ) is a regularization term, typically of the form:  
  Ω(f) = γ ⋅ T + ½ λ ∑(wⱼ²),  
  – T = number of leaves in the tree f.  
  – wⱼ = leaf weights (scores).  
  – λ corresponds to L2 penalty on the leaf weights.  
  – γ penalizes each leaf, encouraging shallower trees.  

XGBoost fits the tree by (1) approximating the loss with a second-order Taylor expansion, (2) finding the best splits based on that approximation, and (3) updating the model. citeturn0file1

4. Important Hyperparameters
• n_estimators (nrounds in R, n.trees in some references): the maximum number of boosting rounds (trees).  
• eta (learning_rate): shrinkage parameter that scales each tree’s contribution (0 < eta ≤ 1). Smaller values slow down learning but can improve generalization.  
• max_depth: maximum depth of each tree. Deeper trees are more expressive but can overfit.  
• gamma: minimum loss reduction required to make a further partition in a leaf node (i.e., cost complexity). Higher gamma means more conservative tree growth.  
• subsample: fraction of the training data to sample in each boosting round (similar to bagging).  
• colsample_bytree / colsample_bynode: fraction of features to sample in each tree (or split).  
• λ (reg_lambda): L2 regularization on leaf weights (on by default).  
• α (reg_alpha): L1 regularization on leaf weights (off by default, set to > 0 to enable).  

These hyperparameters help manage overfitting and can drastically affect XGBoost’s performance. Typically, a methodical approach (grid search, random search, or Bayesian optimization) is needed to find optimal values. citeturn0file1

5. Pseudocode for XGBoost

Initialization:
• F₀(x) = constant (e.g., the average of y if it’s regression).

For m = 1 to M:
 1) For each observation i, compute:
    gᵢ = ∂/∂F(xᵢ) L(yᵢ, Fₘ₋₁(xᵢ)),
    hᵢ = ∂²/∂F(xᵢ)² L(yᵢ, Fₘ₋₁(xᵢ)).  # second derivative
 2) Fit a regression tree to the points {(gᵢ, hᵢ)}, with specialized splitting criteria that accounts for gᵢ and hᵢ.
 3) For each leaf j, compute the optimal weight wⱼ that minimizes the approximate loss plus regularization.
 4) Update Fₘ(x) = Fₘ₋₁(x) + η ⋅ fₘ(x).  

Return Fₘ(x).  

6. Example Code Snippets in Python, R, and SAS

––––––––––––––––––––––––––––––––––––––––––––––––––––
A) Python Example

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Suppose X, y are your data and targets (NumPy arrays or Pandas DataFrames)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Convert into DMatrix, which is a specialized XGBoost data structure
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

# Set parameters: for regression
params = {
    'objective': 'reg:squarederror',
    'eta': 0.1,
    'max_depth': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'lambda': 1.0,  # L2 reg
    'alpha': 0.0    # L1 reg
}

num_rounds = 100  # number of boosting rounds
xgb_model = xgb.train(params, dtrain, num_boost_round=num_rounds)

# Make predictions
preds = xgb_model.predict(dtest)
mse = mean_squared_error(y_test, preds)
print("XGBoost MSE: ", mse)

––––––––––––––––––––––––––––––––––––––––––––––––––––
B) R Example (xgboost library)

# install.packages("xgboost")
library(xgboost)

# Suppose df is our data frame, with numeric columns for X and a numeric y
# Make matrices
X <- as.matrix(df[, -which(names(df) == "y")])
y <- df$y

# Train/test split
set.seed(123)
train_idx <- sample(nrow(df), size = 0.8 * nrow(df))
X_train <- X[train_idx, ]
y_train <- y[train_idx]
X_test  <- X[-train_idx, ]
y_test  <- y[-train_idx]

# Create xgb.DMatrix
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest  <- xgb.DMatrix(data = X_test,  label = y_test)

# Set parameters
params <- list(
  objective = "reg:squarederror",
  eta = 0.1,
  max_depth = 3,
  subsample = 0.8,
  colsample_bytree = 0.8,
  lambda = 1,
  alpha = 0
)

# Train
xgb_model <- xgb.train(
  params = params, 
  data = dtrain, 
  nrounds = 100,
  watchlist = list(train = dtrain, test = dtest),
  early_stopping_rounds = 10
)

# Predict
preds <- predict(xgb_model, X_test)
mse <- mean((y_test - preds)^2)
print(mse)

––––––––––––––––––––––––––––––––––––––––––––––––––––
C) SAS Example
In SAS, you can replicate XGBoost-like functionality in a few ways. If you have SAS Viya, you can leverage PROC XGBOOST. Otherwise, you can approximate it with PROC GRADBOOST or macros for gradient boosting. Here is an example with PROC XGBOOST in SAS Viya:

/* In SAS Viya: */
proc cas;
   session mysession;
   loadactionset "decisionTree";
   /* Assuming we have table 'mytable' with 'y' as target,
      and x1, x2, ..., xp as predictors. */
   action xgboost.train /
     table={name="mytable"}
     target="y"
     inputs={"x1","x2",..., "xp"}
     nominals={}                  /* specify categorical variables if needed */
     nTree=100
     objective="reg:squarederror"
     maxDepth=3
     eta=0.1
     subsample=0.8
     colSampleByTree=0.8
     regLambda=1
     regAlpha=0
     seed=12345
     savestate={name="myXGBmodel"};
run;

/* Score new data */
proc cas;
   action xgboost.score /
     modelState={name="myXGBmodel"}
     table={name="myNewData"}
     casOut={name="myScoredData", replace=True};
run;

7. Practical Tips and Summary
• Regularization Tuning. Don’t neglect gamma, λ (reg_lambda), and α (reg_alpha). These can be key to controlling overfitting.  
• Learning Rate. Typically pick a smaller eta (e.g. 0.01–0.2) and combine with more boosting rounds.  
• Subsampling. Using subsample < 1.0 or colsample_bytree < 1.0 often helps reduce variance and speed up training.  
• Early Stopping. Using early_stopping_rounds can save time by halting training when the model stops improving on a validation set.  
• Custom Losses. One huge advantage of XGBoost is that you can define custom loss functions, as long as they’re differentiable and you can provide gradient and hessian.  

In short, XGBoost is a highly optimized framework for gradient boosting with built-in regularization and sophisticated tree-building. It’s widely used in practice for structured data problems and can often outperform simpler methods, provided that you tune the parameters carefully.  


Below is my third section of the study guide, focusing on Hyperparameters. I drew from The Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript (Module 8 Asynch) citeturn0file1. I will also provide some brief code snippets in Python, R, and SAS for illustration.

SECTION 3: HYPERPARAMETERS

1. What Are Hyperparameters?
I consider hyperparameters to be the “settings” or “knobs” that guide the learning process of a model. For a simple linear regression, the parameters are the slopes and intercept (learned from data), but we usually have no hyperparameters to tune. In contrast, for tree-based methods, boosting, and advanced models like neural networks, we have hyperparameters that control complexity, learning rate, regularization, and so on. These hyperparameters are not learned directly from the training data in a simple closed-form manner; instead, we pick them (for instance, by cross-validation or other search methods).  
Relevant references: ESL, Chapter 7 on model assessment and selection; also chapters dealing with each algorithm’s specific tunable knobs (e.g., Chapter 10 on boosting). citeturn0file0

2. Common Hyperparameters for Tree-Based Models
• max_depth: the maximum depth of the tree, controlling how many splits can occur from root to leaf.  
• min_samples_split or min_child_weight (in XGBoost): the minimum number of samples needed in a leaf node or child node, which helps prevent overly small partitions and thus overfitting.  
• gamma (in XGBoost): additional penalty on leaf splits, requiring a minimum loss reduction before a split can be made.  
• sub_sample or bagging_fraction: fraction of the training data to randomly sample for each round (adds randomness, reduces variance).  
• col_sample_by_tree: fraction of features used in each tree.  

3. Learning Rate vs. Number of Estimators
• learning_rate (eta): how fast or slow we incorporate a new learner’s contribution in each boosting iteration. Lower learning rates typically require more iterations (n_estimators) to achieve good accuracy, but often generalize better.  
• n_estimators (M): the number of boosting rounds (trees in a boosted ensemble). Too few can underfit; too many might overfit if we don’t monitor for early stopping.  
These two hyperparameters are typically tuned together: small learning_rate with a large n_estimators can yield a high-performing model at a cost of more computation. citeturn0file1

4. Regularization Hyperparameters
• L1 regularization α (alpha) encourages sparsity in tree weights (or other model parameters).  
• L2 regularization λ (lambda) shrinks the weights, penalizing large values.  
• gamma (for XGBoost, LightGBM, etc.) also plays a role in regularization by adding a cost for each leaf in a tree.  
• penalty in logistic regression: can be “l1”, “l2”, or “elastic net”, controlling how coefficients are shrunk or forced to zero.  

5. Searching for Good Hyperparameters
I commonly use systematic search procedures such as grid search or random search, possibly augmented by cross-validation. Automated hyperparameter optimization methods (Bayesian optimization, genetic algorithms, Hyperopt, Optuna, etc.) can also be used. The procedure typically involves:  
1) Choose a range or distribution for each hyperparameter.  
2) Sample different combinations of these hyperparameters.  
3) Fit the model on training folds, evaluate on a validation fold.  
4) Pick the combination that yields the best average validation score.  
5) Refit on the full training data if needed.  

6. Example Code Snippet for Hyperparameter Tuning

––––––––––––––––––––––––––––––––––––––––––––––––––––
A) Python (Using scikit-learn’s GridSearchCV)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [1, 3, 5]
}

gbm = GradientBoostingRegressor()
grid_search = GridSearchCV(
    estimator=gbm,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print("Best Params:", grid_search.best_params_)
print("Best CV Score:", -grid_search.best_score_)

––––––––––––––––––––––––––––––––––––––––––––––––––––
B) R (Using caret for tuning a GBM)

# install.packages("caret")
library(caret)

train_control <- trainControl(method = "cv", number = 5)
tune_grid <- expand.grid(
  n.trees = c(50, 100),
  interaction.depth = c(1, 3, 5),
  shrinkage = c(0.01, 0.1, 0.2),
  n.minobsinnode = c(5, 10)
)

set.seed(123)
gbm_fit <- train(
  y ~ ., data = df,
  method = "gbm",
  trControl = train_control,
  tuneGrid = tune_grid,
  metric = "RMSE",
  verbose = FALSE
)

gbm_fit$bestTune
gbm_fit$results

––––––––––––––––––––––––––––––––––––––––––––––––––––
C) SAS (PROC OPTMODEL or HPC Tuning in SAS Viya)
SAS has macros or procedures (like PROC HPFOREST, HPGENSELECT) that can do some hyperparameter selection, but you may need a manual or macro-based approach.

Example snippet using a macro-based approach for searching hyperparameters in SAS older versions (conceptual outline):

%macro tuneMyForest(data=, target=, maxdepth=, ntrees=);
   proc hpforest data=&data;
      target &target.;
      input x1-xp;
      ntree=&ntrees.;
      maxdepth=&maxdepth.;
      /* additional hyperparameters, etc. */
      ods output FitStatistics=FitStats;
   run;
%mend;

%tuneMyForest(data=mydata, target=y, maxdepth=5, ntrees=50);
/* gather FitStats, compare, etc. */

In modern SAS Viya, you can use AutoTune in actions like decisionTree.gbtreeTrain or xgboost.train, specifying search ranges for the hyperparameters.  

7. Hyperparameter Tuning Pitfalls
• Overfitting on validation sets if repeatedly searching a large hyperparameter space.  
• Setting ranges or distributions too narrow can miss better solutions.  
• Computation time can explode with large parameter grids.  
• Sometimes it’s easy to fix certain parameters to well-known defaults (e.g., small learning rate) and only tune the critical ones (like n_estimators, max_depth) to reduce complexity.  

8. Summing Up
Hyperparameters are crucial in controlling the behavior and performance of advanced machine-learning models, especially tree-based methods and boosted ensembles. They govern model complexity, regularization strength, and how learning progresses. The right combination of hyperparameters can dramatically improve predictive accuracy while preventing overfitting.  

Below is my fourth section of the study guide, focusing on two XGBoost Demos. I drew from the Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and the video transcript content (Module 8 Asynch) citeturn0file1. I will also provide sample code so you can replicate the demonstrations.

SECTION 4: XGBOOST DEMO 1 AND 2

DEMO 1: HANDWRITTEN DIGITS CLASSIFICATION (SCIKIT-LEARN + XGBOOST)

1. Data Overview  
I see the digits dataset from scikit-learn as an example for multiclass classification. It contains 1,797 samples of 8×8 pixel images representing the digits 0 through 9. Each pixel is a feature, so we have 64 features. We want to classify which digit each image represents.  

2. Steps to Replicate

(A) Load Libraries and Data  
• Use scikit-learn’s built-in digits dataset:  
  from sklearn.datasets import load_digits  
  digits = load_digits()  

• X will be digits.data, a (1797 × 64) array, and y will be digits.target (digits 0–9).

(B) Split into Training and Test  
• We can do a simple 70–30 split or use cross-validation.  

(C) Convert Data into XGBoost’s DMatrix Format  
• xgb.DMatrix is a specialized data structure for XGBoost, but we can train directly with scikit-learn API too.  

(D) Train an XGBoost Classifier  
• For a multiclass task, set the objective to "multi:softprob" or "multi:softmax" and specify num_class=10.  

(E) Evaluate Accuracy  
• Use predictions on the test set and measure classification accuracy or confusion matrix.  

3. Demo 1: Code Example (Python)

import xgboost as xgb
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
digits = load_digits()
X = digits.data       # shape (1797, 64)
y = digits.target     # labels 0 through 9

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# XGBoost Classifier using scikit-learn API
xgb_clf = xgb.XGBClassifier(
    objective='multi:softprob',
    num_class=10,
    max_depth=3,
    learning_rate=0.1,
    n_estimators=100,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

xgb_clf.fit(X_train, y_train)

# Predict
y_pred = xgb_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("XGBoost classification accuracy:", accuracy)

• You will typically see accuracy in the 0.95–0.98 range depending on parameter settings.  

4. Notes
• Because the images are small (8×8), a simple approach like unrolled pixels is enough to get decent results.  
• For more complex images, more advanced features or deep learning might be appropriate, but XGBoost can still perform surprisingly well on structured tabular data.  

DEMO 2: REGRESSION EXAMPLE (CALIFORNIA HOUSING DATA)

1. Data Overview  
• The California Housing dataset (available in scikit-learn) is a regression problem predicting median house prices based on demographic and geographic features.  
• Features include average income, average house age, average rooms, etc.  

2. Steps to Replicate

(A) Load Libraries and Data  
• from sklearn.datasets import fetch_california_housing  
• cal_housing = fetch_california_housing()  

• The input features are in cal_housing.data, the target is cal_housing.target.

(B) Split into Training and Test  

(C) Build XGBoost Regressor  
• objective='reg:squarederror' (for standard regression).  

(D) Evaluate MSE or R²  

3. Demo 2: Code Example (Python)

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
cal_housing = fetch_california_housing()
X = cal_housing.data
y = cal_housing.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# XGBoost Regressor
xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror',
    max_depth=4,
    learning_rate=0.1,
    n_estimators=200,
    subsample=0.8,
    colsample_bytree=0.8,
    alpha=0,
    lambda=1,
    random_state=42
)

xgb_reg.fit(X_train, y_train)
preds = xgb_reg.predict(X_test)

mse = mean_squared_error(y_test, preds)
print("XGBoost MSE:", mse)
print("XGBoost RMSE:", mse**0.5)

4. Interpretation and Potential Tuning
• You can tune max_depth, learning_rate, alpha, lambda, etc. using a grid search or other methods.  
• Typical MSE might be around 0.3–0.4 or so depending on how the target is scaled, and the RMSE around 0.55–0.63 for the default approach.  

SAMPLE CODE IN R

• For the digits classification example, you could build your own dataset or use the “mnist” data. For regression on the California Housing, you can pull from external sources or pre-downloaded data. R code is similar, using xgboost::xgb.train or xgboost::xgboost.  

library(xgboost)
# Suppose X, y are numeric matrices or vectors
# For classification with multiple classes, set objective="multi:softprob" and num_class=10
# For regression, objective="reg:squarederror"

dtrain <- xgb.DMatrix(data=X_train, label=y_train)
dtest  <- xgb.DMatrix(data=X_test, label=y_test)

params <- list(
  objective = "reg:squarederror",
  max_depth = 4,
  eta = 0.1,
  subsample = 0.8,
  colsample_bytree = 0.8
)

bst <- xgb.train(params = params, data = dtrain, nrounds=200)
preds <- predict(bst, newdata=dtest)

# Evaluate MSE
mse <- mean((y_test - preds)^2)
cat("MSE:", mse, "\n")

SAMPLE CODE IN SAS

• For classification, use something like:

proc cas;
  session mysession;
  loadactionset "decisionTree";
  action xgboost.train /
    table={name="digits_table"}
    target="digit_label"
    inputs={"pixel1","pixel2",..., "pixel64"}
    objective="multi:softmax"
    numClasses=10
    nTree=100
    maxDepth=3
    eta=0.1
    subsample=0.8
    colSampleByTree=0.8
    randomSeed=42
    savestate={name="digits_xgb_model"};
run;

• For regression, set objective="reg:squarederror" and remove the numClasses parameter.  

SUMMARY
• XGBoost is straightforward to apply once you’re familiar with the API.  
• Classification tasks use objective="multi:softmax" or "multi:softprob".  
• Regression tasks use objective="reg:squarederror" or sometimes "reg:linear" in older versions.  
• The two demos illustrate typical use cases: classification (handwritten digits) and regression (housing prices).  


Below is my fifth section of the study guide, focusing on Grid Search. I based it on The Elements of Statistical Learning (ESL) and the Module 8 Asynchronous transcripts you provided. I will include references to ESL (The Elements of Statistical Learning) citeturn0file0 and content from the transcripts (Module 8 Asynch) citeturn0file1. As usual, I'll also provide copy/paste friendly code snippets for multiple programming languages.

SECTION 5: GRID SEARCH

1. What Is Grid Search?
I view “Grid Search” as a brute-force method for systematically exploring multiple combinations of hyperparameters. You define a discrete set of possible values (a “grid”) for each hyperparameter, then train and evaluate the model for every possible combination. The key steps are:

1) Define ranges (or sets) of possible values for each hyperparameter: for example, max_depth ∈ {2,3,4}, learning_rate ∈ {0.01, 0.1}, etc.  
2) For each combination of hyperparameters in the Cartesian product of these sets, train the model on training folds and evaluate on a validation fold (or use cross-validation).  
3) Select the combination that yields the best performance metric.  
4) Optionally, refit the model with the chosen hyperparameters on the entire training set.  

2. Advantages and Disadvantages
• Advantages:  
  – Straightforward and easy to understand.  
  – For a small parameter space, it can be quite effective.  

• Disadvantages:  
  – Potentially expensive in computation time, since we evaluate every combination.  
  – The total number of combinations grows exponentially with the number of hyperparameters or the size of each grid.  

3. Pseudocode Outline

Given:
– A model M(θ) with hyperparameters θ ∈ Θ1 × Θ2 × … × Θp.  
– A performance metric Perf(·).  
– A method for model evaluation, e.g. K-fold cross-validation.

Algorithm:
1) best_perf ← –∞ (or some minimal reference)
2) For each combination (θ₁, θ₂, …, θp) in Θ₁ × Θ₂ × … × Θp:
   a) Train model M(θ) on training folds.
   b) Evaluate on validation fold(s) and compute Perf(θ).
   c) If Perf(θ) > best_perf:
      i) best_perf ← Perf(θ)
      ii) best_params ← θ
3) Return best_params, best_perf  

4. Practical Implementation with Cross-Validation
• Typically, in scikit-learn or R’s caret, GridSearchCV or train() automatically uses cross-validation for each combination of hyperparameters.  
• The final model is often retrained using the selected “best_params.”  

5. Example: Python (Using scikit-learn’s GridSearchCV)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [2, 3, 4]
}

# Initialize model
model = GradientBoostingClassifier()

# Grid Search
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,        # 5-fold cross-validation
    n_jobs=-1    # use all available CPU cores
)

# Fit
grid_search.fit(X, y)
print("Best Params:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

# Evaluate on the same data or separate test set
best_model = grid_search.best_estimator_
preds = best_model.predict(X)
accuracy = accuracy_score(y, preds)
print("Accuracy on entire dataset:", accuracy)

6. Example: R (Using caret)

library(caret)

# Suppose df is a data frame with predictor columns and a factor target "Species"
train_control <- trainControl(method = "cv", number = 5)

grid <- expand.grid(
  n.trees = c(50, 100),
  interaction.depth = c(2, 3, 4),
  shrinkage = c(0.01, 0.1),
  n.minobsinnode = c(5)
)

set.seed(123)
gbm_fit <- train(
  Species ~ .,
  data = iris,
  method = "gbm",
  trControl = train_control,
  tuneGrid = grid,
  verbose = FALSE
)
gbm_fit$bestTune
gbm_fit

7. Example: SAS (Conceptual)
In some SAS environments, you can do manual looping for each hyperparameter combination or use autotuning options. For example, in SAS Viya’s CAS environment, certain actions (like xgboost.train) have an Autotune or “tune” parameter:

proc cas;
   session mySession;
   loadactionset "decisionTree";
   action xgboost.train /
     table={name="your_data"}
     target="your_target"
     inputs={"x1","x2","x3"}
     autotune={
       steps=10,
       objective="AUTO",
       searchmethod="grid",
       parameters={
         { name="nTree", values="50,100" },
         { name="maxDepth", values="2,4" },
         { name="eta", values="0.01,0.1" }
       }
     }
     ;
run;

If your SAS version lacks these features, you can create a macro loop that calls PROC HPFOREST or PROC GRADBOOST with different parameter settings and collects metrics.  

8. When to Use
Grid search is most useful when your hyperparameter space is small or you have strong prior intuition about what ranges to explore. If you have many hyperparameters or a wide range, you might consider Random Search or other optimization approaches.  

9. Summary
Grid Search systematically explores a predefined set of hyperparameter values, which can guarantee that you don’t miss any combination in that grid. While exhaustive, it can be expensive for large parameter spaces. However, for moderate problem sizes, it remains a standard tool for model selection and can yield excellent results.  


Below is my sixth (and final) section of the study guide, focusing on Random Search. As before, I drew on The Elements of Statistical Learning (ESL) citeturn0file0 and the Module 8 Asynchronous transcript material citeturn0file1. Code snippets are given in a copy/paste friendly format.

SECTION 6: RANDOM SEARCH

1. What is Random Search?
I consider Random Search an alternative hyperparameter optimization strategy to Grid Search. Instead of exhaustively enumerating a grid of possible hyperparameter values, we randomly sample from specified distributions for each hyperparameter. The key idea is that randomly chosen points in a high-dimensional space can often cover diverse regions more efficiently than an exhaustive (grid) method with the same computational budget.  

2. Advantages over Grid Search
• Efficiency: For the same number of trials, random search often finds better parameter settings than a coarse grid, especially when only a few hyperparameters are truly influential.  
• Scalability: By sampling from each hyperparameter’s distribution, you can easily add more samples or draw from specialized distributions (log scale, uniform, etc.).  
• Adaptability: If you discover you need more trials, you can just continue sampling.  

3. Basic Steps
1) Define a probability distribution or range for each hyperparameter. For example, learning_rate ∼ Uniform(0.01, 0.2), max_depth ∈ {2, 3, 4, 5}, or alpha ∼ LogUniform(1e–5, 1).  
2) Randomly sample a set of hyperparameter configurations.  
3) For each sampled configuration, train the model (e.g., with cross-validation) and record a performance metric.  
4) Keep track of the best-performing combination and possibly keep searching as resources allow.  

4. Example: Python (Using scikit-learn’s RandomizedSearchCV)

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
import numpy as np

# Load data
boston = load_boston()
X, y = boston.data, boston.target

# Define parameter distributions
param_dist = {
    'n_estimators': np.random.randint(50, 200, size=50),  # sample integers from 50..200
    'learning_rate': np.linspace(0.01, 0.2, num=20),       # sample from 0.01..0.2
    'max_depth': np.random.randint(2, 6, size=4)          # sample from {2,3,4,5}
}

# Model
model = GradientBoostingRegressor()

# Random Search
rand_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=10,               # number of parameter settings to try
    scoring='neg_mean_squared_error',
    cv=5,
    random_state=42,
    n_jobs=-1
)

rand_search.fit(X, y)

best_params = rand_search.best_params_
best_score = -rand_search.best_score_  # because we used neg MSE
print("Best Params:", best_params)
print("Best CV MSE:", best_score)

5. Example: R (Using caret with “search = 'random'”)

library(caret)

data(iris)
set.seed(123)

train_control <- trainControl(method = "cv", number = 5, search = "random")

# Define the model (e.g., gbm for boosting)
model <- train(
  Species ~ .,
  data = iris,
  method = "gbm",
  trControl = train_control,
  tuneLength = 10   # will try 10 random combinations
)

model$bestTune
model$results

6. Example: SAS (Conceptual)
In SAS, you can randomly generate parameter sets in a macro or in a CAS action if available. If you do not have an automated random search function, you can do something like:

%macro run_random_search(num_runs=10);
   %do i=1 %to &num_runs;
      %let ntrees = %sysfunc(rand(integer, 50, 200));
      %let depth = %sysfunc(rand(integer, 2, 6));
      %let lr    = %sysevalf(%sysfunc(rand(uniform))*(0.2-0.01)+0.01);

      /* Then call your chosen procedure (e.g., PROC GRADBOOST, HPFOREST, or xgboost.train)
         with these parameters and record performance. */
   %end;
%mend;

%run_random_search(num_runs=10);

7. When to Use Random Search
• Especially helpful if your model has many hyperparameters or if you believe only a subset of them significantly affects performance.  
• Useful as a first pass to locate promising regions, followed by a more fine-grained search or a Bayesian optimization method.  
• If you want to quickly scale the number of trials or budget more computing time, random search is easy to extend.  

8. Summary
Random Search is a flexible and often surprisingly effective approach to hyperparameter tuning, particularly in higher-dimensional spaces where grid search becomes prohibitively expensive. By specifying meaningful distributions for each hyperparameter, you can focus your search in promising regions and efficiently uncover a strong model configuration.

Here are three key takeaways from the Random Search section:

1) Random Search samples hyperparameter combinations from predefined distributions, often providing better coverage of the search space than an exhaustive grid for the same number of trials.  
2) It is highly flexible and scalable, making it easy to adjust the number of tested combinations if more time or resources become available.  
3) Random Search can be combined with other methods (like Bayesian optimization) to refine the search after identifying promising regions of hyperparameter values.

Here are three thought-provoking questions to consider:

1) In what types of scenarios might a purely random approach fail to locate good hyperparameters, and how could you mitigate this?  
2) How can domain knowledge guide your choice of distributions for the hyperparameters, instead of using uniform or naive distributions?  
3) After finding a good hyperparameter set through random search, how do we decide whether to refine it further or accept it as final?  
