Primary Questions
- How accurately can an optimized XGBoost machine learning model predict winners and losers across global equity markets?
- What are the most significant features driving stock price movements?
This project explores the development of a classification-based predictive model to identify stock movements across global equity markets - specifically, predicting whether equities increased or decreased in value. Leveraging a robust dataset of financial indicators and performance metrics, the analysis focuses on applying the eXtreme Gradient Boosting (XGBoost) machine learning algorithm to build an efficient and interpretable classification framework.
The primary goal of this project is to evaluate the effectiveness of the model in predicting stock movements based on historical data from diverse equity markets and to enhance its performance through hyperparameter tuning and advanced techniques, such as handling imbalanced data and analyzing feature importance. By employing various predictive strategies, the project aims to showcase the potential of machine learning in facilitating data-driven decision-making and investment analysis within global equity markets.
The dataset for this analysis is stored in financial_data_pred.rda.
suppressPackageStartupMessages(library(xgboost))
## Warning: package 'xgboost' was built under R version 4.4.2
suppressPackageStartupMessages(library(caret))
## Warning: package 'caret' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(xgboostExplainer))
suppressPackageStartupMessages(library(pROC))
suppressPackageStartupMessages(library(SHAPforxgboost))
## Warning: package 'SHAPforxgboost' was built under R version 4.4.2
suppressPackageStartupMessages(library(data.table))
load("financial_data_pred.rda")
The dataset used in this project comprises 221 variables that capture various financial indicators and performance metrics for publicly traded stocks across global equity markets. It has been pre-split into two parts: the training dataset, consisting of 8,000 samples, is used to build and refine the predictive model, while the test dataset, containing 2,000 samples, is used to evaluate the model’s performance. The primary response variable, class, is a binary indicator representing stock movements, where a value of 1 signifies an increase in stock value and 0 indicates a decrease. This classification facilitates a focused analysis of price dynamics in global equity markets, offering insights into the factors influencing stock performance.
str(train_data)
## 'data.frame': 8000 obs. of 221 variables:
## $ Revenue : num 7.44e+10 3.73e+09 9.84e+10 2.55e+10 1.79e+10 ...
## $ Revenue.Growth : num -0.0713 1.1737 0.0182 0.0053 0.0076 ...
## $ Cost.of.Revenue : num 3.90e+10 2.81e+09 7.81e+10 1.82e+10 1.15e+10 ...
## $ Gross.Profit : num 3.54e+10 9.29e+08 2.02e+10 7.32e+09 6.37e+09 ...
## $ R.D.Expenses : num 0.00 1.08e+08 0.00 0.00 0.00 ...
## $ SG.A.Expense : num 2.15e+10 3.44e+08 1.52e+10 6.56e+09 3.47e+09 ...
## $ Operating.Expenses : num 2.15e+10 7.94e+08 1.75e+10 6.59e+09 3.41e+09 ...
## $ Operating.Income : num 1.39e+10 1.35e+08 2.72e+09 7.37e+08 2.96e+09 ...
## $ Interest.Expense : num 7.09e+08 1.21e+07 4.43e+08 4.25e+08 3.02e+08 ...
## $ Earnings.before.Tax : num 1.45e+10 1.75e+08 2.27e+09 2.50e+08 2.71e+09 ...
## $ Income.Tax.Expense : num 2.85e+09 3.96e+07 7.51e+08 8.04e+05 8.83e+08 ...
## $ Net.Income...Non.Controlling.int : num 142000000 -14319180 12000000 0 36900000 ...
## $ Net.Income...Discontinued.ops : num -1.13e+09 0.00 0.00 0.00 0.00 ...
## $ Net.Income : num 1.16e+10 1.36e+08 1.52e+09 2.49e+08 1.82e+09 ...
## $ Preferred.Dividends : num 0.0 0.0 0.0 3.4e+07 0.0 ...
## $ Net.Income.Com : num 1.16e+10 1.36e+08 1.52e+09 2.15e+08 1.82e+09 ...
## $ EPS : num 4.19 0.24 1.47 4.6 2.9 ...
## $ EPS.Diluted : num 4.01 0.226 1.45 4.6 2.83 4.76 1.6 4.88 1.28 2.56 ...
## $ Weighted.Average.Shs.Out : num 2.71e+09 2.99e+08 1.03e+09 4.83e+07 6.14e+08 ...
## $ Weighted.Average.Shs.Out..Dil. : num 2.71e+09 1.83e+07 1.03e+09 4.83e+07 6.14e+08 ...
## $ Dividend.per.Share : num 2.448 0 0.307 0 1.55 ...
## $ Gross.Margin : num 0.475 0.249 0.206 0.287 0.356 ...
## $ EBITDA.Margin : num 0.247 0.0107 0.045 0.042 0.201 0.156 0.256 0.074 0.125 0.358 ...
## $ EBIT.Margin : num 0.2043 0.0502 0.0276 0.0264 0.1681 ...
## $ Profit.Margin : num 0.156 0.0058 0.015 0.008 0.102 0.094 0.154 0.034 0.064 0.207 ...
## $ Free.Cash.Flow.margin : num 0.1359 0.0704 0.0126 0.0144 0.1052 ...
## $ EBITDA : num 1.83e+10 2.46e+08 4.42e+09 1.08e+09 3.60e+09 ...
## $ EBIT : num 1.52e+10 1.88e+08 2.71e+09 6.75e+08 3.01e+09 ...
## $ Consolidated.Income : num 1.18e+10 1.22e+08 1.53e+09 2.49e+08 1.86e+09 ...
## $ Earnings.Before.Tax.Margin : num 0.1948 0.047 0.0231 0.0098 0.1512 ...
## $ Net.Profit.Margin : num 0.1565 0.0364 0.0154 0.0098 0.1019 ...
## $ Cash.and.cash.equivalents : num 8.56e+09 7.74e+08 4.01e+08 1.46e+08 8.67e+08 ...
## $ Short.term.investments : num 2.13e+09 6.08e+08 0.00 0.00 0.00 ...
## $ Cash.and.short.term.investments : num 1.07e+10 1.38e+09 4.01e+08 1.46e+08 8.67e+08 ...
## $ Receivables : num 6.39e+09 1.17e+07 1.12e+09 9.49e+08 1.48e+09 ...
## $ Inventories : num 6.76e+09 5.79e+08 5.65e+09 2.99e+09 1.56e+09 ...
## $ Total.current.assets : num 3.16e+10 2.13e+09 8.83e+09 4.29e+09 4.39e+09 ...
## $ Property..Plant...Equipment.Net : num 2.23e+10 3.09e+08 1.69e+10 1.96e+09 3.94e+09 ...
## $ Goodwill.and.Intangible.Assets : num 8.45e+10 1.91e+08 2.84e+09 4.31e+08 1.37e+10 ...
## $ Long.term.investments : num 0.0 6.3e+07 0.0 0.0 0.0 ...
## $ Tax.assets : num 1.09e+09 3.76e+07 0.00 0.00 7.41e+07 ...
## $ Total.non.current.assets : num 1.13e+11 6.02e+08 2.05e+10 2.66e+09 1.88e+10 ...
## $ Total.assets : num 1.44e+11 2.74e+09 2.93e+10 6.94e+09 2.31e+10 ...
## $ Payables : num 8.46e+09 1.00e+09 4.88e+09 1.29e+09 1.61e+09 ...
## $ Short.term.debt : num 1.56e+10 0.00 1.66e+09 4.92e+07 2.36e+09 ...
## $ Total.current.liabilities : num 3.37e+10 1.64e+09 1.07e+10 2.51e+09 5.42e+09 ...
## $ Long.term.debt : num 1.98e+10 6.22e+08 9.65e+09 5.71e+09 6.42e+09 ...
## $ Total.debt : num 3.54e+10 6.22e+08 1.13e+10 5.76e+09 8.79e+09 ...
## $ Deferred.revenue : num 0 31411043 0 0 0 ...
## $ Tax.Liabilities : num 1.02e+10 0.00 1.63e+09 0.00 1.67e+09 ...
## $ Deposit.Liabilities : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total.non.current.liabilities : num 4.06e+10 6.62e+08 1.32e+10 6.55e+09 9.73e+09 ...
## $ Total.liabilities : num 7.43e+10 2.30e+09 2.39e+10 9.06e+09 1.52e+10 ...
## $ Other.comprehensive.income : num -7.66e+09 -1.73e+06 -4.64e+08 -3.73e+07 -1.34e+09 ...
## $ Retained.earnings..deficit. : num 8.50e+10 4.29e+06 1.10e+10 -7.52e+09 1.18e+10 ...
## $ Total.shareholders.equity : num 7.00e+10 4.12e+08 5.38e+09 -2.11e+09 6.53e+09 ...
## $ Investments : num 2.13e+09 6.71e+08 0.00 0.00 0.00 ...
## $ Net.Debt : num 2.47e+10 1.16e+09 1.09e+10 5.61e+09 7.92e+09 ...
## $ Other.Assets : num 7.79e+09 2.36e+08 1.66e+09 1.96e+08 4.83e+08 ...
## $ Other.Liabilities : num 9.66e+09 6.39e+08 4.17e+09 1.17e+09 1.45e+09 ...
## $ Depreciation...Amortization : num 3.14e+09 5.82e+07 1.70e+09 4.04e+08 5.85e+08 ...
## $ Stock.based.compensation : num 360000000 36405231 107000000 -10471000 39200000 ...
## $ Operating.Cash.Flow : num 1.40e+10 5.27e+08 3.57e+09 7.02e+08 2.54e+09 ...
## $ Capital.Expenditure : num -3.85e+09 -2.64e+08 -2.33e+09 -3.34e+08 -6.57e+08 ...
## $ Acquisitions.and.disposals : num -2.40e+07 -1.92e+08 -2.34e+09 0.00 6.67e+07 ...
## $ Investment.purchases.and.sales : num -8.05e+08 -2.13e+08 0.00 4.12e+07 2.93e+07 ...
## $ Investing.Cash.flow : num -4.10e+09 -6.87e+08 -4.77e+09 -3.65e+08 -5.62e+08 ...
## $ Issuance..repayment..of.debt : num 3.54e+09 6.19e+08 2.09e+09 -2.95e+08 8.01e+08 ...
## $ Issuance..buybacks..of.shares : num -3.91e+09 1.77e+06 -4.13e+08 3.32e+07 -1.64e+09 ...
## $ Dividend.payments : num -6.91e+09 0.00 -3.19e+08 0.00 -9.83e+08 ...
## $ Financing.Cash.Flow : num -7.28e+09 6.22e+08 1.36e+09 -3.20e+08 -1.82e+09 ...
## $ Effect.of.forex.changes.on.cash : num 39000000 -15648692 0 0 -29200000 ...
## $ Net.cash.flow...Change.in.cash : num 2.62e+09 4.46e+08 1.63e+08 1.70e+07 1.26e+08 ...
## $ Free.Cash.Flow : num 1.01e+10 2.63e+08 1.24e+09 3.68e+08 1.88e+09 ...
## $ Net.Cash.Marketcap : num -0.1163 0.0956 -0.5969 -0.8822 -0.2376 ...
## $ priceBookValueRatio : num 3.04 4.73 3.36 0 5.1 ...
## $ priceToBookRatio : num 0 8.04 7.1 0 0 ...
## $ priceToSalesRatio : num 2.8583 0.0443 0.1858 0.2491 1.861 ...
## $ priceEarningsRatio : num 18.8 81.6 12 28.6 18.7 ...
## $ priceToFreeCashFlowsRatio : num 21.03 1.36 14.63 17.27 17.69 ...
## $ priceToOperatingCashFlowsRatio : num 15.24 21.07 5.11 9.06 13.12 ...
## $ priceCashFlowRatio : num 12.978 0.679 0 0 10.952 ...
## $ priceEarningsToGrowthRatio : num 15.56 2.63 0 0 15.25 ...
## $ priceSalesRatio : num 2.4346 0.0957 0 0 1.5539 ...
## $ dividendYield : num 0.0382 0 1.1588 0.0182 0.0353 ...
## $ enterpriseValueMultiple : num 11.42 7.14 2.28 5.43 9.77 ...
## $ priceFairValue : num 2.589 0.867 0 6.741 4.259 ...
## $ ebitperRevenue : num 0.2043 0.0502 0.0276 0.0264 0.1681 ...
## $ ebtperEBIT : num 0.953 0.935 0.837 0.371 0.9 ...
## $ niperEBT : num 0.803 0.774 0.669 0.997 0.674 ...
## $ grossProfitMargin : num 0.475 0.249 0.206 0.287 0.356 ...
## $ pretaxProfitMargin : num 0.187 0.036 0.0277 0.0289 0.1651 ...
## $ netProfitMargin : num 0.15649 0.03637 0.01544 0.00977 0.10187 ...
## $ effectiveTaxRate : num 0.1967 0.22556 0.33084 0.00321 0.32622 ...
## $ returnOnAssets : num 0.5765 0.0403 0.1011 0.0668 0.6265 ...
## $ returnOnEquity : num 0.166 0.329 0.282 -0.118 0.279 ...
## $ returnOnCapitalEmployed : num 0.0753 0 0.0859 0.1062 0.1041 ...
## $ nIperEBT : num 0.803 0.774 0.669 0.997 0.674 ...
## $ eBTperEBIT : num 0.953 0.935 0.837 0.371 0.9 ...
## [list output truncated]
Initial checks for missing values indicated that both the training and test datasets contained no missing entries.
sum(is.na(train_data))
## [1] 0
sum(is.na(test_data))
## [1] 0
Next, the response variable was examined, revealing that the training dataset contains 3,428 stocks whose value decreased over the year and 4,572 stocks whose value increased. Similarly, the test dataset includes 742 stocks with a decrease in value and 1,258 stocks with an increase.
summary(as.factor(train_data$class))
## 0 1
## 3428 4572
summary(as.factor(test_data$class))
## 0 1
## 742 1258
A bar plot was generated to illustrate the distribution of the response variable across the training and test datasets. The visualization highlights a slight class imbalance, with stocks that increased in value slightly outnumbering those that decreased. While this imbalance is not extreme, it is worth considering in the context of model development. XGBoost does not inherently resolve class imbalances but provides mechanisms such as the scale_pos_weight parameter, which can adjust the weight of the minority class during training. This parameter could be tuned to improve the model’s ability to correctly predict minority class outcomes, ensuring a more balanced performance. Metrics like AUC or Precision-Recall curves would also be crucial to effectively evaluate the model under these conditions.
library(ggplot2)
ggplot(train_data, aes(x = as.factor(class))) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Stock Movement Classes", x = "Class (1 = Increase, 0 = Decrease)", y = "Count")
Some key financial metrics were summarized to understand their central tendency and spread. The summary stats reveal the presence of outliers and extreme values, such as unusually high or low values in metrics like gross profit margin, price-to-operating cash flows ratio, and free cash flow margin. These observations highlight the variability inherent in financial data, often driven by differences in company size, industry, or performance. While these outliers are notable, they are not expected to negatively impact model performance, as XGBoost’s tree-based framework inherently manages such extreme values effectively during model training. This ensures that the insights from these data points are utilized without disproportionately influencing the model’s decision-making process.
summary(train_data$grossProfitMargin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -223.4828 0.2800 0.4824 0.4529 0.7602 34.8107
summary(train_data$priceToOperatingCashFlowsRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -108768.68 2.26 9.80 22.23 16.77 48212.90
summary(train_data$EBIT.Margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9626.000 0.000 0.088 -7.512 0.219 97.590
summary(train_data$Free.Cash.Flow.margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9373.333 -0.047 0.042 -6.092 0.144 235.927
A correlation matrix analysis was conducted to evaluate relationships among numeric features in the dataset. The heatmap of strong correlations highlighted several variables with values approaching ±1, indicating strong positive or negative relationships. These correlations suggest potential feature redundancy, which could contribute to multicollinearity issues and impact the interpretability of the dataset. However, the XGBoost model’s inherent ability to manage correlated features minimizes the need for additional preprocessing.
numeric_cols <- sapply(train_data, is.numeric)
numeric_data <- train_data[, numeric_cols]
correlation_matrix <- cor(numeric_data, use = "complete.obs")
library(ggplot2)
suppressPackageStartupMessages(library(reshape2))
correlation_matrix[abs(correlation_matrix) < 0.5] <- NA
correlation_matrix_melt <- melt(correlation_matrix, na.rm = TRUE)
ggplot(data = correlation_matrix_melt, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Correlation"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank()
) +
labs(title = "Heatmap of Strong Correlations")
The box plots for Return on Assets and Net Profit Margin illustrate that these ratios alone offer limited predictive power for distinguishing stocks that rise (class 1) versus those that fall (class 0). In many cases, companies hover around break-even levels, with extreme negative values signifying severe unprofitability or high leverage, and extreme positives reflecting outliers that may stem from exceptional growth or profitability. Despite this pronounced skew and overlap between classes, XGBoost’s tree-based method can naturally accommodate wide financial variations and outliers, allowing the model to capture signals without requiring specialized preprocessing. This approach remains consistent with the project’s methodology, capitalizing on XGBoost’s robustness in handling the inherent variability of financial ratios.
library(ggplot2)
# Box Plot for Return on Assets
ggplot(train_data, aes(
x = as.factor(class),
y = returnOnAssets,
fill = as.factor(class),
color = as.factor(class)
)) +
geom_boxplot(outlier.shape = 16, alpha = 0.7) +
labs(
title = "Class-Wise Box Plot of Return on Assets",
x = "Class (0 = Decrease, 1 = Increase)",
y = "Return on Assets",
fill = "Class",
color = "Class"
) +
theme_minimal()
# Box Plot for Net Profit Margin
ggplot(train_data, aes(
x = as.factor(class),
y = netProfitMargin,
fill = as.factor(class),
color = as.factor(class)
)) +
geom_boxplot(outlier.shape = 16, alpha = 0.7) +
labs(
title = "Class-Wise Box Plot of Net Profit Margin",
x = "Class (0 = Decrease, 1 = Increase)",
y = "Net Profit Margin",
fill = "Class",
color = "Class"
) +
theme_minimal()
XGBoost Model:
In this project, XGBoost (eXtreme Gradient Boosting) was used as the primary machine learning algorithm for classifying stocks in global equity markets into gainers (class = 1) and decliners (class = 0). Known for its speed and high accuracy in financial applications, XGBoost iteratively builds an ensemble of decision trees by focusing on misclassified samples at each step. This approach accommodates the skewed and sometimes extreme nature of financial data without requiring extensive preprocessing. Ultimately, the combined contribution of all trees in the ensemble provides a robust predictive framework for distinguishing between stocks that increase or decrease in value.
Preparing data for XGBoost:
dtrain <- xgb.DMatrix(data = as.matrix(train_data[, 1:220]), label = as.numeric(train_data$class) -1)
dtest <- xgb.DMatrix(data = as.matrix(test_data[, 1:220]), label = as.numeric(test_data$class) - 1)
For a binary classification task (class = 1 for increase, 0 for decrease), I specified the objective as ‘binary:logistic’ and chose a discretionary number of training rounds (iterations/trees).
set.seed(111111)
bst_initial <- xgboost(
data = dtrain,
nrounds = 100,
verbose = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = c("auc", "error")
)
## [1] train-auc:0.737380
## [21] train-auc:0.954374
## [41] train-auc:0.988477
## [61] train-auc:0.998311
## [81] train-auc:0.999815
## [100] train-auc:0.999983
Generating predictions on the test set using the initial model, classifying with a cutoff of 0.5 and assessing performance.
preds_init <- predict(bst_initial, dtest)
pred_classes_init <- ifelse(preds_init >= 0.5, 1, 0)
conf_mat_init <- table(pred_classes_init, test_data$class)
confusionMatrix(conf_mat_init, positive = "1")
## Confusion Matrix and Statistics
##
##
## pred_classes_init 0 1
## 0 385 462
## 1 357 796
##
## Accuracy : 0.5905
## 95% CI : (0.5686, 0.6122)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : 0.999819
##
## Kappa : 0.1473
##
## Mcnemar's Test P-Value : 0.000279
##
## Sensitivity : 0.6328
## Specificity : 0.5189
## Pos Pred Value : 0.6904
## Neg Pred Value : 0.4545
## Prevalence : 0.6290
## Detection Rate : 0.3980
## Detection Prevalence : 0.5765
## Balanced Accuracy : 0.5758
##
## 'Positive' Class : 1
##
From the initial model above, we observe an accuracy of 0.5905 and a positive predictive value of 0.6904. While this performance is respectable for data of this nature, further improvements are likely achievable through additional model tuning.
Tuning the algorithm with a high number of trees and a learning rate of 0.1 to determine an appropriate number of trees to use.
set.seed(111111)
bst <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1, # learning rate
nrounds = 1000, # number of trees or boosting rounds
early_stopping_rounds = 50,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.746227+0.002708 train-error:0.322593+0.003906 test-auc:0.645216+0.014222 test-error:0.392997+0.012293
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 50 rounds.
##
## [21] train-auc:0.912924+0.004937 train-error:0.176375+0.003672 test-auc:0.695715+0.016241 test-error:0.359498+0.009578
## [41] train-auc:0.956783+0.002136 train-error:0.121000+0.003201 test-auc:0.696592+0.013426 test-error:0.359249+0.004079
## [61] train-auc:0.975606+0.002642 train-error:0.086937+0.005920 test-auc:0.696160+0.013675 test-error:0.361375+0.003403
## Stopping. Best iteration:
## [18] train-auc:0.900810+0.004299 train-error:0.189563+0.001978 test-auc:0.694857+0.015229 test-error:0.356873+0.008844
From the above iteration, 18 emerged as the optimal iteration count, indicating a sufficient baseline for subsequent tuning. The number of iterations is set to 100, with an early stopping parameter of 20, to refine the model further. The next steps involve tuning max.depth and min_child_weight, which typically have the largest impact on model performance.
max_depth_vals <- c(3, 5, 7, 10, 15)
min_child_weight <- c(1,3,5,7, 10, 15)
cv_params <- expand.grid(max_depth_vals, min_child_weight)
names(cv_params) <- c("max_depth", "min_child_weight")
auc_vec <- error_vec <- rep(NA, nrow(cv_params))
for(i in 1:nrow(cv_params)){
set.seed(111111)
bst_tune <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1,
max.depth = cv_params$max_depth[i],
min_child_weight = cv_params$min_child_weight[i],
nrounds = 100,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
auc_vec[i] <- bst_tune$evaluation_log$test_auc_mean[bst_tune$best_ntreelimit]
error_vec[i] <- bst_tune$evaluation_log$test_error_mean[bst_tune$best_ntreelimit]
}
## [1] train-auc:0.652956+0.002660 train-error:0.373437+0.000972 test-auc:0.640723+0.014849 test-error:0.381248+0.014233
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.736980+0.001836 train-error:0.341437+0.002522 test-auc:0.677436+0.016225 test-error:0.370625+0.011096
## [41] train-auc:0.777790+0.002392 train-error:0.309625+0.001425 test-auc:0.685836+0.015073 test-error:0.366624+0.011965
## [61] train-auc:0.804272+0.001919 train-error:0.283250+0.002880 test-auc:0.687766+0.014101 test-error:0.365625+0.007485
## Stopping. Best iteration:
## [46] train-auc:0.785359+0.002293 train-error:0.302094+0.002899 test-auc:0.687426+0.015158 test-error:0.363375+0.010105
##
## [1] train-auc:0.712134+0.002593 train-error:0.339312+0.002605 test-auc:0.652390+0.014032 test-error:0.383498+0.012182
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.852394+0.003752 train-error:0.241000+0.002911 test-auc:0.691536+0.013650 test-error:0.361498+0.007749
## [41] train-auc:0.902112+0.003349 train-error:0.186438+0.004757 test-auc:0.694164+0.012418 test-error:0.359373+0.008768
## Stopping. Best iteration:
## [35] train-auc:0.891340+0.004581 train-error:0.198594+0.007132 test-auc:0.695340+0.012326 test-error:0.356623+0.010650
##
## [1] train-auc:0.783361+0.003343 train-error:0.291219+0.005114 test-auc:0.645312+0.016291 test-error:0.388748+0.012370
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.959862+0.002204 train-error:0.115031+0.003006 test-auc:0.693102+0.009093 test-error:0.360751+0.007705
## [41] train-auc:0.986717+0.001455 train-error:0.063906+0.004579 test-auc:0.695660+0.011037 test-error:0.359500+0.009464
## [61] train-auc:0.995065+0.001091 train-error:0.037093+0.005376 test-auc:0.695226+0.011485 test-error:0.359250+0.010669
## Stopping. Best iteration:
## [49] train-auc:0.990913+0.001678 train-error:0.051156+0.005734 test-auc:0.695209+0.012002 test-error:0.355750+0.009336
##
## [1] train-auc:0.880904+0.009266 train-error:0.200313+0.012852 test-auc:0.634988+0.017780 test-error:0.398998+0.015855
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.999018+0.000326 train-error:0.020406+0.003484 test-auc:0.690295+0.017691 test-error:0.360250+0.015769
## [41] train-auc:0.999996+0.000002 train-error:0.002094+0.000674 test-auc:0.693155+0.014842 test-error:0.362251+0.014288
## Stopping. Best iteration:
## [24] train-auc:0.999536+0.000228 train-error:0.014812+0.003667 test-auc:0.690545+0.017424 test-error:0.359000+0.013848
##
## [1] train-auc:0.956896+0.008074 train-error:0.110469+0.013032 test-auc:0.626653+0.017942 test-error:0.408876+0.013655
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:1.000000+0.000000 train-error:0.000313+0.000262 test-auc:0.685587+0.020754 test-error:0.360126+0.014569
## [41] train-auc:1.000000+0.000000 train-error:0.000000+0.000000 test-auc:0.689063+0.017793 test-error:0.358125+0.013390
## [61] train-auc:1.000000+0.000000 train-error:0.000000+0.000000 test-auc:0.691988+0.015781 test-error:0.356125+0.010319
## Stopping. Best iteration:
## [48] train-auc:1.000000+0.000000 train-error:0.000000+0.000000 test-auc:0.690441+0.017301 test-error:0.355625+0.011062
##
## [1] train-auc:0.652956+0.002660 train-error:0.373437+0.000972 test-auc:0.640723+0.014849 test-error:0.381248+0.014233
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.737328+0.001451 train-error:0.341875+0.002109 test-auc:0.677530+0.016462 test-error:0.372749+0.010993
## [41] train-auc:0.776419+0.001677 train-error:0.309844+0.002161 test-auc:0.688175+0.014057 test-error:0.367001+0.013192
## [61] train-auc:0.802940+0.002216 train-error:0.284219+0.003555 test-auc:0.689822+0.013593 test-error:0.363624+0.013233
## [81] train-auc:0.823496+0.002349 train-error:0.265969+0.003208 test-auc:0.692609+0.016020 test-error:0.359125+0.014668
## [100] train-auc:0.840070+0.002411 train-error:0.249656+0.002952 test-auc:0.693373+0.014335 test-error:0.358127+0.014985
## [1] train-auc:0.710155+0.002034 train-error:0.340812+0.002267 test-auc:0.652305+0.012876 test-error:0.383747+0.011082
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.847405+0.003525 train-error:0.245781+0.004883 test-auc:0.693116+0.013138 test-error:0.362124+0.007922
## [41] train-auc:0.895859+0.003762 train-error:0.195250+0.003683 test-auc:0.696583+0.010715 test-error:0.363000+0.004497
## [61] train-auc:0.923744+0.005732 train-error:0.160374+0.007318 test-auc:0.698290+0.009440 test-error:0.357750+0.004754
## [81] train-auc:0.944698+0.004879 train-error:0.133187+0.006867 test-auc:0.696767+0.010624 test-error:0.360500+0.009562
## Stopping. Best iteration:
## [74] train-auc:0.937530+0.004963 train-error:0.143031+0.005581 test-auc:0.696956+0.010804 test-error:0.354125+0.006801
##
## [1] train-auc:0.770438+0.004524 train-error:0.302094+0.005815 test-auc:0.646624+0.017310 test-error:0.387622+0.013655
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.947266+0.003262 train-error:0.133500+0.005081 test-auc:0.690695+0.013499 test-error:0.356498+0.011605
## [41] train-auc:0.978445+0.002933 train-error:0.081843+0.007581 test-auc:0.693949+0.013004 test-error:0.359123+0.012836
## Stopping. Best iteration:
## [24] train-auc:0.955118+0.004530 train-error:0.121156+0.006549 test-auc:0.692994+0.014093 test-error:0.354748+0.009545
##
## [1] train-auc:0.846397+0.008217 train-error:0.236782+0.010207 test-auc:0.634156+0.014361 test-error:0.407623+0.015105
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.996236+0.000450 train-error:0.033344+0.002016 test-auc:0.688504+0.014770 test-error:0.364751+0.015530
## [41] train-auc:0.999732+0.000084 train-error:0.009344+0.002665 test-auc:0.690605+0.013981 test-error:0.363376+0.011355
## [61] train-auc:0.999976+0.000015 train-error:0.003219+0.001085 test-auc:0.691355+0.012071 test-error:0.359001+0.013581
## [81] train-auc:0.999998+0.000003 train-error:0.000875+0.000830 test-auc:0.691508+0.011834 test-error:0.357876+0.010957
## [100] train-auc:1.000000+0.000000 train-error:0.000125+0.000153 test-auc:0.691665+0.011685 test-error:0.361252+0.014921
## [1] train-auc:0.900308+0.005442 train-error:0.187718+0.004849 test-auc:0.630759+0.013396 test-error:0.407999+0.010939
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.999657+0.000054 train-error:0.004000+0.000743 test-auc:0.682397+0.017847 test-error:0.371248+0.011923
## [41] train-auc:0.999999+0.000001 train-error:0.000469+0.000198 test-auc:0.686864+0.016514 test-error:0.363498+0.015830
## [61] train-auc:1.000000+0.000000 train-error:0.000031+0.000063 test-auc:0.688904+0.014374 test-error:0.366374+0.011052
## Stopping. Best iteration:
## [48] train-auc:1.000000+0.000000 train-error:0.000312+0.000099 test-auc:0.686961+0.014564 test-error:0.361248+0.014196
##
## [1] train-auc:0.652956+0.002660 train-error:0.373437+0.000972 test-auc:0.640723+0.014849 test-error:0.381248+0.014233
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.737411+0.001756 train-error:0.341531+0.004129 test-auc:0.679093+0.015675 test-error:0.371124+0.010863
## [41] train-auc:0.775909+0.003349 train-error:0.310344+0.002671 test-auc:0.688758+0.014547 test-error:0.365000+0.011112
## [61] train-auc:0.802650+0.003312 train-error:0.284062+0.003223 test-auc:0.690978+0.014733 test-error:0.362250+0.008443
## [81] train-auc:0.821807+0.002604 train-error:0.265531+0.003823 test-auc:0.692768+0.015158 test-error:0.360250+0.009891
## [100] train-auc:0.837739+0.003178 train-error:0.250125+0.003140 test-auc:0.694201+0.015169 test-error:0.359999+0.012539
## [1] train-auc:0.706929+0.002813 train-error:0.342937+0.002658 test-auc:0.652187+0.015796 test-error:0.383372+0.012989
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.844212+0.004330 train-error:0.249375+0.003426 test-auc:0.691855+0.014939 test-error:0.364250+0.010761
## [41] train-auc:0.892237+0.001581 train-error:0.198187+0.001829 test-auc:0.693361+0.014583 test-error:0.363124+0.010175
## [61] train-auc:0.921573+0.003471 train-error:0.164906+0.002385 test-auc:0.694518+0.012805 test-error:0.360497+0.008802
## [81] train-auc:0.942337+0.002873 train-error:0.137844+0.001966 test-auc:0.696428+0.013029 test-error:0.362124+0.007845
## Stopping. Best iteration:
## [63] train-auc:0.923996+0.003183 train-error:0.160781+0.002101 test-auc:0.694498+0.013011 test-error:0.359748+0.008718
##
## [1] train-auc:0.760015+0.006612 train-error:0.309875+0.007992 test-auc:0.645864+0.017834 test-error:0.388622+0.013294
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.939246+0.003818 train-error:0.144687+0.005994 test-auc:0.695666+0.010664 test-error:0.359749+0.004949
## [41] train-auc:0.970678+0.003309 train-error:0.093250+0.006296 test-auc:0.697902+0.011202 test-error:0.354502+0.009401
## [61] train-auc:0.984611+0.002494 train-error:0.065031+0.005848 test-auc:0.696815+0.010638 test-error:0.357626+0.010170
## Stopping. Best iteration:
## [41] train-auc:0.970678+0.003309 train-error:0.093250+0.006296 test-auc:0.697902+0.011202 test-error:0.354502+0.009401
##
## [1] train-auc:0.821050+0.010371 train-error:0.259000+0.011825 test-auc:0.634478+0.015816 test-error:0.408248+0.011533
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.990082+0.000634 train-error:0.050813+0.003494 test-auc:0.687899+0.015027 test-error:0.365501+0.008451
## [41] train-auc:0.998229+0.000214 train-error:0.019813+0.001570 test-auc:0.690532+0.014048 test-error:0.364501+0.010512
## Stopping. Best iteration:
## [36] train-auc:0.997256+0.000195 train-error:0.024906+0.002289 test-auc:0.691843+0.014242 test-error:0.362376+0.008376
##
## [1] train-auc:0.859257+0.006842 train-error:0.228468+0.008601 test-auc:0.628103+0.014897 test-error:0.409996+0.011450
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.997998+0.000538 train-error:0.016187+0.002441 test-auc:0.684853+0.014380 test-error:0.364003+0.012717
## [41] train-auc:0.999890+0.000076 train-error:0.002750+0.000675 test-auc:0.689288+0.012620 test-error:0.367003+0.009730
## Stopping. Best iteration:
## [21] train-auc:0.997998+0.000538 train-error:0.016187+0.002441 test-auc:0.684853+0.014380 test-error:0.364003+0.012717
##
## [1] train-auc:0.652956+0.002660 train-error:0.373437+0.000972 test-auc:0.640723+0.014849 test-error:0.381248+0.014233
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.736975+0.002810 train-error:0.341781+0.003062 test-auc:0.679618+0.014611 test-error:0.370374+0.009697
## [41] train-auc:0.774224+0.004013 train-error:0.311031+0.003238 test-auc:0.687556+0.014523 test-error:0.366751+0.010774
## [61] train-auc:0.800990+0.003356 train-error:0.285688+0.003696 test-auc:0.689827+0.014235 test-error:0.366625+0.013120
## [81] train-auc:0.820959+0.003682 train-error:0.267813+0.003038 test-auc:0.692282+0.014375 test-error:0.363751+0.013952
## Stopping. Best iteration:
## [76] train-auc:0.816390+0.003779 train-error:0.271937+0.002850 test-auc:0.691006+0.015520 test-error:0.363000+0.013656
##
## [1] train-auc:0.705878+0.002710 train-error:0.343625+0.002538 test-auc:0.653949+0.015923 test-error:0.382872+0.014643
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.838598+0.001791 train-error:0.254031+0.001630 test-auc:0.688727+0.015080 test-error:0.367498+0.009656
## [41] train-auc:0.885729+0.001835 train-error:0.206344+0.004396 test-auc:0.694261+0.014919 test-error:0.363249+0.010058
## [61] train-auc:0.914615+0.002477 train-error:0.173219+0.004042 test-auc:0.694294+0.014220 test-error:0.358748+0.010845
## [81] train-auc:0.935184+0.004045 train-error:0.146719+0.005646 test-auc:0.694538+0.012934 test-error:0.355749+0.006231
## [100] train-auc:0.949934+0.003882 train-error:0.125250+0.005537 test-auc:0.694702+0.013742 test-error:0.355750+0.007276
## Stopping. Best iteration:
## [80] train-auc:0.934697+0.004205 train-error:0.147688+0.005541 test-auc:0.694692+0.012968 test-error:0.354874+0.007710
##
## [1] train-auc:0.751769+0.005503 train-error:0.316125+0.005755 test-auc:0.647325+0.020100 test-error:0.387996+0.015242
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.928111+0.002895 train-error:0.157094+0.004766 test-auc:0.697608+0.012366 test-error:0.364250+0.007607
## [41] train-auc:0.964582+0.003058 train-error:0.105625+0.007029 test-auc:0.700232+0.011019 test-error:0.359875+0.008015
## Stopping. Best iteration:
## [32] train-auc:0.953051+0.004662 train-error:0.124438+0.008439 test-auc:0.698838+0.012818 test-error:0.354749+0.007520
##
## [1] train-auc:0.802786+0.010259 train-error:0.274906+0.009873 test-auc:0.635072+0.021948 test-error:0.406248+0.016332
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.980402+0.001051 train-error:0.075562+0.003480 test-auc:0.693066+0.011536 test-error:0.360376+0.010564
## [41] train-auc:0.995052+0.000899 train-error:0.034250+0.002822 test-auc:0.691448+0.011253 test-error:0.362126+0.007281
## Stopping. Best iteration:
## [30] train-auc:0.990531+0.000505 train-error:0.049719+0.001762 test-auc:0.693169+0.010354 test-error:0.358501+0.010996
##
## [1] train-auc:0.833547+0.006810 train-error:0.251031+0.007672 test-auc:0.629899+0.022344 test-error:0.408121+0.018052
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.993628+0.000488 train-error:0.034156+0.001449 test-auc:0.688427+0.011835 test-error:0.365124+0.008558
## Stopping. Best iteration:
## [19] train-auc:0.991751+0.000621 train-error:0.040688+0.000654 test-auc:0.687461+0.009931 test-error:0.361751+0.004622
##
## [1] train-auc:0.652929+0.002988 train-error:0.372812+0.002023 test-auc:0.641224+0.013208 test-error:0.379748+0.011379
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.735304+0.002866 train-error:0.345062+0.002835 test-auc:0.679448+0.015486 test-error:0.370875+0.011468
## [41] train-auc:0.772399+0.003081 train-error:0.312843+0.003766 test-auc:0.688194+0.015234 test-error:0.368000+0.013224
## [61] train-auc:0.797875+0.003428 train-error:0.289250+0.001435 test-auc:0.690410+0.014218 test-error:0.363873+0.010860
## [81] train-auc:0.817036+0.003560 train-error:0.270594+0.002392 test-auc:0.691861+0.014607 test-error:0.362126+0.010603
## [100] train-auc:0.832540+0.003125 train-error:0.255562+0.003983 test-auc:0.691567+0.014876 test-error:0.360251+0.013298
## [1] train-auc:0.703729+0.003789 train-error:0.345906+0.003257 test-auc:0.652757+0.012680 test-error:0.383122+0.013873
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.834393+0.001685 train-error:0.255781+0.001370 test-auc:0.694041+0.015174 test-error:0.359623+0.012951
## [41] train-auc:0.878610+0.002053 train-error:0.211187+0.003560 test-auc:0.696616+0.012792 test-error:0.356498+0.008967
## [61] train-auc:0.908213+0.002281 train-error:0.178031+0.004300 test-auc:0.697274+0.011683 test-error:0.358873+0.010702
## Stopping. Best iteration:
## [49] train-auc:0.891998+0.003403 train-error:0.196688+0.005129 test-auc:0.696909+0.012858 test-error:0.353873+0.009758
##
## [1] train-auc:0.742368+0.006465 train-error:0.323094+0.006829 test-auc:0.652638+0.015795 test-error:0.388872+0.013689
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
## [41] train-auc:0.950800+0.002261 train-error:0.128438+0.002877 test-auc:0.697881+0.011775 test-error:0.360625+0.008816
## Stopping. Best iteration:
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
##
## [1] train-auc:0.782631+0.009468 train-error:0.289156+0.008983 test-auc:0.642474+0.017606 test-error:0.408247+0.011392
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.965321+0.002340 train-error:0.100376+0.006587 test-auc:0.694444+0.011380 test-error:0.363000+0.007673
## [41] train-auc:0.987673+0.001020 train-error:0.055125+0.004282 test-auc:0.694771+0.009806 test-error:0.357749+0.009387
## [61] train-auc:0.995007+0.000622 train-error:0.031750+0.003085 test-auc:0.694259+0.009232 test-error:0.357875+0.007173
## Stopping. Best iteration:
## [54] train-auc:0.993268+0.000841 train-error:0.037906+0.002707 test-auc:0.695243+0.010125 test-error:0.355875+0.007686
##
## [1] train-auc:0.803240+0.005530 train-error:0.273844+0.005162 test-auc:0.638064+0.017152 test-error:0.404747+0.011133
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.981614+0.000782 train-error:0.067375+0.001315 test-auc:0.688622+0.016334 test-error:0.363498+0.010912
## [41] train-auc:0.996048+0.000445 train-error:0.024781+0.001581 test-auc:0.689824+0.015916 test-error:0.362249+0.011918
## [61] train-auc:0.999019+0.000233 train-error:0.009063+0.001122 test-auc:0.691759+0.015345 test-error:0.363497+0.013688
## Stopping. Best iteration:
## [41] train-auc:0.996048+0.000445 train-error:0.024781+0.001581 test-auc:0.689824+0.015916 test-error:0.362249+0.011918
##
## [1] train-auc:0.652904+0.003022 train-error:0.372969+0.002125 test-auc:0.641302+0.013312 test-error:0.379623+0.011472
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.734565+0.003915 train-error:0.344250+0.003956 test-auc:0.680836+0.015202 test-error:0.368749+0.012074
## [41] train-auc:0.770961+0.003843 train-error:0.313531+0.001655 test-auc:0.688281+0.015604 test-error:0.365250+0.013702
## [61] train-auc:0.794610+0.004016 train-error:0.292688+0.003307 test-auc:0.692129+0.014448 test-error:0.363874+0.013744
## [81] train-auc:0.812759+0.003164 train-error:0.275219+0.002083 test-auc:0.693483+0.014758 test-error:0.360124+0.013991
## [100] train-auc:0.827495+0.003337 train-error:0.260656+0.001898 test-auc:0.694688+0.012921 test-error:0.358749+0.012498
## [1] train-auc:0.699914+0.004573 train-error:0.349031+0.004299 test-auc:0.658444+0.013354 test-error:0.378373+0.013196
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.826929+0.003353 train-error:0.263156+0.002586 test-auc:0.690504+0.016909 test-error:0.362624+0.010468
## [41] train-auc:0.870079+0.002428 train-error:0.219687+0.004406 test-auc:0.692857+0.013417 test-error:0.360499+0.004422
## Stopping. Best iteration:
## [25] train-auc:0.838602+0.003131 train-error:0.252625+0.004087 test-auc:0.692437+0.016521 test-error:0.358124+0.009077
##
## [1] train-auc:0.730715+0.007386 train-error:0.332125+0.007465 test-auc:0.661912+0.019722 test-error:0.377997+0.014029
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.891066+0.004400 train-error:0.200375+0.007036 test-auc:0.696986+0.013979 test-error:0.354873+0.009539
## [41] train-auc:0.932405+0.002304 train-error:0.150594+0.003859 test-auc:0.700496+0.011094 test-error:0.357751+0.007690
## Stopping. Best iteration:
## [38] train-auc:0.927952+0.002170 train-error:0.155500+0.003845 test-auc:0.700660+0.011697 test-error:0.354750+0.008330
##
## [1] train-auc:0.763276+0.010059 train-error:0.305500+0.008508 test-auc:0.653228+0.023448 test-error:0.389246+0.017071
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.937078+0.001574 train-error:0.143531+0.001971 test-auc:0.696782+0.011976 test-error:0.357500+0.006440
## [41] train-auc:0.971413+0.001742 train-error:0.087437+0.001950 test-auc:0.698626+0.010098 test-error:0.355624+0.009186
## Stopping. Best iteration:
## [24] train-auc:0.945742+0.001187 train-error:0.130750+0.001975 test-auc:0.697875+0.012004 test-error:0.353501+0.009043
##
## [1] train-auc:0.775279+0.005473 train-error:0.297688+0.005728 test-auc:0.648225+0.020356 test-error:0.388245+0.016301
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.953022+0.002268 train-error:0.118437+0.002552 test-auc:0.691461+0.009621 test-error:0.362251+0.008435
## [41] train-auc:0.984482+0.001522 train-error:0.057469+0.003788 test-auc:0.693763+0.010303 test-error:0.361126+0.009035
## Stopping. Best iteration:
## [28] train-auc:0.969361+0.001570 train-error:0.090219+0.001292 test-auc:0.692778+0.009645 test-error:0.357876+0.006152
# Joining results in dataset
res_db <- cbind.data.frame(cv_params, auc_vec, error_vec)
names(res_db)[3:4] <- c("auc", "error")
res_db$max_depth <- as.factor(res_db$max_depth)
res_db$min_child_weight <- as.factor(res_db$min_child_weight)
# Printing AUC heatmap
g_2 <- ggplot(res_db, aes(y = max_depth, x = min_child_weight, fill = auc)) +
geom_tile() +
theme_bw() +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint =mean(res_db$auc),
space = "Lab",
na.value ="grey",
guide = "colourbar",
aesthetics = "fill") +
labs(x = "Minimum Child Weight", y = "Max Depth", fill = "AUC")
g_2
# printing error heatmap
g_3 <- ggplot(res_db, aes(y = max_depth, x = min_child_weight, fill = error)) + # set aesthetics
geom_tile() +
theme_bw() +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint =mean(res_db$error),
space = "Lab",
na.value ="grey",
guide = "colourbar",
aesthetics = "fill") +
labs(x = "Minimum Child Weight", y = "Max Depth", fill = "Error")
g_3
From the above, it’s clear that setting max.depth between 5 and 10 yields better results, with strong outcomes at (max.depth = 7, min_child_weight = 10 or 15). Lower depths (e.g., 3) appear insufficient for capturing multiple variable interactions. Similarly, lower min_child_weight values degrade performance by causing the model to overly focus on minor sample differences.
# Printing results
res_db
## max_depth min_child_weight auc error
## 1 3 1 0.6874265 0.3633746
## 2 5 1 0.6953400 0.3566235
## 3 7 1 0.6952085 0.3557497
## 4 10 1 0.6905455 0.3590002
## 5 15 1 0.6904405 0.3556251
## 6 3 3 0.6933726 0.3581267
## 7 5 3 0.6969564 0.3541247
## 8 7 3 0.6929936 0.3547485
## 9 10 3 0.6915842 0.3563763
## 10 15 3 0.6869610 0.3612483
## 11 3 5 0.6941444 0.3578751
## 12 5 5 0.6944977 0.3597476
## 13 7 5 0.6979020 0.3545019
## 14 10 5 0.6918435 0.3623763
## 15 15 5 0.6848532 0.3640029
## 16 3 7 0.6910062 0.3630003
## 17 5 7 0.6946923 0.3548744
## 18 7 7 0.6988377 0.3547494
## 19 10 7 0.6931687 0.3585012
## 20 15 7 0.6874606 0.3617508
## 21 3 10 0.6914981 0.3581245
## 22 5 10 0.6969087 0.3538729
## 23 7 10 0.6966653 0.3516244
## 24 10 10 0.6952434 0.3558754
## 25 15 10 0.6898242 0.3622487
## 26 3 15 0.6944949 0.3584994
## 27 5 15 0.6924374 0.3581242
## 28 7 15 0.7006604 0.3547501
## 29 10 15 0.6978746 0.3535014
## 30 15 15 0.6927777 0.3578758
From this, it appears maxing out at max_depth = 7 and min_child_weight = 10 delivers the best balance of model error, while min_child_weight = 15 achieves a slightly higher AUC. Given the minimal AUC gap yet larger discrepancy in error, max_depth = 7 with min_child_weight = 10 is selected. The next phase involves tuning gamma, which governs the minimum loss reduction required to split a node - an important consideration for refining model precision in the context of stock movement predictions.
gamma_vals <- c(0, 0.05, 0.1, 0.15, 0.2)
set.seed(111111)
auc_vec <- error_vec <- rep(NA, length(gamma_vals))
for(i in 1:length(gamma_vals)){
bst_tune <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1,
max.depth = 7,
min_child_weight = 10,
gamma = gamma_vals[i], # minimum loss reduction for split
nrounds = 100,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
auc_vec[i] <- bst_tune$evaluation_log$test_auc_mean[bst_tune$best_ntreelimit]
error_vec[i] <- bst_tune$evaluation_log$test_error_mean[bst_tune$best_ntreelimit]
}
## [1] train-auc:0.742368+0.006465 train-error:0.323094+0.006829 test-auc:0.652638+0.015795 test-error:0.388872+0.013689
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
## [41] train-auc:0.950800+0.002261 train-error:0.128438+0.002877 test-auc:0.697881+0.011775 test-error:0.360625+0.008816
## Stopping. Best iteration:
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
##
## [1] train-auc:0.745782+0.005174 train-error:0.319312+0.004090 test-auc:0.646977+0.012481 test-error:0.395755+0.017786
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.910756+0.004309 train-error:0.178687+0.005867 test-auc:0.690792+0.008033 test-error:0.361750+0.006514
## [41] train-auc:0.952927+0.003420 train-error:0.122719+0.003627 test-auc:0.695220+0.007887 test-error:0.360249+0.008352
## Stopping. Best iteration:
## [29] train-auc:0.931618+0.003306 train-error:0.152219+0.005067 test-auc:0.692083+0.009767 test-error:0.358253+0.008628
##
## [1] train-auc:0.749177+0.004300 train-error:0.314031+0.006195 test-auc:0.636940+0.013784 test-error:0.394248+0.007537
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.910072+0.006175 train-error:0.179375+0.009530 test-auc:0.688646+0.010576 test-error:0.366625+0.006543
## [41] train-auc:0.950937+0.002338 train-error:0.127250+0.004232 test-auc:0.693229+0.012973 test-error:0.364998+0.009645
## [61] train-auc:0.970673+0.001342 train-error:0.094156+0.004103 test-auc:0.693419+0.011889 test-error:0.361749+0.005084
## [81] train-auc:0.982247+0.001342 train-error:0.068469+0.004031 test-auc:0.693893+0.011305 test-error:0.361625+0.006405
## Stopping. Best iteration:
## [69] train-auc:0.975925+0.001408 train-error:0.082219+0.003299 test-auc:0.693857+0.010384 test-error:0.358000+0.006625
##
## [1] train-auc:0.747430+0.011485 train-error:0.312719+0.005596 test-auc:0.642320+0.011291 test-error:0.395501+0.008215
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.911972+0.002089 train-error:0.177687+0.004425 test-auc:0.693889+0.008386 test-error:0.359125+0.007605
## [41] train-auc:0.953457+0.001451 train-error:0.123625+0.004296 test-auc:0.695534+0.010289 test-error:0.359750+0.009517
## Stopping. Best iteration:
## [28] train-auc:0.929895+0.001913 train-error:0.156625+0.005439 test-auc:0.696246+0.009611 test-error:0.355126+0.011080
##
## [1] train-auc:0.742815+0.006023 train-error:0.322594+0.007596 test-auc:0.644012+0.008060 test-error:0.391250+0.008049
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.911775+0.003272 train-error:0.176812+0.003397 test-auc:0.697687+0.008414 test-error:0.357750+0.010387
## Stopping. Best iteration:
## [17] train-auc:0.897486+0.002770 train-error:0.190531+0.002161 test-auc:0.695924+0.006096 test-error:0.357250+0.007928
Viewing results to identify the value of gamma to use.
# Joining gamma to values
cbind.data.frame(gamma_vals, auc_vec, error_vec)
## gamma_vals auc_vec error_vec
## 1 0.00 0.6966653 0.3516244
## 2 0.05 0.6920833 0.3582525
## 3 0.10 0.6938566 0.3579995
## 4 0.15 0.6962457 0.3551260
## 5 0.20 0.6959242 0.3572495
The above shows that a gamma value of 0 yields the highest AUC and the lowest error rate in this dataset. The next step is to re-evaluate the optimal number of rounds, ensuring the model is fully calibrated under the newly identified parameters.
set.seed(111111)
bst <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
nrounds = 1000,
early_stopping_rounds = 50,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.742368+0.006465 train-error:0.323094+0.006829 test-auc:0.652638+0.015795 test-error:0.388872+0.013689
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 50 rounds.
##
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
## [41] train-auc:0.950800+0.002261 train-error:0.128438+0.002877 test-auc:0.697881+0.011775 test-error:0.360625+0.008816
## [61] train-auc:0.971092+0.001987 train-error:0.092875+0.003477 test-auc:0.697865+0.009435 test-error:0.360875+0.008302
## Stopping. Best iteration:
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
The model confirms 21 trees is optimal with the current parameters. The next step involves tuning subsample and colsample_by_tree to refine how data samples and feature columns are incorporated into each tree.
subsample <- c(0.6, 0.7, 0.8, 0.9, 1)
colsample_by_tree <- c(0.6, 0.7, 0.8, 0.9, 1)
cv_params <- expand.grid(subsample, colsample_by_tree)
names(cv_params) <- c("subsample", "colsample_by_tree")
auc_vec <- error_vec <- rep(NA, nrow(cv_params))
for(i in 1:nrow(cv_params)){
set.seed(111111)
bst_tune <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
subsample = cv_params$subsample[i],
colsample_bytree = cv_params$colsample_by_tree[i],
nrounds = 150, #
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
auc_vec[i] <- bst_tune$evaluation_log$test_auc_mean[bst_tune$best_ntreelimit]
error_vec[i] <- bst_tune$evaluation_log$test_error_mean[bst_tune$best_ntreelimit]
}
## [1] train-auc:0.700607+0.006214 train-error:0.345282+0.005724 test-auc:0.621300+0.013880 test-error:0.403123+0.014981
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.868040+0.002348 train-error:0.221219+0.002378 test-auc:0.688889+0.015416 test-error:0.366624+0.010939
## [41] train-auc:0.915315+0.001002 train-error:0.170063+0.004435 test-auc:0.692914+0.014483 test-error:0.360872+0.013744
## Stopping. Best iteration:
## [27] train-auc:0.884809+0.003460 train-error:0.203406+0.004422 test-auc:0.693888+0.015994 test-error:0.354874+0.012937
##
## [1] train-auc:0.707640+0.007244 train-error:0.338563+0.004742 test-auc:0.625758+0.008830 test-error:0.401252+0.010169
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.880495+0.005766 train-error:0.210094+0.009993 test-auc:0.684572+0.012623 test-error:0.370124+0.005529
## [41] train-auc:0.927108+0.005538 train-error:0.155001+0.009190 test-auc:0.688598+0.013523 test-error:0.367251+0.011289
## Stopping. Best iteration:
## [28] train-auc:0.899591+0.005388 train-error:0.188126+0.010278 test-auc:0.688222+0.011174 test-error:0.362750+0.003825
##
## [1] train-auc:0.718486+0.005675 train-error:0.328969+0.004779 test-auc:0.632741+0.017467 test-error:0.390874+0.014772
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.891781+0.002261 train-error:0.199563+0.005331 test-auc:0.692024+0.019545 test-error:0.361498+0.011849
## [41] train-auc:0.935773+0.003364 train-error:0.146406+0.003125 test-auc:0.694778+0.014930 test-error:0.359624+0.009134
## [61] train-auc:0.959994+0.001852 train-error:0.108000+0.002766 test-auc:0.695091+0.014650 test-error:0.362500+0.007891
## Stopping. Best iteration:
## [50] train-auc:0.947712+0.003082 train-error:0.129156+0.003539 test-auc:0.695213+0.013497 test-error:0.358625+0.008767
##
## [1] train-auc:0.727135+0.007376 train-error:0.322469+0.005349 test-auc:0.633231+0.017630 test-error:0.394125+0.015887
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.904763+0.004464 train-error:0.186032+0.004522 test-auc:0.690712+0.011959 test-error:0.362999+0.005510
## [41] train-auc:0.945244+0.003753 train-error:0.134531+0.004002 test-auc:0.697882+0.009329 test-error:0.357126+0.007478
## Stopping. Best iteration:
## [38] train-auc:0.941220+0.003506 train-error:0.139625+0.004539 test-auc:0.696544+0.008965 test-error:0.355876+0.005457
##
## [1] train-auc:0.742067+0.006667 train-error:0.320438+0.007911 test-auc:0.640527+0.015007 test-error:0.392624+0.018065
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.905610+0.003291 train-error:0.182531+0.005681 test-auc:0.690086+0.012914 test-error:0.363625+0.010372
## Stopping. Best iteration:
## [20] train-auc:0.903665+0.002987 train-error:0.185125+0.006650 test-auc:0.689674+0.012496 test-error:0.360625+0.006473
##
## [1] train-auc:0.700369+0.002857 train-error:0.348000+0.004289 test-auc:0.632759+0.005836 test-error:0.393375+0.009795
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.867633+0.002328 train-error:0.223656+0.002039 test-auc:0.688271+0.015628 test-error:0.358497+0.015531
## Stopping. Best iteration:
## [19] train-auc:0.860515+0.003052 train-error:0.231937+0.002454 test-auc:0.688109+0.015226 test-error:0.357747+0.016380
##
## [1] train-auc:0.708836+0.005671 train-error:0.335719+0.003679 test-auc:0.633038+0.015496 test-error:0.394250+0.012567
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.879114+0.003317 train-error:0.212844+0.003249 test-auc:0.691139+0.016627 test-error:0.367750+0.008824
## [41] train-auc:0.927328+0.002361 train-error:0.155031+0.003244 test-auc:0.690679+0.013682 test-error:0.366249+0.008249
## Stopping. Best iteration:
## [37] train-auc:0.920187+0.002556 train-error:0.165313+0.004740 test-auc:0.691671+0.015352 test-error:0.363499+0.009353
##
## [1] train-auc:0.721536+0.006900 train-error:0.328906+0.005951 test-auc:0.637653+0.017637 test-error:0.392001+0.018124
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.894546+0.001809 train-error:0.195625+0.002735 test-auc:0.696740+0.015186 test-error:0.354250+0.008595
## [41] train-auc:0.936718+0.002838 train-error:0.142812+0.002245 test-auc:0.699871+0.014585 test-error:0.351873+0.012216
## [61] train-auc:0.961568+0.001954 train-error:0.105406+0.001730 test-auc:0.699880+0.016441 test-error:0.353499+0.013782
## Stopping. Best iteration:
## [52] train-auc:0.951346+0.002791 train-error:0.121219+0.001341 test-auc:0.699363+0.015702 test-error:0.349623+0.011985
##
## [1] train-auc:0.731046+0.008087 train-error:0.320531+0.004056 test-auc:0.632319+0.015169 test-error:0.398497+0.015711
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.901637+0.002603 train-error:0.189563+0.003127 test-auc:0.697207+0.012358 test-error:0.362874+0.007223
## [41] train-auc:0.944748+0.003327 train-error:0.132187+0.006277 test-auc:0.699472+0.014650 test-error:0.358249+0.009172
## [61] train-auc:0.966715+0.001992 train-error:0.099281+0.004541 test-auc:0.697974+0.013041 test-error:0.358999+0.008046
## Stopping. Best iteration:
## [56] train-auc:0.962131+0.002397 train-error:0.106781+0.004943 test-auc:0.698539+0.013636 test-error:0.354500+0.010740
##
## [1] train-auc:0.743788+0.005583 train-error:0.318000+0.007554 test-auc:0.643600+0.010826 test-error:0.387749+0.009297
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.906599+0.004695 train-error:0.182594+0.005786 test-auc:0.692161+0.016719 test-error:0.360497+0.016837
## [41] train-auc:0.947428+0.003993 train-error:0.130969+0.005743 test-auc:0.694965+0.013601 test-error:0.361500+0.009982
## [61] train-auc:0.966895+0.001762 train-error:0.097344+0.002826 test-auc:0.695442+0.014196 test-error:0.363749+0.013082
## Stopping. Best iteration:
## [47] train-auc:0.954120+0.003402 train-error:0.119750+0.004527 test-auc:0.695189+0.015047 test-error:0.357623+0.009778
##
## [1] train-auc:0.706102+0.006311 train-error:0.344406+0.005688 test-auc:0.625481+0.013202 test-error:0.400873+0.015393
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.868716+0.003189 train-error:0.219156+0.003714 test-auc:0.690431+0.012613 test-error:0.355748+0.008001
## [41] train-auc:0.916768+0.003742 train-error:0.167219+0.003077 test-auc:0.694484+0.010576 test-error:0.358000+0.005652
## Stopping. Best iteration:
## [33] train-auc:0.900760+0.004495 train-error:0.185969+0.004633 test-auc:0.692839+0.012375 test-error:0.355124+0.006920
##
## [1] train-auc:0.709207+0.004595 train-error:0.340344+0.003100 test-auc:0.631447+0.018189 test-error:0.394374+0.014137
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.878002+0.003050 train-error:0.210313+0.005199 test-auc:0.689645+0.016974 test-error:0.365749+0.010103
## [41] train-auc:0.927707+0.001772 train-error:0.154406+0.005146 test-auc:0.692938+0.015885 test-error:0.359498+0.011229
## Stopping. Best iteration:
## [39] train-auc:0.924573+0.001561 train-error:0.159063+0.006009 test-auc:0.692810+0.015545 test-error:0.358499+0.009367
##
## [1] train-auc:0.725509+0.004925 train-error:0.328938+0.003619 test-auc:0.637998+0.014445 test-error:0.393750+0.008143
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.894958+0.003438 train-error:0.196687+0.005042 test-auc:0.693219+0.013338 test-error:0.360000+0.008236
## Stopping. Best iteration:
## [20] train-auc:0.891578+0.004163 train-error:0.200406+0.006979 test-auc:0.692186+0.013534 test-error:0.357626+0.007367
##
## [1] train-auc:0.732100+0.005751 train-error:0.321125+0.002806 test-auc:0.637744+0.017181 test-error:0.395372+0.017989
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.902751+0.001531 train-error:0.186969+0.003558 test-auc:0.695048+0.015238 test-error:0.358251+0.016488
## [41] train-auc:0.945631+0.002369 train-error:0.133875+0.004632 test-auc:0.699311+0.012764 test-error:0.353498+0.011641
## Stopping. Best iteration:
## [40] train-auc:0.944720+0.002442 train-error:0.135469+0.005059 test-auc:0.699125+0.012848 test-error:0.352748+0.010303
##
## [1] train-auc:0.742532+0.003486 train-error:0.320563+0.006016 test-auc:0.649271+0.014077 test-error:0.383373+0.013147
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.910429+0.001702 train-error:0.178281+0.003732 test-auc:0.695664+0.011941 test-error:0.361997+0.008493
## [41] train-auc:0.951804+0.002365 train-error:0.123594+0.004971 test-auc:0.697241+0.011852 test-error:0.361248+0.005946
## [61] train-auc:0.969353+0.002947 train-error:0.094406+0.005512 test-auc:0.697681+0.011653 test-error:0.360374+0.007411
## [81] train-auc:0.981975+0.002435 train-error:0.067625+0.004524 test-auc:0.696503+0.012264 test-error:0.359246+0.014410
## Stopping. Best iteration:
## [65] train-auc:0.973044+0.002558 train-error:0.086719+0.005893 test-auc:0.697376+0.012027 test-error:0.355623+0.008749
##
## [1] train-auc:0.705877+0.005722 train-error:0.342938+0.007519 test-auc:0.623031+0.018512 test-error:0.408247+0.016216
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.868810+0.003298 train-error:0.221750+0.004013 test-auc:0.687785+0.015168 test-error:0.362875+0.010757
## Stopping. Best iteration:
## [11] train-auc:0.829696+0.006947 train-error:0.255469+0.006798 test-auc:0.685581+0.013245 test-error:0.358000+0.011703
##
## [1] train-auc:0.712580+0.007205 train-error:0.333969+0.002623 test-auc:0.639972+0.017298 test-error:0.383249+0.011816
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.882004+0.002328 train-error:0.209531+0.004187 test-auc:0.691266+0.013495 test-error:0.362999+0.008938
## [41] train-auc:0.930820+0.003129 train-error:0.149937+0.003584 test-auc:0.694677+0.014903 test-error:0.360000+0.011389
## [61] train-auc:0.958909+0.002427 train-error:0.110343+0.004630 test-auc:0.696303+0.013894 test-error:0.357626+0.011729
## Stopping. Best iteration:
## [45] train-auc:0.936691+0.001737 train-error:0.141281+0.003129 test-auc:0.694762+0.014456 test-error:0.355750+0.008535
##
## [1] train-auc:0.727469+0.004937 train-error:0.330907+0.004356 test-auc:0.643126+0.013947 test-error:0.391876+0.008349
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.893383+0.004061 train-error:0.196469+0.005460 test-auc:0.698737+0.014155 test-error:0.356999+0.011702
## [41] train-auc:0.937179+0.003186 train-error:0.142687+0.004764 test-auc:0.698113+0.011338 test-error:0.357249+0.007355
## [61] train-auc:0.963269+0.002587 train-error:0.103937+0.004858 test-auc:0.696604+0.010404 test-error:0.357125+0.008316
## [81] train-auc:0.976955+0.002812 train-error:0.077031+0.005678 test-auc:0.696482+0.011124 test-error:0.353500+0.009357
## Stopping. Best iteration:
## [68] train-auc:0.968124+0.002783 train-error:0.093906+0.005188 test-auc:0.696993+0.011870 test-error:0.353499+0.010400
##
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.906617+0.004706 train-error:0.183656+0.007864 test-auc:0.695535+0.015677 test-error:0.360500+0.012594
## [41] train-auc:0.947592+0.004187 train-error:0.130094+0.006442 test-auc:0.696579+0.014793 test-error:0.357998+0.012828
## Stopping. Best iteration:
## [35] train-auc:0.939021+0.004412 train-error:0.142313+0.006350 test-auc:0.696047+0.013378 test-error:0.355373+0.012566
##
## [1] train-auc:0.742880+0.007264 train-error:0.321563+0.008090 test-auc:0.653351+0.018379 test-error:0.387371+0.019220
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.912121+0.003903 train-error:0.176000+0.006037 test-auc:0.695903+0.009409 test-error:0.357501+0.007974
## [41] train-auc:0.951456+0.002111 train-error:0.125406+0.003824 test-auc:0.698000+0.007981 test-error:0.353877+0.009690
## Stopping. Best iteration:
## [39] train-auc:0.948699+0.001485 train-error:0.129375+0.002330 test-auc:0.698141+0.007469 test-error:0.351627+0.007845
##
## [1] train-auc:0.704128+0.011769 train-error:0.348438+0.008734 test-auc:0.624634+0.013731 test-error:0.407625+0.008426
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.871549+0.003700 train-error:0.218406+0.003224 test-auc:0.689942+0.013497 test-error:0.361501+0.009442
## Stopping. Best iteration:
## [12] train-auc:0.838423+0.004528 train-error:0.248594+0.003394 test-auc:0.688409+0.015464 test-error:0.356874+0.011713
##
## [1] train-auc:0.717126+0.006812 train-error:0.336750+0.004827 test-auc:0.641283+0.013117 test-error:0.395001+0.011106
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.883238+0.002830 train-error:0.204844+0.002664 test-auc:0.690684+0.016015 test-error:0.362873+0.012297
## [41] train-auc:0.929920+0.002515 train-error:0.151219+0.003189 test-auc:0.691269+0.011318 test-error:0.364250+0.009331
## Stopping. Best iteration:
## [27] train-auc:0.899234+0.002940 train-error:0.187688+0.002748 test-auc:0.692741+0.016163 test-error:0.359748+0.012177
##
## [1] train-auc:0.725850+0.004683 train-error:0.329188+0.002509 test-auc:0.639705+0.014831 test-error:0.391126+0.011380
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.896011+0.002534 train-error:0.194625+0.003828 test-auc:0.694261+0.018214 test-error:0.363374+0.017588
## [41] train-auc:0.939964+0.002439 train-error:0.139000+0.003412 test-auc:0.697694+0.015358 test-error:0.357875+0.015490
## Stopping. Best iteration:
## [29] train-auc:0.917495+0.001376 train-error:0.167812+0.002189 test-auc:0.698003+0.017517 test-error:0.356374+0.015608
##
## [1] train-auc:0.737905+0.003156 train-error:0.324875+0.001093 test-auc:0.646794+0.009673 test-error:0.389877+0.013357
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.905344+0.002948 train-error:0.183563+0.001802 test-auc:0.695841+0.013530 test-error:0.360500+0.007397
## Stopping. Best iteration:
## [19] train-auc:0.898853+0.002839 train-error:0.190656+0.003344 test-auc:0.695220+0.013095 test-error:0.357501+0.007275
##
## [1] train-auc:0.742368+0.006465 train-error:0.323094+0.006829 test-auc:0.652638+0.015795 test-error:0.388872+0.013689
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
## [41] train-auc:0.950800+0.002261 train-error:0.128438+0.002877 test-auc:0.697881+0.011775 test-error:0.360625+0.008816
## Stopping. Best iteration:
## [21] train-auc:0.908302+0.003286 train-error:0.182781+0.001966 test-auc:0.696665+0.013393 test-error:0.351624+0.009887
Visualizing the result of tuning these parameters:
res_db <- cbind.data.frame(cv_params, auc_vec, error_vec)
names(res_db)[3:4] <- c("auc", "error")
res_db$subsample <- as.factor(res_db$subsample)
res_db$colsample_by_tree <- as.factor(res_db$colsample_by_tree)
g_4 <- ggplot(res_db, aes(y = colsample_by_tree, x = subsample, fill = auc)) +
geom_tile() +
theme_bw() +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint =mean(res_db$auc),
space = "Lab",
na.value ="grey",
guide = "colourbar",
aesthetics = "fill") +
labs(x = "Subsample", y = "Column Sample by Tree", fill = "AUC")
g_4
g_5 <- ggplot(res_db, aes(y = colsample_by_tree, x = subsample, fill = error)) +
geom_tile() +
theme_bw() +
scale_fill_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint =mean(res_db$error),
space = "Lab",
na.value ="grey",
guide = "colourbar",
aesthetics = "fill") +
labs(x = "Subsample", y = "Column Sample by Tree", fill = "Error")
g_5
res_db
## subsample colsample_by_tree auc error
## 1 0.6 0.6 0.6938884 0.3548740
## 2 0.7 0.6 0.6882221 0.3627498
## 3 0.8 0.6 0.6952132 0.3586254
## 4 0.9 0.6 0.6965435 0.3558759
## 5 1 0.6 0.6896740 0.3606251
## 6 0.6 0.7 0.6881092 0.3577465
## 7 0.7 0.7 0.6916709 0.3634990
## 8 0.8 0.7 0.6993630 0.3496233
## 9 0.9 0.7 0.6985390 0.3544995
## 10 1 0.7 0.6951888 0.3576233
## 11 0.6 0.8 0.6928390 0.3551244
## 12 0.7 0.8 0.6928099 0.3584986
## 13 0.8 0.8 0.6921859 0.3576257
## 14 0.9 0.8 0.6991249 0.3527479
## 15 1 0.8 0.6973765 0.3556232
## 16 0.6 0.9 0.6855814 0.3579998
## 17 0.7 0.9 0.6947620 0.3557500
## 18 0.8 0.9 0.6969930 0.3534990
## 19 0.9 0.9 0.6960472 0.3553732
## 20 1 0.9 0.6981413 0.3516269
## 21 0.6 1 0.6884087 0.3568744
## 22 0.7 1 0.6927407 0.3597485
## 23 0.8 1 0.6980034 0.3563736
## 24 0.9 1 0.6952199 0.3575013
## 25 1 1 0.6966653 0.3516244
From this, it appears that the optimal values of the parameters to use are a subsample parameter of 0.8-1 and a colsample_by_tree of 0.8-1. This indicates that the model in general performs better when more of the samples are included in the data. With low values performing much worse. This likely indicates that many of the variables in the dataset do not add much in terms of predictive power and trees built with these variables are poor predictors. To strike a balance, subsample = 0.9 and colsample_by_tree = 0.9 are selected for the final configuration.
The final step in the tuning process is to lower the learning rate eta and add more trees.
set.seed(111111)
bst_mod_1 <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.3,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
subsample = 0.9,
colsample_bytree = 0.9,
nrounds = 1000,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.948347+0.004737 train-error:0.126062+0.006405 test-auc:0.677766+0.009002 test-error:0.372376+0.010593
## Stopping. Best iteration:
## [13] train-auc:0.913612+0.002789 train-error:0.168000+0.001889 test-auc:0.680579+0.009516 test-error:0.364501+0.009260
set.seed(111111)
bst_mod_2 <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.1,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
subsample = 0.9 ,
colsample_bytree = 0.9,
nrounds = 1000,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.906617+0.004706 train-error:0.183656+0.007864 test-auc:0.695535+0.015677 test-error:0.360500+0.012594
## [41] train-auc:0.947592+0.004187 train-error:0.130094+0.006442 test-auc:0.696579+0.014793 test-error:0.357998+0.012828
## Stopping. Best iteration:
## [35] train-auc:0.939021+0.004412 train-error:0.142313+0.006350 test-auc:0.696047+0.013378 test-error:0.355373+0.012566
set.seed(111111)
bst_mod_3 <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.05,
max.depth = 7,
min_child_weight = 10 ,
gamma = 0,
subsample = 0.9 ,
colsample_bytree = 0.9,
nrounds = 1000,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.873873+0.003908 train-error:0.218532+0.003911 test-auc:0.696378+0.015666 test-error:0.361750+0.008731
## Stopping. Best iteration:
## [10] train-auc:0.837416+0.004060 train-error:0.249688+0.006012 test-auc:0.690709+0.014356 test-error:0.360875+0.013102
set.seed(111111)
bst_mod_4 <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.01,
max.depth = 7,
min_child_weight = 10,
gamma = 0.1,
subsample = 0.9 ,
colsample_bytree = 0.9,
nrounds = 1000,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.833964+0.005216 train-error:0.250219+0.006714 test-auc:0.690345+0.013145 test-error:0.363875+0.009502
## [41] train-auc:0.850657+0.005537 train-error:0.239219+0.007633 test-auc:0.693963+0.013818 test-error:0.361250+0.009143
## Stopping. Best iteration:
## [38] train-auc:0.848044+0.005805 train-error:0.240251+0.007557 test-auc:0.693209+0.013924 test-error:0.359624+0.008887
set.seed(111111)
bst_mod_5 <- xgb.cv(data = dtrain,
nfold = 5,
eta = 0.005,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
subsample = 0.9 ,
colsample_bytree = 0.9,
nrounds = 1000,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.735957+0.007026 train-error:0.322844+0.004881 test-auc:0.645129+0.013372 test-error:0.387876+0.017787
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.826208+0.006111 train-error:0.254532+0.006796 test-auc:0.690346+0.013330 test-error:0.364374+0.009648
## Stopping. Best iteration:
## [19] train-auc:0.824125+0.007231 train-error:0.255813+0.006034 test-auc:0.689620+0.013659 test-error:0.362248+0.010626
The following step invloves plotting the error rate over different learning rates
pd1 <- cbind.data.frame(bst_mod_1$evaluation_log[,c("iter", "test_error_mean")], rep(0.3, nrow(bst_mod_1$evaluation_log)))
names(pd1)[3] <- "eta"
# Extract results for model with eta = 0.1
pd2 <- cbind.data.frame(bst_mod_2$evaluation_log[,c("iter", "test_error_mean")], rep(0.1, nrow(bst_mod_2$evaluation_log)))
names(pd2)[3] <- "eta"
# Extract results for model with eta = 0.05
pd3 <- cbind.data.frame(bst_mod_3$evaluation_log[,c("iter", "test_error_mean")], rep(0.05, nrow(bst_mod_3$evaluation_log)))
names(pd3)[3] <- "eta"
# Extract results for model with eta = 0.01
pd4 <- cbind.data.frame(bst_mod_4$evaluation_log[,c("iter", "test_error_mean")], rep(0.01, nrow(bst_mod_4$evaluation_log)))
names(pd4)[3] <- "eta"
# Extract results for model with eta = 0.005
pd5 <- cbind.data.frame(bst_mod_5$evaluation_log[,c("iter", "test_error_mean")], rep(0.005, nrow(bst_mod_5$evaluation_log)))
names(pd5)[3] <- "eta"
# Join datasets
plot_data <- rbind.data.frame(pd1, pd2, pd3, pd4, pd5)
# Converty ETA to factor
plot_data$eta <- as.factor(plot_data$eta)
# Plot points
g_6 <- ggplot(plot_data, aes(x = iter, y = test_error_mean, color = eta))+
geom_point(alpha = 0.5) +
theme_bw() + # Set theme
theme(panel.grid.major = element_blank(), # Remove grid
panel.grid.minor = element_blank(), # Remove grid
panel.border = element_blank(), # Remove grid
panel.background = element_blank()) + # Remove grid
labs(x = "Number of Trees", title = "Error Rate v Number of Trees",
y = "Error Rate", color = "Learning \n Rate") # Set labels
g_6
# Plot lines
g_7 <- ggplot(plot_data, aes(x = iter, y = test_error_mean, color = eta))+
geom_smooth(alpha = 0.5) +
theme_bw() + # Set theme
theme(panel.grid.major = element_blank(), # Remove grid
panel.grid.minor = element_blank(), # Remove grid
panel.border = element_blank(), # Remove grid
panel.background = element_blank()) + # Remove grid
labs(x = "Number of Trees", title = "Error Rate v Number of Trees",
y = "Error Rate", color = "Learning \n Rate") # Set labels
g_7
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The error rate curves indicate eta = 0.01 delivers the best performance for this dataset. The final step is to fit the model using these tuned hyperparameters.
set.seed(111111)
bst_final <- xgboost(data = dtrain,
eta = 0.01,
max.depth = 7,
min_child_weight = 10,
gamma = 0,
subsample = 0.9,
colsample_bytree = 0.9,
nrounds = 100,
early_stopping_rounds = 20,
verbose = 1,
nthread = 1,
print_every_n = 20,
objective = "binary:logistic",
eval_metric = "auc",
eval_metric = "error")
## [1] train-auc:0.732945 train-error:0.327375
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 20 rounds.
##
## [21] train-auc:0.807511 train-error:0.275500
## [41] train-auc:0.831842 train-error:0.257625
## [61] train-auc:0.848296 train-error:0.243750
## [81] train-auc:0.861668 train-error:0.228000
## [100] train-auc:0.869846 train-error:0.218625
With the final hyperparameters established, the next step is to evaluate the tuned model on the test dataset to verify its predictive performance.
boost_preds <- predict(bst_final, dtest) # Create predictions for XGBoost model
pred_dat <- cbind.data.frame(boost_preds , test_data$class)#
# Convert predictions to classes, using optimal cut-off
boost_pred_class <- rep(0, length(boost_preds))
boost_pred_class[boost_preds >= 0.5] <- 1
t <- table(boost_pred_class, test_data$class) # Create table
confusionMatrix(t, positive = "1")
## Confusion Matrix and Statistics
##
##
## boost_pred_class 0 1
## 0 344 338
## 1 398 920
##
## Accuracy : 0.632
## 95% CI : (0.6104, 0.6532)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : 0.40024
##
## Kappa : 0.1982
##
## Mcnemar's Test P-Value : 0.02965
##
## Sensitivity : 0.7313
## Specificity : 0.4636
## Pos Pred Value : 0.6980
## Neg Pred Value : 0.5044
## Prevalence : 0.6290
## Detection Rate : 0.4600
## Detection Prevalence : 0.6590
## Balanced Accuracy : 0.5975
##
## 'Positive' Class : 1
##
As demonstrated by the results, the final model’s accuracy rose to 0.632, surpassing the initial benchmark of 0.5905. The next step involves extracting variable importance measures from XGBoost to determine which financial indicators most significantly influence the model’s predictions.
imp_mat <- xgb.importance(model = bst_final)
xgb.plot.importance(imp_mat, top_n = 10)
Through iterative hyperparameter tuning, the XGBoost model achieved improved accuracy, increasing from 0.5905 to 0.632, in predicting gainers and losers in global equity markets. Feature importance analysis highlighted liquidity and growth metrics — such as shortTermCoverageRatios, Deposit.Liabilities, and Revenue.Growth — as the most significant drivers of stock price movements. These findings underscore the capability of optimized tree-based models to uncover intricate relationships among financial indicators, making them powerful tools for equity performance prediction. Future work could explore feature engineering or the integration of interpretability frameworks to further enhance the model’s predictive accuracy and explainability.