library(tidyverse)
library(skimr)
library(DataExplorer)
library(corrplot)
library(ggfortify)
library(caret)
set.seed(123)
We will explore, analyze and model two sample sales datasets from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 100 and 1,000,000 million records.
Here we load in the data from the CSVs which we have downloaded.
First we load in the sales dataset with 100 observations and see it also has 14 columns.
hund_df <- read_csv("100 Sales Records.csv", show_col_types = TRUE)
## Rows: 100 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(hund_df)
## Rows: 100
## Columns: 14
## $ Region <chr> "Australia and Oceania", "Central America and the Car…
## $ Country <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe…
## $ `Item Type` <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "…
## $ `Sales Channel` <chr> "Offline", "Online", "Offline", "Online", "Offline", …
## $ `Order Priority` <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H"…
## $ `Order Date` <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2…
## $ `Order ID` <dbl> 669165933, 963881480, 341417157, 514321792, 115456712…
## $ `Ship Date` <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/…
## $ `Units Sold` <dbl> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070,…
## $ `Unit Price` <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27,…
## $ `Unit Cost` <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54,…
## $ `Total Revenue` <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.…
## $ `Total Cost` <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.5…
## $ `Total Profit` <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50,…
Next we load in the sales dataset with 1,000,000 rows and are also able to see the same 14 columns.
mil_df <- read_csv("1000000 Sales Records.csv", show_col_types = FALSE)
glimpse(mil_df)
## Rows: 1,000,000
## Columns: 14
## $ Region <chr> "Sub-Saharan Africa", "Middle East and North Africa",…
## $ Country <chr> "South Africa", "Morocco", "Papua New Guinea", "Djibo…
## $ `Item Type` <chr> "Fruits", "Clothes", "Meat", "Clothes", "Beverages", …
## $ `Sales Channel` <chr> "Offline", "Online", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "M", "M", "H", "L", "L", "M", "L", "L", "L", "M"…
## $ `Order Date` <chr> "7/27/2012", "9/14/2013", "5/15/2015", "5/17/2017", "…
## $ `Order ID` <dbl> 443368995, 667593514, 940995585, 880811536, 174590194…
## $ `Ship Date` <chr> "7/28/2012", "10/19/2013", "6/4/2015", "7/2/2017", "1…
## $ `Units Sold` <dbl> 1593, 4611, 360, 562, 3973, 1379, 597, 1476, 896, 776…
## $ `Unit Price` <dbl> 9.33, 109.28, 421.89, 109.28, 47.45, 9.33, 47.45, 47.…
## $ `Unit Cost` <dbl> 6.92, 35.84, 364.69, 35.84, 31.79, 6.92, 31.79, 31.79…
## $ `Total Revenue` <dbl> 14862.69, 503890.08, 151880.40, 61415.36, 188518.85, …
## $ `Total Cost` <dbl> 11023.56, 165258.24, 131288.40, 20142.08, 126301.67, …
## $ `Total Profit` <dbl> 3839.13, 338631.84, 20592.00, 41273.28, 62217.18, 332…
First, we must consider the columns that we have and if they are applicable for fitting a prediction model. We do have labels of order priority, item type, or region that could possibly be predicted from the other categories. However, I believe the most important column here for a company would be predicting the total profit based on other columns, as money is truly the end goal for every business. Thus, models that are not capable of quantitative regression such as logistic regression are not suitable models to pick.
In this case we actually have the exact information to predict profit if we were to be considering all the columns. As the total profit is a direct result of subtracting the total revenue from the total cost. While total cost comes from units sold multiplied by the unit cost and total revenue comes from units sold multiplied by unit price. Thus, there is both correlation and multicollinearity between many categories within our data. This can be seen below within both our datasets particularly with the revenue being highly correlated to total cost. An interesting thing to note below is that our incidental correlation for things that should have no correlation at all such as in the case of order ID and any other column disappear once we look at our million sales data. Only actual correlation remains.
par(mfrow=c(1,2))
corrplot(
hund_df |>
select(where(is.numeric)) |>
cor(),
type = 'lower', na.label = "square", na.label.col = "lightgrey"
)
title("Hundred Sales Correlation")
corrplot(
mil_df |>
select(where(is.numeric)) |>
cor(),
type = 'lower', na.label = "square", na.label.col = "lightgrey"
)
title("Million Sales Correlation")
As long as we had units sold, unit price, and unit cost as predictors we would be able to perfectly predict total profit by simply recreating the equation of Total Profit = Units Sold*(Unit Price - Units Cost). We do not need any further modeling to be able to predict the total profit since the equation is simply able to be derived from a cursory look at the data. However, we can subset the data with only these three predictor variables and the response variable of total profit then fit it to a multiple linear regression model. In theory, we should get coefficients of 1 for each predictor which will end up matching the equation.
hund_df_simp <- hund_df |>
select(`Total Profit`,`Units Sold`, `Unit Price`, `Unit Cost`)
mil_df_simp <- mil_df |>
select(`Total Profit`,`Units Sold`, `Unit Price`, `Unit Cost`)
head(hund_df_simp)
## # A tibble: 6 × 4
## `Total Profit` `Units Sold` `Unit Price` `Unit Cost`
## <dbl> <dbl> <dbl> <dbl>
## 1 951410. 9925 255. 159.
## 2 248406. 2804 206. 117.
## 3 224599. 1779 651. 525.
## 4 19526. 8102 9.33 6.92
## 5 639078. 5062 651. 525.
## 6 285088. 2974 255. 159.
Now we have a subset of data that would work well for a linear regression model. However, if we wanted to apply a non linear equation-based model to attempt to predict profit then, to my understanding, these predictors would not be enough for our best predictions. This is because for models like decision-trees, we are essentially grouping together predictors rather than directly taking the predictors as an input for an equation which can sufficiently explain a prediction with only a few components. Which is why we may need to come back to the predictors that we have not subset if our decision tree accuracy is off by a large amount.
As we have undergone basic data preparation that applies to the whole dataset and does not bias the data we will now procure a training and test component from our datasets.
inTrain <- createDataPartition(
y = hund_df_simp$`Total Profit`,
p = .75,
list = FALSE
)
hund_df_simp_train <- hund_df_simp[inTrain,]
hund_df_simp_test <- hund_df_simp[-inTrain,]
cat("Number of test rows:",nrow(hund_df_simp_test),"Number of train rows:",nrow(hund_df_simp_train))
## Number of test rows: 24 Number of train rows: 76
inTrain <- createDataPartition(
y = mil_df_simp$`Total Profit`,
p = .75,
list = FALSE
)
mil_df_simp_train <- mil_df_simp[inTrain,]
mil_df_simp_test <- mil_df_simp[-inTrain,]
cat("Number of test rows:",nrow(mil_df_simp_test),"Number of train rows:",nrow(mil_df_simp_train))
## Number of test rows: 249999 Number of train rows: 750001
We can see that we have much more data to work with in the larger dataset and also to test on.
Our data exploration here consists of skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.
skim(mil_df_simp_train)
| Name | mil_df_simp_train |
| Number of rows | 750001 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Total Profit | 0 | 1 | 392342.96 | 378952.63 | 2.41 | 95104.80 | 281054.88 | 565307.60 | 1738700.00 | ▇▃▂▁▁ |
| Units Sold | 0 | 1 | 4999.76 | 2885.84 | 1.00 | 2499.00 | 5000.00 | 7498.00 | 10000.00 | ▇▇▇▇▇ |
| Unit Price | 0 | 1 | 265.99 | 216.92 | 9.33 | 81.73 | 154.06 | 421.89 | 668.27 | ▇▇▁▅▃ |
| Unit Cost | 0 | 1 | 187.48 | 175.59 | 6.92 | 35.84 | 97.44 | 263.33 | 524.96 | ▇▃▂▁▂ |
skim(hund_df_simp_train)
| Name | hund_df_simp_train |
| Number of rows | 76 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Total Profit | 0 | 1 | 440975.02 | 428747.68 | 1258.02 | 121443.58 | 300330.84 | 635828.80 | 1719922.04 | ▇▃▂▁▁ |
| Units Sold | 0 | 1 | 5299.33 | 2790.01 | 171.00 | 2961.50 | 5464.00 | 7688.25 | 9925.00 | ▆▅▇▇▇ |
| Unit Price | 0 | 1 | 275.52 | 236.29 | 9.33 | 81.73 | 205.70 | 437.20 | 668.27 | ▇▅▁▃▅ |
| Unit Cost | 0 | 1 | 191.23 | 190.00 | 6.92 | 35.84 | 117.11 | 263.33 | 524.96 | ▇▂▂▁▃ |
Note that none of the columns here are normally distributed, but that is the natural consequence of units sold and cost/price having a hard lower limit of zero. Those three columns are rightward skewed, but in the hundred observation data set, unit price almost seems more bimodal at each end rather than skewing. Additionally, the uniform distribution of the units sold variable is not as solid with less data to consider. Both of these facts are natural consequences of the law of large numbers. The larger dataset with a million observations will have overall distributions closer to the population distributions. Which means that using data with more observation will lead to improving models fit to said dataset versus a similar dataset with less data.
Normally we would transform the skewed variables in order to attempt shifting the distributions to be normal. Yet, we know in the case of this dataset doing so would actually not be beneficial for regression as the regression equation generated should be the true one without transformations.
Now that we have a better idea of our datasets and have them split into training and testing datasets we will build our two models.
Our linear regression model for the hundred observation dataset fits itself almost instantly even with 30-fold cross-validation repeated thrice. When we take a look at the generated coefficients and intercept, we get very very close to the true equation even with the relatively small amount of data that we are using.
ctrl <- trainControl(method = "repeatedcv", repeats = 3, number = 30)
hund_lm <- train(
`Total Profit` ~ `Units Sold`:`Unit Price` + `Units Sold`:`Unit Cost`,
data = hund_df_simp_train,
method = "lm",
trControl = ctrl
)
summary(hund_lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.553e-09 -1.262e-10 -7.280e-11 2.560e-11 5.864e-09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.273e-10 1.204e-10 3.549e+00 0.00068
## `\\`Units Sold\\`:\\`Unit Price\\`` 1.000e+00 3.230e-16 3.096e+15 < 2e-16
## `\\`Units Sold\\`:\\`Unit Cost\\`` -1.000e+00 4.304e-16 -2.324e+15 < 2e-16
##
## (Intercept) ***
## `\\`Units Sold\\`:\\`Unit Price\\`` ***
## `\\`Units Sold\\`:\\`Unit Cost\\`` ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.283e-10 on 73 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.3e+31 on 2 and 73 DF, p-value: < 2.2e-16
Meanwhile, it takes almost 3 minutes to run a 30-fold cross-validation repeated thrice on our million observation dataset. We actually get further away from our true equation by a slightest bit with our y-intercept growing bigger and thus shifting away from the true value of 0. Residual standard error has also increased because of our equation moving away from the true equation. *In this case, our model building and analysis actually is worse off for taking a larger dataset**. Using too much data seems to be able lead to an increase in processing time without having significant increase (or in this case any positive effect) in model building accuracy.
mil_lm <- train(
`Total Profit` ~ `Units Sold`:`Unit Price` + `Units Sold`:`Unit Cost`,
data = mil_df_simp_train,
method = "lm",
trControl = ctrl
)
summary(mil_lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.814e-05 0.000e+00 0.000e+00 1.000e-10 7.045e-06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.973e-08 5.750e-11 -5.171e+02 <2e-16
## `\\`Units Sold\\`:\\`Unit Price\\`` 1.000e+00 1.741e-16 5.745e+15 <2e-16
## `\\`Units Sold\\`:\\`Unit Cost\\`` -1.000e+00 2.225e-16 -4.494e+15 <2e-16
##
## (Intercept) ***
## `\\`Units Sold\\`:\\`Unit Price\\`` ***
## `\\`Units Sold\\`:\\`Unit Cost\\`` ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.459e-08 on 749998 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.5e+31 on 2 and 749998 DF, p-value: < 2.2e-16
Utilizing XGBoost here with non-default parameters and the same training validation method we end up taking 30 minutes to finish training on just the 75 observation dataset. We do end up with a R-Squared of 0.958 and RMSE of $65,683.93 on our predictions which while worse off than our almost perfect linear model, is still fairly good all things considered. However, it is not sustainable to actually attempt to extrapolate this type of model building to the dataset with a million observations. Which is why leave the code below in as proof of its existence but will not evaluate it.
ctrl <- trainControl(method = "repeatedcv", repeats = 3, number = 30, allowParallel = TRUE)
grid <- expand.grid(nrounds = c(100,200),
max_depth = c(10, 15, 20, 25),
colsample_bytree = seq(0.5, 0.9, length.out = 5),
eta = 0.1,
gamma=0,
min_child_weight = 1,
subsample = 1
)
hund_xgb <- train(
`Total Profit` ~ .,
data = hund_df_simp_train,
method = "xgbTree",
trControl = ctrl,
tuneGrid = grid,
verbosity = 0
)
hund_xgb
Next we attempt to train an xgboost model with only 10-fold cross-validation and the default grid parameters to ease up on performance restrictions. This takes us only about a minute which should mean that our million observation xgboost will not take an eternity to finish training. Even without attempting to optimize the hyperparameters on our own, we get an R-Squared of 0.975 with our best model and an RMSE of $68,055.32. Not bad for a model that took 30 times less time to train.
ctrl <- trainControl(method = "cv", number = 10)
hund_xgb <- train(
`Total Profit` ~ .,
data = hund_df_simp_train,
method = "xgbTree",
trControl = ctrl,
verbosity = 0
)
saveRDS(hund_xgb,file="hund_xgb.rds")
We also save and load the model to prevent wasting additional computing power for training on render later down the line.
hund_xgb <- readRDS("hund_xgb.rds")
hund_xgb
## eXtreme Gradient Boosting
##
## 76 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 68, 69, 69, 68, 69, 68, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
## 0.3 1 0.6 0.50 50 190478.68 0.8100255
## 0.3 1 0.6 0.50 100 187553.71 0.8195162
## 0.3 1 0.6 0.50 150 171726.71 0.8545125
## 0.3 1 0.6 0.75 50 184838.14 0.8440144
## 0.3 1 0.6 0.75 100 177576.74 0.8521272
## 0.3 1 0.6 0.75 150 173733.64 0.8571455
## 0.3 1 0.6 1.00 50 175369.37 0.8490767
## 0.3 1 0.6 1.00 100 172545.62 0.8481900
## 0.3 1 0.6 1.00 150 170534.73 0.8516665
## 0.3 1 0.8 0.50 50 186009.78 0.8264810
## 0.3 1 0.8 0.50 100 174460.83 0.8445695
## 0.3 1 0.8 0.50 150 170033.15 0.8492663
## 0.3 1 0.8 0.75 50 177782.48 0.8449394
## 0.3 1 0.8 0.75 100 172477.32 0.8597442
## 0.3 1 0.8 0.75 150 168576.08 0.8595173
## 0.3 1 0.8 1.00 50 173509.40 0.8604187
## 0.3 1 0.8 1.00 100 167166.36 0.8669755
## 0.3 1 0.8 1.00 150 164541.48 0.8716111
## 0.3 2 0.6 0.50 50 170536.58 0.8703065
## 0.3 2 0.6 0.50 100 173605.87 0.8608106
## 0.3 2 0.6 0.50 150 176282.69 0.8566465
## 0.3 2 0.6 0.75 50 166766.88 0.8727290
## 0.3 2 0.6 0.75 100 166381.49 0.8753040
## 0.3 2 0.6 0.75 150 166881.01 0.8766310
## 0.3 2 0.6 1.00 50 168843.00 0.8583780
## 0.3 2 0.6 1.00 100 164706.14 0.8685833
## 0.3 2 0.6 1.00 150 168744.91 0.8657856
## 0.3 2 0.8 0.50 50 84669.48 0.9572917
## 0.3 2 0.8 0.50 100 81925.19 0.9622697
## 0.3 2 0.8 0.50 150 80964.88 0.9647203
## 0.3 2 0.8 0.75 50 94500.43 0.9463521
## 0.3 2 0.8 0.75 100 85772.83 0.9539263
## 0.3 2 0.8 0.75 150 81821.37 0.9583087
## 0.3 2 0.8 1.00 50 97230.75 0.9472568
## 0.3 2 0.8 1.00 100 92571.22 0.9527964
## 0.3 2 0.8 1.00 150 89884.71 0.9554306
## 0.3 3 0.6 0.50 50 168260.61 0.8534544
## 0.3 3 0.6 0.50 100 167465.12 0.8617106
## 0.3 3 0.6 0.50 150 174948.86 0.8501624
## 0.3 3 0.6 0.75 50 165474.15 0.8762212
## 0.3 3 0.6 0.75 100 166822.61 0.8672734
## 0.3 3 0.6 0.75 150 168337.88 0.8623150
## 0.3 3 0.6 1.00 50 167654.42 0.8583315
## 0.3 3 0.6 1.00 100 166276.16 0.8594649
## 0.3 3 0.6 1.00 150 166815.54 0.8591319
## 0.3 3 0.8 0.50 50 82267.08 0.9546139
## 0.3 3 0.8 0.50 100 76171.91 0.9611495
## 0.3 3 0.8 0.50 150 74489.88 0.9631513
## 0.3 3 0.8 0.75 50 86271.82 0.9566361
## 0.3 3 0.8 0.75 100 80544.08 0.9627679
## 0.3 3 0.8 0.75 150 79868.77 0.9633359
## 0.3 3 0.8 1.00 50 72903.50 0.9739812
## 0.3 3 0.8 1.00 100 70759.04 0.9757218
## 0.3 3 0.8 1.00 150 69873.59 0.9759898
## 0.4 1 0.6 0.50 50 189278.17 0.8298432
## 0.4 1 0.6 0.50 100 189154.87 0.8307545
## 0.4 1 0.6 0.50 150 183939.32 0.8499637
## 0.4 1 0.6 0.75 50 176922.76 0.8445547
## 0.4 1 0.6 0.75 100 171689.89 0.8521060
## 0.4 1 0.6 0.75 150 165568.53 0.8576681
## 0.4 1 0.6 1.00 50 180985.64 0.8549658
## 0.4 1 0.6 1.00 100 171232.12 0.8590124
## 0.4 1 0.6 1.00 150 168813.59 0.8614381
## 0.4 1 0.8 0.50 50 184733.43 0.8141596
## 0.4 1 0.8 0.50 100 174369.50 0.8367859
## 0.4 1 0.8 0.50 150 169231.63 0.8521522
## 0.4 1 0.8 0.75 50 179959.08 0.8451395
## 0.4 1 0.8 0.75 100 168762.80 0.8578743
## 0.4 1 0.8 0.75 150 167333.20 0.8566604
## 0.4 1 0.8 1.00 50 172777.04 0.8534115
## 0.4 1 0.8 1.00 100 167217.26 0.8642572
## 0.4 1 0.8 1.00 150 162980.03 0.8693081
## 0.4 2 0.6 0.50 50 174627.70 0.8802220
## 0.4 2 0.6 0.50 100 176253.86 0.8721524
## 0.4 2 0.6 0.50 150 179467.21 0.8597507
## 0.4 2 0.6 0.75 50 178845.03 0.8422412
## 0.4 2 0.6 0.75 100 168710.63 0.8660429
## 0.4 2 0.6 0.75 150 169017.64 0.8632997
## 0.4 2 0.6 1.00 50 169218.92 0.8636912
## 0.4 2 0.6 1.00 100 170564.32 0.8645496
## 0.4 2 0.6 1.00 150 171980.95 0.8606534
## 0.4 2 0.8 0.50 50 86915.74 0.9516489
## 0.4 2 0.8 0.50 100 84380.99 0.9572951
## 0.4 2 0.8 0.50 150 76555.69 0.9647028
## 0.4 2 0.8 0.75 50 93737.17 0.9529033
## 0.4 2 0.8 0.75 100 87229.44 0.9623614
## 0.4 2 0.8 0.75 150 86214.98 0.9625043
## 0.4 2 0.8 1.00 50 93857.12 0.9506581
## 0.4 2 0.8 1.00 100 91735.39 0.9531499
## 0.4 2 0.8 1.00 150 91076.04 0.9548029
## 0.4 3 0.6 0.50 50 174641.26 0.8426442
## 0.4 3 0.6 0.50 100 173266.48 0.8525706
## 0.4 3 0.6 0.50 150 176430.49 0.8500481
## 0.4 3 0.6 0.75 50 169813.90 0.8498983
## 0.4 3 0.6 0.75 100 171972.96 0.8466636
## 0.4 3 0.6 0.75 150 176768.54 0.8379962
## 0.4 3 0.6 1.00 50 158059.25 0.8740518
## 0.4 3 0.6 1.00 100 160392.18 0.8656562
## 0.4 3 0.6 1.00 150 165804.30 0.8576795
## 0.4 3 0.8 0.50 50 75249.30 0.9668407
## 0.4 3 0.8 0.50 100 69699.44 0.9746695
## 0.4 3 0.8 0.50 150 68055.32 0.9749971
## 0.4 3 0.8 0.75 50 91550.95 0.9533954
## 0.4 3 0.8 0.75 100 92169.80 0.9520725
## 0.4 3 0.8 0.75 150 92216.57 0.9517995
## 0.4 3 0.8 1.00 50 74743.63 0.9679472
## 0.4 3 0.8 1.00 100 73534.66 0.9694484
## 0.4 3 0.8 1.00 150 73262.12 0.9694769
## MAE
## 149798.81
## 151402.46
## 137091.86
## 146788.01
## 143638.93
## 139281.70
## 144248.78
## 143575.85
## 139580.54
## 152362.45
## 140237.17
## 135719.03
## 148130.75
## 141993.31
## 135397.57
## 140431.49
## 137393.62
## 132927.77
## 133835.00
## 134764.11
## 135733.70
## 131819.53
## 127877.30
## 128870.87
## 135289.57
## 128495.58
## 128866.52
## 64610.36
## 64690.80
## 62530.25
## 76657.50
## 70869.80
## 67071.86
## 75091.96
## 70113.13
## 67171.96
## 135033.77
## 132430.20
## 136511.15
## 129156.63
## 130225.63
## 131028.89
## 128764.55
## 126030.39
## 124931.83
## 63331.42
## 56612.16
## 56226.43
## 64789.68
## 60109.09
## 60102.73
## 55545.38
## 53830.05
## 53344.36
## 150537.40
## 146827.23
## 145350.37
## 145031.60
## 139634.29
## 132195.28
## 149437.61
## 140619.17
## 136859.96
## 143993.37
## 132286.61
## 126050.49
## 147593.87
## 134803.59
## 133323.39
## 143562.08
## 137157.98
## 131022.21
## 131496.45
## 140197.89
## 139050.90
## 144786.15
## 133858.56
## 130161.97
## 133828.10
## 130982.87
## 132031.07
## 69112.99
## 66960.90
## 60253.21
## 72057.48
## 65201.69
## 65158.40
## 72739.76
## 69765.62
## 68744.26
## 128654.06
## 133152.56
## 137038.63
## 129851.43
## 130135.48
## 132176.07
## 120990.47
## 119300.22
## 121762.14
## 61821.42
## 57601.50
## 56118.28
## 68301.98
## 67638.53
## 67916.97
## 58936.71
## 56943.00
## 56564.70
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
## = 0.5.
varImp(hund_xgb)
## xgbTree variable importance
##
## Overall
## `Unit Price` 100.00
## `Units Sold` 45.95
## `Unit Cost` 0.00
Even with our reduced performance configuration of training the xgboost model, training of the million observation dataset took almost an hour to finish. This is not a sustainable time for training a model in any use case that requires frequent retraining. It might be better here to actually use only a portion of our training dataset just to save on training time. However, the results from the best fit found with cross-validation had an RMSE of $12,296.15 and an R2 0.9989398 which is extremely impressive compared to the previous xgboost model trained on the lower observation dataset.
mil_xgb <- train(
`Total Profit` ~ .,
data = mil_df_simp_train,
method = "xgbTree",
trControl = ctrl,
verbosity = 0
)
saveRDS(mil_xgb,file="mil_xgb.rds")
mil_xgb <- readRDS("mil_xgb.rds")
mil_xgb
## eXtreme Gradient Boosting
##
## 750001 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 675000, 675000, 675000, 675001, 675002, 675002, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
## 0.3 1 0.6 0.50 50 159553.58 0.8244027
## 0.3 1 0.6 0.50 100 154552.88 0.8339084
## 0.3 1 0.6 0.50 150 153411.62 0.8361797
## 0.3 1 0.6 0.75 50 159156.03 0.8261044
## 0.3 1 0.6 0.75 100 154487.70 0.8340351
## 0.3 1 0.6 0.75 150 153495.19 0.8360109
## 0.3 1 0.6 1.00 50 160204.42 0.8234345
## 0.3 1 0.6 1.00 100 154679.85 0.8336228
## 0.3 1 0.6 1.00 150 153568.34 0.8358731
## 0.3 1 0.8 0.50 50 157482.06 0.8292239
## 0.3 1 0.8 0.50 100 154239.82 0.8345122
## 0.3 1 0.8 0.50 150 153480.26 0.8360150
## 0.3 1 0.8 0.75 50 157505.88 0.8291765
## 0.3 1 0.8 0.75 100 154213.02 0.8345767
## 0.3 1 0.8 0.75 150 153480.17 0.8360236
## 0.3 1 0.8 1.00 50 157799.28 0.8285336
## 0.3 1 0.8 1.00 100 154219.01 0.8345593
## 0.3 1 0.8 1.00 150 153473.53 0.8360390
## 0.3 2 0.6 0.50 50 152513.98 0.8381036
## 0.3 2 0.6 0.50 100 152170.31 0.8387586
## 0.3 2 0.6 0.50 150 152127.33 0.8388498
## 0.3 2 0.6 0.75 50 152490.06 0.8381414
## 0.3 2 0.6 0.75 100 152132.51 0.8388386
## 0.3 2 0.6 0.75 150 152083.21 0.8389427
## 0.3 2 0.6 1.00 50 152422.22 0.8382928
## 0.3 2 0.6 1.00 100 152139.41 0.8388247
## 0.3 2 0.6 1.00 150 152055.62 0.8390012
## 0.3 2 0.8 0.50 50 53308.73 0.9803040
## 0.3 2 0.8 0.50 100 37753.95 0.9900718
## 0.3 2 0.8 0.50 150 29892.11 0.9937758
## 0.3 2 0.8 0.75 50 50849.93 0.9819616
## 0.3 2 0.8 0.75 100 36153.80 0.9908461
## 0.3 2 0.8 0.75 150 29430.02 0.9939225
## 0.3 2 0.8 1.00 50 54383.31 0.9793692
## 0.3 2 0.8 1.00 100 38855.72 0.9894899
## 0.3 2 0.8 1.00 150 31627.53 0.9930357
## 0.3 3 0.6 0.50 50 152007.86 0.8391032
## 0.3 3 0.6 0.50 100 152013.81 0.8390892
## 0.3 3 0.6 0.50 150 152024.91 0.8390667
## 0.3 3 0.6 0.75 50 152021.37 0.8390789
## 0.3 3 0.6 0.75 100 152012.29 0.8390922
## 0.3 3 0.6 0.75 150 152020.77 0.8390744
## 0.3 3 0.6 1.00 50 152039.08 0.8390570
## 0.3 3 0.6 1.00 100 151991.44 0.8391366
## 0.3 3 0.6 1.00 150 151984.84 0.8391508
## 0.3 3 0.8 0.50 50 23371.86 0.9961925
## 0.3 3 0.8 0.50 100 15141.27 0.9983997
## 0.3 3 0.8 0.50 150 12296.15 0.9989398
## 0.3 3 0.8 0.75 50 25178.25 0.9955550
## 0.3 3 0.8 0.75 100 15924.57 0.9982215
## 0.3 3 0.8 0.75 150 12691.04 0.9988714
## 0.3 3 0.8 1.00 50 24715.21 0.9957201
## 0.3 3 0.8 1.00 100 15801.85 0.9982527
## 0.3 3 0.8 1.00 150 12637.03 0.9988828
## 0.4 1 0.6 0.50 50 157237.45 0.8284182
## 0.4 1 0.6 0.50 100 154371.22 0.8341327
## 0.4 1 0.6 0.50 150 153614.12 0.8357044
## 0.4 1 0.6 0.75 50 157032.74 0.8288501
## 0.4 1 0.6 0.75 100 154339.83 0.8341985
## 0.4 1 0.6 0.75 150 153631.55 0.8356626
## 0.4 1 0.6 1.00 50 157536.76 0.8277530
## 0.4 1 0.6 1.00 100 154312.42 0.8342599
## 0.4 1 0.6 1.00 150 153576.23 0.8357854
## 0.4 1 0.8 0.50 50 156453.39 0.8298948
## 0.4 1 0.8 0.50 100 154356.67 0.8341305
## 0.4 1 0.8 0.50 150 153610.48 0.8357106
## 0.4 1 0.8 0.75 50 156357.27 0.8300337
## 0.4 1 0.8 0.75 100 154339.29 0.8341936
## 0.4 1 0.8 0.75 150 153596.56 0.8357496
## 0.4 1 0.8 1.00 50 156306.18 0.8301741
## 0.4 1 0.8 1.00 100 154277.49 0.8343451
## 0.4 1 0.8 1.00 150 153546.23 0.8358597
## 0.4 2 0.6 0.50 50 152494.53 0.8380741
## 0.4 2 0.6 0.50 100 152207.73 0.8386790
## 0.4 2 0.6 0.50 150 152143.49 0.8388149
## 0.4 2 0.6 0.75 50 152414.30 0.8382492
## 0.4 2 0.6 0.75 100 152142.66 0.8388164
## 0.4 2 0.6 0.75 150 152071.89 0.8389670
## 0.4 2 0.6 1.00 50 152321.20 0.8384497
## 0.4 2 0.6 1.00 100 152113.66 0.8388786
## 0.4 2 0.6 1.00 150 152036.62 0.8390415
## 0.4 2 0.8 0.50 50 51307.15 0.9815509
## 0.4 2 0.8 0.50 100 35564.30 0.9911626
## 0.4 2 0.8 0.50 150 29069.84 0.9941052
## 0.4 2 0.8 0.75 50 49212.33 0.9830793
## 0.4 2 0.8 0.75 100 34361.32 0.9917584
## 0.4 2 0.8 0.75 150 28055.20 0.9945049
## 0.4 2 0.8 1.00 50 49824.02 0.9826677
## 0.4 2 0.8 1.00 100 35422.38 0.9911832
## 0.4 2 0.8 1.00 150 28953.84 0.9941019
## 0.4 3 0.6 0.50 50 152074.19 0.8389611
## 0.4 3 0.6 0.50 100 152056.69 0.8389985
## 0.4 3 0.6 0.50 150 152051.67 0.8390093
## 0.4 3 0.6 0.75 50 152053.64 0.8390052
## 0.4 3 0.6 0.75 100 152028.96 0.8390572
## 0.4 3 0.6 0.75 150 152035.94 0.8390431
## 0.4 3 0.6 1.00 50 152048.94 0.8390156
## 0.4 3 0.6 1.00 100 152001.29 0.8391160
## 0.4 3 0.6 1.00 150 151993.63 0.8391322
## 0.4 3 0.8 0.50 50 23163.53 0.9962581
## 0.4 3 0.8 0.50 100 15470.04 0.9983260
## 0.4 3 0.8 0.50 150 12334.39 0.9989355
## 0.4 3 0.8 0.75 50 23056.21 0.9962942
## 0.4 3 0.8 0.75 100 15716.43 0.9982675
## 0.4 3 0.8 0.75 150 12287.79 0.9989383
## 0.4 3 0.8 1.00 50 22657.05 0.9963823
## 0.4 3 0.8 1.00 100 15953.28 0.9982170
## 0.4 3 0.8 1.00 150 12996.92 0.9988163
## MAE
## 117919.669
## 112664.033
## 111009.578
## 117173.306
## 112876.317
## 111111.264
## 118738.136
## 113053.769
## 111259.803
## 115289.119
## 112661.844
## 111296.683
## 115319.222
## 112655.615
## 111322.467
## 115683.191
## 112596.565
## 111294.167
## 108311.453
## 108008.549
## 107933.803
## 108517.213
## 107914.470
## 107746.407
## 108140.391
## 107911.557
## 107637.340
## 40889.187
## 29179.004
## 23000.667
## 39160.566
## 27685.337
## 22625.927
## 41279.109
## 29421.594
## 23863.117
## 107493.925
## 107551.089
## 107546.062
## 107405.869
## 107514.831
## 107522.433
## 107292.734
## 107481.965
## 107448.135
## 18338.768
## 11598.451
## 9252.593
## 19615.376
## 12302.204
## 9628.602
## 19465.103
## 12319.778
## 9683.377
## 116983.014
## 112911.191
## 111291.097
## 116466.045
## 112865.295
## 111370.952
## 117264.760
## 112752.601
## 111194.755
## 116516.248
## 113049.678
## 111265.683
## 116545.027
## 112971.162
## 111163.446
## 116513.809
## 112837.246
## 111112.066
## 108613.860
## 108037.881
## 107942.917
## 108525.794
## 107922.723
## 107733.030
## 108237.557
## 107771.426
## 107559.027
## 40206.518
## 27851.660
## 22730.379
## 37833.557
## 26506.625
## 21683.626
## 38317.510
## 27407.961
## 22455.460
## 107604.258
## 107619.527
## 107612.020
## 107602.120
## 107580.791
## 107553.140
## 107532.066
## 107461.616
## 107420.658
## 18092.348
## 11705.510
## 9143.696
## 17942.834
## 11839.543
## 9151.708
## 17663.277
## 12038.833
## 9643.420
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
## = 0.75.
varImp(mil_xgb)
## xgbTree variable importance
##
## Overall
## `Unit Price` 100.00
## `Units Sold` 89.87
## `Unit Cost` 0.00
Finally we determine which model that we have created works the best for our test data.
cat(
"The hundred observation dataset with a linear model gives us an RMSE of:"
,
hund_lm |>
predict(hund_df_simp_test) |>
RMSE(hund_df_simp_test$`Total Profit`),
"\nThe hundred observation dataset with an xgboost model gives us an RMSE of:"
,
hund_xgb |>
predict(hund_df_simp_test) |>
RMSE(hund_df_simp_test$`Total Profit`),
"\nThe million observation dataset with a linear model gives us an RMSE of:"
,
mil_lm |>
predict(mil_df_simp_test) |>
RMSE(mil_df_simp_test$`Total Profit`),
"\nThe million observation dataset with an xgboost model gives us an RMSE of:"
,
mil_xgb |>
predict(mil_df_simp_test) |>
RMSE(mil_df_simp_test$`Total Profit`)
)
## The hundred observation dataset with a linear model gives us an RMSE of: 5.053595e-10
## The hundred observation dataset with an xgboost model gives us an RMSE of: 60215.74
## The million observation dataset with a linear model gives us an RMSE of: 3.446863e-08
## The million observation dataset with an xgboost model gives us an RMSE of: 12275.47
Comparing based on RMSE our predictions on the test data show us that we end up with the lowest RMSE on the hundred observation dataset regressed with a linear model. Normally we would expect that our best model for predicting the total profit of a sales order on the true population of the datasets will be a model that is trained on more data. Yet, since we know the true equation for total profit, we know that it is actually not the best model with the hundred observation linear model being closer to the true equation. Which is the reason that the RMSE is lower on the dataset with less data. Going to show that datasets with more data doesn’t always equate to a more accurate model. We likely hit the ceiling for the amount of data that would be useful for generating a good linear model fairly early, especially considering that there are only 4 variables we need to account for properly representing.
However, in most cases it will lead to a more accurate model. This is effect is observed very strongly in the xgboost model where the model improves 4-fold with more data. Xgboost and boosting tree models in general are better at avoiding bias than a linear model, but require more data to get a good result. Which is why in this case utilizing the million observation dataset was a better idea. Still, it must be considered if the time taken to train all this data (almost one hour) was worth the better result. There is likely a sweet spot in between a hundred observations and a million observations that would maximize our model accuracy in terms of time to fit.
*Overall, for this dataset I would trust the linear model generated from the hundred observation dataset**. The major reason being that when we are predicting values, if there is a direct arithmetic relationship between factors, a linear model will help you fit to it. However, you must be able to construct the proper relationship between the factors yourself and determine which factors actually contribute instead of just being multicollinear. It is never going to be as easy in real world modeling to do so. Which is why looking at correlation and having domain knowledge is so important for constructing linear models. Xgboost gives us a good idea of a predictive model if we are unaware of which variables are important or not, or if there is no direct arithmetic relationship between factors.