DATA 622 Exploratory Analysis

library(tidyverse)
library(skimr)
library(DataExplorer)
library(corrplot)
library(ggfortify)
library(caret)
set.seed(123)

Introduction

We will explore, analyze and model two sample sales datasets from (https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/) containing 100 and 1,000,000 million records.

Loading Data

Here we load in the data from the CSVs which we have downloaded.

First we load in the sales dataset with 100 observations and see it also has 14 columns.

hund_df <- read_csv("100 Sales Records.csv", show_col_types = TRUE)

## Rows: 100 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Region, Country, Item Type, Sales Channel, Order Priority, Order Da...
## dbl (7): Order ID, Units Sold, Unit Price, Unit Cost, Total Revenue, Total C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(hund_df)

## Rows: 100
## Columns: 14
## $ Region           <chr> "Australia and Oceania", "Central America and the Car…
## $ Country          <chr> "Tuvalu", "Grenada", "Russia", "Sao Tome and Principe…
## $ `Item Type`      <chr> "Baby Food", "Cereal", "Office Supplies", "Fruits", "…
## $ `Sales Channel`  <chr> "Offline", "Online", "Offline", "Online", "Offline", …
## $ `Order Priority` <chr> "H", "C", "L", "C", "L", "C", "M", "H", "M", "H", "H"…
## $ `Order Date`     <chr> "5/28/2010", "8/22/2012", "5/2/2014", "6/20/2014", "2…
## $ `Order ID`       <dbl> 669165933, 963881480, 341417157, 514321792, 115456712…
## $ `Ship Date`      <chr> "6/27/2010", "9/15/2012", "5/8/2014", "7/5/2014", "2/…
## $ `Units Sold`     <dbl> 9925, 2804, 1779, 8102, 5062, 2974, 4187, 8082, 6070,…
## $ `Unit Price`     <dbl> 255.28, 205.70, 651.21, 9.33, 651.21, 255.28, 668.27,…
## $ `Unit Cost`      <dbl> 159.42, 117.11, 524.96, 6.92, 524.96, 159.42, 502.54,…
## $ `Total Revenue`  <dbl> 2533654.00, 576782.80, 1158502.59, 75591.66, 3296425.…
## $ `Total Cost`     <dbl> 1582243.50, 328376.44, 933903.84, 56065.84, 2657347.5…
## $ `Total Profit`   <dbl> 951410.50, 248406.36, 224598.75, 19525.82, 639077.50,…

Next we load in the sales dataset with 1,000,000 rows and are also able to see the same 14 columns.

mil_df <- read_csv("1000000 Sales Records.csv", show_col_types = FALSE)
glimpse(mil_df)

## Rows: 1,000,000
## Columns: 14
## $ Region           <chr> "Sub-Saharan Africa", "Middle East and North Africa",…
## $ Country          <chr> "South Africa", "Morocco", "Papua New Guinea", "Djibo…
## $ `Item Type`      <chr> "Fruits", "Clothes", "Meat", "Clothes", "Beverages", …
## $ `Sales Channel`  <chr> "Offline", "Online", "Offline", "Offline", "Offline",…
## $ `Order Priority` <chr> "M", "M", "M", "H", "L", "L", "M", "L", "L", "L", "M"…
## $ `Order Date`     <chr> "7/27/2012", "9/14/2013", "5/15/2015", "5/17/2017", "…
## $ `Order ID`       <dbl> 443368995, 667593514, 940995585, 880811536, 174590194…
## $ `Ship Date`      <chr> "7/28/2012", "10/19/2013", "6/4/2015", "7/2/2017", "1…
## $ `Units Sold`     <dbl> 1593, 4611, 360, 562, 3973, 1379, 597, 1476, 896, 776…
## $ `Unit Price`     <dbl> 9.33, 109.28, 421.89, 109.28, 47.45, 9.33, 47.45, 47.…
## $ `Unit Cost`      <dbl> 6.92, 35.84, 364.69, 35.84, 31.79, 6.92, 31.79, 31.79…
## $ `Total Revenue`  <dbl> 14862.69, 503890.08, 151880.40, 61415.36, 188518.85, …
## $ `Total Cost`     <dbl> 11023.56, 165258.24, 131288.40, 20142.08, 126301.67, …
## $ `Total Profit`   <dbl> 3839.13, 338631.84, 20592.00, 41273.28, 62217.18, 332…

Data Processing

First, we must consider the columns that we have and if they are applicable for fitting a prediction model. We do have labels of order priority, item type, or region that could possibly be predicted from the other categories. However, I believe the most important column here for a company would be predicting the total profit based on other columns, as money is truly the end goal for every business. Thus, models that are not capable of quantitative regression such as logistic regression are not suitable models to pick.

In this case we actually have the exact information to predict profit if we were to be considering all the columns. As the total profit is a direct result of subtracting the total revenue from the total cost. While total cost comes from units sold multiplied by the unit cost and total revenue comes from units sold multiplied by unit price. Thus, there is both correlation and multicollinearity between many categories within our data. This can be seen below within both our datasets particularly with the revenue being highly correlated to total cost. An interesting thing to note below is that our incidental correlation for things that should have no correlation at all such as in the case of order ID and any other column disappear once we look at our million sales data. Only actual correlation remains.

par(mfrow=c(1,2))
corrplot(
  hund_df |> 
    select(where(is.numeric)) |> 
    cor(),
  type = 'lower', na.label = "square", na.label.col = "lightgrey"
  )
title("Hundred Sales Correlation")
corrplot(
  mil_df |> 
    select(where(is.numeric)) |> 
    cor(),
  type = 'lower', na.label = "square", na.label.col = "lightgrey"
  )
title("Million Sales Correlation")

As long as we had units sold, unit price, and unit cost as predictors we would be able to perfectly predict total profit by simply recreating the equation of Total Profit = Units Sold*(Unit Price - Units Cost). We do not need any further modeling to be able to predict the total profit since the equation is simply able to be derived from a cursory look at the data. However, we can subset the data with only these three predictor variables and the response variable of total profit then fit it to a multiple linear regression model. In theory, we should get coefficients of 1 for each predictor which will end up matching the equation.

hund_df_simp <- hund_df |>
  select(`Total Profit`,`Units Sold`, `Unit Price`, `Unit Cost`)
mil_df_simp <- mil_df |>
  select(`Total Profit`,`Units Sold`, `Unit Price`, `Unit Cost`)

head(hund_df_simp)

## # A tibble: 6 × 4
##   `Total Profit` `Units Sold` `Unit Price` `Unit Cost`
##            <dbl>        <dbl>        <dbl>       <dbl>
## 1        951410.         9925       255.        159.  
## 2        248406.         2804       206.        117.  
## 3        224599.         1779       651.        525.  
## 4         19526.         8102         9.33        6.92
## 5        639078.         5062       651.        525.  
## 6        285088.         2974       255.        159.

Now we have a subset of data that would work well for a linear regression model. However, if we wanted to apply a non linear equation-based model to attempt to predict profit then, to my understanding, these predictors would not be enough for our best predictions. This is because for models like decision-trees, we are essentially grouping together predictors rather than directly taking the predictors as an input for an equation which can sufficiently explain a prediction with only a few components. Which is why we may need to come back to the predictors that we have not subset if our decision tree accuracy is off by a large amount.

As we have undergone basic data preparation that applies to the whole dataset and does not bias the data we will now procure a training and test component from our datasets.

inTrain <- createDataPartition(
  y = hund_df_simp$`Total Profit`,
  p = .75,
  list = FALSE
)

hund_df_simp_train <- hund_df_simp[inTrain,]
hund_df_simp_test <- hund_df_simp[-inTrain,]

cat("Number of test rows:",nrow(hund_df_simp_test),"Number of train rows:",nrow(hund_df_simp_train))

## Number of test rows: 24 Number of train rows: 76

inTrain <- createDataPartition(
  y = mil_df_simp$`Total Profit`,
  p = .75,
  list = FALSE
)

mil_df_simp_train <- mil_df_simp[inTrain,]
mil_df_simp_test <- mil_df_simp[-inTrain,]

cat("Number of test rows:",nrow(mil_df_simp_test),"Number of train rows:",nrow(mil_df_simp_train))

## Number of test rows: 249999 Number of train rows: 750001

We can see that we have much more data to work with in the larger dataset and also to test on.

Data Exploration

Our data exploration here consists of skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

skim(mil_df_simp_train)

Data summary
Name	mil_df_simp_train
Number of rows	750001
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Total Profit	1	392342.96	378952.63	2.41	95104.80	281054.88	565307.60	1738700.00	▇▃▂▁▁
Units Sold	1	4999.76	2885.84	1.00	2499.00	5000.00	7498.00	10000.00	▇▇▇▇▇
Unit Price	1	265.99	216.92	9.33	81.73	154.06	421.89	668.27	▇▇▁▅▃
Unit Cost	1	187.48	175.59	6.92	35.84	97.44	263.33	524.96	▇▃▂▁▂

skim(hund_df_simp_train)

Data summary
Name	hund_df_simp_train
Number of rows	76
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Total Profit	1	440975.02	428747.68	1258.02	121443.58	300330.84	635828.80	1719922.04	▇▃▂▁▁
Units Sold	1	5299.33	2790.01	171.00	2961.50	5464.00	7688.25	9925.00	▆▅▇▇▇
Unit Price	1	275.52	236.29	9.33	81.73	205.70	437.20	668.27	▇▅▁▃▅
Unit Cost	1	191.23	190.00	6.92	35.84	117.11	263.33	524.96	▇▂▂▁▃

Note that none of the columns here are normally distributed, but that is the natural consequence of units sold and cost/price having a hard lower limit of zero. Those three columns are rightward skewed, but in the hundred observation data set, unit price almost seems more bimodal at each end rather than skewing. Additionally, the uniform distribution of the units sold variable is not as solid with less data to consider. Both of these facts are natural consequences of the law of large numbers. The larger dataset with a million observations will have overall distributions closer to the population distributions. Which means that using data with more observation will lead to improving models fit to said dataset versus a similar dataset with less data.

Normally we would transform the skewed variables in order to attempt shifting the distributions to be normal. Yet, we know in the case of this dataset doing so would actually not be beneficial for regression as the regression equation generated should be the true one without transformations.

Model Building

Now that we have a better idea of our datasets and have them split into training and testing datasets we will build our two models.

Linear Regression Model

Our linear regression model for the hundred observation dataset fits itself almost instantly even with 30-fold cross-validation repeated thrice. When we take a look at the generated coefficients and intercept, we get very very close to the true equation even with the relatively small amount of data that we are using.

ctrl <- trainControl(method = "repeatedcv", repeats = 3, number = 30)

hund_lm <- train(
  `Total Profit` ~ `Units Sold`:`Unit Price` + `Units Sold`:`Unit Cost`,
  data = hund_df_simp_train,
  method = "lm",
  trControl = ctrl
)

summary(hund_lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.553e-09 -1.262e-10 -7.280e-11  2.560e-11  5.864e-09 
## 
## Coefficients:
##                                       Estimate Std. Error    t value Pr(>|t|)
## (Intercept)                          4.273e-10  1.204e-10  3.549e+00  0.00068
## `\\`Units Sold\\`:\\`Unit Price\\``  1.000e+00  3.230e-16  3.096e+15  < 2e-16
## `\\`Units Sold\\`:\\`Unit Cost\\``  -1.000e+00  4.304e-16 -2.324e+15  < 2e-16
##                                        
## (Intercept)                         ***
## `\\`Units Sold\\`:\\`Unit Price\\`` ***
## `\\`Units Sold\\`:\\`Unit Cost\\``  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.283e-10 on 73 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.3e+31 on 2 and 73 DF,  p-value: < 2.2e-16

Meanwhile, it takes almost 3 minutes to run a 30-fold cross-validation repeated thrice on our million observation dataset. We actually get further away from our true equation by a slightest bit with our y-intercept growing bigger and thus shifting away from the true value of 0. Residual standard error has also increased because of our equation moving away from the true equation. *In this case, our model building and analysis actually is worse off for taking a larger dataset**. Using too much data seems to be able lead to an increase in processing time without having significant increase (or in this case any positive effect) in model building accuracy.

mil_lm <- train(
  `Total Profit` ~ `Units Sold`:`Unit Price` + `Units Sold`:`Unit Cost`,
  data = mil_df_simp_train,
  method = "lm",
  trControl = ctrl
)

summary(mil_lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.814e-05  0.000e+00  0.000e+00  1.000e-10  7.045e-06 
## 
## Coefficients:
##                                       Estimate Std. Error    t value Pr(>|t|)
## (Intercept)                         -2.973e-08  5.750e-11 -5.171e+02   <2e-16
## `\\`Units Sold\\`:\\`Unit Price\\``  1.000e+00  1.741e-16  5.745e+15   <2e-16
## `\\`Units Sold\\`:\\`Unit Cost\\``  -1.000e+00  2.225e-16 -4.494e+15   <2e-16
##                                        
## (Intercept)                         ***
## `\\`Units Sold\\`:\\`Unit Price\\`` ***
## `\\`Units Sold\\`:\\`Unit Cost\\``  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.459e-08 on 749998 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.5e+31 on 2 and 749998 DF,  p-value: < 2.2e-16

XGBoost Model

Utilizing XGBoost here with non-default parameters and the same training validation method we end up taking 30 minutes to finish training on just the 75 observation dataset. We do end up with a R-Squared of 0.958 and RMSE of $65,683.93 on our predictions which while worse off than our almost perfect linear model, is still fairly good all things considered. However, it is not sustainable to actually attempt to extrapolate this type of model building to the dataset with a million observations. Which is why leave the code below in as proof of its existence but will not evaluate it.

ctrl <- trainControl(method = "repeatedcv", repeats = 3, number = 30, allowParallel = TRUE)

grid <- expand.grid(nrounds = c(100,200),
                     max_depth = c(10, 15, 20, 25),
                     colsample_bytree = seq(0.5, 0.9, length.out = 5),
                     eta = 0.1,
                     gamma=0,
                     min_child_weight = 1,
                     subsample = 1
                    )

hund_xgb <- train(
  `Total Profit` ~ .,
  data = hund_df_simp_train,
  method = "xgbTree",
  trControl = ctrl,
  tuneGrid = grid,
  verbosity = 0
)

hund_xgb

Next we attempt to train an xgboost model with only 10-fold cross-validation and the default grid parameters to ease up on performance restrictions. This takes us only about a minute which should mean that our million observation xgboost will not take an eternity to finish training. Even without attempting to optimize the hyperparameters on our own, we get an R-Squared of 0.975 with our best model and an RMSE of $68,055.32. Not bad for a model that took 30 times less time to train.

ctrl <- trainControl(method = "cv", number = 10)

hund_xgb <- train(
  `Total Profit` ~ .,
  data = hund_df_simp_train,
  method = "xgbTree",
  trControl = ctrl,
  verbosity = 0
)

saveRDS(hund_xgb,file="hund_xgb.rds")

We also save and load the model to prevent wasting additional computing power for training on render later down the line.

hund_xgb <- readRDS("hund_xgb.rds")
hund_xgb

## eXtreme Gradient Boosting 
## 
## 76 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 68, 69, 69, 68, 69, 68, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  RMSE       Rsquared 
##   0.3  1          0.6               0.50        50      190478.68  0.8100255
##   0.3  1          0.6               0.50       100      187553.71  0.8195162
##   0.3  1          0.6               0.50       150      171726.71  0.8545125
##   0.3  1          0.6               0.75        50      184838.14  0.8440144
##   0.3  1          0.6               0.75       100      177576.74  0.8521272
##   0.3  1          0.6               0.75       150      173733.64  0.8571455
##   0.3  1          0.6               1.00        50      175369.37  0.8490767
##   0.3  1          0.6               1.00       100      172545.62  0.8481900
##   0.3  1          0.6               1.00       150      170534.73  0.8516665
##   0.3  1          0.8               0.50        50      186009.78  0.8264810
##   0.3  1          0.8               0.50       100      174460.83  0.8445695
##   0.3  1          0.8               0.50       150      170033.15  0.8492663
##   0.3  1          0.8               0.75        50      177782.48  0.8449394
##   0.3  1          0.8               0.75       100      172477.32  0.8597442
##   0.3  1          0.8               0.75       150      168576.08  0.8595173
##   0.3  1          0.8               1.00        50      173509.40  0.8604187
##   0.3  1          0.8               1.00       100      167166.36  0.8669755
##   0.3  1          0.8               1.00       150      164541.48  0.8716111
##   0.3  2          0.6               0.50        50      170536.58  0.8703065
##   0.3  2          0.6               0.50       100      173605.87  0.8608106
##   0.3  2          0.6               0.50       150      176282.69  0.8566465
##   0.3  2          0.6               0.75        50      166766.88  0.8727290
##   0.3  2          0.6               0.75       100      166381.49  0.8753040
##   0.3  2          0.6               0.75       150      166881.01  0.8766310
##   0.3  2          0.6               1.00        50      168843.00  0.8583780
##   0.3  2          0.6               1.00       100      164706.14  0.8685833
##   0.3  2          0.6               1.00       150      168744.91  0.8657856
##   0.3  2          0.8               0.50        50       84669.48  0.9572917
##   0.3  2          0.8               0.50       100       81925.19  0.9622697
##   0.3  2          0.8               0.50       150       80964.88  0.9647203
##   0.3  2          0.8               0.75        50       94500.43  0.9463521
##   0.3  2          0.8               0.75       100       85772.83  0.9539263
##   0.3  2          0.8               0.75       150       81821.37  0.9583087
##   0.3  2          0.8               1.00        50       97230.75  0.9472568
##   0.3  2          0.8               1.00       100       92571.22  0.9527964
##   0.3  2          0.8               1.00       150       89884.71  0.9554306
##   0.3  3          0.6               0.50        50      168260.61  0.8534544
##   0.3  3          0.6               0.50       100      167465.12  0.8617106
##   0.3  3          0.6               0.50       150      174948.86  0.8501624
##   0.3  3          0.6               0.75        50      165474.15  0.8762212
##   0.3  3          0.6               0.75       100      166822.61  0.8672734
##   0.3  3          0.6               0.75       150      168337.88  0.8623150
##   0.3  3          0.6               1.00        50      167654.42  0.8583315
##   0.3  3          0.6               1.00       100      166276.16  0.8594649
##   0.3  3          0.6               1.00       150      166815.54  0.8591319
##   0.3  3          0.8               0.50        50       82267.08  0.9546139
##   0.3  3          0.8               0.50       100       76171.91  0.9611495
##   0.3  3          0.8               0.50       150       74489.88  0.9631513
##   0.3  3          0.8               0.75        50       86271.82  0.9566361
##   0.3  3          0.8               0.75       100       80544.08  0.9627679
##   0.3  3          0.8               0.75       150       79868.77  0.9633359
##   0.3  3          0.8               1.00        50       72903.50  0.9739812
##   0.3  3          0.8               1.00       100       70759.04  0.9757218
##   0.3  3          0.8               1.00       150       69873.59  0.9759898
##   0.4  1          0.6               0.50        50      189278.17  0.8298432
##   0.4  1          0.6               0.50       100      189154.87  0.8307545
##   0.4  1          0.6               0.50       150      183939.32  0.8499637
##   0.4  1          0.6               0.75        50      176922.76  0.8445547
##   0.4  1          0.6               0.75       100      171689.89  0.8521060
##   0.4  1          0.6               0.75       150      165568.53  0.8576681
##   0.4  1          0.6               1.00        50      180985.64  0.8549658
##   0.4  1          0.6               1.00       100      171232.12  0.8590124
##   0.4  1          0.6               1.00       150      168813.59  0.8614381
##   0.4  1          0.8               0.50        50      184733.43  0.8141596
##   0.4  1          0.8               0.50       100      174369.50  0.8367859
##   0.4  1          0.8               0.50       150      169231.63  0.8521522
##   0.4  1          0.8               0.75        50      179959.08  0.8451395
##   0.4  1          0.8               0.75       100      168762.80  0.8578743
##   0.4  1          0.8               0.75       150      167333.20  0.8566604
##   0.4  1          0.8               1.00        50      172777.04  0.8534115
##   0.4  1          0.8               1.00       100      167217.26  0.8642572
##   0.4  1          0.8               1.00       150      162980.03  0.8693081
##   0.4  2          0.6               0.50        50      174627.70  0.8802220
##   0.4  2          0.6               0.50       100      176253.86  0.8721524
##   0.4  2          0.6               0.50       150      179467.21  0.8597507
##   0.4  2          0.6               0.75        50      178845.03  0.8422412
##   0.4  2          0.6               0.75       100      168710.63  0.8660429
##   0.4  2          0.6               0.75       150      169017.64  0.8632997
##   0.4  2          0.6               1.00        50      169218.92  0.8636912
##   0.4  2          0.6               1.00       100      170564.32  0.8645496
##   0.4  2          0.6               1.00       150      171980.95  0.8606534
##   0.4  2          0.8               0.50        50       86915.74  0.9516489
##   0.4  2          0.8               0.50       100       84380.99  0.9572951
##   0.4  2          0.8               0.50       150       76555.69  0.9647028
##   0.4  2          0.8               0.75        50       93737.17  0.9529033
##   0.4  2          0.8               0.75       100       87229.44  0.9623614
##   0.4  2          0.8               0.75       150       86214.98  0.9625043
##   0.4  2          0.8               1.00        50       93857.12  0.9506581
##   0.4  2          0.8               1.00       100       91735.39  0.9531499
##   0.4  2          0.8               1.00       150       91076.04  0.9548029
##   0.4  3          0.6               0.50        50      174641.26  0.8426442
##   0.4  3          0.6               0.50       100      173266.48  0.8525706
##   0.4  3          0.6               0.50       150      176430.49  0.8500481
##   0.4  3          0.6               0.75        50      169813.90  0.8498983
##   0.4  3          0.6               0.75       100      171972.96  0.8466636
##   0.4  3          0.6               0.75       150      176768.54  0.8379962
##   0.4  3          0.6               1.00        50      158059.25  0.8740518
##   0.4  3          0.6               1.00       100      160392.18  0.8656562
##   0.4  3          0.6               1.00       150      165804.30  0.8576795
##   0.4  3          0.8               0.50        50       75249.30  0.9668407
##   0.4  3          0.8               0.50       100       69699.44  0.9746695
##   0.4  3          0.8               0.50       150       68055.32  0.9749971
##   0.4  3          0.8               0.75        50       91550.95  0.9533954
##   0.4  3          0.8               0.75       100       92169.80  0.9520725
##   0.4  3          0.8               0.75       150       92216.57  0.9517995
##   0.4  3          0.8               1.00        50       74743.63  0.9679472
##   0.4  3          0.8               1.00       100       73534.66  0.9694484
##   0.4  3          0.8               1.00       150       73262.12  0.9694769
##   MAE      
##   149798.81
##   151402.46
##   137091.86
##   146788.01
##   143638.93
##   139281.70
##   144248.78
##   143575.85
##   139580.54
##   152362.45
##   140237.17
##   135719.03
##   148130.75
##   141993.31
##   135397.57
##   140431.49
##   137393.62
##   132927.77
##   133835.00
##   134764.11
##   135733.70
##   131819.53
##   127877.30
##   128870.87
##   135289.57
##   128495.58
##   128866.52
##    64610.36
##    64690.80
##    62530.25
##    76657.50
##    70869.80
##    67071.86
##    75091.96
##    70113.13
##    67171.96
##   135033.77
##   132430.20
##   136511.15
##   129156.63
##   130225.63
##   131028.89
##   128764.55
##   126030.39
##   124931.83
##    63331.42
##    56612.16
##    56226.43
##    64789.68
##    60109.09
##    60102.73
##    55545.38
##    53830.05
##    53344.36
##   150537.40
##   146827.23
##   145350.37
##   145031.60
##   139634.29
##   132195.28
##   149437.61
##   140619.17
##   136859.96
##   143993.37
##   132286.61
##   126050.49
##   147593.87
##   134803.59
##   133323.39
##   143562.08
##   137157.98
##   131022.21
##   131496.45
##   140197.89
##   139050.90
##   144786.15
##   133858.56
##   130161.97
##   133828.10
##   130982.87
##   132031.07
##    69112.99
##    66960.90
##    60253.21
##    72057.48
##    65201.69
##    65158.40
##    72739.76
##    69765.62
##    68744.26
##   128654.06
##   133152.56
##   137038.63
##   129851.43
##   130135.48
##   132176.07
##   120990.47
##   119300.22
##   121762.14
##    61821.42
##    57601.50
##    56118.28
##    68301.98
##    67638.53
##    67916.97
##    58936.71
##    56943.00
##    56564.70
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
##  = 0.5.

varImp(hund_xgb)

## xgbTree variable importance
## 
##              Overall
## `Unit Price`  100.00
## `Units Sold`   45.95
## `Unit Cost`     0.00

Even with our reduced performance configuration of training the xgboost model, training of the million observation dataset took almost an hour to finish. This is not a sustainable time for training a model in any use case that requires frequent retraining. It might be better here to actually use only a portion of our training dataset just to save on training time. However, the results from the best fit found with cross-validation had an RMSE of $12,296.15 and an R2 0.9989398 which is extremely impressive compared to the previous xgboost model trained on the lower observation dataset.

mil_xgb <- train(
  `Total Profit` ~ .,
  data = mil_df_simp_train,
  method = "xgbTree",
  trControl = ctrl,
  verbosity = 0
)

saveRDS(mil_xgb,file="mil_xgb.rds")

mil_xgb <- readRDS("mil_xgb.rds")
mil_xgb

## eXtreme Gradient Boosting 
## 
## 750001 samples
##      3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 675000, 675000, 675000, 675001, 675002, 675002, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  RMSE       Rsquared 
##   0.3  1          0.6               0.50        50      159553.58  0.8244027
##   0.3  1          0.6               0.50       100      154552.88  0.8339084
##   0.3  1          0.6               0.50       150      153411.62  0.8361797
##   0.3  1          0.6               0.75        50      159156.03  0.8261044
##   0.3  1          0.6               0.75       100      154487.70  0.8340351
##   0.3  1          0.6               0.75       150      153495.19  0.8360109
##   0.3  1          0.6               1.00        50      160204.42  0.8234345
##   0.3  1          0.6               1.00       100      154679.85  0.8336228
##   0.3  1          0.6               1.00       150      153568.34  0.8358731
##   0.3  1          0.8               0.50        50      157482.06  0.8292239
##   0.3  1          0.8               0.50       100      154239.82  0.8345122
##   0.3  1          0.8               0.50       150      153480.26  0.8360150
##   0.3  1          0.8               0.75        50      157505.88  0.8291765
##   0.3  1          0.8               0.75       100      154213.02  0.8345767
##   0.3  1          0.8               0.75       150      153480.17  0.8360236
##   0.3  1          0.8               1.00        50      157799.28  0.8285336
##   0.3  1          0.8               1.00       100      154219.01  0.8345593
##   0.3  1          0.8               1.00       150      153473.53  0.8360390
##   0.3  2          0.6               0.50        50      152513.98  0.8381036
##   0.3  2          0.6               0.50       100      152170.31  0.8387586
##   0.3  2          0.6               0.50       150      152127.33  0.8388498
##   0.3  2          0.6               0.75        50      152490.06  0.8381414
##   0.3  2          0.6               0.75       100      152132.51  0.8388386
##   0.3  2          0.6               0.75       150      152083.21  0.8389427
##   0.3  2          0.6               1.00        50      152422.22  0.8382928
##   0.3  2          0.6               1.00       100      152139.41  0.8388247
##   0.3  2          0.6               1.00       150      152055.62  0.8390012
##   0.3  2          0.8               0.50        50       53308.73  0.9803040
##   0.3  2          0.8               0.50       100       37753.95  0.9900718
##   0.3  2          0.8               0.50       150       29892.11  0.9937758
##   0.3  2          0.8               0.75        50       50849.93  0.9819616
##   0.3  2          0.8               0.75       100       36153.80  0.9908461
##   0.3  2          0.8               0.75       150       29430.02  0.9939225
##   0.3  2          0.8               1.00        50       54383.31  0.9793692
##   0.3  2          0.8               1.00       100       38855.72  0.9894899
##   0.3  2          0.8               1.00       150       31627.53  0.9930357
##   0.3  3          0.6               0.50        50      152007.86  0.8391032
##   0.3  3          0.6               0.50       100      152013.81  0.8390892
##   0.3  3          0.6               0.50       150      152024.91  0.8390667
##   0.3  3          0.6               0.75        50      152021.37  0.8390789
##   0.3  3          0.6               0.75       100      152012.29  0.8390922
##   0.3  3          0.6               0.75       150      152020.77  0.8390744
##   0.3  3          0.6               1.00        50      152039.08  0.8390570
##   0.3  3          0.6               1.00       100      151991.44  0.8391366
##   0.3  3          0.6               1.00       150      151984.84  0.8391508
##   0.3  3          0.8               0.50        50       23371.86  0.9961925
##   0.3  3          0.8               0.50       100       15141.27  0.9983997
##   0.3  3          0.8               0.50       150       12296.15  0.9989398
##   0.3  3          0.8               0.75        50       25178.25  0.9955550
##   0.3  3          0.8               0.75       100       15924.57  0.9982215
##   0.3  3          0.8               0.75       150       12691.04  0.9988714
##   0.3  3          0.8               1.00        50       24715.21  0.9957201
##   0.3  3          0.8               1.00       100       15801.85  0.9982527
##   0.3  3          0.8               1.00       150       12637.03  0.9988828
##   0.4  1          0.6               0.50        50      157237.45  0.8284182
##   0.4  1          0.6               0.50       100      154371.22  0.8341327
##   0.4  1          0.6               0.50       150      153614.12  0.8357044
##   0.4  1          0.6               0.75        50      157032.74  0.8288501
##   0.4  1          0.6               0.75       100      154339.83  0.8341985
##   0.4  1          0.6               0.75       150      153631.55  0.8356626
##   0.4  1          0.6               1.00        50      157536.76  0.8277530
##   0.4  1          0.6               1.00       100      154312.42  0.8342599
##   0.4  1          0.6               1.00       150      153576.23  0.8357854
##   0.4  1          0.8               0.50        50      156453.39  0.8298948
##   0.4  1          0.8               0.50       100      154356.67  0.8341305
##   0.4  1          0.8               0.50       150      153610.48  0.8357106
##   0.4  1          0.8               0.75        50      156357.27  0.8300337
##   0.4  1          0.8               0.75       100      154339.29  0.8341936
##   0.4  1          0.8               0.75       150      153596.56  0.8357496
##   0.4  1          0.8               1.00        50      156306.18  0.8301741
##   0.4  1          0.8               1.00       100      154277.49  0.8343451
##   0.4  1          0.8               1.00       150      153546.23  0.8358597
##   0.4  2          0.6               0.50        50      152494.53  0.8380741
##   0.4  2          0.6               0.50       100      152207.73  0.8386790
##   0.4  2          0.6               0.50       150      152143.49  0.8388149
##   0.4  2          0.6               0.75        50      152414.30  0.8382492
##   0.4  2          0.6               0.75       100      152142.66  0.8388164
##   0.4  2          0.6               0.75       150      152071.89  0.8389670
##   0.4  2          0.6               1.00        50      152321.20  0.8384497
##   0.4  2          0.6               1.00       100      152113.66  0.8388786
##   0.4  2          0.6               1.00       150      152036.62  0.8390415
##   0.4  2          0.8               0.50        50       51307.15  0.9815509
##   0.4  2          0.8               0.50       100       35564.30  0.9911626
##   0.4  2          0.8               0.50       150       29069.84  0.9941052
##   0.4  2          0.8               0.75        50       49212.33  0.9830793
##   0.4  2          0.8               0.75       100       34361.32  0.9917584
##   0.4  2          0.8               0.75       150       28055.20  0.9945049
##   0.4  2          0.8               1.00        50       49824.02  0.9826677
##   0.4  2          0.8               1.00       100       35422.38  0.9911832
##   0.4  2          0.8               1.00       150       28953.84  0.9941019
##   0.4  3          0.6               0.50        50      152074.19  0.8389611
##   0.4  3          0.6               0.50       100      152056.69  0.8389985
##   0.4  3          0.6               0.50       150      152051.67  0.8390093
##   0.4  3          0.6               0.75        50      152053.64  0.8390052
##   0.4  3          0.6               0.75       100      152028.96  0.8390572
##   0.4  3          0.6               0.75       150      152035.94  0.8390431
##   0.4  3          0.6               1.00        50      152048.94  0.8390156
##   0.4  3          0.6               1.00       100      152001.29  0.8391160
##   0.4  3          0.6               1.00       150      151993.63  0.8391322
##   0.4  3          0.8               0.50        50       23163.53  0.9962581
##   0.4  3          0.8               0.50       100       15470.04  0.9983260
##   0.4  3          0.8               0.50       150       12334.39  0.9989355
##   0.4  3          0.8               0.75        50       23056.21  0.9962942
##   0.4  3          0.8               0.75       100       15716.43  0.9982675
##   0.4  3          0.8               0.75       150       12287.79  0.9989383
##   0.4  3          0.8               1.00        50       22657.05  0.9963823
##   0.4  3          0.8               1.00       100       15953.28  0.9982170
##   0.4  3          0.8               1.00       150       12996.92  0.9988163
##   MAE       
##   117919.669
##   112664.033
##   111009.578
##   117173.306
##   112876.317
##   111111.264
##   118738.136
##   113053.769
##   111259.803
##   115289.119
##   112661.844
##   111296.683
##   115319.222
##   112655.615
##   111322.467
##   115683.191
##   112596.565
##   111294.167
##   108311.453
##   108008.549
##   107933.803
##   108517.213
##   107914.470
##   107746.407
##   108140.391
##   107911.557
##   107637.340
##    40889.187
##    29179.004
##    23000.667
##    39160.566
##    27685.337
##    22625.927
##    41279.109
##    29421.594
##    23863.117
##   107493.925
##   107551.089
##   107546.062
##   107405.869
##   107514.831
##   107522.433
##   107292.734
##   107481.965
##   107448.135
##    18338.768
##    11598.451
##     9252.593
##    19615.376
##    12302.204
##     9628.602
##    19465.103
##    12319.778
##     9683.377
##   116983.014
##   112911.191
##   111291.097
##   116466.045
##   112865.295
##   111370.952
##   117264.760
##   112752.601
##   111194.755
##   116516.248
##   113049.678
##   111265.683
##   116545.027
##   112971.162
##   111163.446
##   116513.809
##   112837.246
##   111112.066
##   108613.860
##   108037.881
##   107942.917
##   108525.794
##   107922.723
##   107733.030
##   108237.557
##   107771.426
##   107559.027
##    40206.518
##    27851.660
##    22730.379
##    37833.557
##    26506.625
##    21683.626
##    38317.510
##    27407.961
##    22455.460
##   107604.258
##   107619.527
##   107612.020
##   107602.120
##   107580.791
##   107553.140
##   107532.066
##   107461.616
##   107420.658
##    18092.348
##    11705.510
##     9143.696
##    17942.834
##    11839.543
##     9151.708
##    17663.277
##    12038.833
##     9643.420
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
##  = 0.75.

varImp(mil_xgb)

## xgbTree variable importance
## 
##              Overall
## `Unit Price`  100.00
## `Units Sold`   89.87
## `Unit Cost`     0.00

Model Evaluation

Finally we determine which model that we have created works the best for our test data.

cat(
  "The hundred observation dataset with a linear model gives us an RMSE of:"
  ,
  hund_lm |>
    predict(hund_df_simp_test) |>
    RMSE(hund_df_simp_test$`Total Profit`),
  "\nThe hundred observation dataset with an xgboost model gives us an RMSE of:"
  ,
  hund_xgb |>
    predict(hund_df_simp_test) |>
    RMSE(hund_df_simp_test$`Total Profit`),
  "\nThe million observation dataset with a linear model gives us an RMSE of:"
  ,
  mil_lm |>
    predict(mil_df_simp_test) |>
    RMSE(mil_df_simp_test$`Total Profit`),
  "\nThe million observation dataset with an xgboost model gives us an RMSE of:"
  ,
  mil_xgb |>
    predict(mil_df_simp_test) |>
    RMSE(mil_df_simp_test$`Total Profit`)
)

## The hundred observation dataset with a linear model gives us an RMSE of: 5.053595e-10 
## The hundred observation dataset with an xgboost model gives us an RMSE of: 60215.74 
## The million observation dataset with a linear model gives us an RMSE of: 3.446863e-08 
## The million observation dataset with an xgboost model gives us an RMSE of: 12275.47

Conclusions

Comparing based on RMSE our predictions on the test data show us that we end up with the lowest RMSE on the hundred observation dataset regressed with a linear model. Normally we would expect that our best model for predicting the total profit of a sales order on the true population of the datasets will be a model that is trained on more data. Yet, since we know the true equation for total profit, we know that it is actually not the best model with the hundred observation linear model being closer to the true equation. Which is the reason that the RMSE is lower on the dataset with less data. Going to show that datasets with more data doesn’t always equate to a more accurate model. We likely hit the ceiling for the amount of data that would be useful for generating a good linear model fairly early, especially considering that there are only 4 variables we need to account for properly representing.

However, in most cases it will lead to a more accurate model. This is effect is observed very strongly in the xgboost model where the model improves 4-fold with more data. Xgboost and boosting tree models in general are better at avoiding bias than a linear model, but require more data to get a good result. Which is why in this case utilizing the million observation dataset was a better idea. Still, it must be considered if the time taken to train all this data (almost one hour) was worth the better result. There is likely a sweet spot in between a hundred observations and a million observations that would maximize our model accuracy in terms of time to fit.

*Overall, for this dataset I would trust the linear model generated from the hundred observation dataset**. The major reason being that when we are predicting values, if there is a direct arithmetic relationship between factors, a linear model will help you fit to it. However, you must be able to construct the proper relationship between the factors yourself and determine which factors actually contribute instead of just being multicollinear. It is never going to be as easy in real world modeling to do so. Which is why looking at correlation and having domain knowledge is so important for constructing linear models. Xgboost gives us a good idea of a predictive model if we are unaware of which variables are important or not, or if there is no direct arithmetic relationship between factors.