1. Table of Contents


This document presents a non-exhaustive list of various data quality assessment, preprocessing and exploration methods for a classification modelling problem using various helpful packages in R.

1.1 Sample Data


The schedulingData dataset from the AppliedPredictiveModeling package was used for this illustrated example.

Preliminary dataset assessment:

[A] 4331 rows (observations)

[B] 8 columns (variables)
     [B.1] 1/8 response = Class variable (factor)
     [B.2] 7/8 predictors = All remaining variables (2/7 factor + 5/7 numeric)

## [1] 4331    8
## 'data.frame':    4331 obs. of  8 variables:
##  $ Protocol   : Factor w/ 14 levels "A","C","D","E",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Compounds  : num  997 97 101 93 100 100 105 98 101 95 ...
##  $ InputFields: num  137 103 75 76 82 82 88 95 91 92 ...
##  $ Iterations : num  20 20 10 20 20 20 20 20 20 20 ...
##  $ NumPending : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hour       : num  14 13.8 13.8 10.1 10.4 ...
##  $ Day        : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 4 5 5 3 5 5 5 3 ...
##  $ Class      : Factor w/ 4 levels "VF","F","M","L": 2 1 1 1 1 1 1 1 1 1 ...
##     Protocol      Compounds        InputFields      Iterations    
##  J      : 989   Min.   :   20.0   Min.   :   10   Min.   : 10.00  
##  O      : 581   1st Qu.:   98.0   1st Qu.:  134   1st Qu.: 20.00  
##  N      : 536   Median :  226.0   Median :  426   Median : 20.00  
##  M      : 451   Mean   :  497.7   Mean   : 1537   Mean   : 29.24  
##  I      : 381   3rd Qu.:  448.0   3rd Qu.:  991   3rd Qu.: 20.00  
##  H      : 321   Max.   :14103.0   Max.   :56671   Max.   :200.00  
##  (Other):1072                                                     
##    NumPending           Hour           Day      Class    
##  Min.   :   0.00   Min.   : 0.01667   Mon:692   VF:2211  
##  1st Qu.:   0.00   1st Qu.:10.90000   Tue:900   F :1347  
##  Median :   0.00   Median :14.01667   Wed:903   M : 514  
##  Mean   :  53.39   Mean   :13.73376   Thu:720   L : 259  
##  3rd Qu.:   0.00   3rd Qu.:16.60000   Fri:923            
##  Max.   :5605.00   Max.   :23.98333   Sat: 32            
##                                       Sun:161
##   Column.Index Column.Name Column.Type
## 1            1    Protocol      factor
## 2            2   Compounds     numeric
## 3            3 InputFields     numeric
## 4            4  Iterations     numeric
## 5            5  NumPending     numeric
## 6            6        Hour     numeric
## 7            7         Day      factor
## 8            8       Class      factor

1.2 Data Quality Assessment


Data quality assessment:

[A] No missing observations noted for any variable.

[B] Low variance observed for 2 variables with First.Second.Mode.Ratio>5.
     [B.1] Iterations variable (numeric)
     [B.2] NumPending variable (numeric)

[C] Low variance observed for 1 variable with Unique.Count.Ratio<0.01.
     [C.1] Iterations variable (numeric)

[D] High skewness observed for 4 variables with Skewness>3 or Skewness<(-3).
     [D.1] Compounds variable (numeric)
     [D.2] InputFields variable (numeric)
     [D.3] Iterations variable (numeric)
     [D.4] NumPending variable (numeric)
##   Column.Index Column.Name Column.Type Row.Count NA.Count Fill.Rate
## 1            1    Protocol      factor      4331        0     1.000
## 2            2   Compounds     numeric      4331        0     1.000
## 3            3 InputFields     numeric      4331        0     1.000
## 4            4  Iterations     numeric      4331        0     1.000
## 5            5  NumPending     numeric      4331        0     1.000
## 6            6        Hour     numeric      4331        0     1.000
## 7            7         Day      factor      4331        0     1.000
## 8            8       Class      factor      4331        0     1.000
## [1] "There are 5 numeric predictor variable(s)."
## [1] "There are 2 factor predictor variable(s)."
##################################
# Formulating a data quality assessment summary for factor predictors
##################################
if (length(names(DQA.Predictors.Factor))>0) {
  
  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = x[!(x %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    usm[tabsm == max(tabsm)]
  }
  
  (DQA.Predictors.Factor.Summary <- data.frame(
  Column.Name= names(DQA.Predictors.Factor), 
  Column.Type=sapply(DQA.Predictors.Factor, function(x) class(x)), 
  Unique.Count=sapply(DQA.Predictors.Factor, function(x) length(unique(x))),
  First.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(FirstModes(x)[1])),
  Second.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(SecondModes(x)[1])),
  First.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  Unique.Count.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Factor)),3), nsmall=3)),
  First.Second.Mode.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((sum(x == FirstModes(x)[1])/sum(x == SecondModes(x)[1])),3), nsmall=3)),
  row.names=NULL)
  )
  
} 
##   Column.Name Column.Type Unique.Count First.Mode.Value Second.Mode.Value
## 1    Protocol      factor           14                J                 O
## 2         Day      factor            7              Fri               Wed
##   First.Mode.Count Second.Mode.Count Unique.Count.Ratio First.Second.Mode.Ratio
## 1              989               581              0.003                   1.702
## 2              923               903              0.002                   1.022
##################################
# Formulating a data quality assessment summary for numeric predictors
##################################
if (length(names(DQA.Predictors.Numeric))>0) {
  
  ##################################
  # Formulating a function to determine the first mode
  ##################################
  FirstModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    ux[tab == max(tab)]
  }

  ##################################
  # Formulating a function to determine the second mode
  ##################################
  SecondModes <- function(x) {
    ux <- unique(na.omit(x))
    tab <- tabulate(match(x, ux))
    fm = ux[tab == max(tab)]
    sm = na.omit(x)[!(na.omit(x) %in% fm)]
    usm <- unique(sm)
    tabsm <- tabulate(match(sm, usm))
    usm[tabsm == max(tabsm)]
  }
  
  (DQA.Predictors.Numeric.Summary <- data.frame(
  Column.Name= names(DQA.Predictors.Numeric), 
  Column.Type=sapply(DQA.Predictors.Numeric, function(x) class(x)), 
  Unique.Count=sapply(DQA.Predictors.Numeric, function(x) length(unique(x))),
  Unique.Count.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Numeric)),3), nsmall=3)),
  First.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((FirstModes(x)[1]),3),nsmall=3)),
  Second.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((SecondModes(x)[1]),3),nsmall=3)),
  First.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == FirstModes(x)[1])),
  Second.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == SecondModes(x)[1])),
  First.Second.Mode.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
  Minimum=sapply(DQA.Predictors.Numeric, function(x) format(round(min(x,na.rm = TRUE),3), nsmall=3)),
  Mean=sapply(DQA.Predictors.Numeric, function(x) format(round(mean(x,na.rm = TRUE),3), nsmall=3)),
  Median=sapply(DQA.Predictors.Numeric, function(x) format(round(median(x,na.rm = TRUE),3), nsmall=3)),
  Maximum=sapply(DQA.Predictors.Numeric, function(x) format(round(max(x,na.rm = TRUE),3), nsmall=3)),
  Skewness=sapply(DQA.Predictors.Numeric, function(x) format(round(skewness(x,na.rm = TRUE),3), nsmall=3)),
  Kurtosis=sapply(DQA.Predictors.Numeric, function(x) format(round(kurtosis(x,na.rm = TRUE),3), nsmall=3)),
  Percentile25th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.25,na.rm = TRUE),3), nsmall=3)),
  Percentile75th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.75,na.rm = TRUE),3), nsmall=3)),
  row.names=NULL)
  )  
  
}
##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1   Compounds     numeric          858              0.198           20.000
## 2 InputFields     numeric         1730              0.399           10.000
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
## 5        Hour     numeric          924              0.213           13.083
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1            31.000               96                29                   3.310
## 2           466.000               82                27                   3.037
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
## 5            21.067               28                25                   1.120
##   Minimum     Mean  Median   Maximum Skewness Kurtosis Percentile25th
## 1  20.000  497.742 226.000 14103.000    6.568   69.486         98.000
## 2  10.000 1537.055 426.000 56671.000    5.870   54.919        134.000
## 3  10.000   29.244  20.000   200.000    3.937   18.510         20.000
## 4   0.000   53.389   0.000  5605.000    9.718  105.594          0.000
## 5   0.017   13.734  14.017    23.983   -0.546    3.747         10.900
##   Percentile75th
## 1        448.000
## 2        991.000
## 3         20.000
## 4          0.000
## 5         16.600
## [1] "No missing observations noted."
## [1] "No low variance factor predictors due to high first-second mode ratio noted."
## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>5."
##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
##   Minimum   Mean Median  Maximum Skewness Kurtosis Percentile25th
## 3  10.000 29.244 20.000  200.000    3.937   18.510         20.000
## 4   0.000 53.389  0.000 5605.000    9.718  105.594          0.000
##   Percentile75th
## 3         20.000
## 4          0.000
## [1] "Low variance observed for 1 numeric variable(s) with Unique.Count.Ratio<0.01."
##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3  Iterations     numeric           11              0.003           20.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3            10.000             3568               272                  13.118
##   Minimum   Mean Median Maximum Skewness Kurtosis Percentile25th Percentile75th
## 3  10.000 29.244 20.000 200.000    3.937   18.510         20.000         20.000
## [1] "High skewness observed for 4 numeric variable(s) with Skewness>3 or Skewness<(-3)."
##   Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1   Compounds     numeric          858              0.198           20.000
## 2 InputFields     numeric         1730              0.399           10.000
## 3  Iterations     numeric           11              0.003           20.000
## 4  NumPending     numeric          303              0.070            0.000
##   Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1            31.000               96                29                   3.310
## 2           466.000               82                27                   3.037
## 3            10.000             3568               272                  13.118
## 4             1.000             3275               165                  19.848
##   Minimum     Mean  Median   Maximum Skewness Kurtosis Percentile25th
## 1  20.000  497.742 226.000 14103.000    6.568   69.486         98.000
## 2  10.000 1537.055 426.000 56671.000    5.870   54.919        134.000
## 3  10.000   29.244  20.000   200.000    3.937   18.510         20.000
## 4   0.000   53.389   0.000  5605.000    9.718  105.594          0.000
##   Percentile75th
## 1        448.000
## 2        991.000
## 3         20.000
## 4          0.000

1.3 Data Preprocessing

1.3.1 Missing Data Imputation


Missing data assessment:

[A] 100% fill rate with no missing data identified from the previous data quality assessment.

[B] 100% fill rate with no missing data confirmed using a descriptive statistics summary.

[C] The caret package allows three imputation methods:
     [C.1] The knnimpute method is carried out by finding the k closest samples (Euclidian distance) in the training set.
     [C.2] The bagimpute method fits a bagged tree model for each predictor (as a function of all the others).
     [C.3] The medianimpute method takes the median of each predictor in the training set, and uses them to fill missing values.
Data summary
Name DPA
Number of rows 4331
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Protocol 0 1 FALSE 14 J: 989, O: 581, N: 536, M: 451
Day 0 1 FALSE 7 Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class 0 1 FALSE 4 VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
## # A tibble: 0 x 15
## # ... with 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## #   complete_rate <dbl>, factor.ordered <lgl>, factor.n_unique <int>,
## #   factor.top_counts <chr>, numeric.mean <dbl>, numeric.sd <dbl>,
## #   numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## #   numeric.p100 <dbl>, numeric.hist <chr>

1.3.2 Outlier Treatment


Outlier data assessment:

[A] Outliers noted for 5 variables. Outlier treatment for numerical stability remains optional depending on potential model requirements for the subsequent steps.

[B] Numeric data can be visualized through a boxplot including observations classified as suspected outliers using the IQR criterion. The IQR criterion means that all observations above the (75th percentile + 1.5 x IQR) or below the (25th percentile - 1.5 x IQR) are suspected outliers, where IQR is the difference between the third quartile (75th percentile) and first quartile (25th percentile).

[C] The caret package includes one method for outlier treatment:
     [C.1] The spatialSign method from the caret package projects the data for a predictor to the unit circle in p dimensions by dividing it by its norm, where p is the number of predictors.

[D] The spatialSign methods was applied on the dataset:
     [D.1] While data distribution generally improved with the number of remaining outliers reduced, there are still 4 variables noted with outliers using the IQR criterion.

## [1] "5 numeric variable(s) were noted with outlier(s)."
Data summary
Name DPA.Predictors.Numeric
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
Data summary
Name DPA_CenteredScaledSpatial…
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 -0.14 0.37 -0.82 -0.39 -0.21 -0.04 1.00 ▂▇▃▁▁
InputFields 0 1 -0.16 0.41 -0.81 -0.41 -0.25 -0.07 1.00 ▃▇▂▁▂
Iterations 0 1 -0.18 0.37 -0.92 -0.38 -0.25 -0.14 1.00 ▁▇▂▁▁
NumPending 0 1 -0.09 0.21 -0.46 -0.18 -0.13 -0.06 1.00 ▃▇▁▁▁
Hour 0 1 0.04 0.65 -0.99 -0.63 0.08 0.70 0.99 ▇▃▅▃▇

## [1] "4 numeric variable(s) were noted with outlier(s)."

1.3.3 Zero and Near-Zero Variance


Zero and near-zero variance data assessment:

[A] Low variance noted for 1 variable from the previous data quality assessment.

[B] Low variance noted for 2 variables confirmed using a preprocessing summary from the caret package.

[C] The caret package includes two methods for detecting low variance variables:
     [C.1] The nearZeroVar method using the freqCut criteria with default setting at 95/5 computes the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data.
     [C.2] The nearZeroVar method using the uniqueCut criteria with default setting at 10 computes the percent of unique values referring to the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.

[D] The nearZeroVar method using both the freqCut and uniqueCut criteria set at 80/20 and 10, respectively, were applied on the dataset:
     [D.1] 2 variables may be optionally removed from the dataset for the subsequent analysis.
Data summary
Name DPA
Number of rows 4331
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Protocol 0 1 FALSE 14 J: 989, O: 581, N: 536, M: 451
Day 0 1 FALSE 7 Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class 0 1 FALSE 4 VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
##            freqRatio percentUnique zeroVar  nzv
## Iterations  13.11765     0.2539829   FALSE TRUE
## NumPending  19.84848     6.9960748   FALSE TRUE
## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."
## [1] "Low variance can be resolved by removing 2 numeric variable(s)."
## [1] "Variable 1 for removal: Iterations"
## [1] "Variable 2 for removal: NumPending"
Data summary
Name DPA_ExcludedLowVariance
Number of rows 4331
Number of columns 6
_______________________
Column type frequency:
factor 3
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Protocol 0 1 FALSE 14 J: 989, O: 581, N: 536, M: 451
Day 0 1 FALSE 7 Fri: 923, Wed: 903, Tue: 900, Thu: 720
Class 0 1 FALSE 4 VF: 2211, F: 1347, M: 514, L: 259

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁

1.3.4 Collinearity


High collinearity data assessment:

[A] No high correlation noted for any variable pair confirmed using the preprocessing summaries from the caret and lares packages.

[B] The caret and lares packages include methods for detecting highly correlated variables:
     [B.1] The findCorrelation method using the cutoff criteria with default setting at 0.90 from the caret package searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
     [B.2] The corr_cross method using the top criteria from the lares package lists out and ranks the variable-pairs with the highest correlation coefficients which were statistically significant.

[D] The findCorrelation method using cutoff criteria set at 0.95 and the corr_cross method using top criteria set at 10 were applied on the dataset:
     [D.1] No highly correlated variables detected which may be optionally removed from the dataset for the subsequent analysis.

## [1] 0
## [1] "No highly correlated predictors noted."

1.3.5 Linear Dependencies


Linear dependency data assessment:

[A] No linear dependency noted for any subset of variables using the preprocessing summary from the caret package.

[B] The caret package includes three methods for detecting linearly dependent variables:
     [B.1] The findLinearCombos method from the caret package uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).

[C] The findLinearCombos method was applied on the dataset:
     [C.1] No linearly dependent variables were identified which may be optionally removed from the dataset for the subsequent analysis.
## [1] 0
## [1] "No linearly dependent predictors noted."

1.3.6 Centering and Scaling


Centering and scaling data assessment:

[A] Centering and scaling transformation for numerical stability remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes three methods for centering and scaling variables:
     [B.1] The center method from the caret package subtracts the average value of a numeric variable to all the values. As a result of centering, the variable has a zero mean.
     [B.2] The scale method from the caret package performs a center transformation with each value of the variable divided by its standard deviation. Scaling the data coerces the values to have a common standard deviation of one.
     [B.3] The range method from the caret package scales the data to be within the defined numeric range bound.

[C] The center, scale and range methods were tried on the dataset.
Data summary
Name DPA.Predictors.Numeric
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
Data summary
Name DPA_CenteredTransformed
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 0 1020.17 -477.74 -399.74 -271.74 -49.74 13605.26 ▇▁▁▁▁
InputFields 0 1 0 3650.08 -1527.06 -1403.06 -1111.06 -546.06 55133.94 ▇▁▁▁▁
Iterations 0 1 0 34.42 -19.24 -9.24 -9.24 -9.24 170.76 ▇▁▁▁▁
NumPending 0 1 0 355.96 -53.39 -53.39 -53.39 -53.39 5551.61 ▇▁▁▁▁
Hour 0 1 0 3.98 -13.72 -2.83 0.28 2.87 10.25 ▁▂▇▇▁
Data summary
Name DPA_CenteredScaledTransfo…
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 0 1 -0.47 -0.39 -0.27 -0.05 13.34 ▇▁▁▁▁
InputFields 0 1 0 1 -0.42 -0.38 -0.30 -0.15 15.10 ▇▁▁▁▁
Iterations 0 1 0 1 -0.56 -0.27 -0.27 -0.27 4.96 ▇▁▁▁▁
NumPending 0 1 0 1 -0.15 -0.15 -0.15 -0.15 15.60 ▇▁▁▁▁
Hour 0 1 0 1 -3.45 -0.71 0.07 0.72 2.57 ▁▂▇▇▁
Data summary
Name DPA_RangedTransformed
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 0.03 0.07 0 0.01 0.01 0.03 1 ▇▁▁▁▁
InputFields 0 1 0.03 0.06 0 0.00 0.01 0.02 1 ▇▁▁▁▁
Iterations 0 1 0.10 0.18 0 0.05 0.05 0.05 1 ▇▁▁▁▁
NumPending 0 1 0.01 0.06 0 0.00 0.00 0.00 1 ▇▁▁▁▁
Hour 0 1 0.57 0.17 0 0.45 0.58 0.69 1 ▁▂▇▇▁

1.3.7 Shape Transformation


Data transformation assessment:

[A] Shape transformation to remove skewness for data distribution stability remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes three methods for transforming variables:
     [B.1] The BoxCox method from the caret package transforms the distributional shape for variables with strictly positive values.
     [B.2] The YeoJohnson method from the caret package transforms the distributional shape for variables with zero and/or negative values.
     [B.3] The expoTrans method from the caret package transforms the distributional shape for variables with zero and/or negative values.

[C] The BoxCox, YeoJohnson and expoTrans methods were tried on the dataset.
Data summary
Name DPA.Predictors.Numeric
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
Data summary
Name DPA_BoxCoxTransformed
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 5.36 1.25 3.00 4.58 5.42 6.10 9.55 ▅▇▇▂▁
InputFields 0 1 5.98 1.65 2.30 4.90 6.05 6.90 10.95 ▂▆▇▃▁
Iterations 0 1 0.95 0.02 0.90 0.95 0.95 0.95 1.00 ▁▁▇▁▁
NumPending 0 1 53.39 355.96 0.00 0.00 0.00 0.00 5605.00 ▇▁▁▁▁
Hour 0 1 22.83 8.41 -0.77 16.40 23.04 28.89 47.09 ▁▆▇▆▁
Data summary
Name DPA_YeoJohnsonTransformed
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 4.32 0.80 2.70 3.84 4.40 4.83 6.67 ▃▅▇▂▁
InputFields 0 1 5.40 1.35 2.31 4.53 5.49 6.17 9.19 ▂▅▇▃▁
Iterations 0 1 0.92 0.01 0.88 0.92 0.92 0.92 0.95 ▁▁▇▁▁
NumPending 0 1 0.19 0.35 0.00 0.00 0.00 0.00 0.91 ▇▁▁▁▂
Hour 0 1 33.41 12.40 0.02 23.80 33.53 42.30 70.46 ▁▆▇▅▁
Data summary
Name DPA_ExpoTransTransformed
Number of rows 4331
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Compounds 0 1 497.74 1020.17 20.00 98.00 226.00 448.00 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.00 426.00 991.00 56671.00 ▇▁▁▁▁
Iterations 0 1 12.31 2.08 7.64 12.00 12.00 12.00 17.75 ▁▁▇▁▁
NumPending 0 1 53.39 355.96 0.00 0.00 0.00 0.00 5605.00 ▇▁▁▁▁
Hour 0 1 19.04 6.86 0.02 13.76 18.98 23.83 40.95 ▁▆▇▃▁

1.3.8 Dummy Variables


Dummy variable creation assessment:

[A] Dummy variable creation (or one-hot encoding) for factor variables remains optional depending on potential model requirements for the subsequent steps.

[B] The caret package includes one method for creating dummy variables:
     [B.1] The dummyVars method from the caret package generates a complete (less than full rank parameterized) set of dummy variables from one or more factors.

[C] The dummyVars method was tried on the dataset.
## [1] "There are 2 factor variables for dummy variable creation."
Data summary
Name DPA_DummyVariablesCreated
Number of rows 4331
Number of columns 26
_______________________
Column type frequency:
numeric 26
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Protocol.A 0 1 0.02 0.15 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.C 0 1 0.04 0.19 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.D 0 1 0.03 0.18 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.E 0 1 0.02 0.15 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.F 0 1 0.04 0.19 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.G 0 1 0.04 0.19 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.H 0 1 0.07 0.26 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.I 0 1 0.09 0.28 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.J 0 1 0.23 0.42 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Protocol.K 0 1 0.00 0.04 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.L 0 1 0.06 0.23 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.M 0 1 0.10 0.31 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.N 0 1 0.12 0.33 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Protocol.O 0 1 0.13 0.34 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Compounds 0 1 497.74 1020.17 20.00 98.0 226.00 448.0 14103.00 ▇▁▁▁▁
InputFields 0 1 1537.06 3650.08 10.00 134.0 426.00 991.0 56671.00 ▇▁▁▁▁
Iterations 0 1 29.24 34.42 10.00 20.0 20.00 20.0 200.00 ▇▁▁▁▁
NumPending 0 1 53.39 355.96 0.00 0.0 0.00 0.0 5605.00 ▇▁▁▁▁
Hour 0 1 13.73 3.98 0.02 10.9 14.02 16.6 23.98 ▁▂▇▇▁
Day.Mon 0 1 0.16 0.37 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Day.Tue 0 1 0.21 0.41 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Day.Wed 0 1 0.21 0.41 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Day.Thu 0 1 0.17 0.37 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Day.Fri 0 1 0.21 0.41 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▂
Day.Sat 0 1 0.01 0.09 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁
Day.Sun 0 1 0.04 0.19 0.00 0.0 0.00 0.0 1.00 ▇▁▁▁▁

1.4 Data Exploration


Data exploration assessment:

[A] Data exploration for classification modelling problems involve bivariate analysis between the factor response variable and the numeric predictor variables.

[B] The caret package includes one method for performing data exploration:
     [B.1] The featurePlot method from the caret package generates various graphs (box, strip, density, correlation pairs and correlation ellipse plots) for exploring and visualizing the potential relationships between the response and predictor variables.

[C] The featurePlot method was tried on the dataset.
## [1] 5
## [1] 4

1.5 References


[Book] Applied Predictive Modelling Textbook by Max Kuhn and Kjell Johnson
[Article] Caret Package – A Practical Guide to Machine Learning in R by Selva Prabhakaran
[Article] Outlier Detection in R by Antoine Soetewey
[R Package] Caret by Max Kuhn
[R Package] lares by Bernardo Lares