1.1 Sample Data
The schedulingData dataset from the AppliedPredictiveModeling package was used for this illustrated example.
Preliminary dataset assessment:
[A] 4331 rows (observations)
[B] 8 columns (variables)
[B.1] 1/8 response = Class variable (factor)
[B.2] 7/8 predictors = All remaining variables (2/7 factor + 5/7 numeric)
## [1] 4331 8
## 'data.frame': 4331 obs. of 8 variables:
## $ Protocol : Factor w/ 14 levels "A","C","D","E",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Compounds : num 997 97 101 93 100 100 105 98 101 95 ...
## $ InputFields: num 137 103 75 76 82 82 88 95 91 92 ...
## $ Iterations : num 20 20 10 20 20 20 20 20 20 20 ...
## $ NumPending : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hour : num 14 13.8 13.8 10.1 10.4 ...
## $ Day : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 4 5 5 3 5 5 5 3 ...
## $ Class : Factor w/ 4 levels "VF","F","M","L": 2 1 1 1 1 1 1 1 1 1 ...
## Protocol Compounds InputFields Iterations
## J : 989 Min. : 20.0 Min. : 10 Min. : 10.00
## O : 581 1st Qu.: 98.0 1st Qu.: 134 1st Qu.: 20.00
## N : 536 Median : 226.0 Median : 426 Median : 20.00
## M : 451 Mean : 497.7 Mean : 1537 Mean : 29.24
## I : 381 3rd Qu.: 448.0 3rd Qu.: 991 3rd Qu.: 20.00
## H : 321 Max. :14103.0 Max. :56671 Max. :200.00
## (Other):1072
## NumPending Hour Day Class
## Min. : 0.00 Min. : 0.01667 Mon:692 VF:2211
## 1st Qu.: 0.00 1st Qu.:10.90000 Tue:900 F :1347
## Median : 0.00 Median :14.01667 Wed:903 M : 514
## Mean : 53.39 Mean :13.73376 Thu:720 L : 259
## 3rd Qu.: 0.00 3rd Qu.:16.60000 Fri:923
## Max. :5605.00 Max. :23.98333 Sat: 32
## Sun:161
## Column.Index Column.Name Column.Type
## 1 1 Protocol factor
## 2 2 Compounds numeric
## 3 3 InputFields numeric
## 4 4 Iterations numeric
## 5 5 NumPending numeric
## 6 6 Hour numeric
## 7 7 Day factor
## 8 8 Class factor
1.2 Data Quality Assessment
Data quality assessment:
[A] No missing observations noted for any variable.
[B] Low variance observed for 2 variables with First.Second.Mode.Ratio>5.
[B.1] Iterations variable (numeric)
[B.2] NumPending variable (numeric)
[C] Low variance observed for 1 variable with Unique.Count.Ratio<0.01.
[C.1] Iterations variable (numeric)
[D] High skewness observed for 4 variables with Skewness>3 or Skewness<(-3).
[D.1] Compounds variable (numeric)
[D.2] InputFields variable (numeric)
[D.3] Iterations variable (numeric)
[D.4] NumPending variable (numeric)
##################################
# Loading dataset
##################################
DQA <- schedulingData
##################################
# Listing all predictors
##################################
DQA.Predictors <- DQA[,!names(DQA) %in% c("Class")]
##################################
# Formulating an overall data quality assessment summary
##################################
(DQA.Summary <- data.frame(
Column.Index=c(1:length(names(DQA))),
Column.Name= names(DQA),
Column.Type=sapply(DQA, function(x) class(x)),
Row.Count=sapply(DQA, function(x) nrow(DQA)),
NA.Count=sapply(DQA,function(x)sum(is.na(x))),
Fill.Rate=sapply(DQA,function(x)format(round((sum(!is.na(x))/nrow(DQA)),3),nsmall=3)),
row.names=NULL)
)
## Column.Index Column.Name Column.Type Row.Count NA.Count Fill.Rate
## 1 1 Protocol factor 4331 0 1.000
## 2 2 Compounds numeric 4331 0 1.000
## 3 3 InputFields numeric 4331 0 1.000
## 4 4 Iterations numeric 4331 0 1.000
## 5 5 NumPending numeric 4331 0 1.000
## 6 6 Hour numeric 4331 0 1.000
## 7 7 Day factor 4331 0 1.000
## 8 8 Class factor 4331 0 1.000
## [1] "There are 5 numeric predictor variable(s)."
## [1] "There are 2 factor predictor variable(s)."
##################################
# Formulating a data quality assessment summary for factor predictors
##################################
if (length(names(DQA.Predictors.Factor))>0) {
##################################
# Formulating a function to determine the first mode
##################################
FirstModes <- function(x) {
ux <- unique(na.omit(x))
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
##################################
# Formulating a function to determine the second mode
##################################
SecondModes <- function(x) {
ux <- unique(na.omit(x))
tab <- tabulate(match(x, ux))
fm = ux[tab == max(tab)]
sm = x[!(x %in% fm)]
usm <- unique(sm)
tabsm <- tabulate(match(sm, usm))
usm[tabsm == max(tabsm)]
}
(DQA.Predictors.Factor.Summary <- data.frame(
Column.Name= names(DQA.Predictors.Factor),
Column.Type=sapply(DQA.Predictors.Factor, function(x) class(x)),
Unique.Count=sapply(DQA.Predictors.Factor, function(x) length(unique(x))),
First.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(FirstModes(x)[1])),
Second.Mode.Value=sapply(DQA.Predictors.Factor, function(x) as.character(SecondModes(x)[1])),
First.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == FirstModes(x)[1])),
Second.Mode.Count=sapply(DQA.Predictors.Factor, function(x) sum(na.omit(x) == SecondModes(x)[1])),
Unique.Count.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Factor)),3), nsmall=3)),
First.Second.Mode.Ratio=sapply(DQA.Predictors.Factor, function(x) format(round((sum(x == FirstModes(x)[1])/sum(x == SecondModes(x)[1])),3), nsmall=3)),
row.names=NULL)
)
}
## Column.Name Column.Type Unique.Count First.Mode.Value Second.Mode.Value
## 1 Protocol factor 14 J O
## 2 Day factor 7 Fri Wed
## First.Mode.Count Second.Mode.Count Unique.Count.Ratio First.Second.Mode.Ratio
## 1 989 581 0.003 1.702
## 2 923 903 0.002 1.022
##################################
# Formulating a data quality assessment summary for numeric predictors
##################################
if (length(names(DQA.Predictors.Numeric))>0) {
##################################
# Formulating a function to determine the first mode
##################################
FirstModes <- function(x) {
ux <- unique(na.omit(x))
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
##################################
# Formulating a function to determine the second mode
##################################
SecondModes <- function(x) {
ux <- unique(na.omit(x))
tab <- tabulate(match(x, ux))
fm = ux[tab == max(tab)]
sm = na.omit(x)[!(na.omit(x) %in% fm)]
usm <- unique(sm)
tabsm <- tabulate(match(sm, usm))
usm[tabsm == max(tabsm)]
}
(DQA.Predictors.Numeric.Summary <- data.frame(
Column.Name= names(DQA.Predictors.Numeric),
Column.Type=sapply(DQA.Predictors.Numeric, function(x) class(x)),
Unique.Count=sapply(DQA.Predictors.Numeric, function(x) length(unique(x))),
Unique.Count.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((length(unique(x))/nrow(DQA.Predictors.Numeric)),3), nsmall=3)),
First.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((FirstModes(x)[1]),3),nsmall=3)),
Second.Mode.Value=sapply(DQA.Predictors.Numeric, function(x) format(round((SecondModes(x)[1]),3),nsmall=3)),
First.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == FirstModes(x)[1])),
Second.Mode.Count=sapply(DQA.Predictors.Numeric, function(x) sum(na.omit(x) == SecondModes(x)[1])),
First.Second.Mode.Ratio=sapply(DQA.Predictors.Numeric, function(x) format(round((sum(na.omit(x) == FirstModes(x)[1])/sum(na.omit(x) == SecondModes(x)[1])),3), nsmall=3)),
Minimum=sapply(DQA.Predictors.Numeric, function(x) format(round(min(x,na.rm = TRUE),3), nsmall=3)),
Mean=sapply(DQA.Predictors.Numeric, function(x) format(round(mean(x,na.rm = TRUE),3), nsmall=3)),
Median=sapply(DQA.Predictors.Numeric, function(x) format(round(median(x,na.rm = TRUE),3), nsmall=3)),
Maximum=sapply(DQA.Predictors.Numeric, function(x) format(round(max(x,na.rm = TRUE),3), nsmall=3)),
Skewness=sapply(DQA.Predictors.Numeric, function(x) format(round(skewness(x,na.rm = TRUE),3), nsmall=3)),
Kurtosis=sapply(DQA.Predictors.Numeric, function(x) format(round(kurtosis(x,na.rm = TRUE),3), nsmall=3)),
Percentile25th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.25,na.rm = TRUE),3), nsmall=3)),
Percentile75th=sapply(DQA.Predictors.Numeric, function(x) format(round(quantile(x,probs=0.75,na.rm = TRUE),3), nsmall=3)),
row.names=NULL)
)
}
## Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1 Compounds numeric 858 0.198 20.000
## 2 InputFields numeric 1730 0.399 10.000
## 3 Iterations numeric 11 0.003 20.000
## 4 NumPending numeric 303 0.070 0.000
## 5 Hour numeric 924 0.213 13.083
## Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1 31.000 96 29 3.310
## 2 466.000 82 27 3.037
## 3 10.000 3568 272 13.118
## 4 1.000 3275 165 19.848
## 5 21.067 28 25 1.120
## Minimum Mean Median Maximum Skewness Kurtosis Percentile25th
## 1 20.000 497.742 226.000 14103.000 6.568 69.486 98.000
## 2 10.000 1537.055 426.000 56671.000 5.870 54.919 134.000
## 3 10.000 29.244 20.000 200.000 3.937 18.510 20.000
## 4 0.000 53.389 0.000 5605.000 9.718 105.594 0.000
## 5 0.017 13.734 14.017 23.983 -0.546 3.747 10.900
## Percentile75th
## 1 448.000
## 2 991.000
## 3 20.000
## 4 0.000
## 5 16.600
## [1] "No missing observations noted."
## [1] "No low variance factor predictors due to high first-second mode ratio noted."
## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>5."
## Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3 Iterations numeric 11 0.003 20.000
## 4 NumPending numeric 303 0.070 0.000
## Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3 10.000 3568 272 13.118
## 4 1.000 3275 165 19.848
## Minimum Mean Median Maximum Skewness Kurtosis Percentile25th
## 3 10.000 29.244 20.000 200.000 3.937 18.510 20.000
## 4 0.000 53.389 0.000 5605.000 9.718 105.594 0.000
## Percentile75th
## 3 20.000
## 4 0.000
## [1] "Low variance observed for 1 numeric variable(s) with Unique.Count.Ratio<0.01."
## Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 3 Iterations numeric 11 0.003 20.000
## Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 3 10.000 3568 272 13.118
## Minimum Mean Median Maximum Skewness Kurtosis Percentile25th Percentile75th
## 3 10.000 29.244 20.000 200.000 3.937 18.510 20.000 20.000
## [1] "High skewness observed for 4 numeric variable(s) with Skewness>3 or Skewness<(-3)."
## Column.Name Column.Type Unique.Count Unique.Count.Ratio First.Mode.Value
## 1 Compounds numeric 858 0.198 20.000
## 2 InputFields numeric 1730 0.399 10.000
## 3 Iterations numeric 11 0.003 20.000
## 4 NumPending numeric 303 0.070 0.000
## Second.Mode.Value First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio
## 1 31.000 96 29 3.310
## 2 466.000 82 27 3.037
## 3 10.000 3568 272 13.118
## 4 1.000 3275 165 19.848
## Minimum Mean Median Maximum Skewness Kurtosis Percentile25th
## 1 20.000 497.742 226.000 14103.000 6.568 69.486 98.000
## 2 10.000 1537.055 426.000 56671.000 5.870 54.919 134.000
## 3 10.000 29.244 20.000 200.000 3.937 18.510 20.000
## 4 0.000 53.389 0.000 5605.000 9.718 105.594 0.000
## Percentile75th
## 1 448.000
## 2 991.000
## 3 20.000
## 4 0.000
1.3 Data Preprocessing
1.3.1 Missing Data Imputation
Missing data assessment:
[A] 100% fill rate with no missing data identified from the previous data quality assessment.
[B] 100% fill rate with no missing data confirmed using a descriptive statistics summary.
[C] The caret package allows three imputation methods:
[C.1] The knnimpute method is carried out by finding the k closest samples (Euclidian distance) in the training set.
[C.2] The bagimpute method fits a bagged tree model for each predictor (as a function of all the others).
[C.3] The medianimpute method takes the median of each predictor in the training set, and uses them to fill missing values.
Data summary
Name |
DPA |
Number of rows |
4331 |
Number of columns |
8 |
_______________________ |
|
Column type frequency: |
|
factor |
3 |
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: factor
Protocol |
0 |
1 |
FALSE |
14 |
J: 989, O: 581, N: 536, M: 451 |
Day |
0 |
1 |
FALSE |
7 |
Fri: 923, Wed: 903, Tue: 900, Thu: 720 |
Class |
0 |
1 |
FALSE |
4 |
VF: 2211, F: 1347, M: 514, L: 259 |
Variable type: numeric
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
29.24 |
34.42 |
10.00 |
20.0 |
20.00 |
20.0 |
200.00 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
53.39 |
355.96 |
0.00 |
0.0 |
0.00 |
0.0 |
5605.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
## # A tibble: 0 x 15
## # ... with 15 variables: skim_type <chr>, skim_variable <chr>, n_missing <int>,
## # complete_rate <dbl>, factor.ordered <lgl>, factor.n_unique <int>,
## # factor.top_counts <chr>, numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>
1.3.2 Outlier Treatment
Outlier data assessment:
[A] Outliers noted for 5 variables. Outlier treatment for numerical stability remains optional depending on potential model requirements for the subsequent steps.
[B] Numeric data can be visualized through a boxplot including observations classified as suspected outliers using the IQR criterion. The IQR criterion means that all observations above the (75th percentile + 1.5 x IQR) or below the (25th percentile - 1.5 x IQR) are suspected outliers, where IQR is the difference between the third quartile (75th percentile) and first quartile (25th percentile).
[C] The caret package includes one method for outlier treatment:
[C.1] The spatialSign method from the caret package projects the data for a predictor to the unit circle in p dimensions by dividing it by its norm, where p is the number of predictors.
[D] The spatialSign methods was applied on the dataset:
[D.1] While data distribution generally improved with the number of remaining outliers reduced, there are still 4 variables noted with outliers using the IQR criterion.
##################################
# Loading dataset
##################################
DPA <- schedulingData
##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]
##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]
##################################
# Identifying outliers for the numeric predictors
##################################
OutlierCountList <- c()
for (i in 1:ncol(DPA.Predictors.Numeric)) {
Outliers <- boxplot.stats(DPA.Predictors.Numeric[,i])$out
OutlierCount <- length(Outliers)
OutlierCountList <- append(OutlierCountList,OutlierCount)
OutlierIndices <- which(DPA.Predictors.Numeric[,i] %in% c(Outliers))
boxplot(DPA.Predictors.Numeric[,i],
ylab = names(DPA.Predictors.Numeric)[i],
main = names(DPA.Predictors.Numeric)[i],
horizontal=TRUE)
mtext(paste0(OutlierCount, " Outlier(s) Detected"))
}





## [1] "5 numeric variable(s) were noted with outlier(s)."
Data summary
Name |
DPA.Predictors.Numeric |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
29.24 |
34.42 |
10.00 |
20.0 |
20.00 |
20.0 |
200.00 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
53.39 |
355.96 |
0.00 |
0.0 |
0.00 |
0.0 |
5605.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
Data summary
Name |
DPA_CenteredScaledSpatial… |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
-0.14 |
0.37 |
-0.82 |
-0.39 |
-0.21 |
-0.04 |
1.00 |
▂▇▃▁▁ |
InputFields |
0 |
1 |
-0.16 |
0.41 |
-0.81 |
-0.41 |
-0.25 |
-0.07 |
1.00 |
▃▇▂▁▂ |
Iterations |
0 |
1 |
-0.18 |
0.37 |
-0.92 |
-0.38 |
-0.25 |
-0.14 |
1.00 |
▁▇▂▁▁ |
NumPending |
0 |
1 |
-0.09 |
0.21 |
-0.46 |
-0.18 |
-0.13 |
-0.06 |
1.00 |
▃▇▁▁▁ |
Hour |
0 |
1 |
0.04 |
0.65 |
-0.99 |
-0.63 |
0.08 |
0.70 |
0.99 |
▇▃▅▃▇ |





## [1] "4 numeric variable(s) were noted with outlier(s)."
1.3.3 Zero and Near-Zero Variance
Zero and near-zero variance data assessment:
[A] Low variance noted for 1 variable from the previous data quality assessment.
[B] Low variance noted for 2 variables confirmed using a preprocessing summary from the caret package.
[C] The caret package includes two methods for detecting low variance variables:
[C.1] The nearZeroVar method using the freqCut criteria with default setting at 95/5 computes the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data.
[C.2] The nearZeroVar method using the uniqueCut criteria with default setting at 10 computes the percent of unique values referring to the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.
[D] The nearZeroVar method using both the freqCut and uniqueCut criteria set at 80/20 and 10, respectively, were applied on the dataset:
[D.1] 2 variables may be optionally removed from the dataset for the subsequent analysis.
Data summary
Name |
DPA |
Number of rows |
4331 |
Number of columns |
8 |
_______________________ |
|
Column type frequency: |
|
factor |
3 |
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: factor
Protocol |
0 |
1 |
FALSE |
14 |
J: 989, O: 581, N: 536, M: 451 |
Day |
0 |
1 |
FALSE |
7 |
Fri: 923, Wed: 903, Tue: 900, Thu: 720 |
Class |
0 |
1 |
FALSE |
4 |
VF: 2211, F: 1347, M: 514, L: 259 |
Variable type: numeric
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
29.24 |
34.42 |
10.00 |
20.0 |
20.00 |
20.0 |
200.00 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
53.39 |
355.96 |
0.00 |
0.0 |
0.00 |
0.0 |
5605.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
## freqRatio percentUnique zeroVar nzv
## Iterations 13.11765 0.2539829 FALSE TRUE
## NumPending 19.84848 6.9960748 FALSE TRUE
if ((nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))==0){
print("No low variance predictors noted.")
} else {
print(paste0("Low variance observed for ",
(nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
" numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."))
DPA_LowVarianceForRemoval <- (nrow(DPA_LowVariance[DPA_LowVariance$nzv,]))
print(paste0("Low variance can be resolved by removing ",
(nrow(DPA_LowVariance[DPA_LowVariance$nzv,])),
" numeric variable(s)."))
for (j in 1:DPA_LowVarianceForRemoval) {
DPA_LowVarianceRemovedVariable <- rownames(DPA_LowVariance[DPA_LowVariance$nzv,])[j]
print(paste0("Variable ",
j,
" for removal: ",
DPA_LowVarianceRemovedVariable))
}
DPA %>%
skim() %>%
dplyr::filter(skim_variable %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,]))
##################################
# Filtering out columns with low variance
#################################
DPA_ExcludedLowVariance <- DPA[,!names(DPA) %in% rownames(DPA_LowVariance[DPA_LowVariance$nzv,])]
##################################
# Gathering descriptive statistics
##################################
(DPA_ExcludedLowVariance_Skimmed <- skim(DPA_ExcludedLowVariance))
}
## [1] "Low variance observed for 2 numeric variable(s) with First.Second.Mode.Ratio>4 and Unique.Count.Ratio<0.10."
## [1] "Low variance can be resolved by removing 2 numeric variable(s)."
## [1] "Variable 1 for removal: Iterations"
## [1] "Variable 2 for removal: NumPending"
Data summary
Name |
DPA_ExcludedLowVariance |
Number of rows |
4331 |
Number of columns |
6 |
_______________________ |
|
Column type frequency: |
|
factor |
3 |
numeric |
3 |
________________________ |
|
Group variables |
None |
Variable type: factor
Protocol |
0 |
1 |
FALSE |
14 |
J: 989, O: 581, N: 536, M: 451 |
Day |
0 |
1 |
FALSE |
7 |
Fri: 923, Wed: 903, Tue: 900, Thu: 720 |
Class |
0 |
1 |
FALSE |
4 |
VF: 2211, F: 1347, M: 514, L: 259 |
Variable type: numeric
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
1.3.4 Collinearity
High collinearity data assessment:
[A] No high correlation noted for any variable pair confirmed using the preprocessing summaries from the caret and lares packages.
[B] The caret and lares packages include methods for detecting highly correlated variables:
[B.1] The findCorrelation method using the cutoff criteria with default setting at 0.90 from the caret package searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
[B.2] The corr_cross method using the top criteria from the lares package lists out and ranks the variable-pairs with the highest correlation coefficients which were statistically significant.
[D] The findCorrelation method using cutoff criteria set at 0.95 and the corr_cross method using top criteria set at 10 were applied on the dataset:
[D.1] No highly correlated variables detected which may be optionally removed from the dataset for the subsequent analysis.
##################################
# Loading dataset
##################################
DPA <- schedulingData
##################################
# Listing all predictors
##################################
DPA.Predictors <- DPA[,!names(DPA) %in% c("Class")]
##################################
# Listing all numeric predictors
##################################
DPA.Predictors.Numeric <- DPA.Predictors[,sapply(DPA.Predictors, is.numeric)]
##################################
# Visualizing pairwise correlation between predictors
##################################
DPA_CorrelationTest <- cor.mtest(DPA.Predictors.Numeric,
method = "pearson",
conf.level = .95)
corrplot(cor(DPA.Predictors.Numeric,
method = "pearson",
use="pairwise.complete.obs"),
method = "circle",
type = "upper",
order = "original",
tl.col = "black",
tl.cex = 0.75,
tl.srt = 90,
sig.level = 0.05,
p.mat = DPA_CorrelationTest$p,
insig = "blank")

## [1] 0
## [1] "No highly correlated predictors noted."
1.3.5 Linear Dependencies
Linear dependency data assessment:
[A] No linear dependency noted for any subset of variables using the preprocessing summary from the caret package.
[B] The caret package includes three methods for detecting linearly dependent variables:
[B.1] The findLinearCombos method from the caret package uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).
[C] The findLinearCombos method was applied on the dataset:
[C.1] No linearly dependent variables were identified which may be optionally removed from the dataset for the subsequent analysis.
## [1] 0
## [1] "No linearly dependent predictors noted."
1.3.6 Centering and Scaling
Centering and scaling data assessment:
[A] Centering and scaling transformation for numerical stability remains optional depending on potential model requirements for the subsequent steps.
[B] The caret package includes three methods for centering and scaling variables:
[B.1] The center method from the caret package subtracts the average value of a numeric variable to all the values. As a result of centering, the variable has a zero mean.
[B.2] The scale method from the caret package performs a center transformation with each value of the variable divided by its standard deviation. Scaling the data coerces the values to have a common standard deviation of one.
[B.3] The range method from the caret package scales the data to be within the defined numeric range bound.
[C] The center, scale and range methods were tried on the dataset.
Data summary
Name |
DPA.Predictors.Numeric |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
29.24 |
34.42 |
10.00 |
20.0 |
20.00 |
20.0 |
200.00 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
53.39 |
355.96 |
0.00 |
0.0 |
0.00 |
0.0 |
5605.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
Data summary
Name |
DPA_CenteredTransformed |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
0 |
1020.17 |
-477.74 |
-399.74 |
-271.74 |
-49.74 |
13605.26 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
0 |
3650.08 |
-1527.06 |
-1403.06 |
-1111.06 |
-546.06 |
55133.94 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
0 |
34.42 |
-19.24 |
-9.24 |
-9.24 |
-9.24 |
170.76 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
0 |
355.96 |
-53.39 |
-53.39 |
-53.39 |
-53.39 |
5551.61 |
▇▁▁▁▁ |
Hour |
0 |
1 |
0 |
3.98 |
-13.72 |
-2.83 |
0.28 |
2.87 |
10.25 |
▁▂▇▇▁ |
Data summary
Name |
DPA_CenteredScaledTransfo… |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
0 |
1 |
-0.47 |
-0.39 |
-0.27 |
-0.05 |
13.34 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
0 |
1 |
-0.42 |
-0.38 |
-0.30 |
-0.15 |
15.10 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
0 |
1 |
-0.56 |
-0.27 |
-0.27 |
-0.27 |
4.96 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
0 |
1 |
-0.15 |
-0.15 |
-0.15 |
-0.15 |
15.60 |
▇▁▁▁▁ |
Hour |
0 |
1 |
0 |
1 |
-3.45 |
-0.71 |
0.07 |
0.72 |
2.57 |
▁▂▇▇▁ |
Data summary
Name |
DPA_RangedTransformed |
Number of rows |
4331 |
Number of columns |
5 |
_______________________ |
|
Column type frequency: |
|
numeric |
5 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Compounds |
0 |
1 |
0.03 |
0.07 |
0 |
0.01 |
0.01 |
0.03 |
1 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
0.03 |
0.06 |
0 |
0.00 |
0.01 |
0.02 |
1 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
0.10 |
0.18 |
0 |
0.05 |
0.05 |
0.05 |
1 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
0.01 |
0.06 |
0 |
0.00 |
0.00 |
0.00 |
1 |
▇▁▁▁▁ |
Hour |
0 |
1 |
0.57 |
0.17 |
0 |
0.45 |
0.58 |
0.69 |
1 |
▁▂▇▇▁ |
1.3.8 Dummy Variables
Dummy variable creation assessment:
[A] Dummy variable creation (or one-hot encoding) for factor variables remains optional depending on potential model requirements for the subsequent steps.
[B] The caret package includes one method for creating dummy variables:
[B.1] The dummyVars method from the caret package generates a complete (less than full rank parameterized) set of dummy variables from one or more factors.
[C] The dummyVars method was tried on the dataset.
## [1] "There are 2 factor variables for dummy variable creation."
Data summary
Name |
DPA_DummyVariablesCreated |
Number of rows |
4331 |
Number of columns |
26 |
_______________________ |
|
Column type frequency: |
|
numeric |
26 |
________________________ |
|
Group variables |
None |
Variable type: numeric
Protocol.A |
0 |
1 |
0.02 |
0.15 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.C |
0 |
1 |
0.04 |
0.19 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.D |
0 |
1 |
0.03 |
0.18 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.E |
0 |
1 |
0.02 |
0.15 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.F |
0 |
1 |
0.04 |
0.19 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.G |
0 |
1 |
0.04 |
0.19 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.H |
0 |
1 |
0.07 |
0.26 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.I |
0 |
1 |
0.09 |
0.28 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.J |
0 |
1 |
0.23 |
0.42 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Protocol.K |
0 |
1 |
0.00 |
0.04 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.L |
0 |
1 |
0.06 |
0.23 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.M |
0 |
1 |
0.10 |
0.31 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.N |
0 |
1 |
0.12 |
0.33 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Protocol.O |
0 |
1 |
0.13 |
0.34 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Compounds |
0 |
1 |
497.74 |
1020.17 |
20.00 |
98.0 |
226.00 |
448.0 |
14103.00 |
▇▁▁▁▁ |
InputFields |
0 |
1 |
1537.06 |
3650.08 |
10.00 |
134.0 |
426.00 |
991.0 |
56671.00 |
▇▁▁▁▁ |
Iterations |
0 |
1 |
29.24 |
34.42 |
10.00 |
20.0 |
20.00 |
20.0 |
200.00 |
▇▁▁▁▁ |
NumPending |
0 |
1 |
53.39 |
355.96 |
0.00 |
0.0 |
0.00 |
0.0 |
5605.00 |
▇▁▁▁▁ |
Hour |
0 |
1 |
13.73 |
3.98 |
0.02 |
10.9 |
14.02 |
16.6 |
23.98 |
▁▂▇▇▁ |
Day.Mon |
0 |
1 |
0.16 |
0.37 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Day.Tue |
0 |
1 |
0.21 |
0.41 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Day.Wed |
0 |
1 |
0.21 |
0.41 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Day.Thu |
0 |
1 |
0.17 |
0.37 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Day.Fri |
0 |
1 |
0.21 |
0.41 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▂ |
Day.Sat |
0 |
1 |
0.01 |
0.09 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
Day.Sun |
0 |
1 |
0.04 |
0.19 |
0.00 |
0.0 |
0.00 |
0.0 |
1.00 |
▇▁▁▁▁ |
1.4 Data Exploration
Data exploration assessment:
[A] Data exploration for classification modelling problems involve bivariate analysis between the factor response variable and the numeric predictor variables.
[B] The caret package includes one method for performing data exploration:
[B.1] The featurePlot method from the caret package generates various graphs (box, strip, density, correlation pairs and correlation ellipse plots) for exploring and visualizing the potential relationships between the response and predictor variables.
[C] The featurePlot method was tried on the dataset.
## [1] 5
## [1] 4

##################################
# Formulating the strip plots
##################################
featurePlot(x = DPA.Predictors.Numeric,
y = DPA$Class,
plot = "strip",
jitter = TRUE,
scales = list(x = list(relation="free", rot = 90),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(1, (ncol(DPA.Predictors.Numeric))))

##################################
# Formulating the density plots
##################################
featurePlot(x = DPA.Predictors.Numeric,
y = DPA$Class,
plot = "density",
scales = list(x = list(relation="free", rot = 90),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(1, (ncol(DPA.Predictors.Numeric))),
auto.key = list(columns = (length(levels(DPA$Class)))))
