This package is a collection of utility functions that facilitate general predictive modeling work. Function usages include but not limited to diagnostic visualization, model metric, data quality check. If you have any feedback, or any function you want to have in the package, please reach out to chengjun.hou@gmail.com or connect via GitHub.

To install the package, use the following command in R:

devtools::install_github("chengjunhou/pmut")


Diagnostic Visualization

pmut.edap.disc

This function creates a visualization for a line plot of one discrete feature against the response, plus a distribution histogram for that discrete feature. In the line plot, the discrete feature will be the x-axis while the response be the y-axis, which will serve as Actual. NA will be formed as its own level. More lines of Prediction can be created by inputting a prediction data.frame.

pmut.edap.disc(datatable, varstring, targetstring, pred.df=NULL)

  • param datatable: Object of class data.frame or data.table
  • param varstring: Single character string indicating the column name inside datatable for the discrete feature
  • param targetstring: Single character string indicating the column name inside datatable for the response
  • param pred.df: Object of class data.frame (optional), with column being prediction from each model
  • return: A view of line plot stacked above the histogram

We use the diamond dataset from ggplot2 to do the demo:

df = data.frame(ggplot2::diamonds)
pmut.edap.disc(df, "color", "price", pred.df=data.frame(GLM=rnorm(dim(df)[1],4000,5000)))

pmut.edap.cont

This function creates a visualization for a line plot of one continuous feature against the response, plus a distribution histogram for that continuous feature. In the line plot, the continuous feature will be cut into bins and then placed on the x-axis. The response will be the y-axis, which will serve as Actual. Binning characteristics will be controlled by meta and qbin. NA will be formed as its own bin. More lines of Prediction can be created by inputting a prediction data.frame.

pmut.edap.cont(datatable, varstring, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)

  • param datatable: Object of class data.frame or data.table
  • param varstring: Single character string indicating the column name inside datatable for the discrete feature
  • param targetstring: Single character string indicating the column name inside datatable for the response
  • param meta: Numeric vector with length of 4 (default is c(50,4,0.01,0.99)): 1st indicates number of bins, 2nd indicates bin rounding digits, 3rd and 4th indicate the outlier percentile
  • param qbin: Logical (default is FALSE), FALSE indicates equal length bins, TRUE indicates equal weight bins (quantile view)
  • param pred.df: Object of class data.frame (optional), with column being prediction from each model
  • return: A view of line plot stacked above the histogram

Note that the first bin in the following view is ranging from the minimum of “carat” to its 1% percentile (meta[3]=0.01), while the last bin is ranging from 99% percentile to the maximum of “carat” (meta[4]=0.99).

pmut.edap.cont(df, "carat", "price", pred.df=data.frame(GLM1=rnorm(dim(df)[1],4000,5000),
                                                        GLM2=rnorm(dim(df)[1],2000,5000)))

Note that in the following quantile view, since we specify the outlier percentile to be 0% (meta[3]=0) and 100% (meta[4]=1), we need to input 12 (meta[1]) to have 10 bins in the view. And the counts within each bin are not perfectly equal because of rounding and the nature of the data.

pmut.edap.cont(df, "carat", "price", meta=c(12,2,0,1), qbin=TRUE)

pmut.edap

This function creates visualization for a vector of features, using either pmut.edap.disc() or pmut.edap.cont(), depending on the feature class. Columns of class factor, character, and logical will use pmut.edap.disc(); Column of class numeric will use pmut.edap.cont(); Column of class integer with unique values smaller than number of bins specified by meta will use pmut.edap.disc(), otherwise use pmut.edap.cont(). Some progression information will be printed on console.

Same arguments as pmut.edap.cont() except varvec.

pmut.edap(datatable, varvec, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)

  • param varvec: Vector of character indicating the column names inside datatable to productionalize the visulization
# output the plots into a pdf file
pdf("EDA_Diamonds.pdf", width=12, height=10)
pmut.edap(df, names(df)[-7], "price")
dev.off()

Model Metric

pmut.auc

This function calculates area under the ROC curve for prediction against actual, without any package dependency.

pmut.auc(aa, pp, plot=FALSE)

  • param aa: Vector of actuals, could be non-binary, but all non-zero will be treated as TRUE
  • param pp: Vector of predictions, could be any value, probability is most ideal
  • param plot: Logical (defualt is FALSE), TRUE indicates plotting the auc curve
  • return: A single numeric value for auc
actuals = c(1,1,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,0)
predicts = rev(seq_along(actuals)); predicts[9:10] = mean(predicts[9:10])
pmut.auc(actuals, predicts, plot=TRUE)

## [1] 0.825

pmut.gini

This function calculates the standardized gini coefficient for prediction agianst actual.

pmut.gini(aa, pp, print=FALSE)

  • param aa: Vector of actuals, could be any value
  • param pp: Vector of predictions, could be any value
  • param print: Logical (defualt is FALSE), TRUE indicates printing the original gini before standardization
  • return: A single numeric value for standardized gini
pmut.gini(actuals, predicts, print=TRUE)
## actual-prediction-gini=3.3; actual-actual-gini=5
## [1] 0.66

Data Preparation for Scoring

pmut.base.find

This function finds the meta information for each column within training data, which will be used to process new data so that it can be scored without error, check pmut.base.prep() for the preparation part. Meta information for columns of class factor, character, and logical will form a list. Each element of the list contains three slots: 1st $VarString is column name, 2nd $LvlVec is vector of unique levels, 3rd $LvlBase is base level name which is the level with most counts. Meta information for columns of class integer, and numeric will form another list. Each element of the list contains two slots: 1st $VarString is column name, 2nd $ValueMean is its value mean.

pmut.base.find(DATA)

  • param DATA: Object of class data.frame or data.table
  • return: A list of two elements, 1st being meta information for categorical columns, 2nd for numeric columns

pmut.base.prep

This function takes meta information generated by pmut.base.find(), prepares new data so that it can be scored without error. It conducts a few things: it handles missing value imputation either by assigning to base level (categorical) or mean value (numeric); it assigns levels not found in meta but observed in new data to base level; it handles levels found in meta but not observed in new data by treating the column as factor then specifying the levels; it handles entire column found in meta but not observed in new data by imputing the entire column with base level or mean value; it attaches symbol “!” with every base level; lastly, it orders the columns alphabetically. Note that data processed by this function will only have two classes: factor for categorical, numeric for numeric. Then model.matrix() will produce data matrix with exactly identical format to be scored for a glmnet or xgboost model.

pmut.base.prep(DATA, CatMeta, NumMeta)

  • param DATA: Object of class data.frame or data.table
  • param CatMeta: List of meta information for categorical features generated by pmut.base.find
  • param NumMeta: List of meta information for numeric features generated by pmut.base.find
  • return: A data.frame or data.table ready to be scored
temp = pmut.base.find(data.frame(ggplot2::diamonds))
## ======  10  Runs ====== 
## Loop 1 carat : numeric 
## Loop 2 cut : ordered factor 
## Loop 3 color : ordered factor 
## Loop 4 clarity : ordered factor 
## Loop 5 depth : numeric 
## Loop 6 table : numeric 
## Loop 7 price : integer 
## Loop 8 x : numeric 
## Loop 9 y : numeric 
## Loop 10 z : numeric
# remove two columns
newdata = data.frame(ggplot2::diamonds)[,-c(2,6)]
# assign new color
newdata$color = "NEW"
# temp[[1]] categorical meta, temp[[2]] numeric meta 
newdata = pmut.base.prep(newdata, temp[[1]], temp[[2]])
## ====== Cat: 3 Runs ====== 
## Warn 1 cut : entire feature generated 
## Loop 2 color : success 
## Loop 3 clarity : success 
## ====== Num: 7 Runs ====== 
## Loop 1 carat : success 
## Loop 2 depth : success 
## Warn 3 table : entire feature generated 
## Loop 4 price : success 
## Loop 5 x : success 
## Loop 6 y : success 
## Loop 7 z : success
head(newdata)
##   carat clarity color    cut depth price    table    x    y    z
## 1  0.23     SI2    !G !Ideal  61.5   326 57.45718 3.95 3.98 2.43
## 2  0.21    !SI1    !G !Ideal  59.8   326 57.45718 3.89 3.84 2.31
## 3  0.23     VS1    !G !Ideal  56.9   327 57.45718 4.05 4.07 2.31
## 4  0.29     VS2    !G !Ideal  62.4   334 57.45718 4.20 4.23 2.63
## 5  0.31     SI2    !G !Ideal  63.3   335 57.45718 4.34 4.35 2.75
## 6  0.24    VVS2    !G !Ideal  62.8   336 57.45718 3.94 3.96 2.48
sapply(newdata, class)
##     carat   clarity     color       cut     depth     price     table 
## "numeric"  "factor"  "factor"  "factor" "numeric" "integer" "numeric" 
##         x         y         z 
## "numeric" "numeric" "numeric"

Note that attaching symbol “!” is to make sure that model.matrix() will remove same level when conducting dummy encoding for a categorical feature. So after obtaining meta list from the training data with pmut.base.find(), training data also needs to be processed by pmut.base.prep() before model fitting.

Simple Quality Check

pmut.data.pmis

This function checks percenrage of NA (include empty string for character) inside each column of the data.

pmut.data.pmis(DATA)

  • param DATA: Object of class data.frame or data.table
  • return: A named vector having percent of missing for the column
pmut.data.pmis(data.frame(ggplot2::diamonds))
##   carat     cut   color clarity   depth   table   price       x       y 
##       0       0       0       0       0       0       0       0       0 
##       z 
##       0

pmut.data.same

This function checks if there is any duplicated column inside the data.

pmut.data.same(DATA)

  • param DATA: Object of class data.frame or data.table
  • return: A named bool vector indicating whether the column is duplicated
pmut.data.same(data.frame(ggplot2::diamonds))
##   carat     cut   color clarity   depth   table   price       x       y 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
##       z 
##   FALSE

pmut.data.scal

This function standardizes numeric column inside the data.

pmut.data.scal(DATA)

  • param DATA: Object of class data.frame or data.table
  • return: A data.frame or data.table after standardization
head(pmut.data.scal(data.frame(ggplot2::diamonds)))
##       carat       cut color clarity      depth      table      price
## 1 -1.198157     Ideal     E     SI2 -0.1740899 -1.0996618 -0.9040868
## 2 -1.240350   Premium     E     SI1 -1.3607259  1.5855140 -0.9040868
## 3 -1.198157      Good     E     VS1 -3.3849872  3.3756312 -0.9038361
## 4 -1.071577   Premium     I     VS2  0.4541292  0.2429261 -0.9020815
## 5 -1.029384      Good     J     SI2  1.0823482  0.2429261 -0.9018308
## 6 -1.177060 Very Good     J    VVS2  0.7333376 -0.2046032 -0.9015802
##           x         y         z
## 1 -1.587823 -1.536181 -1.571115
## 2 -1.641310 -1.658759 -1.741159
## 3 -1.498677 -1.457382 -1.741159
## 4 -1.364959 -1.317293 -1.287708
## 5 -1.240155 -1.212227 -1.117663
## 6 -1.596737 -1.553692 -1.500263