This package is a collection of utility functions that facilitate general predictive modeling work. Function usages include but not limited to diagnostic visualization, model metric, data quality check. If you have any feedback, or any function you want to have in the package, please reach out to chengjun.hou@gmail.com or connect via GitHub.
To install the package, use the following command in R:
devtools::install_github("chengjunhou/pmut")
This function creates a visualization for a line plot of one discrete feature against the response, plus a distribution histogram for that discrete feature. In the line plot, the discrete feature will be the x-axis while the response be the y-axis, which will serve as Actual. NA will be formed as its own level. More lines of Prediction can be created by inputting a prediction data.frame.
pmut.edap.disc(datatable, varstring, targetstring, pred.df=NULL)
data.frame or data.tabledatatable for the discrete featuredatatable for the responsedata.frame (optional), with column being prediction from each modelWe use the diamond dataset from ggplot2 to do the demo:
df = data.frame(ggplot2::diamonds)
pmut.edap.disc(df, "color", "price", pred.df=data.frame(GLM=rnorm(dim(df)[1],4000,5000)))
This function creates a visualization for a line plot of one continuous feature against the response, plus a distribution histogram for that continuous feature. In the line plot, the continuous feature will be cut into bins and then placed on the x-axis. The response will be the y-axis, which will serve as Actual. Binning characteristics will be controlled by meta and qbin. NA will be formed as its own bin. More lines of Prediction can be created by inputting a prediction data.frame.
pmut.edap.cont(datatable, varstring, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)
data.frame or data.tabledatatable for the discrete featuredatatable for the responsedata.frame (optional), with column being prediction from each modelNote that the first bin in the following view is ranging from the minimum of “carat” to its 1% percentile (meta[3]=0.01), while the last bin is ranging from 99% percentile to the maximum of “carat” (meta[4]=0.99).
pmut.edap.cont(df, "carat", "price", pred.df=data.frame(GLM1=rnorm(dim(df)[1],4000,5000),
GLM2=rnorm(dim(df)[1],2000,5000)))
Note that in the following quantile view, since we specify the outlier percentile to be 0% (meta[3]=0) and 100% (meta[4]=1), we need to input 12 (meta[1]) to have 10 bins in the view. And the counts within each bin are not perfectly equal because of rounding and the nature of the data.
pmut.edap.cont(df, "carat", "price", meta=c(12,2,0,1), qbin=TRUE)
This function creates visualization for a vector of features, using either pmut.edap.disc() or pmut.edap.cont(), depending on the feature class. Columns of class factor, character, and logical will use pmut.edap.disc(); Column of class numeric will use pmut.edap.cont(); Column of class integer with unique values smaller than number of bins specified by meta will use pmut.edap.disc(), otherwise use pmut.edap.cont(). Some progression information will be printed on console.
Same arguments as pmut.edap.cont() except varvec.
pmut.edap(datatable, varvec, targetstring, meta=c(50,4,0.01,0.99), qbin=FALSE, pred.df=NULL)
datatable to productionalize the visulization# output the plots into a pdf file
pdf("EDA_Diamonds.pdf", width=12, height=10)
pmut.edap(df, names(df)[-7], "price")
dev.off()
This function calculates area under the ROC curve for prediction against actual, without any package dependency.
pmut.auc(aa, pp, plot=FALSE)
actuals = c(1,1,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,0)
predicts = rev(seq_along(actuals)); predicts[9:10] = mean(predicts[9:10])
pmut.auc(actuals, predicts, plot=TRUE)
## [1] 0.825
This function calculates the standardized gini coefficient for prediction agianst actual.
pmut.gini(aa, pp, print=FALSE)
pmut.gini(actuals, predicts, print=TRUE)
## actual-prediction-gini=3.3; actual-actual-gini=5
## [1] 0.66
This function finds the meta information for each column within training data, which will be used to process new data so that it can be scored without error, check pmut.base.prep() for the preparation part. Meta information for columns of class factor, character, and logical will form a list. Each element of the list contains three slots: 1st $VarString is column name, 2nd $LvlVec is vector of unique levels, 3rd $LvlBase is base level name which is the level with most counts. Meta information for columns of class integer, and numeric will form another list. Each element of the list contains two slots: 1st $VarString is column name, 2nd $ValueMean is its value mean.
pmut.base.find(DATA)
data.frame or data.tableThis function takes meta information generated by pmut.base.find(), prepares new data so that it can be scored without error. It conducts a few things: it handles missing value imputation either by assigning to base level (categorical) or mean value (numeric); it assigns levels not found in meta but observed in new data to base level; it handles levels found in meta but not observed in new data by treating the column as factor then specifying the levels; it handles entire column found in meta but not observed in new data by imputing the entire column with base level or mean value; it attaches symbol “!” with every base level; lastly, it orders the columns alphabetically. Note that data processed by this function will only have two classes: factor for categorical, numeric for numeric. Then model.matrix() will produce data matrix with exactly identical format to be scored for a glmnet or xgboost model.
pmut.base.prep(DATA, CatMeta, NumMeta)
data.frame or data.tablepmut.base.findpmut.base.finddata.frame or data.table ready to be scoredtemp = pmut.base.find(data.frame(ggplot2::diamonds))
## ====== 10 Runs ======
## Loop 1 carat : numeric
## Loop 2 cut : ordered factor
## Loop 3 color : ordered factor
## Loop 4 clarity : ordered factor
## Loop 5 depth : numeric
## Loop 6 table : numeric
## Loop 7 price : integer
## Loop 8 x : numeric
## Loop 9 y : numeric
## Loop 10 z : numeric
# remove two columns
newdata = data.frame(ggplot2::diamonds)[,-c(2,6)]
# assign new color
newdata$color = "NEW"
# temp[[1]] categorical meta, temp[[2]] numeric meta
newdata = pmut.base.prep(newdata, temp[[1]], temp[[2]])
## ====== Cat: 3 Runs ======
## Warn 1 cut : entire feature generated
## Loop 2 color : success
## Loop 3 clarity : success
## ====== Num: 7 Runs ======
## Loop 1 carat : success
## Loop 2 depth : success
## Warn 3 table : entire feature generated
## Loop 4 price : success
## Loop 5 x : success
## Loop 6 y : success
## Loop 7 z : success
head(newdata)
## carat clarity color cut depth price table x y z
## 1 0.23 SI2 !G !Ideal 61.5 326 57.45718 3.95 3.98 2.43
## 2 0.21 !SI1 !G !Ideal 59.8 326 57.45718 3.89 3.84 2.31
## 3 0.23 VS1 !G !Ideal 56.9 327 57.45718 4.05 4.07 2.31
## 4 0.29 VS2 !G !Ideal 62.4 334 57.45718 4.20 4.23 2.63
## 5 0.31 SI2 !G !Ideal 63.3 335 57.45718 4.34 4.35 2.75
## 6 0.24 VVS2 !G !Ideal 62.8 336 57.45718 3.94 3.96 2.48
sapply(newdata, class)
## carat clarity color cut depth price table
## "numeric" "factor" "factor" "factor" "numeric" "integer" "numeric"
## x y z
## "numeric" "numeric" "numeric"
Note that attaching symbol “!” is to make sure that model.matrix() will remove same level when conducting dummy encoding for a categorical feature. So after obtaining meta list from the training data with pmut.base.find(), training data also needs to be processed by pmut.base.prep() before model fitting.
This function checks percenrage of NA (include empty string for character) inside each column of the data.
pmut.data.pmis(DATA)
data.frame or data.tablepmut.data.pmis(data.frame(ggplot2::diamonds))
## carat cut color clarity depth table price x y
## 0 0 0 0 0 0 0 0 0
## z
## 0
This function checks if there is any duplicated column inside the data.
pmut.data.same(DATA)
data.frame or data.tablepmut.data.same(data.frame(ggplot2::diamonds))
## carat cut color clarity depth table price x y
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## z
## FALSE
This function standardizes numeric column inside the data.
pmut.data.scal(DATA)
data.frame or data.tabledata.frame or data.table after standardizationhead(pmut.data.scal(data.frame(ggplot2::diamonds)))
## carat cut color clarity depth table price
## 1 -1.198157 Ideal E SI2 -0.1740899 -1.0996618 -0.9040868
## 2 -1.240350 Premium E SI1 -1.3607259 1.5855140 -0.9040868
## 3 -1.198157 Good E VS1 -3.3849872 3.3756312 -0.9038361
## 4 -1.071577 Premium I VS2 0.4541292 0.2429261 -0.9020815
## 5 -1.029384 Good J SI2 1.0823482 0.2429261 -0.9018308
## 6 -1.177060 Very Good J VVS2 0.7333376 -0.2046032 -0.9015802
## x y z
## 1 -1.587823 -1.536181 -1.571115
## 2 -1.641310 -1.658759 -1.741159
## 3 -1.498677 -1.457382 -1.741159
## 4 -1.364959 -1.317293 -1.287708
## 5 -1.240155 -1.212227 -1.117663
## 6 -1.596737 -1.553692 -1.500263