$\label{fig:fig1}Applied Predictive Modeling.$

Applied Predictive Modeling.

Instructions

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

URL: http://appliedpredictivemodeling.com/

Exercises

3.1

The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

`hist(Glass)`

First, I will visualize the distribution behavior for each predictor variable. In this case, we will look at underlying distribution produced using the hist function from the lattice library.

From the above histograms, we can recognize that some distributions follow what seems to be a normal distribution, while others are either left or right skewed; with the exception of Mg that seems to have some sort of “saddle” distribution.

`skewness(SegData)`

Let’s see some values related to skewness; for this I will make use of the skewness function from the e1071 library.

	skewValues	Interpretation
RI	1.60	Right-skewed
Na	0.45	Right-skewed
Mg	-1.14	Left-skewed
Al	0.89	Right-skewed
Si	-0.72	Left-skewed
K	6.46	Right-skewed
Ca	2.02	Right-skewed
Ba	3.37	Right-skewed
Fe	1.73	Right-skewed

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicate that the mean of the data values is larger than the median, and the data distribution is right-skewed, Chi Yau (2013). In the case that the skewness value is near zero, it means that there’s symmetry, that is the median and the mean have the same value, hence, it’s a normal distribution.

`corrplot(correlations)`

Now, let’s visualize the underlying relationships between predictors. For this, I will make use of the corrplot function from the corrplot library.

From the above graph, is easy to identify some strong correlations among some of the predictors; that is, some are strongly positive (in dark blue) while others are strongly negative (dark brown).

`plot(RI ~ Ca, data = Glass)`

For example, Ca is strongly positive correlated to RI. Let’s visualize this relations ship to have a better idea.

In effect, there seems to be some sort of correlation in between this two predictors.

`plot(RI ~ Si, data = Glass)`

Let’s visualize a strongly negative correlation, that is, RI ans SI.

In effect, there seems to be some sort of negative correlation but in a descending way for RI as Si increase.

b)

Do there appear to be any outliers in the data? Are any predictors skewed?

From the above graphs, we can deduct that there seems to be some outliers present in the data, also as previously presented, some predictors are moderately skewed.

c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

I believe that some relevant transformations might improve the classification model.

For this approach, I will administer a series of transformations to multiple data sets, the caret class preProcess has the ability to transform, center, scale, or impute values, as well as apply the spatial sign transformation and feature extraction. The function calculates the required quantities for the transformation. After calling the preProcess function, the predict method applies the results to a set of data, Kuhn, M. & Johnson, K (2018).

trans <- preProcess(segData, method = c("BoxCox", "center", "scale", "pca"))

## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (0)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

The primary advantage of PCA, and the reason that it has retained its popularity as a data reduction method, is that it creates components that are uncorrelated.

The subsequent PCs are derived such that these linear combinations capture the most remaining variability while also being uncorrelated with all previous PCs.

Let’s see what transformation was performed on each component.

## $BoxCox
## [1] "RI" "Na" "Al" "Si" "Ca"
## 
## $center
## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"
## 
## $scale
## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"
## 
## $pca
## [1] "RI" "Na" "Mg" "Al" "Si" "K"  "Ca" "Ba" "Fe"
## 
## $ignore
## character(0)

Let’s visualize the results after the transformations.

transformed <- predict(trans, segData)

Now, let’s visualize the underlying correlation relationships between Principal Component predictors. For this, I will make use of the corrplot function from the corrplot library. The idea: they should be low correlated.

In effect, after the above transformations, the correlations in between the PCs are very small (near zero).

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Let’s see what ?Soybean express. The below results are given with the help() function from R.

Description

There are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.

Usage

data(Soybean)

Format

A data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class.

[,1]    Class   the 19 classes
[,2]    date    apr(0),may(1),june(2),july(3),aug(4),sept(5),oct(6).
[,3]    plant.stand normal(0),lt-normal(1).
[,4]    precip  lt-norm(0),norm(1),gt-norm(2).
[,5]    temp    lt-norm(0),norm(1),gt-norm(2).
[,6]    hail    yes(0),no(1).
[,7]    crop.hist   dif-lst-yr(0),s-l-y(1),s-l-2-y(2), s-l-7-y(3).
[,8]    area.dam    scatter(0),low-area(1),upper-ar(2),whole-field(3).
[,9]    sever   minor(0),pot-severe(1),severe(2).
[,10]   seed.tmt    none(0),fungicide(1),other(2).
[,11]   germ    90-100%(0),80-89%(1),lt-80%(2).
[,12]   plant.growth    norm(0),abnorm(1).
[,13]   leaves  norm(0),abnorm(1).
[,14]   leaf.halo   absent(0),yellow-halos(1),no-yellow-halos(2).
[,15]   leaf.marg   w-s-marg(0),no-w-s-marg(1),dna(2).
[,16]   leaf.size   lt-1/8(0),gt-1/8(1),dna(2).
[,17]   leaf.shread absent(0),present(1).
[,18]   leaf.malf   absent(0),present(1).
[,19]   leaf.mild   absent(0),upper-surf(1),lower-surf(2).
[,20]   stem    norm(0),abnorm(1).
[,21]   lodging yes(0),no(1).
[,22]   stem.cankers    absent(0),below-soil(1),above-s(2),ab-sec-nde(3).
[,23]   canker.lesion   dna(0),brown(1),dk-brown-blk(2),tan(3).
[,24]   fruiting.bodies absent(0),present(1).
[,25]   ext.decay   absent(0),firm-and-dry(1),watery(2).
[,26]   mycelium    absent(0),present(1).
[,27]   int.discolor    none(0),brown(1),black(2).
[,28]   sclerotia   absent(0),present(1).
[,29]   fruit.pods  norm(0),diseased(1),few-present(2),dna(3).
[,30]   fruit.spots absent(0),col(1),br-w/blk-speck(2),distort(3),dna(4).
[,31]   seed    norm(0),abnorm(1).
[,32]   mold.growth absent(0),present(1).
[,33]   seed.discolor   absent(0),present(1).
[,34]   seed.size   norm(0),lt-norm(1).
[,35]   shriveling  absent(0),present(1).
[,36]   roots   norm(0),rotted(1),galls-cysts(2).

Preamble: A rule of thumb for detecting near-zero variance predictors is:

The fraction of unique values over the sample size is low (say 10 %).
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.

Now, from the above graphs, we can quickly identify some possible distributions that fit the above rule of thumb, some of the possible distributions are: leaf.malf, leaf.mild, lodging, mycelium, int.discolor and sclerotia.

Let’s read some of the calculations for those distributions

leaf.malf

Var1	Freq
0	554
1	45
^* Data: leaf.malf

From the above, we can find a ratio of 554 / 45 = 12, which is a good value in order to keep this data set.

leaf.mild

Var1	Freq
0	535
1	20
2	20
^* Data: leaf.mild

From the above, we can find a ratio of 535 / 20 = 27, which is NOT a good value in order to keep this data set.

lodging

Var1	Freq
0	520
1	42
^* Data: lodging

From the above, we can find a ratio of 520 / 42 = 12, which is a good value in order to keep this data set.

mycelium

Var1	Freq
0	639
1	6
^* Data: mycelium

From the above, we can find a ratio of 639 / 6 = 106, which is NOT a good value in order to keep this data set.

int.discolor

Var1	Freq
0	581
1	44
2	20
^* Data: int.discolor

From the above, we can find a ratio of 581 / 44 = 13, which is a good value in order to keep this data set.

sclerotia

Var1	Freq
0	625
1	20
^* Data: sclerotia

From the above, we can find a ratio of 625 / 20 = 31, which is NOT a good value in order to keep this data set.

As reported above, in effect, there are some distributions that are degenerate.

b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

In order to answer, I would like to have a visualization of the missing data. For this purpose, I will make use of the function vis_miss from the library naniar.

As we can notice, it seems that diverse blocks of Observations present missing data. The real amount of missing data is about 9.5%; but the reference made on the beginning is for the case in which complete cases are considered; that is, about 562 observations are considered complete cases with no missing data; thus out of the 683 observations, represent about 82 % of complete data, which means 18% is missing from the complete cases.

Also, from the above visualization we can quickly identify that some predictors are more susceptible to present missing data compare to others.

Let’s examine, if the classes have some sort of relationship compared to the missing data.

First, let’s examine the observations that have missing data, that is, there are 121 records with missing values on them.

Now, let’s see the frequency of classes in the missing data.

Let’s read the respective frequency distribution values:

Class	Complete Freq	Missing Freq
2-4-d-injury	0	16
alternarialeaf-spot	91	0
anthracnose	44	0
bacterial-blight	20	0
bacterial-pustule	20	0
brown-spot	92	0
brown-stem-rot	44	0
charcoal-rot	20	0
cyst-nematode	0	14
diaporthe-pod-&-stem-blight	0	15
diaporthe-stem-canker	20	0
downy-mildew	20	0
frog-eye-leaf-spot	91	0
herbicide-injury	0	8
phyllosticta-leaf-spot	20	0
phytophthora-rot	20	68
powdery-mildew	20	0
purple-seed-stain	20	0
rhizoctonia-root-rot	20	0
^* Comparing Data: Class

From the above table, we can confirm that in effect, the pattern of missing data related to the classes.

c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

From a), we can already conclude that is better to eliminate some of the predictors since the information held on them produces a very low variance and is better just to remove them. Those are: leaf.mild, mycelium and sclerotia.

Now, since the classes some of the classes represent only observations with missing data, it would not be wise to remove those observations, otherwise we run the risk of not predicting this class as all, since we will not have any meaningful way to predict.

Let’s see the respective frequency distributions for the missing data.

Now that we have visual representations, is important to note that informative missingness can induce significant bias in the model, hence, from my perspective, is safe to continue as follows:

Index	Predictor	Action	Reason
[,1]	Class	Keep
[,2]	date	Keep
[,3]	plant.stand	Input random values (0,1)	% missing is small
[,4]	precip	Input K-nearest neighbor	% missing is small
[,5]	temp	Input K-nearest neighbor	% missing is small
[,6]	hail	Input random values (0,1)	% missing is moderate <20%
[,7]	crop.hist	Input K-nearest neighbor	% missing is moderate <20%
[,8]	area.dam	Input K-nearest neighbor	% missing is small
[,9]	sever	Input K-nearest neighbor	% missing is moderate <20%
[,10]	seed.tmt	Input K-nearest neighbor	% missing is moderate <20%
[,11]	germ	Input K-nearest neighbor	% missing is small
[,12]	plant.growth	Input random values (0,1)	% missing is small
[,13]	leaves	Input random values (0,1)	% missing is small
[,14]	leaf.halo	Input K-nearest neighbor	% missing is moderate <20%
[,15]	leaf.marg	Input K-nearest neighbor	% missing is moderate <20%
[,16]	leaf.size	Input K-nearest neighbor	% missing is moderate <20%
[,17]	leaf.shread	Input random values (0,1)	% missing is moderate <20%
[,18]	leaf.malf	Input random values (0,1)	% missing is moderate <20%
[,19]	leaf.mild	Remove	Distribution is degenerate
[,20]	stem	Input random values (0,1)	% missing is small
[,21]	lodging	Input random values (0,1)	% missing is moderate <20%
[,22]	stem.cankers	Input K-nearest neighbor	% missing is small
[,23]	canker.lesion	Input K-nearest neighbor	% missing is small
[,24]	fruiting.bodies	Input random values (0,1)	% missing is small
[,25]	ext.decay	Input K-nearest neighbor	% missing is small
[,26]	mycelium	Remove	Distribution is degenerate
[,27]	int.discolor	Input K-nearest neighbor	% missing is small
[,28]	sclerotia	Remove	Distribution is degenerate
[,29]	fruit.pods	Input K-nearest neighbor	% missing is small
[,30]	fruit.spots	Input K-nearest neighbor	% missing is small
[,31]	seed	Input random values (0,1)	% missing is moderate <20%
[,32]	mold.growth	Input random values (0,1)	% missing is moderate <20%
[,33]	seed.discolor	Input random values (0,1)	% missing is moderate <20%
[,34]	seed.size	Input random values (0,1)	% missing is moderate <20%
[,35]	shriveling	Input random values (0,1)	% missing is moderate <20%
[,36]	roots	Input K-nearest neighbor	% missing is small

References

Chi Yau. 2013. R Tutorial with Bayesian Statistics Using Openbugs. USA: R-Tutor.com. http://www.r-tutor.com.

Kuhn, M. & Johnson, K. 2018. Applied Predictive Modeling. USA: Pfizer Global R&D. http://appliedpredictivemodeling.com/.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Masters in Data Science

CUNY SPS