Question 1: Glass Identification

Previewing the Glass Dataset:

Head of Glass Data
RI	Na	Mg	Al	Si	K	Ca	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	1
1.51766	13.21	3.69	1.29	72.61	0.57	8.22	1
1.51742	13.27	3.62	1.24	73.08	0.55	8.07	1

## 
## The dimension of the Glass indentification dataset is [ 214 10 ]
## 
## Below is the structure of the Glass identilfication dataset

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Q1a: Using visualizations to explore the predictor variables:

Head of Melted Glass Data
Type	variable	value
1	RI	1.52101
1	RI	1.51761
1	RI	1.51618
1	RI	1.51766

## 
## Density Plot of all the feature variables

## 
## Histogram Plot of all the feature variables

## The histogram Plots were plot in order to have a clearer perception of the data

## 
## Box Plot of all the feature variables per type

The following predictors: Ca, Si, Na, & Ri show some signs of a largely tailed distribution
K and Mg have multiple peaks; this might means a mixture of different distributions.
Very few predictors seem to be correlated: an obvious instance is Ri & Ca. However, must of the predictor

Q1b: Do there appear to be any outliers in the data? Are any predictors skewed?

Skewness Level
	skewValue	skewLevel
RI	1.6027151	Heavily Skewed
Na	0.4478343	Symmetric
Mg	-1.1364523	Heavily Skewed
Al	0.8946104	Moderately Skewed
Si	-0.7202392	Moderately Skewed
K	6.4600889	Heavily Skewed
Ca	2.0184463	Heavily Skewed
Ba	3.3686800	Heavily Skewed
Fe	1.7298107	Heavily Skewed

Looking at the shape of the box plot, there seem to be posiblity of outliers in the data set; it is unclear whether the extreme point in predicator K is an outlier.
The table above shows the skewness level of each of predictors.

Q1c: Are there any relevant transformations of one or more predictors that might improve the classification model?

Having visually explored the data, it might be helpful to use some transformation techniques in order to deal with the skewness and outliers in the data set. The following transformation could be useful:

Spatial sign transformation to resolve/constraint the outliers/extreme values in the predictors.
Yeo-Johnson transformations for treating the skewness, because they can deal with zero or negative values.
However, Log transformation and Box-Cox transformations cannot be used in this case because must of the predictors contain zero values.

## 
## Correlation Plot of the transformed data: Center & Scale Transformation

Spatial Sign transformation has helped to constrain the outliers. Also it has changed the direction of some zero values in the data e.g Fe and B. Thus, correlation between the predictors seem to have improved.

## 
## Density Plot of the transformed data: Yeo Johnson Transformation

Apparently, this seems not to have improved the skewness in the data.

Question 2: Soybean Data

Previewing the Soybean Dataset:

## 
## The dimension of the Soybean dataset is [ 683 36 ]
## 
## Below is the structure of the Soybean dataset

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

## NULL

Q2a: Investigate the frequency distributions for the categorical predictors:

## Temp

## 
##    0    1    2 <NA> 
##   80  374  199   30

## 
##     low    norm    high missing 
##      80     374     199      30

## Date

## 
##    0    1    2    3    4    5    6 <NA> 
##   26   75   93  118  131  149   90    1

## 
##     apr     may    june    july     aug    sept missing 
##      26      75      93     118     131     149       1

## Precip

## 
##    0    1    2 <NA> 
##   74  112  459   38

## 
##     low    norm    high missing 
##      74     112     459      38

Looking at the output above, it is obvious that the factors levels of some predicators are not informative. Predictor temp consist of integer values, which stands for below average, average and above average. Thus, it would be more informative to change these type of integer values to their real values.

## Barchart of the distribution of date and temp

Q2b: Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?:

## These are the predictors with missing values

##  [1] "date"            "plant.stand"     "precip"         
##  [4] "temp"            "hail"            "crop.hist"      
##  [7] "area.dam"        "sever"           "seed.tmt"       
## [10] "germ"            "plant.growth"    "leaf.halo"      
## [13] "leaf.marg"       "leaf.size"       "leaf.shread"    
## [16] "leaf.malf"       "leaf.mild"       "stem"           
## [19] "lodging"         "stem.cankers"    "canker.lesion"  
## [22] "fruiting.bodies" "ext.decay"       "mycelium"       
## [25] "int.discolor"    "sclerotia"       "fruit.pods"     
## [28] "fruit.spots"     "seed"            "mold.growth"    
## [31] "seed.discolor"   "seed.size"       "shriveling"     
## [34] "roots"

##     2-4-d-injury cyst-nematode diaporthe-pod-&-stem-blight
##  1:       0.0625             0                         0.0
##  2:       1.0000             1                         0.4
##  3:       1.0000             1                         0.0
##  4:       1.0000             1                         0.0
##  5:       1.0000             1                         1.0
##  6:       1.0000             0                         0.0
##  7:       0.0625             0                         0.0
##  8:       1.0000             1                         1.0
##  9:       1.0000             1                         1.0
## 10:       1.0000             1                         0.4
## 11:       1.0000             0                         0.0
## 12:       0.0000             1                         1.0
## 13:       0.0000             1                         1.0
## 14:       0.0000             1                         1.0
## 15:       1.0000             1                         1.0
## 16:       0.0000             1                         1.0
## 17:       1.0000             1                         1.0
## 18:       1.0000             0                         0.0
## 19:       1.0000             1                         1.0
## 20:       1.0000             1                         0.0
## 21:       1.0000             1                         0.0
## 22:       1.0000             1                         0.0
## 23:       1.0000             1                         0.0
## 24:       1.0000             1                         0.0
## 25:       1.0000             1                         0.0
## 26:       1.0000             1                         0.0
## 27:       1.0000             0                         0.0
## 28:       1.0000             1                         0.0
## 29:       1.0000             0                         0.0
## 30:       1.0000             0                         0.0
## 31:       1.0000             1                         0.0
## 32:       1.0000             0                         0.0
## 33:       1.0000             1                         0.0
## 34:       1.0000             0                         1.0
##     2-4-d-injury cyst-nematode diaporthe-pod-&-stem-blight
##     herbicide-injury phytophthora-rot
##  1:                0        0.0000000
##  2:                0        0.0000000
##  3:                1        0.0000000
##  4:                0        0.0000000
##  5:                1        0.7727273
##  6:                0        0.0000000
##  7:                0        0.0000000
##  8:                1        0.7727273
##  9:                1        0.7727273
## 10:                1        0.7727273
## 11:                0        0.0000000
## 12:                0        0.6250000
## 13:                0        0.6250000
## 14:                0        0.6250000
## 15:                0        0.6250000
## 16:                0        0.6250000
## 17:                1        0.6250000
## 18:                0        0.0000000
## 19:                1        0.7727273
## 20:                1        0.0000000
## 21:                1        0.0000000
## 22:                1        0.7727273
## 23:                1        0.0000000
## 24:                1        0.0000000
## 25:                1        0.0000000
## 26:                1        0.0000000
## 27:                0        0.7727273
## 28:                1        0.7727273
## 29:                1        0.7727273
## 30:                1        0.7727273
## 31:                1        0.7727273
## 32:                1        0.7727273
## 33:                1        0.7727273
## 34:                0        0.0000000
##     herbicide-injury phytophthora-rot

Class Phytophthora-rot possess high rate of missing data.
Class diaporthe-pod-&-stem-blight possess a fair missing data.
The information above shows that many predictors are missing for herbicide-injury, cyst-nematode and 2-4-d-injury classes.

cyst-nematode and herbicide-injury classes. The and the diaporthe-pod-&-stem-blight has a more moderate pattern of missing data.

Q2c: Develop a strategy for handling missing data, either by eliminating predictors or imputation:

To hand the missing data, let convert the factors to a set of dummy variables:

##        freqRatio percentUnique zeroVar   nzv
## date.0 30.500000     0.3174603   FALSE  TRUE
## date.1  8.264706     0.3174603   FALSE FALSE
## date.2  6.325581     0.3174603   FALSE FALSE
## date.3  4.727273     0.3174603   FALSE FALSE
## date.4  4.080645     0.3174603   FALSE FALSE
## date.5  3.500000     0.3174603   FALSE FALSE

## The number of predictors to remove:

## [1] 16

## The percentage of predictors to remove:

## [1] 0.1616162

Hence, eliminating about 16% of the dummy variable would help to remove unbalanced and sparse predictors.

Question 3: Blood Brain

Question 3a:

## Number of columns:

## [1] 134

##           freqRatio percentUnique zeroVar   nzv
## tpsa       2.142857    61.5384615   FALSE FALSE
## nbasic     1.736842     0.9615385   FALSE FALSE
## negative 207.000000     0.9615385   FALSE  TRUE
## vsa_hyd    1.000000    93.2692308   FALSE FALSE
## a_aro      1.188679     5.7692308   FALSE FALSE
## weight     1.000000    91.8269231   FALSE FALSE

Question 3b:

These are some of the predictors with degenerate distributions:

## These are the near-zero variance predictors

## [1] "negative"     "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid"      
## [5] "vsa_acid"     "frac.anion7." "alert"

## These are table for some of them:

## alert

## 
##   0   1 
## 206   2

## a_acid

## 
##   0   2   3 
## 201   6   1

We might want to remove them:

## [1] 127

Head of Skewness Level
	skewValues
tpsa	0.8570900
nbasic	0.5550790
negative	14.2148592
vsa_hyd	0.4071053
a_aro	-0.2460511
weight	0.5086773

Some of the predictors show some level of skewness e.g negative., while some are also symmetric e.g nbasic, vsa_hyd, weight etc.

Question 3c:

Looking at the correlation between the predictors

## 
## Correlation Matrices for the raw data

## 
## Correlation Matrices for the Spatial Sign Transformation

## 
## Correlation Matrices for the raw data

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.16173  0.06434  0.07068  0.28643  1.00000

## 
## Correlation Matrices for the Spatial Sign Transformation

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.16062  0.03862  0.05561  0.25545  1.00000

The plots above showed:

there are strong relationships between the predictor as seen in the correlation matrix before transformation
this strong relationships can be minimized via transformation
that there is a reduction in the level of correlation after transformation.
however, it seems that the better idea is to reduce predictors, this can be done through findCorrelation function in caret package. The level of correlation would have to be set.
this, should not have a dramatic effect on the number of predictors available for maodeling.

Below is the length and values of the high correlation predictors with a cutoff of 0.85

## [1] 56

##  [1]  22  32  40  50  51  52  61  65  67  68  77  78  80  87  88  89  90
## [18]  91  92  93  94  95  96 101 103 104 106 107 108 109 110 112 116 117
## [35] 118 119 120 121   3  28   1   5  44  43   4  58  60  62  71  82  83
## [52]  76 113  81 105  20

Predictive Modeling

Ibrahim Odumas Odufowora

2017-10-04