| RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
|---|---|---|---|---|---|---|---|---|---|
| 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0 | 1 |
| 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0 | 1 |
| 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0 | 1 |
| 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0 | 1 |
| 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0 | 1 |
##
## The dimension of the Glass indentification dataset is [ 214 10 ]
##
## Below is the structure of the Glass identilfication dataset
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
| Type | variable | value |
|---|---|---|
| 1 | RI | 1.52101 |
| 1 | RI | 1.51761 |
| 1 | RI | 1.51618 |
| 1 | RI | 1.51766 |
##
## Density Plot of all the feature variables
##
## Histogram Plot of all the feature variables
## The histogram Plots were plot in order to have a clearer perception of the data
##
## Box Plot of all the feature variables per type
The following predictors: Ca, Si, Na, & Ri show some signs of a largely tailed distribution
K and Mg have multiple peaks; this might means a mixture of different distributions.
Very few predictors seem to be correlated: an obvious instance is Ri & Ca. However, must of the predictor
| skewValue | skewLevel | |
|---|---|---|
| RI | 1.6027151 | Heavily Skewed |
| Na | 0.4478343 | Symmetric |
| Mg | -1.1364523 | Heavily Skewed |
| Al | 0.8946104 | Moderately Skewed |
| Si | -0.7202392 | Moderately Skewed |
| K | 6.4600889 | Heavily Skewed |
| Ca | 2.0184463 | Heavily Skewed |
| Ba | 3.3686800 | Heavily Skewed |
| Fe | 1.7298107 | Heavily Skewed |
Looking at the shape of the box plot, there seem to be posiblity of outliers in the data set; it is unclear whether the extreme point in predicator K is an outlier.
The table above shows the skewness level of each of predictors.
Having visually explored the data, it might be helpful to use some transformation techniques in order to deal with the skewness and outliers in the data set. The following transformation could be useful:
Spatial sign transformation to resolve/constraint the outliers/extreme values in the predictors.
Yeo-Johnson transformations for treating the skewness, because they can deal with zero or negative values.
However, Log transformation and Box-Cox transformations cannot be used in this case because must of the predictors contain zero values.
##
## Correlation Plot of the transformed data: Center & Scale Transformation
Spatial Sign transformation has helped to constrain the outliers. Also it has changed the direction of some zero values in the data e.g Fe and B. Thus, correlation between the predictors seem to have improved.
##
## Density Plot of the transformed data: Yeo Johnson Transformation
Apparently, this seems not to have improved the skewness in the data.
##
## The dimension of the Soybean dataset is [ 683 36 ]
##
## Below is the structure of the Soybean dataset
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## NULL
## Temp
##
## 0 1 2 <NA>
## 80 374 199 30
##
## low norm high missing
## 80 374 199 30
## Date
##
## 0 1 2 3 4 5 6 <NA>
## 26 75 93 118 131 149 90 1
##
## apr may june july aug sept missing
## 26 75 93 118 131 149 1
## Precip
##
## 0 1 2 <NA>
## 74 112 459 38
##
## low norm high missing
## 74 112 459 38
Looking at the output above, it is obvious that the factors levels of some predicators are not informative. Predictor temp consist of integer values, which stands for below average, average and above average. Thus, it would be more informative to change these type of integer values to their real values.
## Barchart of the distribution of date and temp
To hand the missing data, let convert the factors to a set of dummy variables:
## freqRatio percentUnique zeroVar nzv
## date.0 30.500000 0.3174603 FALSE TRUE
## date.1 8.264706 0.3174603 FALSE FALSE
## date.2 6.325581 0.3174603 FALSE FALSE
## date.3 4.727273 0.3174603 FALSE FALSE
## date.4 4.080645 0.3174603 FALSE FALSE
## date.5 3.500000 0.3174603 FALSE FALSE
## The number of predictors to remove:
## [1] 16
## The percentage of predictors to remove:
## [1] 0.1616162
Hence, eliminating about 16% of the dummy variable would help to remove unbalanced and sparse predictors.
## Number of columns:
## [1] 134
## freqRatio percentUnique zeroVar nzv
## tpsa 2.142857 61.5384615 FALSE FALSE
## nbasic 1.736842 0.9615385 FALSE FALSE
## negative 207.000000 0.9615385 FALSE TRUE
## vsa_hyd 1.000000 93.2692308 FALSE FALSE
## a_aro 1.188679 5.7692308 FALSE FALSE
## weight 1.000000 91.8269231 FALSE FALSE
These are some of the predictors with degenerate distributions:
## These are the near-zero variance predictors
## [1] "negative" "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid"
## [5] "vsa_acid" "frac.anion7." "alert"
## These are table for some of them:
## alert
##
## 0 1
## 206 2
## a_acid
##
## 0 2 3
## 201 6 1
We might want to remove them:
## [1] 127
| skewValues | |
|---|---|
| tpsa | 0.8570900 |
| nbasic | 0.5550790 |
| negative | 14.2148592 |
| vsa_hyd | 0.4071053 |
| a_aro | -0.2460511 |
| weight | 0.5086773 |
Some of the predictors show some level of skewness e.g negative., while some are also symmetric e.g nbasic, vsa_hyd, weight etc.
Looking at the correlation between the predictors
##
## Correlation Matrices for the raw data
##
## Correlation Matrices for the Spatial Sign Transformation
##
## Correlation Matrices for the raw data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.16173 0.06434 0.07068 0.28643 1.00000
##
## Correlation Matrices for the Spatial Sign Transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.16062 0.03862 0.05561 0.25545 1.00000
The plots above showed:
there are strong relationships between the predictor as seen in the correlation matrix before transformation
this strong relationships can be minimized via transformation
that there is a reduction in the level of correlation after transformation.
however, it seems that the better idea is to reduce predictors, this can be done through findCorrelation function in caret package. The level of correlation would have to be set.
this, should not have a dramatic effect on the number of predictors available for maodeling.
Below is the length and values of the high correlation predictors with a cutoff of 0.85
## [1] 56
## [1] 22 32 40 50 51 52 61 65 67 68 77 78 80 87 88 89 90
## [18] 91 92 93 94 95 96 101 103 104 106 107 108 109 110 112 116 117
## [35] 118 119 120 121 3 28 1 5 44 43 4 58 60 62 71 82 83
## [52] 76 113 81 105 20