Predictive Modeling

Ibrahim Odumas Odufowora

2017-10-04

Question 1: Glass Identification

Previewing the Glass Dataset:

Head of Glass Data
RI Na Mg Al Si K Ca Ba Fe Type
1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0 1
1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0 1
1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0 1
1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0 1
1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0 1
## 
## The dimension of the Glass indentification dataset is [ 214 10 ]
## 
## Below is the structure of the Glass identilfication dataset
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Q1a: Using visualizations to explore the predictor variables:

Head of Melted Glass Data
Type variable value
1 RI 1.52101
1 RI 1.51761
1 RI 1.51618
1 RI 1.51766
## 
## Density Plot of all the feature variables

## 
## Histogram Plot of all the feature variables

## The histogram Plots were plot in order to have a clearer perception of the data
## 
## Box Plot of all the feature variables per type

  1. The following predictors: Ca, Si, Na, & Ri show some signs of a largely tailed distribution

  2. K and Mg have multiple peaks; this might means a mixture of different distributions.

  3. Very few predictors seem to be correlated: an obvious instance is Ri & Ca. However, must of the predictor


Q1b: Do there appear to be any outliers in the data? Are any predictors skewed?

Skewness Level
skewValue skewLevel
RI 1.6027151 Heavily Skewed
Na 0.4478343 Symmetric
Mg -1.1364523 Heavily Skewed
Al 0.8946104 Moderately Skewed
Si -0.7202392 Moderately Skewed
K 6.4600889 Heavily Skewed
Ca 2.0184463 Heavily Skewed
Ba 3.3686800 Heavily Skewed
Fe 1.7298107 Heavily Skewed
  1. Looking at the shape of the box plot, there seem to be posiblity of outliers in the data set; it is unclear whether the extreme point in predicator K is an outlier.

  2. The table above shows the skewness level of each of predictors.


Q1c: Are there any relevant transformations of one or more predictors that might improve the classification model?

Having visually explored the data, it might be helpful to use some transformation techniques in order to deal with the skewness and outliers in the data set. The following transformation could be useful:

  1. Spatial sign transformation to resolve/constraint the outliers/extreme values in the predictors.

  2. Yeo-Johnson transformations for treating the skewness, because they can deal with zero or negative values.

  3. However, Log transformation and Box-Cox transformations cannot be used in this case because must of the predictors contain zero values.

## 
## Correlation Plot of the transformed data: Center & Scale Transformation

Spatial Sign transformation has helped to constrain the outliers. Also it has changed the direction of some zero values in the data e.g Fe and B. Thus, correlation between the predictors seem to have improved.


## 
## Density Plot of the transformed data: Yeo Johnson Transformation

Apparently, this seems not to have improved the skewness in the data.


Question 2: Soybean Data

Previewing the Soybean Dataset:

## 
## The dimension of the Soybean dataset is [ 683 36 ]
## 
## Below is the structure of the Soybean dataset
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## NULL


Q2a: Investigate the frequency distributions for the categorical predictors:

## Temp
## 
##    0    1    2 <NA> 
##   80  374  199   30
## 
##     low    norm    high missing 
##      80     374     199      30
## Date
## 
##    0    1    2    3    4    5    6 <NA> 
##   26   75   93  118  131  149   90    1
## 
##     apr     may    june    july     aug    sept missing 
##      26      75      93     118     131     149       1
## Precip
## 
##    0    1    2 <NA> 
##   74  112  459   38
## 
##     low    norm    high missing 
##      74     112     459      38

Looking at the output above, it is obvious that the factors levels of some predicators are not informative. Predictor temp consist of integer values, which stands for below average, average and above average. Thus, it would be more informative to change these type of integer values to their real values.

## Barchart of the distribution of date and temp


Q2c: Develop a strategy for handling missing data, either by eliminating predictors or imputation:

To hand the missing data, let convert the factors to a set of dummy variables:

##        freqRatio percentUnique zeroVar   nzv
## date.0 30.500000     0.3174603   FALSE  TRUE
## date.1  8.264706     0.3174603   FALSE FALSE
## date.2  6.325581     0.3174603   FALSE FALSE
## date.3  4.727273     0.3174603   FALSE FALSE
## date.4  4.080645     0.3174603   FALSE FALSE
## date.5  3.500000     0.3174603   FALSE FALSE
## The number of predictors to remove:
## [1] 16
## The percentage of predictors to remove:
## [1] 0.1616162

Hence, eliminating about 16% of the dummy variable would help to remove unbalanced and sparse predictors.


Question 3: Blood Brain

Question 3a:

## Number of columns:
## [1] 134
##           freqRatio percentUnique zeroVar   nzv
## tpsa       2.142857    61.5384615   FALSE FALSE
## nbasic     1.736842     0.9615385   FALSE FALSE
## negative 207.000000     0.9615385   FALSE  TRUE
## vsa_hyd    1.000000    93.2692308   FALSE FALSE
## a_aro      1.188679     5.7692308   FALSE FALSE
## weight     1.000000    91.8269231   FALSE FALSE


Question 3b:

These are some of the predictors with degenerate distributions:

## These are the near-zero variance predictors
## [1] "negative"     "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid"      
## [5] "vsa_acid"     "frac.anion7." "alert"
## These are table for some of them:
## alert
## 
##   0   1 
## 206   2
## a_acid
## 
##   0   2   3 
## 201   6   1

We might want to remove them:

## [1] 127
Head of Skewness Level
skewValues
tpsa 0.8570900
nbasic 0.5550790
negative 14.2148592
vsa_hyd 0.4071053
a_aro -0.2460511
weight 0.5086773

Some of the predictors show some level of skewness e.g negative., while some are also symmetric e.g nbasic, vsa_hyd, weight etc.


Question 3c:

Looking at the correlation between the predictors


## 
## Correlation Matrices for the raw data

## 
## Correlation Matrices for the Spatial Sign Transformation

## 
## Correlation Matrices for the raw data
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.16173  0.06434  0.07068  0.28643  1.00000
## 
## Correlation Matrices for the Spatial Sign Transformation
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.16062  0.03862  0.05561  0.25545  1.00000

The plots above showed:

  1. there are strong relationships between the predictor as seen in the correlation matrix before transformation

  2. this strong relationships can be minimized via transformation

  3. that there is a reduction in the level of correlation after transformation.

  4. however, it seems that the better idea is to reduce predictors, this can be done through findCorrelation function in caret package. The level of correlation would have to be set.

  5. this, should not have a dramatic effect on the number of predictors available for maodeling.

Below is the length and values of the high correlation predictors with a cutoff of 0.85

## [1] 56
##  [1]  22  32  40  50  51  52  61  65  67  68  77  78  80  87  88  89  90
## [18]  91  92  93  94  95  96 101 103 104 106 107 108 109 110 112 116 117
## [35] 118 119 120 121   3  28   1   5  44  43   4  58  60  62  71  82  83
## [52]  76 113  81 105  20