Question 1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identi???cation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
library(mlbench)
library(e1071)
library(caret) 
## Loading required package: lattice
## Loading required package: ggplot2
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(missMDA)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

We will access the distrubitons of each predictor variable

hist(Glass$RI)

We can see the Refractive index is slightly skewed to the right

hist(Glass$Na)

This Na is normally distributed

hist(Glass$Mg)

Mg distribution is not normaly distributed. looks left skewed with an outlier at the left

hist(Glass$Al)

Al is normally distributed.

hist(Glass$Si)

Si is normally distributed

hist(Glass$K)

K is not normally distributed, looks right skewed

hist(Glass$Ca)

This CA is right skewed

hist(Glass$Ba)

Ba is not normally distributed. looks like there is an outlier to the left and it is uniformly distributed

hist(Glass$Fe)

Fe is not normally distributed. it looks left skewed

Lets plot the correlation plot of all the predictors

#head(Glass[,c(1:9)])
Glass_m<-Glass[,c(1:9)]
M<- cor(Glass_m)
M
##               RI          Na           Mg          Al          Si
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073
##               K         Ca            Ba           Fe
## RI -0.289832711  0.8104027 -0.0003860189  0.143009609
## Na -0.266086504 -0.2754425  0.3266028795 -0.241346411
## Mg  0.005395667 -0.4437500 -0.4922621178  0.083059529
## Al  0.325958446 -0.2595920  0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K   1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155  1.0000000 -0.1128409671  0.124968219
## Ba -0.042618059 -0.1128410  1.0000000000 -0.058691755
## Fe -0.007719049  0.1249682 -0.0586917554  1.000000000
corrplot(M, method="circle")

We see that the element Ca is highly postively correlated with the Refractive index, while the Si element is negatively correlated with the RI THe element Ba is negatively correlated with Mg

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes there appears to be outliers in the data. It was all mentioned above. in summary, Mg distribution is not normaly distributed. looks left skewed with an outlier at the left. K is not normally distributed, looks right skewed. Fe is not normally distributed. it looks left skewed. Ba is not normally distributed. looks like there is an outlier to the left and it is uniformly distributed

  1. Are there any relevant transformations of one or more predictors that might improve the classi???cation model? Yes, we can try box cox transformation for predicts that are skewed

first let’s see how skewed are the variables

skewValues <- apply(Glass_m, 2, skewness) 
skewValues
##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889 
##         Ca         Ba         Fe 
##  2.0184463  3.3686800  1.7298107

Let’s try to transform element K, Ba, Ca, Fe, RI

K_Trans <- BoxCoxTrans(Glass_m$K)
K_Trans
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied
Ba_Trans <- BoxCoxTrans(Glass_m$Ba)
Ba_Trans
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Lambda could not be estimated; no transformation is applied
Ca_Trans <- BoxCoxTrans(Glass_m$Ca)
Ca_Trans
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1
Fe_Trans <- BoxCoxTrans(Glass_m$Fe)
Fe_Trans
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Lambda could not be estimated; no transformation is applied
RI_Trans <- BoxCoxTrans(Glass_m$RI)
RI_Trans
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2

Showing the transformed values and resulting histogram for some of the variable that were eligible for transformation

RI_Trans_B <- predict(RI_Trans, Glass_m$RI) 
head(RI_Trans_B)
## [1] 0.2838746 0.2829051 0.2824954 0.2829194 0.2828507 0.2824323
hist(RI_Trans_B)

Ca_Trans_B <- predict(Ca_Trans, Glass_m$Ca) 
head(Ca_Trans_B)
## [1] 0.8254539 0.8145827 0.8139144 0.8195032 0.8176698 0.8176698
hist(Ca_Trans_B)

Question 2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
#?Soybean 
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
count(Soybean, 'Class')
## # A tibble: 1 x 2
##   `"Class"`     n
##   <chr>     <int>
## 1 Class       683
count(Soybean, 'date ')
## # A tibble: 1 x 2
##   `"date "`     n
##   <chr>     <int>
## 1 "date "     683
count(Soybean, 'plant.stand ')
## # A tibble: 1 x 2
##   `"plant.stand "`     n
##   <chr>            <int>
## 1 "plant.stand "     683
count(Soybean, 'precip')
## # A tibble: 1 x 2
##   `"precip"`     n
##   <chr>      <int>
## 1 precip       683
count(Soybean, 'temp')
## # A tibble: 1 x 2
##   `"temp"`     n
##   <chr>    <int>
## 1 temp       683
count(Soybean, 'hail')
## # A tibble: 1 x 2
##   `"hail"`     n
##   <chr>    <int>
## 1 hail       683
count(Soybean, 'crop.hist')
## # A tibble: 1 x 2
##   `"crop.hist"`     n
##   <chr>         <int>
## 1 crop.hist       683
count(Soybean, 'area.dam')
## # A tibble: 1 x 2
##   `"area.dam"`     n
##   <chr>        <int>
## 1 area.dam       683
count(Soybean, 'sever')
## # A tibble: 1 x 2
##   `"sever"`     n
##   <chr>     <int>
## 1 sever       683
count(Soybean, 'seed.tmt')
## # A tibble: 1 x 2
##   `"seed.tmt"`     n
##   <chr>        <int>
## 1 seed.tmt       683
count(Soybean, 'germ')
## # A tibble: 1 x 2
##   `"germ"`     n
##   <chr>    <int>
## 1 germ       683
count(Soybean, 'plant.growth')
## # A tibble: 1 x 2
##   `"plant.growth"`     n
##   <chr>            <int>
## 1 plant.growth       683
count(Soybean, 'leaves')
## # A tibble: 1 x 2
##   `"leaves"`     n
##   <chr>      <int>
## 1 leaves       683
count(Soybean, 'leaf.halo')
## # A tibble: 1 x 2
##   `"leaf.halo"`     n
##   <chr>         <int>
## 1 leaf.halo       683
count(Soybean, 'leaf.marg')
## # A tibble: 1 x 2
##   `"leaf.marg"`     n
##   <chr>         <int>
## 1 leaf.marg       683
count(Soybean, 'leaf.size')
## # A tibble: 1 x 2
##   `"leaf.size"`     n
##   <chr>         <int>
## 1 leaf.size       683
count(Soybean, 'leaf.shread')
## # A tibble: 1 x 2
##   `"leaf.shread"`     n
##   <chr>           <int>
## 1 leaf.shread       683
count(Soybean, 'leaf.size')
## # A tibble: 1 x 2
##   `"leaf.size"`     n
##   <chr>         <int>
## 1 leaf.size       683
count(Soybean, 'leaf.malf')
## # A tibble: 1 x 2
##   `"leaf.malf"`     n
##   <chr>         <int>
## 1 leaf.malf       683
count(Soybean, 'leaf.mild')
## # A tibble: 1 x 2
##   `"leaf.mild"`     n
##   <chr>         <int>
## 1 leaf.mild       683
count(Soybean, 'stem')
## # A tibble: 1 x 2
##   `"stem"`     n
##   <chr>    <int>
## 1 stem       683
count(Soybean, 'lodging')
## # A tibble: 1 x 2
##   `"lodging"`     n
##   <chr>       <int>
## 1 lodging       683
count(Soybean, 'stem.cankers')
## # A tibble: 1 x 2
##   `"stem.cankers"`     n
##   <chr>            <int>
## 1 stem.cankers       683
count(Soybean, 'canker.lesion')
## # A tibble: 1 x 2
##   `"canker.lesion"`     n
##   <chr>             <int>
## 1 canker.lesion       683
count(Soybean, 'fruiting.bodies')
## # A tibble: 1 x 2
##   `"fruiting.bodies"`     n
##   <chr>               <int>
## 1 fruiting.bodies       683
count(Soybean, 'ext.decay')
## # A tibble: 1 x 2
##   `"ext.decay"`     n
##   <chr>         <int>
## 1 ext.decay       683
count(Soybean, 'mycelium')
## # A tibble: 1 x 2
##   `"mycelium"`     n
##   <chr>        <int>
## 1 mycelium       683

For Predictor mycelium We can see based on the conditions of degenerate distribution 1)The fraction of unique values over the sample size is low (say 10%). 2/683 = 0.29% is low

The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

639/6 = 106.5

is more than 20

So this predictor looks like degenerate

count(Soybean, 'int.discolor')
## # A tibble: 1 x 2
##   `"int.discolor"`     n
##   <chr>            <int>
## 1 int.discolor       683
count(Soybean, 'sclerotia')
## # A tibble: 1 x 2
##   `"sclerotia"`     n
##   <chr>         <int>
## 1 sclerotia       683
count(Soybean, 'fruit.pods')
## # A tibble: 1 x 2
##   `"fruit.pods"`     n
##   <chr>          <int>
## 1 fruit.pods       683
count(Soybean, 'fruit.spots')
## # A tibble: 1 x 2
##   `"fruit.spots"`     n
##   <chr>           <int>
## 1 fruit.spots       683
count(Soybean, 'seed')
## # A tibble: 1 x 2
##   `"seed"`     n
##   <chr>    <int>
## 1 seed       683
count(Soybean, 'mold.growth')
## # A tibble: 1 x 2
##   `"mold.growth"`     n
##   <chr>           <int>
## 1 mold.growth       683
count(Soybean, 'seed.discolor')
## # A tibble: 1 x 2
##   `"seed.discolor"`     n
##   <chr>             <int>
## 1 seed.discolor       683
count(Soybean, 'seed.size')
## # A tibble: 1 x 2
##   `"seed.size"`     n
##   <chr>         <int>
## 1 seed.size       683
count(Soybean, 'shriveling')
## # A tibble: 1 x 2
##   `"shriveling"`     n
##   <chr>          <int>
## 1 shriveling       683
count(Soybean, 'roots')
## # A tibble: 1 x 2
##   `"roots"`     n
##   <chr>     <int>
## 1 roots       683

For Predictor ‘mycelium’ We can see based on the conditions of degenerate distribution 1)The fraction of unique values over the sample size is low (say 10%). 2/683 = 0.29% is low

The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

639/6 = 106.5

is more than 20

So this predictor looks like degenerate

For Predictor leaf.mild The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). 535/20 =26.75

Could be a candidate of degenerate predictor if we ignore missing values

For Predictor ‘sclerotia’ The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

625/20 =31.25

shows it could be a candidate of degenerate predictor if we ignore missing values

nearZeroVar(Soybean) 
## [1] 19 26 28

This integer represents the columns that need to be removed because of the near zero variance

Soybean[1,c(19,26,28)]
##   leaf.mild mycelium sclerotia
## 1         0        0         0
  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
## 

In this dataset most variables have a lot of missing values some are hail sever seed.tmt germ leaf.halo
leaf.marg
leaf.size
leaf.shread leaf.malf leaf.mild lodging fruiting.bodies fruit.spots seed
mold.growth seed.discolor seed.size
shriveling

Is the pattern of missing data related to the classes?

Let’s see how many missing values for each class and predictor

Class_grp <- group_by(Soybean, Class)
summarize(Class_grp, hail = sum(is.na(hail))
,sever = sum(is.na(sever))
,seed.tmt = sum(is.na(seed.tmt))
,germ = sum(is.na(germ))
,leaf.halo = sum(is.na(leaf.halo))
,leaf.marg = sum(is.na(leaf.marg))
,leaf.size = sum(is.na(leaf.size))
,leaf.shread = sum(is.na(leaf.shread))
,leaf.malf  = sum(is.na(leaf.malf ))
,leaf.mild = sum(is.na(leaf.mild))
,lodging = sum(is.na(lodging))
,fruiting.bodies = sum(is.na(fruiting.bodies))
,fruit.spots = sum(is.na(fruit.spots))
,seed = sum(is.na(seed))
,mold.growth = sum(is.na(mold.growth))
,seed.discolor = sum(is.na(seed.discolor))
,seed.size = sum(is.na(seed.size))
,shriveling = sum(is.na(shriveling)))
## # A tibble: 19 x 19
##    Class           hail sever seed.tmt  germ leaf.halo leaf.marg leaf.size
##    <fct>          <int> <int>    <int> <int>     <int>     <int>     <int>
##  1 2-4-d-injury      16    16       16    16         0         0         0
##  2 alternarialea~     0     0        0     0         0         0         0
##  3 anthracnose        0     0        0     0         0         0         0
##  4 bacterial-bli~     0     0        0     0         0         0         0
##  5 bacterial-pus~     0     0        0     0         0         0         0
##  6 brown-spot         0     0        0     0         0         0         0
##  7 brown-stem-rot     0     0        0     0         0         0         0
##  8 charcoal-rot       0     0        0     0         0         0         0
##  9 cyst-nematode     14    14       14    14        14        14        14
## 10 diaporthe-pod~    15    15       15     6        15        15        15
## 11 diaporthe-ste~     0     0        0     0         0         0         0
## 12 downy-mildew       0     0        0     0         0         0         0
## 13 frog-eye-leaf~     0     0        0     0         0         0         0
## 14 herbicide-inj~     8     8        8     8         0         0         0
## 15 phyllosticta-~     0     0        0     0         0         0         0
## 16 phytophthora-~    68    68       68    68        55        55        55
## 17 powdery-mildew     0     0        0     0         0         0         0
## 18 purple-seed-s~     0     0        0     0         0         0         0
## 19 rhizoctonia-r~     0     0        0     0         0         0         0
## # ... with 11 more variables: leaf.shread <int>, leaf.malf <int>,
## #   leaf.mild <int>, lodging <int>, fruiting.bodies <int>,
## #   fruit.spots <int>, seed <int>, mold.growth <int>, seed.discolor <int>,
## #   seed.size <int>, shriveling <int>

It looks like the missing values are normally occurring for certain classes the classes are 1. 2-4-d-injury 2. cyst-nematode 3. diaporthe-pod-&-stem-blight 4. phytophthora-rot

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In this case I will use the MIMCA function/package to impute multiple dataset for the missing values.

#nb <- estim_ncpMCA(Soybean,ncp.max=5) ## Time-consuming, nb = 4
res <- MIMCA(Soybean, ncp=1,nboot=2)
str(res)
## List of 3
##  $ res.MI       :List of 2
##   ..$ nboot=1:'data.frame':  683 obs. of  36 variables:
##   .. ..$ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##   .. ..$ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##   .. ..$ plant.stand    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ precip         : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ temp           : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##   .. ..$ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##   .. ..$ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##   .. ..$ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##   .. ..$ germ           : Factor w/ 3 levels "0","1","2": 1 2 3 2 3 2 1 3 2 3 ...
##   .. ..$ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.size      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##   .. ..$ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##   .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ nboot=2:'data.frame':  683 obs. of  36 variables:
##   .. ..$ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##   .. ..$ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##   .. ..$ plant.stand    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ precip         : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ temp           : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##   .. ..$ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##   .. ..$ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##   .. ..$ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##   .. ..$ germ           : Factor w/ 3 levels "0","1","2": 1 2 3 2 3 2 1 3 2 3 ...
##   .. ..$ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.size      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##   .. ..$ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##   .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ res.imputeMCA: num [1:683, 1:118] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:683] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:118] "2-4-d-injury" "alternarialeaf-spot" "anthracnose" "bacterial-blight" ...
##  $ call         :List of 8
##   ..$ X          :'data.frame':  683 obs. of  36 variables:
##   .. ..$ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##   .. ..$ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##   .. ..$ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##   .. ..$ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##   .. ..$ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##   .. ..$ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##   .. ..$ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##   .. ..$ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##   .. ..$ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##   .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##   .. ..$ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ nboot      : num 2
##   ..$ ncp        : num 1
##   ..$ coeff.ridge: num 1
##   ..$ threshold  : num 1e-06
##   ..$ seed       : NULL
##   ..$ maxiter    : num 1000
##   ..$ tab.disj   : num [1:683, 1:118, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "class")= chr [1:2] "MIMCA" "list"