The UC Irvine Machine Learning Repository6 contains a data set related to glass identi???cation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
library(mlbench)
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(missMDA)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
We will access the distrubitons of each predictor variable
hist(Glass$RI)
We can see the Refractive index is slightly skewed to the right
hist(Glass$Na)
This Na is normally distributed
hist(Glass$Mg)
Mg distribution is not normaly distributed. looks left skewed with an outlier at the left
hist(Glass$Al)
Al is normally distributed.
hist(Glass$Si)
Si is normally distributed
hist(Glass$K)
K is not normally distributed, looks right skewed
hist(Glass$Ca)
This CA is right skewed
hist(Glass$Ba)
Ba is not normally distributed. looks like there is an outlier to the left and it is uniformly distributed
hist(Glass$Fe)
Fe is not normally distributed. it looks left skewed
Lets plot the correlation plot of all the predictors
#head(Glass[,c(1:9)])
Glass_m<-Glass[,c(1:9)]
M<- cor(Glass_m)
M
## RI Na Mg Al Si
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073
## K Ca Ba Fe
## RI -0.289832711 0.8104027 -0.0003860189 0.143009609
## Na -0.266086504 -0.2754425 0.3266028795 -0.241346411
## Mg 0.005395667 -0.4437500 -0.4922621178 0.083059529
## Al 0.325958446 -0.2595920 0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K 1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155 1.0000000 -0.1128409671 0.124968219
## Ba -0.042618059 -0.1128410 1.0000000000 -0.058691755
## Fe -0.007719049 0.1249682 -0.0586917554 1.000000000
corrplot(M, method="circle")
We see that the element Ca is highly postively correlated with the Refractive index, while the Si element is negatively correlated with the RI THe element Ba is negatively correlated with Mg
Yes there appears to be outliers in the data. It was all mentioned above. in summary, Mg distribution is not normaly distributed. looks left skewed with an outlier at the left. K is not normally distributed, looks right skewed. Fe is not normally distributed. it looks left skewed. Ba is not normally distributed. looks like there is an outlier to the left and it is uniformly distributed
first let’s see how skewed are the variables
skewValues <- apply(Glass_m, 2, skewness)
skewValues
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
## Ca Ba Fe
## 2.0184463 3.3686800 1.7298107
Let’s try to transform element K, Ba, Ca, Fe, RI
K_Trans <- BoxCoxTrans(Glass_m$K)
K_Trans
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1225 0.5550 0.4971 0.6100 6.2100
##
## Lambda could not be estimated; no transformation is applied
Ba_Trans <- BoxCoxTrans(Glass_m$Ba)
Ba_Trans
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.175 0.000 3.150
##
## Lambda could not be estimated; no transformation is applied
Ca_Trans <- BoxCoxTrans(Glass_m$Ca)
Ca_Trans
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.430 8.240 8.600 8.957 9.172 16.190
##
## Largest/Smallest: 2.98
## Sample Skewness: 2.02
##
## Estimated Lambda: -1.1
Fe_Trans <- BoxCoxTrans(Glass_m$Fe)
Fe_Trans
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
##
## Lambda could not be estimated; no transformation is applied
RI_Trans <- BoxCoxTrans(Glass_m$RI)
RI_Trans
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.511 1.517 1.518 1.518 1.519 1.534
##
## Largest/Smallest: 1.02
## Sample Skewness: 1.6
##
## Estimated Lambda: -2
Showing the transformed values and resulting histogram for some of the variable that were eligible for transformation
RI_Trans_B <- predict(RI_Trans, Glass_m$RI)
head(RI_Trans_B)
## [1] 0.2838746 0.2829051 0.2824954 0.2829194 0.2828507 0.2824323
hist(RI_Trans_B)
Ca_Trans_B <- predict(Ca_Trans, Glass_m$Ca)
head(Ca_Trans_B)
## [1] 0.8254539 0.8145827 0.8139144 0.8195032 0.8176698 0.8176698
hist(Ca_Trans_B)
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
#?Soybean
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
count(Soybean, 'Class')
## # A tibble: 1 x 2
## `"Class"` n
## <chr> <int>
## 1 Class 683
count(Soybean, 'date ')
## # A tibble: 1 x 2
## `"date "` n
## <chr> <int>
## 1 "date " 683
count(Soybean, 'plant.stand ')
## # A tibble: 1 x 2
## `"plant.stand "` n
## <chr> <int>
## 1 "plant.stand " 683
count(Soybean, 'precip')
## # A tibble: 1 x 2
## `"precip"` n
## <chr> <int>
## 1 precip 683
count(Soybean, 'temp')
## # A tibble: 1 x 2
## `"temp"` n
## <chr> <int>
## 1 temp 683
count(Soybean, 'hail')
## # A tibble: 1 x 2
## `"hail"` n
## <chr> <int>
## 1 hail 683
count(Soybean, 'crop.hist')
## # A tibble: 1 x 2
## `"crop.hist"` n
## <chr> <int>
## 1 crop.hist 683
count(Soybean, 'area.dam')
## # A tibble: 1 x 2
## `"area.dam"` n
## <chr> <int>
## 1 area.dam 683
count(Soybean, 'sever')
## # A tibble: 1 x 2
## `"sever"` n
## <chr> <int>
## 1 sever 683
count(Soybean, 'seed.tmt')
## # A tibble: 1 x 2
## `"seed.tmt"` n
## <chr> <int>
## 1 seed.tmt 683
count(Soybean, 'germ')
## # A tibble: 1 x 2
## `"germ"` n
## <chr> <int>
## 1 germ 683
count(Soybean, 'plant.growth')
## # A tibble: 1 x 2
## `"plant.growth"` n
## <chr> <int>
## 1 plant.growth 683
count(Soybean, 'leaves')
## # A tibble: 1 x 2
## `"leaves"` n
## <chr> <int>
## 1 leaves 683
count(Soybean, 'leaf.halo')
## # A tibble: 1 x 2
## `"leaf.halo"` n
## <chr> <int>
## 1 leaf.halo 683
count(Soybean, 'leaf.marg')
## # A tibble: 1 x 2
## `"leaf.marg"` n
## <chr> <int>
## 1 leaf.marg 683
count(Soybean, 'leaf.size')
## # A tibble: 1 x 2
## `"leaf.size"` n
## <chr> <int>
## 1 leaf.size 683
count(Soybean, 'leaf.shread')
## # A tibble: 1 x 2
## `"leaf.shread"` n
## <chr> <int>
## 1 leaf.shread 683
count(Soybean, 'leaf.size')
## # A tibble: 1 x 2
## `"leaf.size"` n
## <chr> <int>
## 1 leaf.size 683
count(Soybean, 'leaf.malf')
## # A tibble: 1 x 2
## `"leaf.malf"` n
## <chr> <int>
## 1 leaf.malf 683
count(Soybean, 'leaf.mild')
## # A tibble: 1 x 2
## `"leaf.mild"` n
## <chr> <int>
## 1 leaf.mild 683
count(Soybean, 'stem')
## # A tibble: 1 x 2
## `"stem"` n
## <chr> <int>
## 1 stem 683
count(Soybean, 'lodging')
## # A tibble: 1 x 2
## `"lodging"` n
## <chr> <int>
## 1 lodging 683
count(Soybean, 'stem.cankers')
## # A tibble: 1 x 2
## `"stem.cankers"` n
## <chr> <int>
## 1 stem.cankers 683
count(Soybean, 'canker.lesion')
## # A tibble: 1 x 2
## `"canker.lesion"` n
## <chr> <int>
## 1 canker.lesion 683
count(Soybean, 'fruiting.bodies')
## # A tibble: 1 x 2
## `"fruiting.bodies"` n
## <chr> <int>
## 1 fruiting.bodies 683
count(Soybean, 'ext.decay')
## # A tibble: 1 x 2
## `"ext.decay"` n
## <chr> <int>
## 1 ext.decay 683
count(Soybean, 'mycelium')
## # A tibble: 1 x 2
## `"mycelium"` n
## <chr> <int>
## 1 mycelium 683
For Predictor mycelium We can see based on the conditions of degenerate distribution 1)The fraction of unique values over the sample size is low (say 10%). 2/683 = 0.29% is low
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
639/6 = 106.5
is more than 20
So this predictor looks like degenerate
count(Soybean, 'int.discolor')
## # A tibble: 1 x 2
## `"int.discolor"` n
## <chr> <int>
## 1 int.discolor 683
count(Soybean, 'sclerotia')
## # A tibble: 1 x 2
## `"sclerotia"` n
## <chr> <int>
## 1 sclerotia 683
count(Soybean, 'fruit.pods')
## # A tibble: 1 x 2
## `"fruit.pods"` n
## <chr> <int>
## 1 fruit.pods 683
count(Soybean, 'fruit.spots')
## # A tibble: 1 x 2
## `"fruit.spots"` n
## <chr> <int>
## 1 fruit.spots 683
count(Soybean, 'seed')
## # A tibble: 1 x 2
## `"seed"` n
## <chr> <int>
## 1 seed 683
count(Soybean, 'mold.growth')
## # A tibble: 1 x 2
## `"mold.growth"` n
## <chr> <int>
## 1 mold.growth 683
count(Soybean, 'seed.discolor')
## # A tibble: 1 x 2
## `"seed.discolor"` n
## <chr> <int>
## 1 seed.discolor 683
count(Soybean, 'seed.size')
## # A tibble: 1 x 2
## `"seed.size"` n
## <chr> <int>
## 1 seed.size 683
count(Soybean, 'shriveling')
## # A tibble: 1 x 2
## `"shriveling"` n
## <chr> <int>
## 1 shriveling 683
count(Soybean, 'roots')
## # A tibble: 1 x 2
## `"roots"` n
## <chr> <int>
## 1 roots 683
For Predictor ‘mycelium’ We can see based on the conditions of degenerate distribution 1)The fraction of unique values over the sample size is low (say 10%). 2/683 = 0.29% is low
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
639/6 = 106.5
is more than 20
So this predictor looks like degenerate
For Predictor leaf.mild The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). 535/20 =26.75
Could be a candidate of degenerate predictor if we ignore missing values
For Predictor ‘sclerotia’ The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
625/20 =31.25
shows it could be a candidate of degenerate predictor if we ignore missing values
nearZeroVar(Soybean)
## [1] 19 26 28
This integer represents the columns that need to be removed because of the near zero variance
Soybean[1,c(19,26,28)]
## leaf.mild mycelium sclerotia
## 1 0 0 0
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
In this dataset most variables have a lot of missing values some are hail sever seed.tmt germ leaf.halo
leaf.marg
leaf.size
leaf.shread leaf.malf leaf.mild lodging fruiting.bodies fruit.spots seed
mold.growth seed.discolor seed.size
shriveling
Is the pattern of missing data related to the classes?
Let’s see how many missing values for each class and predictor
Class_grp <- group_by(Soybean, Class)
summarize(Class_grp, hail = sum(is.na(hail))
,sever = sum(is.na(sever))
,seed.tmt = sum(is.na(seed.tmt))
,germ = sum(is.na(germ))
,leaf.halo = sum(is.na(leaf.halo))
,leaf.marg = sum(is.na(leaf.marg))
,leaf.size = sum(is.na(leaf.size))
,leaf.shread = sum(is.na(leaf.shread))
,leaf.malf = sum(is.na(leaf.malf ))
,leaf.mild = sum(is.na(leaf.mild))
,lodging = sum(is.na(lodging))
,fruiting.bodies = sum(is.na(fruiting.bodies))
,fruit.spots = sum(is.na(fruit.spots))
,seed = sum(is.na(seed))
,mold.growth = sum(is.na(mold.growth))
,seed.discolor = sum(is.na(seed.discolor))
,seed.size = sum(is.na(seed.size))
,shriveling = sum(is.na(shriveling)))
## # A tibble: 19 x 19
## Class hail sever seed.tmt germ leaf.halo leaf.marg leaf.size
## <fct> <int> <int> <int> <int> <int> <int> <int>
## 1 2-4-d-injury 16 16 16 16 0 0 0
## 2 alternarialea~ 0 0 0 0 0 0 0
## 3 anthracnose 0 0 0 0 0 0 0
## 4 bacterial-bli~ 0 0 0 0 0 0 0
## 5 bacterial-pus~ 0 0 0 0 0 0 0
## 6 brown-spot 0 0 0 0 0 0 0
## 7 brown-stem-rot 0 0 0 0 0 0 0
## 8 charcoal-rot 0 0 0 0 0 0 0
## 9 cyst-nematode 14 14 14 14 14 14 14
## 10 diaporthe-pod~ 15 15 15 6 15 15 15
## 11 diaporthe-ste~ 0 0 0 0 0 0 0
## 12 downy-mildew 0 0 0 0 0 0 0
## 13 frog-eye-leaf~ 0 0 0 0 0 0 0
## 14 herbicide-inj~ 8 8 8 8 0 0 0
## 15 phyllosticta-~ 0 0 0 0 0 0 0
## 16 phytophthora-~ 68 68 68 68 55 55 55
## 17 powdery-mildew 0 0 0 0 0 0 0
## 18 purple-seed-s~ 0 0 0 0 0 0 0
## 19 rhizoctonia-r~ 0 0 0 0 0 0 0
## # ... with 11 more variables: leaf.shread <int>, leaf.malf <int>,
## # leaf.mild <int>, lodging <int>, fruiting.bodies <int>,
## # fruit.spots <int>, seed <int>, mold.growth <int>, seed.discolor <int>,
## # seed.size <int>, shriveling <int>
It looks like the missing values are normally occurring for certain classes the classes are 1. 2-4-d-injury 2. cyst-nematode 3. diaporthe-pod-&-stem-blight 4. phytophthora-rot
In this case I will use the MIMCA function/package to impute multiple dataset for the missing values.
#nb <- estim_ncpMCA(Soybean,ncp.max=5) ## Time-consuming, nb = 4
res <- MIMCA(Soybean, ncp=1,nboot=2)
str(res)
## List of 3
## $ res.MI :List of 2
## ..$ nboot=1:'data.frame': 683 obs. of 36 variables:
## .. ..$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## .. ..$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## .. ..$ plant.stand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ precip : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ temp : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## .. ..$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## .. ..$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## .. ..$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## .. ..$ germ : Factor w/ 3 levels "0","1","2": 1 2 3 2 3 2 1 3 2 3 ...
## .. ..$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.size : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## .. ..$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ nboot=2:'data.frame': 683 obs. of 36 variables:
## .. ..$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## .. ..$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## .. ..$ plant.stand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ precip : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ temp : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## .. ..$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## .. ..$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## .. ..$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## .. ..$ germ : Factor w/ 3 levels "0","1","2": 1 2 3 2 3 2 1 3 2 3 ...
## .. ..$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.size : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## .. ..$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ res.imputeMCA: num [1:683, 1:118] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:683] "1" "2" "3" "4" ...
## .. ..$ : chr [1:118] "2-4-d-injury" "alternarialeaf-spot" "anthracnose" "bacterial-blight" ...
## $ call :List of 8
## ..$ X :'data.frame': 683 obs. of 36 variables:
## .. ..$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## .. ..$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## .. ..$ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## .. ..$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## .. ..$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## .. ..$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## .. ..$ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## .. ..$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## .. ..$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## .. ..$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## .. ..$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ nboot : num 2
## ..$ ncp : num 1
## ..$ coeff.ridge: num 1
## ..$ threshold : num 1e-06
## ..$ seed : NULL
## ..$ maxiter : num 1000
## ..$ tab.disj : num [1:683, 1:118, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "class")= chr [1:2] "MIMCA" "list"