The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data("Soybean")
# summary("Soybean")
df.soy = Soybean %>% data.frame(stringsAsFactors = FALSE)
dim(df.soy)
## [1] 683 36
head(df.soy, 2)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
df.soy.new = apply(df.soy, 2, function(x) as.numeric(as.character(x))) %>%
data.frame()
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
df.soy.long<- stack(df.soy.new)
dim(df.soy.long)
## [1] 24588 2
head(df.soy.long)
## values ind
## 1 NA Class
## 2 NA Class
## 3 NA Class
## 4 NA Class
## 5 NA Class
## 6 NA Class
tail(df.soy.long)
## values ind
## 24583 NA roots
## 24584 NA roots
## 24585 1 roots
## 24586 1 roots
## 24587 1 roots
## 24588 1 roots
To visualize the categorical variables, we first transform the all the character variables as numeric Then the values of the character variables our transformed into factor. The factor has 6 levels , ranging from 0 to 2, 3, 4, 5, 6. Also there are NAs for their corresponding class.
We first used the stack function to make the data in long format now this data has 24588 observations, and two variables,ind variable is what we are interested in and will be used for the plot
ggplot(data = df.soy.long,
aes(x=as.factor(values),
fill=as.factor(values) )) +
geom_bar(stat='count', width=1) +
facet_wrap(~ind)
library(caret)
## Loading required package: lattice
nearZeroVar(df.soy, names=TRUE, saveMetrics = T)
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
Non zero variables are also displayed via nearZeroVar function. There are quite some near zero observations out there.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
ggr_plot <- aggr(df.soy.new,
col=c('lightblue','yellow'), numbers=TRUE, sortVars=TRUE,
labels=names(df.soy), cex.axis=.7, gap=3,
ylab=c ("Histogram,Missing data","Pattern, Missing Data" ))
##
## Variables sorted by number of missings:
## Variable Count
## Class 1.000000000
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## leaves 0.000000000
Aggr package is a new package, as an extension of ggplot2, which is designed for handling all processed multidimensional data with straightforward coding. The left side is the histogram of the class with its missing data. The right hand side plots out these missing data by class with its pattern. Combining these graphs give us very nice overview of total number of missing by class, as well as relative proportion and pattern of them in this database. We can see that hail has the most missing.
### why these numbers diff from Sachid Numb
missing.total.long<-
df.soy %>%
mutate (total=n()) %>%
group_by(Class) %>%
mutate (missingcnt.total=n() )%>%
select (Class,missingcnt.total) %>%
unique() %>%
arrange(-missingcnt.total)
head(missing.total.long, 8)
## # A tibble: 8 x 2
## # Groups: Class [8]
## Class missingcnt.total
## <fct> <int>
## 1 brown-spot 92
## 2 alternarialeaf-spot 91
## 3 frog-eye-leaf-spot 91
## 4 phytophthora-rot 88
## 5 brown-stem-rot 44
## 6 anthracnose 44
## 7 diaporthe-stem-canker 20
## 8 charcoal-rot 20
In order to calculate the total of missing within each class, we first calculate the missings by its class, make the data in the long format, and the sorted by missing value in descending order.
Brown-spot ,alternarialeaf-spot , frog-eye-leaf-spot , phytophthora-rot , brown-stem-rot , anthracnose , diaporthe-stem-canker , charcoal-rot are maong the top classes that have missing vale, ranking by descending order. Each class have around 90 or so missing values.
cntplot<- ## no print after the assignment
ggplot(data = missing.total.long,
aes(x=reorder(Class, missingcnt.total),
y =missingcnt.total,
fill = Class)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1)
) +
ggtitle ('Sum of Missing Numbers, by Class')
missing.proport<-
df.soy %>% mutate (total=n()) %>%
group_by(Class) %>%
mutate (missing.cnt=n(), Proportion=missing.cnt/total) %>%
select (Class, Proportion) %>%
unique() %>%
arrange(-Proportion)
A plot for the above , which is total missing values by class is put into the memory and will be plotted alongside the proportional plot in the future.
propplot<-
ggplot(data = missing.proport,
aes(x=reorder(Class, Proportion),
y =Proportion,
fill = Class)) +
geom_bar(stat='identity') +
theme ( axis.title.x = element_text(size=10),
axis.text.x = element_text(size=8, angle=45, hjust=1, vjust=1) ) +
ggtitle ('Proportion of Missing, by Class')
The proportion of these missing values in terms of total are also calculated and the stored in the format of a plot, which will be displayed later on period
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
par(mfrow=c(1,2))
grid.arrange(cntplot, propplot, top = textGrob("Histogram of Class "))
The grid.arrange function plots the above raw summation of missing count within each class, alongside its proportional value of the overall.
soy_complete <- kNN(df.soy)
head(soy_complete)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
## Class_imp date_imp plant.stand_imp precip_imp temp_imp hail_imp crop.hist_imp
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## area.dam_imp sever_imp seed.tmt_imp germ_imp plant.growth_imp leaves_imp
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE
## leaf.halo_imp leaf.marg_imp leaf.size_imp leaf.shread_imp leaf.malf_imp
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
## leaf.mild_imp stem_imp lodging_imp stem.cankers_imp canker.lesion_imp
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
## fruiting.bodies_imp ext.decay_imp mycelium_imp int.discolor_imp sclerotia_imp
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
## fruit.pods_imp fruit.spots_imp seed_imp mold.growth_imp seed.discolor_imp
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
## seed.size_imp shriveling_imp roots_imp
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
dim(soy_complete)
## [1] 683 72
I used KNN imputation to handle the missing data. As indicated here, after the imputation there is no missing data in the soybean data set anymore.