data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
We can see that there are 19 classes.There are 35 categorical attributes, some nominal and some ordered. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth. A data frame with 683 observations on 36 variables. There are 35 categorical attributes, all numerical and a nominal denoting the class.
a.Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
df <- Soybean
df[is.na(df)] <- 0
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.2.2
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.2.2
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.2.2
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.2.2
f <-colnames(df)
for (i in 1:length(f)){
sSQL <- paste("SELECT [", f[i], "], COUNT(1) Occurrences FROM Soybean GROUP BY [", f[i], "] ORDER BY COUNT(1) DESC LIMIT 2", sep = "")
a <- sqldf(sSQL)
ratio <- round(a[1,2]/a[2,2],1)
if(ratio > 10){
print(paste0(f[i], " Most Frequent: ", a[1,2]," Second Most: ", a[2,2], " Ratio: ", ratio))
}
}
## [1] "mycelium Most Frequent: 639 Second Most: 38 Ratio: 16.8"
## [1] "int.discolor Most Frequent: 581 Second Most: 44 Ratio: 13.2"
## [1] "sclerotia Most Frequent: 625 Second Most: 38 Ratio: 16.4"
The three most likely candidates are mycelium, int.discolor and sclerotia, but the ratios between most frequent and 2nd most frequent levels are all below twenty. None of the predictors are at risk of being degenerate.
b.Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing?
df <- Soybean #copy to new data frame so can edit without changing original
na_count <-sapply(df, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count) #convert the integer series to a data frame
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
invisible(setDT(na_count, keep.rownames = TRUE)[]) # convert the data frame index to a data column
na_count <- na_count[order(-na_count),] # sort by missing values count, descending
library(knitr)
kable(na_count[1:15, ]) # show top 15 missing values
| rn | na_count |
|---|---|
| hail | 121 |
| sever | 121 |
| seed.tmt | 121 |
| lodging | 121 |
| germ | 112 |
| leaf.mild | 108 |
| fruiting.bodies | 106 |
| fruit.spots | 106 |
| seed.discolor | 106 |
| shriveling | 106 |
| leaf.shread | 100 |
| seed | 92 |
| mold.growth | 92 |
| seed.size | 92 |
| leaf.halo | 84 |
According to the Table above lists the predictors most likely to be missing. I can see that where the most missing values are hail, sever, seed.tmt and lodging, all have the same number of missing values. These variables may be correlated within the class. Values are also missing in fruiting.bodies, fruit.spots, seed.discolor, and shriveling.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- select(Soybean, Class, hail, sever, seed.tmt, lodging, fruiting.bodies, fruit.spots, seed.discolor, shriveling)
DT <- data.table(df)
DT[, lapply(.SD, function(x) sum(is.na(x))) , by = list(Class)]
## Class hail sever seed.tmt lodging fruiting.bodies
## 1: diaporthe-stem-canker 0 0 0 0 0
## 2: charcoal-rot 0 0 0 0 0
## 3: rhizoctonia-root-rot 0 0 0 0 0
## 4: phytophthora-rot 68 68 68 68 68
## 5: brown-stem-rot 0 0 0 0 0
## 6: powdery-mildew 0 0 0 0 0
## 7: downy-mildew 0 0 0 0 0
## 8: brown-spot 0 0 0 0 0
## 9: bacterial-blight 0 0 0 0 0
## 10: bacterial-pustule 0 0 0 0 0
## 11: purple-seed-stain 0 0 0 0 0
## 12: anthracnose 0 0 0 0 0
## 13: phyllosticta-leaf-spot 0 0 0 0 0
## 14: alternarialeaf-spot 0 0 0 0 0
## 15: frog-eye-leaf-spot 0 0 0 0 0
## 16: diaporthe-pod-&-stem-blight 15 15 15 15 0
## 17: cyst-nematode 14 14 14 14 14
## 18: 2-4-d-injury 16 16 16 16 16
## 19: herbicide-injury 8 8 8 8 8
## fruit.spots seed.discolor shriveling
## 1: 0 0 0
## 2: 0 0 0
## 3: 0 0 0
## 4: 68 68 68
## 5: 0 0 0
## 6: 0 0 0
## 7: 0 0 0
## 8: 0 0 0
## 9: 0 0 0
## 10: 0 0 0
## 11: 0 0 0
## 12: 0 0 0
## 13: 0 0 0
## 14: 0 0 0
## 15: 0 0 0
## 16: 0 0 0
## 17: 14 14 14
## 18: 16 16 16
## 19: 8 8 8
number.of.na <- apply(Soybean, 1, function(x){sum(is.na(x))})
class.soybean <- Soybean$Class
soybean.na.df <- data.frame(class.soybean, number.of.na)
kable(head(soybean.na.df,10))
| class.soybean | number.of.na |
|---|---|
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
| diaporthe-stem-canker | 0 |
results <- aggregate(soybean.na.df$number.of.na, by=list(class.soybean=soybean.na.df$class.soybean), FUN=sum)
kable(results[order(results[,"x"]),])
| class.soybean | x | |
|---|---|---|
| 2 | alternarialeaf-spot | 0 |
| 3 | anthracnose | 0 |
| 4 | bacterial-blight | 0 |
| 5 | bacterial-pustule | 0 |
| 6 | brown-spot | 0 |
| 7 | brown-stem-rot | 0 |
| 8 | charcoal-rot | 0 |
| 11 | diaporthe-stem-canker | 0 |
| 12 | downy-mildew | 0 |
| 13 | frog-eye-leaf-spot | 0 |
| 15 | phyllosticta-leaf-spot | 0 |
| 17 | powdery-mildew | 0 |
| 18 | purple-seed-stain | 0 |
| 19 | rhizoctonia-root-rot | 0 |
| 14 | herbicide-injury | 160 |
| 10 | diaporthe-pod-&-stem-blight | 177 |
| 9 | cyst-nematode | 336 |
| 1 | 2-4-d-injury | 450 |
| 16 | phytophthora-rot | 1214 |
c.Develop a strategy for handling missing data, either by eliminating predictors or imputation.
head(Soybean[Soybean$Class=='phytophthora-rot',],10)
## Class date plant.stand precip temp hail crop.hist area.dam sever
## 31 phytophthora-rot 0 1 2 1 1 1 1 1
## 32 phytophthora-rot 1 1 2 1 <NA> 3 1 <NA>
## 33 phytophthora-rot 2 1 2 2 <NA> 2 1 <NA>
## 34 phytophthora-rot 1 1 2 0 0 2 1 2
## 35 phytophthora-rot 2 1 2 2 <NA> 2 1 <NA>
## 36 phytophthora-rot 3 1 2 1 <NA> 2 1 <NA>
## 37 phytophthora-rot 0 1 1 1 0 1 1 1
## 38 phytophthora-rot 3 1 2 0 0 2 1 2
## 39 phytophthora-rot 2 1 1 1 <NA> 0 1 <NA>
## 40 phytophthora-rot 2 1 2 0 0 1 1 2
## seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 31 0 0 1 1 0 2 2 0
## 32 <NA> <NA> 1 1 0 2 2 0
## 33 <NA> <NA> 1 1 <NA> <NA> <NA> <NA>
## 34 1 1 1 1 0 2 2 0
## 35 <NA> <NA> 1 1 <NA> <NA> <NA> <NA>
## 36 <NA> <NA> 1 1 <NA> <NA> <NA> <NA>
## 37 0 0 1 1 0 2 2 0
## 38 1 1 1 1 0 2 2 0
## 39 <NA> <NA> 1 1 0 2 2 0
## 40 0 1 1 1 0 2 2 0
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies
## 31 0 0 1 0 1 2 0
## 32 0 0 1 <NA> 2 2 <NA>
## 33 <NA> <NA> 1 <NA> 3 2 <NA>
## 34 0 0 1 0 2 2 0
## 35 <NA> <NA> 1 <NA> 2 2 <NA>
## 36 <NA> <NA> 1 <NA> 3 2 <NA>
## 37 0 0 1 0 1 2 0
## 38 0 0 1 0 2 2 0
## 39 0 0 1 <NA> 2 2 <NA>
## 40 0 0 1 0 1 2 0
## ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 31 1 0 0 0 3 4 0
## 32 0 0 0 0 <NA> <NA> <NA>
## 33 0 0 0 0 <NA> <NA> <NA>
## 34 0 0 0 0 3 4 0
## 35 0 0 0 0 <NA> <NA> <NA>
## 36 0 0 0 0 <NA> <NA> <NA>
## 37 0 0 0 0 3 4 0
## 38 0 0 0 0 3 4 0
## 39 0 0 0 0 <NA> <NA> <NA>
## 40 0 0 0 0 3 4 0
## mold.growth seed.discolor seed.size shriveling roots
## 31 0 0 0 0 0
## 32 <NA> <NA> <NA> <NA> 1
## 33 <NA> <NA> <NA> <NA> 1
## 34 0 0 0 0 0
## 35 <NA> <NA> <NA> <NA> 1
## 36 <NA> <NA> <NA> <NA> 1
## 37 0 0 0 0 0
## 38 0 0 0 0 0
## 39 <NA> <NA> <NA> <NA> 1
## 40 0 0 0 0 0
Analyzing the data and the tables we can observe phytophthora-rot values across all these predictors. It is possible that at the time the data was collected it was not fully realized. Data is also missing in 2-4-d-injury and herbicide-injury. The missing data is related to the class.
The strategy would be to try to collect the missing data, to perform the analysis properly.
In case it is not possible to collect the missing data, a classification model should be built with and without the MICE package. To review if the predictive result can be improved by imputation.
df <- select(Soybean, Class, hail, sever, seed.tmt, lodging, fruiting.bodies, fruit.spots, seed.discolor, shriveling)
DT <- data.table(df)
DT[, lapply(.SD, function(x) sum(is.na(x))) , by = list(Class)]
## Class hail sever seed.tmt lodging fruiting.bodies
## 1: diaporthe-stem-canker 0 0 0 0 0
## 2: charcoal-rot 0 0 0 0 0
## 3: rhizoctonia-root-rot 0 0 0 0 0
## 4: phytophthora-rot 68 68 68 68 68
## 5: brown-stem-rot 0 0 0 0 0
## 6: powdery-mildew 0 0 0 0 0
## 7: downy-mildew 0 0 0 0 0
## 8: brown-spot 0 0 0 0 0
## 9: bacterial-blight 0 0 0 0 0
## 10: bacterial-pustule 0 0 0 0 0
## 11: purple-seed-stain 0 0 0 0 0
## 12: anthracnose 0 0 0 0 0
## 13: phyllosticta-leaf-spot 0 0 0 0 0
## 14: alternarialeaf-spot 0 0 0 0 0
## 15: frog-eye-leaf-spot 0 0 0 0 0
## 16: diaporthe-pod-&-stem-blight 15 15 15 15 0
## 17: cyst-nematode 14 14 14 14 14
## 18: 2-4-d-injury 16 16 16 16 16
## 19: herbicide-injury 8 8 8 8 8
## fruit.spots seed.discolor shriveling
## 1: 0 0 0
## 2: 0 0 0
## 3: 0 0 0
## 4: 68 68 68
## 5: 0 0 0
## 6: 0 0 0
## 7: 0 0 0
## 8: 0 0 0
## 9: 0 0 0
## 10: 0 0 0
## 11: 0 0 0
## 12: 0 0 0
## 13: 0 0 0
## 14: 0 0 0
## 15: 0 0 0
## 16: 0 0 0
## 17: 14 14 14
## 18: 16 16 16
## 19: 8 8 8