library(mlbench)
data(Soybean)
## See ?Soybean for details
A degenerate distribution is when a random variable has a single possible value. First we must remove empty values in order for our functions to work. We can see that this does not occur in this data set as there are no predictors who have a minimum value equal to their maximum value thus there are no columns with a pure degenerate distribution. However, upon further inspection of the frequency distributions plot we could remove a few predictors such as leaves,shriveling, leaf.malf, mold.growth and lodging. We should definitely remove mycelium since it has nearly all the same value suggesting it is a degenerate distribution.
S<-drop_na(Soybean)
mi<-as.data.frame(apply(S,2,min))
ma<-as.data.frame(apply(S,2,max))
mi<-tibble::rownames_to_column(mi, "predictor")
ma<-tibble::rownames_to_column(ma, "predictor")
m<-as.data.frame(merge(mi,ma,by="predictor"))
m<-m %>%
rename(
"min"="apply(S, 2, min)",
"max"="apply(S, 2, max)"
)
m
## predictor min max
## 1 area.dam 0 3
## 2 canker.lesion 0 3
## 3 Class alternarialeaf-spot rhizoctonia-root-rot
## 4 crop.hist 0 3
## 5 date 0 6
## 6 ext.decay 0 1
## 7 fruit.pods 0 3
## 8 fruit.spots 0 4
## 9 fruiting.bodies 0 1
## 10 germ 0 2
## 11 hail 0 1
## 12 int.discolor 0 2
## 13 leaf.halo 0 2
## 14 leaf.malf 0 1
## 15 leaf.marg 0 2
## 16 leaf.mild 0 2
## 17 leaf.shread 0 1
## 18 leaf.size 0 2
## 19 leaves 0 1
## 20 lodging 0 1
## 21 mold.growth 0 1
## 22 mycelium 0 1
## 23 plant.growth 0 1
## 24 plant.stand 0 1
## 25 precip 0 2
## 26 roots 0 2
## 27 sclerotia 0 1
## 28 seed 0 1
## 29 seed.discolor 0 1
## 30 seed.size 0 1
## 31 seed.tmt 0 2
## 32 sever 0 2
## 33 shriveling 0 1
## 34 stem 0 1
## 35 stem.cankers 0 3
## 36 temp 0 2
m%>%filter(min==max)
## [1] predictor min max
## <0 rows> (or 0-length row.names)
asNumeric <- function(x) as.numeric(as.character(x))
factorsNumeric <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
asNumeric))
So<-factorsNumeric(Soybean)
So %>%keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
### (b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Using the skim function from the skimr library you can see that some predictors are much more likely to be missing. The lowest of which are hail,sever,seed.tm, and lodging which ae all missing 121 records for a completion rate of 82.3%. We can also see that only 5 of the classes have NA fields with all of 4 classes containing a null field and one class having 3/4 of its records with a null field.
skim(Soybean)
| Name | Soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
sNA <- Soybean[rowSums(is.na(Soybean)) > 0,]
library(sqldf)
sm<-sqldf("select count(Class) null_count, Class from sNA group by Class")
soy<-sqldf("select count(Class) full_count, Class from Soybean group by Class")
s<-sqldf("select a.class, null_count,full_count from sm a join soy b on a.class = b.class")
s
## Class null_count full_count
## 1 2-4-d-injury 16 16
## 2 cyst-nematode 14 14
## 3 diaporthe-pod-&-stem-blight 15 15
## 4 herbicide-injury 8 8
## 5 phytophthora-rot 68 88
For predictors that were entirely NA for a whole class (i.e. hail is NA for 5 different classes) I would create a dummy variable to show if the predictor was filled in or not or remove it entirely. filling in may be an issue because that is likely something to do with data collection and may not keep up over time. for predictors that have some data within a class I would impute an average for that predictor for a given class.