The soybean data has a lot of columns, 36 predictors with 683 observations.
## load the soybean data
data("Soybean")
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
We can make a bar chart.. for each predictors..
library(skimr)
skim(Soybean)
| Name | Soybean |
| Number of rows | 683 |
| Number of columns | 36 |
| _______________________ | |
| Column type frequency: | |
| factor | 36 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Class | 0 | 1.00 | FALSE | 19 | bro: 92, alt: 91, fro: 91, phy: 88 |
| date | 1 | 1.00 | FALSE | 7 | 5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand | 36 | 0.95 | TRUE | 2 | 0: 354, 1: 293 |
| precip | 38 | 0.94 | TRUE | 3 | 2: 459, 1: 112, 0: 74 |
| temp | 30 | 0.96 | TRUE | 3 | 1: 374, 2: 199, 0: 80 |
| hail | 121 | 0.82 | FALSE | 2 | 0: 435, 1: 127 |
| crop.hist | 16 | 0.98 | FALSE | 4 | 2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam | 1 | 1.00 | FALSE | 4 | 1: 227, 3: 187, 2: 145, 0: 123 |
| sever | 121 | 0.82 | FALSE | 3 | 1: 322, 0: 195, 2: 45 |
| seed.tmt | 121 | 0.82 | FALSE | 3 | 0: 305, 1: 222, 2: 35 |
| germ | 112 | 0.84 | TRUE | 3 | 1: 213, 2: 193, 0: 165 |
| plant.growth | 16 | 0.98 | FALSE | 2 | 0: 441, 1: 226 |
| leaves | 0 | 1.00 | FALSE | 2 | 1: 606, 0: 77 |
| leaf.halo | 84 | 0.88 | FALSE | 3 | 2: 342, 0: 221, 1: 36 |
| leaf.marg | 84 | 0.88 | FALSE | 3 | 0: 357, 2: 221, 1: 21 |
| leaf.size | 84 | 0.88 | TRUE | 3 | 1: 327, 2: 221, 0: 51 |
| leaf.shread | 100 | 0.85 | FALSE | 2 | 0: 487, 1: 96 |
| leaf.malf | 84 | 0.88 | FALSE | 2 | 0: 554, 1: 45 |
| leaf.mild | 108 | 0.84 | FALSE | 3 | 0: 535, 1: 20, 2: 20 |
| stem | 16 | 0.98 | FALSE | 2 | 1: 371, 0: 296 |
| lodging | 121 | 0.82 | FALSE | 2 | 0: 520, 1: 42 |
| stem.cankers | 38 | 0.94 | FALSE | 4 | 0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion | 38 | 0.94 | FALSE | 4 | 0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies | 106 | 0.84 | FALSE | 2 | 0: 473, 1: 104 |
| ext.decay | 38 | 0.94 | FALSE | 3 | 0: 497, 1: 135, 2: 13 |
| mycelium | 38 | 0.94 | FALSE | 2 | 0: 639, 1: 6 |
| int.discolor | 38 | 0.94 | FALSE | 3 | 0: 581, 1: 44, 2: 20 |
| sclerotia | 38 | 0.94 | FALSE | 2 | 0: 625, 1: 20 |
| fruit.pods | 84 | 0.88 | FALSE | 4 | 0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots | 106 | 0.84 | FALSE | 4 | 0: 345, 4: 100, 1: 75, 2: 57 |
| seed | 92 | 0.87 | FALSE | 2 | 0: 476, 1: 115 |
| mold.growth | 92 | 0.87 | FALSE | 2 | 0: 524, 1: 67 |
| seed.discolor | 106 | 0.84 | FALSE | 2 | 0: 513, 1: 64 |
| seed.size | 92 | 0.87 | FALSE | 2 | 0: 532, 1: 59 |
| shriveling | 106 | 0.84 | FALSE | 2 | 0: 539, 1: 38 |
| roots | 31 | 0.95 | FALSE | 3 | 0: 551, 1: 86, 2: 15 |
## Use the .data pronoun for the column..
columns = colnames(Soybean)
p <- lapply(columns,
function(col) {
ggplot(Soybean,
aes(.data[[col]])) + geom_bar() + coord_flip() + ggtitle(col)})
print(p)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
Some of the distributions are degenerate since like leaf.mild has a higher proportion of no mild leaf than mild leafs, there are a greater proportion of lodging and not lodging and a variety of other predictors with this imbalance.
Skem <- skim(Soybean)
ggplot(data = Skem, aes(x = reorder(skim_variable,n_missing), y = n_missing)) +
geom_col() + coord_flip() + labs(title = "Missing Observations Per Columns Ordered", y = "# of Missing Observation", x = "Variable")
From the plot, we can see that there are quite a lot of missing values for the observation especially for sever, seed.tmt and lodging which I would assume are not required for certain plant class. Glancing at the dataframe some of the missing values in certain predictors make sense for instance, the hail column indicates yes for 0 and no for 1 an NA may that the area where the plants were measured may not have hail at all. The pattern of missing data are related to the class of the plants, for instance, there were predictors measuring fruit spots, and fruit pods and information about seeds that may not pertain to the class i.e pleythorea-rot has many missing values within those predictors.
A strategy I would use to handle missing data, is to try to get a better understanding of all the predictors within this data, and see where the missing values are in which predictors since in this case, the class of the plant and their characteristics have different missing values. I might first attempt to use mean imputation to handle the missing data. Another scenario is to use a model to handle the missing data in this case I may use k-nearest neighbors to imputate the data, where the imputation values are determined by their closest neighbors. We can use recorded measurement for each plant class and imputate the Na values. I would make sure each imputation values have similar values to their plant class. This is just a thought process I felt would be approriate regarding this dataset.