(a)
skim(Soybean)
Data summary
| Name |
Soybean |
| Number of rows |
683 |
| Number of columns |
36 |
| _______________________ |
|
| Column type frequency: |
|
| factor |
36 |
| ________________________ |
|
| Group variables |
None |
Variable type: factor
| Class |
0 |
1.00 |
FALSE |
19 |
bro: 92, alt: 91, fro: 91, phy: 88 |
| date |
1 |
1.00 |
FALSE |
7 |
5: 149, 4: 131, 3: 118, 2: 93 |
| plant.stand |
36 |
0.95 |
TRUE |
2 |
0: 354, 1: 293 |
| precip |
38 |
0.94 |
TRUE |
3 |
2: 459, 1: 112, 0: 74 |
| temp |
30 |
0.96 |
TRUE |
3 |
1: 374, 2: 199, 0: 80 |
| hail |
121 |
0.82 |
FALSE |
2 |
0: 435, 1: 127 |
| crop.hist |
16 |
0.98 |
FALSE |
4 |
2: 219, 3: 218, 1: 165, 0: 65 |
| area.dam |
1 |
1.00 |
FALSE |
4 |
1: 227, 3: 187, 2: 145, 0: 123 |
| sever |
121 |
0.82 |
FALSE |
3 |
1: 322, 0: 195, 2: 45 |
| seed.tmt |
121 |
0.82 |
FALSE |
3 |
0: 305, 1: 222, 2: 35 |
| germ |
112 |
0.84 |
TRUE |
3 |
1: 213, 2: 193, 0: 165 |
| plant.growth |
16 |
0.98 |
FALSE |
2 |
0: 441, 1: 226 |
| leaves |
0 |
1.00 |
FALSE |
2 |
1: 606, 0: 77 |
| leaf.halo |
84 |
0.88 |
FALSE |
3 |
2: 342, 0: 221, 1: 36 |
| leaf.marg |
84 |
0.88 |
FALSE |
3 |
0: 357, 2: 221, 1: 21 |
| leaf.size |
84 |
0.88 |
TRUE |
3 |
1: 327, 2: 221, 0: 51 |
| leaf.shread |
100 |
0.85 |
FALSE |
2 |
0: 487, 1: 96 |
| leaf.malf |
84 |
0.88 |
FALSE |
2 |
0: 554, 1: 45 |
| leaf.mild |
108 |
0.84 |
FALSE |
3 |
0: 535, 1: 20, 2: 20 |
| stem |
16 |
0.98 |
FALSE |
2 |
1: 371, 0: 296 |
| lodging |
121 |
0.82 |
FALSE |
2 |
0: 520, 1: 42 |
| stem.cankers |
38 |
0.94 |
FALSE |
4 |
0: 379, 3: 191, 1: 39, 2: 36 |
| canker.lesion |
38 |
0.94 |
FALSE |
4 |
0: 320, 2: 177, 1: 83, 3: 65 |
| fruiting.bodies |
106 |
0.84 |
FALSE |
2 |
0: 473, 1: 104 |
| ext.decay |
38 |
0.94 |
FALSE |
3 |
0: 497, 1: 135, 2: 13 |
| mycelium |
38 |
0.94 |
FALSE |
2 |
0: 639, 1: 6 |
| int.discolor |
38 |
0.94 |
FALSE |
3 |
0: 581, 1: 44, 2: 20 |
| sclerotia |
38 |
0.94 |
FALSE |
2 |
0: 625, 1: 20 |
| fruit.pods |
84 |
0.88 |
FALSE |
4 |
0: 407, 1: 130, 3: 48, 2: 14 |
| fruit.spots |
106 |
0.84 |
FALSE |
4 |
0: 345, 4: 100, 1: 75, 2: 57 |
| seed |
92 |
0.87 |
FALSE |
2 |
0: 476, 1: 115 |
| mold.growth |
92 |
0.87 |
FALSE |
2 |
0: 524, 1: 67 |
| seed.discolor |
106 |
0.84 |
FALSE |
2 |
0: 513, 1: 64 |
| seed.size |
92 |
0.87 |
FALSE |
2 |
0: 532, 1: 59 |
| shriveling |
106 |
0.84 |
FALSE |
2 |
0: 539, 1: 38 |
| roots |
31 |
0.95 |
FALSE |
3 |
0: 551, 1: 86, 2: 15 |
We can see several degenerate looking variables below. It would seem that many of the leaf related variables have an unfavorable ratios of values, with the main variable itself, leaves, containing almost entirely only 1 values. Other variables such as mycelium, fruiting.bodies,int.discolor, sclerotia, seed, mold.growth, seed.discolor, seed.size, shriveling, and roots also are potentially degenerate and need further investigation.
plot_bar(Soybean)




(b)
plot_missing(Soybean)

Upon further investigation we see that the majority of larger missing values cover random, traumatic events that can occur. SUch as hail storms, having to sever the plant, lodging occurring and so on. We also see some values such as mold growth, seed size,and various leaf values which were likely poorly recorded at the time for certain soybeans. Hard to determine if there is a pattern occurring.
profile_missing(Soybean) %>% arrange(desc(pct_missing))
Below we look at the breakdown of missing values by class. We find that out of the 19 possible soybeans, only five have missing values. We see that diaporthe-pod-&-stem-blight, 2-4-d-injury and especially phytophthora-rot, make up a good portion of the missing values.
Soybean %>% filter(!complete.cases(.))%>% group_by(Class) %>% summarise (across(everything(),~as.factor(sum(is.na(. )) ))) %>%
plot_bar( by ="Class",title ="Missing Values Per Class" )




Soybean %>% filter(!complete.cases(.)) %>% group_by(Class) %>% summarise (across(everything(),~sum(is.na(. ))))
It is clear that removing only three of those five alone reduces the missing level down to a negligible amount.
Soybean %>% filter(!Class %in% c('phytophthora-rot','2-4-d-injury','diaporthe-pod-&-stem-blight')) %>% plot_missing()

(c)
With the above investigation in mind, we may not need to impute or eliminate any of these predictor variables.The majority of our classes do not contain any or many missing values. It is unclear if these missing values are caused by a sampling issue or not. What is clear is that if this pattern of missing values repeats, we could potentially use it to identify what class of Soybean it is. It is also possible that we could misclassify a variable due to this sampling error as well, so we would need to pay special attention to this area when working with future data.