DATA 624 HW4

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(mlbench)
data(Soybean)
## See ?Soybean for details

A degenerate distribution is when a random variable has a single possible value. First we must remove empty values in order for our functions to work. We can see that this does not occur in this data set as there are no predictors who have a minimum value equal to their maximum value thus there are no columns with a pure degenerate distribution. However, upon further inspection of the frequency distributions plot we could remove a few predictors such as leaves,shriveling, leaf.malf, mold.growth and lodging. We should definitely remove mycelium since it has nearly all the same value suggesting it is a degenerate distribution.

S<-drop_na(Soybean)


mi<-as.data.frame(apply(S,2,min))
ma<-as.data.frame(apply(S,2,max))
mi<-tibble::rownames_to_column(mi, "predictor")
ma<-tibble::rownames_to_column(ma, "predictor")

m<-as.data.frame(merge(mi,ma,by="predictor"))
m<-m %>% 
  rename(
     "min"="apply(S, 2, min)",
     "max"="apply(S, 2, max)" 
    )
m

##          predictor                 min                  max
## 1         area.dam                   0                    3
## 2    canker.lesion                   0                    3
## 3            Class alternarialeaf-spot rhizoctonia-root-rot
## 4        crop.hist                   0                    3
## 5             date                   0                    6
## 6        ext.decay                   0                    1
## 7       fruit.pods                   0                    3
## 8      fruit.spots                   0                    4
## 9  fruiting.bodies                   0                    1
## 10            germ                   0                    2
## 11            hail                   0                    1
## 12    int.discolor                   0                    2
## 13       leaf.halo                   0                    2
## 14       leaf.malf                   0                    1
## 15       leaf.marg                   0                    2
## 16       leaf.mild                   0                    2
## 17     leaf.shread                   0                    1
## 18       leaf.size                   0                    2
## 19          leaves                   0                    1
## 20         lodging                   0                    1
## 21     mold.growth                   0                    1
## 22        mycelium                   0                    1
## 23    plant.growth                   0                    1
## 24     plant.stand                   0                    1
## 25          precip                   0                    2
## 26           roots                   0                    2
## 27       sclerotia                   0                    1
## 28            seed                   0                    1
## 29   seed.discolor                   0                    1
## 30       seed.size                   0                    1
## 31        seed.tmt                   0                    2
## 32           sever                   0                    2
## 33      shriveling                   0                    1
## 34            stem                   0                    1
## 35    stem.cankers                   0                    3
## 36            temp                   0                    2

m%>%filter(min==max)

## [1] predictor min       max      
## <0 rows> (or 0-length row.names)

asNumeric <- function(x) as.numeric(as.character(x))
factorsNumeric <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],   
                                                   asNumeric))

So<-factorsNumeric(Soybean)
So %>%keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

### (b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Using the skim function from the skimr library you can see that some predictors are much more likely to be missing. The lowest of which are hail,sever,seed.tm, and lodging which ae all missing 121 records for a completion rate of 82.3%. We can also see that only 5 of the classes have NA fields with all of 4 classes containing a null field and one class having 3/4 of its records with a null field.

skim(Soybean)

Data summary
Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

sNA <- Soybean[rowSums(is.na(Soybean)) > 0,]
library(sqldf)

sm<-sqldf("select count(Class) null_count, Class from sNA group by Class")
soy<-sqldf("select count(Class) full_count, Class from Soybean group by Class")
s<-sqldf("select a.class, null_count,full_count from sm a join soy b on a.class = b.class")
s

##                         Class null_count full_count
## 1                2-4-d-injury         16         16
## 2               cyst-nematode         14         14
## 3 diaporthe-pod-&-stem-blight         15         15
## 4            herbicide-injury          8          8
## 5            phytophthora-rot         68         88

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For predictors that were entirely NA for a whole class (i.e. hail is NA for 5 different classes) I would create a dummy variable to show if the predictor was filled in or not or remove it entirely. filling in may be an issue because that is likely something to do with data collection and may not keep up over time. for predictors that have some data within a class I would impute an average for that predictor for a given class.

DATA 624 HW4

Adam Gersowitz

9/30/2021

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

DATA 624 HW4

Adam Gersowitz

9/30/2021

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.