data(Glass)
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Normally distributed: Al, Ca Right skewed: Fe, Ba, Rl, Na, K Left skewed: Si, Mg
variables = Glass%>%
select(-Type)%>%
names()
Glass%>%
pivot_longer(variables)%>%
ggplot(aes(x=value))+
geom_histogram(bins=30)+
facet_wrap(~name, scales = 'free')
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(variables)` instead of `variables` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# scatter plots show correlation between predictor variables
Glass%>%
select(-Type)%>%
pairs()
Other than variables Al and Ca, the other predictors are noticeably skewed.
Based on the boxplots bellow, outliers exist for all variables except Mg
Glass%>%
pivot_longer(variables)%>%
ggplot(aes(x=value))+
geom_boxplot()+
facet_wrap(~name, scales = 'free')
Since most of the variables are skewed we could apply log transformations to normalize the distributions. For example, we can apply a log transformation to Si, one of the variables that were shown as left skewed previously. As we can see, after the log transformation, the data is closer to a normal distribution
Glass%>%
ggplot(aes(x = log(Si)))+geom_histogram(bins=30)
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
From the summary, we can see that many variables contain missing (NA) values mycelium stands out as a ‘degenerate distribution’ as it is mostly one value (639 0’s vs 6 1’s)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
From the summary in question a, we see that there are many predictors with large amounts of NA values. The ones with the highest number of NA values seem to be sever, hail, lodging and seed.tmt
Looking at Classes, we see that all the null values come from 5 classes as seen bellow (The first 4 classes shown all have at least one null value per row in the dataset while the 5th class has null values in 77% of its rows)
non_null = Soybean%>%
na.omit()%>%
group_by(Class)%>%
summarize(non_null_count = n())
all = Soybean%>%
group_by(Class)%>%
summarize(count = n())
all%>%
left_join(non_null, by = 'Class')%>%
mutate(ratio_na = (count - replace_na(non_null_count,0))/count)%>%
arrange(-ratio_na)
## # A tibble: 19 x 4
## Class count non_null_count ratio_na
## <fct> <int> <int> <dbl>
## 1 2-4-d-injury 16 NA 1
## 2 cyst-nematode 14 NA 1
## 3 diaporthe-pod-&-stem-blight 15 NA 1
## 4 herbicide-injury 8 NA 1
## 5 phytophthora-rot 88 20 0.773
## 6 alternarialeaf-spot 91 91 0
## 7 anthracnose 44 44 0
## 8 bacterial-blight 20 20 0
## 9 bacterial-pustule 20 20 0
## 10 brown-spot 92 92 0
## 11 brown-stem-rot 44 44 0
## 12 charcoal-rot 20 20 0
## 13 diaporthe-stem-canker 20 20 0
## 14 downy-mildew 20 20 0
## 15 frog-eye-leaf-spot 91 91 0
## 16 phyllosticta-leaf-spot 20 20 0
## 17 powdery-mildew 20 20 0
## 18 purple-seed-stain 20 20 0
## 19 rhizoctonia-root-rot 20 20 0
Since we know that all null values come from 5 classes, we cannot drop null values as it would entirely drop 4 classes and drop most of the 5th class. Since dropping null values would drop entire classes, I think imputing the data would work better.
Something interesting I noticed, for the Classes with 100% rows including at least one NA,is that we can see predictors such as sever, seed.tmt, leaf.mild, lodging are all NA. Perhaps for values like those we could replace with NA with 0 instead of imputing since there are no values originally
Soybean%>%
filter(Class %in% c('2-4-d-injury','cyst-nematode',
'diaporthe-pod-&-stem-blight','herbicide-injury'))%>%
summary()
## Class date plant.stand precip temp
## 2-4-d-injury :16 5 : 9 0 : 7 0 : 0 0 : 8
## diaporthe-pod-&-stem-blight:15 3 : 8 1 :10 1 : 2 1 : 0
## cyst-nematode :14 6 : 8 NA's:36 2 :13 2 :15
## herbicide-injury : 8 1 : 7 NA's:38 NA's:30
## alternarialeaf-spot : 0 2 : 7
## anthracnose : 0 (Other):13
## (Other) : 0 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 : 0 0 : 6 0 :10 0 : 0 0 : 0 0 : 5 0 :15
## 1 : 0 1 : 9 1 :12 1 : 0 1 : 0 1 : 2 1 :22
## NA's:53 2 :11 2 :10 2 : 0 2 : 0 2 : 2 NA's:16
## 3 :11 3 :20 NA's:53 NA's:53 NA's:44
## NA's:16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem
## 0:15 0 :20 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 :14
## 1:38 1 : 0 1 : 4 1 : 4 1 : 0 1 :24 1 : 0 1 :23
## 2 : 4 2 :20 2 :20 NA's:45 NA's:29 2 : 0 NA's:16
## NA's:29 NA's:29 NA's:29 NA's:53
##
##
##
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 0 : 0 0 :15 0 :15 0 : 0 0 :15 0 :15
## 1 : 0 1 : 0 1 : 0 1 :15 1 : 0 1 : 0
## NA's:53 2 : 0 2 : 0 NA's:38 2 : 0 NA's:38
## 3 : 0 3 : 0 NA's:38
## NA's:38 NA's:38
##
##
## int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 0 :15 0 :15 0 : 0 0 : 0 0 : 3 0 :14
## 1 : 0 1 : 0 1 :15 1 : 0 1 :26 1 :15
## 2 : 0 NA's:38 2 :14 2 :15 NA's:24 NA's:24
## NA's:38 3 : 8 4 : 0
## NA's:16 NA's:38
##
##
## seed.discolor seed.size shriveling roots
## 0 : 0 0 : 0 0 : 0 0 : 0
## 1 :15 1 :29 1 :15 1 : 8
## NA's:38 NA's:24 NA's:38 2 :14
## NA's:31
##
##
##