## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.1.8 ✔ tsibble 1.1.3
## ✔ dplyr 1.1.0 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.0
## ✔ lubridate 1.9.2 ✔ fable 0.3.2
## ✔ ggplot2 3.4.1 ✔ fabletools 0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## corrplot 0.92 loaded
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Q: Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
library('mlbench')
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
Glass %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")
Glass %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass %>%
select_if(is.numeric) %>%
cor() %>%
corrplot()
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Distribution of Types of Glass")
A: We can see with the histograms and the box plots that all follow a normal distribution and that AI, BA, CA, FE, NA, and RI are all right skewed while the reset are left skewed. As far as correlation goes, we can see that the strongest is between Ca and RI, while AI and Ba, AI and K, NA and Na have the next strongest correlations. It looks like the most common types of glass are 1 & 2.
Q: Do there appear to be any outliers in the data? Are any predictors skewed?
A: Yes we saw that several were skewed in the answer above, and looking at the boxplots, we can see that there are a lot of outliers for Ca, Ba, AI, Fe, and Rr while the rest have outliers these columns have the highest.
Q: Are there any relevant transformations of one or more predictors that might improve the classification model?
A: Yes, for the data points that are skewed we could either run a Box Cox or a log transformation and the ones with the outliers we could run a Spatial Sign transformation. Then, if we wanted to cut out the noise we could run a PCA analysis, while this is not exactly a transformation, it does reduce the data allowing for clearer interpretations and better results.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Q: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
Soybean %>%
select(-Class)%>%
drop_na() %>%
gather() %>%
ggplot(aes(value)) +
geom_bar()+
facet_wrap(~ key)
## Warning: attributes are not identical across measure variables; they will be
## dropped
A: Histograms help identify categorical predictors with degenerate
distributions, in this plot I also removed all NAs since it made it look
like some had more variables than when they were removed. Degenerate
distributions are ones that “handful of unique values that occur with
very low frequencies.” mycelium seems to be degenerate and we should
remove.
Q: Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
sort(apply(Soybean, 2, function(col)sum(is.na(col))/length(col)), decreasing = T)
## hail sever seed.tmt lodging germ
## 0.177159590 0.177159590 0.177159590 0.177159590 0.163982430
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 0.158125915 0.155197657 0.155197657 0.155197657 0.155197657
## leaf.shread seed mold.growth seed.size leaf.halo
## 0.146412884 0.134699854 0.134699854 0.134699854 0.122986823
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 0.122986823 0.122986823 0.122986823 0.122986823 0.055636896
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 0.055636896 0.055636896 0.055636896 0.055636896 0.055636896
## sclerotia plant.stand roots temp crop.hist
## 0.055636896 0.052708638 0.045387994 0.043923865 0.023426061
## plant.growth stem date area.dam Class
## 0.023426061 0.023426061 0.001464129 0.001464129 0.000000000
## leaves
## 0.000000000
Soybean %>%
filter_all(any_vars(is.na(.))) %>%
select(Class) %>%
group_by(Class) %>%
summarise(count = n())
## # A tibble: 5 × 2
## Class count
## <fct> <int>
## 1 2-4-d-injury 16
## 2 cyst-nematode 14
## 3 diaporthe-pod-&-stem-blight 15
## 4 herbicide-injury 8
## 5 phytophthora-rot 68
A: It looks like there are is data missing for hail, sever, seed.tmt, lodging, etc. with the all the NAs made up by the five classes above with the biggest class at fault being phytophthora-rot.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
soy <- Soybean %>%
filter_all(any_vars(is.na(.))) %>%
group_by(Class)
soy_name <- as.character(unique(soy$Class))
for (i in soy_name) {
t <- Soybean %>%
filter(Class == as.character(i))
print(i)
print(sort(apply(t, 2, function(col)sum(is.na(col))/length(col)), decreasing = T))
}
## [1] "phytophthora-rot"
## hail sever seed.tmt germ lodging
## 0.7727273 0.7727273 0.7727273 0.7727273 0.7727273
## fruiting.bodies fruit.pods fruit.spots seed mold.growth
## 0.7727273 0.7727273 0.7727273 0.7727273 0.7727273
## seed.discolor seed.size shriveling leaf.halo leaf.marg
## 0.7727273 0.7727273 0.7727273 0.6250000 0.6250000
## leaf.size leaf.shread leaf.malf leaf.mild Class
## 0.6250000 0.6250000 0.6250000 0.6250000 0.0000000
## date plant.stand precip temp crop.hist
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## area.dam plant.growth leaves stem stem.cankers
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## canker.lesion ext.decay mycelium int.discolor sclerotia
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## roots
## 0.0000000
## [1] "diaporthe-pod-&-stem-blight"
## hail sever seed.tmt leaf.halo leaf.marg
## 1.0 1.0 1.0 1.0 1.0
## leaf.size leaf.shread leaf.malf leaf.mild lodging
## 1.0 1.0 1.0 1.0 1.0
## roots plant.stand germ Class date
## 1.0 0.4 0.4 0.0 0.0
## precip temp crop.hist area.dam plant.growth
## 0.0 0.0 0.0 0.0 0.0
## leaves stem stem.cankers canker.lesion fruiting.bodies
## 0.0 0.0 0.0 0.0 0.0
## ext.decay mycelium int.discolor sclerotia fruit.pods
## 0.0 0.0 0.0 0.0 0.0
## fruit.spots seed mold.growth seed.discolor seed.size
## 0.0 0.0 0.0 0.0 0.0
## shriveling
## 0.0
## [1] "cyst-nematode"
## plant.stand precip temp hail sever
## 1 1 1 1 1
## seed.tmt germ leaf.halo leaf.marg leaf.size
## 1 1 1 1 1
## leaf.shread leaf.malf leaf.mild lodging stem.cankers
## 1 1 1 1 1
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 1 1 1 1 1
## sclerotia fruit.spots seed.discolor shriveling Class
## 1 1 1 1 0
## date crop.hist area.dam plant.growth leaves
## 0 0 0 0 0
## stem fruit.pods seed mold.growth seed.size
## 0 0 0 0 0
## roots
## 0
## [1] "2-4-d-injury"
## plant.stand precip temp hail crop.hist
## 1.0000 1.0000 1.0000 1.0000 1.0000
## sever seed.tmt germ plant.growth leaf.shread
## 1.0000 1.0000 1.0000 1.0000 1.0000
## leaf.mild stem lodging stem.cankers canker.lesion
## 1.0000 1.0000 1.0000 1.0000 1.0000
## fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1.0000 1.0000 1.0000 1.0000 1.0000
## fruit.pods fruit.spots seed mold.growth seed.discolor
## 1.0000 1.0000 1.0000 1.0000 1.0000
## seed.size shriveling roots date area.dam
## 1.0000 1.0000 1.0000 0.0625 0.0625
## Class leaves leaf.halo leaf.marg leaf.size
## 0.0000 0.0000 0.0000 0.0000 0.0000
## leaf.malf
## 0.0000
## [1] "herbicide-injury"
## precip hail sever seed.tmt germ
## 1 1 1 1 1
## leaf.mild lodging stem.cankers canker.lesion fruiting.bodies
## 1 1 1 1 1
## ext.decay mycelium int.discolor sclerotia fruit.spots
## 1 1 1 1 1
## seed mold.growth seed.discolor seed.size shriveling
## 1 1 1 1 1
## Class date plant.stand temp crop.hist
## 0 0 0 0 0
## area.dam plant.growth leaves leaf.halo leaf.marg
## 0 0 0 0 0
## leaf.size leaf.shread leaf.malf stem fruit.pods
## 0 0 0 0 0
## roots
## 0
A: For all the classes that we see a missing values to be 100% we should remove. While with “phytophthora-rot”, we could impute the values since there is not a full 100% missing values.