3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
glass_longer <- Glass%>%
select(!(Type))%>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value")
glass_longer %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 18) +
facet_wrap(~ Variable, scales = 'free')+
labs(title = "Histogram of Predictors")
glass_longer%>%
ggplot(aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot() +
facet_wrap(~ Variable, scales = "free") +
labs(title = "Boxplots of Predictors")
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
From the boxplots, we see there are outliers for all of the predictors except Mg. The histograms show us that the predictors are skewed. Al, Ca, and RI are slightly right-skewed. Ba, Fe, and K are all very right skewed. Mg is bimodal and left-skewed. Si is slightly left-skewed and Na looks very slightly right-skewed, possibly a little normal.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Since the some of data from Glass are very skewed, a log, square root, inverse or Box-Cox transformation of one or more predictors might improve the classification model. As an example, I did a log transformation and we see that Ba, Fe, and K improved a little from the log transformation.
glass_longer %>%
ggplot(aes(x = log(Value))) +
geom_histogram(bins = 18) +
facet_wrap(~ Variable, scales = 'free')
## Warning: Removed 392 rows containing non-finite outside the scale range
## (`stat_bin()`).
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Since all of the predictors were categorical, I created a histogram of all of the predictors. We see that the predictors closest to a degenerate distribution is mycelium since it mostly takes one single values. Other that are very close to being degenerate are sclerotia, shriveling, and leaf.malf since the second value is a a little larger than mycelium’s second value.
soy_cate <- sapply(Soybean, is.factor)
soy_cate
## Class date plant.stand precip temp
## TRUE TRUE TRUE TRUE TRUE
## hail crop.hist area.dam sever seed.tmt
## TRUE TRUE TRUE TRUE TRUE
## germ plant.growth leaves leaf.halo leaf.marg
## TRUE TRUE TRUE TRUE TRUE
## leaf.size leaf.shread leaf.malf leaf.mild stem
## TRUE TRUE TRUE TRUE TRUE
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## TRUE TRUE TRUE TRUE TRUE
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## TRUE TRUE TRUE TRUE TRUE
## seed mold.growth seed.discolor seed.size shriveling
## TRUE TRUE TRUE TRUE TRUE
## roots
## TRUE
Soybean %>%
select(!(c(Class, date))) %>%
drop_na() %>%
gather(key = "key", value = "value") %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(~ key) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: attributes are not identical across measure variables; they will be
## dropped
(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
It looks like the predictors with the largest amount of missing data are hail, sever, seed.tmt, and lodging with 121 missing values, which equates to 17.715% missing data. It also does seem like the missing data is related to class, since most of the missing data is from the class phytophthora-rot where twe see up 68 missing values for some predictors. Cyst-nematode, 2-4-d-injury, and diaporthe-pod-&-stem-blight seem to have similiar missingness - around 41-16 missing values at most. Then there is herbicide-injury on the lower side with 8 missing values. The rest of the classes do not have any missing data. This means missing predictor data could be related to the classes they are in, which means there could be informative missingness.
na_total <- sapply(Soybean, function(x) sum(is.na(x)))
soy_na <- data.frame(Column = names(na_total), NAs = na_total)
soy_na %>%
arrange(desc(NAs))
## Column NAs
## hail hail 121
## sever sever 121
## seed.tmt seed.tmt 121
## lodging lodging 121
## germ germ 112
## leaf.mild leaf.mild 108
## fruiting.bodies fruiting.bodies 106
## fruit.spots fruit.spots 106
## seed.discolor seed.discolor 106
## shriveling shriveling 106
## leaf.shread leaf.shread 100
## seed seed 92
## mold.growth mold.growth 92
## seed.size seed.size 92
## leaf.halo leaf.halo 84
## leaf.marg leaf.marg 84
## leaf.size leaf.size 84
## leaf.malf leaf.malf 84
## fruit.pods fruit.pods 84
## precip precip 38
## stem.cankers stem.cankers 38
## canker.lesion canker.lesion 38
## ext.decay ext.decay 38
## mycelium mycelium 38
## int.discolor int.discolor 38
## sclerotia sclerotia 38
## plant.stand plant.stand 36
## roots roots 31
## temp temp 30
## crop.hist crop.hist 16
## plant.growth plant.growth 16
## stem stem 16
## date date 1
## area.dam area.dam 1
## Class Class 0
## leaves leaves 0
ggplot(soy_na, aes(y = reorder(Column, NAs), x = NAs)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Number of Missing values per Predictor in Soybean", x = "Number of NAs", y = "Predictors") +
theme_minimal()
missing_data <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~sum(is.na(.))))%>%
pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Missing_Count") %>%
filter(Missing_Count > 0)
ggplot(missing_data, aes(x = reorder(Predictor, Missing_Count), y = Missing_Count, fill = Predictor)) +
geom_bar(stat = "identity") +
coord_flip() +
facet_wrap(~ Class) +
theme_minimal() +
labs(title = "Missingness by Predictor and Class in Soybean",
x = "Predictor",
y = "Missing Count",
fill = "Predictor") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
One way to handle missing data could be to eliminating. However, in this case we probably would not eliminate predictors (attributes) since there are missing values in almost every single predictor. We could instead remove the 5 classes (shown in the plot aboce) that have missing data, and then we would be left with no missing data.
However, it might be helpful to impute the data if the missing data is related to the class they are in and if we do not want to lose data. This could be done with model-based imputations, like the K-nearest neighbor model(KNN), which finds similar data points in the sample and averaging them to fill in the missing values.