The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
ggplot(data = Glass, aes(x = Na)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x= Mg)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=Al)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=Si)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=K)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=Ca)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=Ba)) +
geom_histogram(binwidth = 0.1)
ggplot(data = Glass, aes(x=Fe)) +
geom_histogram(binwidth = 0.05)
ggplot(data = Glass, aes(x=RI)) +
geom_histogram(binwidth = 0.001)
I looked for the correlation between variables to further explore the
relationships between them.
cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)
## RI Na Mg Al Si K Ca Ba Fe
## RI 1.00 -0.19 -0.12 -0.41 -0.54 -0.29 0.81 0.00 0.14
## Na -0.19 1.00 -0.27 0.16 -0.07 -0.27 -0.28 0.33 -0.24
## Mg -0.12 -0.27 1.00 -0.48 -0.17 0.01 -0.44 -0.49 0.08
## Al -0.41 0.16 -0.48 1.00 -0.01 0.33 -0.26 0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01 1.00 -0.19 -0.21 -0.10 -0.09
## K -0.29 -0.27 0.01 0.33 -0.19 1.00 -0.32 -0.04 -0.01
## Ca 0.81 -0.28 -0.44 -0.26 -0.21 -0.32 1.00 -0.11 0.12
## Ba 0.00 0.33 -0.49 0.48 -0.10 -0.04 -0.11 1.00 -0.06
## Fe 0.14 -0.24 0.08 -0.07 -0.09 -0.01 0.12 -0.06 1.00
The strongest correlation is between RI and Ca (0.81). The scatterplot of these variables is shown below.
The next strongest correlations are RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48).
ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()
There are skews in most of the predictors. RI, Na, Al, Ca, are right skewed while Si, Mg, and K are left skewed.
Ba and Fe are mostly zeros with a small number of positive non-zero values.
The outliers can be seen on the following boxplots.
ggplot(data = Glass, aes(x=RI)) +
geom_boxplot()
ggplot(data = Glass, aes(x=Na)) +
geom_boxplot()
ggplot(data=Glass, aes(x=Mg)) +
geom_boxplot()
ggplot(data= Glass, aes(x=Al)) +
geom_boxplot()
ggplot(data=Glass, aes(x=Si)) +
geom_boxplot()
ggplot(data=Glass, aes(x=K)) +
geom_boxplot()
ggplot(data=Glass, aes(x=Ca)) +
geom_boxplot()
ggplot(data=Glass, aes(x=Ba)) +
geom_boxplot()
ggplot(data=Glass, aes(x=Fe)) +
geom_boxplot()
The box plots all appear to have some outliers with the exception of Mg. RI, Na, Al, Si, and Ca have both high and low outliers while K, Ba, and Fa only have high outliers. Interestingly, the boxplot for Ba seems to lie only at 0 and all points that are not 0 were outliers.
It may be helpful to remove RI or Ca to account for collinearity since they are highly correlated. The remaining variables should be centered and scaled.A spatial sign could be used to address outliers, especially on the variables with high and low outliers including RI, Na, and Al. For the variables that show the most skew, a Box-Cox transformation could also be used to improve symmetry.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data(Soybean)
The criteria for degenerate predictors that were discussed in the chapter are: a. The fraction of unique values over the sample size is low (say 10 %) b. The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
print(dim(Soybean)[1])
## [1] 683
To meet the first criteria, there must be 68 or fewer unique values for the variable.
sapply(Soybean, function(x) length(unique(x)))
## Class date plant.stand precip temp
## 19 8 3 4 4
## hail crop.hist area.dam sever seed.tmt
## 3 5 5 4 4
## germ plant.growth leaves leaf.halo leaf.marg
## 4 3 2 4 4
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 4 3 3 4 3
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 3 5 5 3 4
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 3 4 3 5 5
## seed mold.growth seed.discolor seed.size shriveling
## 3 3 3 3 3
## roots
## 4
The max unique values is 19 so all columns can qualify. For the second criteria, we need to find the frequencies of the 2 most prevalent values.
soy_ratio <- sapply(Soybean, function(x) {
# Count values
counts <- table(x)
# Sort
counts <- sort(table(x), decreasing = TRUE)
# Calculate ratio
counts[1]/counts[2]
})
# Arrange in descending order
sort(soy_ratio, decreasing = TRUE)
## mycelium.0 sclerotia.0 leaf.mild.0 shriveling.0
## 106.500000 31.250000 26.750000 14.184211
## int.discolor.0 lodging.0 leaf.malf.0 seed.size.0
## 13.204545 12.380952 12.311111 9.016949
## seed.discolor.0 leaves.1 mold.growth.0 roots.0
## 8.015625 7.870130 7.820896 6.406977
## leaf.shread.0 fruiting.bodies.0 seed.0 precip.2
## 5.072917 4.548077 4.139130 4.098214
## ext.decay.0 fruit.spots.0 hail.0 fruit.pods.0
## 3.681481 3.450000 3.425197 3.130769
## stem.cankers.0 plant.growth.0 temp.1 canker.lesion.0
## 1.984293 1.951327 1.879397 1.807910
## sever.1 leaf.marg.0 leaf.halo.2 leaf.size.1
## 1.651282 1.615385 1.547511 1.479638
## seed.tmt.0 stem.1 area.dam.1 plant.stand.0
## 1.373874 1.253378 1.213904 1.208191
## date.5 germ.1 Class.brown-spot crop.hist.2
## 1.137405 1.103627 1.010989 1.004587
The ratios of Mycelium, sclerotia, leaf.mild are above 20 and therefore degenerate.
The variables that were identified as degenerate(mycelium, sclerotia, and leaf.mild) can be eliminated. After eliminating, I would use imputation to fill in any remainder missing data. Since missing data is associated with 5 classes, it may be helpful to introduce a new variable to represent the number of missing observations. It may also be worthwhile to remove variables with the higher percentages of data missing (over 15%) to reduce noise and see if that improves the outcome.