The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
library(corrplot)
## corrplot 0.92 loaded
library(mlbench)
library(broom)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
names(Glass)
## [1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type"
To start to get an idea about the data, I have done a density plot for each element. The Ai, Na and Si elements look close to being normally distributed. While the other six elements are all skewed in one direction or the other. K, Ma and Ri all have multiple peaks. The scales of the components are different.
Glass %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Density Plots of Each Element")
The frequency of each type of glass is not evenly distributed. Glass types 1 and 2 each have at least 70 occurrences, while 3,5,6 and 7 all have less than thirty occurrences. This imbalance will have to be dealt with.
Glass %>%
ggplot() +
geom_bar(aes(x = Type)) +
ggtitle("Frequency of Types of Glass")
Each type of glass has a general chemical makeup. To get a general idea of this relationship, we can see the means of each element by its type.
Group 6 is the most distinctive for means since it has three elements that have no means. So any types of glass that are tested and have zero elements in K, Ba and Fe will most likely be determined to be a group 6 type of glass.
Another group that really sticks out is group 5. It’s means for NA,Mg, K and Ca are all distinguished in one way or another from the other type’s means.
Group 7 also catches the eye because its mean in Mg is considerably lower than all the others. Also for the Ai and Ba elements because their means are the highest of all the means. For the Ca and Fe elements, Group 7 has the lowest means.
These distinguishing characteristics for group 5 and 7 will possibly make it easy for an model to identify these.
type_means <- Glass %>%
group_by(Type) %>%
summarise(across(everything(),list(mean)))
type_means
## # A tibble: 6 × 10
## Type RI_1 Na_1 Mg_1 Al_1 Si_1 K_1 Ca_1 Ba_1 Fe_1
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.52 13.2 3.55 1.16 72.6 0.447 8.80 0.0127 0.057
## 2 2 1.52 13.1 3.00 1.41 72.6 0.521 9.07 0.0503 0.0797
## 3 3 1.52 13.4 3.54 1.20 72.4 0.406 8.78 0.00882 0.0571
## 4 5 1.52 12.8 0.774 2.03 72.4 1.47 10.1 0.188 0.0608
## 5 6 1.52 14.6 1.31 1.37 73.2 0 9.36 0 0
## 6 7 1.52 14.4 0.538 2.12 73.0 0.325 8.49 1.04 0.0134
The standard deviations of the groups is a little alarming, as this gives a general idea of how spread out each elements values are for that particular group. Type’s 5 and 6 have high standard deviations for multiple elements.
type_std <- Glass %>%
group_by(Type) %>%
summarise(across(everything(),list(sd)))
type_std
## # A tibble: 6 × 10
## Type RI_1 Na_1 Mg_1 Al_1 Si_1 K_1 Ca_1 Ba_1 Fe_1
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.00227 0.499 0.247 0.273 0.569 0.215 0.575 0.0838 0.0891
## 2 2 0.00380 0.664 1.22 0.318 0.725 0.214 1.92 0.362 0.106
## 3 3 0.00192 0.507 0.163 0.347 0.512 0.230 0.380 0.0364 0.108
## 4 5 0.00335 0.777 0.999 0.694 1.28 2.14 2.18 0.608 0.156
## 5 6 0.00312 1.08 1.10 0.572 1.08 0 1.45 0 0
## 6 7 0.00255 0.686 1.12 0.443 0.940 0.668 0.974 0.665 0.0298
The density plots for each glass type’s relationship to an element gives a better idea of the tables of the mean and standard deviations above. From the plots we can really see just how the data is spread for each type of glass.
Each type’s chart for each element looks to be unique in terms of mean and distribution of values. So even though the means and standard deviations maybe similar, the distribution of the values is different.
glass_modified <- Glass %>% group_by(Type)
x <- sapply(glass_modified, is.factor)
glass_modified[ , x] <- as.data.frame(apply(glass_modified[ , x], 2, as.numeric))
glass_1 <- glass_modified[glass_modified$Type == 1,]
glass_1[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 1")
glass_2 <- glass_modified[glass_modified$Type == 2,]
glass_2[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 2")
glass_3 <- glass_modified[glass_modified$Type == 3,]
glass_3[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 3")
glass_5 <- glass_modified[glass_modified$Type == 5,]
glass_5[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 5")
glass_6 <- glass_modified[glass_modified$Type == 6,]
glass_6[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 6")
glass_7 <- glass_modified[glass_modified$Type == 7,]
glass_7[1:9]%>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_density(fill='blue') +
facet_wrap(~key, scales = 'free') +
ggtitle("Relationship between Element and Glass Type 7")
The correlation plot shows that only two elements Ri and Ca, as being highly correlated. Since they both present very similar information one of those two elements can ultimately be removed.
Glass %>%
keep(is.numeric) %>%
cor() %>%
corrplot(method='number')
When looking at the overall density plots of the data that is not separated by groups, it looks like there are a lot of possible outliers. However, when the data is broken down into groups those outliers look to be somewhat less prevalent.
For example, the Ba element. In all of the charts of the Ba element the chart is highly skewed, except for glass type 7. In glass type 5 the chart is skewed right but there is an increase in the probability as the data goes to the right. Since there are under 30 observations, this might not be an anomaly but rather a normal amount of the element. The spread of the distribution for type 5 is also bigger than most of the other groups for Ba.
Glass type 2’s Ba density chart when compared to group 5’s tells a different story. It says that there are outliers because the spread is bigger than all the others but the distribution of those values is less dense as values go further to the right.
For the Al, element a square root transformation makes the density plot closest to normal. For the Na, Ca, Ri and Si elements a log transformation helps the density plots look more normal. The Ba, Fe, K and Mg elements all have zero’s, so doing a log transformation will produce infinite values. A box cox transformation won’t work either for those particular elements.
Al_sqrt <- sqrt(Glass$Al)
Al_sqrt <- Al_sqrt %>% scale(center=TRUE,scale=TRUE) %>% as.vector()
plot(density(Al_sqrt))
ca_log <- log(Glass$Ca)
plot(density(ca_log))
na_log <- log(Glass$Na)
na_log <- na_log %>% scale(center=TRUE,scale=TRUE) %>% as.vector()
plot(density(na_log))
ri_log <- log(Glass$RI)
plot(density(ri_log))
si_log <- log(Glass$Si)
plot(density(si_log))
The soybean data can also be found at UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
unique(Soybean$Class)
## [1] diaporthe-stem-canker charcoal-rot
## [3] rhizoctonia-root-rot phytophthora-rot
## [5] brown-stem-rot powdery-mildew
## [7] downy-mildew brown-spot
## [9] bacterial-blight bacterial-pustule
## [11] purple-seed-stain anthracnose
## [13] phyllosticta-leaf-spot alternarialeaf-spot
## [15] frog-eye-leaf-spot diaporthe-pod-&-stem-blight
## [17] cyst-nematode 2-4-d-injury
## [19] herbicide-injury
## 19 Levels: 2-4-d-injury alternarialeaf-spot anthracnose ... rhizoctonia-root-rot
nrow(Soybean)
## [1] 683
This data set has different types of data compared to the Glass data set. This data set is categorical data. Also this data set has missing data points in almost every field.
ggplot(melt(Soybean, id.vars=c('Class')), aes(x=value)) +
geom_histogram(stat="count") +
facet_wrap(~variable, scale="free")
Since four of the classes are measuring different fields and seem to be classes that are more specific in their nature, I think the best way to deal with their missing data is to remove the fields of 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight and herbicide-injury.
For the class of phytophthora-rot imputation would be best because that class seems to be measuring the same traits as the other 14 columns.
This strategy is assuming that what is trying to be predicted has nothing to do with the specific nature of those four classes. Otherwise the strategy changes and could quite possibly be the opposite strategy or an entirely different strategy. Such as removing the specific fields that have missing values.