The UC Irvine Mache Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
library(mlbench)
library(tidyr)
library(dplyr)
library(ggplot2)
library(corrplot)
library(e1071)
library(caret)
library(naniar)
library(mice)
library(VIM)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# Converting into long format and creating Predictor as factor
melted <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>%
mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)
# Explore the distribution of each factor in Predictor group
melted %>%
ggplot(., aes(Value, fill=Predictor))+geom_histogram(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="Distribution of Predictors")+theme_minimal()
melted %>%
ggplot(., aes(Value, fill=Predictor))+geom_density(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="Distribution of Predictors")+theme_minimal()
## Warning: Ignoring unknown parameters: bins
# Correlation matrix
Glass %>% select(-Type) %>% cor() %>% corrplot(., method='color', type="upper", order="hclust",
addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE)
According to the above graphs, Si, Ai, Na and Ri are almost normally distributed as compared with other elements. Also, Ca, Na and Si appears to be the highly concentrated in the glass. Most of them don’t have very strong correlations. There is some correlation between RI and SI, Mg and Al, Mg and Ba, Mg and Ca. There is very strong correlation between Ca and RI.
Do there appear to be any outlier in the data? Are any predictors skewed?
# Plotting boxplot for all elements
melted %>% ggplot(aes(Predictor, Value, fill= Predictor))+geom_boxplot()+facet_wrap(~Predictor, scale='free')+coord_flip()+labs(title="Boxplot for Multiple Elements in Glass") + theme_classic() + theme(axis.text.x = element_text(angle=30, hjust=1))
# Summary statistics
summary(Glass[-10])
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.05701
## 3rd Qu.:0.10000
## Max. :0.51000
# Skewness
Glass[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
According to boxplots, seems like most of them has some sort of outliers other than Mg but the values are not very high so let’s check it out using skewness function. Summary statistics shows that there is no significant difference in the mean and median of each elements other than Mg which has mean of 2.6 and median of 3.4. Let’s double check using skewness function to verify if there is extreme outliers.
Skewness output shows that K is skewed leading by Ba and Ca. Other than these, others are normally distributed.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Let’s transform variables using BoxCox transformation using preProcess function from caret package.
# Transformation, scaling and centering the data
trans <- preProcess(Glass, method=c("BoxCox", "center","scale"))
trans2 <- predict(trans, Glass)
# Plot the transformed data
melted_bx <- trans2 %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>%
mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)
melted_bx %>% ggplot(aes(Predictor, Value, fill= Predictor))+geom_boxplot()+facet_wrap(~Predictor, scale='free')+coord_flip()+labs(title="Boxplot for Multiple Elements in Glass") + theme_classic() + theme(axis.text.x = element_text(angle=30, hjust=1))
trans2[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K
## 1.56566039 0.03384644 -1.13645228 0.09105899 -0.65090568 6.46008890
## Ca Ba Fe
## -0.19395573 3.36867997 1.72981071
# Skewness
Glass[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
trans2[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K
## 1.56566039 0.03384644 -1.13645228 0.09105899 -0.65090568 6.46008890
## Ca Ba Fe
## -0.19395573 3.36867997 1.72981071
Centering and scaling has improved bringing the mean near 0 and standard deviation near 1 but boxcox did not significantly reduce the skewness. There is some improvement although but not very much noticeable. I am not sure about the details of dataset but if the outliers are type then they should be replaced with either median or may be knn can be helpful. Log transformation is another way of improving it in this case.
Glass[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
trans2[-10] %>% apply(2, skewness)
## RI Na Mg Al Si K
## 1.56566039 0.03384644 -1.13645228 0.09105899 -0.65090568 6.46008890
## Ca Ba Fe
## -0.19395573 3.36867997 1.72981071
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information no the environmental conditions (eg., temperature, precipitation) and plant conditions (eg., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
Investigate the frequency distributions for the categorical predictors. Are are of the distributions degenerate in the ways discussed earlier in this chapter?
# Summary
summary(Soybean[,2:36])
## date plant.stand precip temp hail crop.hist
## 5 :149 0 :354 0 : 74 0 : 80 0 :435 0 : 65
## 4 :131 1 :293 1 :112 1 :374 1 :127 1 :165
## 3 :118 NA's: 36 2 :459 2 :199 NA's:121 2 :219
## 2 : 93 NA's: 38 NA's: 30 3 :218
## 6 : 90 NA's: 16
## (Other):101
## NA's : 1
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0 :123 0 :195 0 :305 0 :165 0 :441 0: 77 0 :221
## 1 :227 1 :322 1 :222 1 :213 1 :226 1:606 1 : 36
## 2 :145 2 : 45 2 : 35 2 :193 NA's: 16 2 :342
## 3 :187 NA's:121 NA's:121 NA's:112 NA's: 84
## NA's: 1
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0 :357 0 : 51 0 :487 0 :554 0 :535 0 :296 0 :520
## 1 : 21 1 :327 1 : 96 1 : 45 1 : 20 1 :371 1 : 42
## 2 :221 2 :221 NA's:100 NA's: 84 2 : 20 NA's: 16 NA's:121
## NA's: 84 NA's: 84 NA's:108
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0 :379 0 :320 0 :473 0 :497 0 :639 0 :581
## 1 : 39 1 : 83 1 :104 1 :135 1 : 6 1 : 44
## 2 : 36 2 :177 NA's:106 2 : 13 NA's: 38 2 : 20
## 3 :191 3 : 65 NA's: 38 NA's: 38
## NA's: 38 NA's: 38
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor
## 0 :625 0 :407 0 :345 0 :476 0 :524 0 :513
## 1 : 20 1 :130 1 : 75 1 :115 1 : 67 1 : 64
## NA's: 38 2 : 14 2 : 57 NA's: 92 NA's: 92 NA's:106
## 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## seed.size shriveling roots
## 0 :532 0 :539 0 :551
## 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
# Plotting categorical variables
Soybean %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Above plot shows the distribution of data points on the categorical variables. It also shows the missing values exist in almost all of the variables. Let’s plot and print out the number of missing values in the next section.
Roughly 18% of the data are missing. Are there particular predictors that are most likely to be missing? Is the pattern of missing data related to the classes?
# Calculate the missing values
colSums(is.na(Soybean))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
# Visualize
vis_miss(Soybean) + labs(title="Summarized visualization of missing values")
gg_miss_upset(Soybean)
Although there are a lot of missing values in the dataset as we can see in summary and first plot. Second plot shows if there is any pattern in the missing values. It shows that germ, hail, server, seed.tmt and lodging have missing values together and thus it has pattern of missing values together.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
There are various strategies to deal with the missing values and it all depends on the dataset and the understanding of domain. In some area, it’s safe just to remove the data and not to change anything especially in medical science. In social science, you can substitute the missing values with mean or median and then test the accuracy of the model to select the best strategy which lead to a better model. There are some other tools like knn, mice,etc which are also very effective.
mice_method <- mice(Soybean, method="pmm", printFlag=F, seed=200)
## Warning: Number of logged events: 1669
aggr(complete(mice_method), prop=c(TRUE,TRUE), bars=TRUE, numbers=TRUE, sortVars=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Class 0
## date 0
## plant.stand 0
## precip 0
## temp 0
## hail 0
## crop.hist 0
## area.dam 0
## sever 0
## seed.tmt 0
## germ 0
## plant.growth 0
## leaves 0
## leaf.halo 0
## leaf.marg 0
## leaf.size 0
## leaf.shread 0
## leaf.malf 0
## leaf.mild 0
## stem 0
## lodging 0
## stem.cankers 0
## canker.lesion 0
## fruiting.bodies 0
## ext.decay 0
## mycelium 0
## int.discolor 0
## sclerotia 0
## fruit.pods 0
## fruit.spots 0
## seed 0
## mold.growth 0
## seed.discolor 0
## seed.size 0
## shriveling 0
## roots 0
This method assumes values are missing at random but previously we saw that some variables are not randomly missing and hence this method may not be very effective in this case. We might have to go for other tool. I’ll select knn method to impute the missing values again.
Soybean2 <- Soybean[3:36] # Removed class and date
# knn method
knn_method <- kNN(Soybean ,k=5)
colSums(is.na(knn_method))
## Class date plant.stand precip
## 0 0 0 0
## temp hail crop.hist area.dam
## 0 0 0 0
## sever seed.tmt germ plant.growth
## 0 0 0 0
## leaves leaf.halo leaf.marg leaf.size
## 0 0 0 0
## leaf.shread leaf.malf leaf.mild stem
## 0 0 0 0
## lodging stem.cankers canker.lesion fruiting.bodies
## 0 0 0 0
## ext.decay mycelium int.discolor sclerotia
## 0 0 0 0
## fruit.pods fruit.spots seed mold.growth
## 0 0 0 0
## seed.discolor seed.size shriveling roots
## 0 0 0 0
## Class_imp date_imp plant.stand_imp precip_imp
## 0 0 0 0
## temp_imp hail_imp crop.hist_imp area.dam_imp
## 0 0 0 0
## sever_imp seed.tmt_imp germ_imp plant.growth_imp
## 0 0 0 0
## leaves_imp leaf.halo_imp leaf.marg_imp leaf.size_imp
## 0 0 0 0
## leaf.shread_imp leaf.malf_imp leaf.mild_imp stem_imp
## 0 0 0 0
## lodging_imp stem.cankers_imp canker.lesion_imp fruiting.bodies_imp
## 0 0 0 0
## ext.decay_imp mycelium_imp int.discolor_imp sclerotia_imp
## 0 0 0 0
## fruit.pods_imp fruit.spots_imp seed_imp mold.growth_imp
## 0 0 0 0
## seed.discolor_imp seed.size_imp shriveling_imp roots_imp
## 0 0 0 0
I replaced the missing values using knn method and we can see that there is no missing values here but again I am not sure and I cannot say for sure that this method was effective until I build a model, split the data and test for accuracy. To sum up, we have various strategies to deal with the missing values. It depends on dataset and area for selecting specific way of handling missing values. A lot of data scientists use median mostly to replace missing values.