The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
First lets examine the overall distribution of the glass type variable.
library(ggplot2)
ggplot(Glass,aes(x=Glass$Type))+geom_histogram(fill="deepskyblue3",colour="black", stat="count")+ggtitle("Distribution by Type of Glass")+
xlab("Glass Type") +
ylab("Count")+
coord_flip()+
labs(caption="UCI Machine Learning Repository")
We can see there are more cases of Glass Types 1 and 2 within the whole data set. How about the distribution of the predictors?
library(DataExplorer)
Glass2<-subset(Glass, select=c(-Type))
plot_histogram(Glass2)
RI, NA, AI, and SI have a fairly close to normal distribution. We can see a left skew with Mg and possible an outlier. K has right skew along with Ba and Fe.
Are we missing data?
plot_missing(Glass)
How do we see the relationships between predictors?
#correlation matrix and visualization
correlation_matrix <- round(cor(Glass2),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
rm(Glass2)
The correlation heatmap does not indicate that there are many instances where variables are highly correlated with each other. The only exceptions are that Ca is highly correlated with RI with a coefficient of .81 followed by Ba having mild correlation with Ai. The strongest negative relationship are between Si and RI.
Recall the following plot from part a.
Glass2<-subset(Glass, select=c(-Type))
plot_histogram(Glass2)
We will restate our findings from part a. We can see that the following predictors have a close to normal distribution:
NA
AI
SI
The following predictors have right skews:
BA
CA
K
RI
FE
The following predictors have left skews:
The following predictors appear to have outliers:
BA
FE
K
MG
We can certainly drill down more into whether outliers are actually present in the data. We can run some simple visual test and see if there are indeed outliers.
We can use an incredible script to identify outliers. The original code is found here: https://www.r-bloggers.com/identify-describe-plot-and-remove-the-outliers-from-the-dataset/
outlierKD <- function(dt, var) {
var_name <- eval(substitute(var),eval(dt))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = T)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check", outer=TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "n")
cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
cat("Mean of the outliers:", round(mo, 2), "n")
m2 <- mean(var_name, na.rm = T)
cat("Mean without removing outliers:", round(m1, 2), "n")
cat("Mean if we remove outliers:", round(m2, 2), "n")
response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
if(response == "y" | response == "yes"){
dt[as.character(substitute(var))] <- invisible(var_name)
assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
cat("Outliers successfully removed", "n")
return(invisible(dt))
} else{
cat("Nothing changed", "n")
return(invisible(var_name))
}
}
outlierKD(Glass, RI);
## Outliers identified: 17 nPropotion (%) of outliers: 8.6 nMean of the outliers: 1.52 nMean without removing outliers: 1.52 nMean if we remove outliers: 1.52 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Na);
## Outliers identified: 7 nPropotion (%) of outliers: 3.4 nMean of the outliers: 12.66 nMean without removing outliers: 13.41 nMean if we remove outliers: 13.43 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Mg);
## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: 2.68 nMean if we remove outliers: 2.68 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Al);
## Outliers identified: 18 nPropotion (%) of outliers: 9.2 nMean of the outliers: 2.09 nMean without removing outliers: 1.44 nMean if we remove outliers: 1.39 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, K);
## Outliers identified: 7 nPropotion (%) of outliers: 3.4 nMean of the outliers: 3.06 nMean without removing outliers: 0.5 nMean if we remove outliers: 0.41 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Ca);
## Outliers identified: 26 nPropotion (%) of outliers: 13.8 nMean of the outliers: 11.17 nMean without removing outliers: 8.96 nMean if we remove outliers: 8.65 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Ba);
## Outliers identified: 38 nPropotion (%) of outliers: 21.6 nMean of the outliers: 0.99 nMean without removing outliers: 0.18 nMean if we remove outliers: 0 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
outlierKD(Glass, Fe)
## Outliers identified: 12 nPropotion (%) of outliers: 5.9 nMean of the outliers: 0.32 nMean without removing outliers: 0.06 nMean if we remove outliers: 0.04 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
The incredible function reveals that there are outliers in almost every variable with the exception of MG.
Since our data consists of numerical variables, we can apply normalization to the data set. This will increase data stability which would be optimal for our classification model.
glass_norm<-as.data.frame(apply(Glass[,1:9], 2,
function(x)(x-min(x))/(max(x)-min(x))
))
glass_norm$Type<-Glass$Type
#head(glass_norm)
plot_histogram(glass_norm)
We can also try a Box-Cox Transformation on variables using the following tutorial: https://setscholars.net/2019/05/25/how-to-preprocess-data-in-r-using-box-cox-transformation/
# load libraries
library(mlbench)
library(caret)
preprocessParams <- preProcess(Glass[,1:9], method=c("BoxCox"))
# summarize transform parameters
print(preprocessParams)
## Created from 214 samples and 5 variables
##
## Pre-processing:
## - Box-Cox transformation (5)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
# transform the dataset using the parameters
transformed <- predict(preprocessParams, Glass[,1:9])
# summarize the transformed dataset (note pedigree and age)
#summary(transformed);
plot_histogram(transformed)
Summary Before Box-Cox
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Summary After Box-Cox
summary(transformed)
## RI Na Mg Al
## Min. :0.2810 Min. :2.373 Min. :0.000 Min. :-0.9230
## 1st Qu.:0.2826 1st Qu.:2.558 1st Qu.:2.115 1st Qu.: 0.1817
## Median :0.2829 Median :2.588 Median :3.480 Median : 0.3324
## Mean :0.2831 Mean :2.594 Mean :2.685 Mean : 0.3685
## 3rd Qu.:0.2833 3rd Qu.:2.626 3rd Qu.:3.600 3rd Qu.: 0.5534
## Max. :0.2875 Max. :2.855 Max. :4.490 Max. : 1.7417
## Si K Ca Ba
## Min. :2436 Min. :0.0000 Min. :0.7677 Min. :0.000
## 1st Qu.:2612 1st Qu.:0.1225 1st Qu.:0.8197 1st Qu.:0.000
## Median :2649 Median :0.5550 Median :0.8238 Median :0.000
## Mean :2639 Mean :0.4971 Mean :0.8256 Mean :0.175
## 3rd Qu.:2670 3rd Qu.:0.6100 3rd Qu.:0.8297 3rd Qu.:0.000
## Max. :2843 Max. :6.2100 Max. :0.8666 Max. :3.150
## Fe
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.05701
## 3rd Qu.:0.10000
## Max. :0.51000
We can see that the transform has made a difference in some of the predictors but not all. Next step would be outlier treatment but that is out of scope for the problem.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data(Soybean)
#str(Soybean)
plot_missing(Soybean )
We can see that there are several variables that contain missing entries. Perhpas we can deal with this later on in the problem.
As we have seen in the literature, a degenerate distribution happens when a predictor variable contains a unique value or values that do not occur often, meaning they have low frequencies.
Lets examine the class variable before looking at our predictors.
ggplot(Soybean,aes(x=Soybean$Class))+geom_histogram(fill="deepskyblue3",colour="black", stat="count")+ggtitle("Distribution by Type of Soybean")+
xlab("Soybean Type") +
ylab("Count")+
coord_flip()+
labs(caption="UCI Machine Learning Repository")
We can see that there are certain classes that appear much more frequently than others such as phytophthora.
Lets examine the predictors in a visual context using https://www.r-bloggers.com/smoothscatter-with-ggplot2/
soy_predictors <- Soybean[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(soy_predictors))
{
plot(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
#smoothScatter(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
theme_classic()
#scale_fill_continuous(low = "white", high = "red")
}
library(caret)
nearZeroVar(soy_predictors, names = TRUE, saveMetrics=T)
## freqRatio percentUnique zeroVar nzv
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
Our tests reveal thatthat we do not have variables with a unique value. We do however see several variables that have rare values such as leaf.mild, mycelium, and sclerotia. These are degenerate variables by definition.
Lets recall our missing data plot before investigating the relation to classes:
plot_missing(Soybean)
As a whole, we can see that we have 4 predictors have the highest amount of missing data by percentage. Overall, we do not exceed 20% in terms of data missing per predictor.
library(tidyverse)
Soybean %>%
mutate(Total = n()) %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(), Proportion=Missing/Total) %>%
select(Class, Missing, Proportion) %>%
unique()
## # A tibble: 5 x 3
## # Groups: Class [5]
## Class Missing Proportion
## <fct> <int> <dbl>
## 1 phytophthora-rot 68 0.0996
## 2 diaporthe-pod-&-stem-blight 15 0.0220
## 3 cyst-nematode 14 0.0205
## 4 2-4-d-injury 16 0.0234
## 5 herbicide-injury 8 0.0117
If we partition the data by class and drill down into the percent missing per class, we can identify what class is missing the most data. phytophthora-rot is the class that accounts for the most missing data. We could have also done these calcualtions using sqldf package.
We can work on this problem and see how things change according to the approach we decide to take.
https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
After reading the method definitions, it makes more sense to use method = PMM https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
library(rcompanion)
library(psych)
library(DescTools)
library(mice)
the_soy <- mice(Soybean, method="pmm", printFlag=F, seed=112)
the_soy <- complete(the_soy)
the_soy <- as.data.frame(the_soy)
plot_missing(the_soy)
We can see that we no longer have missing data, however it is vital that we check the new distributions of our data.
soy_predictors2 <- the_soy[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(soy_predictors2))
{
plot(soy_predictors2[ ,i], ylab = names(soy_predictors2[i]))+
#smoothScatter(soy_predictors[ ,i], ylab = names(soy_predictors[i]))+
theme_classic()
#scale_fill_continuous(low = "white", high = "red")
}