Helpful links: http://rismyhammer.com/ml/Pre-Processing.html#pre-processing https://www.rdocumentation.org/packages/caret/versions/6.0-92/topics/preProcess
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Glass |>
gather(element, value, -Type) |>
ggplot(aes(value)) +
geom_histogram(fill='skyblue3', bins=25) +
theme(panel.grid = element_blank())+
facet_wrap(~element, scales = "free")
We use histograms to viz distribution. From the above we see:
Glass |>
select (-Type) |>
cor() |>
corrplot.mixed(tl.col = 'black')
A correlation plot shows the relationship between various facets (predictors).
Do there appear to be any outliers in the data? Are any predictors skewed?
Glass |>
gather(element, value, -Type) |>
ggplot(aes(value)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())+
facet_wrap(~element, scales = "free")
Box plots make it easy to visualize outlilers, especially when setting them to a distinctive color. Here we also set an alpha value for outliers to see where many values overlap on the plot (see deeper outlier color in Ba).
Although we mentioned skewness in the review of the histograms, we can also see this in box plots by looking and the median and whiskers. For example:
Are there any relevant transformations of one or more predictors that might improve the classification model?
First we will try Box Cox on Fe. Since Box Cox does not work on zeros, we will add 1 to it first.
BoxCoxTrans(Glass$Fe+1)
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.057 1.100 1.510
##
## Largest/Smallest: 1.51
## Sample Skewness: 1.73
##
## Estimated Lambda: -2
# Before
p1<-Glass |>
ggplot(aes(Fe)) +
geom_histogram( color = "black", fill = "skyblue3", bins=25) +
ggtitle("Histogram of Fe before ")
#after
p2<-Glass |>
ggplot(aes((Fe+1)**-2)) +
geom_histogram(color = "black", fill = "skyblue3", bins=25) +
ggtitle("Histogram of Fe after Box-Cox")
p3<-Glass |>
ggplot(aes(Fe)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())
p4<-Glass |>
ggplot(aes((Fe+1)**-2)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())
ggarrange(p1,p2,p3,p4)
That did not have the desired effect. Looking at other possibilities
in caret, we find:
preProcess(df, method=c(“center”, “scale”)). Let’s try that.
pP_V<-
preProcess(Glass, method=c("center", "scale"))
pP_Glass <- predict(pP_V, Glass)
pP_Glass |>
gather(element, value, -Type) |>
ggplot(aes(value)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())+
facet_wrap(~element, scales = "free")
Nope, that did not really do it either. Maybe focusing on addressing the outliers, using spacialSign.
pP_V2<-
preProcess(Glass, method="spatialSign")
pP_Glass2 <- predict(pP_V2, Glass)
pP_Glass2 |>
gather(element, value, -Type) |>
ggplot(aes(value)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())+
facet_wrap(~element, scales = "free")
Ah, that seemed to help for most of the predictors. Taking a closer look at Fe:
# Before
p5<-Glass |>
ggplot(aes(Fe)) +
geom_histogram( color = "black", fill = "skyblue3", bins=25) +
ggtitle("Histogram of Fe before ")
#after
p6<-pP_Glass2 |>
ggplot(aes(Fe)) +
geom_histogram(color = "black", fill = "skyblue3", bins=25) +
ggtitle("Histogram of Fe after Box-Cox")
p7<-Glass |>
ggplot(aes(Fe)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())
p8<-pP_Glass2 |>
ggplot(aes(Fe)) +
geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) +
theme(panel.grid = element_blank())
ggarrange(p5,p6,p7,p8)
Yup, that did the trick!
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean |>
gather(element, value, -Class) |>
ggplot(aes(value)) +
geom_bar(fill='aquamarine3') +
facet_wrap(~element)
Note that using facet_wrap without scales=‘free’ means some values might be lost. However, this is a simple way to take a quick look.
Observations:
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing<-as.data.frame(sapply(Soybean, function(x) sum(is.na(x))))
missing<-missing |>
rename_at(1,~"Missing") |>
mutate(Predictors=row.names(missing))
ggplot(missing,aes(Missing,Predictors)) +
geom_col(fill='aquamarine3')
As noted in (a), many predictors have missing values. server, seed.tmt, lodging and hail have the most, but many others have significant amounts of missing values.
Now we will look by Class.
mByClass <- Soybean |>
group_by(Class) |>
summarise_all(~sum(is.na(.)))
mByClass$totalNA = rowSums(mByClass[,-c(1)])
mByClass|>
filter_at(vars(totalNA), all_vars((.) != 0))|>
ggplot(aes(totalNA,Class)) +
geom_col(fill='aquamarine3')
First we totaled all the NAs across all the predictors, by class. Then we filtered to only those with a non zero totalNA. We can see that 5 classes have NAs and phytophthora-rot has the most.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
First, we need to understand the data more. Using ?Soybean provides some explanation (“folklore seems to be that the last four classes are unjustified by the data since they have so few examples”), but still not really sure what we are doing here.
Even so, we could: