DPlunkett HW4

Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass |>
  gather(element, value, -Type) |>
  ggplot(aes(value)) +
  geom_histogram(fill='skyblue3', bins=25) + 
  theme(panel.grid = element_blank())+
  facet_wrap(~element, scales = "free")

We use histograms to viz distribution. From the above we see:

Al seems bimodal and maybe right skewed.
Ba and Fe have many zeros (is that an issue?), making it hard to see what else is there.
Ca seems right skewed, with maybe some outliers to the far right.
K and Mg both seem bimodal, and have quite a few zeros.
K looks to have an outlier to the far right.
Na, RI and Si seem somewhat skewed, and also have long tails on both sides (possible outliers).

Glass |>
  select (-Type) |>
  cor() |>
  corrplot.mixed(tl.col = 'black')

A correlation plot shows the relationship between various facets (predictors).

A very strong positive correlation between RI and Ca.
Next strongest positive correlation is with Ba and Al.
Al and K as well as Ba and Na also have a positive correlation.
There are also some weaker positive correlations.
In negative correlations, Ni and Ri is the strongest, followed by Ba and Mg, Al and Mg and then Ca and Mg.

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Glass |>
  gather(element, value, -Type) |>
  ggplot(aes(value)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())+
  facet_wrap(~element, scales = "free")

Box plots make it easy to visualize outlilers, especially when setting them to a distinctive color. Here we also set an alpha value for outliers to see where many values overlap on the plot (see deeper outlier color in Ba).

Although we mentioned skewness in the review of the histograms, we can also see this in box plots by looking and the median and whiskers. For example:

Fe is right skewed.
Mg is left skewed.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

First we will try Box Cox on Fe. Since Box Cox does not work on zeros, we will add 1 to it first.

BoxCoxTrans(Glass$Fe+1)

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.057   1.100   1.510 
## 
## Largest/Smallest: 1.51 
## Sample Skewness: 1.73 
## 
## Estimated Lambda: -2

# Before
p1<-Glass |>
  ggplot(aes(Fe)) +
  geom_histogram( color = "black", fill = "skyblue3", bins=25) +
  ggtitle("Histogram of Fe before ")
#after
p2<-Glass |>
  ggplot(aes((Fe+1)**-2)) +
  geom_histogram(color = "black", fill = "skyblue3", bins=25) +
  ggtitle("Histogram of Fe after Box-Cox")

p3<-Glass |>
ggplot(aes(Fe)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())

p4<-Glass |>
ggplot(aes((Fe+1)**-2)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())

ggarrange(p1,p2,p3,p4)

That did not have the desired effect. Looking at other possibilities in caret, we find:
preProcess(df, method=c(“center”, “scale”)). Let’s try that.

pP_V<- 
  preProcess(Glass, method=c("center", "scale"))
pP_Glass <- predict(pP_V, Glass)


pP_Glass |>
  gather(element, value, -Type) |>
  ggplot(aes(value)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())+
  facet_wrap(~element, scales = "free")

Nope, that did not really do it either. Maybe focusing on addressing the outliers, using spacialSign.

pP_V2<- 
  preProcess(Glass, method="spatialSign")
pP_Glass2 <- predict(pP_V2, Glass)


pP_Glass2 |>
  gather(element, value, -Type) |>
  ggplot(aes(value)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())+
  facet_wrap(~element, scales = "free")

Ah, that seemed to help for most of the predictors. Taking a closer look at Fe:

# Before
p5<-Glass |>
  ggplot(aes(Fe)) +
  geom_histogram( color = "black", fill = "skyblue3", bins=25) +
  ggtitle("Histogram of Fe before ")
#after
p6<-pP_Glass2 |>
  ggplot(aes(Fe)) +
  geom_histogram(color = "black", fill = "skyblue3", bins=25) +
  ggtitle("Histogram of Fe after Box-Cox")

p7<-Glass |>
ggplot(aes(Fe)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())

p8<-pP_Glass2 |>
ggplot(aes(Fe)) +
  geom_boxplot(fill='skyblue3', outlier.color = 'salmon1', outlier.shape = 1, outlier.alpha = .45 ) + 
  theme(panel.grid = element_blank())

ggarrange(p5,p6,p7,p8)

Yup, that did the trick!

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean |>
  gather(element, value, -Class) |>
  ggplot(aes(value)) +
  geom_bar(fill='aquamarine3') + 
  facet_wrap(~element)

Note that using facet_wrap without scales=‘free’ means some values might be lost. However, this is a simple way to take a quick look.

Observations:

The first thing that stands out is the number of NAs - only 3 fields of 35 DON’T have NA (area.dam, date and leaves).
There are also quite a few fields with a large number of zeros: leaf.malf, lodging and roots, just in the first column of plots.
And many fields have a concentration in a single (non-zero) value, for example: precip, leaves

(b)

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing<-as.data.frame(sapply(Soybean, function(x) sum(is.na(x)))) 

missing<-missing |> 
  rename_at(1,~"Missing") |>
  mutate(Predictors=row.names(missing))

ggplot(missing,aes(Missing,Predictors)) +
  geom_col(fill='aquamarine3')

As noted in (a), many predictors have missing values. server, seed.tmt, lodging and hail have the most, but many others have significant amounts of missing values.

Now we will look by Class.

mByClass <- Soybean |>
  group_by(Class) |>
  summarise_all(~sum(is.na(.)))

mByClass$totalNA = rowSums(mByClass[,-c(1)])
  

mByClass|>
  filter_at(vars(totalNA), all_vars((.) != 0))|>
ggplot(aes(totalNA,Class)) +
  geom_col(fill='aquamarine3')

First we totaled all the NAs across all the predictors, by class. Then we filtered to only those with a non zero totalNA. We can see that 5 classes have NAs and phytophthora-rot has the most.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First, we need to understand the data more. Using ?Soybean provides some explanation (“folklore seems to be that the last four classes are unjustified by the data since they have so few examples”), but still not really sure what we are doing here.

Even so, we could:

listen to ‘folklore’ and eliminate rows with the worst 4 classes.
try to impute the data, but only area.dam and date seem to have enough data to impute.
Look for correlation in the predictors, and use that to reduce the predictors, especially those with NAs
Use a modeling method that is tolerant of NAs, like tree based techniques.

DPlunkett HW4

D Plunkett

2024-09-22

Exercise 3.1

(a)

(b)

(c)

Exercise 3.2

(a)

(b)

(c)