DATA 624 Homework 4

Author

Henock Montcho

Published

May 9, 2025

I-) The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(ggplot2)
library(mlbench) 
data(Glass) 
str(Glass)

'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass  |>
ggplot(aes(x=Glass$Type)) +
  geom_histogram(fill="pink",colour="black", stat="count") +
  ggtitle("Distribution by Type of Glass") +
  xlab("Glass Type") +
  ylab("Count") +
  coord_flip() +
labs(caption="UCI Machine Learning Repository")

Comment: The Glass Types 1 and 2 are the main Types within the whole data set. Let’s explore the distribution of the predictors?

library(DataExplorer)

Glass2 <- subset(Glass, select = c(-Type))

plot_histogram(Glass2)

Comment: RI, NA, AI, and SI have a fairly close to normal distribution. We can see a left skew with Mg and possible an outlier. K has right skew along with Ba and Fe.

Let’s explore for missing data.

plot_missing(Glass)

Comment: No missing rows observed., however, the value “0.00” observed could be a missing value replaced by “0.00” (assumption).

Let’s explore the relationships between the predictors

library(corrplot)
Glass2  |>
  cor()  |>
  corrplot()

Comment: The correlation plot does not indicate that there are many instances where variables are highly correlated with each other. The only exceptions are that Ca is highly correlated with RI with a coefficient greater than .81 followed by Ba having mild correlation with Ai. The strongest negative relationship are between Si and RI.

Do there appear to be any outliers in the data? Are any predictors skewed?

#Checking for outliers

boxplot(Glass2$Al, main = "Boxplot of Al", ylab = "Percentage")

boxplot(Glass2$Ba, main = "Boxplot of Ba", ylab = "Percentage")

boxplot(Glass2$Ca, main = "Boxplot of Ca", ylab = "Percentage")

boxplot(Glass2$Fe, main = "Boxplot of Fe", ylab = "Percentage")

boxplot(Glass2$K, main = "Boxplot of K", ylab = "Percentage")

boxplot(Glass2$Mg, main = "Boxplot of Mg", ylab = "Percentage")

boxplot(Glass2$Na, main = "Boxplot of Na", ylab = "Percentage")

boxplot(Glass2$RI, main = "Boxplot of RI", ylab = "Percentage")

boxplot(Glass2$Si, main = "Boxplot of Si", ylab = "Percentage")

#Checking for skewness
library(e1071)
skewness(Glass2$Al)

[1] 0.8946104

skewness(Glass2$Ba)

[1] 3.36868

skewness(Glass2$Ca)

[1] 2.018446

skewness(Glass2$Fe)

[1] 1.729811

skewness(Glass2$K)

[1] 6.460089

skewness(Glass2$Mg)

[1] -1.136452

skewness(Glass2$Na)

[1] 0.4478343

skewness(Glass2$RI)

[1] 1.602715

skewness(Glass2$Si)

[1] -0.7202392

Comment:

- The Boxplots show some outliers in all elements data except for Mg.

- The histograms and the skewness values show some left skewness (negative value: Mg, Si) and right skewness (positive values substantially greater than 0: Ba, Fe, K).

Are there any relevant transformations of one or more predictors that might improve the classification model?

library(caret)
Glass_Al_Trans  <- BoxCoxTrans(Glass2$Al)
Glass_Al_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.290   1.190   1.360   1.445   1.630   3.500 

Largest/Smallest: 12.1 
Sample Skewness: 0.895 

Estimated Lambda: 0.5

Glass_Ba_Trans  <- BoxCoxTrans(Glass2$Ba)
Glass_Ba_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.175   0.000   3.150 

Lambda could not be estimated; no transformation is applied

Glass_Ca_Trans  <- BoxCoxTrans(Glass2$Ca)
Glass_Ca_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.430   8.240   8.600   8.957   9.172  16.190 

Largest/Smallest: 2.98 
Sample Skewness: 2.02 

Estimated Lambda: -1.1

Glass_Fe_Trans  <- BoxCoxTrans(Glass2$Fe)
Glass_Fe_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 

Lambda could not be estimated; no transformation is applied

Glass_K_Trans  <- BoxCoxTrans(Glass2$K)
Glass_K_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 

Lambda could not be estimated; no transformation is applied

Glass_Mg_Trans  <- BoxCoxTrans(Glass2$Mg)
Glass_Mg_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.115   3.480   2.685   3.600   4.490 

Lambda could not be estimated; no transformation is applied

Glass_Na_Trans  <- BoxCoxTrans(Glass2$Na)
Glass_Na_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.73   12.91   13.30   13.41   13.82   17.38 

Largest/Smallest: 1.62 
Sample Skewness: 0.448 

Estimated Lambda: -0.1 
With fudge factor, Lambda = 0 will be used for transformations

Glass_RI_Trans  <- BoxCoxTrans(Glass2$RI)
Glass_RI_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.511   1.517   1.518   1.518   1.519   1.534 

Largest/Smallest: 1.02 
Sample Skewness: 1.6 

Estimated Lambda: -2

Glass_Si_Trans  <- BoxCoxTrans(Glass2$Si)
Glass_Si_Trans

Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  69.81   72.28   72.79   72.65   73.09   75.41 

Largest/Smallest: 1.08 
Sample Skewness: -0.72 

Estimated Lambda: 2

Comments:

-Lambda for Al is 0.5, a square root transformation will be more appropriate.

-Lambda for Ba could not be estimated; no transformation is applied.

-Lambda for Ca is -1.1, a recirpocal transformation will be more appropriate.

-Lambda for Fe could not be estimated; no transformation is applied.

-Lambda for K could not be estimated; no transformation is applied.

-Lambda for Mg could not be estimated; no transformation is applied.

-Lambda for Na is -0.1. With fudge factor, Lambda = 0 will be used for transformations.

A log transformation will be more appropriate.

-Lambda for RI is -2, a squaring transformation will be more appropriate.

-Lambda for Si is 2, a squaring transformation will be more appropriate.

II-) The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean  |>
ggplot(aes(x=Soybean$Class)) +
  geom_histogram(fill="pink",colour="black", stat="count") +
  ggtitle("Distribution by Type of Soybean") +
  xlab("Soybean Type") +
  ylab("Count") +
  coord_flip() +
labs(caption="UCI Machine Learning Repository")

Soybean2 <- subset(Soybean, select = c(-Class))

columns <- colnames(Soybean2)

lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]


[[19]]


[[20]]


[[21]]


[[22]]


[[23]]


[[24]]


[[25]]


[[26]]


[[27]]


[[28]]


[[29]]


[[30]]


[[31]]


[[32]]


[[33]]


[[34]]


[[35]]

A degenerate variable is a variable that has no variability; it takes on a single value for all observations in a dataset. Essentially, it is a constant. From the plots of the Soybean’s predictors, we do not have variables with a unique value. However, there are several variables such as mycelium, sclerotia that have rare values. These could be degenerate variables when the “NA’s” are not factored.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing?

plot_missing(Soybean)

Overall, there are 4 predictors (lodging, seed.tmt, sever, hail) that have the highest percentage of missing data.

Is the pattern of missing data related to the classes?

library(dplyr)

Soybean_Missing <- Soybean  |>
 group_by(Class)  |>
 mutate(Missing = n())  |>
 ungroup()  |> 
 mutate(Total = n(), Proportion = Missing / Total)  |>
 distinct(Class, Missing, Proportion)  |> 
 arrange(desc(Missing))
Soybean_Missing

# A tibble: 19 × 3
   Class                       Missing Proportion
   <fct>                         <int>      <dbl>
 1 brown-spot                       92     0.135 
 2 alternarialeaf-spot              91     0.133 
 3 frog-eye-leaf-spot               91     0.133 
 4 phytophthora-rot                 88     0.129 
 5 brown-stem-rot                   44     0.0644
 6 anthracnose                      44     0.0644
 7 diaporthe-stem-canker            20     0.0293
 8 charcoal-rot                     20     0.0293
 9 rhizoctonia-root-rot             20     0.0293
10 powdery-mildew                   20     0.0293
11 downy-mildew                     20     0.0293
12 bacterial-blight                 20     0.0293
13 bacterial-pustule                20     0.0293
14 purple-seed-stain                20     0.0293
15 phyllosticta-leaf-spot           20     0.0293
16 2-4-d-injury                     16     0.0234
17 diaporthe-pod-&-stem-blight      15     0.0220
18 cyst-nematode                    14     0.0205
19 herbicide-injury                  8     0.0117

Comment: phytophthora-rot, alternarialeaf-spot, frog-eye-leaf-spot and brown-spot the Soybean classes with the most missing data.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Two techniques can be utilized to handle missing data: eliminating the predictors or imputation. In our case, the Soybean dataset is not considerably big and the proportion of missing data is not significant enough (as seen earlier) for adopting the eliminating technique. Therefore, the imputation technique, more precisely, the KNN method will be more appropriate.

library(VIM)

Soybean_Cleaned <- kNN(Soybean)

Verification of the missing data handling:

plot_missing(Soybean_Cleaned)

Comment: No more data missing.