Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. Excersize 1

library(mlbench)
library(ggplot2)
library(reshape2)
library(dplyr)
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

par(mfrow = c(3, 3)) 
for(i in 1:9) {
hist(Glass[, i], 
main = paste("Histogram of", names(Glass)[i]), 
xlab = names(Glass)[i], 
col = "lightblue")
}

The histograms reveal different patterns of concentration distributions for each element, with some elements showing symmetrical distributions and others exhibiting skewness.

# Boxplots for each variable
par(mfrow = c(3, 3))  # Reset plotting layout
for(i in 1:9) {
boxplot(Glass[, i], 
main = paste("Boxplot of", names(Glass)[i]), 
xlab = names(Glass)[i], 
col = "lightgreen")
}

pairs(Glass[, 1:9],
main = "Scatterplot Matrix of Predictors",
col = as.factor(Glass$type))

This matrix illustrates the relationships among different variables, which are represented along the axes: Fe, Ba, Ca, K, Si, Al, Mg, Na, and RI. Each cell in the matrix features a scatterplot that demonstrates the correlation between two variables. The values along the axes correspond to the measurements of these variables. This matrix is utilized to visualize correlations among multiple variables simultaneously. For instance, the cell at the intersection of “Fe” and “Ba” displays a scatterplot of Fe values plotted against Ba values, allowing for a visual assessment of their correlation. The empty cells indicate that the relevant scatterplots are not included in this matrix. Overall, the matrix is a significant tool for exploratory

cor_mat <- cor(Glass[, 1:9])
melted_cor <- melt(cor_mat)
ggplot(melted_cor, aes(Var1, Var2, fill = value)) + 
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
midpoint = 0, limit = c(-1, 1), space = "Lab", 
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, 
size = 12, hjust = 1)) +
coord_fixed() +
ggtitle("Correlation Heatmap of Predictors")

#Enhanced pairwise plot using GGally
library(GGally)
ggpairs(Glass, columns = 1:9, 
aes(color = type), data = Glass) + 
ggtitle("GGpairs: Pairwise Plot of Predictors")

The pairwise plot visualizes the relationships between all pairs of predictors in the glass dataset, showing correlations and distributions to aid in understanding data structure and potential multicollinearity.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, several predictors are skewed. RI, Mg, Al, Si, K, Ca, Ba, and Fe show significant skewness. Na shows mild skewness.

#Check skewness
Glass %>%
summarize(across(where(is.numeric), 
list(skewness = ~ moments::skewness(.x, na.rm = TRUE))))
##   RI_skewness Na_skewness Mg_skewness Al_skewness Si_skewness K_skewness
## 1    1.614015   0.4509917   -1.144465   0.9009179  -0.7253173   6.505636
##   Ca_skewness Ba_skewness Fe_skewness
## 1    2.032677    3.392431    1.742007

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

There are several transformations that can be applied to the predictors to improve the classification model, especially when dealing with skewed data. For positively skewed RI, K, Ca, Ba, and Fe logarithmic transformation log(X) or the square root sqrt(X) can be helpful. For negatively skewed Mg the inverse power transformation 1/X can help.

par(mfrow = c(3, 2), mar = c(4, 3, 3, 1) + 0.1)
trans_vars <- c("RI", "K", "Ca", "Ba", "Fe")
for(var in trans_vars) {
Glass[[paste0("log_", var)]] <- log(Glass[[var]])
Glass[[paste0("sqrt_", var)]] <- sqrt(Glass[[var]])
}


# Visualize the transformed variables 

for(var in trans_vars) {
 hist(Glass[[paste0("log_", var)]],
 main = paste("Histogram of log(", var, ")", sep = ""),
xlab = paste("log(", var, ")", sep = ""),
col = "lightgreen")

  # Histogram for square root transformation

hist(Glass[[paste0("sqrt_", var)]],
main = paste("Histogram of sqrt(", var, ")", sep = ""),
xlab = paste("sqrt(", var, ")", sep = ""),
col = "lightcoral")

}

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
predictor_names <- setdiff(names(Soybean), "Class")

for(name in predictor_names) {

  cat("\nFrequency for", name, ":\n")

  print(table(Soybean[[name]], useNA = "ifany"))

}
## 
## Frequency for date :
## 
##    0    1    2    3    4    5    6 <NA> 
##   26   75   93  118  131  149   90    1 
## 
## Frequency for plant.stand :
## 
##    0    1 <NA> 
##  354  293   36 
## 
## Frequency for precip :
## 
##    0    1    2 <NA> 
##   74  112  459   38 
## 
## Frequency for temp :
## 
##    0    1    2 <NA> 
##   80  374  199   30 
## 
## Frequency for hail :
## 
##    0    1 <NA> 
##  435  127  121 
## 
## Frequency for crop.hist :
## 
##    0    1    2    3 <NA> 
##   65  165  219  218   16 
## 
## Frequency for area.dam :
## 
##    0    1    2    3 <NA> 
##  123  227  145  187    1 
## 
## Frequency for sever :
## 
##    0    1    2 <NA> 
##  195  322   45  121 
## 
## Frequency for seed.tmt :
## 
##    0    1    2 <NA> 
##  305  222   35  121 
## 
## Frequency for germ :
## 
##    0    1    2 <NA> 
##  165  213  193  112 
## 
## Frequency for plant.growth :
## 
##    0    1 <NA> 
##  441  226   16 
## 
## Frequency for leaves :
## 
##   0   1 
##  77 606 
## 
## Frequency for leaf.halo :
## 
##    0    1    2 <NA> 
##  221   36  342   84 
## 
## Frequency for leaf.marg :
## 
##    0    1    2 <NA> 
##  357   21  221   84 
## 
## Frequency for leaf.size :
## 
##    0    1    2 <NA> 
##   51  327  221   84 
## 
## Frequency for leaf.shread :
## 
##    0    1 <NA> 
##  487   96  100 
## 
## Frequency for leaf.malf :
## 
##    0    1 <NA> 
##  554   45   84 
## 
## Frequency for leaf.mild :
## 
##    0    1    2 <NA> 
##  535   20   20  108 
## 
## Frequency for stem :
## 
##    0    1 <NA> 
##  296  371   16 
## 
## Frequency for lodging :
## 
##    0    1 <NA> 
##  520   42  121 
## 
## Frequency for stem.cankers :
## 
##    0    1    2    3 <NA> 
##  379   39   36  191   38 
## 
## Frequency for canker.lesion :
## 
##    0    1    2    3 <NA> 
##  320   83  177   65   38 
## 
## Frequency for fruiting.bodies :
## 
##    0    1 <NA> 
##  473  104  106 
## 
## Frequency for ext.decay :
## 
##    0    1    2 <NA> 
##  497  135   13   38 
## 
## Frequency for mycelium :
## 
##    0    1 <NA> 
##  639    6   38 
## 
## Frequency for int.discolor :
## 
##    0    1    2 <NA> 
##  581   44   20   38 
## 
## Frequency for sclerotia :
## 
##    0    1 <NA> 
##  625   20   38 
## 
## Frequency for fruit.pods :
## 
##    0    1    2    3 <NA> 
##  407  130   14   48   84 
## 
## Frequency for fruit.spots :
## 
##    0    1    2    4 <NA> 
##  345   75   57  100  106 
## 
## Frequency for seed :
## 
##    0    1 <NA> 
##  476  115   92 
## 
## Frequency for mold.growth :
## 
##    0    1 <NA> 
##  524   67   92 
## 
## Frequency for seed.discolor :
## 
##    0    1 <NA> 
##  513   64  106 
## 
## Frequency for seed.size :
## 
##    0    1 <NA> 
##  532   59   92 
## 
## Frequency for shriveling :
## 
##    0    1 <NA> 
##  539   38  106 
## 
## Frequency for roots :
## 
##    0    1    2 <NA> 
##  551   86   15   31
par(mfrow=c(2,2))

for(name in predictor_names) {  # Plotting the first 4 predictors for demonstration

  counts <- table(Soybean[[name]])

  barplot(counts, main=paste("Distribution of", name), col="skyblue")

}

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

#Check the missing data 
sum(is.na(Soybean))
## [1] 2337
# Calculate missing data per variable

missing_counts <- sapply(Soybean, function(x) sum(is.na(x)))

missing_percentages <- missing_counts / nrow(Soybean)*100

print(missing_percentages)
##           Class            date     plant.stand          precip            temp 
##       0.0000000       0.1464129       5.2708638       5.5636896       4.3923865 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##      17.7159590       2.3426061       0.1464129      17.7159590      17.7159590 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##      16.3982430       2.3426061       0.0000000      12.2986823      12.2986823 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##      12.2986823      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##      17.7159590       5.5636896       5.5636896      15.5197657       5.5636896 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##       5.5636896       5.5636896       5.5636896      12.2986823      15.5197657 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##      13.4699854      13.4699854      15.5197657      13.4699854      15.5197657 
##           roots 
##       4.5387994

The highest likelihood of missing data - with over 17% of the data missing is in hail, sever, seed.tmt and lodging.

#Investigate missing data by class for 4 predictors (as an example)
example_predictor <- predictor_names[4]

print(table(Soybean$Class, is.na(Soybean[[example_predictor]])))
##                              
##                               FALSE TRUE
##   2-4-d-injury                    0   16
##   alternarialeaf-spot            91    0
##   anthracnose                    44    0
##   bacterial-blight               20    0
##   bacterial-pustule              20    0
##   brown-spot                     92    0
##   brown-stem-rot                 44    0
##   charcoal-rot                   20    0
##   cyst-nematode                   0   14
##   diaporthe-pod-&-stem-blight    15    0
##   diaporthe-stem-canker          20    0
##   downy-mildew                   20    0
##   frog-eye-leaf-spot             91    0
##   herbicide-injury                8    0
##   phyllosticta-leaf-spot         20    0
##   phytophthora-rot               88    0
##   powdery-mildew                 20    0
##   purple-seed-stain              20    0
##   rhizoctonia-root-rot           20    0

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We can use simple approach of imputation using the mode

# Function to impute mode

impute_mode <- function(x) {
if(any(is.na(x))) {
 mode_val <- names(sort(table(x), decreasing = TRUE))[1]
x[is.na(x)] <- mode_val
}
return(x)

}
Soybean_imputed <- Soybean
for(name in predictor_names) {
 Soybean_imputed[[name]] <- impute_mode(Soybean_imputed[[name]])

}
#Check if missing data remains after imputation
sum(is.na(Soybean_imputed))
## [1] 0
par(mfrow=c(1,2))

barplot(table(Soybean[[example_predictor]], useNA = "ifany"), main = "Before Imputation", col="tomato")

barplot(table(Soybean_imputed[[example_predictor]], useNA = "ifany"), main = "After Imputation", col="seagreen")