Chapter 3 Exercise 1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe

# This code provided by the textbook
library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ggplot(data = Glass, aes(x = RI)) + geom_histogram(binwidth = 0.0005)

ggplot(data = Glass, aes(x = Na)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Mg)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Al)) + geom_histogram(binwidth = 0.05)

ggplot(data = Glass, aes(x = Si)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = K)) + geom_histogram(binwidth = 0.01)

ggplot(data = Glass, aes(x = Ca)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Ba)) + geom_histogram(binwidth = 0.1)

ggplot(data = Glass, aes(x = Fe)) + geom_histogram(binwidth = 0.01)

The relationships between the predictors can be explored using scatterplots. To determine the most interesting relationships to investigate, I’ll start by looking for correlation coefficients close to 1 or -1 between pairs of variables.

cor_table <- round(cor(Glass[,1:9]), 2)
print(cor_table)

##       RI    Na    Mg    Al    Si     K    Ca    Ba    Fe
## RI  1.00 -0.19 -0.12 -0.41 -0.54 -0.29  0.81  0.00  0.14
## Na -0.19  1.00 -0.27  0.16 -0.07 -0.27 -0.28  0.33 -0.24
## Mg -0.12 -0.27  1.00 -0.48 -0.17  0.01 -0.44 -0.49  0.08
## Al -0.41  0.16 -0.48  1.00 -0.01  0.33 -0.26  0.48 -0.07
## Si -0.54 -0.07 -0.17 -0.01  1.00 -0.19 -0.21 -0.10 -0.09
## K  -0.29 -0.27  0.01  0.33 -0.19  1.00 -0.32 -0.04 -0.01
## Ca  0.81 -0.28 -0.44 -0.26 -0.21 -0.32  1.00 -0.11  0.12
## Ba  0.00  0.33 -0.49  0.48 -0.10 -0.04 -0.11  1.00 -0.06
## Fe  0.14 -0.24  0.08 -0.07 -0.09 -0.01  0.12 -0.06  1.00

The only strong (|cor| > 0.7) correlation in the table is between RI and Ca (0.81). The scatterplot of these variables is shown below.

ggplot(Glass, aes(x = RI, y = Ca)) + geom_point()

The strongest remaining correlations are between RI and Si (-0.54), Mg and Al (-0.48), Mg and Ba(-0.49), and Al and Ba (0.48). It’s interesting that Mg is correlated with Al and Ba, which are also correlated with one another.

The scatterplots of these relationships are shown below.

ggplot(Glass, aes(x = RI, y = Si)) + geom_point()

ggplot(Glass, aes(x = Al, y = Mg)) + geom_point()

ggplot(Glass, aes(x = Ba, y = Mg)) + geom_point()

ggplot(Glass, aes(x = Al, y = Ba)) + geom_point()

It’s clear from these scatterplots that the stronger correlations between Al, Mg, and Ba were a result of Mg and Ba having a lot of 0 values, rather than a strong relationship between the non-zero values. On the other hand, the scatterplot of RI and Si does appear to show a noticeable inverse relationship.

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Skew

RI, Na, Al, Ca, are right skewed.

Si is left skewed.

Mg and K are left-skewed, with an additional complication that each of these variables has quite a few 0 values.

Ba and Fe are mostly zeros with a small number of positive non-zero values.

Outliers

Outliers are visible on the box plots of the data.

ggplot(data = Glass, aes(x = RI)) + geom_boxplot()

ggplot(data = Glass, aes(x = Na)) + geom_boxplot()

ggplot(data = Glass, aes(x = Mg)) + geom_boxplot()

ggplot(data = Glass, aes(x = Al)) + geom_boxplot()

ggplot(data = Glass, aes(x = Si)) + geom_boxplot()

ggplot(data = Glass, aes(x = K)) + geom_boxplot()

ggplot(data = Glass, aes(x = Ca)) + geom_boxplot()

ggplot(data = Glass, aes(x = Ba)) + geom_boxplot()

ggplot(data = Glass, aes(x = Fe)) + geom_boxplot()

The box plots of every variable except Mg show at least some outliers. Some noteworthy takeaways:

RI, Na, Al, Si, and Ca have high and low outliers.
K, Ba, and Fa only have high outliers.
Any Si value that is not between 71 and 74 is labeled as an outlier.
Any K value greater than 1.25 is labeled as an outlier.
All non-zero Ba values are labeled as outliers.
All Fe values greater than 2.5 are labeled as outliers.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

It might be correct to remove either RI or Ca to account for collinearity. The remaining variables should be centered and scaled by subtracting the mean of each variable from each of the values, and then dividing the resulting values by the standard deviation of that variable. Then, I would consider using a spatial sign on the variables with outliers, especially those with high and low outliers (RI, Na, Al, Si, and Ca). Alternatively, for the variables that show the most skew, a Box-Cox transformation could be an appropriate final step.

Chapter 3 Exercise 2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

# This code provided by the textbook
library(mlbench)
data(Soybean)

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

From page 45 in the textbook, the criteria for degenerate predictors are:

The fraction of unique values over the sample size is low (say 10 %).
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).

# Get the number of observations from the dimension of the DataFrame
print(dim(Soybean)[1])

## [1] 683

There are 683 observations in the dataset. Therefore, for a variable to meet the first criteria for a degenerate predictor, there must be 68 or fewer unique values for the variable.

# Create a list of columns and an empty list for the number of unique values in the column
col_list <- colnames(Soybean)
col_list <- col_list[-1]
unique_values <- c()

# Find the number of unique values in each column and append it to the list
for (i in 2:ncol(Soybean)) {
  unique_values <- append(unique_values, length(unique(Soybean[,i])))
}

# Create a DataFrame of the results and print the results in descending order of number of unique values
degeneracy_df <- data.frame(col_list, unique_values)
print(head(degeneracy_df %>% arrange(desc(unique_values))))

##        col_list unique_values
## 1          date             8
## 2     crop.hist             5
## 3      area.dam             5
## 4  stem.cankers             5
## 5 canker.lesion             5
## 6    fruit.pods             5

No column has more than 8 unique values, so all columns have the potential to qualify as degenerate. To check the second criteria for degeneracy, it is necessary to determine the frequency of the two most prevalent values for each variable.

# Create empty lists for the frequency of the two most prevalent values for each variable
prev_1 <- c()
prev_2 <- c()

# Find the frequency of the two most prevalent values for each column and append them to the lists
for (i in 2:ncol(Soybean)) {
  sorted_prev <- sort(table(Soybean[,i]), decreasing = TRUE)
  prev_1 <- append(prev_1, sorted_prev[1])
  prev_2 <- append(prev_2, sorted_prev[2])
}

# Add the columns to the DataFrame and calculate the ratio of the frequencies of the two most prevalent values
degeneracy_df$prev_1 <- prev_1
degeneracy_df$prev_2 <- prev_2
degeneracy_df <- degeneracy_df %>% mutate(ratio = prev_1 / prev_2)

# Filter the data for ratios above 20 and print the results
print(degeneracy_df %>% filter(ratio >= 20) %>% arrange(desc(ratio)))

##    col_list unique_values prev_1 prev_2  ratio
## 1  mycelium             3    639      6 106.50
## 2 sclerotia             3    625     20  31.25
## 3 leaf.mild             4    535     20  26.75

The mycelium, sclerotia, and leaf.mild variables are degenerate according to the criteria provided.

(b)

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Create an empty list for the number of missing values in each column
num_missing_values <- c()

# Find the number of missing values in each column and append it to the list
for (i in 2:ncol(Soybean)) {
  num_missing_values <- append(num_missing_values, sum(is.na(Soybean[,i])))
}

# Create a DataFrame of the results and print the results in descending order
missing_df <- data.frame(col_list, num_missing_values)
missing_df <- missing_df %>% mutate(prob_missing = num_missing_values / 683)
print(head(missing_df %>% arrange(desc(prob_missing))))

##    col_list num_missing_values prob_missing
## 1      hail                121    0.1771596
## 2     sever                121    0.1771596
## 3  seed.tmt                121    0.1771596
## 4   lodging                121    0.1771596
## 5      germ                112    0.1639824
## 6 leaf.mild                108    0.1581259

print(tail(missing_df %>% arrange(desc(prob_missing))))

##        col_list num_missing_values prob_missing
## 30    crop.hist                 16  0.023426061
## 31 plant.growth                 16  0.023426061
## 32         stem                 16  0.023426061
## 33         date                  1  0.001464129
## 34     area.dam                  1  0.001464129
## 35       leaves                  0  0.000000000

Different predictors have very different probabilities of being missing. The hail, sever, seed.tmt, and lodging variables are each the most likely to be missing, as each of these variables are missing data in 121 of the 683 observations (17.72%). At the other extreme, there are 3 predictors that are missing data in less than 1% of observations: The date and area.dam variables are only missing in a single observation each and the leaves variable is not missing in any of the observations.

To determine if the pattern of missing data is related to the classes, the analysis can be repeated grouped by class.

# Create an empty list for the number of missing values by class
classes_list <- unique(Soybean$Class)
class_na <- c()
total_obs <- c()

# Find the percentage of data that is missing for each class and append it to the list
for (class_name in classes_list) {
  class_na <- append(class_na, sum(is.na(Soybean %>% filter(Class == class_name))))
  total_obs <- append(total_obs, nrow(Soybean %>% filter(Class == class_name)))
}

# Create a DataFrame of the results and print the non-zero results in descending order
na_by_class <- data.frame(classes_list, class_na, total_obs) %>% mutate(pct_missing = class_na / (35 * total_obs)) %>% filter(class_na != 0)
print(na_by_class %>% arrange(desc(pct_missing)))

##                  classes_list class_na total_obs pct_missing
## 1                2-4-d-injury      450        16   0.8035714
## 2               cyst-nematode      336        14   0.6857143
## 3            herbicide-injury      160         8   0.5714286
## 4            phytophthora-rot     1214        88   0.3941558
## 5 diaporthe-pod-&-stem-blight      177        15   0.3371429

14 of the 19 classes do not have any missing values. The remaining five classes have all of the missing values, and range from 160 NAs (herbicide-injury) to 1,214 (phytophthora-rot). However, since there are different numbers of observations associated with eac class, it’s more useful to consider what percentage of the possible data is missing for each class. The largest percent missing is from the 2-4-d-injury class, which has 450 NAs from only 16 observations, meaning that 80% of the data is missing. The smallest non-zero percentage of missing data is diaporthe-pod-&-stem-blight, which has 177 NAs from 15 observations, so about 34% of the data is missing.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The three variables that were previously identified as degenerate (mycelium, sclerotia, and leaf.mild) can be eliminated. From there, I would use imputation to fill in the remaining missing values, but since missing data is associated with only 5 of the classes I would first introduce a new variable to represent the number of missing variables for each observation. It is possible that that variable would be as important as the pre-existing ones. I’d also consider removing the variables with more than 15% of the data missing (shown below) to see if that improved the model.

print(missing_df %>% filter(prob_missing >= 0.15) %>% arrange(desc(prob_missing)))

##           col_list num_missing_values prob_missing
## 1             hail                121    0.1771596
## 2            sever                121    0.1771596
## 3         seed.tmt                121    0.1771596
## 4          lodging                121    0.1771596
## 5             germ                112    0.1639824
## 6        leaf.mild                108    0.1581259
## 7  fruiting.bodies                106    0.1551977
## 8      fruit.spots                106    0.1551977
## 9    seed.discolor                106    0.1551977
## 10      shriveling                106    0.1551977

DATA 624 HW 4

Marley Myrianthopoulos

2024-09-29