Homework 4

Problem 3.1

The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Part A

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Load in data

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Explore Distributions

sum(rowSums(is.na(Glass)) > 0)

## [1] 0

There are no null values so we can just start plotting histograms for each column

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

plot_histograms <- function(df, bins = 30, ncol = 2) {
  # Create a list of  histogram plots for predictors columns (Response is a factor)
  plots <- lapply(names(df), function(col) {
    if (is.numeric(df[[col]])) {
      ggplot(df, aes(x = .data[[col]])) +
        geom_histogram(fill = "blue", color = "black", bins = bins) +
        labs(title = paste("Histogram of", col), x = col, y = "Count") +
        theme_minimal()
    }
  })
  
  
  do.call(grid.arrange, c(plots, ncol = ncol))
  
}

plot_histograms(Glass, bins = 25, ncol = 3)

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

Glass %>%
  select(where(is.numeric)) %>%
  apply(., 2, skewness)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Explore Predictor Relationships

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(Glass)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part B

Do there appear to be any outliers in the data? Are any predictors skewed?

Looking at the Histograms there does appear to be some outliers in the data. We can see a couple outliers in the Ri column out near the 1.53 mark, there is also one Na column that is much higher than the rest of the values. There is an obvious one in the K column all the way out near 6. Then, there are also some in the Ba and Fe columns so there appears to be a few outliers in the data set.

There are also a few columns that appear to be skewed. The Ri, Al, Ca, Ba, Fe, K, and Na columns all are right skewed which means there are more lower values than higher ones. While the Mg and Si columns appear to be left skewed which means there are more higher values than lower.

Looking at the scatter plots we can see that the Ri and Ca predictors are pretty positively correlated with a .810 correlation. Then the next highest correlation is between Si and Ri with a negative correlation of -.542. There are some other predictors that are slightly correlated in the .4 range but those 4 are the most correlated. So, we could possibly pick one of Ri and Ca to include in a model since they are correlated.

Part C

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, there are some transformations that may improve the model accuracy. For all of the predictors we could use box-cox transformations and figure out the optimal lambda values. But just looking at the distributions I would apply log transformations to all the predictors that are significantly right skewed like the Ri, Ba, K, Fe, Ca. Then I would apply exponential transformation to the significantly left skewed data like the Mg columns and then I would not transform the Si and Na columns since they are already almost normal.

Question 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g.,temperature,precipitation)and plant conditions(e.g.,left spots, mold growth). The outcome labels consist of 19 distinct classes.

Part A

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(mlbench)
data(Soybean)
head(Soybean)

##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1     1        0    0            1      1         0         2         2
## 2     2        1    1            1      1         0         2         2
## 3     2        1    2            1      1         0         2         2
## 4     2        0    1            1      1         0         2         2
## 5     1        0    2            1      1         0         2         2
## 6     1        0    1            1      1         0         2         2
##   leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1           0         0         0    1       1            3             1
## 2           0         0         0    1       0            3             1
## 3           0         0         0    1       0            3             0
## 4           0         0         0    1       0            3             0
## 5           0         0         0    1       0            3             1
## 6           0         0         0    1       0            3             0
##   fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1               1         1        0            0         0          0
## 2               1         1        0            0         0          0
## 3               1         1        0            0         0          0
## 4               1         1        0            0         0          0
## 5               1         1        0            0         0          0
## 6               1         1        0            0         0          0
##   fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1           4    0           0             0         0          0     0
## 2           4    0           0             0         0          0     0
## 3           4    0           0             0         0          0     0
## 4           4    0           0             0         0          0     0
## 5           4    0           0             0         0          0     0
## 6           4    0           0             0         0          0     0

Get info about the Soybean data set.

?Soybean

Function to plot frequency distributions for categorical variables

plot_cat_dists <- function(df, ncol = 6) {
  plots <- lapply(names(df), function(col) {
    if (is.factor(df[[col]]) && col != "Class") {
      ggplot(df, aes(x = .data[[col]])) +
        geom_density() +
        geom_bar() 
    }
  })
  
  
  do.call(grid.arrange, c(plots, ncol = ncol))
  
}

Plot the frequency distributions for categorical variables.

plot_cat_dists(Soybean)

## Warning: Groups with fewer than two data points have been dropped.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Groups with fewer than two data points have been dropped.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

Looking at the output above we can clearly see that there are a few variables that take on primarily one value making them degenerate variables. The variables “mycelium”, “sclerotia” and “shriveling” are all clearly degenerate and then the variables “leaf.mild”, “leaf.malf”, “leaves”, “lodging”, “seed.size”, “seed.discolor”, and “mold.growth” are somewhat degenerate with most of the samples belonging to one class.

Part B

Roughly 18% of the data are missing.Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Get the percentage of missing data for each column.

missing_data <- colSums(is.na(Soybean)) / nrow(Soybean) * 100

missing_data_sorted <- sort(missing_data[missing_data > 0], decreasing = TRUE)

missing_data_sorted

##            hail           sever        seed.tmt         lodging            germ 
##      17.7159590      17.7159590      17.7159590      17.7159590      16.3982430 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##      15.8125915      15.5197657      15.5197657      15.5197657      15.5197657 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##      14.6412884      13.4699854      13.4699854      13.4699854      12.2986823 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##      12.2986823      12.2986823      12.2986823      12.2986823       5.5636896 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##       5.5636896       5.2708638       4.5387994       4.3923865       2.3426061 
##    plant.growth            stem            date        area.dam 
##       2.3426061       2.3426061       0.1464129       0.1464129

From the output above we can see that the top 4 missing variables have the same exact percentage of missing samples. This shows that there is a clear pattern of the missing data related to the classes. It is highly unlikely that 4 columns would have the same exact amount of missing data unless there was some underlying pattern as to why that those columns are missing so often and at the same exact rate.

Part C

My strategy for handling missing data would be to remove the top 4 missing predictors that all have the exact same missing percentage. For these columns there are just too much missing so trying to impute these columns would be too unreliable. Then for the rest of the predictors I would try to use KNN to impute the missing data for each predictor. I would also check to see if there were any relationships between predictors that could be useful in a simple linear model to impute any predictors if KNN does not work well for any predictors. I would also be very careful when imputing predictors with more than 15% missing as they could be unreliable due to so much data missing but it would be worth a try at imputing them using either KNN or a simple linear regression model to fill in those missing values accurately.

Homework 4

Joey Bochnik

2025-02-28

Problem 3.1

Part A

Load in data

Explore Distributions

Explore Predictor Relationships

Part B

Part C

Question 3.2

Part A

Part B

Part C