Data Preprocessing HW4

2.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(lattice)
par(mfrow=c(3, 3))
for (col in names(Glass)) {
  if (is.numeric(Glass[[col]])) {  # Check if the column is numeric
    hist(Glass[[col]], main=col, xlab="Value", col="skyblue", border="white")
  }
}

From the above histograms, we can recognize that some distributions follow what seems to be a normal distribution, while others are either left or right skewed; with the exception of Mg that seems to have some sort of “saddle” distribution. Let’s look at the correlation among the varaibles

library(corrplot)
val <- cor(Glass[,c(1:9)])
corrplot(val)

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

library(magrittr)
library(e1071)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Glass %>%
  select_if(is.numeric) %>%
  apply(., 2, skewness) %>%
  unlist() %>%
  round(4)

##      RI      Na      Mg      Al      Si       K      Ca      Ba      Fe 
##  1.6027  0.4478 -1.1365  0.8946 -0.7202  6.4601  2.0184  3.3687  1.7298

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

Since Be, Fe, and K have a strong right skewness with a concentrations of points with low values, they may benefit from a log transformation. Mg may also be transformed since it is left skewed and we can apply box cox for an optimal value of lambda. The table below shows the optimal lambdas. RI can be inverse squared while Si can be squared. Al can be square rooted. It would also be interesting to see how the model performs without Ca as it has correlations with other variables.

library(caret)

## Loading required package: ggplot2

Glass %>%
  select_if(is.numeric) %>%
  mutate_all(funs(BoxCoxTrans(.)$lambda)) %>%
  head(1)

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##   RI   Na Mg  Al Si  K   Ca Ba Fe
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Assuming 'data' is your data frame
par(mfrow=c(3, 3))
# Get the column names
column_names <- names(Soybean)

# Loop through each column
for (col in column_names) {
  if (is.factor(Soybean[[col]])) {
    # For categorical variables, plot frequency distribution using barplot
    freq_table <- table(Soybean[[col]])
    barplot(freq_table, main = paste("Frequency Distribution of", col),
            xlab = "Categories", ylab = "Frequency", col="skyblue", border="white")
  } else if (is.numeric(Soybean[[col]])) {
    # For numerical variables, plot histogram
    hist(Soybean[[col]], main = paste("Histogram of", col),
         xlab = col, ylab = "Frequency")
  }
}

Degenerate distributions are ones that take on one possible value. mycelium and sclerotia seem to be degenerate. leaf.mild and leaf.malf seem to also almost one-sided when you discount the missing values.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Soybean %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values",
       x = "Proportion") +
  scale_fill_manual(values=c("lightgray","skyblue"))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Soybean %>%
  group_by(Class) %>%
  mutate(class_Total = n()) %>%
  ungroup() %>%
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n(),
         Proportion =  Missing / class_Total) %>% 
  ungroup()%>%
  distinct(Class, Proportion)

## # A tibble: 5 × 2
##   Class                       Proportion
##   <fct>                            <dbl>
## 1 phytophthora-rot                 0.773
## 2 diaporthe-pod-&-stem-blight      1    
## 3 cyst-nematode                    1    
## 4 2-4-d-injury                     1    
## 5 herbicide-injury                 1

Soybean %>%
  filter(!Class %in% c("phytophthora-rot", "diaporthe-pod-&-stem-blight", "cyst-nematode",
                       "2-4-d-injury", "herbicide-injury")) %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values with Missing Classes Removed",
       x = "Proportion") +
  scale_fill_manual(values=c("skyblue","red"))

There does seem to be a pattern in that some of the cases that are missing data are affiliated with certain cases. After those five classes were removed from the data, there seems to be no missing data.

(C) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

One strategy would be to remove those 5 classes completely from the data. You can also subset the data by their class, with those 5 classes separately. You can then impute the variables that have missing values using KNN. If there are certain variables that are affiliated with those classes that have no data at all, then they can be removed in the subset data set.

Data Preprocessing HW4

Umer Farooq

2024-02-24