Data 624 HW-4

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

Exercises 3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via: > library(mlbench) > data(Glass) > str(Glass) ‘data.frame’: 214 obs. of 10 variables: $ RI : num 1.52 1.52 1.52 1.52 1.52 … $ Na : num 13.6 13.9 13.5 13.2 13.3 … $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 … $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 … $ Si : num 71.8 72.7 73 72.6 73.1 … $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 … $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 … $ Ba : num 0 0 0 0 0 0 0 0 0 0 … $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 … $ Type: Factor w/ 6 levels “1”,“2”,“3”,“5”,..: 1 1 1 1 1 1 1 1 1 1 …

library(mlbench)
data("Glass")
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

?Glass

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Density plot of the Index by Glass Type
ggplot(Glass, aes(x = RI, fill = Type)) +
  geom_density(alpha = 0.6) +
  theme_minimal() +
  labs(title = "Density Plot of Refractive Index (RI) by Glass Type", x = "Refractive Index (RI)", y = "Density") +
  scale_fill_brewer(palette = "Dark2")

# Correlation matrix heatmap
cor_matrix <- cor(Glass[,1:9]) # Correlation of numeric variables
melted_cor <- melt(cor_matrix)

ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  theme_minimal() +
  labs(title = "Heatmap of Correlation Matrix - Glass Dataset", x = "", y = "") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Change Glass data to long format for easier plotting
glass_melted <- melt(Glass, id.vars = "Type")

# Create boxplot
ggplot(glass_melted, aes(x = variable, y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
  theme_minimal() +
  labs(title = "Boxplot of Glass Dataset Variables",
       x = "Variables", y = "Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Do there appear to be any outliers in the data? Are any predictors skewed?

The density plot shows us that almost all the predictors are skewed. Some are skewed to the right and some are skewed to the left. The box plots show us that all the predictors have outliers except ‘Mg’. The ones with the most outliers are Na, Ca, K, and Si.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, since there are outliers and skewness in our data we can definitely use transformations that would improve the classification model. The transformation that would be appropriate to deal with outliers would be a “Spatial Sign” transformation. In order to deal with skewness we can run a “Box-Cox” transformation.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

?Soybean

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Stacked bar plot showing Class vs Precipitation
ggplot(Soybean, aes(x = Class, fill = precip)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Stacked Bar Plot of Class vs Precipitation", x = "Soybean Class", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

# Boxplot of Germination by Plant Growth
ggplot(Soybean, aes(x = plant.growth, y = as.numeric(germ), fill = plant.growth)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of Germination Rate by Plant Growth", x = "Plant Growth", y = "Germination Rate") +
  scale_fill_brewer(palette = "Pastel1")

## Warning: Removed 112 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Mosaic plot between plant.stand and precipitation
ggplot(data = Soybean) +
  geom_mosaic(aes(weight = 1, x = product(plant.stand), fill = precip)) +
  theme_minimal() +
  labs(title = "Mosaic Plot of Plant Stand vs Precipitation", x = "Plant Stand", y = "Precipitation")

## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
##   Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Bar plot for the 'Class' variable in Soybean dataset
ggplot(Soybean, aes(x = Class)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Distribution of Soybean Disease Classes", x = "Class", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Stacked bar plot for crop.hist and area.dam
ggplot(Soybean, aes(x = crop.hist, fill = area.dam)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Area Damage by Crop History", x = "Crop History", y = "Proportion") +
  theme_minimal()

# Summarize the count of each combination of seed.tmt and sever
heatmap_data <- Soybean %>%
  group_by(seed.tmt, sever) %>%
  summarise(count = n(), .groups = "drop")

# Create heatmap
ggplot(heatmap_data, aes(x = seed.tmt, y = sever, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Heatmap of Seed Treatment vs Disease Severity",
       x = "Seed Treatment", y = "Disease Severity", fill = "Count") +
  theme_minimal()

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

From these exploratory visualizations we can see that there are many predictors completely missing for the 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight and herbicide-injury classes. The phytophthora-rot class has a high rate of missing data across many predictors.

# Check the proportion of missing values in each predictor
missing_data_summary <- colSums(is.na(Soybean)) / nrow(Soybean)
missing_data_summary <- data.frame(Predictor = names(missing_data_summary), 
                                    Missing_Proportion = missing_data_summary)

# Filter predictors with missing values
missing_data_summary <- missing_data_summary[missing_data_summary$Missing_Proportion > 0, ]

# Visualize the missing data proportions
ggplot(missing_data_summary, aes(x = reorder(Predictor, -Missing_Proportion), 
                                   y = Missing_Proportion)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Proportion of Missing Values for Predictors", x = "Predictor", 
       y = "Proportion of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Investigate the relationship between missing data and Class
missing_class_summary <- Soybean %>%
  group_by(Class) %>%
  summarise(Missing_Count = sum(is.na(germ)), .groups = "drop")

# Visualize missing counts by class
ggplot(missing_class_summary, aes(x = Class, y = Missing_Count)) +
  geom_bar(stat = "identity", fill = "orange") +
  theme_minimal() +
  labs(title = "Missing Germination Counts by Soybean Class", x = "Class", 
       y = "Count of Missing Germination Data") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

One of the approaches we can use for handling missing data is the imputation technique. However, I am unsure of how much this would help being that almost 100% of the predictor values will need to be imputed in some cases. With this we can do things such as eliminate the classes associated with the high rate of missing values to see how it would impact the model.

Data 624 HW-4

Biyag Dukuray

2024-09-27

Data 624 HW-4