(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# Create individual histograms for each predictor with density line
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(aes(y=..density..), bins = 15, fill="black", alpha=0.7) + 
  geom_density(color="red", size=1) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors") +
  theme_minimal()

# Compute the correlation matrix for numerical variables 
cor_matrix <- cor(Glass[, sapply(Glass, is.numeric)])

# Generate the correlation plot
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.cex = 0.8, 
         addCoef.col = "black", number.cex = 0.7, 
         col = colorRampPalette(c("blue", "white", "red"))(200),
         title = "Correlation Plot of Numerical Predictors", 
         mar = c(0, 0, 2, 0))

# Create the boxplot for numerical variables 
Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Outliers:

Histogram: Na and Ca is Fairly normally distributed with slight skew. Ba and Fe have many values concentrated near zero, suggesting these elements are absent or in trace amounts in most samples.

Correlation: Ca and RI show a strong positive correlation, meaning as calcium increases, the refractive index tends to increase. Mg and Na show a weak negative correlation, suggesting these elements do not vary together in this dataset.

Boxplot: RI and Ca have clear distinctions between different glass types, with certain types having higher values than others. Mg and Ba show that some types of glass have very low or zero amounts, as evidenced by compact boxplots with little spread.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Outliers are present in 8 out of the 9 predictors, including RI, Na, Al, Si, K, Ca, Ba, and Fe. The only variable without visible outliers is Mg.

Skewness:

Refractive Index (RI): Slight right-skewness (most values around a central point, slight extension to the right). Sodium (Na): Right-skewed (values concentrated around 13-15, tail toward higher values). Magnesium (Mg): Left-skewed. Most values are higher, and there is a tail extending toward lower values, indicating that a few samples have lower magnesium content. Aluminum (Al): Right-skewed (values between 1 and 2, with a long tail toward higher values). Silicon (Si): Left-skewed (majority of values are high, with a tail extending to lower values). Potassium (K): Strong right-skewness (values near zero, with a long tail toward higher values). Calcium (Ca): Slight right-skewness (fairly uniform, slight skew toward higher values). Barium (Ba): Strong right-skewness (most values near zero, long tail toward higher values). Iron (Fe): Strong right-skewness (majority of values near zero, with a tail toward higher values).

Left-skewed predictors: Si (Silicon) and Mg (Magnesium) are left-skewed, meaning their distributions have longer tails toward lower values, while most of the data is concentrated on the higher end. Right-skewed predictors: Na, Al, K, Ba, and Fe show right-skewness, with a long tail extending toward higher values. Slight right-skewness: RI and Ca show only slight skewness, with a relatively balanced distribution.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Applying transformations to some of the skewed predictors can improve the performance of a classification model. Specifically, transformations can help normalize the data, reduce skewness, and mitigate the impact of outliers, which can in turn lead to better model performance. Here’s a summary of relevant transformations for the Glass dataset based on the skewness and presence of outliers.

Right-skewed distributions: Transformations like the log, square root, or Box-Cox transformations are commonly used to reduce right skewness. Left-skewed distributions: A reverse log transformation or square root transformation can be used to normalize left-skewed data. Outliers: Transformations can reduce the impact of outliers by compressing the extreme values.

Log transformations are recommended for right-skewed variables (Na, Al, K, Ba, Fe). Reverse log transformations can help with left-skewed variables like Mg and Si. For slightly skewed variables like Ca and RI, a square root transformation would be beneficial.

Example of Applying a Log Transformation:

# Step 1: Identify skewness (we assume variables are already identified as skewed)

# Step 2: Apply transformations based on the skewness of the predictors
Glass_transformed <- Glass %>%
  mutate(
    # Log transformations for right-skewed variables
    Na_log = log(Na),
    Al_log = log(Al),
    K_log = log(K),
    Ba_log = log(Ba + 1),  # Adding 1 to avoid log(0)
    Fe_log = log(Fe + 1),  # Adding 1 to avoid log(0)
    
    # Reverse log transformations for left-skewed variables
    Mg_rlog = -log(Mg),
    Si_rlog = -log(Si),
    
    # Square root transformations for slightly skewed variables
    Ca_sqrt = sqrt(Ca),
    RI_sqrt = sqrt(RI)
  )

# Step 3: Check histograms before and after transformation

# Gather data for visualization before and after transformations
Glass_long <- Glass_transformed %>%
  select(Na, Na_log, Al, Al_log, K, K_log, Ba, Ba_log, Fe, Fe_log, Mg, Mg_rlog, Si, Si_rlog, Ca, Ca_sqrt, RI, RI_sqrt) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot histograms of the original and transformed variables
ggplot(Glass_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  facet_wrap(~Variable, scales = "free", ncol = 4) +
  labs(title = "Histograms of Original and Transformed Variables", x = "Value", y = "Frequency") +
  theme_minimal()

Al_log, Na_log, K_log, Ba_log, Fe_log: The log transformations have successfully reduced the skewness in these right-skewed variables, though the results are more effective for some variables (like Na_log and Al_log) compared to others (e.g., Fe_log).

Mg_rlog, Si_rlog: The reverse log transformations have successfully shifted the left-skewed distributions to be more symmetric.

Ca_sqrt, RI_sqrt: The square root transformations have made slight adjustments to the skewness of these variables, although they were only slightly skewed to begin with.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details
## ?Soybean

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

# Check the structure to identify categorical variables
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Identify categorical columns
categorical_columns <- names(Soybean)[sapply(Soybean, is.factor)]

# Create bar plots for each categorical variable
for (col in categorical_columns) {
  # Create the bar plot
  p <- ggplot(Soybean, aes_string(x = col)) +
    geom_bar(fill = "skyblue", color = "black") +
    theme_minimal() +
    labs(title = paste("Bar Plot of", col), x = col, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  # Print the plot
  print(p)
}                                      

# Loop through each categorical variable and calculate the proportion of each category
for (col in categorical_columns) {
  cat("Distribution of", col, ":\n")
  print(prop.table(table(Soybean[[col]])))
  cat("\n")
}
## Distribution of Class :
## 
##                2-4-d-injury         alternarialeaf-spot 
##                  0.02342606                  0.13323572 
##                 anthracnose            bacterial-blight 
##                  0.06442167                  0.02928258 
##           bacterial-pustule                  brown-spot 
##                  0.02928258                  0.13469985 
##              brown-stem-rot                charcoal-rot 
##                  0.06442167                  0.02928258 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                  0.02049780                  0.02196193 
##       diaporthe-stem-canker                downy-mildew 
##                  0.02928258                  0.02928258 
##          frog-eye-leaf-spot            herbicide-injury 
##                  0.13323572                  0.01171303 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                  0.02928258                  0.12884334 
##              powdery-mildew           purple-seed-stain 
##                  0.02928258                  0.02928258 
##        rhizoctonia-root-rot 
##                  0.02928258 
## 
## Distribution of date :
## 
##          0          1          2          3          4          5          6 
## 0.03812317 0.10997067 0.13636364 0.17302053 0.19208211 0.21847507 0.13196481 
## 
## Distribution of plant.stand :
## 
##         0         1 
## 0.5471406 0.4528594 
## 
## Distribution of precip :
## 
##         0         1         2 
## 0.1147287 0.1736434 0.7116279 
## 
## Distribution of temp :
## 
##         0         1         2 
## 0.1225115 0.5727412 0.3047473 
## 
## Distribution of hail :
## 
##         0         1 
## 0.7740214 0.2259786 
## 
## Distribution of crop.hist :
## 
##          0          1          2          3 
## 0.09745127 0.24737631 0.32833583 0.32683658 
## 
## Distribution of area.dam :
## 
##         0         1         2         3 
## 0.1803519 0.3328446 0.2126100 0.2741935 
## 
## Distribution of sever :
## 
##          0          1          2 
## 0.34697509 0.57295374 0.08007117 
## 
## Distribution of seed.tmt :
## 
##          0          1          2 
## 0.54270463 0.39501779 0.06227758 
## 
## Distribution of germ :
## 
##         0         1         2 
## 0.2889667 0.3730298 0.3380035 
## 
## Distribution of plant.growth :
## 
##         0         1 
## 0.6611694 0.3388306 
## 
## Distribution of leaves :
## 
##         0         1 
## 0.1127379 0.8872621 
## 
## Distribution of leaf.halo :
## 
##          0          1          2 
## 0.36894825 0.06010017 0.57095159 
## 
## Distribution of leaf.marg :
## 
##          0          1          2 
## 0.59599332 0.03505843 0.36894825 
## 
## Distribution of leaf.size :
## 
##         0         1         2 
## 0.0851419 0.5459098 0.3689482 
## 
## Distribution of leaf.shread :
## 
##         0         1 
## 0.8353345 0.1646655 
## 
## Distribution of leaf.malf :
## 
##          0          1 
## 0.92487479 0.07512521 
## 
## Distribution of leaf.mild :
## 
##          0          1          2 
## 0.93043478 0.03478261 0.03478261 
## 
## Distribution of stem :
## 
##         0         1 
## 0.4437781 0.5562219 
## 
## Distribution of lodging :
## 
##         0         1 
## 0.9252669 0.0747331 
## 
## Distribution of stem.cankers :
## 
##          0          1          2          3 
## 0.58759690 0.06046512 0.05581395 0.29612403 
## 
## Distribution of canker.lesion :
## 
##         0         1         2         3 
## 0.4961240 0.1286822 0.2744186 0.1007752 
## 
## Distribution of fruiting.bodies :
## 
##         0         1 
## 0.8197574 0.1802426 
## 
## Distribution of ext.decay :
## 
##          0          1          2 
## 0.77054264 0.20930233 0.02015504 
## 
## Distribution of mycelium :
## 
##           0           1 
## 0.990697674 0.009302326 
## 
## Distribution of int.discolor :
## 
##          0          1          2 
## 0.90077519 0.06821705 0.03100775 
## 
## Distribution of sclerotia :
## 
##          0          1 
## 0.96899225 0.03100775 
## 
## Distribution of fruit.pods :
## 
##          0          1          2          3 
## 0.67946578 0.21702838 0.02337229 0.08013356 
## 
## Distribution of fruit.spots :
## 
##          0          1          2          4 
## 0.59792028 0.12998267 0.09878683 0.17331023 
## 
## Distribution of seed :
## 
##         0         1 
## 0.8054146 0.1945854 
## 
## Distribution of mold.growth :
## 
##         0         1 
## 0.8866328 0.1133672 
## 
## Distribution of seed.discolor :
## 
##         0         1 
## 0.8890815 0.1109185 
## 
## Distribution of seed.size :
## 
##         0         1 
## 0.9001692 0.0998308 
## 
## Distribution of shriveling :
## 
##          0          1 
## 0.93414211 0.06585789 
## 
## Distribution of roots :
## 
##          0          1          2 
## 0.84509202 0.13190184 0.02300613

Yes there are highly degenerate variables: Mycelium, Canker Lesion, Sclerotia, Seed Size, Shriveling, Seed Discolor, Mold Growth are highly degenerate because the majority of observations fall into a single category.

The well-distributed variables: Variables like Class (target), Date, Plant Stand, and Precip show more balance and provide better variability for modeling.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I would first eliminate predictors with more than 15% missing data, Hail, Sever, Seed Treatment, Shriveling, and Mold Growth.

Then, impute missing data for important predictors using mode imputation (for categorical variables) or KNN imputation for more accurate results. Predictors like Precipitation, Temperature, Leaf-related attributes (Halo, Marg, Size, Shread, Malf), Plant Stand, Roots, and Stem. Lastly optionally eliminate rows from classes like 2-4-D-Injury and Cyst-Nematode if missingness is very high, or use class-specific imputation to handle missing data based on class.

By following this strategy of elimination for highly missing predictors and imputation for important predictors with moderate missing data, I can preserve the integrity of the Soybean dataset while minimizing the impact of missing values on model performance. Depending on my analysis, I can also consider handling class-specific missing data to ensure accurate classification.

# Step 1: Eliminate predictors with more than 15% missing data
columns_to_remove <- c("hail", "sever", "seed.tmt", "shriveling", "mold.growth", "sclerotia")

# Remove these columns from the dataset
Soybean_cleaned <- Soybean %>%
  select(-all_of(columns_to_remove))

# Step 2: Impute missing values for important predictors
# Use mode imputation for categorical variables

# Define a function for mode imputation
mode_impute <- function(x) {
  x[is.na(x)] <- names(sort(table(x), decreasing = TRUE))[1]
  return(x)
}

# Apply mode imputation to categorical variables
Soybean_imputed <- Soybean_cleaned %>%
  mutate(across(where(is.factor), ~ mode_impute(.)))

# Step 3: (Optional) Apply KNN imputation for remaining missing values (for ordinal/numeric data)
# Preprocess using KNN imputation
pre_process <- preProcess(Soybean_imputed, method = "knnImpute")

# Apply KNN imputation
Soybean_imputed_knn <- predict(pre_process, Soybean_imputed)

# Check the structure of the final dataset to ensure data types are preserved
str(Soybean_imputed_knn)
## 'data.frame':    683 obs. of  30 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Step 4: Optionally handle class-specific imputation or row removal
# If you want to remove classes with high missingness like 2-4-D-Injury, filter them out
Soybean_final <- Soybean_imputed_knn %>%
  filter(Class != "2-4-d-injury" & Class != "cyst-nematode")

# View the final cleaned dataset
head(Soybean_final)
##                   Class date plant.stand precip temp crop.hist area.dam germ
## 1 diaporthe-stem-canker    6           0      2    1         1        1    0
## 2 diaporthe-stem-canker    4           0      2    1         2        0    1
## 3 diaporthe-stem-canker    3           0      2    1         1        0    2
## 4 diaporthe-stem-canker    3           0      2    1         1        0    1
## 5 diaporthe-stem-canker    6           0      2    1         2        0    2
## 6 diaporthe-stem-canker    5           0      2    1         3        0    1
##   plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1            1      1         0         2         2           0         0
## 2            1      1         0         2         2           0         0
## 3            1      1         0         2         2           0         0
## 4            1      1         0         2         2           0         0
## 5            1      1         0         2         2           0         0
## 6            1      1         0         2         2           0         0
##   leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1         0    1       1            3             1               1         1
## 2         0    1       0            3             1               1         1
## 3         0    1       0            3             0               1         1
## 4         0    1       0            3             0               1         1
## 5         0    1       0            3             1               1         1
## 6         0    1       0            3             0               1         1
##   mycelium int.discolor fruit.pods fruit.spots seed seed.discolor seed.size
## 1        0            0          0           4    0             0         0
## 2        0            0          0           4    0             0         0
## 3        0            0          0           4    0             0         0
## 4        0            0          0           4    0             0         0
## 5        0            0          0           4    0             0         0
## 6        0            0          0           4    0             0         0
##   roots
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
# Make sure there are no remaining missing values in the dataset:
sum(is.na(Soybean_final))  # Should return 0 if all missing values are handled
## [1] 0

Returned 0, all missing values are handled.

# Confirm that the data types are preserved and that categorical, ordinal, and numeric variables are still in the correct format (factors, ordinals, etc.)
str(Soybean_final)
## 'data.frame':    653 obs. of  30 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

I handled missing data using mode imputation for categorical variables and KNN imputation for ordinal/numeric variables. I eliminated predictors with more than 15% missing data (such as hail, sever, seed.tmt, etc.). I filtered out classes with high missingness, like 2-4-d-injury and cyst-nematode, if needed.

Visualize the Cleaned Data:

# List of categorical columns
categorical_columns <- names(Soybean_final)[sapply(Soybean_final, is.factor)]

# Plot bar plots for each categorical variable
for (col in categorical_columns) {
  p <- ggplot(Soybean_final, aes_string(x = col)) +
    geom_bar(fill = "lightblue", color = "black") +
    theme_minimal() +
    labs(title = paste("Bar Plot of", col), x = col, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  print(p)  # Print each plot
}