hw 4

###3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(mlbench) 
library(purrr)
library(corrplot)

## corrplot 0.94 loaded

data(Glass) 
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)Using visualizations,explore the predictor variables to understand their distributions as well as the relationships between predictors.

histograms show distributions, boxplots highlight the outliers, corpolot shows the strength relationships.correlations,

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Glass Types")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms" )

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots")

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

Do there appear to be any outliers in the data? Are any predictors skewed?

The boxplots show outliers in all variables except Mg
the histograms shows skewness in many variables, Na looks like it might have the lowest skewness

(c)Are there any relevant transformations of one or more predictors that might improve the classification model?

We can use lambda /Box-Cox transformation, log for positively skewed variables, maybe standardization since the variables are on different scales.

###3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)  ## See ?Soybean for details

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Reading suggests that when the distributions degenerate it can cripple a model
looking for unique values with very low frequencies, ‘nearZeroVar’ identifies low variance of using a ratio: Frequency of 2nd Common / divided by Frequency of Most Common
“leaf.mild” “mycelium” “sclerotia” were identified as distributions degenerate

# facet wrapping geombar makes it hard to see; trying another approach 
library(tidyr)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

index <- nearZeroVar(Soybean)
colnames(Soybean)[index]

## [1] "leaf.mild" "mycelium"  "sclerotia"

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

the plot shows that the top 4 predictors with missing data are server, seed.tmt, lodge, hail
yes there is a pattern of missing data related to classes, these in particular: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury, phytophthora-rot

missing_proportions <- sort(apply(Soybean, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)

missing_df <- data.frame(variable = names(missing_proportions), missing_proportion = missing_proportions)

ggplot(missing_df, aes(x = reorder(variable, missing_proportion), y = missing_proportion)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Proportion of Missing Values",
       x = "Variable",
       y = "Proportion") +
  theme_minimal()

classes_w_missing <- Soybean %>%
   filter_all(any_vars(is.na(.))) %>%
   select(Class) %>%
   group_by(Class) %>%
   summarise(count = n())
classes_w_missing

## # A tibble: 5 × 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

strategies to consider: eliminating predictors, this option wouldn’t be ideal to do immediately as it would remove more than half of the variables. However, I would think removing the 5 classes would be a good strategy, so next step would be to look closer at the classes and remove the variables that are 100% missing the variables listed below in these classes 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury
imputation: would be a be a good strategy, using KNN to estimates the missing data for variables with less than 100% missing in the classes above, and for phytophthora-rot class as no variables are 100% missing for this class.

for (class in unique(classes_w_missing$Class)) {
  t <- Soybean %>%
    filter(Class == class) 

  missing_proportions <- sort(apply(t, 2, function(col) sum(is.na(col)) / length(col)), decreasing = TRUE)
  
  missing_100 <- names(missing_proportions[missing_proportions == 1])
  
  print(paste("Class:", class))
  if (length(missing_100) > 0) {
    print(paste("Variables with 100% missing:", paste(missing_100, collapse = ", ")))
  } else {
    print("No variables with 100% missing values.")
    }
  }

## [1] "Class: 2-4-d-injury"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, crop.hist, sever, seed.tmt, germ, plant.growth, leaf.shread, leaf.mild, stem, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.pods, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling, roots"
## [1] "Class: cyst-nematode"
## [1] "Variables with 100% missing: plant.stand, precip, temp, hail, sever, seed.tmt, germ, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed.discolor, shriveling"
## [1] "Class: diaporthe-pod-&-stem-blight"
## [1] "Variables with 100% missing: hail, sever, seed.tmt, leaf.halo, leaf.marg, leaf.size, leaf.shread, leaf.malf, leaf.mild, lodging, roots"
## [1] "Class: herbicide-injury"
## [1] "Variables with 100% missing: precip, hail, sever, seed.tmt, germ, leaf.mild, lodging, stem.cankers, canker.lesion, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, fruit.spots, seed, mold.growth, seed.discolor, seed.size, shriveling"
## [1] "Class: phytophthora-rot"
## [1] "No variables with 100% missing values."

hw 4

Marjete Vucinaj

2024-09-29