R Markdown

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
#Visual

Glass |> #histo to exam distribution
  keep(is.numeric) |>
  gather()|>
  ggplot(aes(value)) + 
  geom_histogram(bins=30) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms")

Glass |> #Boxplot for outliers
  keep(is.numeric) |>
  gather() |>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass |>
  keep(is.numeric) |>
  cor() |>
  corrplot()

cor_matrix <- cor(Glass[, 1:9])

# Heatmap of the correlation matrix

cor_matrix_melted <- melt(cor_matrix)
ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2( midpoint = 0, limit = c(-1,1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Correlation Heatmap of Predictors")

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Ri seems to have a strong positive correlation with Ca, and a negative correlation with Si, and Ba has a positive correlation with Al.

  1. Do there appear to be any outliers in the data? Are any predictors skewed? Na seem somewhat normally distributed, with some outliers to the right. Mg seems skewed to the left. Al slightly skewed to the right. Si somewhat normally distributed, with a slight skewed to the left. K Skewed to the right. Ca Skewed to the right. Ba Skewed to the right. Fe skewed to the right. RI Skewed to the right. All elements seem to have outliers but Mg.
numeric_glass <- Glass[, sapply(Glass, is.numeric)] #use only numeric columns
# Apply the skewness function
skewness_values <- apply(numeric_glass, 2, skewness)
print(skewness_values)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
  1. Are there any relevant transformations of one or more predictors that might improve the classification model? Fe, K, and Be have a strong skew to the right therefore the log transformation would be beneficial, MG is also strongly skewed to the left the log transformation can also be beneficial.
# log transformation
Glass$log_Fe <- log(Glass$Fe + 1)  # Adding 1 to avoid log(0)
Glass$log_K <- log(Glass$K + 1)

# Check new distributions with histograms
ggplot(Glass, aes(x = log_Fe)) + geom_histogram(bins = 20) + ggtitle("Log-transformed Fe")# Still skewed to the right but improved the frame.

ggplot(Glass, aes(x = log_K)) + geom_histogram(bins = 20) + ggtitle("Log-transformed K") #Still skewed to the right.

#BoxCox
Glass_transformed <- Glass|>
  select(where(is.numeric)) |>  # Keep only numeric columns
  mutate(across(everything(), ~ BoxCoxTrans(.)$lambda))|>
  head(1)
head(Glass_transformed)
##   RI   Na Mg  Al Si  K   Ca Ba Fe log_Fe log_K
## 1 -2 -0.1 NA 0.5  2 NA -1.1 NA NA     NA    NA

The table from using the BoxCox, I could see that RI would need the inverse squared to improve the data quality, Si would need to be squared, Al would need the square root, and Ca would need inverse.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
#See ?Soybean for details
?Soybean
## Help on topic 'Soybean' was found in the following packages:
## 
##   Package               Library
##   mlbench               /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
##   nlme                  /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
## 
## 
## Using the first match ...
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter? If the the missing values were removed Mycelium and Sclerotia may seem degenerate where there seems to mostly be one value with all the outcomes, but they still have another value with a tiny bit of outcome.
Categorical<-sapply(Soybean, is.factor) #Check all columns that are factors
CategoricalP<-names(Soybean)[Categorical] #store categorical predictor

CategoricalP
##  [1] "Class"           "date"            "plant.stand"     "precip"         
##  [5] "temp"            "hail"            "crop.hist"       "area.dam"       
##  [9] "sever"           "seed.tmt"        "germ"            "plant.growth"   
## [13] "leaves"          "leaf.halo"       "leaf.marg"       "leaf.size"      
## [17] "leaf.shread"     "leaf.malf"       "leaf.mild"       "stem"           
## [21] "lodging"         "stem.cankers"    "canker.lesion"   "fruiting.bodies"
## [25] "ext.decay"       "mycelium"        "int.discolor"    "sclerotia"      
## [29] "fruit.pods"      "fruit.spots"     "seed"            "mold.growth"    
## [33] "seed.discolor"   "seed.size"       "shriveling"      "roots"
lapply(CategoricalP,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle("Frequency Distribution of", col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
NAData<-Soybean|> #count of missing values per column
  summarize_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(VIM) #visual of missing data Red=missing data blue=present data
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
aggr(Soybean, numbers=TRUE, sortVars=TRUE, labels=names(CategoricalP))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000
NaClass<-Soybean|>
  gather(key=CategoricalP,value=value, -Class)|>
  group_by(Class)|>
  summarise(n=sum(is.na(value)))|>
  arrange(desc(n))
## Warning: attributes are not identical across measure variables; they will be
## dropped
NaClass
## # A tibble: 19 × 2
##    Class                           n
##    <fct>                       <int>
##  1 phytophthora-rot             1214
##  2 2-4-d-injury                  450
##  3 cyst-nematode                 336
##  4 diaporthe-pod-&-stem-blight   177
##  5 herbicide-injury              160
##  6 alternarialeaf-spot             0
##  7 anthracnose                     0
##  8 bacterial-blight                0
##  9 bacterial-pustule               0
## 10 brown-spot                      0
## 11 brown-stem-rot                  0
## 12 charcoal-rot                    0
## 13 diaporthe-stem-canker           0
## 14 downy-mildew                    0
## 15 frog-eye-leaf-spot              0
## 16 phyllosticta-leaf-spot          0
## 17 powdery-mildew                  0
## 18 purple-seed-stain               0
## 19 rhizoctonia-root-rot            0

Hail, Sever, Seed.tmt, and Lodging seem to have the highest missing values. The classes (rows) with high numbers of missing data are phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I believe the MICE function would be using to handle the missing data, which will handle missing data by imputation using predictive mean matching.

DataImp<-mice(Soybean, m=1, method= "pmm", print=F) #predictive mean matching
## Warning: Number of logged events: 339
DataImp<-complete(DataImp) #Extract imputed data

result<-DataImp|>
  summarise((across(everything(),~sum(is.na(.)))))

result # shows zero missing values confirming imputation was successful
##   Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## 1     0    0           0      0    0    0         0        0     0        0
##   germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1    0            0      0         0         0         0           0         0
##   leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1         0    0       0            0             0               0         0
##   mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 1        0            0         0          0           0    0           0
##   seed.discolor seed.size shriveling roots
## 1             0         0          0     0