library(ggplot2)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.94 loaded

library(gridExtra)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)

R Markdown

3.1.The UC Irvine Machine Learning Repository contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Predictors")

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Predictors")

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot()

Several elements, including Al, Ba, Ca, Fe, K, and the Refractive Index (RI), show right-skewed distributions. Fe and Ba have values concentrated near zero, indicating they are present in trace amounts in most samples. Mg is left-skewed and bimodal, Na is nearly normally distributed with a slight right tail, and Si is left-skewed.

Positive Correlations: A strong positive relationship exists between RI and Ca (as calcium increases, the refractive index tends to increase) and between Ba and Al. Negative Correlations: Notable negative correlations include RI and Si, Al and Mg, Ca and Mg, and Ba and Mg, suggesting these elements tend to vary inversely or independently.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

There appears to be outliers present in the data in almost all of the predictors, excluding Mg. Some of the predictors are skewed. See 3.1a

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Right-skewed variables (e.g., Na, Al, K, Ba, Fe) benefit from log, square root, or Box-Cox transformations to reduce skewness.

Left-skewed variables (e.g., Mg, Si) can be normalized using reverse log or square root transformations.

For slightly skewed variables like Ca and RI, a square root transformation should be used to address mild skewness.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

columns <- colnames(Soybean)
#
lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Degenerate distributions occur when a variable only takes on a single value or almost always takes one value. Mycelium and Sclerotia seem to be degenerate. leaf.mild and leaf.malf seem to also almost one-sided.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

missing_percentage <- as.data.frame(t(Soybean %>%
  select(-Class) %>%  
  summarise(across(everything(), ~ mean(is.na(.)) * 100))))
colnames(missing_percentage) <- "Percentage Missing"
missing_percentage$Variable <- rownames(missing_percentage)
head(missing_percentage)

##             Percentage Missing    Variable
## date                 0.1464129        date
## plant.stand          5.2708638 plant.stand
## precip               5.5636896      precip
## temp                 4.3923865        temp
## hail                17.7159590        hail
## crop.hist            2.3426061   crop.hist

ggplot(missing_percentage, aes(x = reorder(Variable, -`Percentage Missing`), y = `Percentage Missing`)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  coord_flip() +  
  labs(title = "Percentage of Missing Values in Soybean Dataset",
       x = "Predictors",
       y = "Percentage Missing") +
  theme_minimal()

missing_by_class <- Soybean %>%
  mutate(MissingCount = rowSums(is.na(Soybean))) %>%
  group_by(Class) %>%
  summarise(AvgMissing = mean(MissingCount))

Soybean %>%
  filter(!Class %in% c("2-4-d-injury", "cyst-nematode", "diaporthe-pod-&-stem-blight","herbicide-injury", 
                       "phytophthora-rot" )) %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values with Missing Classes Removed",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","red"))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Hail, Sever, Seed Treatment, Germ, Leaf Halo, Leaf Shread, and Leaf Malformation all show significantly high percentages of missing data. These predictors may require different strategies, such as imputation or eliminating predictors, based on their relevance to the analysis. The pattern of missing data seems to be related to the classes. There are 5 classes with missing values. You can remove those classes.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I would start by eliminating predictors with high amounts of missing data. You can then impute the variables that have missing values using KNN for more accurate results. You can also remove the classes missing the most values.

Data-624-Homework-4

Adriana Medina

2024-09-28