Data 624 Homework 4

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

glass <- data(Glass)
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_histogram(bins = 20) + facet_wrap(~key, scales = 'free')

Glass %>% select(!Type) %>% gather() %>% ggplot(aes(value)) + geom_boxplot() + facet_wrap(~key, scales = 'free')

Do there appear to be any outliers in the data? Are any predictors skewed?

The histograms show us that several predictors are skewed. The boxplots show us that all predictors except ‘Mg’ have outliers.

Are there any relevant transformations of one or more predictors that might improve the classification model?

For the predictors that show a skewness, a Box-Cox transformation would be applicable. For the predictors with outliers we can use a spatial sign transformation.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details

soybean <- data(Soybean)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean %>%  select(!Class)%>%  drop_na() %>%  gather() %>% ggplot(aes(value)) +  geom_bar() +  facet_wrap(~ key)

Degenerate distributions are distributions where the variable primarily takes one value and others occur at a very low rate. Here we can say ‘mycelium’, ‘scleroita’, and ‘roots’ seem to be degenerate.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean  %>%
  summarise_all(list(~is.na(.)))%>%
  pivot_longer(everything(),
               names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y=variables,x=n,fill=missing))+
  geom_col()

Soybean %>% filter_all(any_vars(is.na(.))) %>% select(Class) %>% group_by(Class) %>% summarise(count = n())

## # A tibble: 5 x 2
##   Class                       count
##   <fct>                       <int>
## 1 2-4-d-injury                   16
## 2 cyst-nematode                  14
## 3 diaporthe-pod-&-stem-blight    15
## 4 herbicide-injury                8
## 5 phytophthora-rot               68

All missing data is contained within the five classes shown above. Of the predictors, it seems ‘hail’, ‘lodging’, ‘sever’ and ‘seed.tmt’ have the most missing values.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Imputation is one way we can account for missing data without getting rid of it all together. For the imputation strategy we can fill the missing data with several values such as: maxia, minima, mean, or median. For this strategy filling in the missing data with the mean for that column would be the best approach.

Data 624 Homework 4

Krutika Patel

2023-02-25

3.1

3.2