Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)
library(corrplot)
## corrplot 0.92 loaded
library(mlbench)
library(broom)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
names(Glass)
##  [1] "RI"   "Na"   "Mg"   "Al"   "Si"   "K"    "Ca"   "Ba"   "Fe"   "Type"

a) Using Visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Elemental Density Plots

To start to get an idea about the data, I have done a density plot for each element. The Ai, Na and Si elements look close to being normally distributed. While the other six elements are all skewed in one direction or the other. K, Ma and Ri all have multiple peaks. The scales of the components are different.

Glass %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Density Plots of Each Element")

Frequency Count of each Type of Glass

The frequency of each type of glass is not evenly distributed. Glass types 1 and 2 each have at least 70 occurrences, while 3,5,6 and 7 all have less than thirty occurrences. This imbalance will have to be dealt with.

Glass %>%
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Frequency of Types of Glass")

Relationship of Element to Glass Type, Means

Each type of glass has a general chemical makeup. To get a general idea of this relationship, we can see the means of each element by its type.

Group 6 is the most distinctive for means since it has three elements that have no means. So any types of glass that are tested and have zero elements in K, Ba and Fe will most likely be determined to be a group 6 type of glass.

Another group that really sticks out is group 5. It’s means for NA,Mg, K and Ca are all distinguished in one way or another from the other type’s means.

Group 7 also catches the eye because its mean in Mg is considerably lower than all the others. Also for the Ai and Ba elements because their means are the highest of all the means. For the Ca and Fe elements, Group 7 has the lowest means.

These distinguishing characteristics for group 5 and 7 will possibly make it easy for an model to identify these.

type_means <- Glass %>%
  group_by(Type) %>%
  summarise(across(everything(),list(mean)))
type_means
## # A tibble: 6 × 10
##   Type   RI_1  Na_1  Mg_1  Al_1  Si_1   K_1  Ca_1    Ba_1   Fe_1
##   <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>
## 1 1      1.52  13.2 3.55   1.16  72.6 0.447  8.80 0.0127  0.057 
## 2 2      1.52  13.1 3.00   1.41  72.6 0.521  9.07 0.0503  0.0797
## 3 3      1.52  13.4 3.54   1.20  72.4 0.406  8.78 0.00882 0.0571
## 4 5      1.52  12.8 0.774  2.03  72.4 1.47  10.1  0.188   0.0608
## 5 6      1.52  14.6 1.31   1.37  73.2 0      9.36 0       0     
## 6 7      1.52  14.4 0.538  2.12  73.0 0.325  8.49 1.04    0.0134
Relationship of Element to Glass Type, Standard Deviations

The standard deviations of the groups is a little alarming, as this gives a general idea of how spread out each elements values are for that particular group. Type’s 5 and 6 have high standard deviations for multiple elements.

type_std <- Glass %>%
  group_by(Type) %>%
  summarise(across(everything(),list(sd)))
type_std
## # A tibble: 6 × 10
##   Type     RI_1  Na_1  Mg_1  Al_1  Si_1   K_1  Ca_1   Ba_1   Fe_1
##   <fct>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 1     0.00227 0.499 0.247 0.273 0.569 0.215 0.575 0.0838 0.0891
## 2 2     0.00380 0.664 1.22  0.318 0.725 0.214 1.92  0.362  0.106 
## 3 3     0.00192 0.507 0.163 0.347 0.512 0.230 0.380 0.0364 0.108 
## 4 5     0.00335 0.777 0.999 0.694 1.28  2.14  2.18  0.608  0.156 
## 5 6     0.00312 1.08  1.10  0.572 1.08  0     1.45  0      0     
## 6 7     0.00255 0.686 1.12  0.443 0.940 0.668 0.974 0.665  0.0298
Relationship of Element to Glass Type Density Plots

The density plots for each glass type’s relationship to an element gives a better idea of the tables of the mean and standard deviations above. From the plots we can really see just how the data is spread for each type of glass.

Each type’s chart for each element looks to be unique in terms of mean and distribution of values. So even though the means and standard deviations maybe similar, the distribution of the values is different.

glass_modified <- Glass %>% group_by(Type)

x <- sapply(glass_modified, is.factor)
glass_modified[ , x] <- as.data.frame(apply(glass_modified[ , x], 2, as.numeric))

glass_1 <- glass_modified[glass_modified$Type == 1,]
glass_1[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 1")

glass_2 <- glass_modified[glass_modified$Type == 2,]
glass_2[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 2")

glass_3 <- glass_modified[glass_modified$Type == 3,]
glass_3[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 3")

glass_5 <- glass_modified[glass_modified$Type == 5,]
glass_5[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 5")

glass_6 <- glass_modified[glass_modified$Type == 6,]
glass_6[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 6")

glass_7 <- glass_modified[glass_modified$Type == 7,]
glass_7[1:9]%>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_density(fill='blue') + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Relationship between Element and Glass Type 7")

Correlations of Elements

The correlation plot shows that only two elements Ri and Ca, as being highly correlated. Since they both present very similar information one of those two elements can ultimately be removed.

Glass %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot(method='number') 

b) Do there appear to be any outliers in the data? Are any predictors skewed?

When looking at the overall density plots of the data that is not separated by groups, it looks like there are a lot of possible outliers. However, when the data is broken down into groups those outliers look to be somewhat less prevalent.

For example, the Ba element. In all of the charts of the Ba element the chart is highly skewed, except for glass type 7. In glass type 5 the chart is skewed right but there is an increase in the probability as the data goes to the right. Since there are under 30 observations, this might not be an anomaly but rather a normal amount of the element. The spread of the distribution for type 5 is also bigger than most of the other groups for Ba.

Glass type 2’s Ba density chart when compared to group 5’s tells a different story. It says that there are outliers because the spread is bigger than all the others but the distribution of those values is less dense as values go further to the right.

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

For the Al, element a square root transformation makes the density plot closest to normal. For the Na, Ca, Ri and Si elements a log transformation helps the density plots look more normal. The Ba, Fe, K and Mg elements all have zero’s, so doing a log transformation will produce infinite values. A box cox transformation won’t work either for those particular elements.

Al_sqrt <- sqrt(Glass$Al)
Al_sqrt <- Al_sqrt %>% scale(center=TRUE,scale=TRUE) %>% as.vector()
plot(density(Al_sqrt))

ca_log <- log(Glass$Ca)
plot(density(ca_log))

na_log <- log(Glass$Na)
na_log <- na_log %>% scale(center=TRUE,scale=TRUE) %>% as.vector()
plot(density(na_log))

ri_log <- log(Glass$RI)
plot(density(ri_log))

si_log <- log(Glass$Si)
plot(density(si_log))

Exercise 3.2

The soybean data can also be found at UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions(e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
unique(Soybean$Class)
##  [1] diaporthe-stem-canker       charcoal-rot               
##  [3] rhizoctonia-root-rot        phytophthora-rot           
##  [5] brown-stem-rot              powdery-mildew             
##  [7] downy-mildew                brown-spot                 
##  [9] bacterial-blight            bacterial-pustule          
## [11] purple-seed-stain           anthracnose                
## [13] phyllosticta-leaf-spot      alternarialeaf-spot        
## [15] frog-eye-leaf-spot          diaporthe-pod-&-stem-blight
## [17] cyst-nematode               2-4-d-injury               
## [19] herbicide-injury           
## 19 Levels: 2-4-d-injury alternarialeaf-spot anthracnose ... rhizoctonia-root-rot
nrow(Soybean)
## [1] 683

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

This data set has different types of data compared to the Glass data set. This data set is categorical data. Also this data set has missing data points in almost every field.

ggplot(melt(Soybean, id.vars=c('Class')), aes(x=value)) + 
  geom_histogram(stat="count") + 
  facet_wrap(~variable, scale="free")

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Since four of the classes are measuring different fields and seem to be classes that are more specific in their nature, I think the best way to deal with their missing data is to remove the fields of 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight and herbicide-injury.

For the class of phytophthora-rot imputation would be best because that class seems to be measuring the same traits as the other 14 columns.

This strategy is assuming that what is trying to be predicted has nothing to do with the specific nature of those four classes. Otherwise the strategy changes and could quite possibly be the opposite strategy or an entirely different strategy. Such as removing the specific fields that have missing values.