Exercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
Glass_sub <- subset(Glass, select = -Type)
predictors <- colnames(Glass)

par(mfrow = c(3,3))
for(i in 1:9) {
  hist(Glass_sub[,i],main = predictors[i])
}

  • There is a lot of variation between the variables.
  • Aluminum, Silicon, Sodium being the closest to normal
  • Iron, Barium, Potassium appear to be the farthest from normal
  • Magnesium is the only bipolar distribution
corrplot.mixed(cor(Glass_sub[,1:9]), lower.col = "black", number.cex = .7)

  • There does seem to be relationship between some the variables.
    • Calcium and the Refractive Index (RI) have the strongest postive relationship
    • Magnesium & Barium, Magnesium & Calcium, Refractive Index & Silicon have the strongest negative relationship

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Glass_gather <- Glass %>%
  pivot_longer(-Type, names_to = "predictors", values_to = "measurement", values_drop_na = TRUE) %>%
  mutate(predictors = as.factor(predictors))

par(mfrow = c(3,3))
for(i in 1:9) {
  boxplot(Glass_sub[,i],main = predictors[i], horizontal = TRUE)
}

Glass_gather %>%
  filter(predictors == 'Na'|predictors == 'K'|predictors == 'Ba'|predictors == 'Ca')%>%
  ggplot(aes(x=Type, y=measurement, color=predictors))+
  geom_jitter()+
  scale_color_brewer(palette = "Set1") +
  theme_light()  

  • Seperated out Silicon to see the other elememts clearer
  • There does seem to be outliers in most of the elements with the heaviest in Calcium, Potassium, Sodium, and Barium
  • Appears most of the outliers are in Type 5 *Except for Calcium where type 2 holds a lot of outliers
par(mfrow = c(3,3))
for(i in 1:9) {
  hist(Glass_sub[,i],main = predictors[i])
}

  • Most elements have a stong left tails *Barium, Iron, & Potassium are ther strongest
  • Magnesium is Bimodal

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

glass_process <- preProcess(Glass, method=c('BoxCox', 'center', 'scale'))
glass_update<- predict(glass_process, Glass)

par(mfrow = c(3,3))
for(i in 1:9) {
  hist(glass_update[,i],main = predictors[i])
}

par(mfrow = c(3,3))
for(i in 1:9) {
  boxplot(glass_update[,i],main = predictors[i], horizontal = TRUE)
}

  • BoxCox had the most effect on elements not too far from normal
  • For items such as Iron, the skew is too far left tailed for the transform to have major effect.
    • Looking at the data for Iron, there is a LOT of zero values which affected the transform.

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data("Soybean")
Soybean = Soybean
nearZeroVar(Soybean)
## [1] 19 26 28
names(Soybean[,nearZeroVar(Soybean)])
## [1] "leaf.mild" "mycelium"  "sclerotia"
summary(Soybean[19])
##  leaf.mild 
##  0   :535  
##  1   : 20  
##  2   : 20  
##  NA's:108
summary(Soybean[26])
##  mycelium  
##  0   :639  
##  1   :  6  
##  NA's: 38
summary(Soybean[28])
##  sclerotia 
##  0   :625  
##  1   : 20  
##  NA's: 38
  • all of these values have the ratio of most prevalent to second prevalent value under the 10% ratio.

(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean %>%
  group_by(Class)%>%
  gg_miss_fct(Class)

  • The Missing predictors to seem to be in specific variables.
    • 2-4-d-injury being the one with the most missing.

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

#copy dataframe
Soybean_update <- Soybean

#rewrite dataframe so I control the formating
fwrite(Soybean,"soybean.temp")
Soybean_update <- fread("soybean.temp",colClasses = "character")

#updating all columns but class to numberic
Soybean_update <- Soybean_update %>%
  mutate_at(vars("date","plant.stand","precip","temp","hail","crop.hist","area.dam","sever","seed.tmt","germ","plant.growth","leaves","leaf.halo","leaf.marg","leaf.size","leaf.shread","leaf.malf","leaf.mild","stem","lodging","stem.cankers","canker.lesion","fruiting.bodies","ext.decay","mycelium","int.discolor","sclerotia","fruit.pods","fruit.spots","seed","mold.growth","seed.discolor","seed.size","shriveling","roots" ),as.numeric)

#setting all NA's to mean of column
Soybean_update <-Soybean_update %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

summary(Soybean_update)
##     Class                date        plant.stand         precip     
##  Length:683         Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1.000  
##  Mode  :character   Median :4.000   Median :0.0000   Median :2.000  
##                     Mean   :3.554   Mean   :0.4529   Mean   :1.597  
##                     3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:2.000  
##                     Max.   :6.000   Max.   :1.0000   Max.   :2.000  
##       temp            hail         crop.hist        area.dam    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :0.000   Median :2.000   Median :1.000  
##  Mean   :1.182   Mean   :0.226   Mean   :1.885   Mean   :1.581  
##  3rd Qu.:2.000   3rd Qu.:0.226   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :1.000   Max.   :3.000   Max.   :3.000  
##      sever           seed.tmt           germ        plant.growth   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.5196   Median :1.000   Median :0.0000  
##  Mean   :0.7331   Mean   :0.5196   Mean   :1.049   Mean   :0.3388  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :2.0000   Max.   :2.000   Max.   :1.0000  
##      leaves         leaf.halo       leaf.marg       leaf.size    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000  
##  Median :1.0000   Median :2.000   Median :0.000   Median :1.000  
##  Mean   :0.8873   Mean   :1.202   Mean   :0.773   Mean   :1.284  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :1.0000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##   leaf.shread       leaf.malf         leaf.mild           stem       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :1.0000  
##  Mean   :0.1647   Mean   :0.07513   Mean   :0.1043   Mean   :0.5562  
##  3rd Qu.:0.1647   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :2.0000   Max.   :1.0000  
##     lodging         stem.cankers  canker.lesion    fruiting.bodies 
##  Min.   :0.00000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00   Median :0.9798   Median :0.0000  
##  Mean   :0.07473   Mean   :1.06   Mean   :0.9798   Mean   :0.1802  
##  3rd Qu.:0.00000   3rd Qu.:3.00   3rd Qu.:2.0000   3rd Qu.:0.1802  
##  Max.   :1.00000   Max.   :3.00   Max.   :3.0000   Max.   :1.0000  
##    ext.decay         mycelium         int.discolor      sclerotia      
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.000000   Median :0.0000   Median :0.00000  
##  Mean   :0.2496   Mean   :0.009302   Mean   :0.1302   Mean   :0.03101  
##  3rd Qu.:0.2496   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :2.0000   Max.   :1.000000   Max.   :2.0000   Max.   :1.00000  
##    fruit.pods      fruit.spots         seed         mold.growth    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.5042   Mean   :1.021   Mean   :0.1946   Mean   :0.1134  
##  3rd Qu.:1.0000   3rd Qu.:1.021   3rd Qu.:0.1946   3rd Qu.:0.0000  
##  Max.   :3.0000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##  seed.discolor      seed.size         shriveling          roots       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.0000  
##  Mean   :0.1109   Mean   :0.09983   Mean   :0.06586   Mean   :0.1779  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :2.0000