hw4 624

knitr::opts_chunk$set(fig.width = 11, fig.height = 11)
require(caret)

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ggplot2

require(fpp2)

## Loading required package: fpp2

## Loading required package: forecast

## Loading required package: fma

## Loading required package: expsmooth

require(corrplot)

## Loading required package: corrplot

## corrplot 0.84 loaded

3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refactive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

par(mfrow = c(3, 3))
hist(Glass$RI, main = "RI")
hist(Glass$Na, main = "Na")
hist(Glass$Mg, main = "Mg")
hist(Glass$Al, main = "Al")
hist(Glass$Si, main = "Si")
hist(Glass$K, main = "K")
hist(Glass$Ca, main = "Ca")
hist(Glass$Ba, main = "Ba")
hist(Glass$Fe, main = "Fe")

Of all the predictors, Si seems to have the most normal distribution, as well as having the highest content of any mineral. This makes sense as silicon is a fundamental element to make glass. RI, Na, Al, and Ca have normalish distributions.

Aside of the quantity of silicon, one can make an assumption that the refractive index of glass should be an important factor. Not every piece needs to have a mirror like sheen, after all. Some can be much more transparent.

plot(Glass$RI~Glass$Si, main = "Si vs RI",
     xlab = "Si", ylab = "RI", col = c("red", "blue"))

We can see that a lot of points are clustered around a particular area, namely 73 for Si, and between 1.515 and 1.520 for RI. What about the others?

par(mfrow = c(2, 4))
plot(Glass$RI~Glass$Si, main = "Si vs RI",
     xlab = "Si", ylab = "RI", col = c("red", "blue"))
plot(Glass$Na~Glass$Si, main = "Si vs Na",
     xlab = "Si", ylab = "Na", col = c("red", "blue"))
plot(Glass$Mg~Glass$Si, main = "Si vs Mg",
     xlab = "Si", ylab = "Mg", col = c("red", "blue"))
plot(Glass$Al~Glass$Si, main = "Si vs Al",
     xlab = "Si", ylab = "Al", col = c("red", "blue"))
plot(Glass$K~Glass$Si, main = "Si vs K",
     xlab = "Si", ylab = "K", col = c("red", "blue"))
plot(Glass$Ca~Glass$Si, main = "Si vs Ca",
     xlab = "Si", ylab = "Ca", col = c("red", "blue"))
plot(Glass$Ba~Glass$Si, main = "Si vs Ba",
     xlab = "Si", ylab = "Ba", col = c("red", "blue"))
plot(Glass$Fe~Glass$Si, main = "Si vs Fe",
     xlab = "Si", ylab = "Fe", col = c("red", "blue"))

We can see that in most cases there appears to be a strong relationship between Si and the other predictors, but to double check we can create a correlation matrix.

corr.Glass <- cor(Glass[,1:9])
corrplot(corr.Glass, method = "square")

We can see that there is a strong correlation between Si and RI as we suspected, but the correlation is not as strong between Si and the others. Mg, on the other hand, does seem to have a strong correlation with several other predictors.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

par(mfrow = c(3, 3))
boxplot(Glass$RI, main = "RI")
boxplot(Glass$Na, main = "Na")
boxplot(Glass$Mg, main = "Mg")
boxplot(Glass$Al, main = "Al")
boxplot(Glass$Si, main = "Si")
boxplot(Glass$K, main = "K")
boxplot(Glass$Ca, main = "Ca")
boxplot(Glass$Ba, main = "Ba")
boxplot(Glass$Fe, main = "Fe")

Looking at the boxplot of all predictors, we can see that most of them have outliers. Ba looks to be almost all outliers.

summary(Glass$Ba)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150

We can see that this is because most of the glass lacks any traces of barium, so the 1st quartile, median and third quartile are all zero.

In terms of skew, Mg is skewed to the left, while K, Ca, Ba, and Fe are skewed heavily to the right. Due to how Fe is graphed, and how similar it is to Ba, we can assume that the glass samples didn’t have a lot of iron present either.

summary(Glass$Fe)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000

Although more iron is present in glass than barium, as the 3rd quartile managed to make it to 0.1. We can make an assumption that barium, potassium, and iron should be rare in glass with only trace amounts detected, if at all.

par(mfrow = c(2, 2))
plot(Glass$K~Glass$Type, main = "K vs Type", xlab = "Type", ylab = "K content")
plot(Glass$Ca~Glass$Type, main = "Ca vs Type", xlab = "Type", ylab = "Ca content")
plot(Glass$Ba~Glass$Type, main = "Ba vs Type", xlab = "Type", ylab = "Ba content")
plot(Glass$Fe~Glass$Type, main = "Fe vs Type", xlab = "Type", ylab = "Fe content")

We can see that for barium and iron, at least, their presence is most definitely helpful in narrowing down what type of glass is present.

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

So, as this is a classification model, the outliers doesn’t seem to matter. The book notes that “the outlier does not usually have an exceptional influence on the model”. Indeed, if our assumption about Ba and Fe holds true, the skewness is important to them. Still, we can test this.

control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"

set.seed(10)
lda <- train(Type~., data = Glass, method = "lda",
             metric = metric, trControl = control)

print(lda)

## Linear Discriminant Analysis 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 195, 193, 193, 191, 192, 192, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.641186  0.4965104

Without any sort of transformation, the accuracy of our linear discriminant analysis model is 0.64. Now, let’s attempt to normalize the skewed data by utilizing log transformations and PCA. We’ll utilize something simpler like the log rather than a Box Cox transformation because our assumption is that it won’t help with classification much.

That said, taking the log won’t be quite so simple. Some variables, such as Ba and Fe have 0 values, so we’ll need to make sure we don’t take the log of those.

logGlass <- Glass
pos <- which(logGlass$Mg > 0)
logGlass[pos, 3] <- log(logGlass[pos, 3])

pos <- which(logGlass$K > 0)
logGlass[pos, 6] <- log(logGlass[pos, 6])

pos <- which(logGlass$Ca > 0)
logGlass[pos, 7] <- log(logGlass[pos, 7])

pos <- which(logGlass$Ba > 0)
logGlass[pos, 8] <- log(logGlass[pos, 8])

pos <- which(logGlass$Fe > 0)
logGlass[pos, 9] <- log(logGlass[pos, 9])

hist(logGlass$K, main = "Log Transformed K", xlab = "K")

hist(logGlass$Ba, main = "Log Transformed Ba", xlab = "Ba")

We see that they look more normalish after the transformations. Does this translate to an improvement?

set.seed(10)
lda.log <- train(Type~., data = logGlass, method = "lda",
             metric = metric, trControl = control)

print(lda.log)

## Linear Discriminant Analysis 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 195, 193, 193, 191, 192, 192, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5712509  0.4012363

Actually no, which is not a surprise given that we couldn’t do anything for the 0 values. What about PCA?

The book warns us that PCA will actually create “components that are uncorrelated”, and we know from above that Si seems to have a high correlation with the other variables. Still, this all depends on whether or not a linear discriminant analysis prefers uncorrelated components or not.

pca.Glass <- Glass
pca.Glass <- prcomp(pca.Glass[,1:9], center = TRUE, scale. = TRUE)
plot(pca.Glass, type = "l", main = "PCA transformation of Glass dataset")

summary(pca.Glass)

## Importance of components:
##                          PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     1.585 1.4318 1.1853 1.0760 0.9560 0.72639 0.6074
## Proportion of Variance 0.279 0.2278 0.1561 0.1286 0.1016 0.05863 0.0410
## Cumulative Proportion  0.279 0.5068 0.6629 0.7915 0.8931 0.95173 0.9927
##                            PC8     PC9
## Standard deviation     0.25269 0.04011
## Proportion of Variance 0.00709 0.00018
## Cumulative Proportion  0.99982 1.00000

We see that, according to the PCA transformation, most of the variance of the data comes from the first four or five variables in the dataset. Now we need to extract the transformed data.

pca.Glass <- as.data.frame(pca.Glass$x)
pca.Glass$Type <- Glass$Type

set.seed(10)
lda.pca <- train(Type~., data = pca.Glass, method = "lda",
                 metric = metric, trControl = control)

print(lda.pca)

## Linear Discriminant Analysis 
## 
## 214 samples
##   9 predictor
##   6 classes: '1', '2', '3', '5', '6', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 195, 193, 193, 191, 192, 192, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.641186  0.4965104

It turns out our accuracy is the same when using PCA. Honestly, I’m not sure why. I assume the PCA transformation turned out to be uniform, and thus the LDA on both remained the exact same. Truthfully, I’m more used to running regression on continuous variables, but the target for this was categorical so I had to use a different model.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

data(Soybean)

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

So here are the frequency distributions for all 35 predictors.

table(Soybean$date, dnn = "Date")

## Date
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90

table(Soybean$plant.stand, dnn = "Plant Stand")

## Plant Stand
##   0   1 
## 354 293

table(Soybean$precip, dnn = "Precip")

## Precip
##   0   1   2 
##  74 112 459

table(Soybean$temp, dnn = "Temp")

## Temp
##   0   1   2 
##  80 374 199

table(Soybean$hail, dnn = "Hail")

## Hail
##   0   1 
## 435 127

table(Soybean$crop.hist, dnn = "Crop Hist")

## Crop Hist
##   0   1   2   3 
##  65 165 219 218

table(Soybean$area.dam, dnn = "Area Dam")

## Area Dam
##   0   1   2   3 
## 123 227 145 187

table(Soybean$sever, dnn = "Sever")

## Sever
##   0   1   2 
## 195 322  45

table(Soybean$seed.tmt, dnn = "Seed Tmt")

## Seed Tmt
##   0   1   2 
## 305 222  35

table(Soybean$germ, dnn = "Germ")

## Germ
##   0   1   2 
## 165 213 193

table(Soybean$plant.growth, dnn = "Plant Growth")

## Plant Growth
##   0   1 
## 441 226

table(Soybean$leaves, dnn = "Leaves")

## Leaves
##   0   1 
##  77 606

table(Soybean$leaf.halo, dnn = "Leaf Halo")

## Leaf Halo
##   0   1   2 
## 221  36 342

table(Soybean$leaf.marg, dnn = "Leaf Marg")

## Leaf Marg
##   0   1   2 
## 357  21 221

table(Soybean$leaf.size, dnn = "Leaf Size")

## Leaf Size
##   0   1   2 
##  51 327 221

table(Soybean$leaf.shread, dnn = "Leaf Shread")

## Leaf Shread
##   0   1 
## 487  96

table(Soybean$leaf.malf, dnn = "Leaf Malf")

## Leaf Malf
##   0   1 
## 554  45

table(Soybean$leaf.mild, dnn = "Leaf Mild")

## Leaf Mild
##   0   1   2 
## 535  20  20

table(Soybean$stem, dnn = "Stem")

## Stem
##   0   1 
## 296 371

table(Soybean$lodging, dnn = "Lodging")

## Lodging
##   0   1 
## 520  42

table(Soybean$stem.cankers, dnn = "Stem Cankers")

## Stem Cankers
##   0   1   2   3 
## 379  39  36 191

table(Soybean$canker.lesion, dnn = "Canker Lesion")

## Canker Lesion
##   0   1   2   3 
## 320  83 177  65

table(Soybean$fruiting.bodies, dnn = "Fruiting Bodies")

## Fruiting Bodies
##   0   1 
## 473 104

table(Soybean$ext.decay, dnn = "Ext Decay")

## Ext Decay
##   0   1   2 
## 497 135  13

table(Soybean$mycelium, dnn = "Mycelium")

## Mycelium
##   0   1 
## 639   6

table(Soybean$int.discolor, dnn = "Int Discolor")

## Int Discolor
##   0   1   2 
## 581  44  20

table(Soybean$sclerotia, dnn = "Sclerotia")

## Sclerotia
##   0   1 
## 625  20

table(Soybean$fruit.pods, dnn = "Fruit Pods")

## Fruit Pods
##   0   1   2   3 
## 407 130  14  48

table(Soybean$fruit.spots, dnn = "Fruit Spots")

## Fruit Spots
##   0   1   2   4 
## 345  75  57 100

table(Soybean$seed, dnn = "Seed")

## Seed
##   0   1 
## 476 115

table(Soybean$mold.growth, dnn = "Mold Growth")

## Mold Growth
##   0   1 
## 524  67

table(Soybean$seed.discolor, dnn = "Seed Discolor")

## Seed Discolor
##   0   1 
## 513  64

table(Soybean$seed.size, dnn = "Seed Size")

## Seed Size
##   0   1 
## 532  59

table(Soybean$shriveling, dnn = "Shriveling")

## Shriveling
##   0   1 
## 539  38

table(Soybean$roots, dnn = "Roots")

## Roots
##   0   1   2 
## 551  86  15

At first blush, a few of them seem to be degenerate. leaf.malf is split 554 to 45, and leaf.mild is split 535 to 20 and 20. However, we won’t know for sure until we calculate them.

degenerate <- function(df){
  ans <- numeric(0)
  
  for(i in 2:length(df)){
    t <- table(df[,i])
    
    t <- t[order(-t)]
    
    if(t.ratio(t) & uv(t)){
      ans <- c(ans, i)
    }
  }
  
  return(ans)
}

t.ratio <- function(t){
  if(t[1] / t[2] >19){
    return(TRUE)
  }
  
  else{
    return(FALSE)
  }
}

uv <- function(t){
  t <- t / sum(t)
  
  if(t[1] > 0.89){
    return(TRUE)
  }
  
  else{
    return(FALSE)
  }
}

l <- degenerate(Soybean)

head(Soybean[, l])

##   leaf.mild mycelium sclerotia
## 1         0        0         0
## 2         0        0         0
## 3         0        0         0
## 4         0        0         0
## 5         0        0         0
## 6         0        0         0

It appears, by our calculations, that at least three categories are degenerate: leaf.mild, mycelium, and sclerotia. That is to say the fraction of unique values is 10% or less, and the ratio between the first and second variables is 20 or greater. Now, this might not matter because we could simply choose a model that is not susceptible to this sort of variance.

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

Looking at the NA's, there seems to be a certain pattern appearing as some numbers are repeating. precip, stem.cankers, canker.lesion, ext.decay, mycelium, int.discolor, and sclerotia all have 38 missing values; hail, sever, seed.tmt, and lodging, are missing 121 values; and leaf.halo, leaf.marg, leaf.size, leaf.malf, and fruit.pods are missing 84 values.

three.eight <- Soybean[, c(1,4,22,23,25,26,27,28)]
unique(three.eight[is.na(three.eight[, c(2:8)]), 1])

## [1] cyst-nematode    2-4-d-injury     herbicide-injury <NA>            
## 19 Levels: 2-4-d-injury alternarialeaf-spot ... rhizoctonia-root-rot

We see that most of the NA values for all the 38’s are either completely missing, or revolve around cyst-nematode, 2-4-d-injury, and herbicide-injury.

one.two.one <- Soybean[, c(1, 6, 9, 10, 21)]
unique(one.two.one[is.na(one.two.one[, c(2:5)]), 1])

## [1] phytophthora-rot            diaporthe-pod-&-stem-blight
## [3] cyst-nematode               2-4-d-injury               
## [5] herbicide-injury            <NA>                       
## 19 Levels: 2-4-d-injury alternarialeaf-spot ... rhizoctonia-root-rot

Those that are missing 121 values seem to have suffered from phytophthora-rot, diaporthe-pod-&-stem-blight, cyst-nematode, herbicide-injury, and 2-4-d-injury.

And lastly the 84’s:

eight.four <- Soybean[, c(1, 14, 15, 16, 18, 29)]
unique(eight.four[is.na(eight.four[, c(2:6)]), 1])

## [1] phytophthora-rot            diaporthe-pod-&-stem-blight
## [3] cyst-nematode               <NA>                       
## 19 Levels: 2-4-d-injury alternarialeaf-spot ... rhizoctonia-root-rot

Which share phytophthora-rot, diaporthe-pod-&-stem-blight, and cyst-nematode or are all NA. Between them, the cyst-nematode class seems to be shared. It could be that all the soybean pods or plants had been damaged to such an extent that only the root cause could be identified, or that other variables were all that is necessary to identify the class, or that the instruments could not pick up any specific readings about some of the variables. Regardless, the most common thread between all these NA's is cyst-nematode.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

For the data in which class is NA, it would probably be best to delete them outright. It’s possible to predict the data, but as it seems that’s what the model would be built to do in the first place, this seems a little redundant.

For the missing data that has a pattern, such as those outlined above, it might be possible to predict the missing variables based on the group. As we’ve grouped together some of them into the dataframes eight.four, one.two.one, and three.eight, we could do as the book suggests and run a k-nearest neighbors algorithm on each one to predict the missing variables. There is a potential issue, in that our training/test set would have to be comprised of only the complete entries. This might be too small of a size to get any meaningful data out of.

Still, this is better than the alternative, which would be to replace the NA with the most common variable in the given category. If they were numeric values, we could take the mean or median, but they are all unfortunately factor variables, so we would be working with straight replacements. If we are going to just replace NA's with values that make the most sense based on previous common entries, it makes more sense to let the kNN algorithm do it rather than fiddle with it manually.

hw4 624

Iden Watanabe

February 27, 2019

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.