Data 624 HW 4

Maryluz Cruz

2021-03-08

3.1. The UC Irvine Machine Learning Repository6 contains a data set relatedto glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
library(DataExplorer)
library(psych)
library(ggplot2)
library(GGally)
library(tidyverse)
library(mice)
library(caret)
library(e1071)

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Density Plot

plot_density(Glass)

Histogram

plot_histogram(Glass)

Pairs Panels

pairs.panels(Glass)

GGPairs

ggpairs(Glass)

Boxplot

Correlation

plot_correlation(Glass)

  1. Do there appear to be any outliers in the data? Are any predictors skewed?

Yes, there seems to be outliers, and K, Ca, Ba, Fe seem to be skewed to the left. Si, Al, and K seem to be more normally distributed.

  1. Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes, there are relavant transformations that can be done. K, Ca, Ba, Fe can be transformed using log.

Glass$K <- log(Glass$K)
Glass$Ca<- log(Glass$Ca)
Glass$Ba<- log(Glass$Ba)
Glass$Fe<- log(Glass$Fe)
plot_histogram(Glass)

plot_density(Glass)

3.2. The soybean data can also be found at the UC Irvine Machine LearningRepository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
summary(Soybean)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")

All of the categories seems to be missing data except for leaves.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

While there are many values that are missing

plot_missing(Soybean)

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

There are many ways to handle the missing data, you can either use the mean or the median to eliminate it but since there are too many variables that have missing data it would not be the best option. Another way which is better for a lot of missing variables is using the MICE method with the Rainforest method.

This is the formula that can be used to remove the missing variables.

Soybean_trans<-Soybean %>% ## the missing values are imputed 
  mice(method = "rf")
## 
##  iter imp variable
##   1   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   3  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   4  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   5  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   3  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   4  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   5  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   3  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   4  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   5  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   3  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   4  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   5  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   1  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   2  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   3  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   4  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   5  date  plant.stand  precip  temp  hail  crop.hist  area.dam  sever  seed.tmt  germ  plant.growth  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
Soybean_trans<-complete(Soybean_trans)
plot_missing(Soybean_trans)

As you can see after micing the data there are no more missing values.

Soybean_trans %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")