Data 624 HW4

if (!require("Hmisc")) install.packages("Hmisc")
if (!require("PerformanceAnalytics")) install.packages("PerformanceAnalytics")
if (!require("mlbench")) install.packages("mlbench")
if (!require("car")) install.packages("car")
if (!require("missForest")) install.packages("missForest")
if (!require("Amelia")) install.packages("Amelia")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("naniar")) install.packages("naniar")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("caret")) install.packages("caret")

3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

glass <- subset(Glass, select = -Type)
chart.Correlation(glass)

From the Correlation, we can see that the variable Ri and Ca are strong positive correlated(0.81). Ri ad Si are negative correlated (-0.54)

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

hist.data.frame(glass)

par(mfrow=c(3,3))
for(var in names(glass)){
  boxplot(glass[var], main=paste('Boxplot of', var), horizontal = T)
}

From Figure, we can see that K and Mg appear to have possible second modes around zero and that several predictors Ca, Ba, Fe and RI show signs of skewness. There may be one or two outliers in K, but they could simply be due to natueral skewness. Also, predictors Ca, RI, Na and Si have concentrations of samples in the middle of the scale and a small number of data points at the edges of the distribution. Yes, boxplot proves that there is outliers in the data.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

#library(caret)
Trans <- preProcess(glass, method = "YeoJohnson")
TransData <- predict(Trans, newdata= glass)

hist.data.frame(TransData)

par(mfrow=c(3,3))
for(var in names(TransData)){
  boxplot(TransData[var], main=paste('Boxplot of', var), horizontal = T)
}

This transformation did change relative to the original distributions is that a second mode was induced for predictors Ba and Fe. Given these results, this transformation did not seem to improve the data (in terms of skewness). Thus, it was unable to resolve skewness in this data via transformations but it minimized the number of unusual observations.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

When we look closely at this output, we see that the factor levels of some predictors are not informative. For example, the temp column contains integer values. These values correspond the relative temperature: below average, average and above average.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean2 <-Soybean[,2:36]
par(mfrow=c(3,6))
for (i in 1:ncol(Soybean2)) {
  smoothScatter(Soybean2[ ,i], ylab = names(Soybean2[i]))
}

nearZeroVar(Soybean2, names = TRUE, saveMetrics=T)

There are a few degenerate and that is due to the low frequencies. Most important once are mycelium and sclerotia. The Smoothed Density Scatterplot for the variables shows one color across the chart. The variables leaf.mild and int.discolor appear to show near-zero variance.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")

As heatmap suggest, data are missing but it doesn’t give us clear piture base on class of data missing so i will use `naniar’ package

gg_miss_var(Soybean, facet = Class)

table(Soybean$Class, complete.cases(Soybean))

##                              
##                               FALSE TRUE
##   2-4-d-injury                   16    0
##   alternarialeaf-spot             0   91
##   anthracnose                     0   44
##   bacterial-blight                0   20
##   bacterial-pustule               0   20
##   brown-spot                      0   92
##   brown-stem-rot                  0   44
##   charcoal-rot                    0   20
##   cyst-nematode                  14    0
##   diaporthe-pod-&-stem-blight    15    0
##   diaporthe-stem-canker           0   20
##   downy-mildew                    0   20
##   frog-eye-leaf-spot              0   91
##   herbicide-injury                8    0
##   phyllosticta-leaf-spot          0   20
##   phytophthora-rot               68   20
##   powdery-mildew                  0   20
##   purple-seed-stain               0   20
##   rhizoctonia-root-rot            0   20

As we can see in class phytophthora-rot , there more missing values than other class.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

#Remove near zero variance predictors
Soybean <- Soybean %>%
  select (-leaf.mild, -mycelium, -sclerotia)


#seed 10% missing values
Soybean.mis <- prodNA(Soybean, noNA = 0.1)
 summary(Soybean.mis)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 89   5      :137   0   :313    0   : 65   0   : 73  
##  frog-eye-leaf-spot : 84   4      :122   1   :266    1   : 96   1   :344  
##  alternarialeaf-spot: 83   3      :109   NA's:104    2   :413   2   :177  
##  phytophthora-rot   : 79   6      : 81               NA's:109   NA's: 89  
##  anthracnose        : 40   2      : 72                                    
##  (Other)            :250   (Other): 88                                    
##  NA's               : 58   NA's   : 74                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :393   0   : 55   0   :111   0   :172   0   :270   0   :149   0   :394    
##  1   :114   1   :150   1   :207   1   :295   1   :200   1   :193   1   :206    
##  NA's:176   2   :204   2   :130   2   : 38   2   : 33   2   :186   NA's: 83    
##             3   :184   3   :160   NA's:178   NA's:180   NA's:155               
##             NA's: 90   NA's: 75                                                
##                                                                                
##                                                                                
##   leaves    leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf    stem    
##  0   : 68   0   :201   0   :322   0   : 49   0   :428    0   :504   0   :266  
##  1   :553   1   : 32   1   : 19   1   :307   1   : 86    1   : 43   1   :339  
##  NA's: 62   2   :308   2   :200   2   :200   NA's:169    NA's:136   NA's: 78  
##             NA's:142   NA's:142   NA's:127                                    
##                                                                               
##                                                                               
##                                                                               
##  lodging    stem.cankers canker.lesion fruiting.bodies ext.decay  int.discolor
##  0   :466   0   :343     0   :276      0   :428        0   :445   0   :524    
##  1   : 36   1   : 36     1   : 72      1   : 92        1   :117   1   : 37    
##  NA's:181   2   : 32     2   :158      NA's:163        2   : 10   2   : 19    
##             3   :172     3   : 58                      NA's:111   NA's:103    
##             NA's:100     NA's:119                                             
##                                                                               
##                                                                               
##  fruit.pods fruit.spots   seed     mold.growth seed.discolor seed.size 
##  0   :363   0   :317    0   :427   0   :462    0   :463      0   :476  
##  1   :118   1   : 66    1   :109   1   : 59    1   : 58      1   : 58  
##  2   : 12   2   : 53    NA's:147   NA's:162    NA's:162      NA's:149  
##  3   : 44   4   : 86                                                   
##  NA's:146   NA's:161                                                   
##                                                                        
##                                                                        
##  shriveling  roots    
##  0   :493   0   :496  
##  1   : 36   1   : 78  
##  NA's:154   2   : 15  
##             NA's: 94  
##                       
##                       
##

#impute missing values, using all parameters as default values
Soybean.imp <- missForest(Soybean.mis)

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

#check imputed values
Soybean2 <- as.data.frame(Soybean.imp$ximp)


Soybean2 %>%
  arrange(Class) %>%
  missmap(main = "Missing vs Observed")