DATA 624 - Homework 4

3.1 The UC Irvine Machine Learning Repository contains a dataset related to glass identification. The data consist of 214 glass samples labeled as one of several categories. There are nine predictors, including the refractive index and percentage of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

Looking at the data summary, a few things jump out. First, the RI field contains very little variance, so it may not provide much predictive power in the analysis. A boxplot confirms that. However it may still work in conjunction with other variables.

Glass %>% ggplot(aes(x=Type, y=RI)) + geom_boxplot() +
  theme_bw() +
  labs(title="Refractive Index by Type", y="Refraction Index",
       x="Glass Type", caption="Data courtesy of UCI ML Repository \n https://archive.ics.uci.edu/ml/index.php")

Secondly, when looking at the dependent variable, Type, numbers 3, 5, and 6 are pretty underrepresented compared to the other categories.

Glass %>% ggplot() + geom_bar(aes(x=Type, y=..prop.., group=2)) +
  theme_bw() +
  scale_y_continuous(labels=percent) +
  labs(title="Glass Types", y="Percent",
       x="Glass Type", caption="Data courtesy of UCI ML Repository \n https://archive.ics.uci.edu/ml/index.php")

Next we look at the individual element predictors to see how they are distributed by the glass types:

# Boxplots of Element Compositions

lapply(names(Glass)[2:9], function(i){
  ggplot(Glass, aes_string(x="Type",y=i)) + geom_boxplot() +
    theme_bw() +
    labs(title=paste0("Element ",i," by Glass Type"), y="Value",
         x="Glass Type", caption="Data courtesy of UCI ML Repository \n https://archive.ics.uci.edu/ml/index.php")
})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

The boxplots show us a few things:

Elements Na, Si, K, and Ca do not vary much between glass types.
Elements Mg and Fe tend to be higher in types 1-3 than in 5-7
Element Al values tend to overlap across types, so it may also be sub-optimal at distinguishing between types of glass.
Element Ba seems to be a clear indicator of glass type 7.

In short, there doesn’t appear to be one variable which can alone predict the glass type from amongst all types.

Next we’ll investigate correlation between the predictor variables:

corrplot(cor(Glass[,1:9]), method="number", diag=F)

Here we see that the largest correlations between variables are between Ca and RI and between Si and RI. The rest have some correlation but nothing more than 0.5.

b) Do there appear to be any outliers in the data? Are any predictors skewed?

Looking at the distribution of each variable with the below histograms, we can see how the variables are skewed and what sort of outliers may exist:

lapply(names(Glass)[1:9], function(i){
  Glass %>% ggplot(aes_string(x=i)) + geom_histogram(bins=30) +
    labs(title=paste0("Distribution of ",i," variable"),
         x="Value",y="Count")
}
)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

The RI, Na, Al, and Si variables have have some resemblance to a normal distribution, with some (such as RI) exhibiting a bit of skew.
The remaining variables are not at all normal. Variable Mg in particular seems to be bimodal, while K Ba, and Fe seem to be concentrated at the lower end.
Regarding outliers, variable K seems to have an outlier well above the majority of the data points. Variable Ba also seems to have a few possible outliers.

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We can try to use Box-Cox transforms on the predictors using the lambda = "auto" parameter.

lapply(names(Glass)[1:9], function(i){
  Glass %>% ggplot(aes(x=BoxCox(get(i),lambda="auto"))) +
    geom_histogram(bins=30) +
    labs(title=paste0("Distribution of ",i," variable"),
         subtitle = "Box-Cox Transformed",
         x="Value",y="Count")
}
)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Looking at the output histograms, however, we see that the transforms did not work well on the variables that were badly skewed and far from normal.
With the variables that were already simiar to a normal distribution, the transformation did not have a large impact.
I would conclude that Box-Cox transformations are not appropriate for these variables.

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g. temperature, precipitation) and plant conditions (e.g. left spots, mold growth). The outcome labels consist of 19 distinct classes.

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

An inspection of the summary above shows that there are no predictors with a degenerate (constant) distribution. However there are some that are rather imbalanced.

b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# Check NAs for predictors by class
temp <- map_dfr(Soybean, fct_explicit_na) %>%
  pivot_longer(-Class, names_to="predictor",
               values_to="val",
               values_ptypes = list(val=character())) %>%
  group_by(Class, predictor) %>%
  mutate(classTotal = n()) %>%
  group_by(classTotal, value = as.factor(val), .add=T, .drop=F) %>%
  summarize(cnt = n())
  
  temp %>%
    filter(value == "(Missing)") %>%
    ggplot(aes(x=predictor, y=Class)) +
    geom_tile(aes(fill=cnt/classTotal)) +
    scale_fill_gradient2(labels=percent,
                         low="darkgreen",
                         mid="yellow",
                         high="firebrick1",
                         midpoint=.5) +
    theme(axis.text.x = element_text(angle=90, vjust=0.5),
          panel.background = element_blank()) +
    labs(title="Missing predictor values by Class",
         x="Predictor Variable",y="Class", fill="% Missing")

The above visual shows that there aren’t any specific predictor variables that are missing more than any other.
However, some values of the outcome Class are missing many predictors. Namely: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide-injury, and, to a lesser degree, phytophthora-rot.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Looking at the visual above, if we only used predictors where there are at least some values for all classes, we’d be limited to only 3 predictors.
On the other hand, if we use only complete cases (where all predictors have a value), we’d lose 4 classes completely!
This leaves imputation as our best bet, but great care would have to be taken to apply it meaningfully. For example, are some of the NAs for some classes missing because they do not apply to that kind of outcome? Or, is it simply a missing measurement?
Alternatively, a classification method can be considered that is more robust to missing data.

DATA 624 - Homework 4

Adam Douglas

9/27/2020

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.