DS 624: Homwork 4 (Data Processesing & Overfitting)

Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_4

Problem 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

> library(mlbench)
> data(Glass)
> str(Glass)

'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

data(Glass)

# Display summary statistics
summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

# Prepare data for ggplot (remove the Target 'Type' column)
feature_df <- Glass %>% 
  select(-Type)

feature_gather_df <- feature_df %>% 
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(feature_gather_df) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=4)

# Skewness for each Predictor
(skewValues <- apply(feature_df, 2, skewness))

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

# Boxplots for each variable
ggplot(feature_gather_df, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)

# Identify missing data by Feature and display percent breakout
missing <- colSums(Glass %>% sapply(is.na))
missing_pct <- round(missing / nrow(Glass) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))

##    values  ind
## 1       0   RI
## 2       0   Na
## 3       0   Mg
## 4       0   Al
## 5       0   Si
## 6       0    K
## 7       0   Ca
## 8       0   Ba
## 9       0   Fe
## 10      0 Type

# Calculate and plot the Multicolinearity
correlation <- cor(Glass[,1:ncol(Glass) - 1], use = 'pairwise.complete.obs')

corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
         col=brewer.pal(n=8, name="RdYlBu"))

# Plot scatter plots of each variable versus the target variable
featurePlot(Glass[,1:ncol(Glass)-1], Glass[,ncol(Glass)], pch = 20)

See discussion below

Do there appear to be any outliers in the data? Are any predictors skewed?

There appear to be outliers in essentailly all the features and all are skewed to some level. Several predictors have a large number of 0 values, which I will assume isn’t measurement error but rather lack of that atom in the glass type. There is a question whether to treat 0 values as “outliers” since those are probably real measurements with intrinsic value. Mg is the only element without statistical outliers. From the correlation plots, we see that Si and Rl have a somewhat strong negative correlation and Ca nad RL have a more significant positive correlation. Mg has a somewhat negative correlation with several elements: Al, Ba, Ba and Ca, but it is a weak correlation.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Interestingly, several predictor show bimodal distributions (see Mg, Na, Rl, K) - in a model advanced model, we might leverage mixtools package to see if we can tease out the underlying pattern and explore whether adding a classification feature along with 2 more normal distributions rather than a combined bimodal would give better model performance. As a strategy, I would probably run all the features through an exploratory BoxCox transformation to see what suggestions it provides and compare the transformed data against raw (visually and in a model - check both model performance and quality of residuals). Of all the features, Ca, Na and Si are closest to a normal shape. Ba ans Fe are interesting as they almost appear more like discrete values at specific step and less continuous than other elements. This could be an artifact of measurement or maybe an intended aspect of glass making that a domain expert could shed light on. Once “normalized” (or as close as we can get), we could also attempt scaling and centering and assess whether that gives any/all of the predictors more resolving power when used in a model, i.e. convert the predictor to a z-score. Ultimately we are looking for the most well behaved data in our model, not perfect.

While the chapter discusses PCA, we need to carefully consider whether predictive power or model explanatory value is more important. PCA can often handle problems with multicolinearity and reduce dimensionality in our feature space, however, at the cost of explainable features. I generally avoid PCA if I expect someone to ask me how features are related to predictions. This is certainly true during early model exploration. Once all stakeholders have confidence with a model, then PCA can be layered in to tweak out further performance. That said, the moment I’m reaching for PCA, I’ll probably more likely to explore other modeling approaches that are insensitive to multicolinearity and/or larger feature sets. For example, Neural Networks can often handle these more robustly.

Problem 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Note that according to http://search.r-project.org/library/mlbench/html/Soybean.html, all the features are categorical.

data(Soybean)

# Display summary statistics
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

Soybean %>%
  ggplot(aes(x=Class)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90))

# Make sure all features are categorical
feature_df <- as_tibble(Soybean) %>%
  convert(fct())

# Prepare data for ggplot (remove the Target 'Class' column)
feature_gather_df <- feature_df %>% 
  select(-Class) %>%
  gather(key = 'variable', value = 'value')

## Warning: attributes are not identical across measure variables;
## they will be dropped

# Histogram plots of each variable
ggplot(data=feature_gather_df, aes(x=value)) + 
  geom_bar() +
  facet_wrap(variable~., ncol=4)

Since all the features are by definition categorical, we don’t expect outliers per se as those have been coded. However, with this dataset, we do see a significant number of missing values which can be problematic, especially if we have smaller data sets. The question is whether missing values are meaningful - were these points miss-coded or missing, or does the fact they are missing hold meaning such that we should code those in a meaningful way. This comes down to domain expertise. Common approaches are to drop a feature if it has too many missing values as it probably offers less explanatory values. Another approaches, esp with continuous variables, is imputing with a class mean or median. However, with categorical features, this can be more challenging. If we have an ordered categorical, we might imput with the middle, but that may or may not be the correct solution. The book suggests impute.knn() which will look for other observations with similar non-NA features and impute the missing value based on the other rows. This is probably a reasonable approach with this dataset. We should also consider any features that have very little variance - these may offer less resolution for a model. We ideally want features with higher variance so we get better resolution power from that variable.

Alternatively, if a specific row (observation) has too many NAs, we might drop that row. The aggressive approach is to drop any row or column with NAs; however, we could be throwing away valuable information a model could use.

Looking at specific features, we also see class imbalances (e.g., seed.discolor has 513 zeros and 64 ones). Class imbalances might possible reduce resolution of our model by inflating the importance of our dominant class. Different models are more/less sensitive to class imbalances, but it’s certainly a consideration.

Looking at the count of observations for each class, we have some clear imbalances present in the data.

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

vis_miss(Soybean)

Soybean %>% 
  select(-Class) %>% 
  gg_miss_var()

gg_miss_upset(Soybean, 
              nsets = 15,
              nintersects = NA)

missing <- Soybean %>%
  group_by(Class) %>%
  miss_var_summary()

ggplot(missing, aes(Class, variable, fill=pct_miss)) + 
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90))

Using the naniar package we can quickly see patterns with the missing data. Most of the missing data are associated with the 5 classes: 2-4-d-injury, cyst-nematode, diaporthe-pod, herbicide-injury and phytophthora-rot. From the interaction plot (see above), we can see patterns with missing data across the features (note, I truncated this plot to only show more common interactions). Comparing with the bar chart showing observations by class, the missing data is more prevalent in our less represented classes.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Before imputing, I would explore bootstrapping (up-sample or down-sample) to deal with class imbalances. The problem is that imputing based on imbalanced data will likely lead to incorrect class assignment by a trained model and lowered resolution.
Since we are working with categorical data, the only features where we could consider mean or median imputation or ordinal categorical; however, we need to assess any imbalances within each of those features before blindly using a mean or median. The safer approach with categorical is to use KNN which will look for similar observations based on other features to choose a value. date for example is the month and we could possible replace a missing date with the median date, though mean might be ok.
I would then do an exploratory model with knn imputed values and step-wise remove the features with the most missing data (e.g. sever, seed.tmt, etc), as imputed values are inherently less trustworthy, and see how model performance changes. As long as performance remains the same or improves, I would leave out those features.
With exploratory models, I would next assess predictive accuracy for those classes that started with the most missing data (e.g. 2-4-d-injury). These may be problematic as they are also the classes with fewer observations.

DS 624: Homwork 4 (Data Processesing & Overfitting)

Kuhn: 3.1, 3.2

Donny Lofland

Problem 3.1

Problem 3.2