Source Code: https://github.com/djlofland/DATA624_PredictiveAnalytics/tree/master/Homework_4
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via:
> library(mlbench)
> data(Glass)
> str(Glass)
'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
# Prepare data for ggplot (remove the Target 'Type' column)
feature_df <- Glass %>%
select(-Type)
feature_gather_df <- feature_df %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(feature_gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
# Boxplots for each variable
ggplot(feature_gather_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)
# Identify missing data by Feature and display percent breakout
missing <- colSums(Glass %>% sapply(is.na))
missing_pct <- round(missing / nrow(Glass) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))
## values ind
## 1 0 RI
## 2 0 Na
## 3 0 Mg
## 4 0 Al
## 5 0 Si
## 6 0 K
## 7 0 Ca
## 8 0 Ba
## 9 0 Fe
## 10 0 Type
# Calculate and plot the Multicolinearity
correlation <- cor(Glass[,1:ncol(Glass) - 1], use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))
# Plot scatter plots of each variable versus the target variable
featurePlot(Glass[,1:ncol(Glass)-1], Glass[,ncol(Glass)], pch = 20)
See discussion below
There appear to be outliers in essentailly all the features and all are skewed to some level. Several predictors have a large number of 0 values, which I will assume isn’t measurement error but rather lack of that atom in the glass type. There is a question whether to treat 0 values as “outliers” since those are probably real measurements with intrinsic value. Mg is the only element without statistical outliers. From the correlation plots, we see that Si and Rl have a somewhat strong negative correlation and Ca nad RL have a more significant positive correlation. Mg has a somewhat negative correlation with several elements: Al, Ba, Ba and Ca, but it is a weak correlation.
Interestingly, several predictor show bimodal distributions (see Mg, Na, Rl, K) - in a model advanced model, we might leverage mixtools
package to see if we can tease out the underlying pattern and explore whether adding a classification feature along with 2 more normal distributions rather than a combined bimodal would give better model performance. As a strategy, I would probably run all the features through an exploratory BoxCox transformation to see what suggestions it provides and compare the transformed data against raw (visually and in a model - check both model performance and quality of residuals). Of all the features, Ca, Na and Si are closest to a normal shape. Ba ans Fe are interesting as they almost appear more like discrete values at specific step and less continuous than other elements. This could be an artifact of measurement or maybe an intended aspect of glass making that a domain expert could shed light on. Once “normalized” (or as close as we can get), we could also attempt scaling and centering and assess whether that gives any/all of the predictors more resolving power when used in a model, i.e. convert the predictor to a z-score. Ultimately we are looking for the most well behaved data in our model, not perfect.
While the chapter discusses PCA, we need to carefully consider whether predictive power or model explanatory value is more important. PCA can often handle problems with multicolinearity and reduce dimensionality in our feature space, however, at the cost of explainable features. I generally avoid PCA if I expect someone to ask me how features are related to predictions. This is certainly true during early model exploration. Once all stakeholders have confidence with a model, then PCA can be layered in to tweak out further performance. That said, the moment I’m reaching for PCA, I’ll probably more likely to explore other modeling approaches that are insensitive to multicolinearity and/or larger feature sets. For example, Neural Networks can often handle these more robustly.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
> library(mlbench)
> data(Soybean)
> ## See ?Soybean for details
Note that according to http://search.r-project.org/library/mlbench/html/Soybean.html, all the features are categorical.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
# Make sure all features are categorical
feature_df <- as_tibble(Soybean) %>%
convert(fct())
# Prepare data for ggplot (remove the Target 'Class' column)
feature_gather_df <- feature_df %>%
select(-Class) %>%
gather(key = 'variable', value = 'value')
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Histogram plots of each variable
ggplot(data=feature_gather_df, aes(x=value)) +
geom_bar() +
facet_wrap(variable~., ncol=4)
Since all the features are by definition categorical, we don’t expect outliers per se as those have been coded. However, with this dataset, we do see a significant number of missing values which can be problematic, especially if we have smaller data sets. The question is whether missing values are meaningful - were these points miss-coded or missing, or does the fact they are missing hold meaning such that we should code those in a meaningful way. This comes down to domain expertise. Common approaches are to drop a feature if it has too many missing values as it probably offers less explanatory values. Another approaches, esp with continuous variables, is imputing with a class mean or median. However, with categorical features, this can be more challenging. If we have an ordered categorical, we might imput with the middle, but that may or may not be the correct solution. The book suggests impute.knn() which will look for other observations with similar non-NA features and impute the missing value based on the other rows. This is probably a reasonable approach with this dataset. We should also consider any features that have very little variance - these may offer less resolution for a model. We ideally want features with higher variance so we get better resolution power from that variable.
Alternatively, if a specific row (observation) has too many NAs, we might drop that row. The aggressive approach is to drop any row or column with NAs; however, we could be throwing away valuable information a model could use.
Looking at specific features, we also see class imbalances (e.g., seed.discolor has 513 zeros and 64 ones). Class imbalances might possible reduce resolution of our model by inflating the importance of our dominant class. Different models are more/less sensitive to class imbalances, but it’s certainly a consideration.
Looking at the count of observations for each class, we have some clear imbalances present in the data.
missing <- Soybean %>%
group_by(Class) %>%
miss_var_summary()
ggplot(missing, aes(Class, variable, fill=pct_miss)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90))
Using the naniar
package we can quickly see patterns with the missing data. Most of the missing data are associated with the 5 classes: 2-4-d-injury, cyst-nematode, diaporthe-pod, herbicide-injury and phytophthora-rot. From the interaction plot (see above), we can see patterns with missing data across the features (note, I truncated this plot to only show more common interactions). Comparing with the bar chart showing observations by class, the missing data is more prevalent in our less represented classes.
date
for example is the month and we could possible replace a missing date
with the median date
, though mean might be ok.