Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling.
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# Required libraries
library(ggplot2)
library(tidyverse)
Glass %>% select(-c(Type)) %>%
gather() %>%
ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')
The histograms above for each predictor variable show near normal plots for Al, Na, and Si. The histograms for Ca and RI aren’t quite normal, as both have a bit of right skew. Histograms for Ba, Fe, and K clearly have right skew and Mg appears bi-modal.
library(corrplot)
corr <- Glass %>% select(-c(Type)) %>% cor()
corrplot(corr, method="number")
The correlation plot above shows a high positive correlation between Ca and RI, which would indicate both predictor variables contain almost the same information. The only other pair of predictor variables above 0.5 or below -0.5 is Si and RI with a correlation value of -0.54 which means these two have an inverse relationship.
Do there appear to be any outliers in the data? Are any predictors skewed?
Glass %>% select(-c(Type)) %>%
gather() %>%
ggplot(aes(x = value)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
facet_wrap(~key, scales = 'free')
Yes, based on the histograms from section (a) and the boxplots above, outliers do appears as the black dots in the boxplots beyond the whiskers (1.5 IQR from the hinge). Thus, all the predictor variables except Mg appear to have outliers by definition. As mentioned in section (a), yes most of the predictors are skewed, histograms for Ba, Ca, and K clearly have right skew, and Fe and RI have a bit of right skew.
library(e1071)
skewValues <- apply(Glass[1:9],2,skewness)
(skewValues)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
Based on the skewness
function, all the predictors are right skewed except for Mg an Si. The predictors with skew values furthest from zero are K, Ba, Ca, which matches the results of the visual assessment from the histograms.
Are there any relevant transformations of one or more predictors that might improve the classification model?
The first transformation, would be the removal of either Ca or RI due to their high positive correlation. For the predictors with clear right skew, the Box-Cox transformation can be applied. The Box-Cox transformation can identify the proper log, square, square root or inverse transformation in order to resolve the skew-ness and transform the data to fit a more normal distribution for use in a predictive model. For the outliers, spatial sign transformation can be applied which projects the predictor values onto a multidimensional sphere, in essence making all the sample values the same distance from the center of the sphere.
library(caret)
# Box Cox, center, scale transformations
trans <- preProcess(Glass, method = c("BoxCox", "center", "scale"))
transformed <- predict(trans, Glass)
transformed %>% select(-c(Type)) %>%
gather() %>%
ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')
Applying the Box-Cox transformation above with centering and scaling, I’ll be honest the shape of the near normal distributions did not improve that much. Also, the amount of skew did not appear to improve either.
library(caret)
# Spatial sign, center, scale transformations
trans <- preProcess(Glass, method = c("spatialSign", "center", "scale"))
transformed <- predict(trans, Glass)
transformed %>% select(-c(Type)) %>%
gather() %>%
ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')
Applying the spatial sign transformation above with centering and scaling, the not normal distributions for Ba, Fe, K, and Mg aren’t as bad as the initial histograms.
library(caret)
# Spatial sign, center, scale transformations
trans <- preProcess(Glass, method = c("BoxCox", "center", "scale", "pca"))
transformed <- predict(trans, Glass)
transformed %>% select(-c(Type)) %>%
gather() %>%
ggplot(aes(x = value)) + geom_histogram(bins=30) + facet_wrap(~key, scales = 'free')
And finally, from the book example, I’ve applied Box-Cox, center, scaling and PCA. Based on the above histograms of the principal components, the visual assessment does look good in regards to normal distributions.
At this point, I’d probably go with PCA despite my initial paragraph describing Box-Cox and Spatial Sign.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
## See ?Soybean for details
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
First, I generate a barplot for each predictor variable defining the frequency distribution, in order to provide an initial visual assessment.
Soybean %>% select(-c(Class)) %>%
gather() %>%
drop_na() %>%
ggplot(aes(x = value)) + geom_bar() + facet_wrap(~key, ncol=3, scales = 'free')
The above barplots show many plots may be susceptible to severe disproportionate frequency. The nearZeroVar
function below will calculate and identify those predictor variables with near zero variance, and thus degenerative frequency distributions.
library(caret)
cols <- nearZeroVar(Soybean, saveMetrics = TRUE, names = TRUE)
(cols)
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
The output from the function nearZeroVar
identifies leaf.mild
, mycelium
, and sclerotia
as having near zero variance, and thus degenerate distributions. None of the predictor variables have zero variance based on the above calculations. Given these three variables have near-zero variance, they would be good candidates to remove from the predictive model.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# Count the number of observations with a missing value by predictor variable
colSums_Missing_Count <- data.frame(colSums(is.na(Soybean)))
# Name the column for the NA count
colnames(colSums_Missing_Count) <- "NA.Count"
# Convert the index column into a named column to keep the variable names
colSums_Missing_Count <- cbind(Variable = rownames(colSums_Missing_Count), colSums_Missing_Count)
rownames(colSums_Missing_Count) <- 1:nrow(colSums_Missing_Count)
# Sort by the missing count in descending order
colSums_Missing_Count <- colSums_Missing_Count[order(-colSums_Missing_Count$NA.Count),]
# Output the results
(colSums_Missing_Count)
## Variable NA.Count
## 6 hail 121
## 9 sever 121
## 10 seed.tmt 121
## 21 lodging 121
## 11 germ 112
## 19 leaf.mild 108
## 24 fruiting.bodies 106
## 30 fruit.spots 106
## 33 seed.discolor 106
## 35 shriveling 106
## 17 leaf.shread 100
## 31 seed 92
## 32 mold.growth 92
## 34 seed.size 92
## 14 leaf.halo 84
## 15 leaf.marg 84
## 16 leaf.size 84
## 18 leaf.malf 84
## 29 fruit.pods 84
## 4 precip 38
## 22 stem.cankers 38
## 23 canker.lesion 38
## 25 ext.decay 38
## 26 mycelium 38
## 27 int.discolor 38
## 28 sclerotia 38
## 3 plant.stand 36
## 36 roots 31
## 5 temp 30
## 7 crop.hist 16
## 12 plant.growth 16
## 20 stem 16
## 2 date 1
## 8 area.dam 1
## 1 Class 0
## 13 leaves 0
library(naniar)
vis_miss(Soybean)
The 5 predictor variables with the highest value missing count are hail
, sever
, seed.tmt
, lodging
, and germ
. Overall, 11 predictor variables have over 100 missing values.
The vis_miss
function from the library naniar
shows the missing values do occur across the same observations. The next step will be to identify if those observations align to specific class values.
Soybean %>%
filter(!complete.cases(.)) %>%
group_by(Class)
## # A tibble: 121 × 36
## # Groups: Class [5]
## Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## <fct> <fct> <ord> <ord> <ord> <fct> <fct> <fct> <fct> <fct>
## 1 phyto… 1 1 2 1 <NA> 3 1 <NA> <NA>
## 2 phyto… 2 1 2 2 <NA> 2 1 <NA> <NA>
## 3 phyto… 2 1 2 2 <NA> 2 1 <NA> <NA>
## 4 phyto… 3 1 2 1 <NA> 2 1 <NA> <NA>
## 5 phyto… 2 1 1 1 <NA> 0 1 <NA> <NA>
## 6 phyto… 2 1 2 1 <NA> 1 1 <NA> <NA>
## 7 phyto… 1 1 2 1 <NA> 1 1 <NA> <NA>
## 8 phyto… 2 1 2 2 <NA> 3 1 <NA> <NA>
## 9 phyto… 2 1 1 2 <NA> 2 1 <NA> <NA>
## 10 phyto… 1 1 2 1 <NA> 0 1 <NA> <NA>
## # … with 111 more rows, and 26 more variables: germ <ord>, plant.growth <fct>,
## # leaves <fct>, leaf.halo <fct>, leaf.marg <fct>, leaf.size <ord>,
## # leaf.shread <fct>, leaf.malf <fct>, leaf.mild <fct>, stem <fct>,
## # lodging <fct>, stem.cankers <fct>, canker.lesion <fct>,
## # fruiting.bodies <fct>, ext.decay <fct>, mycelium <fct>, int.discolor <fct>,
## # sclerotia <fct>, fruit.pods <fct>, fruit.spots <fct>, seed <fct>,
## # mold.growth <fct>, seed.discolor <fct>, seed.size <fct>, …
Above output indicates 121 rows are missing at least 1 value.
Soybean %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
summarize(Count = n())
## # A tibble: 5 × 2
## Class Count
## <fct> <int>
## 1 2-4-d-injury 16
## 2 cyst-nematode 14
## 3 diaporthe-pod-&-stem-blight 15
## 4 herbicide-injury 8
## 5 phytophthora-rot 68
Above output indicates those 121 rows occur across 5 classes. Class value phytophthora-rot
contains the most rows with missing data at 68.
Soybean %>%
mutate(MissingValues = rowSums(is.na(Soybean))) %>%
mutate(MissingPct = MissingValues / 35) %>%
group_by(Class) %>%
summarise(MissingPctAvg = mean(MissingPct)) %>%
filter(MissingPctAvg > 0)
## # A tibble: 5 × 2
## Class MissingPctAvg
## <fct> <dbl>
## 1 2-4-d-injury 0.804
## 2 cyst-nematode 0.686
## 3 diaporthe-pod-&-stem-blight 0.337
## 4 herbicide-injury 0.571
## 5 phytophthora-rot 0.394
Above output calculates the average percentage of missing values for those 5 classes. The above indicates 2-4-d-injury
rows are missing 80% of the values while cyst-nematode
and herbicide-injury
are also missing over half of the predictor values.
Soybean %>%
count(Class, sort=TRUE)
## Class n
## 1 brown-spot 92
## 2 alternarialeaf-spot 91
## 3 frog-eye-leaf-spot 91
## 4 phytophthora-rot 88
## 5 anthracnose 44
## 6 brown-stem-rot 44
## 7 bacterial-blight 20
## 8 bacterial-pustule 20
## 9 charcoal-rot 20
## 10 diaporthe-stem-canker 20
## 11 downy-mildew 20
## 12 phyllosticta-leaf-spot 20
## 13 powdery-mildew 20
## 14 purple-seed-stain 20
## 15 rhizoctonia-root-rot 20
## 16 2-4-d-injury 16
## 17 diaporthe-pod-&-stem-blight 15
## 18 cyst-nematode 14
## 19 herbicide-injury 8
The above info provides the number of rows by class value. From the count of rows with missing data by class value, the assessment shows that all rows for 2-4-d-injury
, cyst-nematode
, diaporthe-pod-&-stem-blight
, and herbicide-injury
have some missing data. And for phytophthora-rot
, only 20 rows of the 88 do not contain missing values.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
colSums_Missing_Count %>%
mutate(NA.Count.Pct = NA.Count/683)
## Variable NA.Count NA.Count.Pct
## 6 hail 121 0.177159590
## 9 sever 121 0.177159590
## 10 seed.tmt 121 0.177159590
## 21 lodging 121 0.177159590
## 11 germ 112 0.163982430
## 19 leaf.mild 108 0.158125915
## 24 fruiting.bodies 106 0.155197657
## 30 fruit.spots 106 0.155197657
## 33 seed.discolor 106 0.155197657
## 35 shriveling 106 0.155197657
## 17 leaf.shread 100 0.146412884
## 31 seed 92 0.134699854
## 32 mold.growth 92 0.134699854
## 34 seed.size 92 0.134699854
## 14 leaf.halo 84 0.122986823
## 15 leaf.marg 84 0.122986823
## 16 leaf.size 84 0.122986823
## 18 leaf.malf 84 0.122986823
## 29 fruit.pods 84 0.122986823
## 4 precip 38 0.055636896
## 22 stem.cankers 38 0.055636896
## 23 canker.lesion 38 0.055636896
## 25 ext.decay 38 0.055636896
## 26 mycelium 38 0.055636896
## 27 int.discolor 38 0.055636896
## 28 sclerotia 38 0.055636896
## 3 plant.stand 36 0.052708638
## 36 roots 31 0.045387994
## 5 temp 30 0.043923865
## 7 crop.hist 16 0.023426061
## 12 plant.growth 16 0.023426061
## 20 stem 16 0.023426061
## 2 date 1 0.001464129
## 8 area.dam 1 0.001464129
## 1 Class 0 0.000000000
## 13 leaves 0 0.000000000
The above output indicates the percentage of missing values by column. No column has more than 18% missing data.
First, before defining the strategy, I do want to address the findings from section (b). Given the missing values are concentrated to only 5 of the 19 class values, then I believe there is a need for additional research behind those missing values. Potentially, an “informative missingness” exists that we could uncover with more knowledge of the subject or the observational method of data collection.
For handling the missing data, I believe a two-step approach may be valid. First step, eliminate the predictors with near zero variance, leaf.mild
, mycelium
, and sclerotia
. As the textbook indicates, little predictive value will come from these three predictors. Second, considering the missing values of the remaining predictors, I don’t believe PCA via imputation is a good approach as the missing data is found across almost all the predictors. Typically, PCA requires missing data across a small number of predictors. In order to impute the remaining predictors, I’d suggest K-nearest neighbor model (KNN). A 5-nearest neighbor model would be a good parameter selection to start.
Another consideration would be a tree-based prediction model which was alluded to several times throughout Chapter 3 of the textbook. If the missing values do contain “informative missingness”, then a tree-based approach may perform well in predicting the correct class as compared to the regression model.
library(mice)
# from: https://datascienceplus.com/imputing-missing-data-with-r-mice-package/
tempData <- mice(Soybean,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
completedData <- complete(tempData,1)
I did naively attempt the imputation with the mice
package, but the processing took so long and the result of imputed values didn’t prove much to me besides that the function worked as advertised.