packages <- c("tidyverse", "forecast", "kableExtra", "broom", "ggplot2", "caret", "e1071", "knitr", "GGally", "VIM", "mlbench", "car", "corrplot", "mice", "seasonal", "fma", "latex2exp","gridExtra")
pacman::p_load(char = packages)
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
Load the dataset Glass
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
As we see above, dataframe Glass has 214 observations and 10 variables
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Answer
Lets plot histogram to understand the distribution of the predictor variable, we will use histograms to obtain the frequencies of each of them for the following reasons:
The Histogram bins the data by frequency of values and plots the frequency bins against the ordered values.
The height of each bin represents the amount of the frequency.
long_glass <- Glass %>%
pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na = TRUE) %>%
mutate(Predictor = as.factor(Predictor))
long_glass %>%
ggplot(aes(Value, color = Predictor, fill = Predictor)) +
geom_histogram(bins = 20) +
facet_wrap(~ Predictor, ncol = 3, scales = "free") +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
theme_light() +
theme(legend.position = "none") +
ggtitle("Distribution of Predictor Variables")
Glass is primarly made of Silica (Si), Sodium (Na) and lime/Calcium (Ca), Aluminium (Al). Therefore, they have relatively normal (symmetric) distributions. The remainder of the predictor variables appear to have non-normal (asymmetric) distributions.
Now we will examine how the predictors are related to each other. We will analyse that with a correlation plot.
#ColorBrewer's 5 class spectral color palette
col <- colorRampPalette(c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"))
Glass %>%
select(-Type) %>%
cor() %>%
round(., 2) %>%
corrplot(., method="color", col=col(200), type="upper", order="hclust", addCoef.col = "black", tl.col="black", tl.srt=45, diag=FALSE )
As we see above, most of the predictors are negatively correlated, which makes sense. They are measuring chemical concentrations on a percentage basis. As one element increases we would expect a decrease in the others.
Most of the correlations are not very strong. The exception to this is the correlation between calcium(Ca) and the refraction index(RI) is strongly positively correlated.
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Answer
To start with, let’s analyze how the predictors are distributed by the type of glass by using a scatter plot:
long_glass %>%
ggplot(aes(x = Type, y = Value, color = Predictor)) +
geom_jitter() +
ylim(0, 20) +
scale_color_brewer(palette = "Set1") +
theme_light()
From the above plot, looks like glass type 1, 2 and 3 are very similar in chemical composition. There are a couple of observations that appear to be outliers. For example there are a couple of potassium (K) observations in the type 5 glass that are unusually high. There is a barium (Ba) observation in type 2 glass that apears to be an outlier along with some calcium (Ca) observations in type 2 glass.
Now let’s also analyse the outliers using box plot approach:
The Boxplot displays the data in quartiles. Inside the box lie the data filing within the 25th and 75th percentile, also called the 1st and 3rd quartile. The line inside the box represents the median, also known as the 50th percentile, also called the 2nd quartile.
Extending from the box are whiskers. Any values outside the whiskers are considered outliers. Symmetric distributions have a box with equally sized whiskers. Skewed distributions have the boxes with one long and one short whisker.
par(mfrow=c(3,3))
for(var in names(Glass)[-10]){
boxplot(Glass[var], main=paste('Boxplot of', var), horizontal = T, col="steelblue")
}
The Boxplots show outliers in every variable except for Mg. The most extreme outliers appear in the K and Ba variables.
Seeing the box plots above, Magnesium is bimodal and left skewed. Iron, potasium and barium are right skewed. The other predictors are somewhat normal.
The skewness value can be calculated to confirm:
Glass[-10] %>% apply(2, skewness) %>% sort(decreasing=T)
## K Ba Ca Fe RI Al Na
## 6.4600889 3.3686800 2.0184463 1.7298107 1.6027151 0.8946104 0.4478343
## Si Mg
## -0.7202392 -1.1364523
The above calculated values confirms our conclusions.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Answer
A transformation method like Box-Cox transformation might improve the classification model’s preformance.
trans <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
transformed <- predict(trans, Glass[-10])
par(mfrow=c(3,3))
for(var in names(transformed)[-10]){
boxplot(transformed[var], main=paste('Boxplot of', var), horizontal = T, col="steelblue")
}
transformed %>% apply(2, skewness) %>% sort(decreasing=T)
## K Ba Fe RI Al Na
## 6.46008890 3.36867997 1.72981071 1.56566039 0.09105899 0.03384644
## Ca Si Mg
## -0.19395573 -0.65090568 -1.13645228
The centering and scaling did the job of bringing the mean to 0 and standard deviation to 1.
It appears that the Box-Cox transformation has improved the skewness of Ca, Al, and Na. It was not effective in reducing the skewness for other predictors having heavier skewness.
It may be beneficial to remove one of the highly correlated predictors to improve stability of some linear models.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
#View(Soybean)
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Answer
To answer the first part of the question, we will use summary function to analyze frequency distribution of predictors:
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
Following with second part of question: The variable with degenrate distributions is a variable with “zero-variance” or a handful of unique values that occur with very low frequencies “near-zero variance”, that satisfies both following conditions:
The fraction of unique values over the sample size is low (say 10%).
The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
The nearZeroVar function can be used to find the degenrate variables:
paste('The degenerate variables are:', paste(names(Soybean[,nearZeroVar(Soybean)]), collapse = ', '))
## [1] "The degenerate variables are: leaf.mild, mycelium, sclerotia"
Let’s explore the following variables: leaf.mild, mycelium, sclerotia
summary(Soybean[19])
## leaf.mild
## 0 :535
## 1 : 20
## 2 : 20
## NA's:108
For the leaf.mild variable, the fraction of unique value over the sample size is 3/683=0.4% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 535/20=26.75 > 20 which confirms that leaf.mild is a degenerate variable.
summary(Soybean[26])
## mycelium
## 0 :639
## 1 : 6
## NA's: 38
For mycelium variable, the fraction of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 639/6=106.5 > 20 which confirms that mycelium is a degenerate variable.
summary(Soybean[28])
## sclerotia
## 0 :625
## 1 : 20
## NA's: 38
For the sclerotia variable, the fraction of unique value over the sample size is 2/683=0.3% < 10%, and the ratio of the most prevalent value to the 2nd most prevalent value is 625/20=31.25 > 20 which confirms that sclerotia is a degenerate variable.
(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Answer
The count of missing values in each variables are found below:
nas <- Soybean %>% apply(2, is.na) %>% apply(2, sum, na.rm=T)
nas <- sort(nas, decreasing=T)
nas
## hail sever seed.tmt lodging germ
## 121 121 121 121 112
## leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
## 108 106 106 106 106
## leaf.shread seed mold.growth seed.size leaf.halo
## 100 92 92 92 84
## leaf.marg leaf.size leaf.malf fruit.pods precip
## 84 84 84 84 38
## stem.cankers canker.lesion ext.decay mycelium int.discolor
## 38 38 38 38 38
## sclerotia plant.stand roots temp crop.hist
## 38 36 31 30 16
## plant.growth stem date area.dam Class
## 16 16 1 1 0
## leaves
## 0
hail, sever, seed.tmt, lodging are top 4 predictors with most no. of missing values.
Let’s plot our missing values in predictors:
# A function that plots missingness
# requires `reshape2`
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
ggplot_missing <- function(x){
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Rows or observations")
}
ggplot_missing(Soybean[-1])
The above plot confirms our count of missing values in predictors.
Since the data is distributed on the basis of classes therefore pattern of missing data is related to classes.
Let’s use aggr() to analyze pattern of missing data:
aggr function in the VIM package plots and calculates the amount of missing values in each variable.library(VIM)
aggr(Soybean[-1], prop = c(T, T), bars=T, numbers=T, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## leaves 0.000000000
The non-graphical output of the function shows the exact proportion of missing values per variable.
The visualizations produced by the aggr function in the VIM package show a bar chart with the proportion of missing data per variable as well as a grid with the proportion of missing data for variable combinations. The bar chart shows several predictors variables have over 15% of their values missing.
The remainder of the grid shows missing data for variable combinations with each row highlighting the missing values for the group of variables detailed in the x-axis.
dplyr is useful for wrangling data into aggregate summaries and is used to find the pattern of missing data related to the classes.
library(dplyr)
Soybean %>%
dplyr::mutate(Total = n()) %>%
dplyr::filter(!complete.cases(.)) %>%
dplyr::group_by(Class) %>%
dplyr::mutate(Missing = n(), Proportion=Missing/Total) %>%
dplyr::select(Class, Missing, Proportion) %>%
unique()
## # A tibble: 5 x 3
## # Groups: Class [5]
## Class Missing Proportion
## <fct> <int> <dbl>
## 1 phytophthora-rot 68 0.0996
## 2 diaporthe-pod-&-stem-blight 15 0.0220
## 3 cyst-nematode 14 0.0205
## 4 2-4-d-injury 16 0.0234
## 5 herbicide-injury 8 0.0117
In the above process we checked if a pattern of missing data related to the classes exists is done by checking if some classes hold most of the incomplete cases. This is accomplished by filtering, grouping, and mutating the data with dplyr.
The majority of the missing values are in the phytophthora-rot class which has nearly 10% incomplete cases. The are only four more, out of the eighteen other, variables with incomplete cases.
The pattern of missing data is related to the classes. Mostly the phytophthora-rot class however since the other four variables only have between 1% and 2% incomplete cases.
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Answer
Multiple imputation and KNN are widely used, and multiple imputation being simpler is generally preferred.
Method 1
The mice() function in the mice package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has over imputation 20 methods that can be applied to the data.
The one used with these data is the predictive mean matching(pmm) method which is currently the most popular in online forums. After the imputations are made, a complete dataset is created using the complete() function.
The aggr function from the VIM package used in the previous example (plots and calculates the amount of missing values in each variable) is then reran for comparison.
library(mice)
MICE <- mice(Soybean, method="pmm", printFlag=F, seed=624)
## Warning: Number of logged events: 1666
aggr(complete(MICE), prop = c(T, T), bars=T, numbers=T, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## Class 0
## date 0
## plant.stand 0
## precip 0
## temp 0
## hail 0
## crop.hist 0
## area.dam 0
## sever 0
## seed.tmt 0
## germ 0
## plant.growth 0
## leaves 0
## leaf.halo 0
## leaf.marg 0
## leaf.size 0
## leaf.shread 0
## leaf.malf 0
## leaf.mild 0
## stem 0
## lodging 0
## stem.cankers 0
## canker.lesion 0
## fruiting.bodies 0
## ext.decay 0
## mycelium 0
## int.discolor 0
## sclerotia 0
## fruit.pods 0
## fruit.spots 0
## seed 0
## mold.growth 0
## seed.discolor 0
## seed.size 0
## shriveling 0
## roots 0
Method 2
In this method, k neighbors are chosen based on some distance measure and their average is used as an imputation estimate. The method requires the selection of the number of nearest neighbors, and a distance metric. KNN can predict both discrete attributes (the most frequent value among the k nearest neighbors) and continuous attributes (the mean among the k nearest neighbors)
knnImputation is a function that fills in all NA values using the k Nearest Neighbours of each case with NA values. By default it uses the values of the neighbours and obtains an weighted (by the distance to the case) average of their values to fill in the unknows.
library(DMwR)
knnOutput <- knnImputation(Soybean)
library(rmarkdown)
paged_table(knnOutput)
ggplot_missing(knnOutput)
One of the obvious drawbacks of the KNN algorithm is that it becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset. Furthermore, the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor.