Exercises: 3.1 and 3.2 from the Kuhn and Johnson book.
3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification.The data consist of 214 glass samples labeled as one of seven class categories.There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench) data(Glass) str(Glass)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
tbl <- rbind(table(Glass$Type), round(table(Glass$Type)/(nrow(Glass))*100.0,2))
rownames(tbl) <- c("Count of Glass type: ", "% of Dataset from Glass type: ")
tbl
## 1 2 3 5 6 7
## Count of Glass type: 70.00 76.00 17.00 13.00 9.00 29.00
## % of Dataset from Glass type: 32.71 35.51 7.94 6.07 4.21 13.55
Clearly, types 1 and 2 make up an outsized share of the glass types in the data set, which I’ll want to keep in mind. The predictor vlaues that correspond to those types will also make up an outsized share.
Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_density(alpha = 0.3) +
facet_wrap(~ Predictor, scales = "free") +
theme_minimal() +
labs(
title = "Density Plots of Numeric Predictors by Glass Type",
x = "Value",
y = "Density"
)
Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Type, color = Type)) +
geom_density(alpha = 0.3) +
facet_wrap(~ Predictor, scales = "free") +
theme_minimal() +
labs(
title = "Density Plots of Numeric Predictors by Glass Type",
x = "Value",
y = "Density"
)
This gave me an idea of the overall distributions for each predictor. I can also see how somewhat bi-modal distributions in some areas appear to be related to the distributions when controlled by glass Type. For example for Al, the distributions look more normal for Types 1 and 2. But the distributions look less normal and have different centers for types 3 and 7. I’m interested in exploring transformations at this point, but I see that this is a latter part of the question, so I’ll hold off until then.
I want to get a better idea of the relationship between predictors both visually + in correlation terms.
# Using GGally for this first part
Glass %>%
select(-Type) %>%
ggpairs(
lower = list(continuous = wrap("points", alpha = 0.4, size = 0.5)),
title = "Scatterplot Matrix of Glass Predictors"
) +
theme_minimal()
## Doing correlation again for the colors:
glass_cor <- Glass %>%
select(-Type) %>%
cor()
corrplot(glass_cor,
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black"
)
I wanted to see the outliers via boxplot, but the shaped of the distribution via violin plot, so I ended up layering both.
Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value, y = "")) +
geom_violin(
fill = "steelblue",
alpha = 0.5,
outlier.color = "red", # making outliers easier to see
outlier.size = 2,
outlier.shape = 16
) +
geom_boxplot(
width = 0.15, # Keeps the boxplot skinny so it fits inside the violin
fill = "white", # Makes the boxplot stand out against the colored violin
outlier.color = "red", # Keeps the outliers bright red
outlier.size = 1.5,
outlier.shape = 16,
alpha = 0.7
) +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(
title = "Violin of Glass Predictors (Outliers and Skewness)",
x = "Predictor Value",
y = ""
)
## Warning in geom_violin(fill = "steelblue", alpha = 0.5, outlier.color = "red", : Ignoring unknown parameters: `outlier.colour`, `outlier.size`, and
## `outlier.shape`
At first glance, there are definitely outliers. But I’m going to take a second look while controlling for glass Type, since I know the distributions for each predicotr vary by glass type.
Glass %>%
pivot_longer(cols = -Type, names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value, y = Type, fill = Type)) +
geom_violin(
alpha = 0.5,
outlier.color = "red",
outlier.size = 1.5,
outlier.shape = 16
) +
geom_boxplot(
width = 0.15, # Keeps the boxplot skinny so it fits inside the violin
fill = "white", # Makes the boxplot stand out against the colored violin
outlier.color = "red", # Keeps the outliers bright red
outlier.size = 1.5,
outlier.shape = 16,
alpha = 0.7
) +
facet_wrap(~ Predictor, scales = "free", ncol = 3) +
theme_minimal() +
theme(
legend.position = "none"
) +
labs(
title = "Violin plots of Glass Predictors Controlled by Glass Type",
x = "Predictor Value",
y = "Glass Type"
)
## Warning in geom_violin(alpha = 0.5, outlier.color = "red", outlier.size = 1.5, : Ignoring unknown parameters: `outlier.colour`, `outlier.size`, and
## `outlier.shape`
Type 5 and Type 6 glass have some weird shapes that might imply heavier skew and outliers across almost all predictors, but there is also so little data for these two types, that this makes sense how a few outliers are holding a more powerful sway on the shape of the data.
With that noted:
# Finding the appropriate transformations
yj_model <- preProcess(Glass %>% select(-Type), method = "YeoJohnson")
# Extract the lambda values
lambdas <- yj_model$yj
lambda_table <- data.frame(
Predictor = names(lambdas),
Lambda = round(lambdas, 3)
)
print(lambda_table, row.names = FALSE)
## Predictor Lambda
## Na -0.187
## Mg 2.237
## Al 0.003
## K -0.984
## Ca -1.357
I tested this out, and actually this was better fit for Na, Mg, Al, K and Ca according to the YeoJohnson method. Probably due to the number of 0s in some of the other variables. These transformations would help the classification model. That said, the clearest transformations of these variables would of course be Mg, K and maybe Ca, becuase their lamda valeus can be more easily rounded to a number that is more interpretable as well.
3.2 The soy bean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g.,temperature, precipitation) and plant conditions (e.g.,left spots,mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: > library(mlbench) > data(Soybean) > ## See ?Soybean for details
data(Soybean)
head(Soybean)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
?Soybean
## starting httpd help server ... done
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
# splitting this into 3 chunks because I wanted it to look right in the knitted document
Soybean %>%
select(2:13) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "steelblue", color = "white") +
facet_wrap(~ Predictor, scales = "free_x", ncol = 4) +
theme_minimal() +
labs(title = "Soybean Predictors (Group 1 of 3)", x = "Level", y = "Count")
Soybean %>%
select(14:25) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "darkslateblue", color = "white") +
facet_wrap(~ Predictor, scales = "free_x", ncol = 4) +
theme_minimal() +
labs(title = "Soybean Predictors (Group 2 of 3)", x = "Level", y = "Count")
Soybean %>%
select(26:36) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols = everything(), names_to = "Predictor", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "cadetblue", color = "white") +
facet_wrap(~ Predictor, scales = "free_x", ncol = 4) +
theme_minimal() +
labs(title = "Soybean Predictors (Group 3 of 3)", x = "Level", y = "Count")
# Use the function from the package the book author made
nzv_analysis <- caret::nearZeroVar(Soybean[, -1], saveMetrics = TRUE)
# Filtering on 'nzv == TRUE' should meet the book requirements: FreqRatio > 19 AND PercentUnique < 10
degenerate_vars <- nzv_analysis[nzv_analysis$nzv == TRUE, ]
# Output table
degenerate_output <- data.frame(
Predictor = rownames(degenerate_vars),
FreqRatio = round(degenerate_vars$freqRatio, 2),
PercentUnique = round(degenerate_vars$percentUnique, 2),
Is_Degenerate = degenerate_vars$nzv
)
print(degenerate_output, row.names = FALSE)
## Predictor FreqRatio PercentUnique Is_Degenerate
## leaf.mild 26.75 0.44 TRUE
## mycelium 106.50 0.29 TRUE
## sclerotia 31.25 0.29 TRUE
Yes, the leaf.mild, mycelium and sclerotia are degenerate with one or more unque categories that are only present for a very small amount of the data.
par(mar = c(10, 4, 4, 2))
# Using VIM to see patterns of missingness
VIM::aggr(Soybean,
col = c('navyblue','red'),
numbers = TRUE,
sortVars = TRUE,
# las = 2 rotates the x-axis labels to be vertical
# cex.axis = .5 makes the font smaller so they all fit
labels = names(Soybean),
cex.axis = 0.5,
las = 2,
gap = 3,
ylab = c("Proportion of Missingness","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
# Creating a heatmap for percentage of missingness grouped by Class
na_by_class <- Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~ mean(is.na(.x)) * 100)) %>%
pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Percent_NA")
ggplot(na_by_class, aes(x = Predictor, y = Class, fill = Percent_NA)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "firebrick") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 8)) +
labs(
title = "Heatmap of Missing Data by Disease Class",
subtitle = "Darker red = higher concentration of NAs",
x = "Predictors",
y = "Disease Class",
fill = "% Missing"
)
From the first graph
Many of these variables have over 15% missingness (I won’t re-write the names because there are so many). Lukcily, of our degenerate distirbutions, the missingness is only around 5% for 2 of the 3. It’s over 15% for leaf.mild though, which is troublesome.
I can also see from the right-hand side plot that many of these variables are missing together, alongside other values in a certain pattern. This means that the missingness isn’t just random among predictors.
From the heatmap
Yes, I can see that certain predictor data is essentially missing for certain classes.
I thought through a strategy for this:
I’d start by eliminating the three variables that were near-zero variance. They cannot add much information, can add problems, and one of them has a 15% missingness rate on top of it.
For the data that I am able to impute (more on this on the next bullet point), I’d use KNN imputation. This is pretty suitable for the type of data we have here, when the missingness clumps together among similar cases. Plus, some other imputation methods simply wouldn’t be suited to this. Mice, for example, assume missing at random (MAR) data, and it looks like that is not what we have here.
Lastly there is data that I should be certain not to impute. There were classes + predictors in my last plot that were 100% missing. I’d either need to remove this data from the data set (since we should never be trying to predict those values), or I’d need to add a feature flag for that missingness.