data("Glass")
|> select(-Type) |> ggpairs(progress = FALSE) Glass
hw4DATA624
Assignment
We are required to complete questions 3.1 and 3.2 from chapter 3 of “Applied Predictive Modeling” by Max Kuhn and Kjell Johnson.
3.1 - UC Irvine Machine Learning - Glass Data
A - exploring the data
As described in the book, the data consists of 214 glass samples labeled as one of seven categories. There are nine predictors in the data set - Na, Mg, Al, Si, K, Ca, Ba, Fe. We are tasked with visualizing the data to explore the predictor variables, understand their distributions, and relationships between variables.
First, I use the ggpairs function to visualize the distribution of the variables and correlation between variables
Additionally, I use the describe function from the Psych Package to generate a table that has summary statistics about each predictor variable and its distribution
::describe(Glass[1:9]) psych
vars n mean sd median trimmed mad min max range skew kurtosis
RI 1 214 1.52 0.00 1.52 1.52 0.00 1.51 1.53 0.02 1.60 4.72
Na 2 214 13.41 0.82 13.30 13.38 0.64 10.73 17.38 6.65 0.45 2.90
Mg 3 214 2.68 1.44 3.48 2.87 0.30 0.00 4.49 4.49 -1.14 -0.45
Al 4 214 1.44 0.50 1.36 1.41 0.31 0.29 3.50 3.21 0.89 1.94
Si 5 214 72.65 0.77 72.79 72.71 0.57 69.81 75.41 5.60 -0.72 2.82
K 6 214 0.50 0.65 0.56 0.43 0.17 0.00 6.21 6.21 6.46 52.87
Ca 7 214 8.96 1.42 8.60 8.74 0.66 5.43 16.19 10.76 2.02 6.41
Ba 8 214 0.18 0.50 0.00 0.03 0.00 0.00 3.15 3.15 3.37 12.08
Fe 9 214 0.06 0.10 0.00 0.04 0.00 0.00 0.51 0.51 1.73 2.52
se
RI 0.00
Na 0.06
Mg 0.10
Al 0.03
Si 0.05
K 0.04
Ca 0.10
Ba 0.03
Fe 0.01
B - Outliers and skewness
Based on the above plot and summary table, all of the predictor variables have skewness: RI, Na, Ai, K, Ca, Ba, and Fe all have positive skew and Mg, and Si have negative skew. All of the predictors except for Mg have positive Kurtosis as well, some with extreme amounts. This suggests that there are outliers in all of the predictor variables.
Additionally, the box plots for each predictor are visible below, with red points signifying outliers based on IQR. As evident below, all predictors besides Mg have outliers.
<- Glass |> select(-Type) |> pivot_longer(cols = everything(), names_to = 'Predictor', values_to = 'Value')
Glass_longer
|> ggplot((aes(x = Predictor, y = Value))) + geom_boxplot(outlier.color = 'red') + facet_wrap(~Predictor, scales = 'free_y') Glass_longer
C - Transformations of predictors
First, I try taking log values of the predictors besides K, Ba, and Fe - K, Ba, and Fe have observations that include zero, which would result in a -Inf log value.
For K, Ba and Fe, I will try taking the square root.
<- Glass
Glass_Trasnformed
1:5] <- log(Glass_Trasnformed[1:5])
Glass_Trasnformed[7] <- log(Glass_Trasnformed[7])
Glass_Trasnformed[6] <- sqrt(Glass_Trasnformed[6])
Glass_Trasnformed[8:9] <- sqrt(Glass_Trasnformed[8:9])
Glass_Trasnformed[
|> select(-Type) |> ggpairs(progress = FALSE) Glass_Trasnformed
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Warning: Removed 42 rows containing non-finite outside the scale range
(`stat_density()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_text()`).
This approach results in some missing values and does not meaningfully improve the skewness of the data.
Next, I try the Box-Cox transformation for each predictor
library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following objects are masked from 'package:fabletools':
MAE, RMSE
The following object is masked from 'package:purrr':
lift
<- Glass
Glass_box_cox
<- BoxCoxTrans(Glass_box_cox$RI)
RI_predictor <- predict(RI_predictor, Glass_box_cox$RI)
RI_transformed
<- BoxCoxTrans(Glass_box_cox$Na)
Na_predictor <- predict(Na_predictor, Glass_box_cox$Na)
Na_transformed
<- BoxCoxTrans(Glass_box_cox$Mg)
Mg_predictor <- predict(Mg_predictor, Glass_box_cox$Mg)
Mg_transformed
<- BoxCoxTrans(Glass_box_cox$Al)
Al_predictor <- predict(Al_predictor, Glass_box_cox$Al)
Al_transformed
<- BoxCoxTrans(Glass_box_cox$Si)
Si_predictor <- predict(Si_predictor, Glass_box_cox$Si)
Si_transformed
<- BoxCoxTrans(Glass_box_cox$K)
K_predictor <- predict(K_predictor, Glass_box_cox$K)
K_transformed
<- BoxCoxTrans(Glass_box_cox$Ca)
Ca_predictor <- predict(Ca_predictor, Glass_box_cox$Ca)
Ca_transformed
<- BoxCoxTrans(Glass_box_cox$Ba)
Ba_predictor <- predict(Ba_predictor, Glass_box_cox$Ba)
Ba_transformed
<- BoxCoxTrans(Glass_box_cox$Fe)
Fe_predictor <- predict(Fe_predictor, Glass_box_cox$Fe)
Fe_transformed
<- data.frame(RI_transformed, Na_transformed, Mg_transformed,
transformed_predcitors
Al_transformed, Si_transformed, K_transformed,
Ca_transformed, Ba_transformed, Fe_transformed)
|> ggpairs(progress = FALSE) transformed_predcitors
<- transformed_predcitors |> pivot_longer(cols = everything(), names_to = 'Predictor', values_to = 'Value')
transformed_predcitors_longer
|> ggplot((aes(x = Predictor, y = Value))) + geom_boxplot(outlier.color = 'red') + facet_wrap(~Predictor, scales = 'free_y') transformed_predcitors_longer
The Box-Cox transformations of the variables does not make a meaningful difference to the skewness and presence of outliers of the predictors.
The predictors in this data set are difficult to transform: neither the log, square root, nor Box-Cox transformations made a meaningful difference in the skewness and presence of outliers. Further transformations would need to be considered before modeling or alternatively considering a model that is less sensitive to skewness and outliers in the predictors, such as a tree based model or support vector machines.
3.2 - UC Irvine Machine Learning - Soybean Data
data("Soybean")
head(Soybean)
Class date plant.stand precip temp hail crop.hist area.dam
1 diaporthe-stem-canker 6 0 2 1 0 1 1
2 diaporthe-stem-canker 4 0 2 1 0 2 0
3 diaporthe-stem-canker 3 0 2 1 0 1 0
4 diaporthe-stem-canker 3 0 2 1 0 1 0
5 diaporthe-stem-canker 6 0 2 1 0 2 0
6 diaporthe-stem-canker 5 0 2 1 0 3 0
sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
1 1 0 0 1 1 0 2 2
2 2 1 1 1 1 0 2 2
3 2 1 2 1 1 0 2 2
4 2 0 1 1 1 0 2 2
5 1 0 2 1 1 0 2 2
6 1 0 1 1 1 0 2 2
leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
1 0 0 0 1 1 3 1
2 0 0 0 1 0 3 1
3 0 0 0 1 0 3 0
4 0 0 0 1 0 3 0
5 0 0 0 1 0 3 1
6 0 0 0 1 0 3 0
fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
1 1 1 0 0 0 0
2 1 1 0 0 0 0
3 1 1 0 0 0 0
4 1 1 0 0 0 0
5 1 1 0 0 0 0
6 1 1 0 0 0 0
fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
1 4 0 0 0 0 0 0
2 4 0 0 0 0 0 0
3 4 0 0 0 0 0 0
4 4 0 0 0 0 0 0
5 4 0 0 0 0 0 0
6 4 0 0 0 0 0 0
A - categorical variables
Below is a summary of the variables in the Soybean data set
summary(Soybean)
Class date plant.stand precip temp
brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
anthracnose : 44 6 : 90
brown-stem-rot : 44 (Other):101
(Other) :233 NA's : 1
hail crop.hist area.dam sever seed.tmt germ plant.growth
0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
3 :218 3 :187 NA's:121 NA's:121 NA's:112
NA's: 16 NA's: 1
leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
NA's: 84 NA's: 84 NA's: 84 NA's:108
stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
3 :191 3 : 65 NA's: 38
NA's: 38 NA's: 38
mycelium int.discolor sclerotia fruit.pods fruit.spots seed
0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
NA's: 38 3 : 48 4 :100
NA's: 84 NA's:106
mold.growth seed.discolor seed.size shriveling roots
0 :524 0 :513 0 :532 0 :539 0 :551
1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
NA's: 31
The chapter discussed degenerate distributions where the variables has zero variances (e.g. a constant variable) or a variable where the vast majority of observations are concentrated in one value with a smaller amount of other unique values (i.e. near-zero variance).
A rule of thumb for detecting near-zero variance predictors is:
the fraction of unique values over the sample size is low (such as 10%)
the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (around 20)
Since a number of these variables are dummy variables, we need to test the ratio of the frequency of the most prevalent value to the second most prevalent value.
To do this, I have to create a function that counts the frequency of each unique value and takes the ratio of the frequency of the first most common value to the second.
<- function(columnToTest) {
frequency_ratio
<- length(table(columnToTest))
length_of_summary
<- sort(table(columnToTest))[length_of_summary]/sort(table(columnToTest))[(length_of_summary-1)]
ratio
return(ratio)
}
Next, I have to apply this function to each column of the Soybean data set that has predictor variables. I filter for values that are >= 20
<- apply(Soybean, 2, frequency_ratio)
predictor_Frequency_Ratio
as.data.frame(predictor_Frequency_Ratio) |> filter(predictor_Frequency_Ratio >=20)
predictor_Frequency_Ratio
leaf.mild 26.75
mycelium 106.50
sclerotia 31.25
The distributions for leaf.mild, mycelium, and sclerotia are degenerate as these have near zero variance based on the frequency of the most prevalent value to the frequency of the second most prevalent value.
B - Missing data
We are asked to check if any predictors are more likely to have missing data.
Below I create a function that returns the percentage of total observations for each variable that is NA. I then apply the function to the Soybean data set
<- function(columnToTest) {
na_percentage
return(sum(is.na(columnToTest))/length(columnToTest))
}
<- apply(Soybean, 2, na_percentage)
soybean_predictor_na
sort(soybean_predictor_na, decreasing = TRUE)
hail sever seed.tmt lodging germ
0.177159590 0.177159590 0.177159590 0.177159590 0.163982430
leaf.mild fruiting.bodies fruit.spots seed.discolor shriveling
0.158125915 0.155197657 0.155197657 0.155197657 0.155197657
leaf.shread seed mold.growth seed.size leaf.halo
0.146412884 0.134699854 0.134699854 0.134699854 0.122986823
leaf.marg leaf.size leaf.malf fruit.pods precip
0.122986823 0.122986823 0.122986823 0.122986823 0.055636896
stem.cankers canker.lesion ext.decay mycelium int.discolor
0.055636896 0.055636896 0.055636896 0.055636896 0.055636896
sclerotia plant.stand roots temp crop.hist
0.055636896 0.052708638 0.045387994 0.043923865 0.023426061
plant.growth stem date area.dam Class
0.023426061 0.023426061 0.001464129 0.001464129 0.000000000
leaves
0.000000000
Based on the above vector, the hail variable has the highest percentage of NA values at 17.7%. Furthermore, in the above vector the variables from hail through fruit.pods have double digit NA percentages, with fruit.pods at 12.3%. After that, the NA percentage falls to 5.5%. Based on this, it seems like the variables from hail through fruit.pods in the above vector are the most likely to have NA values.
We are then asked if certain classes tend to have more NAs than others. Below, I count the number of NAs by class and variable (e.g.: for the class 2-4-d-injury, how many times does the date column have an NA value). Then I sum up the counts of NAs across the rows. Below, we see that the five classes listed have the most NAs, whereas the other classes do not have any NAs for any of the variables.
<- Soybean |> group_by(Class) |> summarise_all(funs(sum(is.na(.)))) class_Na_summary
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:
# Simple named list: list(mean = mean, median = median)
# Auto named with `tibble::lst()`: tibble::lst(mean, median)
# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
$Na_count <- rowSums(class_Na_summary[,2:36])
class_Na_summary
|> select(Class, Na_count) |> filter(Na_count >0) class_Na_summary
# A tibble: 5 × 2
Class Na_count
<fct> <dbl>
1 2-4-d-injury 450
2 cyst-nematode 336
3 diaporthe-pod-&-stem-blight 177
4 herbicide-injury 160
5 phytophthora-rot 1214
C - handling missing data
Based on the above, I would eliminate the leaf.mild, mycelium, and sclerotia variables as they are degenerate.
<- Soybean |> select(-c("leaf.mild", "mycelium", "sclerotia"))
soybean_updated
summary(soybean_updated)
Class date plant.stand precip temp
brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
anthracnose : 44 6 : 90
brown-stem-rot : 44 (Other):101
(Other) :233 NA's : 1
hail crop.hist area.dam sever seed.tmt germ plant.growth
0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
3 :218 3 :187 NA's:121 NA's:121 NA's:112
NA's: 16 NA's: 1
leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem
0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :296
1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 :371
2 :342 2 :221 2 :221 NA's:100 NA's: 84 NA's: 16
NA's: 84 NA's: 84 NA's: 84
lodging stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
0 :520 0 :379 0 :320 0 :473 0 :497 0 :581
1 : 42 1 : 39 1 : 83 1 :104 1 :135 1 : 44
NA's:121 2 : 36 2 :177 NA's:106 2 : 13 2 : 20
3 :191 3 : 65 NA's: 38 NA's: 38
NA's: 38 NA's: 38
fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
0 :407 0 :345 0 :476 0 :524 0 :513 0 :532
1 :130 1 : 75 1 :115 1 : 67 1 : 64 1 : 59
2 : 14 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92
3 : 48 4 :100
NA's: 84 NA's:106
shriveling roots
0 :539 0 :551
1 : 38 1 : 86
NA's:106 2 : 15
NA's: 31
From the remaining variables, we can consider which ones have the highest percentage of NAs:
<- apply(soybean_updated, 2, na_percentage)
soybean_predictor_na_updated
|> as.data.frame() |> filter(soybean_predictor_na_updated > 0.1) soybean_predictor_na_updated
soybean_predictor_na_updated
hail 0.1771596
sever 0.1771596
seed.tmt 0.1771596
germ 0.1639824
leaf.halo 0.1229868
leaf.marg 0.1229868
leaf.size 0.1229868
leaf.shread 0.1464129
leaf.malf 0.1229868
lodging 0.1771596
fruiting.bodies 0.1551977
fruit.pods 0.1229868
fruit.spots 0.1551977
seed 0.1346999
mold.growth 0.1346999
seed.discolor 0.1551977
seed.size 0.1346999
shriveling 0.1551977
These variables have a high percentage of NA values (there are other variables that still have NAs, however at a lower percentage). To determine whether to eliminate these variables of to try and impute values requires more understanding of the data. While the percentage of NAs for these variables seem high, it is unclear if the NA values are informative or not as the meaning of what these categorical variables describe is not provided. Given the large presence of NAs, I would suggest using a model that can account for missing data, such as a tree based model. The alternative would be to either drop observations with missing values or try to impute these values based on an approach such as K-nearest neighbors.