3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
#Visual
Glass |> #histo to exam distribution
keep(is.numeric) |>
gather()|>
ggplot(aes(value)) +
geom_histogram(bins=30) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms")
Glass |> #Boxplot for outliers
keep(is.numeric) |>
gather() |>
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
Glass |>
keep(is.numeric) |>
cor() |>
corrplot()
cor_matrix <- cor(Glass[, 1:9])
# Heatmap of the correlation matrix
cor_matrix_melted <- melt(cor_matrix)
ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2( midpoint = 0, limit = c(-1,1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Correlation Heatmap of Predictors")
a) Using visualizations, explore the predictor variables to understand
their distributions as well as the relationships between predictors.
Ri seems to have a strong positive correlation with Ca, and a negative correlation with Si, and Ba has a positive correlation with Al.
numeric_glass <- Glass[, sapply(Glass, is.numeric)] #use only numeric columns
# Apply the skewness function
skewness_values <- apply(numeric_glass, 2, skewness)
print(skewness_values)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
# log transformation
Glass$log_Fe <- log(Glass$Fe + 1) # Adding 1 to avoid log(0)
Glass$log_K <- log(Glass$K + 1)
# Check new distributions with histograms
ggplot(Glass, aes(x = log_Fe)) + geom_histogram(bins = 20) + ggtitle("Log-transformed Fe")# Still skewed to the right but improved the frame.
ggplot(Glass, aes(x = log_K)) + geom_histogram(bins = 20) + ggtitle("Log-transformed K") #Still skewed to the right.
#BoxCox
Glass_transformed <- Glass|>
select(where(is.numeric)) |> # Keep only numeric columns
mutate(across(everything(), ~ BoxCoxTrans(.)$lambda))|>
head(1)
head(Glass_transformed)
## RI Na Mg Al Si K Ca Ba Fe log_Fe log_K
## 1 -2 -0.1 NA 0.5 2 NA -1.1 NA NA NA NA
The table from using the BoxCox, I could see that RI would need the inverse squared to improve the data quality, Si would need to be squared, Al would need the square root, and Ca would need inverse.
3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
#See ?Soybean for details
?Soybean
## Help on topic 'Soybean' was found in the following packages:
##
## Package Library
## mlbench /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
## nlme /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
##
##
## Using the first match ...
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Categorical<-sapply(Soybean, is.factor) #Check all columns that are factors
CategoricalP<-names(Soybean)[Categorical] #store categorical predictor
CategoricalP
## [1] "Class" "date" "plant.stand" "precip"
## [5] "temp" "hail" "crop.hist" "area.dam"
## [9] "sever" "seed.tmt" "germ" "plant.growth"
## [13] "leaves" "leaf.halo" "leaf.marg" "leaf.size"
## [17] "leaf.shread" "leaf.malf" "leaf.mild" "stem"
## [21] "lodging" "stem.cankers" "canker.lesion" "fruiting.bodies"
## [25] "ext.decay" "mycelium" "int.discolor" "sclerotia"
## [29] "fruit.pods" "fruit.spots" "seed" "mold.growth"
## [33] "seed.discolor" "seed.size" "shriveling" "roots"
lapply(CategoricalP,
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle("Frequency Distribution of", col)})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
##
## [[35]]
##
## [[36]]
NAData<-Soybean|> #count of missing values per column
summarize_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(VIM) #visual of missing data Red=missing data blue=present data
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
aggr(Soybean, numbers=TRUE, sortVars=TRUE, labels=names(CategoricalP))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
NaClass<-Soybean|>
gather(key=CategoricalP,value=value, -Class)|>
group_by(Class)|>
summarise(n=sum(is.na(value)))|>
arrange(desc(n))
## Warning: attributes are not identical across measure variables; they will be
## dropped
NaClass
## # A tibble: 19 × 2
## Class n
## <fct> <int>
## 1 phytophthora-rot 1214
## 2 2-4-d-injury 450
## 3 cyst-nematode 336
## 4 diaporthe-pod-&-stem-blight 177
## 5 herbicide-injury 160
## 6 alternarialeaf-spot 0
## 7 anthracnose 0
## 8 bacterial-blight 0
## 9 bacterial-pustule 0
## 10 brown-spot 0
## 11 brown-stem-rot 0
## 12 charcoal-rot 0
## 13 diaporthe-stem-canker 0
## 14 downy-mildew 0
## 15 frog-eye-leaf-spot 0
## 16 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
Hail, Sever, Seed.tmt, and Lodging seem to have the highest missing values. The classes (rows) with high numbers of missing data are phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury.
I believe the MICE function would be using to handle the missing data, which will handle missing data by imputation using predictive mean matching.
DataImp<-mice(Soybean, m=1, method= "pmm", print=F) #predictive mean matching
## Warning: Number of logged events: 339
DataImp<-complete(DataImp) #Extract imputed data
result<-DataImp|>
summarise((across(everything(),~sum(is.na(.)))))
result # shows zero missing values confirming imputation was successful
## Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## 1 0 0 0 0 0 0 0 0 0 0
## germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf
## 1 0 0 0 0 0 0 0 0
## leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1 0 0 0 0 0 0 0
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth
## 1 0 0 0 0 0 0 0
## seed.discolor seed.size shriveling roots
## 1 0 0 0 0