In this homework assignment I will be submitting exercises 3.1 and 3.2 from the Kuhn and Johnson Applied Predictive Modeling book.
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
library(GGally)
library(tidyverse)
ggpairs(Glass, upper = list(continuous = wrap("cor", size = 2)))
Using the ggpairs function we can see the distributions of the
predictors along the diagonal and the correlations between predictors
(the correlation coefficients in the top right triangle, and each pair’s
relationships in the bottom left triangle)
Some of the distribution plots show that certain elements have greater tenancy to register as a 0 value, Ba, Fe, and potentially K. These may be considered outliers. Rl, Na, Al, Si, and Ca seem to have relatively normal distribution plots. K, Ba, and Fe have very strong right side skews.
A box-cox transformation for the distributions of certain elements may work to be more normal such as Rl, Mg, Na, Ca. Another transformation we can consider is using a principal component analysis to reduce the number of collinear predictors such as Ca and Rl.
The soybean data can also be found at the UC Irvine Machine Learning Repositiory. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (eg., temperateure, precipitation) and plant conditions (eg., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
data("Soybean")
#help(Soybean)
par(mar = c(2, 2, 2, 2)) # Adjust margin size (bottom, left, top, right)
for (col in 2:ncol(Soybean)) {
hist( as.numeric(Soybean[,col]),main = colnames(Soybean)[col], xlab = colnames(Soybean)[col])
}
## Using nearZeroVar function from chapter to return column names where predictors might be degenerate
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
nearZeroVar(Soybean, names = TRUE)
## [1] "leaf.mild" "mycelium" "sclerotia"
Degenerate distributions in the way discussed in the chapter are distributions of predictors that are zero variance - a predictor variable with a single unique value, or near zero variance predictors. There are a few near zero variance predictors that can be concidered degenerate (leaf.mild, mycelium, and sclerotia)
I would not eliminate predictors are there are NA values across almost all of the predictors, and there doesn’t seem to be a large outlier in terms of NA values in predictors.
For handling this dataset’s missing data we have to first remember that the predictor variables are categorical and not continuous. I would try using the median value of similar groups; for example if we are missing the fruit.pods value, we could calculate the median value for fruit.pods from other rows with the same class, date, etc., values.