library(readxl, quietly = TRUE, warn.conflicts = FALSE, verbose = F)
library(fpp2,quietly = TRUE, warn.conflicts = FALSE, verbose = F)
library(ggplot2)
library(gridExtra)
library(mlbench)
library(caret)
library(corrplot)
library(dplyr )
library(kableExtra)
library(e1071)
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed. correlation plot shows the relationship between predictors. We see that there is high correlation between CA and Ri, we also see negative and positive correlation between variables.
library(mlbench)
data(Glass)
glassData <- Glass[-10]
par(mfrow=c(4,2))
for (col in 2:ncol(glassData)) {
hist(glassData[,col],main = colnames(glassData)[col], xlab = colnames(glassData)[col])
}
par(mfrow=c(1,1))
corGlass <- cor(glassData)
corrplot(corGlass, method = 'number')
Do there appear to be any outliers in the data? Are any predictors skewed?
Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed.
Table display skewness of the variables.
Boxplots shows only Mg doesn’t have any outliers.
par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
hist(glassData[,col],main = colnames(glassData)[col], xlab = colnames(glassData)[col])
}
par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
boxplot(glassData[,col],main = colnames(glassData)[col], xlab = colnames(glassData)[col])
}
kable_styling (kable(apply(Glass[-10],2, skewness )
),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
x | |
---|---|
RI | 1.6027151 |
Na | 0.4478343 |
Mg | -1.1364523 |
Al | 0.8946104 |
Si | -0.7202392 |
K | 6.4600889 |
Ca | 2.0184463 |
Ba | 3.3686800 |
Fe | 1.7298107 |
Histogram shows skewness. Scatter plot and box plot shows there are outliers.
Are there any relevant transformations of one or more predictors that might improve the classification model?
Below plots display actual data vs transformed data. Based on these plots, we do not see significant reduction in skewness. Transformation will be useful for making all variables on a similar scale or for doing PCA.
par(mfrow=c(4,2))
for (col in 2:ncol(Glass[-10])) {
hist(Glass[,col],main = colnames(Glass[-10])[col], xlab = colnames(Glass[-10])[col])
t = BoxCoxTrans( as.numeric( Glass[,col]))
transformed = predict(t, Glass[,col])
hist(transformed, main = paste("after transform ", colnames(Glass[-10])[col]), xlab = colnames(Glass[-10])[col] )
}
trans <- preProcess(Glass[-10], method = c("BoxCox", "center", "scale"))
predict(trans, Glass[-10])
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data("Soybean")
Soybeandata <- Soybean[-1]
par(mfrow=c(4,5))
for (col in 2:ncol(Soybean)) {
hist( as.numeric(Soybean[,col]),main = colnames(Soybean)[col], xlab = colnames(Soybean)[col])
}
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Degenerate distributions are those where the predictor variable has a unique value or a few of unique values that occur with very low frequencies.
“leaf.mild” “mycelium” “sclerotia” predictors have near zero variance.
# Checking for degenerate distibution
nearZeroVar(Soybeandata, names = T)
## [1] "leaf.mild" "mycelium" "sclerotia"
hist( as.numeric(Soybeandata[, "leaf.mild" ]))
hist( as.numeric(Soybeandata[, "mycelium" ]))
hist(as.numeric(Soybeandata[ ,"sclerotia" ]))
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
hail, sever, seed.tmt and lodging variables are missing most data. leaves,date and area.dan are missing the least amount of data.
Table below shows number of missing values by predictor variable.
missingCols <- sort(colSums(is.na(Soybean)) )
kable_styling (kable(missingCols),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
x | |
---|---|
Class | 0 |
leaves | 0 |
date | 1 |
area.dam | 1 |
crop.hist | 16 |
plant.growth | 16 |
stem | 16 |
temp | 30 |
roots | 31 |
plant.stand | 36 |
precip | 38 |
stem.cankers | 38 |
canker.lesion | 38 |
ext.decay | 38 |
mycelium | 38 |
int.discolor | 38 |
sclerotia | 38 |
leaf.halo | 84 |
leaf.marg | 84 |
leaf.size | 84 |
leaf.malf | 84 |
fruit.pods | 84 |
seed | 92 |
mold.growth | 92 |
seed.size | 92 |
leaf.shread | 100 |
fruiting.bodies | 106 |
fruit.spots | 106 |
seed.discolor | 106 |
shriveling | 106 |
leaf.mild | 108 |
germ | 112 |
hail | 121 |
sever | 121 |
seed.tmt | 121 |
lodging | 121 |
class phytophthora-rot has most missing data. Table below displays missing data by classes.
Soybean %>%
mutate(Total = n()) %>%
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n() ) %>%
select(Class, Missing ) %>%
unique()
Develop a strategy for handling missing data, either by eliminating predictors or imputation
Following methods could be used for data imputation.
Mean value imputation
each missing value is replaced with an imputed value equal to the mean of the observed data.
Non-stochastic regression imputation
this approach also involves replacing missing data with imputed values but uses predicted values from a regression model.
Stochastic regression imputatios
this approach extends the regression imputation by adding a varying component to the predictions so that the imputed values have the same variance as the observed values.