The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
data(Glass)
glasspredictors <- Glass[,1:9]
glasspredictors %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()+
ggtitle("Glass Predictor Variables - Histograms")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
glasspredictors %>%
gather() %>%
ggplot(aes(value)) +
geom_density() +
facet_wrap(~key, scales = 'free')+
ggtitle("Glass Predictor Variables - Histograms")pairs(glasspredictors, main="Glass Predictor Variables - Pairs Plot") ### Correlation plot
r <-cor(glasspredictors)
corrplot.mixed(r,
lower.col = "black",
number.cex = .7,
title="Glass Predictor Variables - Correlation Plot",
mar=c(0,0,1,0)) From the above plots, we can see that some of the vatiables are reasonably well centered (Al, Na), some of them are skewed (Mg) and there are also a few of them that seems to have a high proportion of zero or near-zero weights (Fe, Ba) Relationships: The pairs plot doesn’t show any non-linear relationships In the correlation plot tells us that Ri seems to have the highest degree of relation. Ri & CA sharing the highest positive correlation and Ri and Mg having the lowest correlation.
df <- Glass #copy to new data frame so can edit without changing original
df[is.na(df)] <- 0 # replace missing values with zero
indx <- sapply(df, is.factor)
df[indx] <- lapply(df[indx], function(x) as.numeric(as.character(x))) f <-colnames(df)
par(mfrow=c(2,5)) # set box plots to group in 2 rows of 5 columns each
for (i in 1:length(f)){boxplot(df[,i],main = f[i])} #plot box plotsBox plots indicate presence of outliers in all elements except Mg. I would expect outliers due to impurities introduced in the glass manufacturing process.
As shown in the density plots above, other than Si most of the element distributions are right skewed.
Are there any relevant transformations of one or more predictors that might improve the classification model? Since Be, Fe, and K have a strong right skewness with a concentrations of points with low values, they may benefit from a log transformation. Mg may also be log transformed since it is left skewed. Box-cox could be used to address the skew.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes ### A Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data(Soybean)
#number of unique values per col
incl.nas <- sapply(sapply(Soybean,unique),length)
no.nas <- sapply(sapply(Soybean[complete.cases(Soybean),],unique),length)
r <- t(rbind(incl.nas,no.nas))
row.names(r) <- colnames(Soybean)
kable(r)| incl.nas | no.nas | |
|---|---|---|
| Class | 19 | 15 |
| date | 8 | 7 |
| plant.stand | 3 | 2 |
| precip | 4 | 3 |
| temp | 4 | 3 |
| hail | 3 | 2 |
| crop.hist | 5 | 4 |
| area.dam | 5 | 4 |
| sever | 4 | 3 |
| seed.tmt | 4 | 3 |
| germ | 4 | 3 |
| plant.growth | 3 | 2 |
| leaves | 2 | 2 |
| leaf.halo | 4 | 3 |
| leaf.marg | 4 | 3 |
| leaf.size | 4 | 3 |
| leaf.shread | 3 | 2 |
| leaf.malf | 3 | 2 |
| leaf.mild | 4 | 3 |
| stem | 3 | 2 |
| lodging | 3 | 2 |
| stem.cankers | 5 | 4 |
| canker.lesion | 5 | 4 |
| fruiting.bodies | 3 | 2 |
| ext.decay | 4 | 2 |
| mycelium | 3 | 2 |
| int.discolor | 4 | 3 |
| sclerotia | 3 | 2 |
| fruit.pods | 5 | 3 |
| fruit.spots | 5 | 4 |
| seed | 3 | 2 |
| mold.growth | 3 | 2 |
| seed.discolor | 3 | 2 |
| seed.size | 3 | 2 |
| shriveling | 3 | 2 |
| roots | 4 | 3 |
The table above shows the unique-value-count by variable. Based on this table it does not appear as though there are any variables with degenerate distributions of 2 values present in the data.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Soybean.incomplete <- Soybean[!complete.cases(Soybean),]
missing.cols <- Soybean.incomplete %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.)))) ## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
missing.cols <- t(missing.cols/nrow(Soybean))
missing.cols <- missing.cols[order(-missing.cols),]
kable(missing.cols)| x | |
|---|---|
| hail | 0.1771596 |
| sever | 0.1771596 |
| seed.tmt | 0.1771596 |
| lodging | 0.1771596 |
| germ | 0.1639824 |
| leaf.mild | 0.1581259 |
| fruiting.bodies | 0.1551977 |
| fruit.spots | 0.1551977 |
| seed.discolor | 0.1551977 |
| shriveling | 0.1551977 |
| leaf.shread | 0.1464129 |
| seed | 0.1346999 |
| mold.growth | 0.1346999 |
| seed.size | 0.1346999 |
| leaf.halo | 0.1229868 |
| leaf.marg | 0.1229868 |
| leaf.size | 0.1229868 |
| leaf.malf | 0.1229868 |
| fruit.pods | 0.1229868 |
| precip | 0.0556369 |
| stem.cankers | 0.0556369 |
| canker.lesion | 0.0556369 |
| ext.decay | 0.0556369 |
| mycelium | 0.0556369 |
| int.discolor | 0.0556369 |
| sclerotia | 0.0556369 |
| plant.stand | 0.0527086 |
| roots | 0.0453880 |
| temp | 0.0439239 |
| crop.hist | 0.0234261 |
| plant.growth | 0.0234261 |
| stem | 0.0234261 |
| date | 0.0014641 |
| area.dam | 0.0014641 |
| Class | 0.0000000 |
| leaves | 0.0000000 |
case.count <- Soybean.incomplete %>%
group_by(Class) %>%
tally()
na.count <- aggregate(Soybean.incomplete, list(Soybean.incomplete$Class), function(x) sum(is.na(x)))
case.count$NAs <- data.frame(rowSums(na.count[2:ncol(na.count)]))
colnames(case.count) <- c("Class","Incomeplete.Cases", "NA.Values")
case.count$NA.Per.Case <- case.count$NA.Values / case.count$Incomeplete.Cases
kable(case.count)| Class | Incomeplete.Cases | NA.Values | NA.Per.Case |
|---|---|---|---|
| 2-4-d-injury | 16 | 450 | 28.12500 |
| cyst-nematode | 14 | 336 | 24.00000 |
| diaporthe-pod-&-stem-blight | 15 | 177 | 11.80000 |
| herbicide-injury | 8 | 160 | 20.00000 |
| phytophthora-rot | 68 | 1214 | 17.85294 |
Variables There does seem to be a pattern in some of the variables which are missing.
crop damage (hail, lodging, severe weather) appear to be among the most common missing variables (~18%)
Classes Of the 19 classes, the missing data appear to be confined to only 5 distinct classes. We see that of these, the majority of the missing data are in phytophthora-rot.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Impute Zeros
Soybean$hail[is.na(Soybean$hail)] <- 0
Soybean$sever[is.na(Soybean$hail)] <- 0Knn Impute For the remaining data we’ll use KNN (k=10) to impute.
df <- data.frame(Soybean)
Soybean.impute <- knnImputation(df, k = 10, scale = T, meth = "mode",
distData = NULL)
nrow(Soybean.impute[!complete.cases(Soybean.impute),])## [1] 0
The number of incomplete cases is now 0.