Exercise 3.1: The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
ggplot(gather(Glass[,1:9]), aes(value)) +
geom_histogram(bins = 9) +
facet_wrap(~key, scales = 'free_x')
x <- melt(Glass[,c(2:9)])
## No id variables; using all as measure variables
plt <- ggplot(data = x, aes(x = variable, y = value))
plt + geom_boxplot() + theme_minimal() + labs(x = "Element", y = "Percent (%)")
boxplot(Glass[,1], main = 'Boxplot of Refractive Index', ylab='Refrective Index (RI)')
corr <- cor(Glass[,1:9])
corrplot(corr, method = 'circle', order = 'AOE')
The Refractive Index and Calcium content seem to have an especially strong positive correlation.
A look at the predictors shows us that some are more evenly distributed within the sample than others.
Aluminum (Al), Calcium (Ca), Sodium (Na), and Silicon (Si) seem to be fairly normally distributed within the 214 observations here, though Al, Ca, and Na show slight right-skewedness; the Refractive Index (RI) also follows a fairly normal distribution with a right tail.
Barium (Ba), Iron (Fe), and Potassium (K) are strongly right-skewed, with a majority of the measurements falling on the low end of the presented values due to a large portion containing none of those elements. Since the percentage can’t be below 0, the distribution bunches up there.
Finally, Magnesium (Mg) presents a fairly unique distribution where most of the readings fall on the right end of the spectrum, but there is a healthy amount at readings of zero (0), presenting a multimodal distribution with peaks at 0 and approximately 3.5.
None of the predictors appear to have outliers, unless some of the zero values are due to a lack of data as opposed to actual lack of the element altogether. One way to test this is to create a column with the total percentages from these 8 elements and see if the zeroes in some columns correspond with numbers noticeably lower than 100% for the total:
Glass['Total'] = Glass[2]+ Glass[3]+ Glass[4]+ Glass[5]+ Glass[6]+ Glass[7]+ Glass[8]+ Glass[9]
boxplot(Glass[,11], main = 'Boxplot of Total Calculated Percentages', ylab='Percentage (%)')
It appears that the lowest total falls at around 99%, meaning 1% is unaccounted for in the 8 provided values. It is certainly possible that this is data missing from our 8 elements, though also likely that it is simply data accounted for by elements not listed in our table.
We can take the logarithm of the skewed predictors like Ba, Fe, and K. This could improve the accuracy of our classification model.
Exercise 3.2: The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
ggplot(gather(Soybean[,2:36], na.rm = TRUE), aes(value)) +
geom_histogram(stat='count') +
facet_wrap(~key, scales = 'free_x')
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad
In order to determine potential degenerate distributions, we can use the nearZeroVar
function from the caret
package. It provides an evaluation of whether the variance of a predictor is near zero, suggesting that it may be degenerate.
nearZeroVar(Soybean[,2:36], names = TRUE)
## [1] "leaf.mild" "mycelium" "sclerotia"
We can see from thsi function that the three variables \(leaf.mild\), \(mycelium\), and \(sclerotia\) may be degenerate distributions.
sapply(Soybean, function(x) sum(is.na(x)))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
From the table above we can actually see that there is missing data in every column except for Class
and leaves
, with date
, and area.dam
having 1 missing datapoint. We can see that the highest number of missing data is from hail
, sever
, lodging
, and seed.tmt
each of which is missing 121 datapoints in 683 observations, or 17.7%.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
missing <- Soybean %>%
select(everything()) %>%
group_by(Class) %>%
summarise_all(funs(sum(is.na(.)))) %>%
mutate(Total = select(.,date:roots) %>% rowSums())
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
missing[,c('Class','Total')] %>% arrange(-Total)
## # A tibble: 19 x 2
## Class Total
## <fct> <dbl>
## 1 phytophthora-rot 1214
## 2 2-4-d-injury 450
## 3 cyst-nematode 336
## 4 diaporthe-pod-&-stem-blight 177
## 5 herbicide-injury 160
## 6 alternarialeaf-spot 0
## 7 anthracnose 0
## 8 bacterial-blight 0
## 9 bacterial-pustule 0
## 10 brown-spot 0
## 11 brown-stem-rot 0
## 12 charcoal-rot 0
## 13 diaporthe-stem-canker 0
## 14 downy-mildew 0
## 15 frog-eye-leaf-spot 0
## 16 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
From the table above we can see that the \(phytophthora-rot\) class has far an away the most missing data (though obviously multiple missing fields for the same observation exacerbates these numbers), followed by \(2-d-4-injury\) and \(cyst-nematode\). \(diaporthe-pod-&-stem-blight\) and \(herbicide-injury\) round out the list of classes with missing data, meaning that 14 of the classes don’t have any missing data.
One option to deal with missing data is to remove all observations with any missing data:
Soybean_clean <- Soybean[complete.cases(Soybean), ]
dim(Soybean_clean)
## [1] 562 36
This is a simple and quick solution, but it removes 18% of our observations, meaning any predictive model we develop will have nearly a fifth less data to work with. Let’s look at a more complicated (but les destructive) method of dealing with missing data: imputation.
A very popular package to deal with missing data imputation is Mice
:
md.pattern(Soybean, rotate.names = TRUE)
## Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1 0
## 6 1 1 1 1 1 1 1 1 0
## 14 1 1 1 1 1 1 1 0 1
## 15 1 1 1 1 0 0 0 0 0
## 1 1 1 0 0 0 0 0 0 0
## 0 0 1 1 16 16 16 30 31
## plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562 1 1 1 1 1 1
## 13 1 1 1 1 1 1
## 55 1 1 1 1 1 1
## 8 1 0 0 0 0 0
## 9 1 1 1 1 1 1
## 6 0 1 1 1 1 1
## 14 0 0 0 0 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 36 38 38 38 38 38
## int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 562 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 0
## 55 1 1 0 0 0 0 0
## 8 0 0 1 1 1 1 1
## 9 1 1 0 0 0 0 1
## 6 1 1 0 0 0 0 1
## 14 0 0 0 0 0 0 1
## 15 0 0 1 1 1 1 0
## 1 0 0 1 1 1 1 0
## 38 38 84 84 84 84 84
## seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots
## 562 1 1 1 1 1 1
## 13 0 0 0 1 0 0
## 55 0 0 0 0 0 0
## 8 0 0 0 1 0 0
## 9 1 1 1 0 1 1
## 6 1 1 1 0 1 1
## 14 1 1 1 0 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 92 92 92 100 106 106
## seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging
## 562 1 1 1 1 1 1 1 1 0
## 13 0 0 1 0 0 0 0 0 13
## 55 0 0 0 0 0 0 0 0 19
## 8 0 0 0 0 0 0 0 0 20
## 9 1 1 0 1 0 0 0 0 11
## 6 1 1 0 0 0 0 0 0 13
## 14 0 0 0 0 0 0 0 0 24
## 15 0 0 0 0 0 0 0 0 28
## 1 0 0 0 0 0 0 0 0 30
## 106 106 108 112 121 121 121 121 2337
We will impute using the Predictive Mean Matching (pmm) method, which picks a value from similar observations to assign to the missing datapoint.
imputed_Soy <- mice(Soybean, method = 'pmm', seed = 44, printFlag = FALSE)
## Warning: Number of logged events: 1675
summary(imputed_Soy)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Class date plant.stand precip temp
## "" "pmm" "pmm" "pmm" "pmm"
## hail crop.hist area.dam sever seed.tmt
## "pmm" "pmm" "pmm" "pmm" "pmm"
## germ plant.growth leaves leaf.halo leaf.marg
## "pmm" "pmm" "" "pmm" "pmm"
## leaf.size leaf.shread leaf.malf leaf.mild stem
## "pmm" "pmm" "pmm" "pmm" "pmm"
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## "pmm" "pmm" "pmm" "pmm" "pmm"
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## "pmm" "pmm" "pmm" "pmm" "pmm"
## seed mold.growth seed.discolor seed.size shriveling
## "pmm" "pmm" "pmm" "pmm" "pmm"
## roots
## "pmm"
## PredictorMatrix:
## Class date plant.stand precip temp hail crop.hist area.dam sever
## Class 0 1 1 1 1 1 1 1 1
## date 1 0 1 1 1 1 1 1 1
## plant.stand 1 1 0 1 1 1 1 1 1
## precip 1 1 1 0 1 1 1 1 1
## temp 1 1 1 1 0 1 1 1 1
## hail 1 1 1 1 1 0 1 1 1
## seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## Class 1 1 1 1 1 1 1
## date 1 1 1 1 1 1 1
## plant.stand 1 1 1 1 1 1 1
## precip 1 1 1 1 1 1 1
## temp 1 1 1 1 1 1 1
## hail 1 1 1 1 1 1 1
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## Class 1 1 1 1 1 1
## date 1 1 1 1 1 1
## plant.stand 1 1 1 1 1 1
## precip 1 1 1 1 1 1
## temp 1 1 1 1 1 1
## hail 1 1 1 1 1 1
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## Class 1 1 1 1 1
## date 1 1 1 1 1
## plant.stand 1 1 1 1 1
## precip 1 1 1 1 1
## temp 1 1 1 1 1
## hail 1 1 1 1 1
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor
## Class 1 1 1 1 1 1
## date 1 1 1 1 1 1
## plant.stand 1 1 1 1 1 1
## precip 1 1 1 1 1 1
## temp 1 1 1 1 1 1
## hail 1 1 1 1 1 1
## seed.size shriveling roots
## Class 1 1 1
## date 1 1 1
## plant.stand 1 1 1
## precip 1 1 1
## temp 1 1 1
## hail 1 1 1
## Number of logged events: 1675
## it im dep meth
## 1 1 1 date pmm
## 2 1 1 plant.stand pmm
## 3 1 1 plant.stand pmm
## 4 1 1 precip pmm
## 5 1 1 precip pmm
## 6 1 1 temp pmm
## out
## 1 int.discolor2
## 2 Classcyst-nematode, Classdowny-mildew, leaf.mild2, sclerotia1, fruit.pods1
## 3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
## 4 Classbrown-spot, Classcharcoal-rot, Classcyst-nematode, Classherbicide-injury, stem1, int.discolor2, fruit.pods1
## 5 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
## 6 Classbrown-spot, Classcyst-nematode, Classdiaporthe-pod-&-stem-blight, Classfrog-eye-leaf-spot, fruit.pods2
final_soy <- complete(imputed_Soy,4)
Now that we have imputed the data let’s confirm nothing is missing:
missing2 <- final_soy %>%
select(everything()) %>% # replace to your needs
group_by(Class) %>%
summarise_all(funs(sum(is.na(.)))) %>%
mutate(Total = select(.,date:roots) %>% rowSums())
missing2[,c('Class','Total')] %>% arrange(-Total)
## # A tibble: 19 x 2
## Class Total
## <fct> <dbl>
## 1 2-4-d-injury 0
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode 0
## 10 diaporthe-pod-&-stem-blight 0
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury 0
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0