DATA 624 - HOMEWORK 4
library(tidyverse)
library(corrplot)
library(missForest)
library(ggthemes)
library(psych)
library(naniar)
library(DMwR)
1 Question - 3.1
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
(a.) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
(b.) Do there appear to be any outliers in the data? Are any predictors skewed?
(c.) Are there any relevant transformations of one or more predictors that might improve the classification model?
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
1.1 (a)
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Answer:
data <- Glass %>% select(-Type)
data %>%
gather(key = 'Predictor', value = 'Value') %>%
ggplot(aes(x=Value)) +
geom_histogram(bins=30) +
facet_wrap(~Predictor,scales = "free") +
theme_hc()+
ggtitle('Histogram: Glass')
1.2 (b)
Do there appear to be any outliers in the data? Are any predictors skewed?
Answer:
All predictors except Mg have outliers.
From (a), all predictors are skewed.
data %>%
gather(key = 'Predictor', value = 'Value') %>%
ggplot(aes(x=Value, y = Predictor)) +
geom_boxplot()+
facet_wrap(~Predictor, scales = 'free')+
theme_hc()
1.3 (c)
Are there any relevant transformations of one or more predictors that might improve the classification model?
Answer:
Targeting skewness, use BoxCox transformation to normalize the data.
- Targeting collinearty, since RI and Ca has highest correlation 0.81, perform predictor reduction by either:
- perform PCA after data normalization (BoxCox, center, scale, etc.,.);
- remove either RI or Ca, whichever has higher mean correlation among the dataset.
2 Question - 3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
(a.) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
(b.) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
(c.) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## Class date plant.stand precip temp
## 2-4-d-injury :16 0 : 26 0 :354 0 : 74 0 : 80
## alternarialeaf-spot :91 1 : 75 1 :293 1 :112 1 :374
## anthracnose :44 2 : 93 NA's: 36 2 :459 2 :199
## bacterial-blight :20 3 :118 NA's: 38 NA's: 30
## bacterial-pustule :20 4 :131
## brown-spot :92 5 :149
## brown-stem-rot :44 6 : 90
## charcoal-rot :20 NA's: 1
## cyst-nematode :14
## diaporthe-pod-&-stem-blight:15
## diaporthe-stem-canker :20
## downy-mildew :20
## frog-eye-leaf-spot :91
## herbicide-injury : 8
## phyllosticta-leaf-spot :20
## phytophthora-rot :88
## powdery-mildew :20
## purple-seed-stain :20
## rhizoctonia-root-rot :20
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
2.1 (a)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Answer:
According to this cahpter, some models can be crippled by predictors with degenerate distributions, such as predictors with near zeo predictors. A rule of thumb for detecting near-zero variance predctors is:
-The fraction of unique values over the sample size is low (say 10%).
-The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20).
If both of these criteria are true and the model in question is susceptible to this type of predictor, it may be advantageous to remove the variable from the model.
In this dataset, there are three predictors meet both criteria, which are leaf.mild
, mycelium
, and sclerotia
.
Soybean %>%
select(1:18) %>%
gather(key = 'Predictor', value = 'Value', -Class) %>%
ggplot(aes(x=Value))+
geom_histogram(stat="count")+
facet_wrap(~Predictor, scales = 'free')+
ggtitle('Histogram:Soybean - 1')+
theme_hc()
Soybean %>%
select(1,19:36) %>%
gather(key = 'Predictor', value = 'Value', -Class) %>%
ggplot(aes(x=Value))+
geom_histogram(stat="count")+
facet_wrap(~Predictor, scales = 'free')+
ggtitle('Histogram:Soybean - 2')+
theme_hc()
Row_Cnt <- Soybean %>%
gather(key = 'Predictor', value = 'Value', -Class, na.rm = FALSE) %>%
#mutate(Value = if_else(is.na(Value),'NA', Value)) %>%
group_by(Predictor) %>%
tally(n='Row_Cnt')
# Predictors with fraction of unique values over the sample size less than 10%
Soybean %>%
gather(key = 'Predictor', value = 'Value', -Class) %>%
group_by(Predictor, Value) %>%
tally(n='Val_Cnt') %>%
left_join(Row_Cnt) %>%
mutate(Uniq_Val_Frac=Val_Cnt/Row_Cnt) %>%
filter(!is.na(Value), Uniq_Val_Frac < 0.1) %>%
select(Predictor) %>%
unique()
# Predictors with the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large
Soybean %>%
gather(key = 'Predictor', value = 'Value', -Class, na.rm = TRUE) %>%
group_by(Predictor, Value) %>%
tally(n='Cnt') %>%
arrange(Predictor, desc(Cnt)) %>%
mutate(id = row_number()) %>%
filter(id %in% c(1,2)) %>%
select(-Value) %>%
spread(key = 'id', value = 'Cnt') %>%
mutate(Ratio_1to2 = `1`/`2`) %>%
filter(Ratio_1to2 >=20) %>%
select(-`1`,-`2`)
## Warning: attributes are not identical across measure variables;
## they will be dropped
2.2 (b)
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Answer:
Most of the predictors have missing values, and nearly half of them contain more than 75 missing values respectively. The predictors have the most missing values are
server
,seed.tmt
,lodging
andhail
.The missing data is highly related to the classes. There are only 5 classes with missing values, including
phytophthora-rot
,2-4-d-injury
,cyst-nematode
,diaporthe-pod-&-stem-blight
andherbicide-injury
Soybean %>%
gather(key = 'Predictor', value = 'Value', - Class) %>%
group_by(Class) %>%
summarise(NA_Cnt = sum(is.na(Value))) %>%
ggplot(aes(x=reorder(Class, NA_Cnt), y=NA_Cnt))+
geom_bar(stat='identity')+
coord_flip()+
theme_hc()+
ggtitle('Soybean: Missing Value Count by Class')+
ylab('NA Count')+
xlab('Class')
2.3 (c)
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Answer: 1. Remove predictors with near zero variation, including leaf.mild
, mycelium
, and sclerotia
.
Use KNN to imputate missing values
Or using
missForest
to imputate missing values.
Soybean %>%
select(-leaf.mild, -mycelium, -sclerotia) %>%
DMwR::knnImputation(k=5) %>%
gg_miss_var()
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!