DATA624: Homework 4
library(tidyverse)
library(GGally)
library(corrplot)
library(ggpubr)
library(naniar)
library(mice)
Task
Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.
3.1
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Numeric Data
- Al - right skewed
- Ba - right skewed, outlier
- Ca - right skewed, outlier
- Fe - right skewed, outlier
- K - right skewed, outlier, bimodal
- Mg - left skewed, bimodal
- Na - Close to near normal
- RI - right skewed
- Si - left skewed
%>%
Glass gather(-c(Type), key = variable, value = value) %>%
ggplot(., aes(x = value)) +
geom_histogram(aes(x=value, y = ..density..), bins = 30, fill="#69b3a2", color="#e9ecef") +
geom_density(aes(x=value), color='red', lwd = 1.25) +
facet_wrap(~variable, scales ="free", ncol = 3) +
ggtitle("Distribution of Numeric Data")
Correlation Matrix
Most of the variables have an inverse relationship.
- Strong Negative Relationships
- Ca / Mg
- RI / Si
- RI / Al
- Mg / Al
- Mg / Ba
- Strong Positive Relationships
- Ca / RI
- K / Al
- Al / Ba
- Na / Ba
<- Glass[,-c(10)]
num.data corrplot(cor(num.data), method = 'shade', order = 'AOE',col= colorRampPalette(c("red","tan", "blue"))(10) , type = 'lower', diag = FALSE)
Categorical Data
%>%
Glass ggplot(aes(x = Type)) + geom_histogram(stat="count", fill="#69b3a2", color="#e9ecef") +
ggtitle("Distribution of Categorical Variable - Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
There does appear to be ouliers in some of the variables in the
dataset. Ba
, Ca
, Fe
,
K
, Mg
, and Na
appear to have
observations that are outliers to the rest of the variable. There are
also predictors that have a skewed data distribution. Ca
,
Ba
, Na
, and RI
are right skewed.
Mg
and Si
are left skewed.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Transformations like Box-Cox, log, square root, and inverse can
improve the data distributions of heavily skewed data. Variables like
Ba
and Fe
may benefit from a log
transformation as they are heavy right skews. Lesser skews can maybe
benefit from a square root transformation. Mg
can maybe be
improved via a square root transformation. The transformation below
shows that it improved slightly but still suffers from the bimodal
nature of the variable. Another technique that could be used is
Centering and Scaling to normalize the inputs. Certain models are
sensitive to input scale and require center scaling to successfully
train the model. Removing variables that may have multicollinarity can
prove beneficial to the model. RI
appears to be highly
correlated with multiple predictor variables that could indicate that
multicollinarity exists. It may be best to remove RI
.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Looking at the data, mycelium
appears to have a
degenerate distribution. mycelium
only contains 6
observations with the value 1 and 639 with the value 0, providing very
little information about how mycelium
is changes the target
variable when it equals 0. For this reason it is best to remove the
variable.
= 0
count for(i in 1:length(names(Soybean))){
if(is.factor(Soybean[,i])){
= count + 1
count
}
}
paste0("Of the 36 variables in Soybean, ", count, " are categorical predictors.")
## [1] "Of the 36 variables in Soybean, 36 are categorical predictors."
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Dropping rows with missing data is not always the best solution. In many cases imputing is effective, as deleting data could cause the loss of important information that is needed to generalize to an unknown population. To handle the missing values I will us the MICE imputation algorithm to generate data.
<- mice(Soybean, m=2, maxit = 10, method = 'pmm', seed = 15) imputed_Data
## Warning: Number of logged events: 1344
<- complete(imputed_Data, 2) completed.data
summary(completed.data)
## Class date plant.stand precip temp hail
## brown-spot : 92 0: 26 0:363 0:105 0: 94 0:464
## alternarialeaf-spot: 91 1: 75 1:320 1:112 1:379 1:219
## frog-eye-leaf-spot : 91 2: 93 2:466 2:210
## phytophthora-rot : 88 3:118
## anthracnose : 44 4:131
## brown-stem-rot : 44 5:150
## (Other) :233 6: 90
## crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0: 71 0:124 0:222 0:389 0:174 0:449 0: 77 0:250
## 1:166 1:227 1:385 1:236 1:213 1:234 1:606 1: 36
## 2:222 2:145 2: 76 2: 58 2:296 2:397
## 3:224 3:187
##
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0:440 0:135 0:566 0:561 0:643 0:305 0:641
## 1: 22 1:327 1:117 1:122 1: 20 1:378 1: 42
## 2:221 2:221 2: 20
##
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0:403 0:340 0:578 0:509 0:677 0:619
## 1: 39 1: 83 1:105 1:139 1: 6 1: 44
## 2: 36 2:177 2: 35 2: 20
## 3:205 3: 83
##
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 0:625 0:407 0:396 0:491 0:524 0:610 0:615
## 1: 58 1:130 1: 75 1:192 1:159 1: 73 1: 68
## 2: 38 2:102
## 3:108 4:110
##
##
##
## shriveling roots
## 0:643 0:582
## 1: 40 1: 86
## 2: 15
##
##
##
##
<- completed.data %>%
missing.values gather(key = "key", value = "val") %>%
mutate(isna = is.na(val)) %>%
group_by(key) %>%
mutate(total = n()) %>%
group_by(key, total, isna) %>%
summarise(num.isna = n()) %>%
mutate(pct = num.isna / total * 100)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'key', 'total'. You can override using the
## `.groups` argument.
<-
levels %>% arrange(desc(pct)))$key
(missing.values
<- missing.values %>%
percentage.plot ggplot() +
geom_bar(aes(x = reorder(key, desc(pct)),
y = pct, fill=isna),
stat = 'identity', alpha=0.8) +
scale_x_discrete(limits = levels) +
scale_fill_manual(name = "",
values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
coord_flip() +
labs(title = "Percentage of missing values", x =
'Variable', y = "% of missing values")
percentage.plot