Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
I have used histograms and the correlation plots to show relationships between predictors.For instance RI has a strong positive correlation with Ca. Multicollinearity may have an impact during model build and we want to reduce where possible to appropriate features so that we don’t have redundant features in the model. Here’s we can use GGlally , pairsplot to view the correlation between variables.This is a good tool to understand multicollinearity.
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
df <- Glass %>%
dplyr::select(-Type)
df1 <- df %>%
gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(df1) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)# Select numeric columns only
M<-cor(df)
ggcorrplot(M, type = "upper", outline.color = "white",
ggtheme = theme_classic,
#colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, several variables have outliers see boxplot (Ca, FE, K all have outliers), see histogram of RI below, shows outliers on the right. Additionally, right skew variables have large values on the left side of the distribution. As the distribution become right skewed, the skewness becomes larger. Here we can see from the skewness calculation that MG and SI have negative skewness, whereas the rest of the variables have a positive skewness.
From the book - “A general rule of thumb to consider is that skewed data whose ratio of the highest value to the lowest value is greater than 20 have significant skewness. Also, the skewness statistic can be used as a diagnostic. If the predictor distribution is roughly symmetric, the skewness values will be close to zero. As the distribution becomes more right skewed, the skewness statistic becomes larger. Similarly, as the distribution becomes more left skewed, the value becomes negative.”
#3.1 UC Irvine Glass Analysis
# Identify missing data by Feature and display percent breakout
missing <- colSums(Glass %>% sapply(is.na))
missing_pct <- round(missing / nrow(Glass) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))## $RI
## Skewness | SE
## ----------------
## 1.625 | 0.165
##
## $Na
## Skewness | SE
## ----------------
## 0.454 | 0.165
##
## $Mg
## Skewness | SE
## ----------------
## -1.153 | 0.165
##
## $Al
## Skewness | SE
## ----------------
## 0.907 | 0.165
##
## $Si
## Skewness | SE
## ----------------
## -0.730 | 0.165
##
## $K
## Skewness | SE
## ----------------
## 6.552 | 0.165
##
## $Ca
## Skewness | SE
## ----------------
## 2.047 | 0.165
##
## $Ba
## Skewness | SE
## ----------------
## 3.416 | 0.165
##
## $Fe
## Skewness | SE
## ----------------
## 1.754 | 0.165
# Boxplots for each variable
ggplot(df1, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)Are there any relevant transformations of one or more predictors that might improve the classification model?
Yes, to reduce skewness, we can apply centering and scaling. We can subtract values from the mean to center. Also divide by SD to normalize the data. Here I have tried boxcox and log transformations and to check and see if we can minimize the skewness where possible for few variables on the histogram.
#Transform using boxcox.
Glass$GlassMgt_t <- BoxCox(Glass$Mg, BoxCox.lambda(Glass$Mg))
Glass$GlassRI_t <- BoxCox(Glass$RI, BoxCox.lambda(Glass$RI))
Glass$GlassSI_t <- BoxCox(Glass$Si, BoxCox.lambda(Glass$Si))
Glass$GlassMgt_lt <- log(Glass$Mg)
Glass$Glasssi_lt <- log(Glass$Si)
hist(Glass$GlassMgt_t)Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
MYCELIUM , ROOTS, SCLEROTIA, INT.DISCOLOR seem to take on 1 unique value. These are zero variance predictor therefore, these can cause imbalance and affect our model have examples of degenerate distribution. We could potentially remove these variables as one option if they are not informational and value added when building our model. There are many variables with NA’S and missing which could be structurally missing or missing from certain classes that contribute to informative missingness.
Degenerate distributions :We can see from the textbook the following.A predictor variable that has a single unique value can be a zero variance predictor. We can remove a variable here as these may be uninformative and not impact calculations.Similarly, some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance.The fraction of unique values over the sample size is low (say 10%). The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). If both of these criteria are true and the model in question is susceptible to a predictor that can cause imbalance and degenerate distribution.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
columns <- colnames(Soybean)
#Example histograms
lapply('sever',
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})## [[1]]
lapply('mycelium',
function(col) {
ggplot(Soybean,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})## [[1]]
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
There are many variables with NA’S and missing which could be structurally missing.Missing can also be related to the outcome. For instance, customer rating values can be called “informative missingness” since only strong opinions push them to rate a product or movie for instance. Such type of missing info can bias a bodel.
We can see that few classes represent all of the missing data. These classes are 2-4-d injury, cyst-nematode,diaporthe-pod-&-stem-blight, herbicide-injury,phytopythora-rot. Looking at the count of the classes and the class frequency, these class type of soybean plant are less frequent types of classes that have missingness in data.
Soybean %>%
gather(variable, value) %>%
filter(is.na(value)) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(df) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable Missing Data` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()| Variable Missing Data | Number of Records | Share of Total |
|---|---|---|
| hail | 121 | 57% |
| lodging | 121 | 57% |
| seed.tmt | 121 | 57% |
| sever | 121 | 57% |
| germ | 112 | 52% |
| leaf.mild | 108 | 50% |
| fruit.spots | 106 | 50% |
| fruiting.bodies | 106 | 50% |
| seed.discolor | 106 | 50% |
| shriveling | 106 | 50% |
| leaf.shread | 100 | 47% |
| mold.growth | 92 | 43% |
| seed | 92 | 43% |
| seed.size | 92 | 43% |
| fruit.pods | 84 | 39% |
| leaf.halo | 84 | 39% |
| leaf.malf | 84 | 39% |
| leaf.marg | 84 | 39% |
| leaf.size | 84 | 39% |
| canker.lesion | 38 | 18% |
| ext.decay | 38 | 18% |
| int.discolor | 38 | 18% |
| mycelium | 38 | 18% |
| precip | 38 | 18% |
| sclerotia | 38 | 18% |
| stem.cankers | 38 | 18% |
| plant.stand | 36 | 17% |
| roots | 31 | 14% |
| temp | 30 | 14% |
| crop.hist | 16 | 7.5% |
| plant.growth | 16 | 7.5% |
| stem | 16 | 7.5% |
| area.dam | 1 | 0.5% |
| date | 1 | 0.5% |
missing <- Soybean %>%
group_by(Class) %>%
miss_var_summary()
ggplot(missing, aes(Class, variable,fill=n_miss)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) + coord_flip()Develop a strategy for handling missing data, either by eliminating predictors or imputation.
The book talks about using KNN imputation. Here, I’ve used mice imputation and repeated the visualizations after imputations to show some of the density plots.”MICE stands for Multivariate Imputation by Chained Equations. It is a statistical method used to handle missing data in a dataset. The process uses multiple imputation techniques to fill in the missing data and then combines the results from multiple imputations to produce a final imputed dataset. The goal of MICE is to preserve the relationship between variables in the original data and to reduce the bias introduced by imputing missing values.” It looks like after imputation the NA’s have been transferred to 0 or 1 for Hail for instance.
Imputation can also be performed by looking at closely correlated values and imputed similar values for the correlated variables as mentioned in the book. However, we would have to be careful with correlated values, as these can cause unstable models, numerical errors and inaccurate predictive performance due to multicollinearity.
We can remove predictors as well and reduce predictors. If we opted to go down this path, we could remove variables such as MYCELIUM, Roots etc.. that may take on only 1 unique value, additionally we could look at the 5 classes mentioned above that have a large number of missing info which may be meaningful in its missing data. We can look at that pattern to see any predictors can be dropped.
set.seed(12345)
# Perform Multiple Imputation
imputed_data_df <- mice(Soybean, m=5, method='pmm', print=FALSE)
completedData <- complete(imputed_data_df,1)
summary(completedData) %>%
kable() %>%
kable_styling()| Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| brown-spot : 92 | 0: 26 | 0:390 | 0: 82 | 0: 80 | 0:444 | 0: 65 | 0:124 | 0:315 | 0:410 | 0:169 | 0:450 | 0: 77 | 0:276 | 0:386 | 0: 51 | 0:562 | 0:638 | 0:643 | 0:312 | 0:520 | 0:409 | 0:350 | 0:489 | 0:508 | 0:674 | 0:619 | 0:625 | 0:423 | 0:348 | 0:499 | 0:616 | 0:535 | 0:556 | 0:559 | 0:551 | |
| alternarialeaf-spot: 91 | 1: 76 | 1:293 | 1:112 | 1:382 | 1:239 | 1:166 | 1:227 | 1:323 | 1:228 | 1:219 | 1:233 | 1:606 | 1: 36 | 1: 21 | 1:351 | 1:121 | 1: 45 | 1: 20 | 1:371 | 1:163 | 1: 39 | 1: 83 | 1:194 | 1:161 | 1: 9 | 1: 44 | 1: 58 | 1:130 | 1: 86 | 1:184 | 1: 67 | 1:148 | 1:127 | 1:124 | 1: 86 | |
| frog-eye-leaf-spot : 91 | 2: 93 | NA | 2:489 | 2:221 | NA | 2:219 | 2:145 | 2: 45 | 2: 45 | 2:295 | NA | NA | 2:371 | 2:276 | 2:281 | NA | NA | 2: 20 | NA | NA | 2: 36 | 2:177 | NA | 2: 14 | NA | 2: 20 | NA | 2: 14 | 2: 58 | NA | NA | NA | NA | NA | 2: 46 | |
| phytophthora-rot : 88 | 3:118 | NA | NA | NA | NA | 3:233 | 3:187 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3:199 | 3: 73 | NA | NA | NA | NA | NA | 3:116 | 4:191 | NA | NA | NA | NA | NA | NA | |
| anthracnose : 44 | 4:131 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| brown-stem-rot : 44 | 5:149 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | |
| (Other) :233 | 6: 90 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
#Analysis of Missing values on the imputed dataset
missing2 <- colSums(completedData %>% sapply(is.na))
missing_pct2 <- round(missing2 / nrow(completedData) * 100, 2)
stack(sort(missing_pct2, decreasing = TRUE))## Class date plant.stand precip temp hail
## brown-spot : 92 0: 26 0:390 0: 82 0: 80 0:444
## alternarialeaf-spot: 91 1: 76 1:293 1:112 1:382 1:239
## frog-eye-leaf-spot : 91 2: 93 2:489 2:221
## phytophthora-rot : 88 3:118
## anthracnose : 44 4:131
## brown-stem-rot : 44 5:149
## (Other) :233 6: 90
## crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo
## 0: 65 0:124 0:315 0:410 0:169 0:450 0: 77 0:276
## 1:166 1:227 1:323 1:228 1:219 1:233 1:606 1: 36
## 2:219 2:145 2: 45 2: 45 2:295 2:371
## 3:233 3:187
##
##
##
## leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging
## 0:386 0: 51 0:562 0:638 0:643 0:312 0:520
## 1: 21 1:351 1:121 1: 45 1: 20 1:371 1:163
## 2:276 2:281 2: 20
##
##
##
##
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 0:409 0:350 0:489 0:508 0:674 0:619
## 1: 39 1: 83 1:194 1:161 1: 9 1: 44
## 2: 36 2:177 2: 14 2: 20
## 3:199 3: 73
##
##
##
## sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 0:625 0:423 0:348 0:499 0:616 0:535 0:556
## 1: 58 1:130 1: 86 1:184 1: 67 1:148 1:127
## 2: 14 2: 58
## 3:116 4:191
##
##
##
## shriveling roots
## 0:559 0:551
## 1:124 1: 86
## 2: 46
##
##
##
##
# Density plots
ggplot(Soybean, aes(x=sever, fill="Original")) +
geom_density(alpha=0.5) +
geom_density(data=completedData, aes(x=sever, fill="imputed"), alpha=0.5) +
labs(title="Density Plot of Sever: Original vs. Imputed")# Density plots
ggplot(Soybean, aes(x=fruit.spots, fill="Original")) +
geom_density(alpha=0.5) +
geom_density(data=completedData, aes(x=fruit.spots, fill="imputed"), alpha=0.5) +
labs(title="Density Plot of leaf.shread: Original vs. Imputed")columns <- colnames(Soybean)
#After MICE IMPUTATION let's take a look at the histograms
lapply('sever',
function(col) {
ggplot(completedData,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})## [[1]]
lapply('mycelium',
function(col) {
ggplot(completedData,
aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})## [[1]]