Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.
3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
a- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.0.5
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:car':
##
## some
## The following object is masked from 'package:caret':
##
## lift
data(Glass)
#str(Glass)
#view(Glass)
sum(is.na(Glass))
## [1] 0
sapply(Glass, class)
## RI Na Mg Al Si K Ca Ba
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## Fe Type
## "numeric" "factor"
Glass1 <- dplyr::select(Glass, -Type)
Glass %>%
head(05)%>%
kable()
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
---|---|---|---|---|---|---|---|---|---|
1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0 | 1 |
1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0 | 1 |
1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0 | 1 |
1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0 | 1 |
1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0 | 1 |
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
Glass1 %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density() + # as density
theme_dark()
cat("\nAnother way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).")
##
## Another way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).
df1 <- c(1:1, 2:9)
df2 <- Glass1[ , -10]
par(mfrow = c(3,5))
for (i in df1) {
#hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
d <- density(df2[,i])
plot(d, main = names(df2[i]))
polygon(d, col="blue")
}
# Not sure if I should convert the variable "Type" in numeric datatype
#hist(Glass$Type,main = " Distribution of predicted Elements Type ",xlab = "Probability of Element", col = 'blue')
out <- boxplot.stats(Glass1$Na)$out
boxplot(Glass1$Na,
ylab = "Value of element",
main = "Boxplot of 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe)"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))
Outlier = boxplot(stack(Glass1), plot=TRUE)$out
library(outliers)
# df1a <- c(1:9)
# df2a <- Glass1[ , -10]
# for (i in df1a) {
# outlier(df2a$i)
# }
#
outlier(Glass1$Na)
## [1] 17.38
outlier(Glass1$Mg)
## [1] 0
outlier(Glass1$Al)
## [1] 3.5
outlier(Glass1$Al)
## [1] 3.5
outlier(Glass1$Si)
## [1] 69.81
outlier(Glass$K)
## [1] 6.21
outlier(Glass1$Ca)
## [1] 16.19
outlier(Glass1$Ba)
## [1] 3.15
outlier(Glass1$Fe)
## [1] 0.51
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
## starting httpd help server ... done
Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 4 | 0 | 2 | 1 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## Warning: Removed 2336 rows containing non-finite values (stat_density).
## [1] 17 24 26
## Warning: package 'corrplot' was built under R version 4.0.5
## corrplot 0.84 loaded
## [1] 0
Based on the frequency plots, we don’t see a degenarated distribution(As there is no spread of variables around the mean, the variance for the degenerate distribution is zero (Var(X) = 0))…because there is no constant value among variables. If we observed little variation, we could say we have a degenarated distribution. We could also use the correlation function to find if there are variable with high correlation(meaning explaning the same underlying response). However, there many missing values rendering the plot difficult. the findCorrelation() with cutoff at 75% on Soybean dataframe output zero. This means we should not delete a variable.
## [1] 683 36
## Total missing values from Soybean data is : 2337
## The percentage of missing values from Soybean data is: 9.504636
## [1] 9.5
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
variables | number.missing |
---|---|
hail | 121 |
lodging | 121 |
seed.tmt | 121 |
sever | 121 |
germ | 112 |
leaf.mild | 108 |
fruit.spots | 106 |
fruiting.bodies | 106 |
seed.discolor | 106 |
shriveling | 106 |
leaf.shread | 100 |
mold.growth | 92 |
seed | 92 |
seed.size | 92 |
fruit.pods | 84 |
leaf.halo | 84 |
leaf.malf | 84 |
leaf.marg | 84 |
leaf.size | 84 |
canker.lesion | 38 |
ext.decay | 38 |
int.discolor | 38 |
mycelium | 38 |
precip | 38 |
sclerotia | 38 |
stem.cankers | 38 |
plant.stand | 36 |
roots | 31 |
temp | 30 |
crop.hist | 16 |
plant.growth | 16 |
stem | 16 |
area.dam | 1 |
date | 1 |
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
There are many variables with missing values. We cannot tell whether there is a pattern on missing values but the plot of missing values show were to look at when dealing with removing NA.