if (!require("Hmisc")) install.packages("Hmisc")
if (!require("PerformanceAnalytics")) install.packages("PerformanceAnalytics")
if (!require("mlbench")) install.packages("mlbench")
if (!require("car")) install.packages("car")
if (!require("missForest")) install.packages("missForest")
if (!require("Amelia")) install.packages("Amelia")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("naniar")) install.packages("naniar")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("caret")) install.packages("caret")The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The data can be accessed via
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
From the Correlation, we can see that the variable Ri and Ca are strong positive correlated(0.81). Ri ad Si are negative correlated (-0.54)
par(mfrow=c(3,3))
for(var in names(glass)){
boxplot(glass[var], main=paste('Boxplot of', var), horizontal = T)
}From Figure, we can see that K and Mg appear to have possible second modes around zero and that several predictors Ca, Ba, Fe and RI show signs of skewness. There may be one or two outliers in K, but they could simply be due to natueral skewness. Also, predictors Ca, RI, Na and Si have concentrations of samples in the middle of the scale and a small number of data points at the edges of the distribution. Yes, boxplot proves that there is outliers in the data.
#library(caret)
Trans <- preProcess(glass, method = "YeoJohnson")
TransData <- predict(Trans, newdata= glass)
hist.data.frame(TransData)par(mfrow=c(3,3))
for(var in names(TransData)){
boxplot(TransData[var], main=paste('Boxplot of', var), horizontal = T)
}This transformation did change relative to the original distributions is that a second mode was induced for predictors Ba and Fe. Given these results, this transformation did not seem to improve the data (in terms of skewness). Thus, it was unable to resolve skewness in this data via transformations but it minimized the number of unusual observations.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
When we look closely at this output, we see that the factor levels of some predictors are not informative. For example, the temp column contains integer values. These values correspond the relative temperature: below average, average and above average.
Soybean2 <-Soybean[,2:36]
par(mfrow=c(3,6))
for (i in 1:ncol(Soybean2)) {
smoothScatter(Soybean2[ ,i], ylab = names(Soybean2[i]))
}There are a few degenerate and that is due to the low frequencies. Most important once are mycelium and sclerotia. The Smoothed Density Scatterplot for the variables shows one color across the chart. The variables leaf.mild and int.discolor appear to show near-zero variance.
#Remove near zero variance predictors
Soybean <- Soybean %>%
select (-leaf.mild, -mycelium, -sclerotia)
#seed 10% missing values
Soybean.mis <- prodNA(Soybean, noNA = 0.1)
summary(Soybean.mis)## Class date plant.stand precip temp
## brown-spot : 89 5 :137 0 :313 0 : 65 0 : 73
## frog-eye-leaf-spot : 84 4 :122 1 :266 1 : 96 1 :344
## alternarialeaf-spot: 83 3 :109 NA's:104 2 :413 2 :177
## phytophthora-rot : 79 6 : 81 NA's:109 NA's: 89
## anthracnose : 40 2 : 72
## (Other) :250 (Other): 88
## NA's : 58 NA's : 74
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :393 0 : 55 0 :111 0 :172 0 :270 0 :149 0 :394
## 1 :114 1 :150 1 :207 1 :295 1 :200 1 :193 1 :206
## NA's:176 2 :204 2 :130 2 : 38 2 : 33 2 :186 NA's: 83
## 3 :184 3 :160 NA's:178 NA's:180 NA's:155
## NA's: 90 NA's: 75
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf stem
## 0 : 68 0 :201 0 :322 0 : 49 0 :428 0 :504 0 :266
## 1 :553 1 : 32 1 : 19 1 :307 1 : 86 1 : 43 1 :339
## NA's: 62 2 :308 2 :200 2 :200 NA's:169 NA's:136 NA's: 78
## NA's:142 NA's:142 NA's:127
##
##
##
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
## 0 :466 0 :343 0 :276 0 :428 0 :445 0 :524
## 1 : 36 1 : 36 1 : 72 1 : 92 1 :117 1 : 37
## NA's:181 2 : 32 2 :158 NA's:163 2 : 10 2 : 19
## 3 :172 3 : 58 NA's:111 NA's:103
## NA's:100 NA's:119
##
##
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 0 :363 0 :317 0 :427 0 :462 0 :463 0 :476
## 1 :118 1 : 66 1 :109 1 : 59 1 : 58 1 : 58
## 2 : 12 2 : 53 NA's:147 NA's:162 NA's:162 NA's:149
## 3 : 44 4 : 86
## NA's:146 NA's:161
##
##
## shriveling roots
## 0 :493 0 :496
## 1 : 36 1 : 78
## NA's:154 2 : 15
## NA's: 94
##
##
##
#impute missing values, using all parameters as default values
Soybean.imp <- missForest(Soybean.mis)## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
#check imputed values
Soybean2 <- as.data.frame(Soybean.imp$ximp)
Soybean2 %>%
arrange(Class) %>%
missmap(main = "Missing vs Observed")