The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
#install.packages('mlbench')
library(mlbench)
data(Glass)
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
library('ggcorrplot')
## Loading required package: ggplot2
cols <- colnames(Glass)
par(mfrow = c(3,3))
for(i in 1:9) {
# histogram
hist(Glass[,i],
breaks=seq(min(Glass[,i]), max(Glass[,i]), length=22),
prob= TRUE, col= "lightgray", main =cols[i] )
# overlay
lines(density(Glass[,i], adjust = 3), col = "blue")
}
corr <- cor(Glass[,1:9])
ggcorrplot(corr, type = "lower",
lab = TRUE)
Above are the histogram of each predictor variable(RI, NA, Mg, AI, Si, K, Ca, BA, Fe). By looking at this visuals, we see that not all predictor variable are normally distributed.
RI, Na, AI, Si with slight skewness and CA with heavy right skew. Rest of the variables are not normally distributed.
par(mfrow = c(3,3))
for(i in 1:9) {
boxplot(Glass[,i] , main =cols[i])
}
The Box plot says that there ar outliers in all predictors except Mg.
We will try apply the box-cox tranformation and see if the skewness is removed from any distribution.
library(caret)
## Loading required package: lattice
bxCx <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
Glass_bxCx <- predict(bxCx, Glass[-10])
par(mfrow = c(3, 3))
for(i in 1:9) {
boxplot(Glass_bxCx[,i] , main =cols[i])
}
par(mfrow = c(3,3))
for(i in 1:9) {
# histogram
hist(Glass_bxCx[,i],
breaks=seq(min(Glass_bxCx[,i]), max(Glass_bxCx[,i]), length=22),
prob= TRUE, col= "lightgray", main =cols[i] )
# overlay
lines(density(Glass_bxCx[,i], adjust = 3), col = "blue")
}
We can see that skewness is reduced for NA and SI and Ca and the distribution is normal distribution.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data(Soybean)
cols <- colnames(Soybean)
par(mfrow = c(3,3))
for(i in 3:34) {
# histogram
hist(as.numeric(Soybean[,i]), main =cols[i] )
}
Yes, there are many predictors which are having only fewer values and low variance. Some of them are mycelium, sclerotia, lodging, stem, leaf.malf and etc.
The easiest solution is to drop the NA records from the dataset. This is effective only if there are very few missing records. If there are more, we will end up deteriorating the data and thus impact the model performance.
If a particular column has a major percentage missing, it is quite okay to drop the column from the dataset. Sometimes, we cant drop na or columns. In such scenarios, we will use imputers.
library(knitr)
data_rs1 <- na.omit(Soybean)
Sometime back, I had created an r notebook explaining on how to handle the missing values with different types of imputers. I have covered mean imputers and Regression imputers( Deterministic regression imputation & Stochastic regression imputation) Out of this, Stochastic regression imputation would do a better job in imputing missing values.
Please refer https://rpubs.com/charlsjoseph/missing_values