data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
glass <- Glass %>% select(-Type)
chart.Correlation(glass, bg=c("blue","red","yellow"), pch=21)
x <- cor(Glass[1:9])
corrplot(x, method="number")
The predictor variables Ri and Ca have strong correlation. Both will be presenting the same information in the model. It is suggested to have one of the variables removed from the model to avoid redundancy. All other variables have moderate to weak correlation. That means we can include all of them in the model.
par(mfrow=c(2, 2))
colnames <- dimnames(Glass)[[2]]
for (i in 1:9) {
d <- density(Glass[,i])
plot(d, type="n", main=colnames[i])
polygon(d, col="red", border="gray")
}
Density plots reveals that some of the variables have distribution close to normal. Na, Ai, Si, Ba (with exception of right skew) can be considered as approximately normal.
df.m <- melt(Glass)
## Using Type as id variables
p1 <-ggplot(data = df.m, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors Glass",
breaks = seq(0, 2, 0.5),
limits=c(0, 2)) + coord_flip()
p2 <- ggplot(data = df.m, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors Glass",
breaks = seq(5, 10, 2),
limits=c(5, 10)) + coord_flip()
p3 <- ggplot(data = df.m, aes(x=variable, y=value)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4,aes(fill=variable)) + scale_y_continuous(name = "Predictors Glass",
breaks = seq(70, 76, 1),
limits=c(70, 76))+ coord_flip()
grid.arrange(p1, p2, p3, nrow = 2)
## Warning: Removed 833 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1737 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1714 rows containing non-finite values (stat_boxplot).
There are outliers in all the predictor variables. The outliers are highlighted red in the plots. Looking at the box plots we can see that some off the predictor variables are skewed. Data is skewed right for Fe, left in variable K, Al has slight positive skew, and variable Si has slight left skew.
Transformation of variables is the best scenarios while building up the models. While there are different transformations that can be applied, we have to choose the best possible for different variables. In our case, we can follow the steps by first removing one of the highly correlated variables (Ri and Ca- we can remove Ca ), removing variables with high percentage of missing values, and removing zero variance variables. We checked for normality of the variables, as we explored in part a, only variables Na, Ai, Si, Ba were approximate normally distributed, the remaining variables Ri or Ca, Mg, K, Fe can be transformed using box-cox transformations.
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread
## 0 :441 0: 77 0 :221 0 :357 0 : 51 0 :487
## 1 :226 1:606 1 : 36 1 : 21 1 :327 1 : 96
## NA's: 16 2 :342 2 :221 2 :221 NA's:100
## NA's: 84 NA's: 84 NA's: 84
##
##
##
## leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 0 :554 0 :535 0 :296 0 :520 0 :379 0 :320
## 1 : 45 1 : 20 1 :371 1 : 42 1 : 39 1 : 83
## NA's: 84 2 : 20 NA's: 16 NA's:121 2 : 36 2 :177
## NA's:108 3 :191 3 : 65
## NA's: 38 NA's: 38
##
##
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 0 :473 0 :497 0 :639 0 :581 0 :625 0 :407
## 1 :104 1 :135 1 : 6 1 : 44 1 : 20 1 :130
## NA's:106 2 : 13 NA's: 38 2 : 20 NA's: 38 2 : 14
## NA's: 38 NA's: 38 3 : 48
## NA's: 84
##
##
## fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 0 :345 0 :476 0 :524 0 :513 0 :532 0 :539
## 1 : 75 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
## 2 : 57 NA's: 92 NA's: 92 NA's:106 NA's: 92 NA's:106
## 4 :100
## NA's:106
##
##
## roots
## 0 :551
## 1 : 86
## 2 : 15
## NA's: 31
##
##
##
A degenerate distribution is one in which the random variable is not actually random, as it has only a single value. In simple terms, the effect of second term is minor (negligible). Frequency of occurence of one term (factor) in the variable dominate the another term (factor). Zerto variance or near zero variance is the term used. We will use caret package function to detect the variables for degenerate distributions. In building the predictive models, zero variance terms are removed as they will predict biased results. In this case, we will detect the terms only. We will generate a dataframe of frequency distributions for all columns. We can check a few categories at random.
for (i in 2:ncol(Soybean)){
temp <-as.data.frame(table(Soybean[[i]]))
temp$col = colnames(Soybean[i])
assign(paste0("freq_",colnames(Soybean[i])),temp)
rm(temp)
}
freq_all <- do.call("rbind",mget(ls(pattern = "^freq_.*")))
head(freq_all,4)
## Var1 Freq col
## freq_area.dam.1 0 123 area.dam
## freq_area.dam.2 1 227 area.dam
## freq_area.dam.3 2 145 area.dam
## freq_area.dam.4 3 187 area.dam
freq_date
## Var1 Freq col
## 1 0 26 date
## 2 1 75 date
## 3 2 93 date
## 4 3 118 date
## 5 4 131 date
## 6 5 149 date
## 7 6 90 date
freq_temp
## Var1 Freq col
## 1 0 80 temp
## 2 1 374 temp
## 3 2 199 temp
freq_stem
## Var1 Freq col
## 1 0 296 stem
## 2 1 371 stem
freq_leaves
## Var1 Freq col
## 1 0 77 leaves
## 2 1 606 leaves
par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
plot(Soybean[i], main = colnames(Soybean[i]))
}
To detect the zero variance (or near zero variance predictors) we will use the function nearZeroVar() and will see the frequencies of those variables.
zeroVarCols <- nearZeroVar(Soybean)
# Columns 19, 26 ad 28 are degenerate
freq_leaf.mild
## Var1 Freq col
## 1 0 535 leaf.mild
## 2 1 20 leaf.mild
## 3 2 20 leaf.mild
freq_mycelium
## Var1 Freq col
## 1 0 639 mycelium
## 2 1 6 mycelium
freq_sclerotia
## Var1 Freq col
## 1 0 625 sclerotia
## 2 1 20 sclerotia
par(mfrow = c(2,2))
for(i in zeroVarCols) {
plot(Soybean[i], main = colnames(Soybean[i]))
}
colSums(is.na(Soybean))
## Class date plant.stand precip
## 0 1 36 38
## temp hail crop.hist area.dam
## 30 121 16 1
## sever seed.tmt germ plant.growth
## 121 121 112 16
## leaves leaf.halo leaf.marg leaf.size
## 0 84 84 84
## leaf.shread leaf.malf leaf.mild stem
## 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies
## 121 38 38 106
## ext.decay mycelium int.discolor sclerotia
## 38 38 38 38
## fruit.pods fruit.spots seed mold.growth
## 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
plot_missing(Soybean)
sb_class <- Soybean%>% mutate(nul=rowSums(is.na(Soybean)))%>%
group_by(Class)%>% summarize(miss=sum(nul)) %>%filter(miss!=0)
aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## hail 0.177159590
## sever 0.177159590
## seed.tmt 0.177159590
## lodging 0.177159590
## germ 0.163982430
## leaf.mild 0.158125915
## fruiting.bodies 0.155197657
## fruit.spots 0.155197657
## seed.discolor 0.155197657
## shriveling 0.155197657
## leaf.shread 0.146412884
## seed 0.134699854
## mold.growth 0.134699854
## seed.size 0.134699854
## leaf.halo 0.122986823
## leaf.marg 0.122986823
## leaf.size 0.122986823
## leaf.malf 0.122986823
## fruit.pods 0.122986823
## precip 0.055636896
## stem.cankers 0.055636896
## canker.lesion 0.055636896
## ext.decay 0.055636896
## mycelium 0.055636896
## int.discolor 0.055636896
## sclerotia 0.055636896
## plant.stand 0.052708638
## roots 0.045387994
## temp 0.043923865
## crop.hist 0.023426061
## plant.growth 0.023426061
## stem 0.023426061
## date 0.001464129
## area.dam 0.001464129
## Class 0.000000000
## leaves 0.000000000
sb_class
## # A tibble: 5 x 2
## Class miss
## <fct> <dbl>
## 1 2-4-d-injury 450
## 2 cyst-nematode 336
## 3 diaporthe-pod-&-stem-blight 177
## 4 herbicide-injury 160
## 5 phytophthora-rot 1214
The missing plot reveals variables with the percentage of missing values. The different Classes with the missing values were summarized. Out of 19 categories of Class, only 4 have the missing values with class phytophthora-rot having the maximum of missing values. This indicates that there is pattern in missing values by category class.
sb_imp <- knnImputation(Soybean[,-1])
colSums(is.na(sb_imp))
## date plant.stand precip temp
## 0 0 0 0
## hail crop.hist area.dam sever
## 0 0 0 0
## seed.tmt germ plant.growth leaves
## 0 0 0 0
## leaf.halo leaf.marg leaf.size leaf.shread
## 0 0 0 0
## leaf.malf leaf.mild stem lodging
## 0 0 0 0
## stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 0 0 0
## mycelium int.discolor sclerotia fruit.pods
## 0 0 0 0
## fruit.spots seed mold.growth seed.discolor
## 0 0 0 0
## seed.size shriveling roots
## 0 0 0
The best stratergy is to start with checking the correlation between two variables. Due to high percentage of missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values are suggested to be dropped, as with more missing values, the predictor might not be providing correct information to the model. We used k nearest neighbours to impute the missing values in our dataset.
References:
https://rdrr.io/cran/EnvStats/man/boxcoxTransform.html
https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/preprocess.html?revision=845&root=caret