The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Refractive Index:
library(ggplot2)
ggplot(data=Glass, aes(x=RI))+
geom_histogram(bins = 18, col='blue', fill='lightgreen')+
labs(x='Refractive Index',
y= 'Count',
title = 'Histogram of refractive Indices')
The distribution of refractive index is approximately normal but a little bit right skewed.
Sodium:
ggplot(data=Glass, aes(x=Na))+
geom_histogram(bins = 20, col='black', fill='lightpink')+
labs(x='Percentage of Na',
y= 'Count',
title = 'Histogram of percentage of sodium')
The distribution of percentage of sodium seems slightly right skewed.
Magnesium:
ggplot(data=Glass, aes(x=Mg))+
geom_histogram(bins = 18, col='blue', fill='yellow')+
labs(x='Percentage of Magnesium',
y= 'Count',
title = 'Histogram of percentage of magnesium')
The distribution of magnesium is neither normal nor uniform.
ggplot(data=Glass, aes(x=Al))+
geom_histogram(bins = 20, col='black', fill='lightpink')+
labs(x='Percentage of Al',
y= 'Count',
title = 'Histogram of percentage of Aluminium')
It’s approx normal distribution but a little bit right skewed.
Silicon:
ggplot(data=Glass, aes(x=Si))+
geom_histogram(bins = 20, col='blue', fill='lightgreen')+
labs(x='Percentage of Silicon',
y= 'Count',
title = 'Histogram of percentage of silicon')
The distribution of Si is left skewed but seems normal
Histograms of K, Ca, Ba and Fe:
variables <- c('K', 'Ca', 'Ba', 'Fe')
histograms <- lapply(variables, function(var) {
ggplot(Glass, aes_string(x = var)) +
geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "blue") +
ggtitle(paste("Histogram of", var)) +
xlab(var) +
ylab("Frequency")
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gridExtra::grid.arrange(grobs = histograms, ncol = 2)
It can be seen that the distribution of Ca is approx normal but distributions of K, Ba, and Fe are neither normal nor uniform. Distribution of Fe appears exponential.
Correlation plot:
library(corrplot)
## corrplot 0.92 loaded
variables <- c('RI', 'Na','Mg','Al','Si', 'K', 'Ca', 'Ba', 'Fe')
cor_matrix <- cor(Glass[, variables])
# Create correlation plot
corr_plot <- corrplot(cor_matrix, method = "shade")
print(corr_plot)
## $corr
## RI Na Mg Al Si K
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672 0.005395667
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372 0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000 -0.193330854
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085 1.000000000
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131 -0.042618059
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073 -0.007719049
## Ca Ba Fe
## RI 0.8104027 -0.0003860189 0.143009609
## Na -0.2754425 0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178 0.083059529
## Al -0.2595920 0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K -0.3178362 -0.0426180594 -0.007719049
## Ca 1.0000000 -0.1128409671 0.124968219
## Ba -0.1128410 1.0000000000 -0.058691755
## Fe 0.1249682 -0.0586917554 1.000000000
##
## $corrPos
## xName yName x y corr
## 1 RI RI 1 9 1.0000000000
## 2 RI Na 1 8 -0.1918853790
## 3 RI Mg 1 7 -0.1222740393
## 4 RI Al 1 6 -0.4073260341
## 5 RI Si 1 5 -0.5420521997
## 6 RI K 1 4 -0.2898327111
## 7 RI Ca 1 3 0.8104026963
## 8 RI Ba 1 2 -0.0003860189
## 9 RI Fe 1 1 0.1430096093
## 10 Na RI 2 9 -0.1918853790
## 11 Na Na 2 8 1.0000000000
## 12 Na Mg 2 7 -0.2737319608
## 13 Na Al 2 6 0.1567936672
## 14 Na Si 2 5 -0.0698088065
## 15 Na K 2 4 -0.2660865043
## 16 Na Ca 2 3 -0.2754424856
## 17 Na Ba 2 2 0.3266028795
## 18 Na Fe 2 1 -0.2413464115
## 19 Mg RI 3 9 -0.1222740393
## 20 Mg Na 3 8 -0.2737319608
## 21 Mg Mg 3 7 1.0000000000
## 22 Mg Al 3 6 -0.4817985090
## 23 Mg Si 3 5 -0.1659267225
## 24 Mg K 3 4 0.0053956673
## 25 Mg Ca 3 3 -0.4437500264
## 26 Mg Ba 3 2 -0.4922621178
## 27 Mg Fe 3 1 0.0830595289
## 28 Al RI 4 9 -0.4073260341
## 29 Al Na 4 8 0.1567936672
## 30 Al Mg 4 7 -0.4817985090
## 31 Al Al 4 6 1.0000000000
## 32 Al Si 4 5 -0.0055237204
## 33 Al K 4 4 0.3259584457
## 34 Al Ca 4 3 -0.2595920102
## 35 Al Ba 4 2 0.4794039017
## 36 Al Fe 4 1 -0.0744021509
## 37 Si RI 5 9 -0.5420521997
## 38 Si Na 5 8 -0.0698088065
## 39 Si Mg 5 7 -0.1659267225
## 40 Si Al 5 6 -0.0055237204
## 41 Si Si 5 5 1.0000000000
## 42 Si K 5 4 -0.1933308544
## 43 Si Ca 5 3 -0.2087321537
## 44 Si Ba 5 2 -0.1021513105
## 45 Si Fe 5 1 -0.0942007305
## 46 K RI 6 9 -0.2898327111
## 47 K Na 6 8 -0.2660865043
## 48 K Mg 6 7 0.0053956673
## 49 K Al 6 6 0.3259584457
## 50 K Si 6 5 -0.1933308544
## 51 K K 6 4 1.0000000000
## 52 K Ca 6 3 -0.3178361547
## 53 K Ba 6 2 -0.0426180594
## 54 K Fe 6 1 -0.0077190491
## 55 Ca RI 7 9 0.8104026963
## 56 Ca Na 7 8 -0.2754424856
## 57 Ca Mg 7 7 -0.4437500264
## 58 Ca Al 7 6 -0.2595920102
## 59 Ca Si 7 5 -0.2087321537
## 60 Ca K 7 4 -0.3178361547
## 61 Ca Ca 7 3 1.0000000000
## 62 Ca Ba 7 2 -0.1128409671
## 63 Ca Fe 7 1 0.1249682190
## 64 Ba RI 8 9 -0.0003860189
## 65 Ba Na 8 8 0.3266028795
## 66 Ba Mg 8 7 -0.4922621178
## 67 Ba Al 8 6 0.4794039017
## 68 Ba Si 8 5 -0.1021513105
## 69 Ba K 8 4 -0.0426180594
## 70 Ba Ca 8 3 -0.1128409671
## 71 Ba Ba 8 2 1.0000000000
## 72 Ba Fe 8 1 -0.0586917554
## 73 Fe RI 9 9 0.1430096093
## 74 Fe Na 9 8 -0.2413464115
## 75 Fe Mg 9 7 0.0830595289
## 76 Fe Al 9 6 -0.0744021509
## 77 Fe Si 9 5 -0.0942007305
## 78 Fe K 9 4 -0.0077190491
## 79 Fe Ca 9 3 0.1249682190
## 80 Fe Ba 9 2 -0.0586917554
## 81 Fe Fe 9 1 1.0000000000
##
## $arg
## $arg$type
## [1] "full"
It can be seen that the Refractive index has a positive correlation with percentage of calcium and negative correlation with Si, Al and K percentages in the glass.
variables <- c('RI', 'Na','Mg','Al','Si', 'K', 'Ca', 'Ba', 'Fe')
histograms <- lapply(variables, function(var) {
ggplot(Glass, aes_string(x = var)) +
geom_boxplot( fill = "lightgreen", color = "blue") +
ggtitle(paste("Box plot of", var)) +
xlab(var)
})
gridExtra::grid.arrange(grobs = histograms, ncol = 3)
It can be observed that the there are outliers in RI, Na, Al, Si, K, Ca, Ba, and Fe. Therefore, outliers needs to be handles before applying any data analysis techniques. The distributions of RI, Na, Al, Ca are slightly right skewed. The distribution of Si is left skewed. The distribution of Mg, Ba, K and Fe are not normal but they have some other distribution. The distribution of Fe appears to be exponential with decreasing or negative slope.
Answer. Yes, to minimize the effect of outliers, the spatial sign transformation can be used for the variables, RI, Na, Al, Si, K, Ca, Ba, and Fe, to improve the classification model. After, handling the outliers, the noise in the data can be minimized using other transformation methods.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(data = Soybean, mapping = aes(x=temp))+
geom_bar(fill = 'lightgreen', col='black')+
labs(title = 'Frequency plot of temp',
y='Frequency')
ggplot(data = Soybean, mapping = aes(x=leaf.size))+
geom_bar(fill = 'yellow', col='black')+
labs(title = 'Frequency plot of leaf size',
y='Frequency',
x='Leaf size')
Answer.
I think missing data in categorical variables can be imputed using the most frequent entry in that predictor variable. The following code chunk imputes the missing values by mode of ‘temp’ variable.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Soybean$temp = ifelse(is.na(Soybean$temp),
ave(Soybean$temp, FUN = function(x) Mode(x)),
Soybean$temp)
sum(is.na(Soybean$temp))
## [1] 0
Now there is no missing value in the ‘temp’ variable. Similarly, missing values in other variables could also be handled.