The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
From initial analysis of the data we see that we have 1 Target Variable[type] of 6 levels and 9 predictor variables
glass <- Glass %>% select(-Type)
chart.Correlation(glass, bg=c("blue","red","yellow"), pch=21)
# this we be too convert to long format. Also will be creating predictor as factor
m <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>%
mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)
# here we will explore the distribution of each factor
m %>%
ggplot(., aes(Value, fill=Predictor))+geom_histogram(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="The Distribution of the Predictors")+theme_minimal()
We can see that of all the predictors, Si appears to have the most normal distribution In addition to having the highest level of any mineral. We can conclude that this seems rational as silicon is a fundamental element to make glass. You’ll also notice RI, Na, Al, and Ca have normal distributions. In additions to quantity of silicon, we can also assume that the RI of glass should be an important factor. Besides the Si, I would not say these as normal distributions. In general we could say that most of the element distributions are right skewed, meaning, very small trace amounts of the element are usually present, but with notable exceptions. We can say that color indicates the strength and polarity of the correlation. For example, Mg and Al have a strong negative correlation.
corrplot(cor(Glass[,1:9]), method='square')
We can see that the variables differ quite a bit. Some are more normally distributed such as Na/Al, mean while others do not look normal at all such as Ba, Fe, K.In general, the correlation table tells us the relationship between each variables.There are some strong positive relationships such as Rl and Ca, Al and Ba. Also as well as some strong negative relationships, for example Rl and Si, Rl and Al, Mg and Ba.
We conclude that there are outliers in the data. K, Fe and Ba variable contains lots of zeros having their graphs highly skewed to the right. “K” has a very obvious outlier. “Ba” also has outliers at above 2.0 and “Fe” has an outlier above 0.5. Most of the variables including RI, NA, AI, SI, CA have peaks in the center of the distribution. They appear to be more normally distributed. Lots of outliers in variable Ri, Al, Ca, Ba, Fe. You can see that the correlation table tell us that most of the variables are not related to each other The columns Ba, Fe, and K look to be heavily skewed right. This is caused by left limit is bounded at 0 and outliers on the right side of the distribution. I would expect outliers due to impurities introduced in the glass manufacturing process.In addition, other than Si (silicon - the main ingredient in glass) most of the element distributions are right skewed, meaning, very small trace amounts of the element are normally present.
m %>%
ggplot(aes(x = Type, y = Value, color = Predictor)) +
geom_jitter() +
ylim(0, 20) +
scale_color_brewer(palette = "Set2") +
theme_dark()
In my opinion the relevant transformation that can be considered is box cox transformation or log transformation. This might improve the classification model. Besides this, removing outliers might be the best choice for improving the classification model. Another thought is that center and scaling is another option that might improve model the performance. transformations like a log or a Box Cox could help improve the classification model. Also removing skew is removing outliers that improves a model’s performance. Also, centering and scaling can be important for all variables with any model. You can say that checking if there are any missing values in any columns that can cause a delay or miscalculate or need to addressed by removal/imputation or other means.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
?Soybean
## starting httpd help server ... done
summary(Soybean)
## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
Soybean %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")
created plot which shows the distribution of data points on the categorical variables. Keep in mind it also shows the missing values exist in almost all of the variables. mycelium, sclerotia, and leaf.mild are strongly imbalanced. It might be favorable to remove these variables from the model. Note that int.discolor, resulted in both a keep and remove for each factor, given that we can keep one factor, the variable is kept unless there is another indication that is affecting the model.
par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
plot(Soybean[i], main = colnames(Soybean[i]))
}
sorted <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[sorted])
x | |
---|---|
hail | 121 |
sever | 121 |
seed.tmt | 121 |
lodging | 121 |
germ | 112 |
leaf.mild | 108 |
fruiting.bodies | 106 |
fruit.spots | 106 |
seed.discolor | 106 |
shriveling | 106 |
leaf.shread | 100 |
seed | 92 |
mold.growth | 92 |
seed.size | 92 |
leaf.halo | 84 |
leaf.marg | 84 |
leaf.size | 84 |
leaf.malf | 84 |
fruit.pods | 84 |
precip | 38 |
stem.cankers | 38 |
canker.lesion | 38 |
ext.decay | 38 |
mycelium | 38 |
int.discolor | 38 |
sclerotia | 38 |
plant.stand | 36 |
roots | 31 |
temp | 30 |
crop.hist | 16 |
plant.growth | 16 |
stem | 16 |
date | 1 |
area.dam | 1 |
Class | 0 |
leaves | 0 |
we can see that there are a lot of missing values in the dataset as we can see in summary and first plot. Second plot shows if there is any pattern in the missing values. It shows that germ, hail, server, seed.tmt and lodging have missing values together and thus it has pattern of missing values together.
soybean_missing_counts <- sapply(Soybean, function(x) sum(is.na(x))) %>%
sort(decreasing = TRUE) %>%
as.data.frame() %>%
rename('NA_Count' ='.')
soybean_missing_counts <- soybean_missing_counts%>%
mutate('Feature' = rownames(soybean_missing_counts))
ggplot(soybean_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) +
geom_bar(stat = 'identity', fill = 'blue') +
labs(title = 'Soybean Missing Counts') +
theme(plot.title = element_text(hjust = 0.5))
As you can see the graphs above are very helpful in indicating the amount of missing data the Soybean data contains. From the first plot, it highlights lodging, hail, sever and seed.tmt accounts for nearly 18% each. The second plot shows the pattern of the missing data as it relates to the other variables. It shows 82% are complete, in addition to the Class and leaves variables. There are quite a few signs of missing patterns, but their overall proportion is not extreme. In addition, from the graph, the first set of variables, from hail to fruit.pods, accounts for 8% of the missing data when the other variables are complete, note this does not indicate within variable missingness. In conclusion, for some imputation methods, such as certain types of multiple imputations, having fewer missingness patterns is helpful, as it requires fitting fewer models.
Soybean2 <- Soybean[3:36] # Removed class and date
# for our knn method
knn_method <- kNN(Soybean ,k=5)
colSums(is.na(knn_method))
## Class date plant.stand precip
## 0 0 0 0
## temp hail crop.hist area.dam
## 0 0 0 0
## sever seed.tmt germ plant.growth
## 0 0 0 0
## leaves leaf.halo leaf.marg leaf.size
## 0 0 0 0
## leaf.shread leaf.malf leaf.mild stem
## 0 0 0 0
## lodging stem.cankers canker.lesion fruiting.bodies
## 0 0 0 0
## ext.decay mycelium int.discolor sclerotia
## 0 0 0 0
## fruit.pods fruit.spots seed mold.growth
## 0 0 0 0
## seed.discolor seed.size shriveling roots
## 0 0 0 0
## Class_imp date_imp plant.stand_imp precip_imp
## 0 0 0 0
## temp_imp hail_imp crop.hist_imp area.dam_imp
## 0 0 0 0
## sever_imp seed.tmt_imp germ_imp plant.growth_imp
## 0 0 0 0
## leaves_imp leaf.halo_imp leaf.marg_imp leaf.size_imp
## 0 0 0 0
## leaf.shread_imp leaf.malf_imp leaf.mild_imp stem_imp
## 0 0 0 0
## lodging_imp stem.cankers_imp canker.lesion_imp fruiting.bodies_imp
## 0 0 0 0
## ext.decay_imp mycelium_imp int.discolor_imp sclerotia_imp
## 0 0 0 0
## fruit.pods_imp fruit.spots_imp seed_imp mold.growth_imp
## 0 0 0 0
## seed.discolor_imp seed.size_imp shriveling_imp roots_imp
## 0 0 0 0
One option to consider is for predictors that were entirely NA for a whole class, its possible to create a dummy variable to show if the predictor was filled in or not or remove it entirely. doing this filling in may be an issue because that is likely something to do with data collection and may not keep up over time.on the other hand, for predictors that have some data within a class I would impute an average for that predictor for a given class.Many sources suggest that the wisest stratergy is to start with checking the correlation between two variables. Important note, is due to high percentage of missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values are suggested to be dropped, as with more missing values, the predictor might not be providing correct information to the model. We used k nearest neighbours to impute the missing values in our dataset.