Joel Park
3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identificiation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
From ?Glass
, this is “a data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).”
There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
The 10 variables are:
RI
: refractive indexNa
: SodiumMg
: MagnesiumAl
: AluminumSi
: SiliconK
: PotassiumCa
: CalciumBa
: BariumFe
: IronType
: Type of Glass (Class Attribute)We assume that the depedent variable, Type
is categorical and has no ordinal component to the number.
A. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
With exploratory data analysis, we will approach this in two ways, 1. descriptive and 2. visually.
Let’s look at the head, columns, and summary of the data.
head(Glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
colnames(Glass)
## [1] "RI" "Na" "Mg" "Al" "Si" "K" "Ca" "Ba" "Fe" "Type"
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
From above, we note that all predictor (independent) variables are float, and the dependent variable, “Type” is categorical type (the types are in string format).
The psych
package has a function called describe()
that can better describe skew and kurtosis and other features of the individual predictor variables.
library(psych)
describe(Glass)
## vars n mean sd median trimmed mad min max range skew
## RI 1 214 1.52 0.00 1.52 1.52 0.00 1.51 1.53 0.02 1.60
## Na 2 214 13.41 0.82 13.30 13.38 0.64 10.73 17.38 6.65 0.45
## Mg 3 214 2.68 1.44 3.48 2.87 0.30 0.00 4.49 4.49 -1.14
## Al 4 214 1.44 0.50 1.36 1.41 0.31 0.29 3.50 3.21 0.89
## Si 5 214 72.65 0.77 72.79 72.71 0.57 69.81 75.41 5.60 -0.72
## K 6 214 0.50 0.65 0.56 0.43 0.17 0.00 6.21 6.21 6.46
## Ca 7 214 8.96 1.42 8.60 8.74 0.66 5.43 16.19 10.76 2.02
## Ba 8 214 0.18 0.50 0.00 0.03 0.00 0.00 3.15 3.15 3.37
## Fe 9 214 0.06 0.10 0.00 0.04 0.00 0.00 0.51 0.51 1.73
## Type* 10 214 2.54 1.71 2.00 2.31 1.48 1.00 6.00 5.00 1.04
## kurtosis se
## RI 4.72 0.00
## Na 2.90 0.06
## Mg -0.45 0.10
## Al 1.94 0.03
## Si 2.82 0.05
## K 52.87 0.04
## Ca 6.41 0.10
## Ba 12.08 0.03
## Fe 2.52 0.01
## Type* -0.29 0.12
anyNA(Glass)
## [1] FALSE
Judging from the information above, it appears obvious that there are no missing data, so there will not need any imputations for this exercise.
Though the descriptive statistics can provide a significant amount of information, it can be a little difficult to interpret with so many variables. The visualization component truly helps with “eyeballing” of the data and helps us determine which models or assumptions is appropriate.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(reshape2)
ggplot(melt(Glass, id.vars=c('Type')), aes(x=value)) + geom_histogram(bins=50) + facet_wrap(~variable, scale="free")
ggplot(melt(Glass, id.vars=c('Type')), aes(x = variable, y = value)) + geom_boxplot() + facet_wrap(~variable, scale="free") + coord_flip()
As you can see many (in fact, likely all of them) do not have a normal distribution. As the normal distribution is important in many models, we must take the skewed predictor values into consideration.
So for example, Si
has a left skew whereas Ca
has a right skew.
Let’s take a closer look at normality via the QQ plot.
# Reference: http://www.sthda.com/english/wiki/ggplot2-qq-plot-quantile-quantile-graph-quick-start-guide-r-software-and-data-visualization
Glass_columns <- colnames(Glass)
ggplot(melt(Glass, id.vars=c('Type')), aes(sample=value)) + stat_qq() + facet_wrap(~variable, scale="free")
As you can see from the Q-Q Norm plot, again, we confirm that these variables (some worse than others) have issues with the normal distribution (they do not lie in a straight line on the qq plot).
Are there any correlations in the data?
correlations <- cor(Glass[,-10])
correlations
## RI Na Mg Al Si
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073
## K Ca Ba Fe
## RI -0.289832711 0.8104027 -0.0003860189 0.143009609
## Na -0.266086504 -0.2754425 0.3266028795 -0.241346411
## Mg 0.005395667 -0.4437500 -0.4922621178 0.083059529
## Al 0.325958446 -0.2595920 0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K 1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155 1.0000000 -0.1128409671 0.124968219
## Ba -0.042618059 -0.1128410 1.0000000000 -0.058691755
## Fe -0.007719049 0.1249682 -0.0586917554 1.000000000
On the trace (diagonal) of this matrix, all of the values will equal 1 (given that all of these values equal to it self). With values closer to 0, this indicates that no correlation exists. Let’s visualize this.
library(corrplot)
corrplot(correlations, order = "hclust")
It seems that the largest correlation (visualization) is Ca
and RI
. Let’s see what their correlation value is:
cor(Glass[,c('Ca','RI')])
## Ca RI
## Ca 1.0000000 0.8104027
## RI 0.8104027 1.0000000
The correlation here is 0.81, which is fairly high.
Suppose we want to use a cutoff of 0.75 (similar to the assignment in the textbook), let’s find out what predictor variables should be potentially dropped off due to high correlation. (High correlation can lead to collinearity, which can ultimately lead to an unstable model.)
library(caret)
## Loading required package: lattice
highCorr <- findCorrelation(correlations, cutoff = .75)
print(paste0("Number of Predictor Variables with Pearson Correlation > 0.75: ",length(highCorr)))
## [1] "Number of Predictor Variables with Pearson Correlation > 0.75: 1"
print(paste0("Caret recommends dropping predictor variable: ", Glass_columns[highCorr[1]]))
## [1] "Caret recommends dropping predictor variable: Ca"
The above information has provided general information regarding the structure of our data.
B. Do there appears to be any outliers in the data? Are any predictors skewed?
As we have noted from the box plot in section A, the box plots does a great job showing all the outliers for each predictor variable. By definition, any value outside of the 1.5 times below and above lower and upper quantile respectively, this would show as a “dot” on the boxplot. There are many ways to handle outliers (including deleting them), but that being said, having domain knowledge as to why these outliers exist is important and these values should be considered quite heavily before deciding to either transform or delete them.
Another method of showing outliers:
# Reference: https://stackoverflow.com/questions/44089894/identifying-the-outliers-in-a-data-set-in-r
for(i in 1:9) {
print(paste0("Predictor Variable (outlier values): ", colnames(Glass[i])))
print(paste0(boxplot(Glass[i],plot=FALSE)$out))
}
## [1] "Predictor Variable (outlier values): RI"
## [1] "1.52667" "1.5232" "1.51215" "1.52725" "1.5241" "1.52475" "1.53125"
## [8] "1.53393" "1.52664" "1.52739" "1.52777" "1.52614" "1.52369" "1.51115"
## [15] "1.51131" "1.52315" "1.52365"
## [1] "Predictor Variable (outlier values): Na"
## [1] "11.45" "10.73" "11.23" "11.02" "11.03" "17.38" "15.79"
## [1] "Predictor Variable (outlier values): Mg"
## character(0)
## [1] "Predictor Variable (outlier values): Al"
## [1] "0.29" "0.47" "0.47" "0.51" "3.5" "3.04" "3.02" "0.34" "2.38" "2.79"
## [11] "2.68" "2.54" "2.34" "2.66" "2.51" "2.42" "2.74" "2.88"
## [1] "Predictor Variable (outlier values): Si"
## [1] "70.57" "69.81" "70.16" "74.45" "69.89" "70.48" "70.7" "74.55"
## [9] "75.41" "70.26" "70.43" "75.18"
## [1] "Predictor Variable (outlier values): K"
## [1] "1.68" "6.21" "6.21" "1.76" "1.46" "2.7" "1.41"
## [1] "Predictor Variable (outlier values): Ca"
## [1] "11.64" "10.79" "13.24" "13.3" "16.19" "11.52" "10.99" "14.68"
## [9] "14.96" "14.4" "11.14" "13.44" "5.87" "11.41" "11.62" "11.53"
## [17] "11.32" "12.24" "12.5" "11.27" "10.88" "11.22" "6.65" "5.43"
## [25] "5.79" "6.47"
## [1] "Predictor Variable (outlier values): Ba"
## [1] "0.09" "0.11" "0.69" "0.14" "0.11" "3.15" "0.27" "0.09" "0.06" "0.15"
## [11] "2.2" "0.24" "1.19" "1.63" "1.68" "0.76" "0.64" "0.4" "1.59" "1.57"
## [21] "0.61" "0.81" "0.66" "0.64" "0.53" "0.63" "0.56" "1.71" "0.67" "1.55"
## [31] "1.38" "2.88" "0.54" "1.06" "1.59" "1.64" "1.57" "1.67"
## [1] "Predictor Variable (outlier values): Fe"
## [1] "0.26" "0.3" "0.31" "0.32" "0.34" "0.28" "0.29" "0.28" "0.35" "0.37"
## [11] "0.51" "0.28"
These are the values that are outliers (from a boxplot) for each predictor variable. So the answer is yes, there does appear to be outliers in all predictor variables with the exception of Mg
.
Let us take a look at the skewness of the predictor variables. (We have already touched on this in section A.)
# Reference: Textbook Chapter 3
library(e1071)
apply(Glass[,-10], 2, skewness)
## RI Na Mg Al Si K
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889
## Ca Ba Fe
## 2.0184463 3.3686800 1.7298107
According to https://brownmath.com/stat/shape.htm#SkewnessInterpret,
“Classic rule of thumb for skewness: If skewness is less than −1 or greater than +1, the distribution is highly skewed. If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed. If skewness is between −½ and +½, the distribution is approximately symmetric.”
It appears that all of these predictor values, with the exception of Na
show highly skewed values. Not surprisingly, all of these data points will need to be potentially transformed in hopes to achieve a normal distribution.
C. Are there any relevant transformations of one or more predictors that might improve the classification model?
As noted in section B, there will likely need to be several predictor variables that will require a transformation. A popular way to transform our predictor variables is via BoxCox Transformation.
transformed_glass <- apply(Glass[,-10], 2, BoxCoxTrans)
transformed_glass
## $RI
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.511 1.517 1.518 1.518 1.519 1.534
##
## Largest/Smallest: 1.02
## Sample Skewness: 1.6
##
## Estimated Lambda: -2
##
##
## $Na
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.73 12.91 13.30 13.41 13.82 17.38
##
## Largest/Smallest: 1.62
## Sample Skewness: 0.448
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
##
##
## $Mg
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.115 3.480 2.685 3.600 4.490
##
## Lambda could not be estimated; no transformation is applied
##
##
## $Al
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 1.190 1.360 1.445 1.630 3.500
##
## Largest/Smallest: 12.1
## Sample Skewness: 0.895
##
## Estimated Lambda: 0.5
##
##
## $Si
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.81 72.28 72.79 72.65 73.09 75.41
##
## Largest/Smallest: 1.08
## Sample Skewness: -0.72
##
## Estimated Lambda: 2
##
##
## $K
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1225 0.5550 0.4971 0.6100 6.2100
##
## Lambda could not be estimated; no transformation is applied
##
##
## $Ca
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.430 8.240 8.600 8.957 9.172 16.190
##
## Largest/Smallest: 2.98
## Sample Skewness: 2.02
##
## Estimated Lambda: -1.1
##
##
## $Ba
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.175 0.000 3.150
##
## Lambda could not be estimated; no transformation is applied
##
##
## $Fe
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
##
## Lambda could not be estimated; no transformation is applied
As you can note above, some of the transformations could be applied. But, Box Cox transformation could not be applied to some of these variables, which is because a BoxCox transformation cannot be performed on variables that contain negative of 0 values.
If you notice some of these predictor variables such as Fe
or Ba
, we should potentially consider that their distributions may never achieve normal distribution despite attempting a BoxCox transformation. For these particular instances, we should consider other types of distributions such as Poisson Distribution or the Negative Binomial Distribution (which may ultimately be a better fit.).
Reference: https://www.researchgate.net/post/How_to_transform_count_data_with_0s_to_get_a_normal_distribution
3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:
library(mlbench)
data(Soybean)
## See ?Soybean for details
According to ?Soybean
, “there are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value “dna” means does not apply. The values for attributes are encoded numerically, with the first value encoded as “0,” the second as “1,” and so forth.”
“A data frame with 683 observations on 36 variables. There are 35 categeorical attributes, all numerical and a nominal denoting the class.
A. Investigate the frequency distributions in the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Let’s take a look at the head of the dataset.
head(Soybean)
## Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker 6 0 2 1 0 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2
## 3 diaporthe-stem-canker 3 0 2 1 0 1
## 4 diaporthe-stem-canker 3 0 2 1 0 1
## 5 diaporthe-stem-canker 6 0 2 1 0 2
## 6 diaporthe-stem-canker 5 0 2 1 0 3
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1 1 1 0 0 1 1 0 2
## 2 0 2 1 1 1 1 0 2
## 3 0 2 1 2 1 1 0 2
## 4 0 2 0 1 1 1 0 2
## 5 0 1 0 2 1 1 0 2
## 6 0 1 0 1 1 1 0 2
## leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1 2 0 0 0 1 1 3
## 2 2 0 0 0 1 0 3
## 3 2 0 0 0 1 0 3
## 4 2 0 0 0 1 0 3
## 5 2 0 0 0 1 0 3
## 6 2 0 0 0 1 0 3
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1 1 1 1 0 0 0
## 2 1 1 1 0 0 0
## 3 0 1 1 0 0 0
## 4 0 1 1 0 0 0
## 5 1 1 1 0 0 0
## 6 0 1 1 0 0 0
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1 0 4 0 0 0 0
## 2 0 4 0 0 0 0
## 3 0 4 0 0 0 0
## 4 0 4 0 0 0 0
## 5 0 4 0 0 0 0
## 6 0 4 0 0 0 0
## shriveling roots
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Now, let’s perform a visualization.
ggplot(melt(Soybean, id.vars=c('Class')), aes(x=value)) + geom_histogram(stat="count") + facet_wrap(~variable, scale="free")
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We must remember that these frequency distribution visualizations can be misleading. The values within the predictor variables are categorical and are not ordinal. Therefore, in some of the predictor values, a value of 2
does not necessarily mean that it is of greater value than a value of 1
. These numbers are simply a placeholder. Therefore, we must be careful not treat them as ordinal values.
To answer the question, I refer to section 3.5 “Removing Predictors.”
“There are potential advantages to removing predictors prior to modeling. First, few predictors means decreased computational time and complexity. Second, if two predictors are highly correlated, this implies that they are measuring the same underlying information. Removing one should not compromise the performance of the model and might lead to a more parsimonious and interpretable model. Third, some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.”
What is a degenerate distribution? Let’s refer to: https://www.statisticshowto.datasciencecentral.com/degenerate-distribution/.
A degenerate distribution is a distribution where a constant c
is chosen with a probability of a 1. So an example would be a weighted die. If the dice was thrown and was weighted so it always landed on a 6, such that \(P(X = 6) = 1\), this is known as a degenerate distribution.
If we have predictive variables with values that have this degenerative distribution, then it will have zero-variance predictors. If the distribution was near degenerative distribution, i.e., the weighted die was thrown 100 times and 99% of the time, 6 rolled, and the other 1% of the time, a 5 was rolled, this is what we call a near zero-variance predictor. These zero or near zero-variance predictors can detract from the models as typically, they do not add any value to the model.
A good rule of thumb for detecting near-zero variance predictors is:
We can filter out near-zero variance predictors using the caret
package and the nearZeroVar()
function.
nearZeroVar(Soybean)
## [1] 19 26 28
This shows a vector of integers of which columns should be removed. In this case, the degenerate distributions are columns 19, 26, and 28.
B. Roughly 18% of the data are mising. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Let’s determine which predictor variables contain the missing data points.
Soybean_na <- apply(Soybean, 2, function(x){sum(is.na(x))})
Soybean_na
## Class date plant.stand precip
## 0 1 36 38
## temp hail crop.hist area.dam
## 30 121 16 1
## sever seed.tmt germ plant.growth
## 121 121 112 16
## leaves leaf.halo leaf.marg leaf.size
## 0 84 84 84
## leaf.shread leaf.malf leaf.mild stem
## 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies
## 121 38 38 106
## ext.decay mycelium int.discolor sclerotia
## 38 38 38 38
## fruit.pods fruit.spots seed mold.growth
## 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
Soybean_na[order(Soybean_na, decreasing=TRUE)]
## hail sever seed.tmt lodging
## 121 121 121 121
## germ leaf.mild fruiting.bodies fruit.spots
## 112 108 106 106
## seed.discolor shriveling leaf.shread seed
## 106 106 100 92
## mold.growth seed.size leaf.halo leaf.marg
## 92 92 84 84
## leaf.size leaf.malf fruit.pods precip
## 84 84 84 38
## stem.cankers canker.lesion ext.decay mycelium
## 38 38 38 38
## int.discolor sclerotia plant.stand roots
## 38 38 36 31
## temp crop.hist plant.growth stem
## 30 16 16 16
## date area.dam Class leaves
## 1 1 0 0
So looking at the total missing values, there appears to be quite a bit of missing data here spread throughout the dataset. Is there a particular pattern as to why this data is missing? When we inspected the documentation for the dataset, the document did state that four of the classes were controversial and some consider them to be unjustified given how sparse the data was. Perhaps the missing data may be related to those four classes.
Num_of_NAs <- apply(Soybean, 1, function(x){sum(is.na(x))})
Class_Soybean <- Soybean$Class
Soybean_na_df <- data.frame(Class_Soybean, Num_of_NAs)
head(Soybean_na_df,10)
## Class_Soybean Num_of_NAs
## 1 diaporthe-stem-canker 0
## 2 diaporthe-stem-canker 0
## 3 diaporthe-stem-canker 0
## 4 diaporthe-stem-canker 0
## 5 diaporthe-stem-canker 0
## 6 diaporthe-stem-canker 0
## 7 diaporthe-stem-canker 0
## 8 diaporthe-stem-canker 0
## 9 diaporthe-stem-canker 0
## 10 diaporthe-stem-canker 0
Now that we have successfully created a new dataframe with each row corresponding to the number of missing values and the class of the soybeans, let’s find the sum of all the NAs for each class type.
# Reference: https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group
# Reference: how to reorder R dataframe decreasing column
results <- aggregate(Soybean_na_df$Num_of_NAs, by=list(Class_Soybean=Soybean_na_df$Class_Soybean), FUN=sum)
results[order(results[,"x"]),]
## Class_Soybean x
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 15 phyllosticta-leaf-spot 0
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
## 14 herbicide-injury 160
## 10 diaporthe-pod-&-stem-blight 177
## 9 cyst-nematode 336
## 1 2-4-d-injury 450
## 16 phytophthora-rot 1214
Interestingly, all of the missing data is coming from herbicide-injury
, diaporthe-pod-&-stem-blight
, cyst-nematode
, 2-4-d-injury
, and phytophtora-rot
. Granted that the description had stated that there were about 4 that were sparsely found, these are probably the missing data that likely corresponds to these “Classes”.
C. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
According to the textbook, “Applied Predictive Modeling”, we can either remove the rows with the missing variables (if there is a large dataset with spare NULLS), or we can impute the missing data.
Let’s see if we can determine how the null values are distributed throughout the Soybean dataset. Let’s start by inspecting the class with the most missing data points, phytophthora-rot
.
Reference: https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637
head(Soybean[Soybean$Class=='phytophthora-rot',],20)
## Class date plant.stand precip temp hail crop.hist area.dam
## 31 phytophthora-rot 0 1 2 1 1 1 1
## 32 phytophthora-rot 1 1 2 1 <NA> 3 1
## 33 phytophthora-rot 2 1 2 2 <NA> 2 1
## 34 phytophthora-rot 1 1 2 0 0 2 1
## 35 phytophthora-rot 2 1 2 2 <NA> 2 1
## 36 phytophthora-rot 3 1 2 1 <NA> 2 1
## 37 phytophthora-rot 0 1 1 1 0 1 1
## 38 phytophthora-rot 3 1 2 0 0 2 1
## 39 phytophthora-rot 2 1 1 1 <NA> 0 1
## 40 phytophthora-rot 2 1 2 0 0 1 1
## 41 phytophthora-rot 2 1 2 1 <NA> 1 1
## 42 phytophthora-rot 1 1 2 1 <NA> 1 1
## 43 phytophthora-rot 0 1 2 1 0 3 1
## 44 phytophthora-rot 0 1 1 1 1 2 1
## 45 phytophthora-rot 3 1 2 0 0 1 1
## 46 phytophthora-rot 2 1 2 2 <NA> 3 1
## 47 phytophthora-rot 0 1 2 1 0 2 1
## 48 phytophthora-rot 2 1 1 2 <NA> 2 1
## 49 phytophthora-rot 2 1 2 1 1 1 1
## 50 phytophthora-rot 0 1 2 1 0 3 1
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 31 1 0 0 1 1 0 2 2
## 32 <NA> <NA> <NA> 1 1 0 2 2
## 33 <NA> <NA> <NA> 1 1 <NA> <NA> <NA>
## 34 2 1 1 1 1 0 2 2
## 35 <NA> <NA> <NA> 1 1 <NA> <NA> <NA>
## 36 <NA> <NA> <NA> 1 1 <NA> <NA> <NA>
## 37 1 0 0 1 1 0 2 2
## 38 2 1 1 1 1 0 2 2
## 39 <NA> <NA> <NA> 1 1 0 2 2
## 40 2 0 1 1 1 0 2 2
## 41 <NA> <NA> <NA> 1 1 0 2 2
## 42 <NA> <NA> <NA> 1 1 <NA> <NA> <NA>
## 43 1 0 0 1 1 0 2 2
## 44 2 1 0 1 1 0 2 2
## 45 2 1 0 1 1 0 2 2
## 46 <NA> <NA> <NA> 1 1 <NA> <NA> <NA>
## 47 1 0 1 1 1 0 2 2
## 48 <NA> <NA> <NA> 1 1 0 2 2
## 49 2 0 2 1 1 0 2 2
## 50 1 0 2 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 31 0 0 0 1 0 1 2
## 32 0 0 0 1 <NA> 2 2
## 33 <NA> <NA> <NA> 1 <NA> 3 2
## 34 0 0 0 1 0 2 2
## 35 <NA> <NA> <NA> 1 <NA> 2 2
## 36 <NA> <NA> <NA> 1 <NA> 3 2
## 37 0 0 0 1 0 1 2
## 38 0 0 0 1 0 2 2
## 39 0 0 0 1 <NA> 2 2
## 40 0 0 0 1 0 1 2
## 41 0 0 0 1 <NA> 2 2
## 42 <NA> <NA> <NA> 1 <NA> 1 2
## 43 0 0 0 1 0 1 2
## 44 0 0 0 1 1 2 2
## 45 0 0 0 1 0 2 2
## 46 <NA> <NA> <NA> 1 <NA> 3 2
## 47 0 0 0 1 0 1 2
## 48 0 0 0 1 <NA> 2 2
## 49 0 0 0 1 0 1 2
## 50 0 0 0 1 0 1 2
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 31 0 1 0 0 0 3
## 32 <NA> 0 0 0 0 <NA>
## 33 <NA> 0 0 0 0 <NA>
## 34 0 0 0 0 0 3
## 35 <NA> 0 0 0 0 <NA>
## 36 <NA> 0 0 0 0 <NA>
## 37 0 0 0 0 0 3
## 38 0 0 0 0 0 3
## 39 <NA> 0 0 0 0 <NA>
## 40 0 0 0 0 0 3
## 41 <NA> 0 0 0 0 <NA>
## 42 <NA> 0 0 0 0 <NA>
## 43 0 0 0 0 0 3
## 44 0 1 0 0 0 3
## 45 0 0 0 0 0 3
## 46 <NA> 0 0 0 0 <NA>
## 47 0 0 0 0 0 3
## 48 <NA> 0 0 0 0 <NA>
## 49 0 1 0 0 0 3
## 50 0 0 0 0 0 3
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 31 4 0 0 0 0 0 0
## 32 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 33 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 34 4 0 0 0 0 0 0
## 35 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 36 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 37 4 0 0 0 0 0 0
## 38 4 0 0 0 0 0 0
## 39 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 40 4 0 0 0 0 0 0
## 41 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 42 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 43 4 0 0 0 0 0 0
## 44 4 0 0 0 0 0 0
## 45 4 0 0 0 0 0 0
## 46 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 47 4 0 0 0 0 0 0
## 48 <NA> <NA> <NA> <NA> <NA> <NA> 1
## 49 4 0 0 0 0 0 0
## 50 4 0 0 0 0 0 0
We try to avoid disposing of data points despite the number of missing values as the more data we throw away, the more potential valuable information we may be losing. Therefore, we should attempt to impute the missing values and see if we could increase . If we remember from the above sections, all of these data points are categorical. In the textbook, the authors had utilized K Nearest Neighbors as a means to impute the data. (https://www.researchgate.net/post/What_is_the_proper_imputation_method_for_categorical_missing_value is another webiste that has information in regards for imputing categorical missing data, but we will focus on KNN).
KNN is an unsupervised method where the data is layered in a multidimensional hyperplane. Similar types of data will be clustered around each other. Therefore, if there is a data point that is missing some of its predictor values, the data point will take its K (whichever number we decide as a parameter) nearest neighbors and take the mode, and thus imputing in that value.
# Reference: https://stats.stackexchange.com/questions/61110/knn-imputation-r-packages
library(DMwR)
## Loading required package: grid
imputed_data <- knnImputation(Soybean,k=5)
head(imputed_data,20)
## Class date plant.stand precip temp hail crop.hist
## 1 diaporthe-stem-canker 6 0 2 1 0 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2
## 3 diaporthe-stem-canker 3 0 2 1 0 1
## 4 diaporthe-stem-canker 3 0 2 1 0 1
## 5 diaporthe-stem-canker 6 0 2 1 0 2
## 6 diaporthe-stem-canker 5 0 2 1 0 3
## 7 diaporthe-stem-canker 5 0 2 1 0 2
## 8 diaporthe-stem-canker 4 0 2 1 1 1
## 9 diaporthe-stem-canker 6 0 2 1 0 3
## 10 diaporthe-stem-canker 4 0 2 1 0 2
## 11 charcoal-rot 6 0 0 2 0 1
## 12 charcoal-rot 4 0 0 1 1 1
## 13 charcoal-rot 3 0 0 1 0 1
## 14 charcoal-rot 6 0 0 1 1 3
## 15 charcoal-rot 6 0 0 2 0 1
## 16 charcoal-rot 5 0 0 2 1 3
## 17 charcoal-rot 6 0 0 2 1 0
## 18 charcoal-rot 4 0 0 1 0 2
## 19 charcoal-rot 3 0 0 2 0 2
## 20 charcoal-rot 5 0 0 2 1 2
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1 1 1 0 0 1 1 0 2
## 2 0 2 1 1 1 1 0 2
## 3 0 2 1 2 1 1 0 2
## 4 0 2 0 1 1 1 0 2
## 5 0 1 0 2 1 1 0 2
## 6 0 1 0 1 1 1 0 2
## 7 0 1 1 0 1 1 0 2
## 8 0 1 0 2 1 1 0 2
## 9 0 1 1 1 1 1 0 2
## 10 0 2 0 2 1 1 0 2
## 11 3 1 1 0 1 1 0 2
## 12 3 1 1 1 1 1 0 2
## 13 2 1 0 0 1 1 0 2
## 14 3 1 1 0 1 1 0 2
## 15 3 1 1 1 1 1 0 2
## 16 3 1 1 2 1 1 0 2
## 17 2 1 0 0 1 1 0 2
## 18 2 1 0 1 1 1 0 2
## 19 2 1 0 2 1 1 0 2
## 20 2 1 0 2 1 1 0 2
## leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1 2 0 0 0 1 1 3
## 2 2 0 0 0 1 0 3
## 3 2 0 0 0 1 0 3
## 4 2 0 0 0 1 0 3
## 5 2 0 0 0 1 0 3
## 6 2 0 0 0 1 0 3
## 7 2 0 0 0 1 1 3
## 8 2 0 0 0 1 0 3
## 9 2 0 0 0 1 0 3
## 10 2 0 0 0 1 0 3
## 11 2 0 0 0 1 0 0
## 12 2 0 0 0 1 1 0
## 13 2 0 0 0 1 0 0
## 14 2 0 0 0 1 0 0
## 15 2 0 0 0 1 0 0
## 16 2 0 0 0 1 0 0
## 17 2 0 0 0 1 1 0
## 18 2 0 0 0 1 0 0
## 19 2 0 0 0 1 0 0
## 20 2 0 0 0 1 0 0
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1 1 1 1 0 0 0
## 2 1 1 1 0 0 0
## 3 0 1 1 0 0 0
## 4 0 1 1 0 0 0
## 5 1 1 1 0 0 0
## 6 0 1 1 0 0 0
## 7 1 1 1 0 0 0
## 8 1 1 1 0 0 0
## 9 1 1 1 0 0 0
## 10 1 1 1 0 0 0
## 11 3 0 0 0 2 1
## 12 3 0 0 0 2 1
## 13 3 0 0 0 2 1
## 14 3 0 0 0 2 1
## 15 3 0 0 0 2 1
## 16 3 0 0 0 2 1
## 17 3 0 0 0 2 1
## 18 3 0 0 0 2 1
## 19 3 0 0 0 2 1
## 20 3 0 0 0 2 1
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1 0 4 0 0 0 0
## 2 0 4 0 0 0 0
## 3 0 4 0 0 0 0
## 4 0 4 0 0 0 0
## 5 0 4 0 0 0 0
## 6 0 4 0 0 0 0
## 7 0 4 0 0 0 0
## 8 0 4 0 0 0 0
## 9 0 4 0 0 0 0
## 10 0 4 0 0 0 0
## 11 0 4 0 0 0 0
## 12 0 4 0 0 0 0
## 13 0 4 0 0 0 0
## 14 0 4 0 0 0 0
## 15 0 4 0 0 0 0
## 16 0 4 0 0 0 0
## 17 0 4 0 0 0 0
## 18 0 4 0 0 0 0
## 19 0 4 0 0 0 0
## 20 0 4 0 0 0 0
## shriveling roots
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 0
anyNA(imputed_data)
## [1] FALSE
We have successfully imputed all of the data utilizing K Nearest Neighbors.