I-) The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Comment: RI, NA, AI, and SI have a fairly close to normal distribution. We can see a left skew with Mg and possible an outlier. K has right skew along with Ba and Fe.
Let’s explore for missing data.
plot_missing(Glass)
Comment: No missing rows observed., however, the value “0.00” observed could be a missing value replaced by “0.00” (assumption).
Let’s explore the relationships between the predictors
library(corrplot)Glass2 |>cor() |>corrplot()
Comment: The correlation plot does not indicate that there are many instances where variables are highly correlated with each other. The only exceptions are that Ca is highly correlated with RI with a coefficient greater than .81 followed by Ba having mild correlation with Ai. The strongest negative relationship are between Si and RI.
Do there appear to be any outliers in the data? Are any predictors skewed?
#Checking for outliersboxplot(Glass2$Al, main ="Boxplot of Al", ylab ="Percentage")
boxplot(Glass2$Ba, main ="Boxplot of Ba", ylab ="Percentage")
boxplot(Glass2$Ca, main ="Boxplot of Ca", ylab ="Percentage")
boxplot(Glass2$Fe, main ="Boxplot of Fe", ylab ="Percentage")
boxplot(Glass2$K, main ="Boxplot of K", ylab ="Percentage")
boxplot(Glass2$Mg, main ="Boxplot of Mg", ylab ="Percentage")
boxplot(Glass2$Na, main ="Boxplot of Na", ylab ="Percentage")
boxplot(Glass2$RI, main ="Boxplot of RI", ylab ="Percentage")
boxplot(Glass2$Si, main ="Boxplot of Si", ylab ="Percentage")
#Checking for skewnesslibrary(e1071)skewness(Glass2$Al)
[1] 0.8946104
skewness(Glass2$Ba)
[1] 3.36868
skewness(Glass2$Ca)
[1] 2.018446
skewness(Glass2$Fe)
[1] 1.729811
skewness(Glass2$K)
[1] 6.460089
skewness(Glass2$Mg)
[1] -1.136452
skewness(Glass2$Na)
[1] 0.4478343
skewness(Glass2$RI)
[1] 1.602715
skewness(Glass2$Si)
[1] -0.7202392
Comment:
- The Boxplots show some outliers in all elements data except for Mg.
- The histograms and the skewness values show some left skewness (negative value: Mg, Si) and right skewness (positive values substantially greater than 0: Ba, Fe, K).
Are there any relevant transformations of one or more predictors that might improve the classification model?
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.290 1.190 1.360 1.445 1.630 3.500
Largest/Smallest: 12.1
Sample Skewness: 0.895
Estimated Lambda: 0.5
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.175 0.000 3.150
Lambda could not be estimated; no transformation is applied
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.430 8.240 8.600 8.957 9.172 16.190
Largest/Smallest: 2.98
Sample Skewness: 2.02
Estimated Lambda: -1.1
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
Lambda could not be estimated; no transformation is applied
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.1225 0.5550 0.4971 0.6100 6.2100
Lambda could not be estimated; no transformation is applied
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.115 3.480 2.685 3.600 4.490
Lambda could not be estimated; no transformation is applied
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.73 12.91 13.30 13.41 13.82 17.38
Largest/Smallest: 1.62
Sample Skewness: 0.448
Estimated Lambda: -0.1
With fudge factor, Lambda = 0 will be used for transformations
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.511 1.517 1.518 1.518 1.519 1.534
Largest/Smallest: 1.02
Sample Skewness: 1.6
Estimated Lambda: -2
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
69.81 72.28 72.79 72.65 73.09 75.41
Largest/Smallest: 1.08
Sample Skewness: -0.72
Estimated Lambda: 2
Comments:
-Lambda for Al is 0.5, a square root transformation will be more appropriate.
-Lambda for Ba could not be estimated; no transformation is applied.
-Lambda for Ca is -1.1, a recirpocal transformation will be more appropriate.
-Lambda for Fe could not be estimated; no transformation is applied.
-Lambda for K could not be estimated; no transformation is applied.
-Lambda for Mg could not be estimated; no transformation is applied.
-Lambda for Na is -0.1. With fudge factor, Lambda = 0 will be used for transformations.
A log transformation will be more appropriate.
-Lambda for RI is -2, a squaring transformation will be more appropriate.
-Lambda for Si is 2, a squaring transformation will be more appropriate.
II-) The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Soybean |>ggplot(aes(x=Soybean$Class)) +geom_histogram(fill="pink",colour="black", stat="count") +ggtitle("Distribution by Type of Soybean") +xlab("Soybean Type") +ylab("Count") +coord_flip() +labs(caption="UCI Machine Learning Repository")
A degenerate variable is a variable that has no variability; it takes on a single value for all observations in a dataset. Essentially, it is a constant. From the plots of the Soybean’s predictors, we do not have variables with a unique value. However, there are several variables such as mycelium, sclerotia that have rare values. These could be degenerate variables when the “NA’s” are not factored.
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing?
plot_missing(Soybean)
Overall, there are 4 predictors (lodging, seed.tmt, sever, hail) that have the highest percentage of missing data.
Is the pattern of missing data related to the classes?
Comment: phytophthora-rot, alternarialeaf-spot, frog-eye-leaf-spot and brown-spot the Soybean classes with the most missing data.
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Two techniques can be utilized to handle missing data: eliminating the predictors or imputation. In our case, the Soybean dataset is not considerably big and the proportion of missing data is not significant enough (as seen earlier) for adopting the eliminating technique. Therefore, the imputation technique, more precisely, the KNN method will be more appropriate.