library(tidyverse)
library(caret)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)Assignment 4: Data Preprocessing/Overfitting
Assignment 4: Do exercises 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Link to Applied Predictive Modeling for reference.
3.1
The UC Irvine Machine Learning Repository 6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)'data.frame': 214 obs. of 10 variables:
$ RI : num 1.52 1.52 1.52 1.52 1.52 ...
$ Na : num 13.6 13.9 13.5 13.2 13.3 ...
$ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
$ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
$ Si : num 71.8 72.7 73 72.6 73.1 ...
$ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
$ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
$ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
$ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
$ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Based on the above, we can understand that ‘Type’ is not a predictor, because it is not one of the nine mentioned in the note that there are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. So, let’s create a subset excluding ‘Type’ we can use for the following questions. We won’t create a ‘training’ dataset (and a ‘test’ dataset) like we might in other circumstances because this set of exercises is for practicing data pre-processing.
glass_predictors <- Glass |> select(-Type) a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
glass_predictors |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 40) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
labs(title = "Distributions of Glass Predictor Variables",
x = "value",
y = "Count")Here’s what we can tell with the above histograms of the glass predictor variables, starting from the top row:
Percentage of Al: The distribution of Al is roughly symmetric, slightly right skewed. We can see what appear to be some outliers on the right side of the histogram which contribute to its slight right skew. Most of the samples are between 1% and 2% Al. The data is centered around 1.3-1.4%. There are no apparent observations recorded at 0% so we can say that this element is in all samples in the dataset. The highest value recorded for Al appears just above 3.5.
Percentage of Ba: The distribution of Ba is extremely right skewed in this histogram. The majority of values are at 0% and there is a long right tail with values extending past 3.
Percentage of Ca: The distribution of Ca is relatively bell-shaped and centered between 8 and 9. There are more outliers on the right side of the histogram so we can call this slightly right-skewed. There are no values observable below 5.
Percentage of Fe: The distribution of Fe, like Ba, is extrmely right skewed in this histogram. Also like Ba, the majority of values are at 0 (0%) and there is a long right tail. For Fe, the values extend up just past 0.5.
Percentage of K: This distribution has most of its values near or at 0. While there are some values at 0, it is not the majority of values - the tallest bin we see appears around 0.7. The right tail trails just past 6: this distribution is right skewed.
Percentage of Mg: This distribution is bimodal: it has a significant amount of data points at 0 and a large amount of data between 3.4 and 3.8. The largest visible value appears at 4.5%.
Percentage of Na: The distribution of Na is slightly right skewed but mostly centered around 13 where we see the largest number of values. There are visible tails in both directions and it is somewhat symmetric.
RI (Refractive Index): This distribution appears somewhat bimodal and somewhat symmetric, though when we called for fewer bins it looked more bell-shaped rather than bimodal. The two ‘peaks’ we see are both between 1.515 and 1.520, so it’s a narrow distribution band. Outliers appear in the tails in both directions.
Percentage of Si: This distribution is somewhat bell shaped with tails visible in both directions, and values centered around 73. This is the highest percentage of any of the elements, which makes sense knowing that Si is Silicon. Values extend from just below 70 to just over 73.
We can also use the skewness function from the e1071 package to compute the skewness across columns.
skewness_values <- apply(glass_predictors, 2, skewness)
skewness_values RI Na Mg Al Si K Ca
1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
Ba Fe
3.3686800 1.7298107
corrplot(cor(glass_predictors),
method = "color",
type = "lower",
addCoef.col = "black",
diag = FALSE) We see the most significant positive correlations between Ca and RI (the Refractive Index), with a 0.81 correlation, strongly positive. The most negative correlation we see is between Si and RI, with a -0.54 correlation, significantly less strong than the positive correlation observed between Ca and the refractive index. Other relationships with a positive correlation are between Fe and Ca (0.12), Al and K (0.33), Al and Ba (0.48), Na and Al (0.16), Na and Ba (0.33), and Fe and RI (0.14). The remaining correlations range from 0 (Ba and RI) to the aforementioned Si and RI (-0.54).
b. Do there appear to be any outliers in the data? Are any predictors skewed?
glass_predictors |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
ggplot(aes(x = value)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free", ncol = 3) +
labs(title = "Distributions of Glass Predictor Variables",
x = "value",
y = "Count")Yes, there appear to be outliers in the data for several of the predictors. These outliers are visible in the histograms and in the boxplots above. The ones that stand out to me as having notable outliers is for K, where there are many outliers and a significant gap between the second-highest bin at ~2.4 and the highest bin at over 6. K and Ba, Ca, and Fe are all skewed as noted in the overview of distributions above. The only predictor variable without outliers is Mg.
c. Are there any relevant transformations of one or more predictors that might improve the classification model?
Based on our understanding of the data, it seems that Box-Cox transformation might improve the classification model. Knowing that Box Cox transformations can be applied to the predictor variables, this would enable more accurate transformations to resolve skewness – to reduce outliers and reduce dimensionality of the data, as noted in Kuhn & Johnson on pages 32-33.
The preProcess function from caret can apply the appropriate box cox transformation to the set of predictor variables from the Glass Identification Database, while also transform, centering, scaling, and imputing values, if that’s the path we wanted to continue down.
transformed_glass_predictors <-
preProcess(glass_predictors, method = c("BoxCox", "center", "scale"))
head(transformed_glass_predictors)$dim
[1] 214 9
$bc
$bc$RI
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.511 1.517 1.518 1.518 1.519 1.534
Largest/Smallest: 1.02
Sample Skewness: 1.6
Estimated Lambda: -2
$bc$Na
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.73 12.91 13.30 13.41 13.82 17.38
Largest/Smallest: 1.62
Sample Skewness: 0.448
Estimated Lambda: -0.1
With fudge factor, Lambda = 0 will be used for transformations
$bc$Al
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.290 1.190 1.360 1.445 1.630 3.500
Largest/Smallest: 12.1
Sample Skewness: 0.895
Estimated Lambda: 0.5
$bc$Si
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
69.81 72.28 72.79 72.65 73.09 75.41
Largest/Smallest: 1.08
Sample Skewness: -0.72
Estimated Lambda: 2
$bc$Ca
Box-Cox Transformation
214 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.430 8.240 8.600 8.957 9.172 16.190
Largest/Smallest: 2.98
Sample Skewness: 2.02
Estimated Lambda: -1.1
$yj
NULL
$et
NULL
$invHyperbolicSine
NULL
$mean
RI Na Mg Al Si K
2.831185e-01 2.594009e+00 2.684533e+00 3.684509e-01 2.638878e+03 4.970561e-01
Ca Ba Fe
8.256036e-01 1.750467e-01 5.700935e-02
transformed_glass_predictorsCreated from 214 samples and 9 variables
Pre-processing:
- Box-Cox transformation (5)
- centered (9)
- ignored (0)
- scaled (9)
Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1
We can see that the Box-Cox transformation would be recommended for 5 of 9 predictor variables: RI, Na, Al, Si, and Ca. This transformation is used to reduce skewness and stabilize variance, making the data more normally distrbuted, which would ideally improve predictive performance. The estimated Lambdas for each of the 5 predictor variables recommended for transformation are outlined above, including a RI(Refractive Index) estimated Lambda of -2, Na: Estimated Lambda -0.1 (This also has a note that “with fudge factor, Lambda = 0 will be used for transformations” so a natural log transformation), Al: Estimated Lambda 0.5 (This would be a square-root transformation), Si: Estimated Lambda 2, and Ca: Estimated Lambda -1.1.
Additionally, the data has been centered (the mean sutracted so that the mean of each predictor variable is 0) and scaled (each predictor variable has been divided by its standard deviation to standardize variance.)
3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
## See ?Soybean for detailsa. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
str(Soybean)'data.frame': 683 obs. of 36 variables:
$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
$ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
$ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
$ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
$ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
$ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
$ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
$ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
$ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
$ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
$ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
$ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
$ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
$ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
$ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
$ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
We’ll create a function to create bar plots for categorical variables to look at their frequency distrbution. We’ll remove the Class variable since it’s not a predictor: Class is the outcome variable. And we’ll make them gold like
categorical_variables <- setdiff(names(Filter(is.factor, Soybean)), "Class")
plot_categorical_distribution <- function(data, variable) {
ggplot(data, aes_string(x = variable)) +
geom_bar(fill = "gold") +
theme_minimal() +
labs(title = paste("Frequency Distribution of", variable),
x = variable, y = "Count") +
theme(axis.text.x = element_text(hjust = 1))
}
for (var in categorical_variables) {
print(plot_categorical_distribution(Soybean, var))
}Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
To identify distrbutions that are problematic for modeling, we’ll check for near-zero predictors (categories that have little variation)
near_zero_variance_Soybean <- nearZeroVar(Soybean, saveMetrics =TRUE)
near_zero_variance_Soybean[near_zero_variance_Soybean$near_zero_variance_Soybean,][1] freqRatio percentUnique zeroVar nzv
<0 rows> (or 0-length row.names)
This means that none of the categorical variables are dominated by a single category.
Let’s check for constant predictors (single-level predictors):
constant_predictors_soybean <-
sapply(Soybean, function(x) length(unique(x)) == 1)
names(Soybean)[constant_predictors_soybean]character(0)
This means that no constant predictors were found: all the categorical variables in Soybean have at least 2 distinct levels.
Let’s check to see more about those values that are missing, which we noticed earlier in the frequency plots.
colSums(is.na(Soybean)) Class date plant.stand precip temp
0 1 36 38 30
hail crop.hist area.dam sever seed.tmt
121 16 1 121 121
germ plant.growth leaves leaf.halo leaf.marg
112 16 0 84 84
leaf.size leaf.shread leaf.malf leaf.mild stem
84 100 84 108 16
lodging stem.cankers canker.lesion fruiting.bodies ext.decay
121 38 38 106 38
mycelium int.discolor sclerotia fruit.pods fruit.spots
38 38 38 84 106
seed mold.growth seed.discolor seed.size shriveling
92 92 106 92 106
roots
31
This tells us how many observations are null in each row. We can see there are a number of columns where there’s 100 or more observations missing, including leaf.mild, lodging, seed.discolor, sever, seed.tmt, germ, leaf.shread, shriveling, fruiting.bodies, hail, and fruit.spots.
How many rows have at least one missing value?
sum(complete.cases(Soybean) == FALSE)[1] 121
Out of 683 observations (rows), 121 rows have at least one missing value.
total_missing <- sum(is.na(Soybean))
total_values <- prod(dim(Soybean))
missing_percentage <- (total_missing / total_values) * 100
print(paste("Overall missing data:", round(missing_percentage, 2), "%"))[1] "Overall missing data: 9.5 %"
The reason we see 9.5% (I think) compared to the 18% mentioned below is likely because we are measuring (above) the total number of missing cells divided by the total cells in the data table. 18% would likely come from the percentage of rows that have missing data (121, noted above) divided by those that don’t (683 observations in the total data table): 121/683 = 17.7%.
c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
We won’t eliminate predictors and instead will follow the recommended approach for K-Nearest Neighbors (KNN) imputation. First we’ll perform the KNN imputation and then apply it.
To do that, first we have to convert categorical variables to dummy variables to create numeric versions of the categorical variables.
But before that, we need to separate the Class from the predictor variables. As advised in the book we’ll store it separately.
Soybean_class <- Soybean$Class
Soybean_no_class <- Soybean |> select(-Class)Now that Class is removed we can safely apply dummy encoding.
dummies <- dummyVars(~ ., data = Soybean_no_class, fullRank = TRUE)
Soybean_numeric <- predict(dummies, newdata = Soybean_no_class) |>
as.data.frame()Then we’ll apply KNN imputation on the numeric dataset.
Soybean_imputed <- preProcess(Soybean_numeric, method = "knnImpute")
Soybean_filled <- predict(Soybean_imputed, newdata = Soybean_numeric)Now that we’ve imputed the predictors, we’ll add Class back.
Soybean_final <- Soybean_filled |>
mutate(Class = Soybean_class)Let’s confirm this worked - are there any missing values?
colSums(is.na(Soybean_final)) date.1 date.2 date.3 date.4
0 0 0 0
date.5 date.6 plant.stand.L precip.L
0 0 0 0
precip.Q temp.L temp.Q hail.1
0 0 0 0
crop.hist.1 crop.hist.2 crop.hist.3 area.dam.1
0 0 0 0
area.dam.2 area.dam.3 sever.1 sever.2
0 0 0 0
seed.tmt.1 seed.tmt.2 germ.L germ.Q
0 0 0 0
plant.growth.1 leaves.1 leaf.halo.1 leaf.halo.2
0 0 0 0
leaf.marg.1 leaf.marg.2 leaf.size.L leaf.size.Q
0 0 0 0
leaf.shread.1 leaf.malf.1 leaf.mild.1 leaf.mild.2
0 0 0 0
stem.1 lodging.1 stem.cankers.1 stem.cankers.2
0 0 0 0
stem.cankers.3 canker.lesion.1 canker.lesion.2 canker.lesion.3
0 0 0 0
fruiting.bodies.1 ext.decay.1 ext.decay.2 mycelium.1
0 0 0 0
int.discolor.1 int.discolor.2 sclerotia.1 fruit.pods.1
0 0 0 0
fruit.pods.2 fruit.pods.3 fruit.spots.1 fruit.spots.2
0 0 0 0
fruit.spots.4 seed.1 mold.growth.1 seed.discolor.1
0 0 0 0
seed.size.1 shriveling.1 roots.1 roots.2
0 0 0 0
Class
0
No more missing values! Excellent.
str(Soybean_final)'data.frame': 683 obs. of 65 variables:
$ date.1 : num -0.351 -0.351 -0.351 -0.351 -0.351 ...
$ date.2 : num -0.397 -0.397 -0.397 -0.397 -0.397 ...
$ date.3 : num -0.457 -0.457 2.185 2.185 -0.457 ...
$ date.4 : num -0.487 2.049 -0.487 -0.487 -0.487 ...
$ date.5 : num -0.528 -0.528 -0.528 -0.528 -0.528 ...
$ date.6 : num 2.56 -0.39 -0.39 -0.39 2.56 ...
$ plant.stand.L : num -0.909 -0.909 -0.909 -0.909 -0.909 ...
$ precip.L : num 0.587 0.587 0.587 0.587 0.587 ...
$ precip.Q : num 0.458 0.458 0.458 0.458 0.458 ...
$ temp.L : num -0.29 -0.29 -0.29 -0.29 -0.29 ...
$ temp.Q : num -0.863 -0.863 -0.863 -0.863 -0.863 ...
$ hail.1 : num -0.54 -0.54 -0.54 -0.54 -0.54 ...
$ crop.hist.1 : num 1.743 -0.573 1.743 1.743 -0.573 ...
$ crop.hist.2 : num -0.699 1.429 -0.699 -0.699 1.429 ...
$ crop.hist.3 : num -0.696 -0.696 -0.696 -0.696 -0.696 ...
$ area.dam.1 : num 1.415 -0.706 -0.706 -0.706 -0.706 ...
$ area.dam.2 : num -0.519 -0.519 -0.519 -0.519 -0.519 ...
$ area.dam.3 : num -0.614 -0.614 -0.614 -0.614 -0.614 ...
$ sever.1 : num 0.863 -1.157 -1.157 -1.157 0.863 ...
$ sever.2 : num -0.295 3.387 3.387 3.387 -0.295 ...
$ seed.tmt.1 : num -0.807 1.236 1.236 -0.807 -0.807 ...
$ seed.tmt.2 : num -0.257 -0.257 -0.257 -0.257 -0.257 ...
$ germ.L : num -1.326 -0.062 1.202 -0.062 1.202 ...
$ germ.Q : num 0.771 -1.295 0.771 -1.295 0.771 ...
$ plant.growth.1 : num 1.4 1.4 1.4 1.4 1.4 ...
$ leaves.1 : num 0.356 0.356 0.356 0.356 0.356 ...
$ leaf.halo.1 : num -0.253 -0.253 -0.253 -0.253 -0.253 ...
$ leaf.halo.2 : num -1.15 -1.15 -1.15 -1.15 -1.15 ...
$ leaf.marg.1 : num -0.19 -0.19 -0.19 -0.19 -0.19 ...
$ leaf.marg.2 : num 1.31 1.31 1.31 1.31 1.31 ...
$ leaf.size.L : num 1.17 1.17 1.17 1.17 1.17 ...
$ leaf.size.Q : num 1.1 1.1 1.1 1.1 1.1 ...
$ leaf.shread.1 : num -0.444 -0.444 -0.444 -0.444 -0.444 ...
$ leaf.malf.1 : num -0.285 -0.285 -0.285 -0.285 -0.285 ...
$ leaf.mild.1 : num -0.19 -0.19 -0.19 -0.19 -0.19 ...
$ leaf.mild.2 : num -0.19 -0.19 -0.19 -0.19 -0.19 ...
$ stem.1 : num 0.893 0.893 0.893 0.893 0.893 ...
$ lodging.1 : num 3.516 -0.284 -0.284 -0.284 -0.284 ...
$ stem.cankers.1 : num -0.253 -0.253 -0.253 -0.253 -0.253 ...
$ stem.cankers.2 : num -0.243 -0.243 -0.243 -0.243 -0.243 ...
$ stem.cankers.3 : num 1.54 1.54 1.54 1.54 1.54 ...
$ canker.lesion.1 : num 2.6 2.6 -0.384 -0.384 2.6 ...
$ canker.lesion.2 : num -0.615 -0.615 -0.615 -0.615 -0.615 ...
$ canker.lesion.3 : num -0.335 -0.335 -0.335 -0.335 -0.335 ...
$ fruiting.bodies.1: num 2.13 2.13 2.13 2.13 2.13 ...
$ ext.decay.1 : num 1.94 1.94 1.94 1.94 1.94 ...
$ ext.decay.2 : num -0.143 -0.143 -0.143 -0.143 -0.143 ...
$ mycelium.1 : num -0.0968 -0.0968 -0.0968 -0.0968 -0.0968 ...
$ int.discolor.1 : num -0.27 -0.27 -0.27 -0.27 -0.27 ...
$ int.discolor.2 : num -0.179 -0.179 -0.179 -0.179 -0.179 ...
$ sclerotia.1 : num -0.179 -0.179 -0.179 -0.179 -0.179 ...
$ fruit.pods.1 : num -0.526 -0.526 -0.526 -0.526 -0.526 ...
$ fruit.pods.2 : num -0.155 -0.155 -0.155 -0.155 -0.155 ...
$ fruit.pods.3 : num -0.295 -0.295 -0.295 -0.295 -0.295 ...
$ fruit.spots.1 : num -0.386 -0.386 -0.386 -0.386 -0.386 ...
$ fruit.spots.2 : num -0.331 -0.331 -0.331 -0.331 -0.331 ...
$ fruit.spots.4 : num 2.18 2.18 2.18 2.18 2.18 ...
$ seed.1 : num -0.491 -0.491 -0.491 -0.491 -0.491 ...
$ mold.growth.1 : num -0.357 -0.357 -0.357 -0.357 -0.357 ...
$ seed.discolor.1 : num -0.353 -0.353 -0.353 -0.353 -0.353 ...
$ seed.size.1 : num -0.333 -0.333 -0.333 -0.333 -0.333 ...
$ shriveling.1 : num -0.265 -0.265 -0.265 -0.265 -0.265 ...
$ roots.1 : num -0.39 -0.39 -0.39 -0.39 -0.39 ...
$ roots.2 : num -0.153 -0.153 -0.153 -0.153 -0.153 ...
$ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...