Assignment 4: Data Preprocessing/Overfitting

Author

Amanda Rose Knudsen

Published

March 1, 2025

Assignment 4: Do exercises 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Link to Applied Predictive Modeling for reference.

library(tidyverse)
library(caret)
library(corrplot)
library(e1071)
library(lattice)
library(car)
library(RANN)

3.1

The UC Irvine Machine Learning Repository 6 contains a data set related to glass identiﬁcation. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

'data.frame':   214 obs. of  10 variables:
 $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
 $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
 $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
 $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
 $ Si  : num  71.8 72.7 73 72.6 73.1 ...
 $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
 $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
 $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
 $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Based on the above, we can understand that ‘Type’ is not a predictor, because it is not one of the nine mentioned in the note that there are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. So, let’s create a subset excluding ‘Type’ we can use for the following questions. We won’t create a ‘training’ dataset (and a ‘test’ dataset) like we might in other circumstances because this set of exercises is for practicing data pre-processing.

glass_predictors <- Glass |> select(-Type)

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

glass_predictors |> 
  pivot_longer(everything(), names_to = "variable", values_to = "value") |> 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 40) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(title = "Distributions of Glass Predictor Variables",
       x = "value",
       y = "Count")

Here’s what we can tell with the above histograms of the glass predictor variables, starting from the top row:

Percentage of Al: The distribution of Al is roughly symmetric, slightly right skewed. We can see what appear to be some outliers on the right side of the histogram which contribute to its slight right skew. Most of the samples are between 1% and 2% Al. The data is centered around 1.3-1.4%. There are no apparent observations recorded at 0% so we can say that this element is in all samples in the dataset. The highest value recorded for Al appears just above 3.5.

Percentage of Ba: The distribution of Ba is extremely right skewed in this histogram. The majority of values are at 0% and there is a long right tail with values extending past 3.

Percentage of Ca: The distribution of Ca is relatively bell-shaped and centered between 8 and 9. There are more outliers on the right side of the histogram so we can call this slightly right-skewed. There are no values observable below 5.

Percentage of Fe: The distribution of Fe, like Ba, is extrmely right skewed in this histogram. Also like Ba, the majority of values are at 0 (0%) and there is a long right tail. For Fe, the values extend up just past 0.5.

Percentage of K: This distribution has most of its values near or at 0. While there are some values at 0, it is not the majority of values - the tallest bin we see appears around 0.7. The right tail trails just past 6: this distribution is right skewed.

Percentage of Mg: This distribution is bimodal: it has a significant amount of data points at 0 and a large amount of data between 3.4 and 3.8. The largest visible value appears at 4.5%.

Percentage of Na: The distribution of Na is slightly right skewed but mostly centered around 13 where we see the largest number of values. There are visible tails in both directions and it is somewhat symmetric.

RI (Refractive Index): This distribution appears somewhat bimodal and somewhat symmetric, though when we called for fewer bins it looked more bell-shaped rather than bimodal. The two ‘peaks’ we see are both between 1.515 and 1.520, so it’s a narrow distribution band. Outliers appear in the tails in both directions.

Percentage of Si: This distribution is somewhat bell shaped with tails visible in both directions, and values centered around 73. This is the highest percentage of any of the elements, which makes sense knowing that Si is Silicon. Values extend from just below 70 to just over 73.

We can also use the skewness function from the e1071 package to compute the skewness across columns.

skewness_values <- apply(glass_predictors, 2, skewness)

skewness_values

        RI         Na         Mg         Al         Si          K         Ca 
 1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
        Ba         Fe 
 3.3686800  1.7298107

corrplot(cor(glass_predictors),
  method = "color",
  type = "lower",
  addCoef.col = "black",
  diag = FALSE)

We see the most significant positive correlations between Ca and RI (the Refractive Index), with a 0.81 correlation, strongly positive. The most negative correlation we see is between Si and RI, with a -0.54 correlation, significantly less strong than the positive correlation observed between Ca and the refractive index. Other relationships with a positive correlation are between Fe and Ca (0.12), Al and K (0.33), Al and Ba (0.48), Na and Al (0.16), Na and Ba (0.33), and Fe and RI (0.14). The remaining correlations range from 0 (Ba and RI) to the aforementioned Si and RI (-0.54).

b. Do there appear to be any outliers in the data? Are any predictors skewed?

glass_predictors |> 
  pivot_longer(everything(), names_to = "variable", values_to = "value") |> 
  ggplot(aes(x = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(title = "Distributions of Glass Predictor Variables",
       x = "value",
       y = "Count")

Yes, there appear to be outliers in the data for several of the predictors. These outliers are visible in the histograms and in the boxplots above. The ones that stand out to me as having notable outliers is for K, where there are many outliers and a significant gap between the second-highest bin at ~2.4 and the highest bin at over 6. K and Ba, Ca, and Fe are all skewed as noted in the overview of distributions above. The only predictor variable without outliers is Mg.

c. Are there any relevant transformations of one or more predictors that might improve the classiﬁcation model?

Based on our understanding of the data, it seems that Box-Cox transformation might improve the classification model. Knowing that Box Cox transformations can be applied to the predictor variables, this would enable more accurate transformations to resolve skewness – to reduce outliers and reduce dimensionality of the data, as noted in Kuhn & Johnson on pages 32-33.

The preProcess function from caret can apply the appropriate box cox transformation to the set of predictor variables from the Glass Identification Database, while also transform, centering, scaling, and imputing values, if that’s the path we wanted to continue down.

transformed_glass_predictors <- 
  preProcess(glass_predictors, method = c("BoxCox", "center", "scale"))

head(transformed_glass_predictors)

$dim
[1] 214   9

$bc
$bc$RI
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.511   1.517   1.518   1.518   1.519   1.534 

Largest/Smallest: 1.02 
Sample Skewness: 1.6 

Estimated Lambda: -2 


$bc$Na
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.73   12.91   13.30   13.41   13.82   17.38 

Largest/Smallest: 1.62 
Sample Skewness: 0.448 

Estimated Lambda: -0.1 
With fudge factor, Lambda = 0 will be used for transformations


$bc$Al
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.290   1.190   1.360   1.445   1.630   3.500 

Largest/Smallest: 12.1 
Sample Skewness: 0.895 

Estimated Lambda: 0.5 


$bc$Si
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  69.81   72.28   72.79   72.65   73.09   75.41 

Largest/Smallest: 1.08 
Sample Skewness: -0.72 

Estimated Lambda: 2 


$bc$Ca
Box-Cox Transformation

214 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.430   8.240   8.600   8.957   9.172  16.190 

Largest/Smallest: 2.98 
Sample Skewness: 2.02 

Estimated Lambda: -1.1 



$yj
NULL

$et
NULL

$invHyperbolicSine
NULL

$mean
          RI           Na           Mg           Al           Si            K 
2.831185e-01 2.594009e+00 2.684533e+00 3.684509e-01 2.638878e+03 4.970561e-01 
          Ca           Ba           Fe 
8.256036e-01 1.750467e-01 5.700935e-02

transformed_glass_predictors

Created from 214 samples and 9 variables

Pre-processing:
  - Box-Cox transformation (5)
  - centered (9)
  - ignored (0)
  - scaled (9)

Lambda estimates for Box-Cox transformation:
-2, -0.1, 0.5, 2, -1.1

We can see that the Box-Cox transformation would be recommended for 5 of 9 predictor variables: RI, Na, Al, Si, and Ca. This transformation is used to reduce skewness and stabilize variance, making the data more normally distrbuted, which would ideally improve predictive performance. The estimated Lambdas for each of the 5 predictor variables recommended for transformation are outlined above, including a RI(Refractive Index) estimated Lambda of -2, Na: Estimated Lambda -0.1 (This also has a note that “with fudge factor, Lambda = 0 will be used for transformations” so a natural log transformation), Al: Estimated Lambda 0.5 (This would be a square-root transformation), Si: Estimated Lambda 2, and Ca: Estimated Lambda -1.1.

Additionally, the data has been centered (the mean sutracted so that the mean of each predictor variable is 0) and scaled (each predictor variable has been divided by its standard deviation to standardize variance.)

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
## See ?Soybean for details

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

str(Soybean)

'data.frame':   683 obs. of  36 variables:
 $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
 $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
 $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
 $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
 $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
 $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
 $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
 $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
 $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
 $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
 $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
 $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
 $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
 $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
 $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
 $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

We’ll create a function to create bar plots for categorical variables to look at their frequency distrbution. We’ll remove the Class variable since it’s not a predictor: Class is the outcome variable. And we’ll make them gold like

categorical_variables <- setdiff(names(Filter(is.factor, Soybean)), "Class")

plot_categorical_distribution <- function(data, variable) {
  ggplot(data, aes_string(x = variable)) +
    geom_bar(fill = "gold") +
    theme_minimal() +
    labs(title = paste("Frequency Distribution of", variable), 
         x = variable, y = "Count") +
    theme(axis.text.x = element_text(hjust = 1))
}

for (var in categorical_variables) {
  print(plot_categorical_distribution(Soybean, var))
}

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

To identify distrbutions that are problematic for modeling, we’ll check for near-zero predictors (categories that have little variation)

near_zero_variance_Soybean <- nearZeroVar(Soybean, saveMetrics =TRUE)
near_zero_variance_Soybean[near_zero_variance_Soybean$near_zero_variance_Soybean,]

[1] freqRatio     percentUnique zeroVar       nzv          
<0 rows> (or 0-length row.names)

This means that none of the categorical variables are dominated by a single category.

Let’s check for constant predictors (single-level predictors):

constant_predictors_soybean <- 
  sapply(Soybean, function(x) length(unique(x)) == 1)
names(Soybean)[constant_predictors_soybean]

character(0)

This means that no constant predictors were found: all the categorical variables in Soybean have at least 2 distinct levels.

Let’s check to see more about those values that are missing, which we noticed earlier in the frequency plots.

colSums(is.na(Soybean))

          Class            date     plant.stand          precip            temp 
              0               1              36              38              30 
           hail       crop.hist        area.dam           sever        seed.tmt 
            121              16               1             121             121 
           germ    plant.growth          leaves       leaf.halo       leaf.marg 
            112              16               0              84              84 
      leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
             84             100              84             108              16 
        lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
            121              38              38             106              38 
       mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
             38              38              38              84             106 
           seed     mold.growth   seed.discolor       seed.size      shriveling 
             92              92             106              92             106 
          roots 
             31

This tells us how many observations are null in each row. We can see there are a number of columns where there’s 100 or more observations missing, including leaf.mild, lodging, seed.discolor, sever, seed.tmt, germ, leaf.shread, shriveling, fruiting.bodies, hail, and fruit.spots.

How many rows have at least one missing value?

sum(complete.cases(Soybean) == FALSE)

[1] 121

Out of 683 observations (rows), 121 rows have at least one missing value.

total_missing <- sum(is.na(Soybean))
total_values <- prod(dim(Soybean))
missing_percentage <- (total_missing / total_values) * 100

print(paste("Overall missing data:", round(missing_percentage, 2), "%"))

[1] "Overall missing data: 9.5 %"

The reason we see 9.5% (I think) compared to the 18% mentioned below is likely because we are measuring (above) the total number of missing cells divided by the total cells in the data table. 18% would likely come from the percentage of rows that have missing data (121, noted above) divided by those that don’t (683 observations in the total data table): 121/683 = 17.7%.

b. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Let’s identify which predictor variables contribute most to ‘missingness’:

# Calculate missing percentage per variable
missing_summary <- Soybean |> 
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) |> 
  pivot_longer(everything(), names_to = "Variable", values_to = "MissingPercentage") |> 
  arrange(desc(MissingPercentage))

print(missing_summary)

# A tibble: 36 × 2
   Variable        MissingPercentage
   <chr>                       <dbl>
 1 hail                         17.7
 2 sever                        17.7
 3 seed.tmt                     17.7
 4 lodging                      17.7
 5 germ                         16.4
 6 leaf.mild                    15.8
 7 fruiting.bodies              15.5
 8 fruit.spots                  15.5
 9 seed.discolor                15.5
10 shriveling                   15.5
# ℹ 26 more rows

We can see that hail, sever, seed.tmt, and lodging each have close to 18% missing values.

ggplot(missing_summary, 
       aes(x = reorder(Variable, -MissingPercentage), y = MissingPercentage)) +
  geom_bar(stat = "identity", fill = "gold") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Missing Data by Variable",
       x = "Predictor", y = "Percentage Missing")

# Create a dataset showing how much missing data exists for each class
missing_by_class <- Soybean |> 
  group_by(Class) |> 
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) |> 
  pivot_longer(-Class, 
               names_to = "Variable", values_to = "MissingPercentage") |> 
  filter(MissingPercentage > 0)  # Keep only variables with missing data

ggplot(missing_by_class, aes(x = Class, 
                             y = MissingPercentage, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Missing Data Patterns by Class",
       x = "Class", y = "Missing Percentage")

The distribution of missing data across classes was difficult to interpret visually due to the number of predictors and classes. However, since we identified that certain predictors have a higher percentage of missing values, we acknowledge the presence of non-random missingness, but we will proceed with further analysis without continuing this investigation into missing data by class in more depth.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We won’t eliminate predictors and instead will follow the recommended approach for K-Nearest Neighbors (KNN) imputation. First we’ll perform the KNN imputation and then apply it.

To do that, first we have to convert categorical variables to dummy variables to create numeric versions of the categorical variables.

But before that, we need to separate the Class from the predictor variables. As advised in the book we’ll store it separately.

Soybean_class <- Soybean$Class  

Soybean_no_class <- Soybean |> select(-Class)

Now that Class is removed we can safely apply dummy encoding.

dummies <- dummyVars(~ ., data = Soybean_no_class, fullRank = TRUE)
Soybean_numeric <- predict(dummies, newdata = Soybean_no_class) |> 
  as.data.frame()

Then we’ll apply KNN imputation on the numeric dataset.

Soybean_imputed <- preProcess(Soybean_numeric, method = "knnImpute")

Soybean_filled <- predict(Soybean_imputed, newdata = Soybean_numeric)

Now that we’ve imputed the predictors, we’ll add Class back.

Soybean_final <- Soybean_filled |> 
  mutate(Class = Soybean_class)

Let’s confirm this worked - are there any missing values?

colSums(is.na(Soybean_final))

           date.1            date.2            date.3            date.4 
                0                 0                 0                 0 
           date.5            date.6     plant.stand.L          precip.L 
                0                 0                 0                 0 
         precip.Q            temp.L            temp.Q            hail.1 
                0                 0                 0                 0 
      crop.hist.1       crop.hist.2       crop.hist.3        area.dam.1 
                0                 0                 0                 0 
       area.dam.2        area.dam.3           sever.1           sever.2 
                0                 0                 0                 0 
       seed.tmt.1        seed.tmt.2            germ.L            germ.Q 
                0                 0                 0                 0 
   plant.growth.1          leaves.1       leaf.halo.1       leaf.halo.2 
                0                 0                 0                 0 
      leaf.marg.1       leaf.marg.2       leaf.size.L       leaf.size.Q 
                0                 0                 0                 0 
    leaf.shread.1       leaf.malf.1       leaf.mild.1       leaf.mild.2 
                0                 0                 0                 0 
           stem.1         lodging.1    stem.cankers.1    stem.cankers.2 
                0                 0                 0                 0 
   stem.cankers.3   canker.lesion.1   canker.lesion.2   canker.lesion.3 
                0                 0                 0                 0 
fruiting.bodies.1       ext.decay.1       ext.decay.2        mycelium.1 
                0                 0                 0                 0 
   int.discolor.1    int.discolor.2       sclerotia.1      fruit.pods.1 
                0                 0                 0                 0 
     fruit.pods.2      fruit.pods.3     fruit.spots.1     fruit.spots.2 
                0                 0                 0                 0 
    fruit.spots.4            seed.1     mold.growth.1   seed.discolor.1 
                0                 0                 0                 0 
      seed.size.1      shriveling.1           roots.1           roots.2 
                0                 0                 0                 0 
            Class 
                0

No more missing values! Excellent.

str(Soybean_final)

'data.frame':   683 obs. of  65 variables:
 $ date.1           : num  -0.351 -0.351 -0.351 -0.351 -0.351 ...
 $ date.2           : num  -0.397 -0.397 -0.397 -0.397 -0.397 ...
 $ date.3           : num  -0.457 -0.457 2.185 2.185 -0.457 ...
 $ date.4           : num  -0.487 2.049 -0.487 -0.487 -0.487 ...
 $ date.5           : num  -0.528 -0.528 -0.528 -0.528 -0.528 ...
 $ date.6           : num  2.56 -0.39 -0.39 -0.39 2.56 ...
 $ plant.stand.L    : num  -0.909 -0.909 -0.909 -0.909 -0.909 ...
 $ precip.L         : num  0.587 0.587 0.587 0.587 0.587 ...
 $ precip.Q         : num  0.458 0.458 0.458 0.458 0.458 ...
 $ temp.L           : num  -0.29 -0.29 -0.29 -0.29 -0.29 ...
 $ temp.Q           : num  -0.863 -0.863 -0.863 -0.863 -0.863 ...
 $ hail.1           : num  -0.54 -0.54 -0.54 -0.54 -0.54 ...
 $ crop.hist.1      : num  1.743 -0.573 1.743 1.743 -0.573 ...
 $ crop.hist.2      : num  -0.699 1.429 -0.699 -0.699 1.429 ...
 $ crop.hist.3      : num  -0.696 -0.696 -0.696 -0.696 -0.696 ...
 $ area.dam.1       : num  1.415 -0.706 -0.706 -0.706 -0.706 ...
 $ area.dam.2       : num  -0.519 -0.519 -0.519 -0.519 -0.519 ...
 $ area.dam.3       : num  -0.614 -0.614 -0.614 -0.614 -0.614 ...
 $ sever.1          : num  0.863 -1.157 -1.157 -1.157 0.863 ...
 $ sever.2          : num  -0.295 3.387 3.387 3.387 -0.295 ...
 $ seed.tmt.1       : num  -0.807 1.236 1.236 -0.807 -0.807 ...
 $ seed.tmt.2       : num  -0.257 -0.257 -0.257 -0.257 -0.257 ...
 $ germ.L           : num  -1.326 -0.062 1.202 -0.062 1.202 ...
 $ germ.Q           : num  0.771 -1.295 0.771 -1.295 0.771 ...
 $ plant.growth.1   : num  1.4 1.4 1.4 1.4 1.4 ...
 $ leaves.1         : num  0.356 0.356 0.356 0.356 0.356 ...
 $ leaf.halo.1      : num  -0.253 -0.253 -0.253 -0.253 -0.253 ...
 $ leaf.halo.2      : num  -1.15 -1.15 -1.15 -1.15 -1.15 ...
 $ leaf.marg.1      : num  -0.19 -0.19 -0.19 -0.19 -0.19 ...
 $ leaf.marg.2      : num  1.31 1.31 1.31 1.31 1.31 ...
 $ leaf.size.L      : num  1.17 1.17 1.17 1.17 1.17 ...
 $ leaf.size.Q      : num  1.1 1.1 1.1 1.1 1.1 ...
 $ leaf.shread.1    : num  -0.444 -0.444 -0.444 -0.444 -0.444 ...
 $ leaf.malf.1      : num  -0.285 -0.285 -0.285 -0.285 -0.285 ...
 $ leaf.mild.1      : num  -0.19 -0.19 -0.19 -0.19 -0.19 ...
 $ leaf.mild.2      : num  -0.19 -0.19 -0.19 -0.19 -0.19 ...
 $ stem.1           : num  0.893 0.893 0.893 0.893 0.893 ...
 $ lodging.1        : num  3.516 -0.284 -0.284 -0.284 -0.284 ...
 $ stem.cankers.1   : num  -0.253 -0.253 -0.253 -0.253 -0.253 ...
 $ stem.cankers.2   : num  -0.243 -0.243 -0.243 -0.243 -0.243 ...
 $ stem.cankers.3   : num  1.54 1.54 1.54 1.54 1.54 ...
 $ canker.lesion.1  : num  2.6 2.6 -0.384 -0.384 2.6 ...
 $ canker.lesion.2  : num  -0.615 -0.615 -0.615 -0.615 -0.615 ...
 $ canker.lesion.3  : num  -0.335 -0.335 -0.335 -0.335 -0.335 ...
 $ fruiting.bodies.1: num  2.13 2.13 2.13 2.13 2.13 ...
 $ ext.decay.1      : num  1.94 1.94 1.94 1.94 1.94 ...
 $ ext.decay.2      : num  -0.143 -0.143 -0.143 -0.143 -0.143 ...
 $ mycelium.1       : num  -0.0968 -0.0968 -0.0968 -0.0968 -0.0968 ...
 $ int.discolor.1   : num  -0.27 -0.27 -0.27 -0.27 -0.27 ...
 $ int.discolor.2   : num  -0.179 -0.179 -0.179 -0.179 -0.179 ...
 $ sclerotia.1      : num  -0.179 -0.179 -0.179 -0.179 -0.179 ...
 $ fruit.pods.1     : num  -0.526 -0.526 -0.526 -0.526 -0.526 ...
 $ fruit.pods.2     : num  -0.155 -0.155 -0.155 -0.155 -0.155 ...
 $ fruit.pods.3     : num  -0.295 -0.295 -0.295 -0.295 -0.295 ...
 $ fruit.spots.1    : num  -0.386 -0.386 -0.386 -0.386 -0.386 ...
 $ fruit.spots.2    : num  -0.331 -0.331 -0.331 -0.331 -0.331 ...
 $ fruit.spots.4    : num  2.18 2.18 2.18 2.18 2.18 ...
 $ seed.1           : num  -0.491 -0.491 -0.491 -0.491 -0.491 ...
 $ mold.growth.1    : num  -0.357 -0.357 -0.357 -0.357 -0.357 ...
 $ seed.discolor.1  : num  -0.353 -0.353 -0.353 -0.353 -0.353 ...
 $ seed.size.1      : num  -0.333 -0.333 -0.333 -0.333 -0.333 ...
 $ shriveling.1     : num  -0.265 -0.265 -0.265 -0.265 -0.265 ...
 $ roots.1          : num  -0.39 -0.39 -0.39 -0.39 -0.39 ...
 $ roots.2          : num  -0.153 -0.153 -0.153 -0.153 -0.153 ...
 $ Class            : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...