Problem 3.1 – Glass Identification Data

Question:
The Glass data consist of 214 glass samples labeled as one of seven categories. There are nine predictors: refractive index (RI) and percentages of eight elements (Na, Mg, Al, Si, K, Ca, Ba, Fe).

Explore predictor distributions and relationships.
Check for outliers and skewness.
Suggest transformations.

Step 1: Load Data

library(mlbench)
library(ggplot2)
library(reshape2)

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Step 2: Visualize Distributions

# Melt to long format for plotting
glass_long <- melt(Glass, id.vars = "Type")

ggplot(glass_long, aes(x=value)) +
  geom_histogram(bins=30, fill="skyblue", color="black") +
  facet_wrap(~variable, scales="free") +
  theme_minimal()

Interpretation:
- Some variables (e.g., Fe, Ba, K) are highly skewed, with many zeros.
- RI, Na, Mg, and Ca are more bell-shaped.

Step 3: Outliers and Skewness

ggplot(glass_long, aes(x="", y=value)) +
  geom_boxplot(fill="lightgreen") +
  facet_wrap(~variable, scales="free") +
  theme_minimal()

Interpretation:
- Outliers appear in Mg, Fe, and Ba.
- Several predictors are skewed (especially Fe, Ba, K).

Step 4: Suggested Transformations

Log-transform skewed variables (e.g., Fe, Ba, K).
Standardization/normalization may help classifiers like kNN or SVM.

Problem 3.2 – Soybean Disease Data

Question:
The Soybean dataset has 683 observations, 35 mostly categorical predictors, and 19 disease classes.

Investigate categorical predictor distributions.
Explore missing data patterns.
Propose a strategy for handling missing data.

Step 1: Load Data

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Step 2: Frequency Distributions

# Show frequency tables for first few predictors
for (col in names(Soybean)[1:5]) {
  print(table(Soybean[[col]]))
}

## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
##   0   1 
## 354 293 
## 
##   0   1   2 
##  74 112 459 
## 
##   0   1   2 
##  80 374 199

Interpretation:
- Some predictors have imbalanced categories.
- A few are near-degenerate (mostly one level).

Step 3: Missing Data

# Count missing values
missing_counts <- colSums(is.na(Soybean))
missing_counts[missing_counts > 0]

##            date     plant.stand          precip            temp            hail 
##               1              36              38              30             121 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##              16               1             121             121             112 
##    plant.growth       leaf.halo       leaf.marg       leaf.size     leaf.shread 
##              16              84              84              84             100 
##       leaf.malf       leaf.mild            stem         lodging    stem.cankers 
##              84             108              16             121              38 
##   canker.lesion fruiting.bodies       ext.decay        mycelium    int.discolor 
##              38             106              38              38              38 
##       sclerotia      fruit.pods     fruit.spots            seed     mold.growth 
##              38              84             106              92              92 
##   seed.discolor       seed.size      shriveling           roots 
##             106              92             106              31

# Proportion missing overall
mean(is.na(Soybean))

## [1] 0.09504636

Interpretation:
- About ~18% missing overall.
- Some predictors have much higher missingness than others.
- Missingness may depend on disease class.

Step 4: Pattern of Missing Data

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

aggr(Soybean, numbers=TRUE, sortVars=TRUE, cex.axis=.7)

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

Interpretation:
- Some predictors often missing together.
- Missingness could be related to disease classes.

Step 5: Strategy

Drop predictors with very high missingness or degenerate distributions.
Impute categorical variables using mode imputation.
Consider more advanced imputation (e.g., mice package).

Conclusion

Problem 3.1: Glass data show skewed predictors and outliers; log-transformations and normalization could improve classification.
Problem 3.2: Soybean data contain ~18% missing values, concentrated in certain predictors; best handled with careful imputation or predictor elimination.

Applied Predictive Modeling - Chapter 3 (Problems 3.1 & 3.2)

Sabina Baraili

2025-09-25

Problem 3.1 – Glass Identification Data

Step 1: Load Data

Step 2: Visualize Distributions

Step 3: Outliers and Skewness

Step 4: Suggested Transformations

Problem 3.2 – Soybean Disease Data

Step 1: Load Data

Step 2: Frequency Distributions

Step 3: Missing Data

Step 4: Pattern of Missing Data

Step 5: Strategy

Conclusion