DATA 624 Homework 4: Data Pre-processing

suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(e1071))

## Warning: package 'e1071' was built under R version 4.3.3

suppressMessages(library(caret))

3.1

The UC Irvine Machine Learning Repository contains a dat set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a.)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between the predictors.

glass <- as.data.frame(Glass)

summary(glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

glass_long <- glass %>%
  pivot_longer(cols = c(1:9), names_to = "predictor", values_to = "value")

ggplot(glass_long, aes(x = value)) +
  geom_boxplot(fill = "cornflowerblue", color = "black") +
  facet_wrap(~ predictor, scales = "free_x") +
  theme_minimal()

ggpairs(glass, columns = 1:9, progress = FALSE)

## Warning in geom_point(): All aesthetics have length 1, but the data has 81 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

Calcium and the refracting index are highly positively correlated. RI is also negatively correlated with both silicon and aluminum. Magnesium is negatively correlated with barium, calcium, and aluminum. Aluminum is positively correlated with barium.

b.)

Do there appear to be any outliers in the data? Are any predictors skewed?

skewValues <- apply(glass[,1:9], 2, skewness)
skewValues

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Yes, based on the skew values and the box plots, potassium, calcium, and barium are significantly right-skewed. Iron and the refracting index are right-skewed. Sodium and aluminum are roughly symmetric. Silicon and magnesium are left-skewed. There also appear to be outliers in the data for each predictor except magnesium, based on the box plots.

c.)

Are there any relevant transformations of one or more predictors that might improve the classification model?

First, I would apply a log transformation to Ca and RI, since they are right-skewed and contain values greater than 1. Then I would center and scale the data.

glass$RI <- log(glass$RI)

glass$Ca <- log(glass$Ca)

glass$RI <- (glass$RI-mean(glass$RI))/ sd(glass$RI)
glass$Mg <- (glass$Mg-mean(glass$Mg))/ sd(glass$Mg)
glass$Si <- (glass$Si-mean(glass$Si))/ sd(glass$Si)
glass$K <- (glass$K-mean(glass$K))/ sd(glass$K)
glass$Ca <- (glass$Ca-mean(glass$Ca))/ sd(glass$Ca)
glass$Ba <- (glass$Ba-mean(glass$Ba))/ sd(glass$Ba)
glass$Fe <- (glass$Fe-mean(glass$Fe))/ sd(glass$Fe)

Finally, feature extraction through principle component analysis would likely help to account for redundancy in the several predictors that are correlated with each other.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth.) The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
soybean <- as.data.frame(Soybean)

a.)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

summary(soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

nearZeroVar(soybean)

## [1] 19 26 28

For all of the categorical variables, the fraction of unique values over the sample size is low (<10%), since none of the variables have 68 unique values. Therefore to see if any distributions are degenerate, as described in the text, we need only look at the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value. When looking at this ratio, three categorical variables stand out as having degenerate distributions: leaf.mild(535/20=26.75), mycelium(639/6=106.5), and sclerotia(625/20=31.25). The model would likely be improved by eliminating these variables.

soybean <- soybean %>% 
  select(-leaf.mild, -mycelium, -sclerotia)

b.)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

percent_na <- sort(sapply(soybean, function(x) mean(is.na(x)) * 100))

print(percent_na)

##           Class          leaves            date        area.dam       crop.hist 
##       0.0000000       0.0000000       0.1464129       0.1464129       2.3426061 
##    plant.growth            stem            temp           roots     plant.stand 
##       2.3426061       2.3426061       4.3923865       4.5387994       5.2708638 
##          precip    stem.cankers   canker.lesion       ext.decay    int.discolor 
##       5.5636896       5.5636896       5.5636896       5.5636896       5.5636896 
##       leaf.halo       leaf.marg       leaf.size       leaf.malf      fruit.pods 
##      12.2986823      12.2986823      12.2986823      12.2986823      12.2986823 
##            seed     mold.growth       seed.size     leaf.shread fruiting.bodies 
##      13.4699854      13.4699854      13.4699854      14.6412884      15.5197657 
##     fruit.spots   seed.discolor      shriveling            germ            hail 
##      15.5197657      15.5197657      15.5197657      16.3982430      17.7159590 
##           sever        seed.tmt         lodging 
##      17.7159590      17.7159590      17.7159590

Yes, some predictors are more likely to be missing than others. 4 predictors have over 17% missing values (hail, sever, seed.tmt, and lodging,) and 3 have less than 1% missing values (area.dam, date, and leaves, which has no missing values.)

soybean %>%
  group_by(Class) %>%
  summarize(total_na = sum(rowSums(is.na(cur_data())))) %>%
  arrange(desc(total_na))

## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `total_na = sum(rowSums(is.na(cur_data())))`.
## ℹ In group 1: `Class = 2-4-d-injury`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.

## # A tibble: 19 × 2
##    Class                       total_na
##    <fct>                          <dbl>
##  1 phytophthora-rot                1159
##  2 2-4-d-injury                     402
##  3 cyst-nematode                    294
##  4 diaporthe-pod-&-stem-blight      162
##  5 herbicide-injury                 136
##  6 alternarialeaf-spot                0
##  7 anthracnose                        0
##  8 bacterial-blight                   0
##  9 bacterial-pustule                  0
## 10 brown-spot                         0
## 11 brown-stem-rot                     0
## 12 charcoal-rot                       0
## 13 diaporthe-stem-canker              0
## 14 downy-mildew                       0
## 15 frog-eye-leaf-spot                 0
## 16 phyllosticta-leaf-spot             0
## 17 powdery-mildew                     0
## 18 purple-seed-stain                  0
## 19 rhizoctonia-root-rot               0

The pattern of missing data is related to the classes. Only 5 classes account for all of the missing data.

c.)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the context of the data, I would handle missing values with imputation. First, I would consider the levels of each particular predictor. For the predictors where a 0 was used to indicate the feature was absent, normal, or did not apply, I would replace missing data with a 0. Researchers may not have been checking for those features based on the classification of the plant’s disease, meaning that that predictor is not a feature of that disease. For remaining missing values, I would use a k-nearest neighbors classifier to impute missing values based on similar data.

DATA 624 Homework 4: Data Pre-processing

Molly Siebecker

2024-09-29

3.1

a.)

b.)

c.)

3.2

a.)

b.)

c.)