Data 624 Homework 4

Objective

Assignment 4 involves answering questions 3.1, and 3.2 the book Applied Predictive Modeling by Kuhn and Johnson.

Exercise 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Do there appear to be any outliers in the data? Are any predictors skewed?
Are there any relevant transformations of one or more predictors that might improve the classification model?

Part A + B

Looking at the distribution, we can observe that some variables are right skewed such as K, Fe, and Ba. Mg is left skewed. The Ri variable has a bi-model distribution. We will need to perform transformations to the variables mentioned.

p1 <- ggplot(Glass, aes(x = RI)) + 
  geom_density(fill = "blue", color = "black", alpha = 0.5) + 
  ggtitle("RI")

p2 <- ggplot(Glass, aes(x = Na)) + 
  geom_density(fill = "green", color = "black", alpha = 0.5) + 
  ggtitle("Na")

p3 <- ggplot(Glass, aes(x = Mg)) + 
  geom_density(fill = "red", color = "black", alpha = 0.5) + 
  ggtitle("Mg")

p4 <- ggplot(Glass, aes(x = Al)) + 
  geom_density(fill = "orange", color = "black", alpha = 0.5) + 
  ggtitle("Al")

p5 <- ggplot(Glass, aes(x = Si)) + 
  geom_density(fill = "grey", color = "black", alpha = 0.5) + 
  ggtitle("Si")

p6 <- ggplot(Glass, aes(x = K)) + 
  geom_density(fill = "yellow", color = "black", alpha = 0.5) + 
  ggtitle("K")

p7 <- ggplot(Glass, aes(x = Ca)) + 
  geom_density(fill = "purple", color = "black", alpha = 0.5) + 
  ggtitle("Ca")

p8 <- ggplot(Glass, aes(x = Ba)) + 
  geom_density(fill = "pink", color = "black", alpha = 0.5) + 
  ggtitle("Ba")

p9 <- ggplot(Glass, aes(x = Fe)) + 
  geom_density(fill = "brown", color = "black", alpha = 0.5) + 
  ggtitle("Fe")

# Arrange the plots side by side
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, nrow = 3, ncol = 3)

Looking at the correlation between the variables. We can observe the following trends:

There is a strong positive correlation between Ca and RI, with a value of 0.81. As Calcium content increases, so does the refractice index.

We can observe negative correlations between Si and RI (-0.54). Mg and Al also has a negative relationship(-0.48)

num_glass <- Glass %>% 
  select_if(is.numeric)

M <- cor(num_glass)


corrplot(M, type = 'lower', method = 'circle', addCoef.col = 'black', number.cex = 0.7)

Part C

I choose to utilize the Box-Cox method the normalize Al and Mg. Since the Box-Cox method automatically figures out the ideal transformation parameter, it gives more flexibility. Both variables showed a complex distribution and therefore this method appeared most appropriate.

For distributions like Ri and K which have more than 1 peak, I opted to use a sqrt transformation to better center the data without changing heavily changing the shape or slopes in the distribution.

For Ca, Ba, and Fe, since they are right skewed, I opted to use a more powerful transformation like log. However, it didn’t heavily change the distribution for Ba and Fe.

# Box-Cox transformation on Al
lamb.Al <- boxcox((Glass$Al) ~ 1)

# Find optimal lambda for Al
lambda_Al <- lamb.Al$x[which.max(lamb.Al$y)] #0.50

Glass <- Glass %>%
  mutate(Al_transformed = (Al^lambda_Al - 1) / lambda_Al)

p4.2 <- ggplot(Glass, aes(x = Al_transformed)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  ggtitle("Box-Cox Transformed Al (Aluminum)") 

#################
  
Glass <- Glass %>%
  mutate(Mg = ifelse(Mg <= 0, Mg + abs(min(Mg)) + 1, Mg))

# Box-Cox transformation for Mg
lamb.Mg <- boxcox((Glass$Mg) ~ 1)

# Find lambda for Mg transformation
lamb.Mg <- lamb.Mg$x[which.max(lamb.Mg$y)]  # 

# Apply the Box-Cox transformation to Mg
Glass <- Glass %>%
  mutate(Mg_transformed = (Mg^lamb.Mg - 1) / lamb.Mg)

p3.2 <- ggplot(Glass, aes(x = Mg_transformed)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  ggtitle("Box-Cox Transformed Mg (Magnesium)")

# Perform square root transformation on Ri
Glass$Ri_sqrt <- sqrt(Glass$RI)

# Perform square root transformation on K
Glass$K_sqrt <- sqrt(Glass$K)

# Perform log transformation on Ca (Calcium)
# Add a small constant (like 1) to avoid log(0) issues if necessary
Glass$Ca_log <- log(Glass$Ca + 1)

Glass$Ba_log <- log(Glass$Ba + 1)

Glass$Fe_log <- log(Glass$Fe + 1)

# Plot for transformed Ri 
p1.2 <- ggplot(Glass, aes(x = Glass$Ri_sqrt)) +
  geom_density(fill = "purple", alpha = 0.5) +
  ggtitle("Ri_sqrt")

# Plot for transformed Ka 
p6.2 <- ggplot(Glass, aes(x = K_sqrt)) +
  geom_density(fill = "purple", alpha = 0.5) +
  ggtitle("K_sqrt") 

# Plot for transformed Ca 
p7.2 <- ggplot(Glass, aes(x = Ca_log)) +
  geom_density(fill = "purple", alpha = 0.5) +
  ggtitle("Ca_log") 

# Plot for transformed Ba 
p8.2 <- ggplot(Glass, aes(x = Ba_log)) +
  geom_density(fill = "purple", alpha = 0.5) +
  ggtitle("Ba_log") 

# Plot for transformed Fe
p9.2 <- ggplot(Glass, aes(x = Fe_log)) +
  geom_density(fill = "purple", alpha = 0.5) +
  ggtitle("Fe_log") 


grid.arrange(p1.2, p2, p3.2, p4.2, p5, p6.2, p7.2, p8.2, p9.2, nrow = 3, ncol = 3)

## Warning: Use of `Glass$Ri_sqrt` is discouraged.
## ℹ Use `Ri_sqrt` instead.

Exercise 3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Part A + B

There some presence of nulls seen. In addition, there are some large zeros observed in variables such as Hail. However, just because there is a zero doesn’t mean that it’s a null or a non valuable result. On variables like Hail, a 0 means that it was a day where there was no hail. A zero does tell us something in this scenario.

A problem I observe however, is that some of the variables are heavily skewed. Using Hail again as an example, a large number of days had no hail. It makes one question whether the variable is useful for us to use at all given that most days had no hail.

Given that the a lot of the variables are heavily influenced by the environment. I imagine that certain variables like temp and percip are going to be heavily influenced by climate and ecology patterns. There is likely to be some classes that are either missing data or have zeros.

Regarding Nulls, I assume there might be problems with the device recording the data.

# Investigate frequency distributions of categorical predictors
cat_cols <- sapply(Soybean, is.factor)  # Identify categorical variables
cat_predictors <- Soybean[, cat_cols]    # Subset categorical predictors


# Display frequency distributions
lapply(cat_predictors, table)

## $Class
## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
## $date
## 
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
## $plant.stand
## 
##   0   1 
## 354 293 
## 
## $precip
## 
##   0   1   2 
##  74 112 459 
## 
## $temp
## 
##   0   1   2 
##  80 374 199 
## 
## $hail
## 
##   0   1 
## 435 127 
## 
## $crop.hist
## 
##   0   1   2   3 
##  65 165 219 218 
## 
## $area.dam
## 
##   0   1   2   3 
## 123 227 145 187 
## 
## $sever
## 
##   0   1   2 
## 195 322  45 
## 
## $seed.tmt
## 
##   0   1   2 
## 305 222  35 
## 
## $germ
## 
##   0   1   2 
## 165 213 193 
## 
## $plant.growth
## 
##   0   1 
## 441 226 
## 
## $leaves
## 
##   0   1 
##  77 606 
## 
## $leaf.halo
## 
##   0   1   2 
## 221  36 342 
## 
## $leaf.marg
## 
##   0   1   2 
## 357  21 221 
## 
## $leaf.size
## 
##   0   1   2 
##  51 327 221 
## 
## $leaf.shread
## 
##   0   1 
## 487  96 
## 
## $leaf.malf
## 
##   0   1 
## 554  45 
## 
## $leaf.mild
## 
##   0   1   2 
## 535  20  20 
## 
## $stem
## 
##   0   1 
## 296 371 
## 
## $lodging
## 
##   0   1 
## 520  42 
## 
## $stem.cankers
## 
##   0   1   2   3 
## 379  39  36 191 
## 
## $canker.lesion
## 
##   0   1   2   3 
## 320  83 177  65 
## 
## $fruiting.bodies
## 
##   0   1 
## 473 104 
## 
## $ext.decay
## 
##   0   1   2 
## 497 135  13 
## 
## $mycelium
## 
##   0   1 
## 639   6 
## 
## $int.discolor
## 
##   0   1   2 
## 581  44  20 
## 
## $sclerotia
## 
##   0   1 
## 625  20 
## 
## $fruit.pods
## 
##   0   1   2   3 
## 407 130  14  48 
## 
## $fruit.spots
## 
##   0   1   2   4 
## 345  75  57 100 
## 
## $seed
## 
##   0   1 
## 476 115 
## 
## $mold.growth
## 
##   0   1 
## 524  67 
## 
## $seed.discolor
## 
##   0   1 
## 513  64 
## 
## $seed.size
## 
##   0   1 
## 532  59 
## 
## $shriveling
## 
##   0   1 
## 539  38 
## 
## $roots
## 
##   0   1   2 
## 551  86  15

Part C

Given the small amount of missing values, we could remove them. However, there are different number of missing values among the variables. A variable like lodging is missing 121 values, but another like precip is missing 38. If we remove every single instance of null values, this may increase bias in our model towards variables that have more accurate data. Instead, what I would do is to impute and use a technique like K-nearest model to fill in the null values.

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

Data 624 Homework 4

Christian Uriostegui

2024-09-25

Objective

Exercise 3.1

Part A + B

Part C

Exercise 3.2

Part A + B

Part C