DATA 624: Homework 4

Nakesha Fray

2025-02-25

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

glass_longer <- Glass%>%
  select(!(Type))%>%
  pivot_longer(cols = everything(), 
               names_to = "Variable", 
               values_to = "Value")

glass_longer %>%
  ggplot(aes(x = Value)) +          
  geom_histogram(bins = 18) +
  facet_wrap(~ Variable, scales = 'free')+
  labs(title = "Histogram of Predictors")

glass_longer%>%
  ggplot(aes(x = Variable, y = Value, fill = Variable)) +
  geom_boxplot() +
  facet_wrap(~ Variable, scales = "free") +
  labs(title = "Boxplots of Predictors")

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

From the boxplots, we see there are outliers for all of the predictors except Mg. The histograms show us that the predictors are skewed. Al, Ca, and RI are slightly right-skewed. Ba, Fe, and K are all very right skewed. Mg is bimodal and left-skewed. Si is slightly left-skewed and Na looks very slightly right-skewed, possibly a little normal.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Since the some of data from Glass are very skewed, a log, square root, inverse or Box-Cox transformation of one or more predictors might improve the classification model. As an example, I did a log transformation and we see that Ba, Fe, and K improved a little from the log transformation.

glass_longer %>%
  ggplot(aes(x = log(Value))) +          
  geom_histogram(bins = 18) +
  facet_wrap(~ Variable, scales = 'free')
## Warning: Removed 392 rows containing non-finite outside the scale range
## (`stat_bin()`).

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Since all of the predictors were categorical, I created a histogram of all of the predictors. We see that the predictors closest to a degenerate distribution is mycelium since it mostly takes one single values. Other that are very close to being degenerate are sclerotia, shriveling, and leaf.malf since the second value is a a little larger than mycelium’s second value.

soy_cate <- sapply(Soybean, is.factor)
soy_cate
##           Class            date     plant.stand          precip            temp 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##            TRUE            TRUE            TRUE            TRUE            TRUE 
##           roots 
##            TRUE
Soybean %>%
  select(!(c(Class, date))) %>%          
  drop_na() %>%               
  gather(key = "key", value = "value") %>% 
  ggplot(aes(x = value)) +   
  geom_bar() +                
  facet_wrap(~ key) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: attributes are not identical across measure variables; they will be
## dropped

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

It looks like the predictors with the largest amount of missing data are hail, sever, seed.tmt, and lodging with 121 missing values, which equates to 17.715% missing data. It also does seem like the missing data is related to class, since most of the missing data is from the class phytophthora-rot where twe see up 68 missing values for some predictors. Cyst-nematode, 2-4-d-injury, and diaporthe-pod-&-stem-blight seem to have similiar missingness - around 41-16 missing values at most. Then there is herbicide-injury on the lower side with 8 missing values. The rest of the classes do not have any missing data. This means missing predictor data could be related to the classes they are in, which means there could be informative missingness.

na_total <- sapply(Soybean, function(x) sum(is.na(x)))

soy_na <- data.frame(Column = names(na_total), NAs = na_total)

soy_na %>%
  arrange(desc(NAs))
##                          Column NAs
## hail                       hail 121
## sever                     sever 121
## seed.tmt               seed.tmt 121
## lodging                 lodging 121
## germ                       germ 112
## leaf.mild             leaf.mild 108
## fruiting.bodies fruiting.bodies 106
## fruit.spots         fruit.spots 106
## seed.discolor     seed.discolor 106
## shriveling           shriveling 106
## leaf.shread         leaf.shread 100
## seed                       seed  92
## mold.growth         mold.growth  92
## seed.size             seed.size  92
## leaf.halo             leaf.halo  84
## leaf.marg             leaf.marg  84
## leaf.size             leaf.size  84
## leaf.malf             leaf.malf  84
## fruit.pods           fruit.pods  84
## precip                   precip  38
## stem.cankers       stem.cankers  38
## canker.lesion     canker.lesion  38
## ext.decay             ext.decay  38
## mycelium               mycelium  38
## int.discolor       int.discolor  38
## sclerotia             sclerotia  38
## plant.stand         plant.stand  36
## roots                     roots  31
## temp                       temp  30
## crop.hist             crop.hist  16
## plant.growth       plant.growth  16
## stem                       stem  16
## date                       date   1
## area.dam               area.dam   1
## Class                     Class   0
## leaves                   leaves   0
ggplot(soy_na, aes(y = reorder(Column, NAs), x = NAs)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Number of Missing values per Predictor in Soybean", x = "Number of NAs", y = "Predictors") +
  theme_minimal()

missing_data <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~sum(is.na(.))))%>%
  pivot_longer(cols = -Class, names_to = "Predictor", values_to = "Missing_Count") %>%
  filter(Missing_Count > 0)

ggplot(missing_data, aes(x = reorder(Predictor, Missing_Count), y = Missing_Count, fill = Predictor)) +
  geom_bar(stat = "identity") +
  coord_flip() + 
  facet_wrap(~ Class) + 
  theme_minimal() +
  labs(title = "Missingness by Predictor and Class in Soybean",
       x = "Predictor",
       y = "Missing Count",
       fill = "Predictor") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

One way to handle missing data could be to eliminating. However, in this case we probably would not eliminate predictors (attributes) since there are missing values in almost every single predictor. We could instead remove the 5 classes (shown in the plot aboce) that have missing data, and then we would be left with no missing data.

However, it might be helpful to impute the data if the missing data is related to the class they are in and if we do not want to lose data. This could be done with model-based imputations, like the K-nearest neighbor model(KNN), which finds similar data points in the sample and averaging them to fill in the missing values.