Exercises:
3.1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

#Import needed libraries

library(tsibbledata) #to use the time series data in it for the exercises.
library(tsibble) # to use datasets and function as_tsibble
library(tibble) # to use view function
library(ggplot2)
library(feasts) # to use the functions for graphics like autoplot()

library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function

library(USgas) # to use us_total data

library(fpp3)  # to use us_gasoline dataset 
library(seasonal) # X-13ARIMA-SEATS decomposition
library(feasts)

library(mlbench) # to use glass data

library(corrplot)

Exercises:

3.1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

As shown in the histogram plot,Na is not skewed, Mg is left skewed, Al is right skewed, Si is left skewed, K is right skewed, Ca is right skewed, Ba is right skewed, and Fe is right skewed.

As shown in the box plot, percentage of Mg elemnt has the most variation as its quartile range is spread out.Ba has the lease variation as its value is centered at 0.

As shown in the Correlation plot, Ca and RI have the most positive correlation closer to 1, followed by Al and Ba. Si and Ri have a negative correlation, followed by Ba and Mg.

The glasses are distributed mostly in Type 1, 2 and 7 .

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Glass|>
  select(is.numeric)|>
  gather()|>
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

Glass|>
  select(is.numeric)|>
  gather()|>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

Glass|>
  select(is.numeric)|>
  cor()

##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000

Glass|>
  ggplot() +
  geom_bar(aes(x = Type)) +
  ggtitle("Distribution of Types of Glass")

Do there appear to be any outliers in the data? Are any predictors skewed?

Answer:

Based on the box plot, i can observe that there are many points outside the whiskers for each and every element except Mg. Mg does not happen to have any outliers.

As mentioned above most of the elements are having a skewed distribution.To quantify the level of skewness, i used the skewness function.

Rule of thumb: 0.5 < |skewness| < 1 indicates moderate skewness, |skewness| > 1 indicates substantial skewness.

Hence K, Ca, Ba, Fe, Mg and RI indicate substantial skewness whereas Si and Al indicate moderate skewness. there

library(e1071)  # For skewness function
skew_values <- Glass |>
  select(is.numeric) |>
  sapply(skewness, na.rm = TRUE)

# Print skewness values
print(skew_values)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Based on the histogram and the skewness values above,here are my transformation recommendations:

For right-skewed variables (K, Ba, Ca, Fe, RI):

Log transformation would be most appropriate for the extremely skewed K Log or square root transformation for Ba, Ca, Fe, and RI

For left-skewed variables (Mg, Si):

Squared or cubed transformation Or negative log transformation on the reversed variable

For Al: Square root transformation should be sufficient For Na: Minimal skewness, no transformation needed

Glass_transformed <- Glass
# For variables with positive values only
skewed_vars <- names(skew_values[skew_values > 0.5])
for(var in skewed_vars) {
  if(min(Glass[[var]], na.rm = TRUE) > 0) {
    Glass_transformed[[paste0(var, "_log")]] <- log(Glass[[var]])
  }
}


Glass_transformed |>
  select(is.numeric)|>
  gather()|>
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

# For positive variables
for(var in skewed_vars) {
  if(min(Glass[[var]], na.rm = TRUE) >= 0) {
    Glass_transformed[[paste0(var, "_sqrt")]] <- sqrt(Glass[[var]])
  }
}

3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer:

When looking for “degenerate” distributions in categorical predictors, following are considered:

Near-zero variance predictors - variables where almost all observations have the same value (one dominant level) Zero variance predictors - variables with only one unique value (completely constant) Highly imbalanced categories - where one level appears much more frequently than others Missing values - categories with many NA values

Based on the frequency distribution plot, i can observe clearly the following: Near-zero variance predictors - mycelium Zero variance predictors - none Highly imbalanced - int_discolor, leaf_malf, sclerotia Missing values - many categorical predictors are missng values.

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Soybean %>%
  select(-Class)%>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_bar()+
  facet_wrap(~ key) +
  labs(title = "Distribution of Soybean Categorical Predictor Variables")

Answer:

Hail,sever,seed.tmt ,lodging have the highest missing % of 18% followed by germ which has 16%. Only Date, area.dam, Class, leaves doe not have missing values. Rest all have missing values. Yes the pattern of missing data is related to the classes as shown in the chart below. 5 classes only have missing data.

  # Overall missing data by predictor
missing_by_predictor <- Soybean %>%
  summarise(across(everything(), ~sum(is.na(.))/n()*100)) %>%
  pivot_longer(cols = everything(), 
               names_to = "predictor", 
               values_to = "percent_missing") %>%
  arrange(desc(percent_missing))

# Print predictors with highest missing rates
print(missing_by_predictor %>% filter(percent_missing > 0))

## # A tibble: 34 × 2
##    predictor       percent_missing
##    <chr>                     <dbl>
##  1 hail                       17.7
##  2 sever                      17.7
##  3 seed.tmt                   17.7
##  4 lodging                    17.7
##  5 germ                       16.4
##  6 leaf.mild                  15.8
##  7 fruiting.bodies            15.5
##  8 fruit.spots                15.5
##  9 seed.discolor              15.5
## 10 shriveling                 15.5
## # ℹ 24 more rows

# Missing data by class
missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(n_samples = n(),
            across(everything(), ~sum(is.na(.))/n()*100)) %>%
  pivot_longer(cols = -c(Class, n_samples), 
               names_to = "predictor", 
               values_to = "percent_missing") %>%
  filter(percent_missing > 0)

# Find predictors with class-dependent missingness
class_dependent_missing <- missing_by_class %>%
  group_by(predictor) %>%
  summarise(min_missing = min(percent_missing),
            max_missing = max(percent_missing),
            range = max_missing - min_missing) %>%
  filter(range > 10) %>%  # Arbitrary threshold - variables with >10% difference between classes
  arrange(desc(range))

# Visualize missing data by class for top variables with class-dependent missingness
if(nrow(class_dependent_missing) > 0) {
  top_vars <- head(class_dependent_missing$predictor, 5)
  
  missing_by_class %>%
    filter(predictor %in% top_vars) %>%
    ggplot(aes(x = Class, y = percent_missing, fill = predictor)) +
    geom_bar(stat = "identity", position = "dodge") +
    coord_flip() +
    labs(title = "Missing Data by Class for Key Predictors",
         y = "Percent Missing", 
         x = "Class") +
    theme_minimal()
}

# Heatmap of missing data patterns by class
missing_heatmap <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~sum(is.na(.))/n()*100)) %>%
  pivot_longer(cols = -Class, 
               names_to = "predictor", 
               values_to = "percent_missing") %>%
  ggplot(aes(x = predictor, y = Class, fill = percent_missing)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Heatmap of Missing Data by Class and Predictor",
       fill = "% Missing")

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer

As each predictor has less than 18% of data missing, it is difficult to eliminate them as we have limited data. We need to understand the data more as to why only 5 classes are only missing data. One approach is to exclude these 5 classess and only use rest of the classes for classification.

If we have to handle imputation for important predictors,we can use missForest for imputation as it handles categorical data well and preserves relationships. I may consider handling class-specific missing data to ensure accurate classification.

Homework 4 - Data Transformations

Dhanya Nair

2025-02-27

Exercises:

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer:

Answer:

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer

Homework 4 - Data Transformations

Dhanya Nair

2025-02-27

Exercises:

3.1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Answer:

Do there appear to be any outliers in the data? Are any predictors skewed?

Answer:

Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Answer:

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Answer:

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Answer