Homework # 4 (Data 624)

Problem 3.1 Dataset related to Glass Identification.

library(mlbench)

data(Glass)

str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Data Exploration..

At a glance it appears that BA column has only one single unique value, i.e this column is a zero-variance predictor which may be removed..

head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Let’s make a histogram of all of the predictors of the data-frame, by melting the data into a long-format and then facet_wrap by name so we can see all of the distributions

## Let's make a histogram by melting the data.. 
## Remove the response first
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NotypeGlass <- Glass[,1:9]

NotypeLong <- NotypeGlass %>%
  pivot_longer(colnames(NotypeGlass)) %>%
  as.data.frame()

ggplot(NotypeLong,aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name,scales = "free")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, let’s create a correlation plot matrix and see whether correlations exist between the predictors. It appears we have quite a few correlations in both direction

library(corrplot)

## corrplot 0.92 loaded

Glass_matrix <- Glass %>%
  select_if(is.numeric) %>%
  cor(method="pearson", use="pairwise.complete.obs")

corrplot(Glass_matrix,order = 'hclust',type="upper",method = "number")

## Create a boxplot 
ggplot(NotypeLong,aes(x = name,y = value)) +
  geom_boxplot() + labs(title = "Boxplot of Glass")

Are there any outliers within the data, are any predictors skewed

Based on the few plots, I have produced it appears that most of the predictors have outliers within the data. Some of the predictors are left-tail skewed like Fe,K, and Ba while others are slightly left-tail skewed. We can also notice some correlations between the predictors such as the correlations between RI and Si have a negative correlation of -0.54 and most noticeably Ri and Ca have a positive correlation of 0.81.

Are there any relevant transformations that can improve the classification models..

We can attempt a log-transform some of the predictors in order to improve their skewing distributions, like applying a log-transformation on Ba and K predictors, but some others are very sharply skewed to the left.

ggplot(NotypeGlass,aes(x = log(K))) +
         geom_histogram() + labs(title = "Log Transformation of K")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 30 rows containing non-finite values (`stat_bin()`).

ggplot(NotypeGlass,aes(x = log(Ba))) +
         geom_histogram() + labs(title = "Log Transformation of Ca")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 176 rows containing non-finite values (`stat_bin()`).

Performing log transformation on the skewed predictors seems to have helped reduce the skewness.

Problem 3.2 The Soybean data

The soybean data has a lot of columns, 36 predictors with 683 observations.

## load the soybean data
data("Soybean")

str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Investigate The Frequency Distributions For The Categorical predictors

We can make a bar chart.. for each predictors..

library(skimr)

skim(Soybean)

Data summary
Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

## Use the .data pronoun for the column.. 
columns = colnames(Soybean)
p <- lapply(columns,
  function(col) {
    ggplot(Soybean, 
           aes(.data[[col]])) + geom_bar() + coord_flip() + ggtitle(col)})
print(p)

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

## 
## [[24]]

## 
## [[25]]

## 
## [[26]]

## 
## [[27]]

## 
## [[28]]

## 
## [[29]]

## 
## [[30]]

## 
## [[31]]

## 
## [[32]]

## 
## [[33]]

## 
## [[34]]

## 
## [[35]]

## 
## [[36]]

Some of the distributions are degenerate since like leaf.mild has a higher proportion of no mild leaf than mild leafs, there are a greater proportion of lodging and not lodging and a variety of other predictors with this imbalance.

Are there any particular predictors that are more likely to be missing

Skem <- skim(Soybean)

ggplot(data = Skem, aes(x = reorder(skim_variable,n_missing), y = n_missing)) +
  geom_col() + coord_flip() + labs(title = "Missing Observations Per Columns Ordered", y = "# of Missing Observation", x = "Variable")

From the plot, we can see that there are quite a lot of missing values for the observation especially for sever, seed.tmt and lodging which I would assume are not required for certain plant class. Glancing at the dataframe some of the missing values in certain predictors make sense for instance, the hail column indicates yes for 0 and no for 1 an NA may that the area where the plants were measured may not have hail at all. The pattern of missing data are related to the class of the plants, for instance, there were predictors measuring fruit spots, and fruit pods and information about seeds that may not pertain to the class i.e pleythorea-rot has many missing values within those predictors.

Develop A strategy For Handling Missing Data..

A strategy I would use to handle missing data, is to try to get a better understanding of all the predictors within this data, and see where the missing values are in which predictors since in this case, the class of the plant and their characteristics have different missing values. I might first attempt to use mean imputation to handle the missing data. Another scenario is to use a model to handle the missing data in this case I may use k-nearest neighbors to imputate the data, where the imputation values are determined by their closest neighbors. We can use recorded measurement for each plant class and imputate the Na values. I would make sure each imputation values have similar values to their plant class. This is just a thought process I felt would be approriate regarding this dataset.

Homework # 4 (Data 624)

Al Haque

2024-02-25

Problem 3.2 The Soybean data

Investigate The Frequency Distributions For The Categorical predictors

Are there any particular predictors that are more likely to be missing

Develop A strategy For Handling Missing Data..