3.1

library(mlbench)
library(dplyr)
library(DataExplorer)
library(skimr)
 data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

skim(Glass)
Data summary
Name Glass
Number of rows 214
Number of columns 10
_______________________
Column type frequency:
factor 1
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Type 0 1 FALSE 6 2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RI 0 1 1.52 0.00 1.51 1.52 1.52 1.52 1.53 ▁▇▂▁▁
Na 0 1 13.41 0.82 10.73 12.91 13.30 13.83 17.38 ▁▇▆▁▁
Mg 0 1 2.68 1.44 0.00 2.11 3.48 3.60 4.49 ▃▁▁▇▅
Al 0 1 1.44 0.50 0.29 1.19 1.36 1.63 3.50 ▂▇▃▁▁
Si 0 1 72.65 0.77 69.81 72.28 72.79 73.09 75.41 ▁▂▇▂▁
K 0 1 0.50 0.65 0.00 0.12 0.56 0.61 6.21 ▇▁▁▁▁
Ca 0 1 8.96 1.42 5.43 8.24 8.60 9.17 16.19 ▁▇▁▁▁
Ba 0 1 0.18 0.50 0.00 0.00 0.00 0.00 3.15 ▇▁▁▁▁
Fe 0 1 0.06 0.10 0.00 0.00 0.00 0.10 0.51 ▇▁▁▁▁
plot_histogram(Glass)

plot_boxplot(Glass , by="Type")

plot_correlation(Glass)

(b)

  We appear to have many skewed predictors in this data. Ba, Fe, K and Mg all are all significantly skewed. Only one type of glass appears to contain a meaningful amount of Ba. Several types of glass do not have much of any Fe in them. All glass types appear to have little K in them with one type having a few outliers.

(c)

  Depending on the modeling technique we use, we may want to scale the data, as the difference in amount of chemicals covers a very small range. It may make it easier for the model to identify differences. We can highlight the scale of the different amounts of chemicals for each glass type. We may also consider doing some Boxcox or log transformations on the data in order to fix some of the skewing seen in some of the predictor values.

3.2

data(Soybean)

(a)

skim(Soybean)
Data summary
Name Soybean
Number of rows 683
Number of columns 36
_______________________
Column type frequency:
factor 36
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1.00 FALSE 19 bro: 92, alt: 91, fro: 91, phy: 88
date 1 1.00 FALSE 7 5: 149, 4: 131, 3: 118, 2: 93
plant.stand 36 0.95 TRUE 2 0: 354, 1: 293
precip 38 0.94 TRUE 3 2: 459, 1: 112, 0: 74
temp 30 0.96 TRUE 3 1: 374, 2: 199, 0: 80
hail 121 0.82 FALSE 2 0: 435, 1: 127
crop.hist 16 0.98 FALSE 4 2: 219, 3: 218, 1: 165, 0: 65
area.dam 1 1.00 FALSE 4 1: 227, 3: 187, 2: 145, 0: 123
sever 121 0.82 FALSE 3 1: 322, 0: 195, 2: 45
seed.tmt 121 0.82 FALSE 3 0: 305, 1: 222, 2: 35
germ 112 0.84 TRUE 3 1: 213, 2: 193, 0: 165
plant.growth 16 0.98 FALSE 2 0: 441, 1: 226
leaves 0 1.00 FALSE 2 1: 606, 0: 77
leaf.halo 84 0.88 FALSE 3 2: 342, 0: 221, 1: 36
leaf.marg 84 0.88 FALSE 3 0: 357, 2: 221, 1: 21
leaf.size 84 0.88 TRUE 3 1: 327, 2: 221, 0: 51
leaf.shread 100 0.85 FALSE 2 0: 487, 1: 96
leaf.malf 84 0.88 FALSE 2 0: 554, 1: 45
leaf.mild 108 0.84 FALSE 3 0: 535, 1: 20, 2: 20
stem 16 0.98 FALSE 2 1: 371, 0: 296
lodging 121 0.82 FALSE 2 0: 520, 1: 42
stem.cankers 38 0.94 FALSE 4 0: 379, 3: 191, 1: 39, 2: 36
canker.lesion 38 0.94 FALSE 4 0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies 106 0.84 FALSE 2 0: 473, 1: 104
ext.decay 38 0.94 FALSE 3 0: 497, 1: 135, 2: 13
mycelium 38 0.94 FALSE 2 0: 639, 1: 6
int.discolor 38 0.94 FALSE 3 0: 581, 1: 44, 2: 20
sclerotia 38 0.94 FALSE 2 0: 625, 1: 20
fruit.pods 84 0.88 FALSE 4 0: 407, 1: 130, 3: 48, 2: 14
fruit.spots 106 0.84 FALSE 4 0: 345, 4: 100, 1: 75, 2: 57
seed 92 0.87 FALSE 2 0: 476, 1: 115
mold.growth 92 0.87 FALSE 2 0: 524, 1: 67
seed.discolor 106 0.84 FALSE 2 0: 513, 1: 64
seed.size 92 0.87 FALSE 2 0: 532, 1: 59
shriveling 106 0.84 FALSE 2 0: 539, 1: 38
roots 31 0.95 FALSE 3 0: 551, 1: 86, 2: 15
  We can see several degenerate looking variables below. It would seem that many of the leaf related variables have an unfavorable ratios of values, with the main variable itself, leaves, containing almost entirely only 1 values. Other variables such as mycelium, fruiting.bodies,int.discolor, sclerotia, seed, mold.growth, seed.discolor, seed.size, shriveling, and roots also are potentially degenerate and need further investigation.
plot_bar(Soybean)

(b)

plot_missing(Soybean)

  Upon further investigation we see that the majority of larger missing values cover random, traumatic events that can occur. SUch as hail storms, having to sever the plant, lodging occurring and so on. We also see some values such as mold growth, seed size,and various leaf values which were likely poorly recorded at the time for certain soybeans. Hard to determine if there is a pattern occurring.
profile_missing(Soybean) %>% arrange(desc(pct_missing))
  Below we look at the breakdown of missing values by class. We find that out of the 19 possible soybeans, only five have missing values. We see that diaporthe-pod-&-stem-blight, 2-4-d-injury and especially phytophthora-rot, make up a good portion of the missing values.
Soybean %>% filter(!complete.cases(.))%>% group_by(Class) %>% summarise (across(everything(),~as.factor(sum(is.na(. ))  )))  %>%

 plot_bar( by ="Class",title ="Missing Values Per Class" ) 

Soybean %>% filter(!complete.cases(.)) %>% group_by(Class) %>% summarise (across(everything(),~sum(is.na(. ))))  
  It is clear that removing only three of those five alone reduces the missing level down to a negligible amount.
Soybean %>%  filter(!Class %in% c('phytophthora-rot','2-4-d-injury','diaporthe-pod-&-stem-blight')) %>% plot_missing()

(c)

  With the above investigation in mind, we may not need to impute or eliminate any of these predictor variables.The majority of our classes do not contain any or many missing values. It is unclear if these missing values are caused by a sampling issue or not. What is clear is that if this pattern of missing values repeats, we could potentially use it to identify what class of Soybean it is. It is also possible that we could misclassify a variable due to this sampling error as well, so we would need to pay special attention to this area when working with future data.