3.1

library(mlbench)
library(dplyr)
library(DataExplorer)
library(skimr)
 data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a)

skim(Glass)

Data summary
Name	Glass
Number of rows	214
Number of columns	10
_______________________
Column type frequency:
factor	1
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Type	0	1	FALSE	6	2: 76, 1: 70, 7: 29, 3: 17

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RI	1	1.52	0.00	1.51	1.52	1.52	1.52	1.53	▁▇▂▁▁
Na	1	13.41	0.82	10.73	12.91	13.30	13.83	17.38	▁▇▆▁▁
Mg	1	2.68	1.44	0.00	2.11	3.48	3.60	4.49	▃▁▁▇▅
Al	1	1.44	0.50	0.29	1.19	1.36	1.63	3.50	▂▇▃▁▁
Si	1	72.65	0.77	69.81	72.28	72.79	73.09	75.41	▁▂▇▂▁
K	1	0.50	0.65	0.00	0.12	0.56	0.61	6.21	▇▁▁▁▁
Ca	1	8.96	1.42	5.43	8.24	8.60	9.17	16.19	▁▇▁▁▁
Ba	1	0.18	0.50	0.00	0.00	0.00	0.00	3.15	▇▁▁▁▁
Fe	1	0.06	0.10	0.00	0.00	0.00	0.10	0.51	▇▁▁▁▁

plot_histogram(Glass)

plot_boxplot(Glass , by="Type")

plot_correlation(Glass)

(b)

We appear to have many skewed predictors in this data. Ba, Fe, K and Mg all are all significantly skewed. Only one type of glass appears to contain a meaningful amount of Ba. Several types of glass do not have much of any Fe in them. All glass types appear to have little K in them with one type having a few outliers.

(c)

Depending on the modeling technique we use, we may want to scale the data, as the difference in amount of chemicals covers a very small range. It may make it easier for the model to identify differences. We can highlight the scale of the different amounts of chemicals for each glass type. We may also consider doing some Boxcox or log transformations on the data in order to fix some of the skewing seen in some of the predictor values.

3.2

data(Soybean)

(a)

skim(Soybean)

Data summary
Name	Soybean
Number of rows	683
Number of columns	36
_______________________
Column type frequency:
factor	36
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Class	0	1.00	FALSE	19	bro: 92, alt: 91, fro: 91, phy: 88
date	1	1.00	FALSE	7	5: 149, 4: 131, 3: 118, 2: 93
plant.stand	36	0.95	TRUE	2	0: 354, 1: 293
precip	38	0.94	TRUE	3	2: 459, 1: 112, 0: 74
temp	30	0.96	TRUE	3	1: 374, 2: 199, 0: 80
hail	121	0.82	FALSE	2	0: 435, 1: 127
crop.hist	16	0.98	FALSE	4	2: 219, 3: 218, 1: 165, 0: 65
area.dam	1	1.00	FALSE	4	1: 227, 3: 187, 2: 145, 0: 123
sever	121	0.82	FALSE	3	1: 322, 0: 195, 2: 45
seed.tmt	121	0.82	FALSE	3	0: 305, 1: 222, 2: 35
germ	112	0.84	TRUE	3	1: 213, 2: 193, 0: 165
plant.growth	16	0.98	FALSE	2	0: 441, 1: 226
leaves	0	1.00	FALSE	2	1: 606, 0: 77
leaf.halo	84	0.88	FALSE	3	2: 342, 0: 221, 1: 36
leaf.marg	84	0.88	FALSE	3	0: 357, 2: 221, 1: 21
leaf.size	84	0.88	TRUE	3	1: 327, 2: 221, 0: 51
leaf.shread	100	0.85	FALSE	2	0: 487, 1: 96
leaf.malf	84	0.88	FALSE	2	0: 554, 1: 45
leaf.mild	108	0.84	FALSE	3	0: 535, 1: 20, 2: 20
stem	16	0.98	FALSE	2	1: 371, 0: 296
lodging	121	0.82	FALSE	2	0: 520, 1: 42
stem.cankers	38	0.94	FALSE	4	0: 379, 3: 191, 1: 39, 2: 36
canker.lesion	38	0.94	FALSE	4	0: 320, 2: 177, 1: 83, 3: 65
fruiting.bodies	106	0.84	FALSE	2	0: 473, 1: 104
ext.decay	38	0.94	FALSE	3	0: 497, 1: 135, 2: 13
mycelium	38	0.94	FALSE	2	0: 639, 1: 6
int.discolor	38	0.94	FALSE	3	0: 581, 1: 44, 2: 20
sclerotia	38	0.94	FALSE	2	0: 625, 1: 20
fruit.pods	84	0.88	FALSE	4	0: 407, 1: 130, 3: 48, 2: 14
fruit.spots	106	0.84	FALSE	4	0: 345, 4: 100, 1: 75, 2: 57
seed	92	0.87	FALSE	2	0: 476, 1: 115
mold.growth	92	0.87	FALSE	2	0: 524, 1: 67
seed.discolor	106	0.84	FALSE	2	0: 513, 1: 64
seed.size	92	0.87	FALSE	2	0: 532, 1: 59
shriveling	106	0.84	FALSE	2	0: 539, 1: 38
roots	31	0.95	FALSE	3	0: 551, 1: 86, 2: 15

We can see several degenerate looking variables below. It would seem that many of the leaf related variables have an unfavorable ratios of values, with the main variable itself, leaves, containing almost entirely only 1 values. Other variables such as mycelium, fruiting.bodies,int.discolor, sclerotia, seed, mold.growth, seed.discolor, seed.size, shriveling, and roots also are potentially degenerate and need further investigation.

plot_bar(Soybean)

(b)

plot_missing(Soybean)

Upon further investigation we see that the majority of larger missing values cover random, traumatic events that can occur. SUch as hail storms, having to sever the plant, lodging occurring and so on. We also see some values such as mold growth, seed size,and various leaf values which were likely poorly recorded at the time for certain soybeans. Hard to determine if there is a pattern occurring.

profile_missing(Soybean) %>% arrange(desc(pct_missing))

Below we look at the breakdown of missing values by class. We find that out of the 19 possible soybeans, only five have missing values. We see that diaporthe-pod-&-stem-blight, 2-4-d-injury and especially phytophthora-rot, make up a good portion of the missing values.

Soybean %>% filter(!complete.cases(.))%>% group_by(Class) %>% summarise (across(everything(),~as.factor(sum(is.na(. ))  )))  %>%

 plot_bar( by ="Class",title ="Missing Values Per Class" )

Soybean %>% filter(!complete.cases(.)) %>% group_by(Class) %>% summarise (across(everything(),~sum(is.na(. ))))

It is clear that removing only three of those five alone reduces the missing level down to a negligible amount.

Soybean %>%  filter(!Class %in% c('phytophthora-rot','2-4-d-injury','diaporthe-pod-&-stem-blight')) %>% plot_missing()

(c)

With the above investigation in mind, we may not need to impute or eliminate any of these predictor variables.The majority of our classes do not contain any or many missing values. It is unclear if these missing values are caused by a sampling issue or not. What is clear is that if this pattern of missing values repeats, we could potentially use it to identify what class of Soybean it is. It is also possible that we could misclassify a variable due to this sampling error as well, so we would need to pay special attention to this area when working with future data.

Data 624: Homework Four

Zachary Safir

2/28/2022

3.1

(a)

(b)

(c)

3.2

(a)

(b)

(c)