Data624HW4

3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

data(Glass)
glasspredictors <- Glass[,1:9]
glasspredictors %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()+
    ggtitle("Glass Predictor Variables - Histograms")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

glasspredictors %>%
  gather() %>% 
  ggplot(aes(value)) +
  geom_density() +
  facet_wrap(~key, scales = 'free')+
  ggtitle("Glass Predictor Variables - Histograms")

pairs(glasspredictors, main="Glass Predictor Variables - Pairs Plot")

### Correlation plot

r <-cor(glasspredictors)
corrplot.mixed(r, 
               lower.col = "black",
               number.cex = .7,
               title="Glass Predictor Variables - Correlation Plot",
               mar=c(0,0,1,0))

From the above plots, we can see that some of the vatiables are reasonably well centered (Al, Na), some of them are skewed (Mg) and there are also a few of them that seems to have a high proportion of zero or near-zero weights (Fe, Ba) Relationships: The pairs plot doesn’t show any non-linear relationships In the correlation plot tells us that Ri seems to have the highest degree of relation. Ri & CA sharing the highest positive correlation and Ri and Mg having the lowest correlation.

B - Do there appear to be any outliers in the data? Are any predictors skewed?

df <- Glass #copy to new data frame so can edit without changing original
df[is.na(df)] <- 0  # replace missing values with zero
indx <- sapply(df, is.factor) 
df[indx] <- lapply(df[indx], function(x) as.numeric(as.character(x)))

Box Plots

f <-colnames(df) 
par(mfrow=c(2,5)) # set box plots to group in 2 rows of 5 columns each 
for (i in 1:length(f)){boxplot(df[,i],main = f[i])} #plot box plots

Box plots indicate presence of outliers in all elements except Mg. I would expect outliers due to impurities introduced in the glass manufacturing process.

As shown in the density plots above, other than Si most of the element distributions are right skewed.

c

Are there any relevant transformations of one or more predictors that might improve the classification model? Since Be, Fe, and K have a strong right skewness with a concentrations of points with low values, they may benefit from a log transformation. Mg may also be log transformed since it is left skewed. Box-cox could be used to address the skew.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes ### A Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)
#number of unique values per col
incl.nas <- sapply(sapply(Soybean,unique),length)
no.nas <- sapply(sapply(Soybean[complete.cases(Soybean),],unique),length)
  
r <- t(rbind(incl.nas,no.nas))
row.names(r) <- colnames(Soybean)
kable(r)

	incl.nas	no.nas
Class	19	15
date	8	7
plant.stand	3	2
precip	4	3
temp	4	3
hail	3	2
crop.hist	5	4
area.dam	5	4
sever	4	3
seed.tmt	4	3
germ	4	3
plant.growth	3	2
leaves	2	2
leaf.halo	4	3
leaf.marg	4	3
leaf.size	4	3
leaf.shread	3	2
leaf.malf	3	2
leaf.mild	4	3
stem	3	2
lodging	3	2
stem.cankers	5	4
canker.lesion	5	4
fruiting.bodies	3	2
ext.decay	4	2
mycelium	3	2
int.discolor	4	3
sclerotia	3	2
fruit.pods	5	3
fruit.spots	5	4
seed	3	2
mold.growth	3	2
seed.discolor	3	2
seed.size	3	2
shriveling	3	2
roots	4	3

The table above shows the unique-value-count by variable. Based on this table it does not appear as though there are any variables with degenerate distributions of 2 values present in the data.

B

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Soybean.incomplete <- Soybean[!complete.cases(Soybean),]
missing.cols <- Soybean.incomplete %>%
  select(everything()) %>%  
  summarise_all(funs(sum(is.na(.))))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

missing.cols <- t(missing.cols/nrow(Soybean))
missing.cols <- missing.cols[order(-missing.cols),]
kable(missing.cols)

	x
hail	0.1771596
sever	0.1771596
seed.tmt	0.1771596
lodging	0.1771596
germ	0.1639824
leaf.mild	0.1581259
fruiting.bodies	0.1551977
fruit.spots	0.1551977
seed.discolor	0.1551977
shriveling	0.1551977
leaf.shread	0.1464129
seed	0.1346999
mold.growth	0.1346999
seed.size	0.1346999
leaf.halo	0.1229868
leaf.marg	0.1229868
leaf.size	0.1229868
leaf.malf	0.1229868
fruit.pods	0.1229868
precip	0.0556369
stem.cankers	0.0556369
canker.lesion	0.0556369
ext.decay	0.0556369
mycelium	0.0556369
int.discolor	0.0556369
sclerotia	0.0556369
plant.stand	0.0527086
roots	0.0453880
temp	0.0439239
crop.hist	0.0234261
plant.growth	0.0234261
stem	0.0234261
date	0.0014641
area.dam	0.0014641
Class	0.0000000
leaves	0.0000000

case.count <- Soybean.incomplete %>% 
  group_by(Class) %>%
  tally()
na.count <- aggregate(Soybean.incomplete, list(Soybean.incomplete$Class), function(x) sum(is.na(x)))
case.count$NAs <- data.frame(rowSums(na.count[2:ncol(na.count)]))
colnames(case.count) <- c("Class","Incomeplete.Cases", "NA.Values")
case.count$NA.Per.Case <- case.count$NA.Values / case.count$Incomeplete.Cases
kable(case.count)

Class	Incomeplete.Cases	NA.Values	NA.Per.Case
2-4-d-injury	16	450	28.12500
cyst-nematode	14	336	24.00000
diaporthe-pod-&-stem-blight	15	177	11.80000
herbicide-injury	8	160	20.00000
phytophthora-rot	68	1214	17.85294

Variables There does seem to be a pattern in some of the variables which are missing.

crop damage (hail, lodging, severe weather) appear to be among the most common missing variables (~18%)

Classes Of the 19 classes, the missing data appear to be confined to only 5 distinct classes. We see that of these, the majority of the missing data are in phytophthora-rot.

C

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Impute Zeros

Soybean$hail[is.na(Soybean$hail)] <- 0
Soybean$sever[is.na(Soybean$hail)] <- 0

Knn Impute For the remaining data we’ll use KNN (k=10) to impute.

df <- data.frame(Soybean)
Soybean.impute <- knnImputation(df, k = 10, scale = T, meth = "mode",
              distData = NULL)
nrow(Soybean.impute[!complete.cases(Soybean.impute),])

## [1] 0

The number of incomplete cases is now 0.

Data624HW4

Jaya Veluri

2/25/2024

3.1

B - Do there appear to be any outliers in the data? Are any predictors skewed?

Box Plots

c

3.2

B

C