Chapter 3

Exercise 3.1 : The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

suppressMessages(suppressWarnings(library(mlbench)))
suppressMessages(suppressWarnings(library(car)))
suppressMessages(suppressWarnings(library(caret)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(corrgram)))
suppressMessages(suppressWarnings(library(psych)))
suppressMessages(suppressWarnings(library(moments)))
suppressMessages(suppressWarnings(library(mice)))
suppressMessages(suppressWarnings(library(Amelia)))
suppressMessages(suppressWarnings(library(kableExtra)))
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Lets plot histographs to see the distributions.

a <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(a)) {
  hist(a[ ,i], xlab = names(a[i]), main = paste(names(a[i]), "Histogram"), col="blue")  
}

Finding correlations: The correlation plot below shows how variables in the dataset are related to each other.

names(Glass)

##  [1] "RI"   "Na"   "Mg"   "Al"   "Si"   "K"    "Ca"   "Ba"   "Fe"   "Type"

cor(drop_na(Glass[,1:9]))

##               RI          Na           Mg          Al          Si
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073
##               K         Ca            Ba           Fe
## RI -0.289832711  0.8104027 -0.0003860189  0.143009609
## Na -0.266086504 -0.2754425  0.3266028795 -0.241346411
## Mg  0.005395667 -0.4437500 -0.4922621178  0.083059529
## Al  0.325958446 -0.2595920  0.4794039017 -0.074402151
## Si -0.193330854 -0.2087322 -0.1021513105 -0.094200731
## K   1.000000000 -0.3178362 -0.0426180594 -0.007719049
## Ca -0.317836155  1.0000000 -0.1128409671  0.124968219
## Ba -0.042618059 -0.1128410  1.0000000000 -0.058691755
## Fe -0.007719049  0.1249682 -0.0586917554  1.000000000

pairs.panels(Glass[1:9])

From the above plots, we can see that RI, Na, Al and Si have closely normal distributions and othera are do not have normal distributions. Also we can see RI and Ca are highly positively correlated. Others do not have good correlations.

b. Do there appear to be any outliers in the data? Are any predictors skewed?

Lets plot “Boxplot”" to find the outliers and “Density Plot” to find the skewness in the predictors.

a <- Glass[,1:9]
par(mfrow = c(3, 3))
for (i in 1:ncol(a)) {
  boxplot(a[ ,i], ylab = names(a[i]), horizontal=T,
          main = paste(names(a[i]), "Boxplot"), col="blue")
}

for (i in 1:ncol(a)) {
  d <- density(a[,i], na.rm = TRUE)
  plot(d, main = paste(names(a[i]), "Density"))
  polygon(d, col="blue")
}

In terms of outliers, Mg looks good as it does not have outliers. RI, Na, Al, Si, K and Fe do have outliers. But Ca and Ba are having max outliers.

Skewness:

RI: - Right skewed

Na: - Right skewed

Mg: - Left skewed

AL: - Looks normal

Si: - Looks normal

K: - Left skewed

Ca: - Right skewed

Ba: - Right skewed

Fe: - Right skewed

c. Are there any relevant transformations of one or more predictors that might improve the classification model?

We can use Box-Cox transformation to understand the transformation needed to improve our model.

bx <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
Glass1 <- predict(bx, Glass[-10])

par(mfrow = c(3, 3))
for (i in 1:ncol(Glass1)) {
  boxplot(Glass1[ ,i], ylab = names(Glass1[i]), horizontal=T,
          main = paste(names(Glass1[i]), "Boxplot"), col="blue")
}

for (i in 1:ncol(Glass1)) {
  d <- density(Glass1[,i], na.rm = TRUE)
  plot(d, main = paste(names(Glass1[i]), "Density"))
  polygon(d, col="blue")
}

We can see that with the transformation the skewness of Na and Ca has improved.

Exercise 3.2: The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes. The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Degenerate distributions are those where the predictor variable has a single unique value or a handful of unique values that occur with very low frequencies.

S1 <- Soybean[,2:36]
par(mfrow = c(3, 6))
for (i in 1:ncol(S1)) {
  smoothScatter(S1[ ,i], ylab = names(S1[i]))
}

nearZeroVar(S1, names = TRUE, saveMetrics=T)

##                  freqRatio percentUnique zeroVar   nzv
## date              1.137405     1.0248902   FALSE FALSE
## plant.stand       1.208191     0.2928258   FALSE FALSE
## precip            4.098214     0.4392387   FALSE FALSE
## temp              1.879397     0.4392387   FALSE FALSE
## hail              3.425197     0.2928258   FALSE FALSE
## crop.hist         1.004587     0.5856515   FALSE FALSE
## area.dam          1.213904     0.5856515   FALSE FALSE
## sever             1.651282     0.4392387   FALSE FALSE
## seed.tmt          1.373874     0.4392387   FALSE FALSE
## germ              1.103627     0.4392387   FALSE FALSE
## plant.growth      1.951327     0.2928258   FALSE FALSE
## leaves            7.870130     0.2928258   FALSE FALSE
## leaf.halo         1.547511     0.4392387   FALSE FALSE
## leaf.marg         1.615385     0.4392387   FALSE FALSE
## leaf.size         1.479638     0.4392387   FALSE FALSE
## leaf.shread       5.072917     0.2928258   FALSE FALSE
## leaf.malf        12.311111     0.2928258   FALSE FALSE
## leaf.mild        26.750000     0.4392387   FALSE  TRUE
## stem              1.253378     0.2928258   FALSE FALSE
## lodging          12.380952     0.2928258   FALSE FALSE
## stem.cankers      1.984293     0.5856515   FALSE FALSE
## canker.lesion     1.807910     0.5856515   FALSE FALSE
## fruiting.bodies   4.548077     0.2928258   FALSE FALSE
## ext.decay         3.681481     0.4392387   FALSE FALSE
## mycelium        106.500000     0.2928258   FALSE  TRUE
## int.discolor     13.204545     0.4392387   FALSE FALSE
## sclerotia        31.250000     0.2928258   FALSE  TRUE
## fruit.pods        3.130769     0.5856515   FALSE FALSE
## fruit.spots       3.450000     0.5856515   FALSE FALSE
## seed              4.139130     0.2928258   FALSE FALSE
## mold.growth       7.820896     0.2928258   FALSE FALSE
## seed.discolor     8.015625     0.2928258   FALSE FALSE
## seed.size         9.016949     0.2928258   FALSE FALSE
## shriveling       14.184211     0.2928258   FALSE FALSE
## roots             6.406977     0.4392387   FALSE FALSE

There are a few degenerate and that is due to the low frequencies. Most important once are mycelium and sclerotia. The Smoothed Density Scatterplot for the variables shows one color across the chart. The variables leaf.mild and int.discolor appear to show near-zero variance.

b. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

The missing values heat map and the counts are given below.

Non_NAs <- sapply(Soybean, function(y) sum(length(which(!is.na(y)))))
NAs <- sapply(Soybean, function(y) sum(length(which(is.na(y)))))
NA_Percent <- NAs / (NAs + Non_NAs)
NA_SUMMARY <- data.frame(Non_NAs,NAs,NA_Percent)
missmap(Soybean, main = "Missing Values")

kable(NA_SUMMARY)

	Non_NAs	NAs	NA_Percent
Class	683	0	0.0000000
date	682	1	0.0014641
plant.stand	647	36	0.0527086
precip	645	38	0.0556369
temp	653	30	0.0439239
hail	562	121	0.1771596
crop.hist	667	16	0.0234261
area.dam	682	1	0.0014641
sever	562	121	0.1771596
seed.tmt	562	121	0.1771596
germ	571	112	0.1639824
plant.growth	667	16	0.0234261
leaves	683	0	0.0000000
leaf.halo	599	84	0.1229868
leaf.marg	599	84	0.1229868
leaf.size	599	84	0.1229868
leaf.shread	583	100	0.1464129
leaf.malf	599	84	0.1229868
leaf.mild	575	108	0.1581259
stem	667	16	0.0234261
lodging	562	121	0.1771596
stem.cankers	645	38	0.0556369
canker.lesion	645	38	0.0556369
fruiting.bodies	577	106	0.1551977
ext.decay	645	38	0.0556369
mycelium	645	38	0.0556369
int.discolor	645	38	0.0556369
sclerotia	645	38	0.0556369
fruit.pods	599	84	0.1229868
fruit.spots	577	106	0.1551977
seed	591	92	0.1346999
mold.growth	591	92	0.1346999
seed.discolor	577	106	0.1551977
seed.size	591	92	0.1346999
shriveling	577	106	0.1551977
roots	652	31	0.0453880

Soybean %>%
mutate(Total = n()) %>% 
filter(!complete.cases(.)) %>%
group_by(Class) %>%
mutate(Missing = n(), Proportion=Missing/Total) %>%
select(Class, Missing, Proportion) %>%
unique()

## # A tibble: 5 x 3
## # Groups:   Class [5]
##   Class                       Missing Proportion
##   <fct>                         <int>      <dbl>
## 1 phytophthora-rot                 68     0.0996
## 2 diaporthe-pod-&-stem-blight      15     0.0220
## 3 cyst-nematode                    14     0.0205
## 4 2-4-d-injury                     16     0.0234
## 5 herbicide-injury                  8     0.0117

The above grid show the number of missing values for each variable. Checking if a pattern of missing data related to the classes exists is done by filtering, grouping, and mutating the data with dplyr. The majority of the missing values are in the phytophthora-rot class which has nearly 10%. The pattern of missing data is related to the classes. Mostly the phytophthora-rot class however since the other four variables only have between 1% and 2%.

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Missing values can be handeled in different ways. The easiest way is to delete the rows. Next if the data is skewed we can use median as replacement for missing values. If the data is mormal we can use mean. For non-numberic data we can use mode. The are other different ways like doing regression to replace the missing values.One such way is using MICE. The mice() function in the mice package conducts Multivariate Imputation by Chained Equations (MICE) on multivariate datasets with missing values. The function has many imputation methods that can be applied to the data. We will be using is PMM i.e. predictive mean matching method.

Soybean1 <- mice(Soybean, method="pmm", printFlag=F, seed=112)

## Warning: Number of logged events: 1668

Soybean1 <- complete(Soybean1)
Soybean1 <- as.data.frame(Soybean1)
missmap(Soybean1, main = "Missing Values")

We can see that there is no missing values in the dataset.

Data624 Data PreProcessing Assignment4

Ritesh Lohiya

February 27, 2019

Data624 Data Preprocessing Assignment4

Chapter 3

a. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.