DATA 624 Homework #4

Exercise 3.1: `Glass` dataset

The Glass dataset includes 214 observations of 10 variables. The variables include:

9 predictor variables (RI through Fe), all of which are numeric
1 target variable (Type) which is categorical and takes the values (1, 2, 3, 5, 6, 7).

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

attach(Glass)

a. Distributions and variable relationships

To start, we can use the ggpairs function from the GGally package to visualize the distributions of the 9 predictor variables as well as their bivariate scatterplots and correlations. In this chart, the data points and the correlations are conditioned on the target Type variable.

# pairs plot excluding the target variable
ggpairs(Glass[ , -10], aes(col = Type), 
        title = "Pairs plot of predictor variables")

From these plots, we can see that:

Some predictors appear to be approximately symmetric: Na, Al, and Si
Other predictors appear to be highly skewed: K and Ba
Some predictor variables appear highly correlated: RI and Ca.

Close-up examples of these variables are shown below:

Distribution of Na: example of approximately symmetric distribution
Distribution of K: example of highly skewed distribution
Scatterplot of RI vs. Ca: highly correlated relationship.

ggplot(Glass, aes(x = Na, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Na: example of approximately symmetric distribution")

ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of K: example of highly skewed distribution")

ggplot(Glass, aes(x = Ca, y = RI, col = Type)) + 
  geom_point() + 
  geom_smooth(aes(x = Ca, y = RI), inherit.aes = FALSE, method = "lm", se = FALSE) + 
  labs(title = "RI vs. Ca: example of correlated variables")

We can take an alternative look at relationships between the predictors by viewing the correlation matrix, using the corrplot function from the corrplot package. It is apparent that the largest correlations in absolute value are:

RI and Ca: positive (0.81)
Al and Ba: positive (0.48
RI and Si: negative (-0.54)
Mg and Ba: negative (-0.49)
Mg and Al: negative (-0.48)

# correlation plot excluding the target variable
corrs <- cor(Glass[ , -10]) 
corrplot(corrs)

b. Outliers and skew

We can quantify skewness in the predictors by computing the skewness statistic and the ratio of high-to-low values. These statistics confirm our observations from the density plots below that the following variables all have highly skewed distributions:

# skew statistics
skewValues <- apply(Glass[ , -10], 2, skewness)

# high-to-low ratios; add 0.1 to min to prevent division by 0
hiloRatios <- apply(Glass[ , -10], 2, function(x) max(x) / min(x + 0.1))

cbind(Skew = skewValues, Hilo = hiloRatios) %>%
  kable(digits = 2, 
        col.names=c("Skew statistic", "High-to-Low ratio"), 
        caption = "Predictors with Skewed Distributions")

Predictors with Skewed Distributions
	Skew statistic	High-to-Low ratio
RI	1.60	0.95
Na	0.45	1.60
Mg	-1.14	44.90
Al	0.89	8.97
Si	-0.72	1.08
K	6.46	62.10
Ca	2.02	2.93
Ba	3.37	31.50
Fe	1.73	5.10

ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of K: highly skewed distribution")

ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Ba: highly skewed distribution")

ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Density of Ca: highly skewed distribution")

From the boxplots below, outliers in these variable distributions are apparent:

K: outliers in Types 5 and 7
Ba: outliers in Types 2 and 5
Ca: outliers in Types 2, 6, and 7

ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = K, fill = Type)) + 
  labs(title = "Distribution of K by Type")

ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = Ba, fill = Type)) + 
  labs(title = "Distribution of Ba by Type")

ggplot(Glass) + 
  geom_boxplot(aes(x = Type, y = Ca, fill = Type)) + 
  labs(title = "Distribution of Ca by Type")

c. Transformations

Some transformations of the predictors that might improve a classification model include:

Box-Cox transformations of the highly skewed variables
Normalizing the variables by centering and scaling.

We can visualize the distributions of the highly skewed variables (K, Ba, and Ca) before and after the Box-Cox transformation in the density plots below. Generally transforming by Box-Cox improves the skewness of the overall variable distributions, although some of the distributions conditioned on Type may still remain skewed.

# box-cox transformation
K_bc <- BoxCoxTrans(Glass$K + 0.1)
K_trans <- predict(K_bc, Glass$K + 0.1)
Ba_bc <- BoxCoxTrans(Glass$Ba + 0.1)
Ba_trans <- predict(Ba_bc, Glass$Ba + 0.1)
Ca_bc <- BoxCoxTrans(Glass$Ca + 0.1)
Ca_trans <- predict(Ca_bc, Glass$Ca + 0.1)

# plot distributions before & after box-cox
ggplot(Glass, aes(x = K, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of K before Box-Cox")

ggplot(Glass, aes(x = K_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of K after Box-Cox")

ggplot(Glass, aes(x = Ba, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ba before Box-Cox")

ggplot(Glass, aes(x = Ba_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ba after Box-Cox")

ggplot(Glass, aes(x = Ca, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ca before Box-Cox")

ggplot(Glass, aes(x = Ca_trans, fill = Type, col = Type)) + 
  geom_density() + 
  geom_rug() + 
  labs(title = "Distribution of Ca after Box-Cox")

detach(Glass)

Exercise 3.2: `Soybean` dataset

The Soybean dataset includes 683 observations of 36 variables. The variables include:

35 predictor variables (date through roots), all of which are categorical
1 target variable (Class) which is categorical and specifies 19 distinct classes.

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

attach(Soybean)

a. Frequency distributions

We start by reviewing the frequency distributions of the predictors using the summary function. It is apparent that:

Many of the distributions are highly skewed or imbalanced; for instance, ~94% of the observations for the mycelium variable fall into the same category.
Missing values are scattered throughout the dataset, affecting almost all variables to differing degrees. Missing values comprise ~18% (121 cases) of the data for several variables (hail, sever, seed.tmt, and lodging), which may indicate a common source of the problem.

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

From the text, degenerate distributions that have near-zero variance can be identified as satisfying two conditions:

The frequency ratio of the most prevalent value to the second-most prevalent value is > 20
The proportion of unique values as a percentage of the sample size is < 10%.

We can use the nearZeroVar function in the caret package to identify the predictors with degenerate distributions. Based on this definition, the variables with near-zero variance include:

leaf.mild: 78% of the data in the most prevalent value
mycelium: 94% in the most prevalent value
sclerotia: 92% in the most prevalent value.

# get near-zero variance variables
nzv.cols <- nearZeroVar(Soybean)
nzv.out <- nearZeroVar(Soybean, saveMetrics=TRUE)[nzv.cols, ]
nzv.prop.high <- apply(Soybean, 2, function(x) max(table(x)) / length(x))[nzv.cols]

summary(Soybean[ , nzv.cols])

##  leaf.mild  mycelium   sclerotia 
##  0   :535   0   :639   0   :625  
##  1   : 20   1   :  6   1   : 20  
##  2   : 20   NA's: 38   NA's: 38  
##  NA's:108

df <- nzv.out %>% 
  mutate(prophigh = nzv.prop.high * 100) %>%
  select(freqRatio, percentUnique, prophigh)
rownames(df) <- rownames(nzv.out)
  
kable(df, 
      digits = 2, 
      col.names = c("Frequency ratio", "Pct. unique values", "Pct. most prevalent value"), 
      caption = "Near-Zero Variance Predictors")

Near-Zero Variance Predictors
	Frequency ratio	Pct. unique values	Pct. most prevalent value
leaf.mild	26.75	0.44	78.33
mycelium	106.50	0.29	93.56
sclerotia	31.25	0.29	91.51

b. Missing data

Measured by the number of complete cases, roughly 18% of the cases in the dataset are missing values for one or more variables. Measured by the number of missing values across all cases and variables, roughly 10% of the dataset is missing data.

# proportion of complete cases
1 - sum(complete.cases(Soybean)) / nrow(Soybean)

## [1] 0.1771596

# proportion of missing values across all rows & cols
sum(is.na(Soybean)) / ncol(Soybean) / nrow(Soybean)

## [1] 0.09504636

We can detect patterns in the missing data by viewing the distribution of NA values across variables and cases. From the output below, it is evident that:

Nearly all variables have missing data (34 out of 35 predictors)
Missing data is concentrated in certain groups of variables such as:
- 121 missing values: hail, sever, seed.tmt, lodging
- 106 missing values: fruiting.bodies, fruit.spots, seed.discolor, shriveling
- 84 missing values: leaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods
Missing data is found in only 121 cases (out of 683)

# exclude target variable
isna <- is.na(Soybean[ , -1])

# missing data by cols
nabyCol <- colSums(isna)
kable(nabyCol, 
      col.names="# Cases with missing data", 
      caption="Predictors with Missing Data")

Predictors with Missing Data
	# Cases with missing data
date	1
plant.stand	36
precip	38
temp	30
hail	121
crop.hist	16
area.dam	1
sever	121
seed.tmt	121
germ	112
plant.growth	16
leaves	0
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.shread	100
leaf.malf	84
leaf.mild	108
stem	16
lodging	121
stem.cankers	38
canker.lesion	38
fruiting.bodies	106
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
fruit.pods	84
fruit.spots	106
seed	92
mold.growth	92
seed.discolor	106
seed.size	92
shriveling	106
roots	31

kable(addmargins(table(nabyCol)), 
      col.names=c("# Missing values", "# Predictors"),
      caption = "Predictors by Number of Missing Values")

Predictors by Number of Missing Values
# Missing values	# Predictors
0	1
1	2
16	3
30	1
31	1
36	1
38	7
84	5
92	3
100	1
106	4
108	1
112	1
121	4
Sum	35

# missing data by rows
nabyRow <- rowSums(isna)
kable(addmargins(table(nabyRow)), 
      col.names=c("# Missing values", "# Cases"),
      caption = "Cases by Number of Missing Values")

Cases by Number of Missing Values
# Missing values	# Cases
0	562
11	9
13	19
19	55
20	8
24	14
28	15
30	1
Sum	683

Further, we can visualize the pattern of missing data in the image below, where the horizontal axis represents the variables and the vertical axis represents the cases in the dataset (note that the cases on the vertical axis are in reverse order compared to the dataframe). There is a clear pattern of missing values in the dataframe, with concentrations in selected cases and variables.

# image of missing data
par(mfrow = c(1,1))
image(t(isna), 
      xlab = "Variables", 
      ylab = "Cases (note reverse order)")
title("Image of Missing Values in Soybean Dataframe")

Finally, we can cross-tabulate by Class to see if there is a pattern of missing data related to the target variable. It is clear that the missing data is highly structured and dependent on the target:

2-4-d-injury: all cases are missing some data
cyst-nematode: all cases are missing some data
diaporthe-pod-&-stem-blight: all cases are missing some data
herbicide-injury: all cases are missing some data
phytophthora-rot: 68 out of 88 cases are missing some data.

# this time include target variable
isna <- is.na(Soybean)
nabyCol <- colSums(isna)
nabyRow <- rowSums(isna)

dfwna <- Soybean %>% mutate(nas = nabyRow, anyna = nabyRow > 0)

# cross tab by any missing data
addmargins(xtabs(~ Class + anyna, dfwna)) %>% 
  kable(col.names = c("Complete Cases", "Incomplete Cases", "Total Cases"), 
        caption = "Class by Complete / Incomplete Cases")

Class by Complete / Incomplete Cases
	Complete Cases	Incomplete Cases	Total Cases
2-4-d-injury	0	16	16
alternarialeaf-spot	91	0	91
anthracnose	44	0	44
bacterial-blight	20	0	20
bacterial-pustule	20	0	20
brown-spot	92	0	92
brown-stem-rot	44	0	44
charcoal-rot	20	0	20
cyst-nematode	0	14	14
diaporthe-pod-&-stem-blight	0	15	15
diaporthe-stem-canker	20	0	20
downy-mildew	20	0	20
frog-eye-leaf-spot	91	0	91
herbicide-injury	0	8	8
phyllosticta-leaf-spot	20	0	20
phytophthora-rot	20	68	88
powdery-mildew	20	0	20
purple-seed-stain	20	0	20
rhizoctonia-root-rot	20	0	20
Sum	562	121	683

Conditioning on the incomplete cases, we can see that the missing values arise from a consistent set of predictors, which depends on the target Class:

2-4-d-injury: incomplete cases due to missing data in 28 or 30 predictors
cyst-nematode: incomplete cases due to missing data in 24 predictors
diaporthe-pod-&-stem-blight: incomplete cases due to missing data in 11 or 13 predictors
herbicide-injury: incomplete cases due to missing data in 20 predictors
phytophthora-rot: incomplete cases due to missing data in 13 or 19 predictors.

# cross tab by number of missing data
addmargins(xtabs(~ Class + nas, dfwna, subset = (anyna == TRUE))) %>%
  kable(caption = "Classes with Incomplete Cases by Number of Missing Variables")

Classes with Incomplete Cases by Number of Missing Variables
	11	13	19	20	24	28	30	Sum
2-4-d-injury	0	0	0	0	0	15	1	16
alternarialeaf-spot	0	0	0	0	0	0	0	0
anthracnose	0	0	0	0	0	0	0	0
bacterial-blight	0	0	0	0	0	0	0	0
bacterial-pustule	0	0	0	0	0	0	0	0
brown-spot	0	0	0	0	0	0	0	0
brown-stem-rot	0	0	0	0	0	0	0	0
charcoal-rot	0	0	0	0	0	0	0	0
cyst-nematode	0	0	0	0	14	0	0	14
diaporthe-pod-&-stem-blight	9	6	0	0	0	0	0	15
diaporthe-stem-canker	0	0	0	0	0	0	0	0
downy-mildew	0	0	0	0	0	0	0	0
frog-eye-leaf-spot	0	0	0	0	0	0	0	0
herbicide-injury	0	0	0	8	0	0	0	8
phyllosticta-leaf-spot	0	0	0	0	0	0	0	0
phytophthora-rot	0	13	55	0	0	0	0	68
powdery-mildew	0	0	0	0	0	0	0	0
purple-seed-stain	0	0	0	0	0	0	0	0
rhizoctonia-root-rot	0	0	0	0	0	0	0	0
Sum	9	19	55	8	14	15	1	121

c. Strategy for missing data

One strategy for handling the missing data is to remove the predictors that account for the bulk of the systematic missing data, such as variables that have more than 80 incomplete cases:

Variables with 121 incomplete cases: hail, sever, seed.tmt, lodging
Variables with 106 incomplete cases: fruiting.bodies, fruit.spots, seed.discolor, shriveling
Variables with 84 incomplete cases: leaf.halo, leaf.marg, leaf.size, leaf.malf, fruit.pods.

Dropping the variables leaves 16 predictors remaining in the dataset, with the remaining missing data isolated to certain cases, which can be removed as incomplete cases.

# drop missing data
dropna <- Soybean[ , nabyCol < 80]
# dim of remaining dataframe
dim(dropna)

## [1] 683  17

# visualize missing data
image(t(is.na(dropna)), 
      xlab = "Variables", 
      ylab = "Cases (note reverse order)")
title("Image of Missing Values in Reduced Dataframe")

After dropping the remaining incomplete cases, we are left with a dataset with 630 complete observations of 16 predictor variables. Starting with the raw Soybean dataset (683 cases with 35 predictors), we’ve dropped 19 predictors and 53 incomplete cases. Finally we review the summary statistics of the reduced dataset.

# drop remaining imcomplete cases
final_df <- dropna[complete.cases(dropna), ]
dim(final_df)

## [1] 630  17

summary(final_df)

##                  Class     date    plant.stand precip  temp    crop.hist
##  brown-spot         : 92   0: 20   0:347       0: 74   0: 72   0: 59    
##  alternarialeaf-spot: 91   1: 68   1:283       1:110   1:374   1:156    
##  frog-eye-leaf-spot : 91   2: 86               2:446   2:184   2:208    
##  phytophthora-rot   : 88   3:110                               3:207    
##  anthracnose        : 44   4:124                                        
##  brown-stem-rot     : 44   5:140                                        
##  (Other)            :180   6: 82                                        
##  area.dam plant.growth leaves  stem    stem.cankers canker.lesion
##  0:113    0:426        0: 62   0:282   0:364        0:305        
##  1:215    1:204        1:568   1:348   1: 39        1: 83        
##  2:135                                 2: 36        2:177        
##  3:167                                 3:191        3: 65        
##                                                                  
##                                                                  
##                                                                  
##  ext.decay mycelium int.discolor sclerotia roots  
##  0:482     0:624    0:566        0:610     0:551  
##  1:135     1:  6    1: 44        1: 20     1: 78  
##  2: 13              2: 20                  2:  1  
##                                                   
##                                                   
##                                                   
##

detach(Soybean)

It would be critical to review the summary statistics and compare versus the initial summary statistics, particularly with respect to variable distributions. Importantly, we would want to inspect the distribution of the target Class variable, and the predictor distributions conditioned on Class, to determine the degree to which dropping the missing data has introduced any bias into the final dataset.

DATA 624 Homework #4

Kevin Benson

3/1/2020

Exercise 3.1: `Glass` dataset

a. Distributions and variable relationships

b. Outliers and skew

c. Transformations

Exercise 3.2: `Soybean` dataset

a. Frequency distributions

b. Missing data

c. Strategy for missing data

DATA 624 Homework #4

Kevin Benson

3/1/2020

Exercise 3.1: Glass dataset

a. Distributions and variable relationships

b. Outliers and skew

c. Transformations

Exercise 3.2: Soybean dataset

a. Frequency distributions

b. Missing data

c. Strategy for missing data

Exercise 3.1: `Glass` dataset

Exercise 3.2: `Soybean` dataset