Chapters 3-4, Kuhn and Johnson, Applied Predictive Modeling

Q 3.1

3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and precentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench)
data(Glass)

Examinging the structure of the data frame, there are 9 predictor variables and 1 target variable (Type), which is a factor with 6 levels.

str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Q 3.1.a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Removing the target variable, Magnesium is the only bimodal distribution, although one peak is at 0, which if removed or transformed would make it left skewed. Only two, Sodium and Aluminum, look to be somewhat normal, but with upper outliers. Actually, most of the predictors appear to have outliers in their upper ranges. Potassium, Barium and Iron all have multiple values recorded at 0. Few if any of the histograms are scaled the same on the x axis.

GlassPreds <- Glass[,0:9]

RI <- ggplot(GlassPreds, aes(x = RI)) + geom_histogram(binwidth = 0.001) + labs(title = "Refractive Index")
Na <- ggplot(GlassPreds, aes(x = Na)) + geom_histogram(binwidth = 0.2) + labs(title = "Sodium")
Mg <- ggplot(GlassPreds, aes(x = Mg)) + geom_histogram(binwidth = 0.2) + labs(title = "Magnesium")
Al <- ggplot(GlassPreds, aes(x = Al)) + geom_histogram(binwidth = 0.1) + labs(title = "Aluminum")
Si <- ggplot(GlassPreds, aes(x = Si)) + geom_histogram(binwidth = 0.15) + labs(title = "Silicon")
K <- ggplot(GlassPreds, aes(x = K)) + geom_histogram(binwidth = 0.2) + labs(title = "Potassium")
Ca <- ggplot(GlassPreds, aes(x = Ca)) + geom_histogram(binwidth = 0.2) + labs(title = "Calcium")
Ba <- ggplot(GlassPreds, aes(x = Ba)) + geom_histogram(binwidth = 0.1) + labs(title = "Barium")
Fe <- ggplot(GlassPreds, aes(x = Fe)) + geom_histogram(binwidth = 0.02) + labs(title = "Iron")

grid.arrange(RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,ncol=3)

The relationships between the predictors can be quantified when we calculate the correlations across all variables. Most of the predictors are negatively correlated as seen in the below matrix, although a few have positive correlation, such as between Ca and RI.

corrSkew <- cor(GlassPreds)
corrSkew[1:9,1:9]

##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000

It is possible to examine the correlations visually using a correlation plot, which confirms the negative correlations in red, such as between Al and Mg, and positive correlations in blue, such as between Ca and RI. Darker shadings represent stronger correlations in both positive and negative directions. The fainter the shading the weaker the correlation, such as between Ba and Ca. White boxes show no correlation at all, such as between Na and Si.

corrplot(corrSkew, order = "hclust")

Q 3.1.b

Do there appear to be any outliers in the data? Are any predictors skewed?

Since the intention of the prediction is to measure the refractive qualities of each element, we can plot each element against the Refractive Index (RI).

Scatterplots can be good cues to finding outliers in the data. Visually, it appears that Potassium has an obvious outlier, and Barium and Iron have a few outliers as well.

Na_p <- ggplot(GlassPreds, aes(x = Na, y = RI)) + geom_point() + labs(title = "Sodium")
Mg_p <- ggplot(GlassPreds, aes(x = Mg, y = RI)) + geom_point() + labs(title = "Magnesium")
Al_p <- ggplot(GlassPreds, aes(x = Al, y = RI)) + geom_point() + labs(title = "Aluminum")
Si_p <- ggplot(GlassPreds, aes(x = Si, y = RI)) + geom_point() + labs(title = "Silicon")
K_p <- ggplot(GlassPreds, aes(x = K, y = RI)) + geom_point() + labs(title = "Potassium")
Ca_p <- ggplot(GlassPreds, aes(x = Ca, y = RI)) + geom_point() + labs(title = "Calcium")
Ba_p <- ggplot(GlassPreds, aes(x = Ba, y = RI)) + geom_point() + labs(title = "Barium")
Fe_p <- ggplot(GlassPreds, aes(x = Fe, y = RI)) + geom_point() + labs(title = "Iron")

grid.arrange(Na_p,Mg_p,Al_p,Si_p,K_p,Ca_p,Ba_p,Fe_p,ncol=3)

Based on the above histograms in (a), visually five of the predictor variables are right skewed: RI, K, Ca, Ba and Fe. Two are left skewed: Mg and Si. Two are somewhat normal with upper outliers: Na and Al.

This is confirmed when skewness is quantified/calculated. Positive values are right skewed and negative values are left skewed, with the greater the distance from 0 the greater the skew in either direction.

skewValues <- apply(GlassPreds, 2, skewness)
skewValues

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Q 3.1.c

Are there any relevant transformations of one or more predictors that might improve the classification model?

In terms of relevant transformations, Box-Cox can suggest and even apply appropriate transformations using the ‘BoxCoxTrans’ function from the ‘caret’ package.

Looking at each of the 9 predictor variables individually four cannot estimate lambda. Again, based on the scatterplots and histograms above, it is apparent that there are many values at 0 for Magnesium, Potassium, Barium and Iron, which are the four for which lambda could not be estimated. Perhaps taking the values > 0 only in those predictors would help achieve a better result for classification. BoxCox will be tried both with the 0 values and without the 0 values below for those four predictors.

# Refractive Index
(RI_Trans <- BoxCoxTrans(GlassPreds$RI))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2

# Refractive Index
(RI_Trans <- BoxCoxTrans(GlassPreds$RI))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2

# Sodium
(Na_Trans <- BoxCoxTrans(GlassPreds$Na))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations

# Magnesium with 0 values
(Mg_Trans <- BoxCoxTrans(GlassPreds$Mg))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490 
## 
## Lambda could not be estimated; no transformation is applied

Changing 0 values to NA and using BoxCoxTrans with na.rm = True results in an estimated Lambda.

# Magnesium without 0 values
GlassPreds$MgNo_0 <- ifelse(GlassPreds$Mg == 0,NA,GlassPreds$Mg)  # add a column with NA instead of 0 values

(MgNo_0Trans <- BoxCoxTrans(GlassPreds$MgNo_0, na.rm = TRUE))

## Box-Cox Transformation
## 
## 172 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.330   3.385   3.535   3.340   3.632   4.490 
## 
## Largest/Smallest: 13.6 
## Sample Skewness: -2.31 
## 
## Estimated Lambda: 2

# Aluminum
(Al_Trans <- BoxCoxTrans(GlassPreds$Al))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5

# Silicon
(Si_Trans <- BoxCoxTrans(GlassPreds$Si))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2

# Potassium
(K_Trans <- BoxCoxTrans(GlassPreds$K))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Lambda could not be estimated; no transformation is applied

Changing 0 values to NA and using BoxCoxTrans with na.rm = True results in an estimated Lambda.

# Calcium without 0 values
GlassPreds$CaNo_0 <- ifelse(GlassPreds$Ca == 0,NA,GlassPreds$Ca)  # add a column with NA instead of 0 values

(CaNo_0Trans <- BoxCoxTrans(GlassPreds$CaNo_0, na.rm = TRUE))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1

# Calcium
(Ca_Trans <- BoxCoxTrans(GlassPreds$Ca))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1

# Barium
(Ba_Trans <- BoxCoxTrans(GlassPreds$Ba))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Lambda could not be estimated; no transformation is applied

Changing 0 values to NA and using BoxCoxTrans with na.rm = True results in an estimated Lambda.

# Barium without 0 values
GlassPreds$BaNo_0 <- ifelse(GlassPreds$Ba == 0,NA,GlassPreds$Ba)  # add a column with NA instead of 0 values

(BaNo_0Trans <- BoxCoxTrans(GlassPreds$BaNo_0, na.rm = TRUE))

## Box-Cox Transformation
## 
## 38 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0600  0.4325  0.6800  0.9858  1.5850  3.1500 
## 
## Largest/Smallest: 52.5 
## Sample Skewness: 0.85 
## 
## Estimated Lambda: 0.4

However, there are only 38 data points remaining after NA is removed from the original 214 data points. Given that only 17.7% of the original records have non-zero (or non NA) values, it would be best to remove this predictor variable entirely.

GlassPreds <- GlassPreds %>%
  mutate(BaNo_0 = NULL)

# Iron
(Fe_Trans <- BoxCoxTrans(GlassPreds$Fe))

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Lambda could not be estimated; no transformation is applied

Changing 0 values to NA and using BoxCoxTrans with na.rm = True results in an estimated Lambda.

# Iron without 0 values
GlassPreds$FeNo_0 <- ifelse(GlassPreds$Fe == 0,NA,GlassPreds$Fe)  # add a column with NA instead of 0 values

(FeNo_0Trans <- BoxCoxTrans(GlassPreds$FeNo_0, na.rm = TRUE))

## Box-Cox Transformation
## 
## 70 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1000  0.1650  0.1743  0.2400  0.5100 
## 
## Largest/Smallest: 51 
## Sample Skewness: 0.866 
## 
## Estimated Lambda: 0.5

We could use Principal Component Analysis to center, scale, transform data and extract principal compnents in one step using the caret package’s function preProcess. However, in order to utilize the function we need to remove the variable with NA values and only use the variables with 0 values.

So we remove the three remaining columns with NAs.

GlassPreds <- GlassPreds %>%
  mutate(MgNo_0 = NULL, CaNo_0 = NULL, FeNo_0 = NULL)

Then use the preProcess function to center, scale, transform and extract PCA.

(pcaObject <- prcomp(GlassPreds, center = TRUE, scale. = TRUE))

## Standard deviations (1, .., p=9):
## [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 0.60741950
## [8] 0.25269141 0.04011007
## 
## Rotation (n x k) = (9 x 9):
##           PC1         PC2           PC3         PC4          PC5         PC6
## RI -0.5451766  0.28568318 -0.0869108293 -0.14738099  0.073542700 -0.11528772
## Na  0.2581256  0.27035007  0.3849196197 -0.49124204 -0.153683304  0.55811757
## Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124 -0.30818598
## Al  0.4287086  0.29521154 -0.3292371183  0.13750592 -0.014108879  0.01885731
## Si  0.2288364 -0.15509891  0.4587088382  0.65253771 -0.008500117 -0.08609797
## K   0.2193440 -0.15397013 -0.6625741197  0.03853544  0.307039842  0.24363237
## Ca -0.4923061  0.34537980  0.0009847321  0.27644322  0.188187742  0.14866937
## Ba  0.2503751  0.48470218 -0.0740547309 -0.13317545 -0.251334261 -0.65721884
## Fe -0.1858415 -0.06203879 -0.2844505524  0.23049202 -0.873264047  0.24304431
##            PC7         PC8         PC9
## RI -0.08186724 -0.75221590 -0.02573194
## Na -0.14858006 -0.12769315  0.31193718
## Mg  0.20604537 -0.07689061  0.57727335
## Al  0.69923557 -0.27444105  0.19222686
## Si -0.21606658 -0.37992298  0.29807321
## K  -0.50412141 -0.10981168  0.26050863
## Ca  0.09913463  0.39870468  0.57932321
## Ba -0.35178255  0.14493235  0.19822820
## Fe -0.07372136 -0.01627141  0.01466944

Q 3.2

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predictc disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Q 3.2.a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Visualizing them with a count for each level is a good way to investigate the categorical predictors.

Assuming that “degenerate” distributions means predictors with either zero or near-zero variance, it is apparent in the plots below that the variables with either zero or near-zero variance include: leaves, leaf.shread, leaf.malf, leaf.mild, lodging, stem.cankers, fruiting.bodies, ext.decay, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, shriveling, and roots.

ggplotly((Class_F <- ggplot(Soybean, aes(x=Class))+ geom_bar() + labs(title = "Class distribution") + coord_flip()))

grid.arrange(date_F,plant.stand_F,precip_F,temp_F,hail_F,crop.hist_F, ncol = 2)

grid.arrange(area.dam_F,sever_F,seed.tmt_F,germ_F,plant.growth_F,leaves_F, ncol = 2)

grid.arrange(leaf.halo_F,leaf.marg_F,leaf.size_F,leaf.shread_F,leaf.malf_F,leaf.mild_F, ncol = 2)

grid.arrange(stem_F,lodging_F,stem.cankers_F,canker.lesion_F,fruiting.bodies_F,ext.decay_F, ncol = 2)

grid.arrange(mycelium_F,int.discolor_F,sclerotia_F,fruit.pods_F,fruit.spots_F,mold.growth_F, ncol = 2)

grid.arrange(seed.discolor_F,seed.size_F,shriveling_F,roots_F, ncol = 2)

Q 3.2.b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Let’s visualize the missing data first, from which we can see that sever, seed.tmt, lodging and hail have the highest counts of missing values.

soybean_predictor <- names(Soybean)
soybean_missing <- sapply(Soybean, function(x) sum(is.na(x)))
soybean_miss <- rbind(soybean_predictor,soybean_missing)
soybean_miss <- t(soybean_miss) %>%
  as.data.frame
names(soybean_miss) <- c("Predictor","Missing") 
soybean_miss$Missing <- as.integer(soybean_miss$Missing)

ggplotly(ggplot(soybean_miss, aes(x = Predictor, y = Missing)) + 
  geom_col(fill = "lightgreen") +
  labs(title = "Soybean Missing Counts by Predictor", x = "Predictor") +
    coord_flip() +
  theme(plot.title = element_text(hjust = 0.5)))

In terms of missing data being related to the classes, the visualization indicates that only five Classes have missing values: herbicide-injury, diasporthe-pod-&-stem-bright, cyst-nematode, 2-4-d-indury and phytophthora-rot. This suggests that missingness is indeed related to classes, the specified ones specifically.

missingbyclass <- Soybean %>%
  group_by(Class) %>%
    do(as.data.frame(sum(is.na(.))))

names(missingbyclass) <- c("Class","Missing")

ggplotly(ggplot(missingbyclass, aes(x = reorder(Class, -Missing), Missing)) + 
  geom_col(fill = "pink") +
  labs(title = "Soybean Missing Counts by Class", x = "Class") +
    coord_flip() +
  theme(plot.title = element_text(hjust = 0.5)))

Q 3.2.c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First and foremost I would eliminate the predictors that have very low or no variance, specifically leaf.mild, mycelium and sclerotia

(Low_or_zero_var <- names(Soybean[nearZeroVar(Soybean)]))

## [1] "leaf.mild" "mycelium"  "sclerotia"

I would leave in predictors with missing values, lest I run the risk of degrading the predictive power of the dataset. It might be helpful to attempt to impute the missing values using knn, but before doing that I would scale the numerical data so that all values were between 0 and 1. For categorical data I would need to create numerical dummy variables that could then be scaled between 0 and 1, but I would also save out the original factor variables for easier interpretability eventually. At that point I could use the knn function from the class package to run a K-nearest neighbor algorithm to impute the missing values.

Data 624 Predictive Analytics

Douglas Barley

2/27/2022

Chapters 3-4, Kuhn and Johnson, Applied Predictive Modeling

Q 3.1

Q 3.1.a

Q 3.1.b

Q 3.1.c

Q 3.2

Q 3.2.a

Q 3.2.b

Q 3.2.c