DATA624 Homework 4 Data Preprocessing and Overfitting

Assignment: Exercises 3.1, 3.2 from the KJ textbook

library(mlbench)
library(corrr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

## Loading required package: ggplot2

library(e1071)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(caret)

## Loading required package: lattice

library(ggplot2)
library(finalfit)
library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(DMwR)

## Loading required package: grid

## 
## Attaching package: 'DMwR'

## The following object is masked from 'package:plyr':
## 
##     join

3.1. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

By looking at a boxplot of the predictors, we can see that the glass samples tend to have higher values for Si while the Na and Ca predictors have larger ranges. Ca seems to have the most outliers and K is right-skewed.

df <- Glass[,1:9]
boxplot(df)

A correlation matrix shows us that correlation coefficients between each of the 9 predictors. We can use this to create a correlation matrix that visualizes the relationships that the predictors have with one another. Si and Ca are the most correlated with the refractive index with Ca being postively correlated and Si being negatively correlated.

correlation <- correlate(df)

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

correlation

## # A tibble: 9 x 10
##   rowname        RI       Na        Mg        Al        Si         K
##   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 RI       NA        -0.192   -0.122    -0.407    -0.542    -0.290  
## 2 Na       -1.92e-1  NA       -0.274     0.157    -0.0698   -0.266  
## 3 Mg       -1.22e-1  -0.274   NA        -0.482    -0.166     0.00540
## 4 Al       -4.07e-1   0.157   -0.482    NA        -0.00552   0.326  
## 5 Si       -5.42e-1  -0.0698  -0.166    -0.00552  NA        -0.193  
## 6 K        -2.90e-1  -0.266    0.00540   0.326    -0.193    NA      
## 7 Ca        8.10e-1  -0.275   -0.444    -0.260    -0.209    -0.318  
## 8 Ba       -3.86e-4   0.327   -0.492     0.479    -0.102    -0.0426 
## 9 Fe        1.43e-1  -0.241    0.0831   -0.0744   -0.0942   -0.00772
## # … with 3 more variables: Ca <dbl>, Ba <dbl>, Fe <dbl>

#Correlation network for variables
tidy_cors <- df %>% 
  correlate() %>% 
  stretch()

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

tidy_cors

## # A tibble: 81 x 3
##    x     y              r
##    <chr> <chr>      <dbl>
##  1 RI    RI     NA       
##  2 RI    Na     -0.192   
##  3 RI    Mg     -0.122   
##  4 RI    Al     -0.407   
##  5 RI    Si     -0.542   
##  6 RI    K      -0.290   
##  7 RI    Ca      0.810   
##  8 RI    Ba     -0.000386
##  9 RI    Fe      0.143   
## 10 Na    RI     -0.192   
## # … with 71 more rows

graph_cors <- tidy_cors %>%
  filter(abs(r) > .3) %>%
  graph_from_data_frame(directed = FALSE)
ggraph(graph_cors) +
  geom_edge_link(aes(edge_alpha = abs(r), edge_width = abs(r), color = r)) +
  guides(edge_alpha = "none", edge_width = "none") +
  scale_edge_colour_gradientn(limits = c(-1, 1), colors = c("firebrick2", "dodgerblue2")) +
  geom_node_point(color = "grey", size = 2) +
  geom_node_text(aes(label = name), repel = FALSE) +
  theme_graph()

## Using `stress` as default layout

Do there appear to be any outliers in the data? Are any predictors skewed?

Ca seems to have the most outliers, but K, Na and Ba seem to have a considerable number of outliers as well. K and Ba are both very right-skewed, and Ca also seems to be right-skewed.

Are there any relevant transformations of one or more predictors that might improve the classification model?

We calculate our skewness statistics for our predictors and confirm that K is the most skewed, but Ba, Ca and RI can also be considered highly skewed as their skewness values are greater than 1.

skewValues <- apply(df, 2, skewness)
skewValues

##         RI         Na         Mg         Al         Si          K 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889 
##         Ca         Ba         Fe 
##  2.0184463  3.3686800  1.7298107

We can use the powerTransform function to calculate lambda values and decide which transformations would be best for these predictors. Based on these results, we would use a square root transformation on Ca, log transformation on K, no transformation on Al or Na, and a square transformation on Mg. The other variables may benefit from other transformations as well.

summary(powerTransform(df[,2:9], family="yjPower")) #removed the RI variable as it was throwing errors

## yjPower Transformations to Multinormality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Na    1.3489        1.00       0.5645       2.1334
## Mg    1.7747        2.00       1.5398       2.0096
## Al    0.8491        1.00       0.4986       1.1996
## Si   13.4963       13.50       8.1223      18.8703
## K    -0.1622        0.00      -0.3639       0.0395
## Ca    0.6185        0.50       0.3660       0.8710
## Ba   -6.9783       -6.98      -8.1385      -5.8181
## Fe  -14.4378      -14.44     -17.2836     -11.5919
## 
##  Likelihood ratio test that all transformation parameters are equal to 0
##                                          LRT df       pval
## LR test, lambda = (0 0 0 0 0 0 0 0) 863.7414  8 < 2.22e-16

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

?Soybean

## Help on topic 'Soybean' was found in the following packages:
## 
##   Package               Library
##   nlme                  /Library/Frameworks/R.framework/Versions/3.5/Resources/library
##   mlbench               /Library/Frameworks/R.framework/Versions/3.5/Resources/library
## 
## 
## Using the first match ...

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

A predictor has a degenerate distribution if it only takes on a single possible value or have a handful of distinct values with very low frequencies. These are also called near zero variance predictors, and we can identify them using the nearZeroVar function. Using this, we identify lead.mild, mycelium and sclerotia for possible removal in our model.

nearZeroVar(Soybean, names = TRUE)

## [1] "leaf.mild" "mycelium"  "sclerotia"

barplot(table(Soybean$leaf.mild))

barplot(table(Soybean$mycelium))

barplot(table(Soybean$sclerotia))

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

We can see a plot of the percentage of missing data by predictor. The sever, seed.tmt, lodging and hail predictors are tied for having the most missing data.

results <- sapply(Soybean, function(x) sum(is.na(x)))
results <- results/683*100
results <- as.data.frame(results)
results <- cbind(predictor = rownames(results), results)
rownames(results) <- 1:nrow(results)

ggplot(results, aes(x=reorder(predictor, results), y = results)) +
  geom_bar(stat= "identity", fill = "#0073C2FF", width = 1) +
  ylab("% of missing values") +
  xlab("Predictor") +
  ggtitle("Missing values for predictors") +
  theme(text=element_text(size=6), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(colour = "black")) +
  coord_flip()

The following plot shows how missing data in any of the predictors relates to the Class variable. Red indicates a combination of missing values between the predictor and Class variable.

explanatory <- c(colnames(Soybean[,-1]))
dependent <- "Class"
Soybean %>% missing_pattern(dependent, explanatory)

##     Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562     1      1    1        1         1            1    1    1     1
## 13      1      1    1        1         1            1    1    1     1
## 55      1      1    1        1         1            1    1    1     1
## 8       1      1    1        1         1            1    1    1     1
## 9       1      1    1        1         1            1    1    1     0
## 6       1      1    1        1         1            1    1    1     0
## 14      1      1    1        1         1            1    1    0     1
## 15      1      1    1        1         0            0    0    0     0
## 1       1      1    0        0         0            0    0    0     0
##         0      0    1        1        16           16   16   30    31
##     plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562           1      1            1             1         1        1
## 13            1      1            1             1         1        1
## 55            1      1            1             1         1        1
## 8             1      0            0             0         0        0
## 9             1      1            1             1         1        1
## 6             0      1            1             1         1        1
## 14            0      0            0             0         0        0
## 15            0      0            0             0         0        0
## 1             0      0            0             0         0        0
##              36     38           38            38        38       38
##     int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf
## 562            1         1         1         1         1         1
## 13             1         1         1         1         1         1
## 55             1         1         0         0         0         0
## 8              0         0         1         1         1         1
## 9              1         1         0         0         0         0
## 6              1         1         0         0         0         0
## 14             0         0         0         0         0         0
## 15             0         0         1         1         1         1
## 1              0         0         1         1         1         1
##               38        38        84        84        84        84
##     fruit.pods seed mold.growth seed.size leaf.shread fruiting.bodies
## 562          1    1           1         1           1               1
## 13           0    0           0         0           1               0
## 55           0    0           0         0           0               0
## 8            1    0           0         0           1               0
## 9            1    1           1         1           0               1
## 6            1    1           1         1           0               1
## 14           1    1           1         1           0               0
## 15           0    0           0         0           0               0
## 1            0    0           0         0           0               0
##             84   92          92        92         100             106
##     fruit.spots seed.discolor shriveling leaf.mild germ hail sever
## 562           1             1          1         1    1    1     1
## 13            0             0          0         1    0    0     0
## 55            0             0          0         0    0    0     0
## 8             0             0          0         0    0    0     0
## 9             1             1          1         0    1    0     0
## 6             1             1          1         0    0    0     0
## 14            0             0          0         0    0    0     0
## 15            0             0          0         0    0    0     0
## 1             0             0          0         0    0    0     0
##             106           106        106       108  112  121   121
##     seed.tmt lodging     
## 562        1       1    0
## 13         0       0   13
## 55         0       0   19
## 8          0       0   20
## 9          0       0   11
## 6          0       0   13
## 14         0       0   24
## 15         0       0   28
## 1          0       0   30
##          121     121 2337

The phytophthora-rot class has the most missing values.

ddply(Soybean, .(Class), colwise(.fun = function(x) sum(is.na(x))))

##                          Class date plant.stand precip temp hail crop.hist
## 1                 2-4-d-injury    1          16     16   16   16        16
## 2          alternarialeaf-spot    0           0      0    0    0         0
## 3                  anthracnose    0           0      0    0    0         0
## 4             bacterial-blight    0           0      0    0    0         0
## 5            bacterial-pustule    0           0      0    0    0         0
## 6                   brown-spot    0           0      0    0    0         0
## 7               brown-stem-rot    0           0      0    0    0         0
## 8                 charcoal-rot    0           0      0    0    0         0
## 9                cyst-nematode    0          14     14   14   14         0
## 10 diaporthe-pod-&-stem-blight    0           6      0    0   15         0
## 11       diaporthe-stem-canker    0           0      0    0    0         0
## 12                downy-mildew    0           0      0    0    0         0
## 13          frog-eye-leaf-spot    0           0      0    0    0         0
## 14            herbicide-injury    0           0      8    0    8         0
## 15      phyllosticta-leaf-spot    0           0      0    0    0         0
## 16            phytophthora-rot    0           0      0    0   68         0
## 17              powdery-mildew    0           0      0    0    0         0
## 18           purple-seed-stain    0           0      0    0    0         0
## 19        rhizoctonia-root-rot    0           0      0    0    0         0
##    area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1         1    16       16   16           16      0         0         0
## 2         0     0        0    0            0      0         0         0
## 3         0     0        0    0            0      0         0         0
## 4         0     0        0    0            0      0         0         0
## 5         0     0        0    0            0      0         0         0
## 6         0     0        0    0            0      0         0         0
## 7         0     0        0    0            0      0         0         0
## 8         0     0        0    0            0      0         0         0
## 9         0    14       14   14            0      0        14        14
## 10        0    15       15    6            0      0        15        15
## 11        0     0        0    0            0      0         0         0
## 12        0     0        0    0            0      0         0         0
## 13        0     0        0    0            0      0         0         0
## 14        0     8        8    8            0      0         0         0
## 15        0     0        0    0            0      0         0         0
## 16        0    68       68   68            0      0        55        55
## 17        0     0        0    0            0      0         0         0
## 18        0     0        0    0            0      0         0         0
## 19        0     0        0    0            0      0         0         0
##    leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1          0          16         0        16   16      16           16
## 2          0           0         0         0    0       0            0
## 3          0           0         0         0    0       0            0
## 4          0           0         0         0    0       0            0
## 5          0           0         0         0    0       0            0
## 6          0           0         0         0    0       0            0
## 7          0           0         0         0    0       0            0
## 8          0           0         0         0    0       0            0
## 9         14          14        14        14    0      14           14
## 10        15          15        15        15    0      15            0
## 11         0           0         0         0    0       0            0
## 12         0           0         0         0    0       0            0
## 13         0           0         0         0    0       0            0
## 14         0           0         0         8    0       8            8
## 15         0           0         0         0    0       0            0
## 16        55          55        55        55    0      68            0
## 17         0           0         0         0    0       0            0
## 18         0           0         0         0    0       0            0
## 19         0           0         0         0    0       0            0
##    canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1             16              16        16       16           16        16
## 2              0               0         0        0            0         0
## 3              0               0         0        0            0         0
## 4              0               0         0        0            0         0
## 5              0               0         0        0            0         0
## 6              0               0         0        0            0         0
## 7              0               0         0        0            0         0
## 8              0               0         0        0            0         0
## 9             14              14        14       14           14        14
## 10             0               0         0        0            0         0
## 11             0               0         0        0            0         0
## 12             0               0         0        0            0         0
## 13             0               0         0        0            0         0
## 14             8               8         8        8            8         8
## 15             0               0         0        0            0         0
## 16             0              68         0        0            0         0
## 17             0               0         0        0            0         0
## 18             0               0         0        0            0         0
## 19             0               0         0        0            0         0
##    fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1          16          16   16          16            16        16
## 2           0           0    0           0             0         0
## 3           0           0    0           0             0         0
## 4           0           0    0           0             0         0
## 5           0           0    0           0             0         0
## 6           0           0    0           0             0         0
## 7           0           0    0           0             0         0
## 8           0           0    0           0             0         0
## 9           0          14    0           0            14         0
## 10          0           0    0           0             0         0
## 11          0           0    0           0             0         0
## 12          0           0    0           0             0         0
## 13          0           0    0           0             0         0
## 14          0           8    8           8             8         8
## 15          0           0    0           0             0         0
## 16         68          68   68          68            68        68
## 17          0           0    0           0             0         0
## 18          0           0    0           0             0         0
## 19          0           0    0           0             0         0
##    shriveling roots
## 1          16    16
## 2           0     0
## 3           0     0
## 4           0     0
## 5           0     0
## 6           0     0
## 7           0     0
## 8           0     0
## 9          14     0
## 10          0    15
## 11          0     0
## 12          0     0
## 13          0     0
## 14          8     0
## 15          0     0
## 16         68     0
## 17          0     0
## 18          0     0
## 19          0     0

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

I’ll use knn imputation since I have not used it before for imputation and would like to learn more about it. This imputes a missing value with the average weighted value of observations near/similar to it. We perform this imputation on all variables except for the response variable.

knnOutput <- knnImputation(Soybean[, !names(Soybean) %in% "Class"])
anyNA(knnOutput)

## [1] FALSE

DATA624 Homework 4 Data Preprocessing and Overfitting

Omar Pineda

2/25/2020