3.1

data(Glass)
glass_df <- as.data.frame(Glass)

summary(glass_df)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

No missing variables so no need to remove observations or impute any data.

a.

glass_predictors <- subset(glass_df, select=-Type)

ggplot(gather(glass_predictors), aes(value)) + 
    geom_boxplot() + 
    facet_wrap(~key, scales = 'free')

skewValues <- apply(glass_predictors, 2, skewness)
skewValues
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
ggplot(gather(glass_predictors), aes(value)) + 
    geom_histogram(bins=15) + 
    facet_wrap(~key, scales = 'free')

glass_predictors %>%
  cor() %>%
  corrplot()

b.

From the boxplots it appears that most of the variables have outliers, namely Al, Ba, Ca, Na, RI and Si. However, some of the other variables that seem to have outliers at first glance look that way because their values are dominated by zero values. Ba, Fe, and K are such examples of this, and this is supported by the histograms that show a very tall bar at the zero value.

Al, Na, Si all seem to be roughly unskewed, while Ba and K show significant skew. This is supported by the calculated skewness table.

CA and RI are positively correlated, as are Ba and Al to a lesser extent. Si-RI, Al-RI, Ba-Mg, Ca-Mg, Al-Mg are negatively correlated.

c.

Ba and K could benefit from log, square root or inverse transformations to help remove skew, or from using the Box Cox method to identify the appropriate transformation.

I would likely remove Al and Ca entirely as predictors as they have the strongest correlations with the most variables. Transformations such as adding predictors do not apply to these data as they are all numerical.

3.2

data(Soybean)

soy_df <- Soybean
summary(soy_df)
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
soy_predictors <- subset(soy_df, select=-Class)

a.

par(mfrow=c(2,5))
for (i in 1:ncol(soy_predictors)) {
  barplot(table(soy_predictors[i]), ylab='Frequency', xlab=colnames(soy_predictors[i]))
}

Mycelium and sclerotia appear to be degenerate, but we can use the nearZeroVar function to print variables with near zero variance.

nearZeroVar(soy_predictors)
## [1] 18 25 27
colnames(soy_predictors[18])
## [1] "leaf.mild"
colnames(soy_predictors[25])
## [1] "mycelium"
colnames(soy_predictors[27])
## [1] "sclerotia"

b.

mice_plot <- aggr(soy_predictors, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(soy_predictors), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##           leaves 0.000000000

Variables hail, sever, seed.tmt, and lodging are all missing data while leaves, area.dam, and date are missing very few. Lets look at hail to see if there’s any pattern of missing by Class. We will have to use our soy_df object since it includes the Class column.

soy_df %>%
  group_by(Class) %>%
  mutate(na_count = if_else(is.na(hail), 1,0)) %>%
  summarize(sum(na_count))
## # A tibble: 19 × 2
##    Class                       `sum(na_count)`
##    <fct>                                 <dbl>
##  1 2-4-d-injury                             16
##  2 alternarialeaf-spot                       0
##  3 anthracnose                               0
##  4 bacterial-blight                          0
##  5 bacterial-pustule                         0
##  6 brown-spot                                0
##  7 brown-stem-rot                            0
##  8 charcoal-rot                              0
##  9 cyst-nematode                            14
## 10 diaporthe-pod-&-stem-blight              15
## 11 diaporthe-stem-canker                     0
## 12 downy-mildew                              0
## 13 frog-eye-leaf-spot                        0
## 14 herbicide-injury                          8
## 15 phyllosticta-leaf-spot                    0
## 16 phytophthora-rot                         68
## 17 powdery-mildew                            0
## 18 purple-seed-stain                         0
## 19 rhizoctonia-root-rot                      0

We can see that the number of missing “hail” values vary depending on the Class. Phytophthora rot makes up the majority. Lets check two more of the most missing variables.

soy_df %>%
  group_by(Class) %>%
  mutate(na_count = if_else(is.na(seed.tmt), 1,0)) %>%
  summarize(sum(na_count))
## # A tibble: 19 × 2
##    Class                       `sum(na_count)`
##    <fct>                                 <dbl>
##  1 2-4-d-injury                             16
##  2 alternarialeaf-spot                       0
##  3 anthracnose                               0
##  4 bacterial-blight                          0
##  5 bacterial-pustule                         0
##  6 brown-spot                                0
##  7 brown-stem-rot                            0
##  8 charcoal-rot                              0
##  9 cyst-nematode                            14
## 10 diaporthe-pod-&-stem-blight              15
## 11 diaporthe-stem-canker                     0
## 12 downy-mildew                              0
## 13 frog-eye-leaf-spot                        0
## 14 herbicide-injury                          8
## 15 phyllosticta-leaf-spot                    0
## 16 phytophthora-rot                         68
## 17 powdery-mildew                            0
## 18 purple-seed-stain                         0
## 19 rhizoctonia-root-rot                      0
soy_df %>%
  group_by(Class) %>%
  mutate(na_count = if_else(is.na(germ), 1,0)) %>%
  summarize(sum(na_count))
## # A tibble: 19 × 2
##    Class                       `sum(na_count)`
##    <fct>                                 <dbl>
##  1 2-4-d-injury                             16
##  2 alternarialeaf-spot                       0
##  3 anthracnose                               0
##  4 bacterial-blight                          0
##  5 bacterial-pustule                         0
##  6 brown-spot                                0
##  7 brown-stem-rot                            0
##  8 charcoal-rot                              0
##  9 cyst-nematode                            14
## 10 diaporthe-pod-&-stem-blight               6
## 11 diaporthe-stem-canker                     0
## 12 downy-mildew                              0
## 13 frog-eye-leaf-spot                        0
## 14 herbicide-injury                          8
## 15 phyllosticta-leaf-spot                    0
## 16 phytophthora-rot                         68
## 17 powdery-mildew                            0
## 18 purple-seed-stain                         0
## 19 rhizoctonia-root-rot                      0

Again, phytophthora-rot has most of the missing values. This would indicate an issue with measuring these values for this particular class.

c.

Observations that are missing values for a few predictors could be dropped, or a new predictor could be added that indicates whether or not an observation is missing those variables. This could be a “yes/no” type of predictor called “measurement_error,” for example.

Classes that have a lot of missing values, such as phytophthora-rot, can be eliminated. Predictors with only a few missing values could potentially be imputed as long as missing values were randomly distributed. Lets try to find one.

soy_df %>%
  group_by(Class) %>%
  mutate(na_count = if_else(is.na(plant.growth), 1,0)) %>%
  summarize(sum(na_count))
## # A tibble: 19 × 2
##    Class                       `sum(na_count)`
##    <fct>                                 <dbl>
##  1 2-4-d-injury                             16
##  2 alternarialeaf-spot                       0
##  3 anthracnose                               0
##  4 bacterial-blight                          0
##  5 bacterial-pustule                         0
##  6 brown-spot                                0
##  7 brown-stem-rot                            0
##  8 charcoal-rot                              0
##  9 cyst-nematode                             0
## 10 diaporthe-pod-&-stem-blight               0
## 11 diaporthe-stem-canker                     0
## 12 downy-mildew                              0
## 13 frog-eye-leaf-spot                        0
## 14 herbicide-injury                          0
## 15 phyllosticta-leaf-spot                    0
## 16 phytophthora-rot                          0
## 17 powdery-mildew                            0
## 18 purple-seed-stain                         0
## 19 rhizoctonia-root-rot                      0

Plant growth is a bad candidate for imputation because all missing values are in one class: 2-4-d-injury. Maybe plant growth cannot be measured for this class in the same way so it wouldn’t make sense to impute the missing values.

soy_df %>%
  group_by(Class) %>%
  mutate(na_count = if_else(is.na(ext.decay), 1,0)) %>%
  summarize(sum(na_count))
## # A tibble: 19 × 2
##    Class                       `sum(na_count)`
##    <fct>                                 <dbl>
##  1 2-4-d-injury                             16
##  2 alternarialeaf-spot                       0
##  3 anthracnose                               0
##  4 bacterial-blight                          0
##  5 bacterial-pustule                         0
##  6 brown-spot                                0
##  7 brown-stem-rot                            0
##  8 charcoal-rot                              0
##  9 cyst-nematode                            14
## 10 diaporthe-pod-&-stem-blight               0
## 11 diaporthe-stem-canker                     0
## 12 downy-mildew                              0
## 13 frog-eye-leaf-spot                        0
## 14 herbicide-injury                          8
## 15 phyllosticta-leaf-spot                    0
## 16 phytophthora-rot                          0
## 17 powdery-mildew                            0
## 18 purple-seed-stain                         0
## 19 rhizoctonia-root-rot                      0

It would appear that the missing values in this data are generally related to class. If there was a variable appropriate for imputation I would use the “mice” package and code similar to below:

# mice_data <- mice(data)
# pred_mat <- mice_data$predictorMatrix
# pred_mat[, c("TARGET_VARIABLE")] <- 0
# impute <- mice(data, method = 'rf', predictorMatrix=pred_mat)
# imputed <- complete(impute)
# summary(imputed)

This method uses the random forest method to impute variables based on target variable.