HW 4

Problem 3.1
- Part A
- Part B
- Part C
Problem 3.2
- Part A
- Part B
- Part C

Problem 3.1

Part A

Variable Inspection

library(mlbench)
library(tidyverse)
library(gridExtra)

data(Glass)
glass_pred <- Glass[1:9]
plot_it <- function(col){
  ggplot(Glass) + geom_histogram(aes_string(x = col), color = 'black', fill = 'gray')
}

plots <- lapply(colnames(glass_pred), plot_it)

grid.arrange(grobs = plots, ncol = 3, nrow = 3)

The notable features are:

Ba, Fe, K, and Mg have high concentrations of zeros
Al Ca, and Ri have long tails
There are positive outliers in Al, Na, and Ba

library(GGally)
ggpairs(glass_pred, upper ='blank')

There are noticeable outliers and unusual patterns in the pairwise plots, though this is most prevalent when at least one of the variables has a high proportion of zeros. For instance, in the Mg-Ba plot there are three distinct patterns. One line where the Mg value is 0, one where the Ba value is zero, and an upside-down U shape where they are both non-zero.

PCA

library(caret)

trans <- preProcess(x = glass_pred)
glass_cs <- predict(trans, glass_pred)
pca_obj <- prcomp(glass_cs)
pca_obj$rotation

##           PC1         PC2           PC3         PC4          PC5
## RI -0.5451766  0.28568318 -0.0869108293 -0.14738099  0.073542700
## Na  0.2581256  0.27035007  0.3849196197 -0.49124204 -0.153683304
## Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124
## Al  0.4287086  0.29521154 -0.3292371183  0.13750592 -0.014108879
## Si  0.2288364 -0.15509891  0.4587088382  0.65253771 -0.008500117
## K   0.2193440 -0.15397013 -0.6625741197  0.03853544  0.307039842
## Ca -0.4923061  0.34537980  0.0009847321  0.27644322  0.188187742
## Ba  0.2503751  0.48470218 -0.0740547309 -0.13317545 -0.251334261
## Fe -0.1858415 -0.06203879 -0.2844505524  0.23049202 -0.873264047
##            PC6         PC7         PC8         PC9
## RI -0.11528772 -0.08186724 -0.75221590 -0.02573194
## Na  0.55811757 -0.14858006 -0.12769315  0.31193718
## Mg -0.30818598  0.20604537 -0.07689061  0.57727335
## Al  0.01885731  0.69923557 -0.27444105  0.19222686
## Si -0.08609797 -0.21606658 -0.37992298  0.29807321
## K   0.24363237 -0.50412141 -0.10981168  0.26050863
## Ca  0.14866937  0.09913463  0.39870468  0.57932321
## Ba -0.65721884 -0.35178255  0.14493235  0.19822820
## Fe  0.24304431 -0.07372136 -0.01627141  0.01466944

rot <- as.data.frame(round(pca_obj$rotation, 3)) %>%
  rownames_to_column(var = 'Var') %>%
  gather(key = 'PC', value = 'Val', -Var)

ggplot(rot) + geom_tile(aes(x = PC, y = Var,  fill = Val))

screeplot(pca_obj)

pca_obj$sdev^2/sum(pca_obj$sdev^2)

## [1] 0.2790181918 0.2277857983 0.1560937771 0.1286513829 0.1015558052
## [6] 0.0586261325 0.0409953826 0.0070947720 0.0001787575

Only PC 5 is mostly dominated by a single variable, indicating some collinearity. The scree plot doesn’t have an obvious elbow. The largest dropoffs are between PC2 and PC3, and PC5 and PC6.

Correlation

library(corrplot)
cor_mat <- cor(glass_pred) %>%
  corrplot(order = 'hclust')

We see Ca, RI, the variable that make up PC1 as the highest correlated. There is a pocket of collinearity, with Ri, Ca, Mg, and K but these variables are only lightly correlated. None of the correlations exceeed .6. I would only consider PCA to deal with the Ca-Ra relationship and only after seeing a high vif in the modeling step.

Part B

library(psych)
library(kableExtra)

profile <- function (df){ 
  ins_desc <- df %>%
    keep(is.numeric) %>%
    describe() %>%
    dplyr::select(-c(vars,n,range, mad, trimmed))
  
  rns <- rownames(ins_desc)
  
  round_2 <- function(col) round(col,2)
  ins_desc2 <- ins_desc %>%
    mutate_all(round_2)
  
  rownames(ins_desc2) <- rns
  
  ins_desc2 %>%
    kable %>%
    kable_styling
}
profile(glass_pred)

	mean	sd	median	min	max	skew	kurtosis	se
RI	1.52	0.00	1.52	1.51	1.53	1.60	4.72	0.00
Na	13.41	0.82	13.30	10.73	17.38	0.45	2.90	0.06
Mg	2.68	1.44	3.48	0.00	4.49	-1.14	-0.45	0.10
Al	1.44	0.50	1.36	0.29	3.50	0.89	1.94	0.03
Si	72.65	0.77	72.79	69.81	75.41	-0.72	2.82	0.05
K	0.50	0.65	0.56	0.00	6.21	6.46	52.87	0.04
Ca	8.96	1.42	8.60	5.43	16.19	2.02	6.41	0.10
Ba	0.18	0.50	0.00	0.00	3.15	3.37	12.08	0.03
Fe	0.06	0.10	0.00	0.00	0.51	1.73	2.52	0.01

Skew and Kurtosis is a problem in many of these variables, as we saw in the histograms. With the high number of zeros in some of these variables, the Box-Cox transformation might not be effective in enhancing their predictive power and could lead to nonsensical transformations due to numerical instability. One should proceed with caution when applying the Box-Cox to distributions like these.

Part C

We first try the multivarite version of the Box-Cox, that tries to find a multivariate normal distribution for all the variables in unison.

library(car)
library(caret)

new_glass <- glass_pred + 1

bc_summary <- summary(bc <- powerTransform(as.matrix(new_glass)))
bc_summary$result

##      Est Power Rounded Pwr Wald Lwr bnd Wald Upr Bnd
## RI -25.0853114      -25.09  -25.1140249 -25.05659793
## Na   1.3755562        1.00    0.6042300   2.14688241
## Mg   1.7699080        2.00    1.5354127   2.00440341
## Al   0.9773267        1.00    0.6451343   1.30951920
## Si  10.9452696       10.95    5.9995589  15.89098032
## K   -0.1441078        0.00   -0.3431581   0.05494246
## Ca   0.6774333        0.50    0.4470012   0.90786544
## Ba  -6.8620464       -6.86   -8.0320821  -5.69201071
## Fe -14.9245600      -14.92  -17.8279259 -12.02119414

As expected, some of these transformations are extreme. If we were insistant on using the multvariate version, we’d cap these at an absolute value of 2. The univariate version will be preferable because we don’t want the extreme distributions driving the lambda estimates of the more well behaved predictors.

trans <- preProcess(x = new_glass, method = c( 'BoxCox')) 
trans

## Created from 214 samples and 9 variables
## 
## Pre-processing:
##   - Box-Cox transformation (9)
##   - ignored (0)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.2, 2, 0, 2, -1, -1.4, -2, -2

This procedure apparently caps at an absolute value of 2. Assuming one chooses that ceiling, the transformations are still pretty similar to the multivariate version

Now We’ll have a look at the distributions after the Box-Cox.

glass_trans <- predict(trans, new_glass)

plot_it2 <- function(col){
  ggplot(glass_trans) + geom_histogram(aes_string(x = col), color = 'black', fill = 'gray')
}

plots <- lapply(colnames(glass_trans), plot_it2)

grid.arrange(grobs = plots, ncol = 3, nrow = 3)

The distribution of Al looks nearly normal. Aso do Na and Si, other than a couple outliers. The heavy zero variables can’t really be helped.

profile(glass_trans)

	mean	sd	median	min	max	skew	kurtosis	se
RI	0.42	0.00	0.42	0.42	0.42	1.58	4.62	0.00
Na	2.67	0.06	2.66	2.46	2.91	0.06	2.50	0.00
Mg	7.32	4.16	9.54	0.00	14.57	-0.93	-0.79	0.28
Al	0.87	0.20	0.86	0.25	1.50	0.00	1.00	0.01
Si	2712.03	56.83	2721.98	2506.53	2918.74	-0.65	2.78	3.88
K	0.28	0.17	0.36	0.00	0.86	-0.09	0.13	0.01
Ca	0.68	0.01	0.68	0.66	0.70	-0.25	4.33	0.00
Ba	0.06	0.13	0.00	0.00	0.47	2.14	2.92	0.01
Fe	0.04	0.07	0.00	0.00	0.28	1.33	0.45	0.00

Al now has zero skew, and about half of the variables are closer to zero. The only variables with high skew either had many zeros(MG, BA, Fe) or high kurtosis (RI). In the modelling step, I would try no Box-Cox, applying the Box-Cox to all variables, and only applying Box-Cox to the 5 that it helped the most. In practice, I would also round the lambdas to the nearest .5.

Problem 3.2

Part A

data("Soybean")

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ    
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165  
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213  
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193  
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112  
##             NA's: 16   NA's:  1                                   
##                                                                   
##                                                                   
##  plant.growth leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread
##  0   :441     0: 77   0   :221   0   :357   0   : 51   0   :487   
##  1   :226     1:606   1   : 36   1   : 21   1   :327   1   : 96   
##  NA's: 16             2   :342   2   :221   2   :221   NA's:100   
##                       NA's: 84   NA's: 84   NA's: 84              
##                                                                   
##                                                                   
##                                                                   
##  leaf.malf  leaf.mild    stem     lodging    stem.cankers canker.lesion
##  0   :554   0   :535   0   :296   0   :520   0   :379     0   :320     
##  1   : 45   1   : 20   1   :371   1   : 42   1   : 39     1   : 83     
##  NA's: 84   2   : 20   NA's: 16   NA's:121   2   : 36     2   :177     
##             NA's:108                         3   :191     3   : 65     
##                                              NA's: 38     NA's: 38     
##                                                                        
##                                                                        
##  fruiting.bodies ext.decay  mycelium   int.discolor sclerotia  fruit.pods
##  0   :473        0   :497   0   :639   0   :581     0   :625   0   :407  
##  1   :104        1   :135   1   :  6   1   : 44     1   : 20   1   :130  
##  NA's:106        2   : 13   NA's: 38   2   : 20     NA's: 38   2   : 14  
##                  NA's: 38              NA's: 38                3   : 48  
##                                                                NA's: 84  
##                                                                          
##                                                                          
##  fruit.spots   seed     mold.growth seed.discolor seed.size  shriveling
##  0   :345    0   :476   0   :524    0   :513      0   :532   0   :539  
##  1   : 75    1   :115   1   : 67    1   : 64      1   : 59   1   : 38  
##  2   : 57    NA's: 92   NA's: 92    NA's:106      NA's: 92   NA's:106  
##  4   :100                                                              
##  NA's:106                                                              
##                                                                        
##                                                                        
##   roots    
##  0   :551  
##  1   : 86  
##  2   : 15  
##  NA's: 31  
##            
##            
##

None of these categoricals meet the first heuristic from the Kuhn and Johnson book, ratio of the frequency between most and second most prevalent values > 20. However, some might meet the second condition, unique values make up less than 10% of the sample.

df <- Soybean %>% gather('var', 'value') %>%
  mutate(value = as.character(value), 
         value = ifelse(is.na(value), 'unknown', value),
         rows = nrow(Soybean)
  ) %>%
  group_by(var, rows, value) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(proportion = count/rows) %>%
  filter(proportion >= .9)

kable(df) %>% kable_styling()

var	rows	value	count	proportion
mycelium	683	0	639	0.9355783
sclerotia	683	0	625	0.9150805

Based on that criteria we would likely drop the variables mycelium and sclerotia

Part B

library(naniar)
gg_miss_case(as.data.frame(Soybean))

The missing values are clearly non-random. There are nearly 100 observations where nearly half the variables are missing.

Keying in on particular missing values, we limit the intersection size:

gg_miss_upset(Soybean, nsets = 5, nintersects = 5)

I don’t have the domain knowledge to see the connection between the 121 row group of missing values. However, given the size in the intersection between missing values, it appears to be non-random. All of the missing values for hail, sever, seed.tmt, and lodging occur on the same rows.

Part C

My firstline strategy for missing values in categorical data is to replace the NAs with ‘unknown’. This will extend the feature space and improve predictive power if the missing data is non-random. For example, if the missing data was caused by temperatures too low to collect data, that would clearly add explanatory power.

I would also drop the variables identified in Part A as degenerate.

The code below produces the model matrix using this strategy.

replace_it <- function(x) ifelse(is.na(x), '_unknown', x)
soybean_2 <- dplyr::select(Soybean, - c(mycelium, sclerotia, Class)) %>%
  mutate_all(as.character) %>%
  mutate_all(replace_it)
model_mat <- dummyVars(~ ., data = soybean_2)

new_soy <- predict(model_mat, soybean_2)

If the values are missing at random, then they could be imputed with a package like mice. mice uses logistic regression to impute categoricals. One way to determine missing at random is whether the ‘unknown’ binary variables you create are not significant in your model.

The code below shows how to create the imputed dataset with mice. In practice, you’d up the maxit and m arguments, assuming execution time is not an issue.

library(mice)

Soybean_3 <- dplyr::select(Soybean, - c(mycelium, sclerotia, Class))
pred <- mice(data = Soybean, m = 3, method = "polyreg", maxit = 2)
imputed_soy <- pred$imp

HW 4

Peter Goodridge

March 2, 2019

Problem 3.1

Part A

Part B

Part C

Problem 3.2

Part A

Part B

Part C