R Markdown

1. Would we expect performance of a flexible statistical learning method to be better/worse than inflexible one?

a) The sample size n is extremely large, and the number of predictors p is small.

Flexible would be better. An extremely large n allows a flexible model to learn patterns without excessive over-fitting, achieving better fit to the true data distribution. Moreover, with a small p, interpretability should not be an issue.

b) The number of predictors p is extremely large, and the number of observations n is small.

Flexible would be worse. In this scenario, you have a high dimensional relation but not enough data. Both flexible models and inflexible models would struggle to reliably estimate given the limited n. However, flexible models are vulnerable to overfitting to noise.

c) The relationship between the predictors and response is highly non-linear.

Flexible would be better. Inflexible models struggle with fitting to non-linear relationships. However, although linear models are relatively “inflexible,” I think it’s important to remember that linear regression can also include polynomials and hence fit many scenarios (just no this case).

d) The variance of the error terms, \(\sigma^2=var(\varepsilon)\), is extremely high.

Flexible would be worse. In general, you want both bias and variance to be minimized. However, in this scenario, variance of the irreducible error is high. To avoid worsening the preexisting high variance, it’s important to choose a model that does not add much additional variance. We do so by picking an inflexible method. Flexible models have higher variance because they fit data more closely, including noise, which can lead to overfitting.

2. Bias-variance decomposition

a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be 5 curves.

f1 <- function(x) {(x-1)^2}
f2 <- function(x) {x^2}
f3 <- function(x) {(x-1)^2+0.1}
f4 <- function(x) {(2*(x-0.5))^2+0.5}

plot(f1, 0, 1, ylab = "value", xlab = "flexibility", col = "red", main = "Bias-Variance Decomposition", ylim = c(0,1))
curve(f2, add = TRUE, col = "blue")
curve(f3, add = TRUE, col = "green")
curve(f4, add = TRUE, col = "purple")
abline(h=0.4, col = "orange")

legend("topleft", legend = c("(squared) bias", "variance", "training error", "test error", "Bayes (irreducible) error"), col = c("red", "blue", "green", "purple", "orange"), lty = 1, cex = 0.5)

b) Explain why each curve has the shape displayed in (a).

Bias decreases as flexibility increases. Variance increases as flexibility increases. Training error decreases as flexibility increases. Testing error decreases up to a certain point; however, as flexibility increases, the tendency to overfit increases, which although leads to decreased bias and training error, also leads to an increase in treating error. Irreducible error is inherent to the data and not affected by model. As a result, it is unchanged as flexibility increases.

3. Boston housing data set

a) Load the Boston data set, part of ISLR2 library. How many rows, columns, and what do they mean?

library(ISLR2)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
dim(Boston)
## [1] 506  13

The Boston data set from ISLR2 contains housing values in 506 suburbs of Boston. The format is a data frame with 506 rows for each suburb of Boston. There are 13 columns for each housing variables, including crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft), indus (proportion of non-retail business acres per town), chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)), nox (nitrogen oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted mean of distances to five Boston employment centers), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10,000), ptratio (pupil-teacher ratio by town), lstat (lower status of the population (percent)), and medv (median value of owner-occupied homes in $1000s).

b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings:

pairs(Boston)

With lstat, it is positively correlated with nox and age and negatively correlated with rm, dis, and medv. This makes sense since lower income families are more likely to be exposed to air pollution. Their homes are also likely lower value, closer to employment centers (downtown/factories), and be smaller with fewer rooms. With nox, it is positively correlated with indus and age and negatively correlated with dis. This can be due to more factory and industrial waste as nitrogen oxide. The closer you are to employment centers (downtown/factories), the more likely you’ll also be exposed to air pollution.

c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

round(cor(Boston),2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         lstat  medv
## crim     0.46 -0.39
## zn      -0.41  0.36
## indus    0.60 -0.48
## chas    -0.05  0.18
## nox      0.59 -0.43
## rm      -0.61  0.70
## age      0.60 -0.38
## dis     -0.50  0.25
## rad      0.49 -0.38
## tax      0.54 -0.47
## ptratio  0.37 -0.51
## lstat    1.00 -0.74
## medv    -0.74  1.00
cor.test(Boston$crim, Boston$rad)
## 
##  Pearson's product-moment correlation
## 
## data:  Boston$crim and Boston$rad
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5693817 0.6758248
## sample estimates:
##       cor 
## 0.6255051
cor.test(Boston$crim, Boston$tax)
## 
##  Pearson's product-moment correlation
## 
## data:  Boston$crim and Boston$tax
## t = 16.099, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5221186 0.6375464
## sample estimates:
##       cor 
## 0.5827643

Based on the correlation coefficients, per capita crime rate has a positive correlation with rad (index of accessibility to radial highways) of r=0.63 and a positive correlation with tax (full-value property-tax rate per $10,000) of r=0.58. (usually r=0.75 as drop off of significance)

d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

plot(Boston$crim)

summary(Boston$crim)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620
plot(Boston$tax)

summary(Boston$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0
plot(Boston$ptratio)

summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00
print(which(Boston$crim > 3.62 & Boston$tax > 330 & Boston$ptratio > 19.05))
##   [1] 357 358 359 360 361 362 363 364 366 367 368 369 370 371 372 373 374 375
##  [19] 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
##  [37] 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411
##  [55] 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429
##  [73] 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
##  [91] 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465
## [109] 467 468 469 470 471 472 474 475 476 477 478 479 480 481 482 483 486 487
## [127] 488

There are census tracts of Boston that have particularly high crime rates, tax rates, and pupil-teacher ratios. It is interesting to note that census tracts 357 to 488 have higher crime rates, tax rates, and pupil-teacher ratios. The range of each predictor is tax rates (187 to 711) > crime rates (0.0063 to 88.98) > pupil-teacher ratios (12.6 to 22). However, crime rates, in particular, has a 3rd quartile of 3.68 vs max of 88.98. In other words, it has the greatest relative outliers.

e) How many of the census tracts in this data set bound the Charles river?

hist(Boston$chas)$count

##  [1] 471   0   0   0   0   0   0   0   0  35

There are 35 census tracts bound to the Charles river.

f) What is the median pupil-teacher ratio among the towns in this data set?

summary(Boston$ptratio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

The median pupil-teacher ratio is 19.05.

g) Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

head(Boston[order(Boston$medv),])
##         crim zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv
## 399 38.35180  0 18.10    0 0.693 5.453 100.0 1.4896  24 666    20.2 30.59  5.0
## 406 67.92080  0 18.10    0 0.693 5.683 100.0 1.4254  24 666    20.2 22.98  5.0
## 401 25.04610  0 18.10    0 0.693 5.987 100.0 1.5888  24 666    20.2 26.77  5.6
## 400  9.91655  0 18.10    0 0.693 5.852  77.8 1.5004  24 666    20.2 29.97  6.3
## 415 45.74610  0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2 36.98  7.0
## 490  0.18337  0 27.74    0 0.609 5.414  98.3 1.7554   4 711    20.1 23.97  7.0
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

The census tracts #399 and #490 have the lowest median value of $5000. Both have crime rate >3rd quartile, 3rd quartile of non-retail business acres, not bounded by Charles River just like most, >3rd quartile for nitrogen oxide concentration, <1st quartile for number of rooms, most units built before 1940, <1st quartile for distance to Boston employment centers, highest index of accessibility to highways, 3rd quartile for tax rate, 3rd quartile for pupil-teacher ratio, and >3rd quartile for lower status. Due to the relatively high crime rate, nitrogen oxides concentration, older units, few rooms, and lower socioeconomic status, theses 2 census tracts are likely less desirable for higher income families. This is not unexpected as they have the absolute lowest median value of owner-occupied homes.

h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

dim(subset(Boston, rm>7))
## [1] 64 13
dim(subset(Boston, rm>8))
## [1] 13 13
summary(subset(Boston, rm>8))
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

64 census tracts average more than 7 rooms per dwelling and 13 have more than 8. What stands out the most about the tracts with >8 rooms per dwelling is that by median, they have a much lower amount of low income status population (4.14<<11.36). Moreover, they have a higher median value of property (48.3>>11.36).

5. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

table(Soybean$Class)
## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20
barplot(table(Soybean$Class), main="Frequency Distribution of Class", xlab="Class", ylab="Frequency")

library(caret)
nearZeroVar(Soybean)
## [1] 19 26 28
colnames(Soybean)[nearZeroVar(Soybean)]
## [1] "leaf.mild" "mycelium"  "sclerotia"
Soybean1 <- Soybean[,-nearZeroVar(Soybean)]
Soybean1[is.na(Soybean1)] <- 0
Soybean1 <- Soybean1[,-1]
for (col in names(Soybean1)){
  Soybean1[[col]] <- as.numeric(as.character(Soybean1[[col]]))
}

segCorr <- cor(Soybean1)
library(corrplot)
## corrplot 0.94 loaded
corrplot(segCorr, order="hclust", tl.cex=.35)

highCorr <- findCorrelation(segCorr, .75)
colnames(segCorr)[highCorr]
## [1] "leaf.marg"

Yes, columns 19, 26, and 28 (leaf.mild, mycelium, and sclerotia) are near zero variance, meaning that they might not contribute much to the model’s predictive power. Moreover, leaf.marg is found to have correlation greater than 0.75.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

#tempSoybean <- Soybean[, c("Class", "roots", "temp", "crop.hist", "plant.growth", "stem", "date", "area.dam", "leaves")]
#tempData <- mice(tempSoybean, m = 5, maxit = 50, meth = 'pmm', seed = 500)
#completedData <- complete(tempData,1)
#head(completedData)

tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg"))]
#tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg", "hail","sever", "seed.tmt", "lodging"))]
new.pred <- quickpred(tempSoybean)
tempData <- mice(tempSoybean, method="pmm", seed=1000, pred=new.pred)
## 
##  iter imp variable
##   1   1  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   2  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   3  date  plant.stand*  precip  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   4  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion  fruiting.bodies*  ext.decay  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   5  date  plant.stand*  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   1  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   2  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   3  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   5  date  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   1  date*  plant.stand*  precip  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion  fruiting.bodies  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   2  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread  leaf.malf*  stem*  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling  roots*
##   3   3  date  plant.stand*  precip  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   4  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling  roots*
##   3   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots
##   4   1  date*  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   3  date  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size*  leaf.shread  leaf.malf  stem  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   4  date  plant.stand  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots
##   5   1  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   2  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   3  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   4  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   5  date*  plant.stand  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
## Warning: Number of logged events: 1319
completedData <- complete(tempData,1)
head(completedData)
##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.size leaf.shread
## 1     1        0    0            1      1         0         2           0
## 2     2        1    1            1      1         0         2           0
## 3     2        1    2            1      1         0         2           0
## 4     2        0    1            1      1         0         2           0
## 5     1        0    2            1      1         0         2           0
## 6     1        0    1            1      1         0         2           0
##   leaf.malf stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1         0    1       1            3             1               1         1
## 2         0    1       0            3             1               1         1
## 3         0    1       0            3             0               1         1
## 4         0    1       0            3             0               1         1
## 5         0    1       0            3             1               1         1
## 6         0    1       0            3             0               1         1
##   int.discolor fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1            0          0           4    0           0             0         0
## 2            0          0           4    0           0             0         0
## 3            0          0           4    0           0             0         0
## 4            0          0           4    0           0             0         0
## 5            0          0           4    0           0             0         0
## 6            0          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0
completedData <- as.data.frame(completedData)
aggr_plot <- aggr(completedData, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(completedData), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##     int.discolor     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

In part a), I show how to eliminate predictors with degenerate distributions. I also show the result of removing leaf.marg, a predictor with too high correlation. In part b), I identify that of the 35 predictors, only 8 have the acceptable <5% missing data. In my code, I have comments showing experimenting with removing additional predictors with a large amount of missing data. However, since I do not have domain expertise, I concluded it was best to impute for all. The quickpred method allows the MICE package’s imputation to occur within a reasonable time frame. For further studies, I would experiment with additional methods listed in methods(mice).