STA 6923 Fall 2024 Homework 1

R Markdown

1. Would we expect performance of a flexible statistical learning method to be better/worse than inflexible one?

a) The sample size n is extremely large, and the number of predictors p is small.

Flexible would be better. An extremely large n allows a flexible model to learn patterns without excessive over-fitting, achieving better fit to the true data distribution. Moreover, with a small p, interpretability should not be an issue.

b) The number of predictors p is extremely large, and the number of observations n is small.

Flexible would be worse. In this scenario, you have a high dimensional relation but not enough data. Both flexible models and inflexible models would struggle to reliably estimate given the limited n. However, flexible models are vulnerable to overfitting to noise.

c) The relationship between the predictors and response is highly non-linear.

Flexible would be better. Inflexible models struggle with fitting to non-linear relationships. However, although linear models are relatively “inflexible,” I think it’s important to remember that linear regression can also include polynomials and hence fit many scenarios (just no this case).

d) The variance of the error terms, $\sigma^2=var(\varepsilon)$, is extremely high.

Flexible would be worse. In general, you want both bias and variance to be minimized. However, in this scenario, variance of the irreducible error is high. To avoid worsening the preexisting high variance, it’s important to choose a model that does not add much additional variance. We do so by picking an inflexible method. Flexible models have higher variance because they fit data more closely, including noise, which can lead to overfitting.

2. Bias-variance decomposition

a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be 5 curves.

f1 <- function(x) {(x-1)^2}
f2 <- function(x) {x^2}
f3 <- function(x) {(x-1)^2+0.1}
f4 <- function(x) {(2*(x-0.5))^2+0.5}

plot(f1, 0, 1, ylab = "value", xlab = "flexibility", col = "red", main = "Bias-Variance Decomposition", ylim = c(0,1))
curve(f2, add = TRUE, col = "blue")
curve(f3, add = TRUE, col = "green")
curve(f4, add = TRUE, col = "purple")
abline(h=0.4, col = "orange")

legend("topleft", legend = c("(squared) bias", "variance", "training error", "test error", "Bayes (irreducible) error"), col = c("red", "blue", "green", "purple", "orange"), lty = 1, cex = 0.5)

b) Explain why each curve has the shape displayed in (a).

Bias decreases as flexibility increases. Variance increases as flexibility increases. Training error decreases as flexibility increases. Testing error decreases up to a certain point; however, as flexibility increases, the tendency to overfit increases, which although leads to decreased bias and training error, also leads to an increase in treating error. Irreducible error is inherent to the data and not affected by model. As a result, it is unchanged as flexibility increases.

3. Boston housing data set

a) Load the Boston data set, part of ISLR2 library. How many rows, columns, and what do they mean?

library(ISLR2)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7

dim(Boston)

## [1] 506  13

The Boston data set from ISLR2 contains housing values in 506 suburbs of Boston. The format is a data frame with 506 rows for each suburb of Boston. There are 13 columns for each housing variables, including crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft), indus (proportion of non-retail business acres per town), chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)), nox (nitrogen oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted mean of distances to five Boston employment centers), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10,000), ptratio (pupil-teacher ratio by town), lstat (lower status of the population (percent)), and medv (median value of owner-occupied homes in $1000s).

b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings:

pairs(Boston)

With lstat, it is positively correlated with nox and age and negatively correlated with rm, dis, and medv. This makes sense since lower income families are more likely to be exposed to air pollution. Their homes are also likely lower value, closer to employment centers (downtown/factories), and be smaller with fewer rooms. With nox, it is positively correlated with indus and age and negatively correlated with dis. This can be due to more factory and industrial waste as nitrogen oxide. The closer you are to employment centers (downtown/factories), the more likely you’ll also be exposed to air pollution.

c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

round(cor(Boston),2)

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         lstat  medv
## crim     0.46 -0.39
## zn      -0.41  0.36
## indus    0.60 -0.48
## chas    -0.05  0.18
## nox      0.59 -0.43
## rm      -0.61  0.70
## age      0.60 -0.38
## dis     -0.50  0.25
## rad      0.49 -0.38
## tax      0.54 -0.47
## ptratio  0.37 -0.51
## lstat    1.00 -0.74
## medv    -0.74  1.00

cor.test(Boston$crim, Boston$rad)

## 
##  Pearson's product-moment correlation
## 
## data:  Boston$crim and Boston$rad
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5693817 0.6758248
## sample estimates:
##       cor 
## 0.6255051

cor.test(Boston$crim, Boston$tax)

## 
##  Pearson's product-moment correlation
## 
## data:  Boston$crim and Boston$tax
## t = 16.099, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5221186 0.6375464
## sample estimates:
##       cor 
## 0.5827643

Based on the correlation coefficients, per capita crime rate has a positive correlation with rad (index of accessibility to radial highways) of r=0.63 and a positive correlation with tax (full-value property-tax rate per $10,000) of r=0.58. (usually r=0.75 as drop off of significance)

d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

plot(Boston$crim)

summary(Boston$crim)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620

plot(Boston$tax)

summary(Boston$tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0

plot(Boston$ptratio)

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

print(which(Boston$crim > 3.62 & Boston$tax > 330 & Boston$ptratio > 19.05))

##   [1] 357 358 359 360 361 362 363 364 366 367 368 369 370 371 372 373 374 375
##  [19] 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
##  [37] 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411
##  [55] 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429
##  [73] 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
##  [91] 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465
## [109] 467 468 469 470 471 472 474 475 476 477 478 479 480 481 482 483 486 487
## [127] 488

There are census tracts of Boston that have particularly high crime rates, tax rates, and pupil-teacher ratios. It is interesting to note that census tracts 357 to 488 have higher crime rates, tax rates, and pupil-teacher ratios. The range of each predictor is tax rates (187 to 711) > crime rates (0.0063 to 88.98) > pupil-teacher ratios (12.6 to 22). However, crime rates, in particular, has a 3rd quartile of 3.68 vs max of 88.98. In other words, it has the greatest relative outliers.

e) How many of the census tracts in this data set bound the Charles river?

hist(Boston$chas)$count

##  [1] 471   0   0   0   0   0   0   0   0  35

There are 35 census tracts bound to the Charles river.

f) What is the median pupil-teacher ratio among the towns in this data set?

summary(Boston$ptratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00

The median pupil-teacher ratio is 19.05.

g) Which census tract of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

head(Boston[order(Boston$medv),])

##         crim zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv
## 399 38.35180  0 18.10    0 0.693 5.453 100.0 1.4896  24 666    20.2 30.59  5.0
## 406 67.92080  0 18.10    0 0.693 5.683 100.0 1.4254  24 666    20.2 22.98  5.0
## 401 25.04610  0 18.10    0 0.693 5.987 100.0 1.5888  24 666    20.2 26.77  5.6
## 400  9.91655  0 18.10    0 0.693 5.852  77.8 1.5004  24 666    20.2 29.97  6.3
## 415 45.74610  0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2 36.98  7.0
## 490  0.18337  0 27.74    0 0.609 5.414  98.3 1.7554   4 711    20.1 23.97  7.0

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

The census tracts #399 and #490 have the lowest median value of $5000. Both have crime rate >3rd quartile, 3rd quartile of non-retail business acres, not bounded by Charles River just like most, >3rd quartile for nitrogen oxide concentration, <1st quartile for number of rooms, most units built before 1940, <1st quartile for distance to Boston employment centers, highest index of accessibility to highways, 3rd quartile for tax rate, 3rd quartile for pupil-teacher ratio, and >3rd quartile for lower status. Due to the relatively high crime rate, nitrogen oxides concentration, older units, few rooms, and lower socioeconomic status, theses 2 census tracts are likely less desirable for higher income families. This is not unexpected as they have the absolute lowest median value of owner-occupied homes.

h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

dim(subset(Boston, rm>7))

## [1] 64 13

dim(subset(Boston, rm>8))

## [1] 13 13

summary(subset(Boston, rm>8))

##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          lstat           medv     
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :2.47   Min.   :21.9  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:3.32   1st Qu.:41.7  
##  Median : 7.000   Median :307.0   Median :17.40   Median :4.14   Median :48.3  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :4.31   Mean   :44.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :7.44   Max.   :50.0

64 census tracts average more than 7 rooms per dwelling and 13 have more than 8. What stands out the most about the tracts with >8 rooms per dwelling is that by median, they have a much lower amount of low income status population (4.14<<11.36). Moreover, they have a higher median value of property (48.3>>11.36).

4. The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

par(mfrow=c(3,3))
for (i in 1:9){
  hist(Glass[,i], main=names(Glass)[i])
}

for (i in 1:9){
  boxplot(Glass[,i], main=names(Glass)[i])
}

pairs(Glass[,1:9])

Per histograms, the predictor variables are mostly normally distributed or skewed. Mg seems to have a bimodal distribution around the min and max. Per boxplots, Si is the main composition of glass. Per pairwise plots, there appears to be a positive correlation between refractive index and Ca.

b) Do there appear to be any outliers in the data? Are any predictors skewed?

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

library(e1071)
for (i in 1:9) {
  cat("Predictor:", names(Glass)[i], "- Outliers:", length(boxplot(Glass[,i],plot=FALSE)$out), "- ?>20 max/min?:", max(Glass[,i])/min(Glass[,i]), "- Skewness:", skewness(Glass[,i]), "\n")
}

## Predictor: RI - Outliers: 17 - ?>20 max/min?: 1.015075 - Skewness: 1.602715 
## Predictor: Na - Outliers: 7 - ?>20 max/min?: 1.619758 - Skewness: 0.4478343 
## Predictor: Mg - Outliers: 0 - ?>20 max/min?: Inf - Skewness: -1.136452 
## Predictor: Al - Outliers: 18 - ?>20 max/min?: 12.06897 - Skewness: 0.8946104 
## Predictor: Si - Outliers: 12 - ?>20 max/min?: 1.080218 - Skewness: -0.7202392 
## Predictor: K - Outliers: 7 - ?>20 max/min?: Inf - Skewness: 6.460089 
## Predictor: Ca - Outliers: 26 - ?>20 max/min?: 2.981584 - Skewness: 2.018446 
## Predictor: Ba - Outliers: 38 - ?>20 max/min?: Inf - Skewness: 3.36868 
## Predictor: Fe - Outliers: 12 - ?>20 max/min?: Inf - Skewness: 1.729811

Yes, if we define outliers as outside 1st quartile - 1.5*IQR and 3rd quartile + 1.5*IQR, there appears to be outliers in all predictors besides Mg. However, Mg has a bimodal distribution that is not captured on boxplot but histogram.

The predictors are all generally skewed. There are several definitions we can use: causing mean to stray from median, having a high skewness value, or from histogram visual description. The ones that stands out the most is Ba (mean > 3rd quartile), skewness 3.37, and visually right-skewed histogram. K is interesting with mean < median, skewness 6.46. Upon further inspection, it looks actually left skewed in the range of 0 to 1 but has a few extreme right-sided outliers; however, they are not enough to shift mean > median. There is a rule of thumb that skewed data whose ratio of the highest value to the lowest value is greater than 20 have significant skewness, which further supports these two as skewed predictors.

c) Are there any relevant transformations of one or more predictors that might improve the classification model?

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

for (i in 1:9) {
  original_data <- Glass[,i]+1.e-100 
  
  transformed_data <- BoxCoxTrans(original_data)
  print(transformed_data)
  transformed_data <- predict(transformed_data, original_data) 
  
  par(mfrow = c(1, 2))
  hist(original_data, main = paste("Original:", colnames(Glass)[i]), xlab = "")
  hist(transformed_data, main = paste("Transformed:", colnames(Glass)[i]), xlab = "")
}

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.73   12.91   13.30   13.41   13.82   17.38 
## 
## Largest/Smallest: 1.62 
## Sample Skewness: 0.448 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.115   3.480   2.685   3.600   4.490 
## 
## Largest/Smallest: 4.49e+100 
## Sample Skewness: -1.14 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.290   1.190   1.360   1.445   1.630   3.500 
## 
## Largest/Smallest: 12.1 
## Sample Skewness: 0.895 
## 
## Estimated Lambda: 0.5

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   69.81   72.28   72.79   72.65   73.09   75.41 
## 
## Largest/Smallest: 1.08 
## Sample Skewness: -0.72 
## 
## Estimated Lambda: 2

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1225  0.5550  0.4971  0.6100  6.2100 
## 
## Largest/Smallest: 6.21e+100 
## Sample Skewness: 6.46 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.175   0.000   3.150 
## 
## Largest/Smallest: 3.15e+100 
## Sample Skewness: 3.37 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations

## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000 
## 
## Largest/Smallest: 5.1e+99 
## Sample Skewness: 1.73 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations

Boxcox transformation can be used for Ba and K in particular. In my code, I add 1.e-100 so that a lambda can be obtained. We could also manage multiple predictors by preprocessing all predictors with Boxcox and then conducting principal component analysis. We could furthermore use spatial sign procedure to resolve outliers.

5. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

table(Soybean$Class)

## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20

barplot(table(Soybean$Class), main="Frequency Distribution of Class", xlab="Class", ylab="Frequency")

library(caret)
nearZeroVar(Soybean)

## [1] 19 26 28

colnames(Soybean)[nearZeroVar(Soybean)]

## [1] "leaf.mild" "mycelium"  "sclerotia"

Soybean1 <- Soybean[,-nearZeroVar(Soybean)]
Soybean1[is.na(Soybean1)] <- 0
Soybean1 <- Soybean1[,-1]
for (col in names(Soybean1)){
  Soybean1[[col]] <- as.numeric(as.character(Soybean1[[col]]))
}

segCorr <- cor(Soybean1)
library(corrplot)

## corrplot 0.94 loaded

corrplot(segCorr, order="hclust", tl.cex=.35)

highCorr <- findCorrelation(segCorr, .75)
colnames(segCorr)[highCorr]

## [1] "leaf.marg"

Yes, columns 19, 26, and 28 (leaf.mild, mycelium, and sclerotia) are near zero variance, meaning that they might not contribute much to the model’s predictive power. Moreover, leaf.marg is found to have correlation greater than 0.75.

b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(Soybean,2,pMiss)

##           Class            date     plant.stand          precip            temp 
##       0.0000000       0.1464129       5.2708638       5.5636896       4.3923865 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##      17.7159590       2.3426061       0.1464129      17.7159590      17.7159590 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##      16.3982430       2.3426061       0.0000000      12.2986823      12.2986823 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##      12.2986823      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##      17.7159590       5.5636896       5.5636896      15.5197657       5.5636896 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##       5.5636896       5.5636896       5.5636896      12.2986823      15.5197657 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##      13.4699854      13.4699854      15.5197657      13.4699854      15.5197657 
##           roots 
##       4.5387994

apply(Soybean,1,pMiss)

##        1        2        3        4        5        6        7        8 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##        9       10       11       12       13       14       15       16 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##       17       18       19       20       21       22       23       24 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##       25       26       27       28       29       30       31       32 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 36.11111 
##       33       34       35       36       37       38       39       40 
## 52.77778  0.00000 52.77778 52.77778  0.00000  0.00000 36.11111  0.00000 
##       41       42       43       44       45       46       47       48 
## 36.11111 52.77778  0.00000  0.00000  0.00000 52.77778  0.00000 36.11111 
##       49       50       51       52       53       54       55       56 
##  0.00000  0.00000  0.00000 36.11111 52.77778 36.11111 52.77778  0.00000 
##       57       58       59       60       61       62       63       64 
## 36.11111 52.77778 36.11111 52.77778 36.11111 36.11111  0.00000 36.11111 
##       65       66       67       68       69       70       71       72 
## 52.77778  0.00000 52.77778 52.77778  0.00000 52.77778  0.00000  0.00000 
##       73       74       75       76       77       78       79       80 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##       81       82       83       84       85       86       87       88 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##       89       90       91       92       93       94       95       96 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##       97       98       99      100      101      102      103      104 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      105      106      107      108      109      110      111      112 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      113      114      115      116      117      118      119      120 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      121      122      123      124      125      126      127      128 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      129      130      131      132      133      134      135      136 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      137      138      139      140      141      142      143      144 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      145      146      147      148      149      150      151      152 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      153      154      155      156      157      158      159      160 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      161      162      163      164      165      166      167      168 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      169      170      171      172      173      174      175      176 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      177      178      179      180      181      182      183      184 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      185      186      187      188      189      190      191      192 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      193      194      195      196      197      198      199      200 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      201      202      203      204      205      206      207      208 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      209      210      211      212      213      214      215      216 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      217      218      219      220      221      222      223      224 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      225      226      227      228      229      230      231      232 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      233      234      235      236      237      238      239      240 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      241      242      243      244      245      246      247      248 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      249      250      251      252      253      254      255      256 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      257      258      259      260      261      262      263      264 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      265      266      267      268      269      270      271      272 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      273      274      275      276      277      278      279      280 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      281      282      283      284      285      286      287      288 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      289      290      291      292      293      294      295      296 
##  0.00000  0.00000 30.55556 30.55556 30.55556 30.55556 36.11111 30.55556 
##      297      298      299      300      301      302      303      304 
## 66.66667 66.66667 66.66667 66.66667 66.66667 66.66667 83.33333 55.55556 
##      305      306      307      308      309      310      311      312 
## 55.55556 55.55556 55.55556  0.00000  0.00000  0.00000  0.00000  0.00000 
##      313      314      315      316      317      318      319      320 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      321      322      323      324      325      326      327      328 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      329      330      331      332      333      334      335      336 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      337      338      339      340      341      342      343      344 
##  0.00000  0.00000  0.00000  0.00000  0.00000 36.11111 36.11111 52.77778 
##      345      346      347      348      349      350      351      352 
## 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 
##      353      354      355      356      357      358      359      360 
## 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 
##      361      362      363      364      365      366      367      368 
## 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 
##      369      370      371      372      373      374      375      376 
## 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 
##      377      378      379      380      381      382      383      384 
## 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 52.77778 
##      385      386      387      388      389      390      391      392 
## 52.77778  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      393      394      395      396      397      398      399      400 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      401      402      403      404      405      406      407      408 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      409      410      411      412      413      414      415      416 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      417      418      419      420      421      422      423      424 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      425      426      427      428      429      430      431      432 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      433      434      435      436      437      438      439      440 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      441      442      443      444      445      446      447      448 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      449      450      451      452      453      454      455      456 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      457      458      459      460      461      462      463      464 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      465      466      467      468      469      470      471      472 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      473      474      475      476      477      478      479      480 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      481      482      483      484      485      486      487      488 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      489      490      491      492      493      494      495      496 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      497      498      499      500      501      502      503      504 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      505      506      507      508      509      510      511      512 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      513      514      515      516      517      518      519      520 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      521      522      523      524      525      526      527      528 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      529      530      531      532      533      534      535      536 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      537      538      539      540      541      542      543      544 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      545      546      547      548      549      550      551      552 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      553      554      555      556      557      558      559      560 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      561      562      563      564      565      566      567      568 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      569      570      571      572      573      574      575      576 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      577      578      579      580      581      582      583      584 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      585      586      587      588      589      590      591      592 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      593      594      595      596      597      598      599      600 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      601      602      603      604      605      606      607      608 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      609      610      611      612      613      614      615      616 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      617      618      619      620      621      622      623      624 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      625      626      627      628      629      630      631      632 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      633      634      635      636      637      638      639      640 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 
##      641      642      643      644      645      646      647      648 
##  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000 36.11111 
##      649      650      651      652      653      654      655      656 
## 36.11111 36.11111 36.11111 36.11111 30.55556 30.55556 30.55556 30.55556 
##      657      658      659      660      661      662      663      664 
## 66.66667 66.66667 66.66667 66.66667 66.66667 66.66667 66.66667 66.66667 
##      665      666      667      668      669      670      671      672 
## 77.77778 77.77778 77.77778 77.77778 77.77778 77.77778 77.77778 77.77778 
##      673      674      675      676      677      678      679      680 
## 77.77778 77.77778 77.77778 77.77778 77.77778 77.77778 77.77778 55.55556 
##      681      682      683 
## 55.55556 55.55556 55.55556

library(mice)
md.pattern(Soybean)

##     Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562     1      1    1        1         1            1    1    1     1
## 13      1      1    1        1         1            1    1    1     1
## 55      1      1    1        1         1            1    1    1     1
## 8       1      1    1        1         1            1    1    1     1
## 9       1      1    1        1         1            1    1    1     0
## 6       1      1    1        1         1            1    1    1     0
## 14      1      1    1        1         1            1    1    0     1
## 15      1      1    1        1         0            0    0    0     0
## 1       1      1    0        0         0            0    0    0     0
##         0      0    1        1        16           16   16   30    31
##     plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562           1      1            1             1         1        1
## 13            1      1            1             1         1        1
## 55            1      1            1             1         1        1
## 8             1      0            0             0         0        0
## 9             1      1            1             1         1        1
## 6             0      1            1             1         1        1
## 14            0      0            0             0         0        0
## 15            0      0            0             0         0        0
## 1             0      0            0             0         0        0
##              36     38           38            38        38       38
##     int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 562            1         1         1         1         1         1          1
## 13             1         1         1         1         1         1          0
## 55             1         1         0         0         0         0          0
## 8              0         0         1         1         1         1          1
## 9              1         1         0         0         0         0          1
## 6              1         1         0         0         0         0          1
## 14             0         0         0         0         0         0          1
## 15             0         0         1         1         1         1          0
## 1              0         0         1         1         1         1          0
##               38        38        84        84        84        84         84
##     seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots
## 562    1           1         1           1               1           1
## 13     0           0         0           1               0           0
## 55     0           0         0           0               0           0
## 8      0           0         0           1               0           0
## 9      1           1         1           0               1           1
## 6      1           1         1           0               1           1
## 14     1           1         1           0               0           0
## 15     0           0         0           0               0           0
## 1      0           0         0           0               0           0
##       92          92        92         100             106         106
##     seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging     
## 562             1          1         1    1    1     1        1       1    0
## 13              0          0         1    0    0     0        0       0   13
## 55              0          0         0    0    0     0        0       0   19
## 8               0          0         0    0    0     0        0       0   20
## 9               1          1         0    1    0     0        0       0   11
## 6               1          1         0    0    0     0        0       0   13
## 14              0          0         0    0    0     0        0       0   24
## 15              0          0         0    0    0     0        0       0   28
## 1               0          0         0    0    0     0        0       0   30
##               106        106       108  112  121   121      121     121 2337

library(VIM)
aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

# Calculate the number of missing values for each class
missing_by_class <- aggregate(is.na(Soybean), by = list(Soybean$Class), FUN = sum)
missing_df <- as.data.frame(missing_by_class)
missing_df$Sum <- rowSums(missing_df[,2:37])
missing_df <- missing_df[,c(1,38)]
missing_df <- missing_df[order(-missing_df$Sum),]
print(missing_df)

##                        Group.1  Sum
## 16            phytophthora-rot 1214
## 1                 2-4-d-injury  450
## 9                cyst-nematode  336
## 10 diaporthe-pod-&-stem-blight  177
## 14            herbicide-injury  160
## 2          alternarialeaf-spot    0
## 3                  anthracnose    0
## 4             bacterial-blight    0
## 5            bacterial-pustule    0
## 6                   brown-spot    0
## 7               brown-stem-rot    0
## 8                 charcoal-rot    0
## 11       diaporthe-stem-canker    0
## 12                downy-mildew    0
## 13          frog-eye-leaf-spot    0
## 15      phyllosticta-leaf-spot    0
## 17              powdery-mildew    0
## 18           purple-seed-stain    0
## 19        rhizoctonia-root-rot    0

The predictors with >5% missing are included in the table above. The main missing variables seem to be in hail, sever, seed.tmt, and lodging. When looking into the number of missing values for each class, we discover that all missing values belong to just 5 of 19 classes: phytophthora-rot, 2-4-d-injury, cyst-nematodem, diaporthe-pod-&-stem-blight, and herbicide-injury.

c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

#tempSoybean <- Soybean[, c("Class", "roots", "temp", "crop.hist", "plant.growth", "stem", "date", "area.dam", "leaves")]
#tempData <- mice(tempSoybean, m = 5, maxit = 50, meth = 'pmm', seed = 500)
#completedData <- complete(tempData,1)
#head(completedData)

tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg"))]
#tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg", "hail","sever", "seed.tmt", "lodging"))]
new.pred <- quickpred(tempSoybean)
tempData <- mice(tempSoybean, method="pmm", seed=1000, pred=new.pred)

## 
##  iter imp variable
##   1   1  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem  lodging  stem.cankers  canker.lesion  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   2  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   3  date  plant.stand*  precip  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   4  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion  fruiting.bodies*  ext.decay  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   1   5  date  plant.stand*  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth  leaf.halo  leaf.size  leaf.shread  leaf.malf  stem  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   1  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   2  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   3  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   4  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   2   5  date  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   1  date*  plant.stand*  precip  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion  fruiting.bodies  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   2  date  plant.stand  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread  leaf.malf*  stem*  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling  roots*
##   3   3  date  plant.stand*  precip  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   3   4  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling  roots*
##   3   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots
##   4   1  date*  plant.stand  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   2  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   3  date  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo  leaf.size*  leaf.shread  leaf.malf  stem  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   4  date  plant.stand  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   4   5  date*  plant.stand*  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots
##   5   1  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   2  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   3  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   4  date*  plant.stand*  precip*  temp  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor*  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*
##   5   5  date*  plant.stand  precip*  temp*  hail*  crop.hist  area.dam*  sever*  seed.tmt*  germ*  plant.growth*  leaf.halo*  leaf.size*  leaf.shread*  leaf.malf*  stem*  lodging  stem.cankers*  canker.lesion*  fruiting.bodies*  ext.decay*  int.discolor  fruit.pods*  fruit.spots*  seed*  mold.growth*  seed.discolor*  seed.size*  shriveling*  roots*

## Warning: Number of logged events: 1319

completedData <- complete(tempData,1)
head(completedData)

##                   Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker    6           0      2    1    0         1        1
## 2 diaporthe-stem-canker    4           0      2    1    0         2        0
## 3 diaporthe-stem-canker    3           0      2    1    0         1        0
## 4 diaporthe-stem-canker    3           0      2    1    0         1        0
## 5 diaporthe-stem-canker    6           0      2    1    0         2        0
## 6 diaporthe-stem-canker    5           0      2    1    0         3        0
##   sever seed.tmt germ plant.growth leaves leaf.halo leaf.size leaf.shread
## 1     1        0    0            1      1         0         2           0
## 2     2        1    1            1      1         0         2           0
## 3     2        1    2            1      1         0         2           0
## 4     2        0    1            1      1         0         2           0
## 5     1        0    2            1      1         0         2           0
## 6     1        0    1            1      1         0         2           0
##   leaf.malf stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1         0    1       1            3             1               1         1
## 2         0    1       0            3             1               1         1
## 3         0    1       0            3             0               1         1
## 4         0    1       0            3             0               1         1
## 5         0    1       0            3             1               1         1
## 6         0    1       0            3             0               1         1
##   int.discolor fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1            0          0           4    0           0             0         0
## 2            0          0           4    0           0             0         0
## 3            0          0           4    0           0             0         0
## 4            0          0           4    0           0             0         0
## 5            0          0           4    0           0             0         0
## 6            0          0           4    0           0             0         0
##   shriveling roots
## 1          0     0
## 2          0     0
## 3          0     0
## 4          0     0
## 5          0     0
## 6          0     0

completedData <- as.data.frame(completedData)
aggr_plot <- aggr(completedData, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(completedData), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable Count
##            Class     0
##             date     0
##      plant.stand     0
##           precip     0
##             temp     0
##             hail     0
##        crop.hist     0
##         area.dam     0
##            sever     0
##         seed.tmt     0
##             germ     0
##     plant.growth     0
##           leaves     0
##        leaf.halo     0
##        leaf.size     0
##      leaf.shread     0
##        leaf.malf     0
##             stem     0
##          lodging     0
##     stem.cankers     0
##    canker.lesion     0
##  fruiting.bodies     0
##        ext.decay     0
##     int.discolor     0
##       fruit.pods     0
##      fruit.spots     0
##             seed     0
##      mold.growth     0
##    seed.discolor     0
##        seed.size     0
##       shriveling     0
##            roots     0

In part a), I show how to eliminate predictors with degenerate distributions. I also show the result of removing leaf.marg, a predictor with too high correlation. In part b), I identify that of the 35 predictors, only 8 have the acceptable <5% missing data. In my code, I have comments showing experimenting with removing additional predictors with a large amount of missing data. However, since I do not have domain expertise, I concluded it was best to impute for all. The quickpred method allows the MICE package’s imputation to occur within a reasonable time frame. For further studies, I would experiment with additional methods listed in methods(mice).