Flexible would be better. An extremely large n allows a flexible model to learn patterns without excessive over-fitting, achieving better fit to the true data distribution. Moreover, with a small p, interpretability should not be an issue.
Flexible would be worse. In this scenario, you have a high dimensional relation but not enough data. Both flexible models and inflexible models would struggle to reliably estimate given the limited n. However, flexible models are vulnerable to overfitting to noise.
Flexible would be better. Inflexible models struggle with fitting to non-linear relationships. However, although linear models are relatively “inflexible,” I think it’s important to remember that linear regression can also include polynomials and hence fit many scenarios (just no this case).
Flexible would be worse. In general, you want both bias and variance to be minimized. However, in this scenario, variance of the irreducible error is high. To avoid worsening the preexisting high variance, it’s important to choose a model that does not add much additional variance. We do so by picking an inflexible method. Flexible models have higher variance because they fit data more closely, including noise, which can lead to overfitting.
f1 <- function(x) {(x-1)^2}
f2 <- function(x) {x^2}
f3 <- function(x) {(x-1)^2+0.1}
f4 <- function(x) {(2*(x-0.5))^2+0.5}
plot(f1, 0, 1, ylab = "value", xlab = "flexibility", col = "red", main = "Bias-Variance Decomposition", ylim = c(0,1))
curve(f2, add = TRUE, col = "blue")
curve(f3, add = TRUE, col = "green")
curve(f4, add = TRUE, col = "purple")
abline(h=0.4, col = "orange")
legend("topleft", legend = c("(squared) bias", "variance", "training error", "test error", "Bayes (irreducible) error"), col = c("red", "blue", "green", "purple", "orange"), lty = 1, cex = 0.5)
Bias decreases as flexibility increases. Variance increases as flexibility increases. Training error decreases as flexibility increases. Testing error decreases up to a certain point; however, as flexibility increases, the tendency to overfit increases, which although leads to decreased bias and training error, also leads to an increase in treating error. Irreducible error is inherent to the data and not affected by model. As a result, it is unchanged as flexibility increases.
library(ISLR2)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
dim(Boston)
## [1] 506 13
The Boston data set from ISLR2 contains housing values in 506 suburbs of Boston. The format is a data frame with 506 rows for each suburb of Boston. There are 13 columns for each housing variables, including crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft), indus (proportion of non-retail business acres per town), chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)), nox (nitrogen oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted mean of distances to five Boston employment centers), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10,000), ptratio (pupil-teacher ratio by town), lstat (lower status of the population (percent)), and medv (median value of owner-occupied homes in $1000s).
pairs(Boston)
With lstat, it is positively correlated with nox and age and negatively correlated with rm, dis, and medv. This makes sense since lower income families are more likely to be exposed to air pollution. Their homes are also likely lower value, closer to employment centers (downtown/factories), and be smaller with fewer rooms. With nox, it is positively correlated with indus and age and negatively correlated with dis. This can be due to more factory and industrial waste as nitrogen oxide. The closer you are to employment centers (downtown/factories), the more likely you’ll also be exposed to air pollution.
round(cor(Boston),2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## lstat medv
## crim 0.46 -0.39
## zn -0.41 0.36
## indus 0.60 -0.48
## chas -0.05 0.18
## nox 0.59 -0.43
## rm -0.61 0.70
## age 0.60 -0.38
## dis -0.50 0.25
## rad 0.49 -0.38
## tax 0.54 -0.47
## ptratio 0.37 -0.51
## lstat 1.00 -0.74
## medv -0.74 1.00
cor.test(Boston$crim, Boston$rad)
##
## Pearson's product-moment correlation
##
## data: Boston$crim and Boston$rad
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5693817 0.6758248
## sample estimates:
## cor
## 0.6255051
cor.test(Boston$crim, Boston$tax)
##
## Pearson's product-moment correlation
##
## data: Boston$crim and Boston$tax
## t = 16.099, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5221186 0.6375464
## sample estimates:
## cor
## 0.5827643
Based on the correlation coefficients, per capita crime rate has a positive correlation with rad (index of accessibility to radial highways) of r=0.63 and a positive correlation with tax (full-value property-tax rate per $10,000) of r=0.58. (usually r=0.75 as drop off of significance)
plot(Boston$crim)
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
plot(Boston$tax)
summary(Boston$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
plot(Boston$ptratio)
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
print(which(Boston$crim > 3.62 & Boston$tax > 330 & Boston$ptratio > 19.05))
## [1] 357 358 359 360 361 362 363 364 366 367 368 369 370 371 372 373 374 375
## [19] 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
## [37] 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411
## [55] 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429
## [73] 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
## [91] 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465
## [109] 467 468 469 470 471 472 474 475 476 477 478 479 480 481 482 483 486 487
## [127] 488
There are census tracts of Boston that have particularly high crime rates, tax rates, and pupil-teacher ratios. It is interesting to note that census tracts 357 to 488 have higher crime rates, tax rates, and pupil-teacher ratios. The range of each predictor is tax rates (187 to 711) > crime rates (0.0063 to 88.98) > pupil-teacher ratios (12.6 to 22). However, crime rates, in particular, has a 3rd quartile of 3.68 vs max of 88.98. In other words, it has the greatest relative outliers.
hist(Boston$chas)$count
## [1] 471 0 0 0 0 0 0 0 0 35
There are 35 census tracts bound to the Charles river.
summary(Boston$ptratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.60 17.40 19.05 18.46 20.20 22.00
The median pupil-teacher ratio is 19.05.
head(Boston[order(Boston$medv),])
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 399 38.35180 0 18.10 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0
## 406 67.92080 0 18.10 0 0.693 5.683 100.0 1.4254 24 666 20.2 22.98 5.0
## 401 25.04610 0 18.10 0 0.693 5.987 100.0 1.5888 24 666 20.2 26.77 5.6
## 400 9.91655 0 18.10 0 0.693 5.852 77.8 1.5004 24 666 20.2 29.97 6.3
## 415 45.74610 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 36.98 7.0
## 490 0.18337 0 27.74 0 0.609 5.414 98.3 1.7554 4 711 20.1 23.97 7.0
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
The census tracts #399 and #490 have the lowest median value of $5000. Both have crime rate >3rd quartile, 3rd quartile of non-retail business acres, not bounded by Charles River just like most, >3rd quartile for nitrogen oxide concentration, <1st quartile for number of rooms, most units built before 1940, <1st quartile for distance to Boston employment centers, highest index of accessibility to highways, 3rd quartile for tax rate, 3rd quartile for pupil-teacher ratio, and >3rd quartile for lower status. Due to the relatively high crime rate, nitrogen oxides concentration, older units, few rooms, and lower socioeconomic status, theses 2 census tracts are likely less desirable for higher income families. This is not unexpected as they have the absolute lowest median value of owner-occupied homes.
dim(subset(Boston, rm>7))
## [1] 64 13
dim(subset(Boston, rm>8))
## [1] 13 13
summary(subset(Boston, rm>8))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio lstat medv
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :2.47 Min. :21.9
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:3.32 1st Qu.:41.7
## Median : 7.000 Median :307.0 Median :17.40 Median :4.14 Median :48.3
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :4.31 Mean :44.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :7.44 Max. :50.0
64 census tracts average more than 7 rooms per dwelling and 13 have more than 8. What stands out the most about the tracts with >8 rooms per dwelling is that by median, they have a much lower amount of low income status population (4.14<<11.36). Moreover, they have a higher median value of property (48.3>>11.36).
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
table(Soybean$Class)
##
## 2-4-d-injury alternarialeaf-spot
## 16 91
## anthracnose bacterial-blight
## 44 20
## bacterial-pustule brown-spot
## 20 92
## brown-stem-rot charcoal-rot
## 44 20
## cyst-nematode diaporthe-pod-&-stem-blight
## 14 15
## diaporthe-stem-canker downy-mildew
## 20 20
## frog-eye-leaf-spot herbicide-injury
## 91 8
## phyllosticta-leaf-spot phytophthora-rot
## 20 88
## powdery-mildew purple-seed-stain
## 20 20
## rhizoctonia-root-rot
## 20
barplot(table(Soybean$Class), main="Frequency Distribution of Class", xlab="Class", ylab="Frequency")
library(caret)
nearZeroVar(Soybean)
## [1] 19 26 28
colnames(Soybean)[nearZeroVar(Soybean)]
## [1] "leaf.mild" "mycelium" "sclerotia"
Soybean1 <- Soybean[,-nearZeroVar(Soybean)]
Soybean1[is.na(Soybean1)] <- 0
Soybean1 <- Soybean1[,-1]
for (col in names(Soybean1)){
Soybean1[[col]] <- as.numeric(as.character(Soybean1[[col]]))
}
segCorr <- cor(Soybean1)
library(corrplot)
## corrplot 0.94 loaded
corrplot(segCorr, order="hclust", tl.cex=.35)
highCorr <- findCorrelation(segCorr, .75)
colnames(segCorr)[highCorr]
## [1] "leaf.marg"
Yes, columns 19, 26, and 28 (leaf.mild, mycelium, and sclerotia) are near zero variance, meaning that they might not contribute much to the model’s predictive power. Moreover, leaf.marg is found to have correlation greater than 0.75.
#tempSoybean <- Soybean[, c("Class", "roots", "temp", "crop.hist", "plant.growth", "stem", "date", "area.dam", "leaves")]
#tempData <- mice(tempSoybean, m = 5, maxit = 50, meth = 'pmm', seed = 500)
#completedData <- complete(tempData,1)
#head(completedData)
tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg"))]
#tempSoybean <- Soybean[, !(names(Soybean) %in% c("leaf.mild", "mycelium", "sclerotia", "leaf.marg", "hail","sever", "seed.tmt", "lodging"))]
new.pred <- quickpred(tempSoybean)
tempData <- mice(tempSoybean, method="pmm", seed=1000, pred=new.pred)
##
## iter imp variable
## 1 1 date plant.stand precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth* leaf.halo leaf.size leaf.shread* leaf.malf stem lodging stem.cankers canker.lesion fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 1 2 date plant.stand precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.size leaf.shread* leaf.malf stem lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 1 3 date plant.stand* precip temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.size leaf.shread leaf.malf stem lodging stem.cankers canker.lesion* fruiting.bodies* ext.decay int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 1 4 date plant.stand precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.size leaf.shread leaf.malf stem* lodging stem.cankers* canker.lesion fruiting.bodies* ext.decay int.discolor fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 1 5 date plant.stand* precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth leaf.halo leaf.size leaf.shread leaf.malf stem lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 2 1 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 2 2 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.size leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 2 3 date* plant.stand* precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 2 4 date* plant.stand* precip* temp* hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 2 5 date plant.stand* precip* temp* hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth* leaf.halo leaf.size leaf.shread* leaf.malf stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 3 1 date* plant.stand* precip temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion fruiting.bodies ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 3 2 date plant.stand precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread leaf.malf* stem* lodging stem.cankers canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling roots*
## 3 3 date plant.stand* precip temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 3 4 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling roots*
## 3 5 date* plant.stand* precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots
## 4 1 date* plant.stand precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 4 2 date* plant.stand* precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 4 3 date plant.stand* precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo leaf.size* leaf.shread leaf.malf stem lodging stem.cankers canker.lesion* fruiting.bodies* ext.decay int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 4 4 date plant.stand precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 4 5 date* plant.stand* precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots
## 5 1 date* plant.stand* precip* temp hail* crop.hist area.dam sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 5 2 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 5 3 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 5 4 date* plant.stand* precip* temp hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor* fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## 5 5 date* plant.stand precip* temp* hail* crop.hist area.dam* sever* seed.tmt* germ* plant.growth* leaf.halo* leaf.size* leaf.shread* leaf.malf* stem* lodging stem.cankers* canker.lesion* fruiting.bodies* ext.decay* int.discolor fruit.pods* fruit.spots* seed* mold.growth* seed.discolor* seed.size* shriveling* roots*
## Warning: Number of logged events: 1319
completedData <- complete(tempData,1)
head(completedData)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.size leaf.shread
## 1 1 0 0 1 1 0 2 0
## 2 2 1 1 1 1 0 2 0
## 3 2 1 2 1 1 0 2 0
## 4 2 0 1 1 1 0 2 0
## 5 1 0 2 1 1 0 2 0
## 6 1 0 1 1 1 0 2 0
## leaf.malf stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 1 0 1 1 3 1 1 1
## 2 0 1 0 3 1 1 1
## 3 0 1 0 3 0 1 1
## 4 0 1 0 3 0 1 1
## 5 0 1 0 3 1 1 1
## 6 0 1 0 3 0 1 1
## int.discolor fruit.pods fruit.spots seed mold.growth seed.discolor seed.size
## 1 0 0 4 0 0 0 0
## 2 0 0 4 0 0 0 0
## 3 0 0 4 0 0 0 0
## 4 0 0 4 0 0 0 0
## 5 0 0 4 0 0 0 0
## 6 0 0 4 0 0 0 0
## shriveling roots
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
completedData <- as.data.frame(completedData)
aggr_plot <- aggr(completedData, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(completedData), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Class 0
## date 0
## plant.stand 0
## precip 0
## temp 0
## hail 0
## crop.hist 0
## area.dam 0
## sever 0
## seed.tmt 0
## germ 0
## plant.growth 0
## leaves 0
## leaf.halo 0
## leaf.size 0
## leaf.shread 0
## leaf.malf 0
## stem 0
## lodging 0
## stem.cankers 0
## canker.lesion 0
## fruiting.bodies 0
## ext.decay 0
## int.discolor 0
## fruit.pods 0
## fruit.spots 0
## seed 0
## mold.growth 0
## seed.discolor 0
## seed.size 0
## shriveling 0
## roots 0
In part a), I show how to eliminate predictors with degenerate distributions. I also show the result of removing leaf.marg, a predictor with too high correlation. In part b), I identify that of the 35 predictors, only 8 have the acceptable <5% missing data. In my code, I have comments showing experimenting with removing additional predictors with a large amount of missing data. However, since I do not have domain expertise, I concluded it was best to impute for all. The quickpred method allows the MICE package’s imputation to occur within a reasonable time frame. For further studies, I would experiment with additional methods listed in methods(mice).