Problem 1

a) How many rows are in this Boston data set? How many columns? What do the rows and columns represent?

# install.packages('MASS')
library(MASS)
# Boston
# ?Boston

The Boston dataset contains 506 rows and 14 columns. Each row represents a suburb in Boston, and each column is a descriptor of that suburb as outlined in the dataset description under “Format”.

b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

plot(Boston$medv ~ Boston$rm, xlab = "Average number of rooms per dwelling", ylab = "Median value in $1000s")

plot(Boston$medv ~ Boston$lstat, xlab = "Lower status of the population (%)", ylab = "Median value in $1000s")

plot(Boston$medv ~ Boston$nox, xlab = "Nitrogen oxides concentration (parts per 10 million)", ylab = "Median value in $1000s")

There is a clear positive relationship between the average number of rooms per dwelling (RM) and the median value of owner-occupied homes (MEDV). This suggests that suburbs with larger homes tend to have higher housing prices.

There is a strong negative relationship between the percentage of the population classified as lower socioeconomic status (LSTAT) and MEDV. Suburbs with higher proportions of lower-status residents tend to have substantially lower home values.

Finally, nitrogen oxide concentration (NOX), a measure of air pollution, shows a mild negative relationship with MEDV, indicating that higher pollution levels are generally associated with slightly lower housing prices.

c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

plot(Boston$medv ~ Boston$crim, xlab = "Per capita crime rate by town", ylab = "Median value in $1000s")

Higher per capita crime rates are associated with lower median home values. This indicates a negative relationship between crime and housing prices, suggesting that suburbs with higher crime levels tend to have less valuable homes, while wealthier suburbs generally experience lower crime rates.

d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Comment on the range of each predictor.

range(Boston$crim)

## [1]  0.00632 88.97620

range(Boston$tax)

## [1] 187 711

boxplot(Boston$crim)

boxplot(Boston$tax)

Crime rates in the Boston dataset range from 0.00632 to 88.97620, indicating a very wide spread in per capita crime across census tracts. The boxplot shows that most suburbs have crime rates below 10, with a small number of tracts exhibiting much higher crime levels.

Property tax rates range from 187 to 711. While the boxplot does not show clear outliers for tax rates, the median tax rate is approximately 350, suggesting moderate variation across suburbs.

e) How many of the census tracts in this data set bound the Charles river?

sum(Boston$chas)

## [1] 35

There are 35 tracts that bound the Charles River.

f) What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)

## [1] 19.05

The median pupil-teacher ratio is 19.05.

Problem 2

library(mlbench)
data(Soybean)
# ?Soybean

a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

colnames(Soybean)[nearZeroVar(Soybean)]

## [1] "leaf.mild" "mycelium"  "sclerotia"

par(mfrow = c(2,5))
barplot(table(Soybean$date), main = "date")
barplot(table(Soybean$plant.stand), main = "plant.stand")
barplot(table(Soybean$precip), main = "precip")
barplot(table(Soybean$temp), main = "temp")
barplot(table(Soybean$hail), main = "hail")
barplot(table(Soybean$crop.hist), main = "crop.hist")
barplot(table(Soybean$area.dam), main = "area.dam")
barplot(table(Soybean$sever), main = "sever")
barplot(table(Soybean$seed.tmt), main = "seed.tmt")
barplot(table(Soybean$germ), main = "germ")

par(mfrow = c(2,5))
barplot(table(Soybean$plant.growth), main = "plant.growth")
barplot(table(Soybean$leaves), main = "leaves")
barplot(table(Soybean$leaf.halo), main = "leaf.halo")
barplot(table(Soybean$leaf.marg), main = "leaf.marg")
barplot(table(Soybean$leaf.size), main = "leaf.size")
barplot(table(Soybean$leaf.shread), main = "leaf.shread")
barplot(table(Soybean$leaf.malf), main = "leaf.malf")
barplot(table(Soybean$leaf.mild), main = "leaf.mild")
barplot(table(Soybean$stem), main = "stem")
barplot(table(Soybean$lodging), main = "lodging")

par(mfrow = c(2,5))
barplot(table(Soybean$stem.cankers), main = "stem.cankers")
barplot(table(Soybean$canker.lesion), main = "canker.lesion")
barplot(table(Soybean$fruiting.bodies), main = "fruiting.bodies")
barplot(table(Soybean$ext.decay), main = "ext.decay")
barplot(table(Soybean$mycelium), main = "mycelium")
barplot(table(Soybean$int.discolor), main = "int.discolor")
barplot(table(Soybean$sclerotia), main = "sclerotia")
barplot(table(Soybean$fruit.pods), main = "fruit.pods")
barplot(table(Soybean$fruit.spots), main = "fruit.spots")
barplot(table(Soybean$seed), main = "seed")

par(mfrow = c(2,5))
barplot(table(Soybean$mold.growth), main = "mold.growth")
barplot(table(Soybean$seed.discolor), main = "seed.discolor")
barplot(table(Soybean$seed.size), main = "seed.size")
barplot(table(Soybean$shriveling), main = "shriveling")
barplot(table(Soybean$roots), main = "roots")

Using the nearZeroVar function, three predictors were identified as having degenerate distributions: leaf.mild, mycelium, and sclerotia. Inspection of the barplots suggested that several additional predictors, including leaf.malf, lodging, int.discolor, and shriveling, also exhibit near-degenerate behavior. These variables show little to no variability, with one category dominating the majority of observations. As a result, they provide minimal predictive information and would typically be removed during preprocessing.

b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

sort(colSums(is.na(Soybean)), decreasing = TRUE)

##            hail           sever        seed.tmt         lodging            germ 
##             121             121             121             121             112 
##       leaf.mild fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##             108             106             106             106             106 
##     leaf.shread            seed     mold.growth       seed.size       leaf.halo 
##             100              92              92              92              84 
##       leaf.marg       leaf.size       leaf.malf      fruit.pods          precip 
##              84              84              84              84              38 
##    stem.cankers   canker.lesion       ext.decay        mycelium    int.discolor 
##              38              38              38              38              38 
##       sclerotia     plant.stand           roots            temp       crop.hist 
##              38              36              31              30              16 
##    plant.growth            stem            date        area.dam           Class 
##              16              16               1               1               0 
##          leaves 
##               0

Soybean$missing_count = rowSums(is.na(Soybean))
aggregate(missing_count ~ Class, Soybean, mean)

##                          Class missing_count
## 1                 2-4-d-injury      28.12500
## 2          alternarialeaf-spot       0.00000
## 3                  anthracnose       0.00000
## 4             bacterial-blight       0.00000
## 5            bacterial-pustule       0.00000
## 6                   brown-spot       0.00000
## 7               brown-stem-rot       0.00000
## 8                 charcoal-rot       0.00000
## 9                cyst-nematode      24.00000
## 10 diaporthe-pod-&-stem-blight      11.80000
## 11       diaporthe-stem-canker       0.00000
## 12                downy-mildew       0.00000
## 13          frog-eye-leaf-spot       0.00000
## 14            herbicide-injury      20.00000
## 15      phyllosticta-leaf-spot       0.00000
## 16            phytophthora-rot      13.79545
## 17              powdery-mildew       0.00000
## 18           purple-seed-stain       0.00000
## 19        rhizoctonia-root-rot       0.00000

Missing data occur most frequently in the variables hail, sever, seed.tmt, and lodging. Many predictors previously identified as having degenerate distributions also contain substantial missing values, likely because these characteristics are only assessed when a disease is suspected and may not be recorded for healthy plants. In addition, the missing values are concentrated within a subset of disease classes rather than being uniformly distributed across all classes. This indicates that the data are not missing completely at random (MCAR). Instead, the pattern of missingness appears to be related to the disease class and may itself contain predictive information.

###c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

The missing data appear to be missing at random (MAR) because the pattern of missingness is related to the disease class, an observed variable. Therefore, it is reasonable to retain observations and address the missing values through preprocessing rather than deleting rows. As a first step, predictors with degenerate distributions should be removed because they provide little to no predictive information and contain substantial missingness. For the remaining predictors, multiple imputation using chained equations (MICE) is an appropriate approach since it can model relationships among categorical variables and better preserve the underlying structure of the data compared to simple imputation methods.

Problem 3

a) Start R and use these commands to load the data

library(caret)
data(BloodBrain)
?BloodBrain
str(logBBB)

##  num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...

str(bbbDescr)

## 'data.frame':    208 obs. of  134 variables:
##  $ tpsa                : num  12 49.3 50.5 37.4 37.4 ...
##  $ nbasic              : int  1 0 1 0 1 1 1 1 1 1 ...
##  $ negative            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ vsa_hyd             : num  167.1 92.6 295.2 319.1 299.7 ...
##  $ a_aro               : int  0 6 15 15 12 11 6 12 12 6 ...
##  $ weight              : num  156 151 366 383 326 ...
##  $ peoe_vsa.0          : num  76.9 38.2 58.1 62.2 74.8 ...
##  $ peoe_vsa.1          : num  43.4 25.5 124.7 124.7 118 ...
##  $ peoe_vsa.2          : num  0 0 21.7 13.2 33 ...
##  $ peoe_vsa.3          : num  0 8.62 8.62 21.79 0 ...
##  $ peoe_vsa.4          : num  0 23.3 17.4 0 0 ...
##  $ peoe_vsa.5          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ peoe_vsa.6          : num  17.24 0 8.62 8.62 8.62 ...
##  $ peoe_vsa.0.1        : num  18.7 49 83.8 83.8 83.8 ...
##  $ peoe_vsa.1.1        : num  43.5 0 49 68.8 36.8 ...
##  $ peoe_vsa.2.1        : num  0 0 0 0 0 ...
##  $ peoe_vsa.3.1        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ peoe_vsa.4.1        : num  0 0 5.68 5.68 5.68 ...
##  $ peoe_vsa.5.1        : num  0 13.567 2.504 0 0.137 ...
##  $ peoe_vsa.6.1        : num  0 7.9 2.64 2.64 2.5 ...
##  $ a_acc               : int  0 2 2 2 2 2 2 2 0 2 ...
##  $ a_acid              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ a_base              : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ vsa_acc             : num  0 13.57 8.19 8.19 8.19 ...
##  $ vsa_acid            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ vsa_base            : num  5.68 0 0 0 0 ...
##  $ vsa_don             : num  5.68 5.68 5.68 5.68 5.68 ...
##  $ vsa_other           : num  0 28.1 43.6 28.3 19.6 ...
##  $ vsa_pol             : num  0 13.6 0 0 0 ...
##  $ slogp_vsa0          : num  18 25.4 14.1 14.1 14.1 ...
##  $ slogp_vsa1          : num  0 23.3 34.8 34.8 34.8 ...
##  $ slogp_vsa2          : num  3.98 23.86 0 0 0 ...
##  $ slogp_vsa3          : num  0 0 76.2 76.2 76.2 ...
##  $ slogp_vsa4          : num  4.41 0 3.19 3.19 3.19 ...
##  $ slogp_vsa5          : num  32.9 0 9.51 0 0 ...
##  $ slogp_vsa6          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ slogp_vsa7          : num  0 70.6 148.1 144 140.7 ...
##  $ slogp_vsa8          : num  113.2 0 75.5 75.5 75.5 ...
##  $ slogp_vsa9          : num  33.3 41.3 28.3 55.5 26 ...
##  $ smr_vsa0            : num  0 23.86 12.63 3.12 3.12 ...
##  $ smr_vsa1            : num  18 25.4 27.8 27.8 27.8 ...
##  $ smr_vsa2            : num  4.41 0 0 0 0 ...
##  $ smr_vsa3            : num  3.98 5.24 8.43 8.43 8.43 ...
##  $ smr_vsa4            : num  0 20.8 29.6 21.4 20.3 ...
##  $ smr_vsa5            : num  113.2 70.6 235.1 235.1 234.6 ...
##  $ smr_vsa6            : num  0 5.26 76.25 76.25 76.25 ...
##  $ smr_vsa7            : num  66.2 33.3 0 31.3 0 ...
##  $ tpsa.1              : num  16.6 49.3 51.7 38.6 38.6 ...
##  $ logp.o.w.           : num  2.948 0.889 4.439 5.254 3.8 ...
##  $ frac.anion7.        : num  0 0.001 0 0 0 0 0.001 0 0 0 ...
##  $ frac.cation7.       : num  0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
##  $ andrewbind          : num  3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
##  $ rotatablebonds      : int  3 2 8 8 8 8 8 7 4 5 ...
##  $ mlogp               : num  2.5 1.06 4.66 3.82 3.27 ...
##  $ clogp               : num  2.97 0.494 5.137 5.878 4.367 ...
##  $ mw                  : num  155 151 365 382 325 ...
##  $ nocount             : int  1 3 5 4 4 4 4 3 2 4 ...
##  $ hbdnr               : int  1 2 1 1 1 1 2 1 1 0 ...
##  $ rule.of.5violations : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ alert               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ prx                 : int  0 1 6 2 2 2 1 0 0 4 ...
##  $ ub                  : num  0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
##  $ pol                 : int  0 2 3 3 2 2 2 3 4 1 ...
##  $ inthb               : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ adistm              : num  0 395 1365 703 746 ...
##  $ adistd              : num  0 10.9 25.7 10 10.6 ...
##  $ polar_area          : num  21.1 117.4 82.1 65.1 66.2 ...
##  $ nonpolar_area       : num  379 248 638 668 602 ...
##  $ psa_npsa            : num  0.0557 0.4743 0.1287 0.0974 0.11 ...
##  $ tcsa                : num  0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
##  $ tcpa                : num  0.1842 0.0417 0.0972 0.1218 0.1186 ...
##  $ tcnp                : num  0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
##  $ ovality             : num  1.1 1.12 1.3 1.3 1.27 ...
##  $ surface_area        : num  400 365 720 733 668 ...
##  $ volume              : num  656 555 1224 1257 1133 ...
##  $ most_negative_charge: num  -0.617 -0.84 -0.801 -0.761 -0.857 ...
##  $ most_positive_charge: num  0.307 0.497 0.541 0.48 0.455 ...
##  $ sum_absolute_charge : num  3.89 4.89 7.98 7.93 7.85 ...
##  $ dipole_moment       : num  1.19 4.21 3.52 3.15 3.27 ...
##  $ homo                : num  -9.67 -8.96 -8.63 -8.56 -8.67 ...
##  $ lumo                : num  3.4038 0.1942 0.0589 -0.2651 0.3149 ...
##  $ hardness            : num  6.54 4.58 4.34 4.15 4.49 ...
##  $ ppsa1               : num  349 223 518 508 509 ...
##  $ ppsa2               : num  679 546 2066 2013 1999 ...
##  $ ppsa3               : num  31 42.3 64 61.7 61.6 ...
##  $ pnsa1               : num  51.1 141.8 202 225.4 158.8 ...
##  $ pnsa2               : num  -99.3 -346.9 -805.9 -894 -623.3 ...
##  $ pnsa3               : num  -10.5 -44 -43.8 -42 -39.8 ...
##  $ fpsa1               : num  0.872 0.611 0.719 0.693 0.762 ...
##  $ fpsa2               : num  1.7 1.5 2.87 2.75 2.99 ...
##  $ fpsa3               : num  0.0774 0.1159 0.0888 0.0842 0.0922 ...
##  $ fnsa1               : num  0.128 0.389 0.281 0.307 0.238 ...
##  $ fnsa2               : num  -0.248 -0.951 -1.12 -1.22 -0.933 ...
##  $ fnsa3               : num  -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
##  $ wpsa1               : num  139.7 81.4 372.7 372.1 340.1 ...
##  $ wpsa2               : num  272 199 1487 1476 1335 ...
##  $ wpsa3               : num  12.4 15.4 46 45.2 41.1 ...
##  $ wnsa1               : num  20.4 51.8 145.4 165.3 106 ...
##  $ wnsa2               : num  -39.8 -126.6 -580.1 -655.3 -416.3 ...
##   [list output truncated]

b) Do any of the individual predictors have degenerate distributions?

colnames(bbbDescr)[nearZeroVar(bbbDescr)]

## [1] "negative"     "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid"       "vsa_acid"    
## [6] "frac.anion7." "alert"

Yes - variables “negative”, “peoe_vsa.2.1”, “peoe_vsa.3.1”, “a_acid”, “vsa_acid”, “frac.anion7.”, and “alert” were identified to have near-zero variances and therefore are degenerate distributions.

c) Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling?

# Removing degenerate distributions
bbbDescr1 = bbbDescr[,-nearZeroVar(bbbDescr)]
nearZeroVar(bbbDescr1)

## integer(0)

# Correlation matrix
bbbCorr = cor(bbbDescr1)
library(corrplot)

## corrplot 0.95 loaded

corrplot(bbbCorr, order = "hclust", tl.cex = 0.35)

# Identify highly correlated predictors
highCorr = findCorrelation(bbbCorr, 0.75)
highCorr

##  [1]  22  27  32  35  40  50  51  52  61  65  66  67  68  76  77  78  80  81  83
## [20]  86  87  88  89  90  91  92  93  94  95  96 101 103 104 106 107 108 109 110
## [39] 112 116 117 118 119 120 121   3   1  19  20  28   5  44  43   4  18  58  60
## [58]  62  73  71  82  84 102 113 105

# Remove highly correlated predictors
bbbCorr1 = cor(bbbDescr1[,-highCorr])
corrplot(bbbCorr1, order = "hclust", tl.cex = 0.35)

The correlation matrix showed that there were strong relationships among the predictor data. Predictors with near-zero variance were removed first. Then, using a correlation cutoff of 0.75 with findCorrelation(), more than half of the remaining predictors were identified as highly correlated and were subsequently removed. The reduction had a dramatic effect on the number of predictors available for modeling, from 127 to 62 (51% decrease).

Problem 4

library(caret)
data(oil)
head(fattyAcids)

##   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
## 1      9.7     5.2  31.0     52.7       0.4        0.4        0.1
## 2     11.1     5.0  32.9     49.8       0.3        0.4        0.1
## 3     11.5     5.2  35.0     47.2       0.2        0.4        0.1
## 4     10.0     4.8  30.4     53.5       0.3        0.4        0.1
## 5     12.2     5.0  31.1     50.5       0.3        0.4        0.1
## 6      9.8     4.2  43.0     39.2       2.4        0.4        0.5

table(oilType)

## oilType
##  A  B  C  D  E  F  G 
## 37 26  3  7 11 10  2

a) Use the sample function in base R to create a completely random sample of 60 oils. How closely do the frequencies of the random sample match the original samples? Repeat this procedure several times of understand the variation in the sampling process.

samp = sample(oilType, 60)

par(mfrow = c(1,2))
plot(oilType)
plot(samp)

The sample function produces a random sample of 60 oils whose frequencies approximate the original distribution quite well. Repeating the sampling shows that smaller oil types fluctuate more, but across many repetitions the mean proportions closely match the original distribution.

b) Use the caret package function createDataPartition to create a stratified random sample. How does this compare to completely random samples?

part = createDataPartition(oilType, p = 60/96, list = FALSE)
partSamp = oilType[part]

par(mfrow = c(1,2))
plot(oilType)
plot(partSamp)

The stratified sampling produces a distribution that is much closer to the original proportions of each oil type compared to completely random sampling.

c) With such a small samples size, what are the options for determining performance of the model? Should a test set be used?

Using a test set is not ideal because removing data for testing makes the training set size too small, which then makes the performance estimates highly variable. Better options include k-fold cross-validation, leave-one-out cross-validation, or bootstrap.

d) Try different samples sizes and accuracy rates to understand the trade-off between the uncertainty in the results, the model performance, and the test set size.

binom.test(16, 20)

## 
##  Exact binomial test
## 
## data:  16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.563386 0.942666
## sample estimates:
## probability of success 
##                    0.8

binom.test(10, 20)

## 
##  Exact binomial test
## 
## data:  10 and 20
## number of successes = 10, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2719578 0.7280422
## sample estimates:
## probability of success 
##                    0.5

binom.test(49, 50)

## 
##  Exact binomial test
## 
## data:  49 and 50
## number of successes = 49, number of trials = 50, p-value = 9.059e-14
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.8935305 0.9994938
## sample estimates:
## probability of success 
##                   0.98

Problem 5

The bias–variance tradeoff refers to the relationship between model complexity and prediction error. Bias refers to error caused by underfitting, while variance refers to error caused by overfitting. As the model complexity increases, bias decreases but variance increases, hence the “tradeoff”. The total prediction error is therefore minimized at an intermediate level of complexity. The drawn plot is included on Canvas.

Homework 1