# install.packages('MASS')
library(MASS)
# Boston
# ?Boston
The Boston dataset contains 506 rows and 14 columns. Each row represents a suburb in Boston, and each column is a descriptor of that suburb as outlined in the dataset description under “Format”.
plot(Boston$medv ~ Boston$rm, xlab = "Average number of rooms per dwelling", ylab = "Median value in $1000s")
plot(Boston$medv ~ Boston$lstat, xlab = "Lower status of the population (%)", ylab = "Median value in $1000s")
plot(Boston$medv ~ Boston$nox, xlab = "Nitrogen oxides concentration (parts per 10 million)", ylab = "Median value in $1000s")
There is a clear positive relationship between the average number of rooms per dwelling (RM) and the median value of owner-occupied homes (MEDV). This suggests that suburbs with larger homes tend to have higher housing prices.
There is a strong negative relationship between the percentage of the population classified as lower socioeconomic status (LSTAT) and MEDV. Suburbs with higher proportions of lower-status residents tend to have substantially lower home values.
Finally, nitrogen oxide concentration (NOX), a measure of air pollution, shows a mild negative relationship with MEDV, indicating that higher pollution levels are generally associated with slightly lower housing prices.
plot(Boston$medv ~ Boston$crim, xlab = "Per capita crime rate by town", ylab = "Median value in $1000s")
Higher per capita crime rates are associated with lower median home values. This indicates a negative relationship between crime and housing prices, suggesting that suburbs with higher crime levels tend to have less valuable homes, while wealthier suburbs generally experience lower crime rates.
range(Boston$crim)
## [1] 0.00632 88.97620
range(Boston$tax)
## [1] 187 711
boxplot(Boston$crim)
boxplot(Boston$tax)
Crime rates in the Boston dataset range from 0.00632 to 88.97620, indicating a very wide spread in per capita crime across census tracts. The boxplot shows that most suburbs have crime rates below 10, with a small number of tracts exhibiting much higher crime levels.
Property tax rates range from 187 to 711. While the boxplot does not show clear outliers for tax rates, the median tax rate is approximately 350, suggesting moderate variation across suburbs.
sum(Boston$chas)
## [1] 35
There are 35 tracts that bound the Charles River.
median(Boston$ptratio)
## [1] 19.05
The median pupil-teacher ratio is 19.05.
library(mlbench)
data(Soybean)
# ?Soybean
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
colnames(Soybean)[nearZeroVar(Soybean)]
## [1] "leaf.mild" "mycelium" "sclerotia"
par(mfrow = c(2,5))
barplot(table(Soybean$date), main = "date")
barplot(table(Soybean$plant.stand), main = "plant.stand")
barplot(table(Soybean$precip), main = "precip")
barplot(table(Soybean$temp), main = "temp")
barplot(table(Soybean$hail), main = "hail")
barplot(table(Soybean$crop.hist), main = "crop.hist")
barplot(table(Soybean$area.dam), main = "area.dam")
barplot(table(Soybean$sever), main = "sever")
barplot(table(Soybean$seed.tmt), main = "seed.tmt")
barplot(table(Soybean$germ), main = "germ")
par(mfrow = c(2,5))
barplot(table(Soybean$plant.growth), main = "plant.growth")
barplot(table(Soybean$leaves), main = "leaves")
barplot(table(Soybean$leaf.halo), main = "leaf.halo")
barplot(table(Soybean$leaf.marg), main = "leaf.marg")
barplot(table(Soybean$leaf.size), main = "leaf.size")
barplot(table(Soybean$leaf.shread), main = "leaf.shread")
barplot(table(Soybean$leaf.malf), main = "leaf.malf")
barplot(table(Soybean$leaf.mild), main = "leaf.mild")
barplot(table(Soybean$stem), main = "stem")
barplot(table(Soybean$lodging), main = "lodging")
par(mfrow = c(2,5))
barplot(table(Soybean$stem.cankers), main = "stem.cankers")
barplot(table(Soybean$canker.lesion), main = "canker.lesion")
barplot(table(Soybean$fruiting.bodies), main = "fruiting.bodies")
barplot(table(Soybean$ext.decay), main = "ext.decay")
barplot(table(Soybean$mycelium), main = "mycelium")
barplot(table(Soybean$int.discolor), main = "int.discolor")
barplot(table(Soybean$sclerotia), main = "sclerotia")
barplot(table(Soybean$fruit.pods), main = "fruit.pods")
barplot(table(Soybean$fruit.spots), main = "fruit.spots")
barplot(table(Soybean$seed), main = "seed")
par(mfrow = c(2,5))
barplot(table(Soybean$mold.growth), main = "mold.growth")
barplot(table(Soybean$seed.discolor), main = "seed.discolor")
barplot(table(Soybean$seed.size), main = "seed.size")
barplot(table(Soybean$shriveling), main = "shriveling")
barplot(table(Soybean$roots), main = "roots")
Using the nearZeroVar function, three predictors were identified as having degenerate distributions: leaf.mild, mycelium, and sclerotia. Inspection of the barplots suggested that several additional predictors, including leaf.malf, lodging, int.discolor, and shriveling, also exhibit near-degenerate behavior. These variables show little to no variability, with one category dominating the majority of observations. As a result, they provide minimal predictive information and would typically be removed during preprocessing.
library(caret)
data(BloodBrain)
?BloodBrain
str(logBBB)
## num [1:208] 1.08 -0.4 0.22 0.14 0.69 0.44 -0.43 1.38 0.75 0.88 ...
str(bbbDescr)
## 'data.frame': 208 obs. of 134 variables:
## $ tpsa : num 12 49.3 50.5 37.4 37.4 ...
## $ nbasic : int 1 0 1 0 1 1 1 1 1 1 ...
## $ negative : int 0 0 0 0 0 0 0 0 0 0 ...
## $ vsa_hyd : num 167.1 92.6 295.2 319.1 299.7 ...
## $ a_aro : int 0 6 15 15 12 11 6 12 12 6 ...
## $ weight : num 156 151 366 383 326 ...
## $ peoe_vsa.0 : num 76.9 38.2 58.1 62.2 74.8 ...
## $ peoe_vsa.1 : num 43.4 25.5 124.7 124.7 118 ...
## $ peoe_vsa.2 : num 0 0 21.7 13.2 33 ...
## $ peoe_vsa.3 : num 0 8.62 8.62 21.79 0 ...
## $ peoe_vsa.4 : num 0 23.3 17.4 0 0 ...
## $ peoe_vsa.5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ peoe_vsa.6 : num 17.24 0 8.62 8.62 8.62 ...
## $ peoe_vsa.0.1 : num 18.7 49 83.8 83.8 83.8 ...
## $ peoe_vsa.1.1 : num 43.5 0 49 68.8 36.8 ...
## $ peoe_vsa.2.1 : num 0 0 0 0 0 ...
## $ peoe_vsa.3.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ peoe_vsa.4.1 : num 0 0 5.68 5.68 5.68 ...
## $ peoe_vsa.5.1 : num 0 13.567 2.504 0 0.137 ...
## $ peoe_vsa.6.1 : num 0 7.9 2.64 2.64 2.5 ...
## $ a_acc : int 0 2 2 2 2 2 2 2 0 2 ...
## $ a_acid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ a_base : int 1 0 1 1 1 1 1 1 1 1 ...
## $ vsa_acc : num 0 13.57 8.19 8.19 8.19 ...
## $ vsa_acid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ vsa_base : num 5.68 0 0 0 0 ...
## $ vsa_don : num 5.68 5.68 5.68 5.68 5.68 ...
## $ vsa_other : num 0 28.1 43.6 28.3 19.6 ...
## $ vsa_pol : num 0 13.6 0 0 0 ...
## $ slogp_vsa0 : num 18 25.4 14.1 14.1 14.1 ...
## $ slogp_vsa1 : num 0 23.3 34.8 34.8 34.8 ...
## $ slogp_vsa2 : num 3.98 23.86 0 0 0 ...
## $ slogp_vsa3 : num 0 0 76.2 76.2 76.2 ...
## $ slogp_vsa4 : num 4.41 0 3.19 3.19 3.19 ...
## $ slogp_vsa5 : num 32.9 0 9.51 0 0 ...
## $ slogp_vsa6 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ slogp_vsa7 : num 0 70.6 148.1 144 140.7 ...
## $ slogp_vsa8 : num 113.2 0 75.5 75.5 75.5 ...
## $ slogp_vsa9 : num 33.3 41.3 28.3 55.5 26 ...
## $ smr_vsa0 : num 0 23.86 12.63 3.12 3.12 ...
## $ smr_vsa1 : num 18 25.4 27.8 27.8 27.8 ...
## $ smr_vsa2 : num 4.41 0 0 0 0 ...
## $ smr_vsa3 : num 3.98 5.24 8.43 8.43 8.43 ...
## $ smr_vsa4 : num 0 20.8 29.6 21.4 20.3 ...
## $ smr_vsa5 : num 113.2 70.6 235.1 235.1 234.6 ...
## $ smr_vsa6 : num 0 5.26 76.25 76.25 76.25 ...
## $ smr_vsa7 : num 66.2 33.3 0 31.3 0 ...
## $ tpsa.1 : num 16.6 49.3 51.7 38.6 38.6 ...
## $ logp.o.w. : num 2.948 0.889 4.439 5.254 3.8 ...
## $ frac.anion7. : num 0 0.001 0 0 0 0 0.001 0 0 0 ...
## $ frac.cation7. : num 0.999 0 0.986 0.986 0.986 0.986 0.996 0.946 0.999 0.976 ...
## $ andrewbind : num 3.4 -3.3 12.8 12.8 10.3 10 10.4 15.9 12.9 9.5 ...
## $ rotatablebonds : int 3 2 8 8 8 8 8 7 4 5 ...
## $ mlogp : num 2.5 1.06 4.66 3.82 3.27 ...
## $ clogp : num 2.97 0.494 5.137 5.878 4.367 ...
## $ mw : num 155 151 365 382 325 ...
## $ nocount : int 1 3 5 4 4 4 4 3 2 4 ...
## $ hbdnr : int 1 2 1 1 1 1 2 1 1 0 ...
## $ rule.of.5violations : int 0 0 1 1 0 0 0 0 1 0 ...
## $ alert : int 0 0 0 0 0 0 0 0 0 0 ...
## $ prx : int 0 1 6 2 2 2 1 0 0 4 ...
## $ ub : num 0 3 5.3 5.3 4.2 3.6 3 4.7 4.2 3 ...
## $ pol : int 0 2 3 3 2 2 2 3 4 1 ...
## $ inthb : int 0 0 0 0 0 0 1 0 0 0 ...
## $ adistm : num 0 395 1365 703 746 ...
## $ adistd : num 0 10.9 25.7 10 10.6 ...
## $ polar_area : num 21.1 117.4 82.1 65.1 66.2 ...
## $ nonpolar_area : num 379 248 638 668 602 ...
## $ psa_npsa : num 0.0557 0.4743 0.1287 0.0974 0.11 ...
## $ tcsa : num 0.0097 0.0134 0.0111 0.0108 0.0118 0.0111 0.0123 0.0099 0.0106 0.0115 ...
## $ tcpa : num 0.1842 0.0417 0.0972 0.1218 0.1186 ...
## $ tcnp : num 0.0103 0.0198 0.0125 0.0119 0.013 0.0125 0.0162 0.011 0.0109 0.0122 ...
## $ ovality : num 1.1 1.12 1.3 1.3 1.27 ...
## $ surface_area : num 400 365 720 733 668 ...
## $ volume : num 656 555 1224 1257 1133 ...
## $ most_negative_charge: num -0.617 -0.84 -0.801 -0.761 -0.857 ...
## $ most_positive_charge: num 0.307 0.497 0.541 0.48 0.455 ...
## $ sum_absolute_charge : num 3.89 4.89 7.98 7.93 7.85 ...
## $ dipole_moment : num 1.19 4.21 3.52 3.15 3.27 ...
## $ homo : num -9.67 -8.96 -8.63 -8.56 -8.67 ...
## $ lumo : num 3.4038 0.1942 0.0589 -0.2651 0.3149 ...
## $ hardness : num 6.54 4.58 4.34 4.15 4.49 ...
## $ ppsa1 : num 349 223 518 508 509 ...
## $ ppsa2 : num 679 546 2066 2013 1999 ...
## $ ppsa3 : num 31 42.3 64 61.7 61.6 ...
## $ pnsa1 : num 51.1 141.8 202 225.4 158.8 ...
## $ pnsa2 : num -99.3 -346.9 -805.9 -894 -623.3 ...
## $ pnsa3 : num -10.5 -44 -43.8 -42 -39.8 ...
## $ fpsa1 : num 0.872 0.611 0.719 0.693 0.762 ...
## $ fpsa2 : num 1.7 1.5 2.87 2.75 2.99 ...
## $ fpsa3 : num 0.0774 0.1159 0.0888 0.0842 0.0922 ...
## $ fnsa1 : num 0.128 0.389 0.281 0.307 0.238 ...
## $ fnsa2 : num -0.248 -0.951 -1.12 -1.22 -0.933 ...
## $ fnsa3 : num -0.0262 -0.1207 -0.0608 -0.0573 -0.0596 ...
## $ wpsa1 : num 139.7 81.4 372.7 372.1 340.1 ...
## $ wpsa2 : num 272 199 1487 1476 1335 ...
## $ wpsa3 : num 12.4 15.4 46 45.2 41.1 ...
## $ wnsa1 : num 20.4 51.8 145.4 165.3 106 ...
## $ wnsa2 : num -39.8 -126.6 -580.1 -655.3 -416.3 ...
## [list output truncated]
colnames(bbbDescr)[nearZeroVar(bbbDescr)]
## [1] "negative" "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid" "vsa_acid"
## [6] "frac.anion7." "alert"
Yes - variables “negative”, “peoe_vsa.2.1”, “peoe_vsa.3.1”, “a_acid”, “vsa_acid”, “frac.anion7.”, and “alert” were identified to have near-zero variances and therefore are degenerate distributions.
# Removing degenerate distributions
bbbDescr1 = bbbDescr[,-nearZeroVar(bbbDescr)]
nearZeroVar(bbbDescr1)
## integer(0)
# Correlation matrix
bbbCorr = cor(bbbDescr1)
library(corrplot)
## corrplot 0.95 loaded
corrplot(bbbCorr, order = "hclust", tl.cex = 0.35)
# Identify highly correlated predictors
highCorr = findCorrelation(bbbCorr, 0.75)
highCorr
## [1] 22 27 32 35 40 50 51 52 61 65 66 67 68 76 77 78 80 81 83
## [20] 86 87 88 89 90 91 92 93 94 95 96 101 103 104 106 107 108 109 110
## [39] 112 116 117 118 119 120 121 3 1 19 20 28 5 44 43 4 18 58 60
## [58] 62 73 71 82 84 102 113 105
# Remove highly correlated predictors
bbbCorr1 = cor(bbbDescr1[,-highCorr])
corrplot(bbbCorr1, order = "hclust", tl.cex = 0.35)
The correlation matrix showed that there were strong relationships among the predictor data. Predictors with near-zero variance were removed first. Then, using a correlation cutoff of 0.75 with findCorrelation(), more than half of the remaining predictors were identified as highly correlated and were subsequently removed. The reduction had a dramatic effect on the number of predictors available for modeling, from 127 to 62 (51% decrease).
library(caret)
data(oil)
head(fattyAcids)
## Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
## 1 9.7 5.2 31.0 52.7 0.4 0.4 0.1
## 2 11.1 5.0 32.9 49.8 0.3 0.4 0.1
## 3 11.5 5.2 35.0 47.2 0.2 0.4 0.1
## 4 10.0 4.8 30.4 53.5 0.3 0.4 0.1
## 5 12.2 5.0 31.1 50.5 0.3 0.4 0.1
## 6 9.8 4.2 43.0 39.2 2.4 0.4 0.5
table(oilType)
## oilType
## A B C D E F G
## 37 26 3 7 11 10 2
samp = sample(oilType, 60)
par(mfrow = c(1,2))
plot(oilType)
plot(samp)
The sample function produces a random sample of 60 oils whose frequencies approximate the original distribution quite well. Repeating the sampling shows that smaller oil types fluctuate more, but across many repetitions the mean proportions closely match the original distribution.
part = createDataPartition(oilType, p = 60/96, list = FALSE)
partSamp = oilType[part]
par(mfrow = c(1,2))
plot(oilType)
plot(partSamp)
The stratified sampling produces a distribution that is much closer to the original proportions of each oil type compared to completely random sampling.
Using a test set is not ideal because removing data for testing makes the training set size too small, which then makes the performance estimates highly variable. Better options include k-fold cross-validation, leave-one-out cross-validation, or bootstrap.
binom.test(16, 20)
##
## Exact binomial test
##
## data: 16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.563386 0.942666
## sample estimates:
## probability of success
## 0.8
binom.test(10, 20)
##
## Exact binomial test
##
## data: 10 and 20
## number of successes = 10, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.2719578 0.7280422
## sample estimates:
## probability of success
## 0.5
binom.test(49, 50)
##
## Exact binomial test
##
## data: 49 and 50
## number of successes = 49, number of trials = 50, p-value = 9.059e-14
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.8935305 0.9994938
## sample estimates:
## probability of success
## 0.98
The bias–variance tradeoff refers to the relationship between model complexity and prediction error. Bias refers to error caused by underfitting, while variance refers to error caused by overfitting. As the model complexity increases, bias decreases but variance increases, hence the “tradeoff”. The total prediction error is therefore minimized at an intermediate level of complexity. The drawn plot is included on Canvas.