For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.
We would generally expect the performance of a flexible statistical learning method to be better with a large sample size, n and small number of predictors, p because flexible methods can capture more patterns without over-fitting. The larger n, there is sufficient data to avoid high variance.
(b) The number of predictors p is extremely large, and the number of observations n is small.
We would generally expect the performance of a inflexible statistical learning method to be better because a flexible model would over fit the small number of observations and cause a error, or noise in the model. A inflexible method can add more structure and avoids over-fitting given more predictors.
(c) The relationship between the predictors and response is highly non-linear.
We would generally expect the performance of a flexible statistical learning method to be better because it can capture more complex, non-linear relationships.
(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.
We would generally expect the performance of a inflexible statistical learning method to be better because the flexible method would fit the variance the noise and increase its variance.
(a) Bias-Variance Decomposition Sketch
(b) Explain why each of the five curves has the shape displayed in part
\[ More flexibility \Leftrightarrow Higher Variance \Leftrightarrow Lower Bias \] The Bias/Variance Trade-off is balance need to be found when building models. Bias comes from building the model too simple with not enough complexities. Variance is when the model is complex and is sensitive to noise in the training data. The goal is to find the “sweet spot” and minimize the total prediction error.
Starting with the squared bias, it monotonically decreases with an increase of flexibility due to the decrease in bias and ability to be described in a relationship. The bias in the trade-off estimator is referred to the error(distance) between an estimated model to the true model.
Variance, on the hand, increases with flexibility which represents how disperse the estimated model is spread. The more flexible the model, the more variance it would have, as well as increasing the risk to over-estimation.
One criterion, which measures the model’s accuracy, or its performance by optimizing (by minimizing or maximizing) is the Mean Square Error, MSE. \[ MSE = Var(\hat{\theta})+Bias(\hat{\theta})^{2} \] The more flexible method has lower \(MSE_{Tr}\). MSE is always non-negative. The MSE will be small if the predictor responses are close to the true responses. The MSE is computed by using training data fitted for the model, or \(MSE_{Tr}\). The accuracy of the prediction when our models is applied to new data, or test data is more important than measuring the quality of fit.
The testing MSE, \(MSE_{Te}\) The average training MSE as a function of flexibility, or degrees of freedom. \(MSE_{Te}\) declines monotonically as flexibility increasing, giving a “U”-shape.
The horizontal line is the Var(\(\epsilon\)), irreducible error, corresponding to the lowest achieveble test MSE among all methods. Thus, the smoothing spline, or area close to Var(\(\epsilon\)) is optimal. Variability associated within \(\epsilon\), is affected by our predictions.
This exercise involves the Boston housing data set.
(a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. Now the data set is contained in the object Boston. Read about the data set: How many rows are in this data set? How many columns? What do the rows and columns represent?
506 observations and 13 variables in the Boston data set contain housing values in 506 suburbs of Boston.
library(ISLR2)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)
(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : num 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
pairs(Boston[,1:10])
(c) Are any of the predictors associated with per capita
crime rate? If so, explain the relationship. crim
vs age: the more per capita rate by town, the higher the
proportion of owner-occupied units prior to 1940
crim vsdis`: more per capita crime rae is followed by an
closer mean of distance of the Boston employment centres
(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
plot(Boston$age, Boston$crim) #more crimes for older homes
plot(Boston$rm, Boston$crim) #around 5-8 rooms per dwelling has the most crimes
plot(Boston$tax, Boston$crim) # full-value taxes near 200-400 have the most crime, and a spike around 650
plot(Boston$ptratio, Boston$crim) # the higher the crime, the higher the pupil:teacher ratio
(e) How many of the census tracts in this data set bound the Charles river? There are 35 suburbs set bound the Charles river.
sum(Boston$chas)
## [1] 35
(f) What is the median pupil-teacher ratio among the towns in this data set?
median(Boston$ptratio)
## [1] 19.05
(g) Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.
lowest_median <- t(subset(Boston, medv == min(Boston$medv)))
lowest_median
## 399 406
## crim 38.3518 67.9208
## zn 0.0000 0.0000
## indus 18.1000 18.1000
## chas 0.0000 0.0000
## nox 0.6930 0.6930
## rm 5.4530 5.6830
## age 100.0000 100.0000
## dis 1.4896 1.4254
## rad 24.0000 24.0000
## tax 666.0000 666.0000
## ptratio 20.2000 20.2000
## lstat 30.5900 22.9800
## medv 5.0000 5.0000
(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
nrow(filter(Boston, rm>=7))
## [1] 64
nrow(filter(Boston, rm>=8))
## [1] 13
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
## corrplot 0.92 loaded
## Loading required package: lattice
## Type variable value
## 1 1 RI 1.52101
## 2 1 RI 1.51761
## 3 1 RI 1.51618
## 4 1 RI 1.51766
## 5 1 RI 1.51742
## 6 1 RI 1.51596
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.05701
## 3rd Qu.:0.10000
## Max. :0.51000
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
Al, Ba, Cam Fe, K RI are all right skewed. Most plots within the boxplots have outliers especially in RI, Ca, Ba, K, Fe. RI, Na, Al, K, Ca, Ba and Fe are all positively skewed, and Si and Mg are negatively skewed, or left skewed. Potassium is the highest right-skewed. Within the boxplots, you can also see K has the largest difference from the outlier.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
Based off principal component accounts, PCA, the variables Fe, B, K, Al, Mg all had transformations approximate to beforehand. After the transformation, Ca and Si are less skewed and are more normally distributed. Whereas Na transformed from slighted skewed to the left to skewed to the right.
#skewness
apply(Glass[, -10], 2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
par(mfrow=c(4,2), mar = c(2, 2, 2, 0.5))
for (col in 2:ncol(glassData)) {
hist(Glass[,col],main = colnames(glassData)[col], xlab = colnames(glassData)[col])
t = BoxCoxTrans( as.numeric( Glass[,col]))
transformed = predict(t, Glass[,col])
hist(transformed, main = paste("After transformation ", colnames(glassData)[col]), xlab = colnames(glassData)[col] )
}
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
library(mlbench)
data("Soybean")
str("Soybean")
## chr "Soybean"
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
The histogram helps determine the frequency distributions for the soybean’s categorical predictors. Most of the data set has very low frequencies causing the histogram to be one sided. Mycelium, sclerotia, seed.size, shriveling, lodging, and leaf.malf all have near zero variance.
data(Soybean)
SoybeanData <- Soybean[-1]
par(mfrow=c(4,2), mar = c(2, 2, 2, 0.5)) #adjusted margins to fit
for (col in 2:ncol(Soybean)) {hist( as.numeric(Soybean[,col]),
main = colnames(Soybean)[col],xlab = colnames(Soybean)[col])}
(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
Date, area.dam, and leaves all have less than 1% of their data missing. Factors from the plant or environment are factors in the missing information. Phytophthora-rot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, and herbicide-injury are all missing data by class.
##missingdata
segTrainTrans1 <- model.matrix(~ . - 1, data = SoybeanData)
missing_percentages <- colMeans(is.na(SoybeanData)) * 100
missing_percentages
## date plant.stand precip temp hail
## 0.1464129 5.2708638 5.5636896 4.3923865 17.7159590
## crop.hist area.dam sever seed.tmt germ
## 2.3426061 0.1464129 17.7159590 17.7159590 16.3982430
## plant.growth leaves leaf.halo leaf.marg leaf.size
## 2.3426061 0.0000000 12.2986823 12.2986823 12.2986823
## leaf.shread leaf.malf leaf.mild stem lodging
## 14.6412884 12.2986823 15.8125915 2.3426061 17.7159590
## stem.cankers canker.lesion fruiting.bodies ext.decay mycelium
## 5.5636896 5.5636896 15.5197657 5.5636896 5.5636896
## int.discolor sclerotia fruit.pods fruit.spots seed
## 5.5636896 5.5636896 12.2986823 15.5197657 13.4699854
## mold.growth seed.discolor seed.size shriveling roots
## 13.4699854 15.5197657 13.4699854 15.5197657 4.5387994
##missingdata
segTrainTrans1 <- model.matrix(~ . - 1, data = SoybeanData)
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
In the above model, we can notice the significance of dark red colors appearing more often to show a strong negative correlation in the first model. After removing the predictors, we can observe a decrease. The cell segmentation correlation matrix is one strategy of handling the missing data by removing highly correlated predictors. It will allow the performance of the model to be more parsimonious and interpretable model. It also stabilizes the data without the problematic variables. The strategy, or algorithm to remove the highly correlated predictors include:
Calculate the correlation matrix of the predictors. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B). Determine the average correlation between A and the other variables. Do the same for predictor B. If A has a larger average correlation, remove it; otherwise, remove predictor B. Repeat Steps 2–4 until no absolute correlations are above the threshold.
Another strategy are using extraction methods (e.g., principal components) which migrates the effect of strong correlations between predictors.
# Remove near zero variance predictors and get correlation matrix
nzvCols <- nearZeroVar(segTrainTrans1)
segTrainTrans1 <- segTrainTrans1[, -nzvCols]
segCorr <- cor(segTrainTrans1)
corrplot(segCorr, order = "hclust", tl.cex = 0.35)
# Identify columns to remove based on high correlation
highCorr <- findCorrelation(segCorr, cutoff = 0.75)
highCorr
## [1] 47 29 44 28
segCorr1 <-cor(segTrainTrans1[,-highCorr])
corrplot(segCorr1, order = "hclust", tl.cex= .35)