Question 1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

We would generally expect the performance of a flexible statistical learning method to be better with a large sample size, n and small number of predictors, p because flexible methods can capture more patterns without over-fitting. The larger n, there is sufficient data to avoid high variance.

(b) The number of predictors p is extremely large, and the number of observations n is small.

We would generally expect the performance of a inflexible statistical learning method to be better because a flexible model would over fit the small number of observations and cause a error, or noise in the model. A inflexible method can add more structure and avoids over-fitting given more predictors.

(c) The relationship between the predictors and response is highly non-linear.

We would generally expect the performance of a flexible statistical learning method to be better because it can capture more complex, non-linear relationships.

(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.

We would generally expect the performance of a inflexible statistical learning method to be better because the flexible method would fit the variance the noise and increase its variance.

Question 2

(a) Bias-Variance Decomposition Sketch

(b) Explain why each of the five curves has the shape displayed in part

\[ More flexibility \Leftrightarrow Higher Variance \Leftrightarrow Lower Bias \] The Bias/Variance Trade-off is balance need to be found when building models. Bias comes from building the model too simple with not enough complexities. Variance is when the model is complex and is sensitive to noise in the training data. The goal is to find the “sweet spot” and minimize the total prediction error.

Starting with the squared bias, it monotonically decreases with an increase of flexibility due to the decrease in bias and ability to be described in a relationship. The bias in the trade-off estimator is referred to the error(distance) between an estimated model to the true model.

Variance, on the hand, increases with flexibility which represents how disperse the estimated model is spread. The more flexible the model, the more variance it would have, as well as increasing the risk to over-estimation.

One criterion, which measures the model’s accuracy, or its performance by optimizing (by minimizing or maximizing) is the Mean Square Error, MSE. \[ MSE = Var(\hat{\theta})+Bias(\hat{\theta})^{2} \] The more flexible method has lower \(MSE_{Tr}\). MSE is always non-negative. The MSE will be small if the predictor responses are close to the true responses. The MSE is computed by using training data fitted for the model, or \(MSE_{Tr}\). The accuracy of the prediction when our models is applied to new data, or test data is more important than measuring the quality of fit.

The testing MSE, \(MSE_{Te}\) The average training MSE as a function of flexibility, or degrees of freedom. \(MSE_{Te}\) declines monotonically as flexibility increasing, giving a “U”-shape.

The horizontal line is the Var(\(\epsilon\)), irreducible error, corresponding to the lowest achieveble test MSE among all methods. Thus, the smoothing spline, or area close to Var(\(\epsilon\)) is optimal. Variability associated within \(\epsilon\), is affected by our predictions.

Question 3

This exercise involves the Boston housing data set.

(a) To begin, load in the Boston data set. The Boston data set is part of the ISLR2 library. Now the data set is contained in the object Boston. Read about the data set: How many rows are in this data set? How many columns? What do the rows and columns represent?

506 observations and 13 variables in the Boston data set contain housing values in 506 suburbs of Boston.

library(ISLR2)
nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 13
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.numeric(Boston$chas)
Boston$rad <- as.numeric(Boston$rad)

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
pairs(Boston[,1:10])

(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship. crim vs age: the more per capita rate by town, the higher the proportion of owner-occupied units prior to 1940 crim vsdis`: more per capita crime rae is followed by an closer mean of distance of the Boston employment centres

(d) Do any of the census tracts of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

plot(Boston$age, Boston$crim) #more crimes for older homes

plot(Boston$rm, Boston$crim) #around 5-8 rooms per dwelling has the most crimes 

plot(Boston$tax, Boston$crim) # full-value taxes near 200-400 have the most crime, and a spike around 650

plot(Boston$ptratio, Boston$crim) # the higher the crime, the higher the pupil:teacher ratio

(e) How many of the census tracts in this data set bound the Charles river? There are 35 suburbs set bound the Charles river.

sum(Boston$chas)
## [1] 35

(f) What is the median pupil-teacher ratio among the towns in this data set?

median(Boston$ptratio)
## [1] 19.05

(g) Which census tract of Boston has lowest median value of owner occupied homes? What are the values of the other predictors for that census tract, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

lowest_median <- t(subset(Boston, medv == min(Boston$medv)))
lowest_median
##              399      406
## crim     38.3518  67.9208
## zn        0.0000   0.0000
## indus    18.1000  18.1000
## chas      0.0000   0.0000
## nox       0.6930   0.6930
## rm        5.4530   5.6830
## age     100.0000 100.0000
## dis       1.4896   1.4254
## rad      24.0000  24.0000
## tax     666.0000 666.0000
## ptratio  20.2000  20.2000
## lstat    30.5900  22.9800
## medv      5.0000   5.0000

(h) In this data set, how many of the census tracts average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the census tracts that average more than eight rooms per dwelling.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
nrow(filter(Boston, rm>=7)) 
## [1] 64
nrow(filter(Boston, rm>=8)) 
## [1] 13

Question 4

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

## corrplot 0.92 loaded
## Loading required package: lattice
##   Type variable   value
## 1    1       RI 1.52101
## 2    1       RI 1.51761
## 3    1       RI 1.51618
## 4    1       RI 1.51766
## 5    1       RI 1.51742
## 6    1       RI 1.51596
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe         
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.05701  
##  3rd Qu.:0.10000  
##  Max.   :0.51000

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Al, Ba, Cam Fe, K RI are all right skewed. Most plots within the boxplots have outliers especially in RI, Ca, Ba, K, Fe. RI, Na, Al, K, Ca, Ba and Fe are all positively skewed, and Si and Mg are negatively skewed, or left skewed. Potassium is the highest right-skewed. Within the boxplots, you can also see K has the largest difference from the outlier.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Based off principal component accounts, PCA, the variables Fe, B, K, Al, Mg all had transformations approximate to beforehand. After the transformation, Ca and Si are less skewed and are more normally distributed. Whereas Na transformed from slighted skewed to the left to skewed to the right.

#skewness
apply(Glass[, -10], 2, skewness)
##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107
par(mfrow=c(4,2), mar = c(2, 2, 2, 0.5))
for (col in 2:ncol(glassData)) {
  hist(Glass[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
  t = BoxCoxTrans( as.numeric( Glass[,col]))
  transformed  = predict(t, Glass[,col])
  hist(transformed, main =  paste("After transformation ", colnames(glassData)[col]), xlab = colnames(glassData)[col] )
}

Question 5

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data("Soybean")
str("Soybean")
##  chr "Soybean"

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

The histogram helps determine the frequency distributions for the soybean’s categorical predictors. Most of the data set has very low frequencies causing the histogram to be one sided. Mycelium, sclerotia, seed.size, shriveling, lodging, and leaf.malf all have near zero variance.

data(Soybean)
SoybeanData <- Soybean[-1]
par(mfrow=c(4,2), mar = c(2, 2, 2, 0.5)) #adjusted margins to fit
for (col in 2:ncol(Soybean)) {hist( as.numeric(Soybean[,col]),
      main = colnames(Soybean)[col],xlab = colnames(Soybean)[col])}

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Date, area.dam, and leaves all have less than 1% of their data missing. Factors from the plant or environment are factors in the missing information. Phytophthora-rot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, and herbicide-injury are all missing data by class.

##missingdata
segTrainTrans1 <- model.matrix(~ . - 1, data = SoybeanData)
missing_percentages <- colMeans(is.na(SoybeanData)) * 100
missing_percentages
##            date     plant.stand          precip            temp            hail 
##       0.1464129       5.2708638       5.5636896       4.3923865      17.7159590 
##       crop.hist        area.dam           sever        seed.tmt            germ 
##       2.3426061       0.1464129      17.7159590      17.7159590      16.3982430 
##    plant.growth          leaves       leaf.halo       leaf.marg       leaf.size 
##       2.3426061       0.0000000      12.2986823      12.2986823      12.2986823 
##     leaf.shread       leaf.malf       leaf.mild            stem         lodging 
##      14.6412884      12.2986823      15.8125915       2.3426061      17.7159590 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay        mycelium 
##       5.5636896       5.5636896      15.5197657       5.5636896       5.5636896 
##    int.discolor       sclerotia      fruit.pods     fruit.spots            seed 
##       5.5636896       5.5636896      12.2986823      15.5197657      13.4699854 
##     mold.growth   seed.discolor       seed.size      shriveling           roots 
##      13.4699854      15.5197657      13.4699854      15.5197657       4.5387994
##missingdata
segTrainTrans1 <- model.matrix(~ . - 1, data = SoybeanData)

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

In the above model, we can notice the significance of dark red colors appearing more often to show a strong negative correlation in the first model. After removing the predictors, we can observe a decrease. The cell segmentation correlation matrix is one strategy of handling the missing data by removing highly correlated predictors. It will allow the performance of the model to be more parsimonious and interpretable model. It also stabilizes the data without the problematic variables. The strategy, or algorithm to remove the highly correlated predictors include:

Calculate the correlation matrix of the predictors. Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B). Determine the average correlation between A and the other variables. Do the same for predictor B. If A has a larger average correlation, remove it; otherwise, remove predictor B. Repeat Steps 2–4 until no absolute correlations are above the threshold.

Another strategy are using extraction methods (e.g., principal components) which migrates the effect of strong correlations between predictors.

# Remove near zero variance predictors and get correlation matrix
nzvCols <- nearZeroVar(segTrainTrans1)
segTrainTrans1 <- segTrainTrans1[, -nzvCols]
segCorr <- cor(segTrainTrans1)
corrplot(segCorr, order = "hclust", tl.cex = 0.35)

# Identify columns to remove based on high correlation
highCorr <- findCorrelation(segCorr, cutoff = 0.75)
highCorr
## [1] 47 29 44 28
segCorr1 <-cor(segTrainTrans1[,-highCorr]) 
corrplot(segCorr1, order = "hclust", tl.cex= .35)