Homework Assignment: Statistical Pattern Recognition

Author

Aidan Perkins

Instructions

Modify this quarto notebook, convert it to PDF using typst or to HTML and submit. Total is 70 points

Rubric:

Textual answers will be graded holistically. 5 points for an accurate and thoughtful answer, 4 points for minor errors, 3 points for a reasonable attempt with minor conceptual issues, 2 or less otherwise
Coding answers: 5 points for correct results, 4 points for correct approach, wrong answer, 3 points for an attempt which is mostly complete but got stuck in the end, 2 or less otherwise

Question 1: Conceptual – Supervised vs Unsupervised Learning (10 points)

(a)Explain the difference between supervised and unsupervised learning in the context of statistical pattern recognition.

For supervised learning you use data that is already labeled, meaning you know what the true values of your observations are. If you input two variables, says home square footage, and number of bedrooms, the model will try and predict the value of the home. You “train” your model on labelled data so it understands the pattern at play which will influence home price. After training, you then use part of your original data set which the model has NOT seen in order to test the accuracy of it.

Unsupervised learning requires training the model on a data set where no labels exist. The goal of unsupervised learning is more so about uncovering hidden patterns / clusters in the data and assigning them labels. The goal is to find these patterns and not to predict predefined labels, as they do not exist.

(b) Provide one real-world data example from environmental or political science where supervised methods would be more appropriate, and one where unsupervised methods would be more useful. Justify your choices. Give examples not given in class 😃.

A problem I worked on this past summer at the British Columbia CDC was on harmful algal blooms and if we could get better at providing advanced warning to places before they got too severe. This would be a great use case for supervised learning as we had a whole bunch of historical, labelled data. Our input would include: water temperature, chlorophyll-a levels, nutrient availability etc. and the output would be “bloom” or “no bloom”. Since we had a large amount of labelled data we can learn from what has happened previously in order to better predict what the outcome will be in the future, given specific lake characteristics.

An unsupervised example would be coming up with new biomes or ecoregions as the climate continues to change. We can use species data, vegetation type, rainfall, temperature, and many more variables to create clusters of new ecoregions. This would be unsupervised as there is no labelled data saying “this area belongs to this new ecoregion”, we let the model guide the clustering based on patterns it identifies.

Sidenote: Our last problem of this homework was unsupervised and we hid the species labels. How would we test this model on data that is not labelled?

Question 2: Conceptual – Bias–Variance Tradeoff (15 points)

Consider the bias–variance tradeoff in the performance of pattern recognition models.

(a) Define bias and variance in this context. (5 points)

Bias would be how far a model’s average predictions are from true value. High bias models would miss important patterns and may be too simple for the task (underfitting)

Variance would be how much the model’s predictions change when it sees new data. A high variance model is very sensitive to small changes in the new data it is seeing (overfitting). This model is likely to perform very well on training data but fall apart when new data is introduced (too sensitive to change)

(b) Explain how increasing the complexity of a model (e.g., adding more predictors or using a flexible method like random forests) may affect bias and variance. (5 points)

If we increase the complexity of a model we would expect for bias to decrease, but variance would increase. These complex models are very good at picking up on patterns in the data so their predictions are close to the real relationship. The issue is they are also very sensitive to noise in the data. This means you predictions may be vastly different when testing the model on never before seen data.

(c) Simple models tend to be high-bias models, but have the advantage of not overfitting the training data. Give one scenario where a high-bias model might actually be preferable. (5 points)

A high bias model might be preferred if you are working with a small or very noisy dataset. In this instance a simpler model (say linear regression) might be well suited for the task as it will identify the trend and work better on new data, even if it does not capture complex patterns in the data.

Question 3: Data-Driven (Supervised Classification) (20 points)

You are given a dataset of air quality measurements (airquality in R). Suppose you want to build a supervised model to predict whether ozone concentration is above or below the median.

Split the data into training (70%) and test (30%) sets. (2 points)
Create a new binary variable HighOzone = 1 if Ozone > median(Ozone, na.rm=TRUE), else 0. (3 points)
Fit a logistic regression using Solar.R, Wind, and Temp as predictors. (5 points)
Report the accuracy on the test set. (5 points)
Explain why creating a dichotomous outcome variable with the median might not be desirable. What might be a possble method that might be preferred? (Extra credit, 5 points)

Some sample code to get you started:

# Install and load necessary libraries. Make sure you install the packages separately from this document, 
# so that you can run this
library(tidyverse)
library(tidymodels)
library(janitor)
library(datasets)

# Load data and remove missing values
data("airquality")
aq <- drop_na(airquality) # complete cases only

# Create binary outcome variable

median_ozone <- median(aq$Ozone)
aq$HighOzone <- factor(ifelse(aq$Ozone > median_ozone , 1, 0))

# Split into training and test sets
set.seed(123)
splits<- initial_split(aq, prop = 0.70)
train <- analysis(splits) # the training data
test <- assessment(splits) # the test/evaluation data

# Fit logistic regression model
model <- glm(HighOzone ~ Solar.R + Wind + Temp,  data = train, family = "binomial")

# Predict and compute accuracy
test_performance <- augment(model, newdata = test, type.predict = "response")

## Create a new variable for predicted class based on a threshold of 0.5
# Look at names(test_performance) before to find the right variable name

test_performance <- mutate(test_performance, 
                           pred_class = factor(ifelse(.fitted > 0.5, 1, 0)),
                           HighOzone = as.factor(HighOzone))

# Look at the documentation for accuracy (?yardstick::accuracy) to figure out how to calculate accuracy

accuracy <- accuracy(test_performance, truth = HighOzone, estimate = pred_class)

# Print accuracy
print(accuracy)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.735

Explain why creating a dichotomous outcome variable with the median might not be desirable. What might be a possible method that might be preferred? (Extra credit, 5 points)

Creating a binary above or below the median value is relatively arbitrary. The median split ignores the actual distribution of the data and their health impacts. Two observations, can be on either side of the median (30,31) and will be treated the same as if they were on complete opposite ends of the spectrum. Instead we could use a meaningful value, such as the level of ozone a group like the EPA would deem as unhealthy.

For binary classification using logistic regression, you actually get a probability score (between 0 and 1) for each observation. We will often try to figure out the best threshold for converting these probabilities into class labels (0 or 1). A common choice is 0.5, but this may not always be optimal depending on the context and costs of misclassification. We will often use ROC curves and AUC to evaluate classification models.

Implementing ROC curves (5 points)

You can read more about ROC curves and AUC here: https://www.statology.org/interpret-roc-curve

# Compute the ROC curve

roc <- roc_curve(test_performance, truth = HighOzone, .fitted, event_level = 'second')

head(roc) # this is a data.frame

# A tibble: 6 × 3
   .threshold specificity sensitivity
        <dbl>       <dbl>       <dbl>
1 -Inf             0            1    
2    0.000193      0            1    
3    0.000523      0.0588       1    
4    0.000768      0.118        1    
5    0.0142        0.118        0.941
6    0.0184        0.176        0.941

# small type said autplot
autoplot(roc)

Now, reproduce this graph using ggplot2

library(cowplot)
ggplot(roc, aes(x = 1-specificity, y = sensitivity))+
  geom_step()+
  geom_abline(slope = 1, linetype = "dashed")+
  labs(
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)"
  ) +
  theme_minimal()

Question 4: Unsupervised Clustering (25 points)

Using the built-in iris dataset:

Perform a k-means clustering with k = 3 using the numeric features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width). (5 points)

data(iris)

dat <- iris |> select(-Species) # remove the Species column
model <- kmeans(dat, centers = 3)

Compare the resulting cluster assignments with the actual species labels. (5 points)

model_performance <- augment(model, data = iris)
table(model_performance$.cluster, model_performance$Species)

   
    setosa versicolor virginica
  1      0         48        14
  2     50          0         0
  3      0          2        36

Assign each cluster assignment to the species for which that cluster assignment is most common (5 points)

model_performance <- mutate(model_performance, .cluster = case_when(
    .cluster == 2 ~ "setosa",
    .cluster == 1 ~ "versicolor",
    .cluster == 3 ~ "virginica"
))

Calculate and report the misclassification rate, i.e. the proportion of observations where the true species and the cluster assignment don’t match (5 points)

summarise(model_performance, missclass_rate = mean(.cluster != Species))

# A tibble: 1 × 1
  missclass_rate
           <dbl>
1          0.107

Briefly interpret whether k-means was effective in uncovering the underlying patterns without supervision. (5 points)

Yes, k-means was fairly effective at uncovering underlying patters without supervision. Our model had a 10% misclassification rate which is not horrible, but could use improvement. It was highly accurate with setosa, with zero misclassifications. For versicolor it misclassificed two observations. The model struggled the most with virginica which had 14 misclassifications.