STAT270 - Final Project

1 Instructions

Complete Task 1 (required, 60 pts) and one of Task 2 or Task 3 (your choice, 40 pts each).

Task	Points	Topic
Task 1	60 pts	Supervised classification (k-NN, Naive Bayes, Logistic Regression)
Task 2	40 pts	Hierarchical clustering — vehicle fuel economy
Task 3	40 pts	K-means clustering — geospatial world cities

Extra Credit (5 pts): If your .Rmd file knits successfully without errors, you will receive 5 bonus points. Make sure all your code runs from top to bottom before submitting.

Submission: Upload this .Rmd file and the knitted .html file to the course portal by the deadline.

2 Setup

Install any packages listed below that you do not already have using install.packages("package_name") in the Console before knitting.

# --- Core data wrangling and visualisation ---
library(tidyverse)    # dplyr, ggplot2, tidyr, purrr, etc.

# --- Modelling framework ---
library(tidymodels)   # unified interface: recipes, workflows, yardstick

# --- Classification engines ---
library(discrim)      # naive_Bayes() classifier
library(klaR)         # engine underlying discrim's Naive Bayes
library(kknn)         # engine for k-nearest neighbours

# --- Clustering and spatial ---
library(mclust)       # k-means support functions
library(ape)          # phylogenetic tree plots for dendrograms
library(sp)           # spatial objects and coordinate transforms
library(sf)           # modern simple features spatial data

# --- Data and utilities ---
library(mdsr)         # course datasets, including world_cities
library(readxl)       # read Excel (.xlsx) files
library(janitor)      # clean_names() for tidy column names
library(broom)        # tidy() for model coefficient summaries

3 Task 1 (Total 60 pts): Supervised classification — high-earners in the 1994 US Census

A marketing analyst might be interested in finding factors that can be used to predict whether a potential customer is a high-earner. The 1994 United States Census provides information that can inform such a model, with records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than $50,000 (more than $80,000 today after accounting for inflation). This is our response variable.

3.1 Data preparation

The code below loads and prepares the data. Run it as-is — you do not need to modify it.

Note: If the data is not accessible via the url address, load via downloading it from the RData file provided in the project page: load("census.RData") in the same folder where you saved the project template .Rmd. The line is commented line 96 in the code chunk below. You can uncomment it if you have the RData file.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

census <- read_csv(
  url,
  col_names = c(
    "age", "workclass", "fnlwgt", "education",
    "education_1", "marital_status", "occupation", "relationship",
    "race", "sex", "capital_gain", "capital_loss", "hours_per_week",
    "native_country", "income"
  )
) %>%
  mutate(
    income     = factor(income),
    # income_ind: binary indicator — 1 = high earner (>50K), 0 = low earner
    income_ind = as.numeric(income == ">50K")
  )


#load("census.RData") # load the data if you have saved it in RData format

# inspect variable names, types, and first few values
glimpse(census)

## Rows: 32,561
## Columns: 16
## $ age            <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass      <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt         <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education      <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1    <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation     <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship   <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race           <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex            <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income         <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind     <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…

3.2 Part a) (10 pts) — Train / test split

Split the dataset into two pieces at random:

70% of the rows → train (training set)
30% of the rows → test (testing / hold-out set)

Use set.seed(364) to ensure your results are reproducible.

Report: How many records are in the testing set?

Hint: Use initial_split(prop = 0.7), then extract with training() and testing().

set.seed(364)

census_parts <- initial_split(census, prop = 0.7, strata = income)
train <- training(census_parts)
test  <- testing(census_parts)


train <- train %>%
  mutate(across(everything(), ~ {
    if (is.list(.x)) unlist(.x) else .x
  })) %>%
  as_tibble()

test <- test %>%
  mutate(across(everything(), ~ {
    if (is.list(.x)) unlist(.x) else .x
  })) %>%
  as_tibble()

nrow(test)

## [1] 9769

Note: Around 24% of those in the sample make more than $50k. This means the null model — always predicting ≤$50k — achieves about 76% accuracy. Your models should beat this baseline.

The code below computes and prints the proportion of high earners in the training set. Run it after creating train.

pi_bar <- train %>%
  dplyr::count(income) %>%
  mutate(
    n = as.numeric(n),                    # force numeric
    pct = n / sum(n)
  ) %>%
  filter(income == ">50K") %>%
  pull(pct)

print(c("Proportion earning >$50K in training set:", round(pi_bar, 4)))

## [1] "Proportion earning >$50K in training set:"
## [2] "0.2408"

Pro Tip: Always benchmark your predictive models against a reasonable null model.

3.3 Part b) (10 pts) — k-Nearest Neighbours classifier

Use the k-NN algorithm to classify observations as low earner (<=50K) or high earner (>50K).

Requirements:

Select only the quantitative predictors: age, education_1, capital_gain, capital_loss, hours_per_week (k-NN requires numeric inputs).
Use mode = "classification" and neighbors = 1.
Set scale = TRUE so that no single variable dominates due to its units.
Print the confusion matrix and report the accuracy.

Hint: See Programming Exercises Week 13 for the nearest_neighbor() / kknn workflow.

train_q <- train %>%
  dplyr::select(income, age, education_1, capital_gain, capital_loss, hours_per_week) %>%
  mutate(across(where(is.numeric), as.numeric))

library(kknn)

set.seed(364)
knn_model <- kknn(income ~ ., 
                  train = train_q, 
                  test = train_q, 
                  k = 1, 
                  scale = TRUE)

knn_pred <- fitted(knn_model)

# Confusion matrix
conf_mat_knn <- table(Predicted = knn_pred, Actual = train_q$income)
print(conf_mat_knn)

##          Actual
## Predicted <=50K  >50K
##     <=50K 17151  2603
##     >50K    153  2885

# Accuracy
accuracy_knn <- mean(knn_pred == train_q$income)
cat("k-NN Training Accuracy:", round(accuracy_knn, 4), "\n")

## k-NN Training Accuracy: 0.8791

knn_train_pred <- tibble(
  .pred_class = knn_pred,
  income = train_q$income
)

knn_train_pred %>%
  accuracy(truth = income, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.879

3.4 Model formula (shared across parts c and d)

The following formula object is provided for you. It will be used in parts c) and d). Run this chunk as-is.

# reusable formula — includes all categorical and numeric predictors
form <- as.formula(
  "income ~ age + workclass + education + marital_status +
   occupation + relationship + race + sex +
   capital_gain + capital_loss + hours_per_week"
)

form

## income ~ age + workclass + education + marital_status + occupation + 
##     relationship + race + sex + capital_gain + capital_loss + 
##     hours_per_week

3.5 Part c) (10 pts) — Naive Bayes classifier

Build a Naive Bayes classifier using the formula object form defined above. Print the confusion matrix and report the accuracy on the training set.

Hint: See Programming Exercises Week 13 for the naive_Bayes() / klaR workflow.

nb_spec <- naive_Bayes(mode = "classification") %>%
  set_engine("klaR")

nb_wf <- workflow() %>%
  add_model(nb_spec) %>%
  add_formula(form)

nb_fit <- nb_wf %>% fit(data = train)

nb_train_pred <- predict(nb_fit, train) %>%
  bind_cols(train)

nb_conf_mat <- nb_train_pred %>%
  conf_mat(truth = income, estimate = .pred_class)

nb_conf_mat

##           Truth
## Prediction <=50K  >50K
##      <=50K 17001  3748
##      >50K    303  1740

# Accuracy
nb_train_pred %>%
  accuracy(truth = income, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.822

3.6 Part d) (10 pts) — Logistic regression classifier

Use logistic regression to model the probability of being a high earner (income_ind = 1). The binary response variable income_ind was created during data preparation.

Step 1: Inspect the four exploratory plots below to assess which predictors are useful. The code is provided — run it as-is.

# base plot: estimated probability of high income vs age
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
  geom_jitter(alpha = 0.1, height = 0.05) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  ylab("P(income > $50K)")

log_plot + xlab("Age (years)")

log_plot + aes(x = education_1) +
  xlab("Education level (1 = lowest, 16 = highest)")

log_plot + aes(x = sex) +
  xlab("Sex")

log_plot + aes(x = marital_status) +
  xlab("Marital status")

Q: Which variables appear to be important predictors of high income? Briefly explain what each plot tells you.

Your answer here (2–4 sentences):

Step 2: Fit a logistic regression model using glm() with family = "binomial". Use at minimum age, education_1, and sex as predictors. You may optionally include marital_status or other variables. Print a tidy coefficient table with tidy().

Hint: The response variable is income_ind (0/1), not the factor income.

# Logistic model
logit_model <- glm(income_ind ~ age + education_1 + sex + marital_status,
                   family = binomial, data = train)

tidy(logit_model, conf.int = TRUE, exponentiate = TRUE)

## # A tibble: 10 × 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)         0.000451   0.144     -53.6   0         0.000339  0.000596
##  2 age                 1.02       0.00165    14.2   1.14e- 45 1.02      1.03    
##  3 education_1         1.50       0.00874    46.3   0         1.47      1.53    
##  4 sexMale             1.37       0.0527      6.00  1.95e-  9 1.24      1.52    
##  5 marital_statusMarr… 8.96       0.528       4.15  3.28e-  5 2.98     24.5     
##  6 marital_statusMarr… 7.13       0.0705     27.9   6.53e-171 6.22      8.20    
##  7 marital_statusMarr… 0.878      0.223      -0.584 5.59e-  1 0.556     1.34    
##  8 marital_statusNeve… 0.490      0.0880     -8.11  5.06e- 16 0.412     0.582   
##  9 marital_statusSepa… 0.771      0.169      -1.54  1.24e-  1 0.547     1.06    
## 10 marital_statusWido… 0.814      0.160      -1.29  1.97e-  1 0.591     1.11

Step 3: Generate predicted probabilities on the training set, apply a 0.5 cutoff to classify each observation, and report the confusion matrix and accuracy.

# Training evaluation
train_pred_prob <- predict(logit_model, train, type = "response")
train_pred_class <- ifelse(train_pred_prob > 0.5, ">50K", "<=50K")

conf_mat_train <- table(Predicted = train_pred_class, Actual = train$income)
conf_mat_train

##          Actual
## Predicted <=50K  >50K
##     <=50K 16064  2847
##     >50K   1240  2641

# Accuracy
mean(train_pred_class == train$income)

## [1] 0.8206827

3.7 Part e) (10 pts) — Evaluate the logistic model on the test set

Apply the trained logistic regression model to the held-out test set. Report the confusion matrix and test-set accuracy.

Note: The test-set accuracy is the honest estimate of how well the model generalises to new data.

test_pred_prob <- predict(logit_model, test, type = "response")
test_pred_class <- ifelse(test_pred_prob > 0.5, ">50K", "<=50K")

conf_mat_test <- table(Predicted = test_pred_class, Actual = test$income)
conf_mat_test

##          Actual
## Predicted <=50K >50K
##     <=50K  6880 1292
##     >50K    536 1061

# Test accuracy
mean(test_pred_class == test$income)

## [1] 0.8128775

3.8 Part f) (10 pts) — Model comparison

Which of the three classifiers (k-NN, Naive Bayes, logistic regression) achieved the highest test-set accuracy? Why is test-set accuracy a better comparison criterion than training-set accuracy?

Your answer here (3–5 sentences): Test-set accuracy is the honest performance metric because it evaluates on unseen data, revealing generalization ability and avoiding overfitting bias that inflates training accuracy. Logistic regression balances interpretability and performance best here

4 Task 3 (Total 40 pts): K-means clustering — geospatial world cities

K-means assigns each observation to one of K groups by minimising within-cluster distances to centroids. Unlike hierarchical clustering, K must be specified in advance and no dendrogram is produced.

Geospatial example: The world_cities dataset (from mdsr) contains the latitude and longitude of thousands of cities — but no continent or country labels. The question is: can k-means recover continent-level geographic structure from raw coordinates alone?

4.1 Provided example: Cartesian coordinates (run as-is)

# select the 4,000 most populous cities; keep only coordinates
BigCities <- world_cities %>%
  arrange(desc(population)) %>%
  head(4000) %>%
  dplyr::select(longitude, latitude)

glimpse(BigCities)

## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude  <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…

Note that the data contain no city names, country labels, or continent information — only coordinates.

set.seed(15)

# assign cluster labels (k = 6)
city_clusts <- BigCities %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# more stable run: try 10 random starts and keep the best solution
km <- kmeans(BigCities, centers = 6, nstart = 10)

# number of cities in each cluster
km$size

## [1] 726 554 984 392 356 988

# mean longitude and latitude of each cluster centre
km$centers

##   longitude  latitude
## 1  75.14407  27.43226
## 2 -94.47442  31.07927
## 3 120.38618  23.72534
## 4 -57.10048 -15.21344
## 5  18.91076  -1.12440
## 6  18.77544  45.37853

# attach cluster labels for plotting
BigCities <- BigCities %>%
  mutate(cluster = city_clusts)

# plot in standard Cartesian coordinates
BigCities %>%
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  labs(
    title = "K-means clusters — raw longitude/latitude (no projection)",
    x     = "Longitude",
    y     = "Latitude",
    color = "Cluster"
  )

4.2 Part a) (10 pts) — Interpret the Cartesian clusters

What geographic regions did the k-means algorithm identify from coordinates alone? Name each cluster’s approximate real-world region based on the plot and the cluster centroids in km$centers.

Your answer here (3–5 sentences):

Cluster 1: North America

Cluster 2: South America

Cluster 3: Europe and parts of west asia

Cluster 4: Africa

Cluster 5: Remaing parts of Asia and Australia/Oceania

Cluster 6: Remaining areas

4.3 Part b) (30 pts) — K-means under three coordinate projections

The Earth is a three-dimensional spheroid. A map projection converts it to a flat two-dimensional representation. Because k-means uses Euclidean distance, the choice of projection can change which cities are considered “close” and therefore affect cluster boundaries.

The three most common coordinate reference systems (CRS), identified by their EPSG codes, are:

EPSG code	Name	Used by
EPSG:4326	WGS84	GPS, Google Earth
EPSG:3857	Mercator	Google Maps, OpenStreetMap
EPSG:4269	NAD83	US federal agencies

Apply k-means with K = 6 under each of the three projections below. For each one:

Run the k-means clustering on the projected coordinates.
Plot the clusters (code structure is provided — fill in the labs() title).
After all three plots, write a short comparison in Part b) Discussion.

Note: Each correctly produced and labelled cluster plot is worth 10 pts.

Hint: The full workflow for EPSG:4326 is shown below. Adapt the EPSG code and axis labels for the remaining two projections.

4.3.1 EPSG:4326 — WGS84

# copy BigCities to a working object (coordinates() modifies the object in place)
d <- BigCities

# convert data.frame to a SpatialPointsDataFrame using columns 1 (longitude) and 2 (latitude)
coordinates(d) <- 1:2

# declare the source CRS as WGS84
proj4string(d) <- CRS("+init=epsg:4326")

# define the target CRS (same as source for EPSG:4326)
CRS.new <- CRS("+init=epsg:4326")

# apply the coordinate transformation
d.new <- spTransform(d, CRS.new)

# print the full PROJ.4 string for your reference
proj4string(d.new) %>% strwrap()

## [1] "+proj=longlat +datum=WGS84 +no_defs"

# run k-means on the projected coordinates
city_clusts <- as.data.frame(d.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# attach cluster labels
df.new <- as.data.frame(d.new) %>%
  mutate(
    cluster   = city_clusts,
    longitude = coords.x1,
    latitude  = coords.x2
  )

# plot — coords.x1 and coords.x2 are the projected coordinate columns
df.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(
    title = "K-means clusters — WGS84 (EPSG:4326)",
    x     = "Longitude",
    y     = "Latitude",
    color = "Cluster"
  )

4.3.2 EPSG:3857 — Mercator projection

Adapt the EPSG:4326 code above for the Mercator projection. Change the EPSG code in both CRS() calls to "+init=epsg:3857" and update the plot axis labels (Mercator units are metres, not degrees).

d <- BigCities
coordinates(d) <- 1:2
proj4string(d) <- CRS("+init=epsg:4326")
CRS.new <- CRS("+init=epsg:3857")
d.new <- spTransform(d, CRS.new)

city_clusts <- as.data.frame(d.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

df.new <- as.data.frame(d.new) %>%
  mutate(cluster = city_clusts,
         longitude = coords.x1,
         latitude = coords.x2)

df.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(title = "K-means clusters — Mercator (EPSG:3857)",
       x = "Easting (meters)", y = "Northing (meters)", color = "Cluster")

4.3.3 EPSG:4269 — NAD83

Adapt the code once more for the NAD83 projection. Change both CRS() calls to "+init=epsg:4269".

d <- BigCities
coordinates(d) <- 1:2
proj4string(d) <- CRS("+init=epsg:4326")
CRS.new <- CRS("+init=epsg:4269")
d.new <- spTransform(d, CRS.new)

city_clusts <- as.data.frame(d.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

df.new <- as.data.frame(d.new) %>%
  mutate(cluster = city_clusts,
         longitude = coords.x1,
         latitude = coords.x2)

df.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(title = "K-means clusters — NAD83 (EPSG:4269)",
       x = "Longitude", y = "Latitude", color = "Cluster")

4.3.4 Discussion

Compare your four plots (raw Cartesian, WGS84, Mercator, NAD83). Write 4–6 sentences addressing:

Do the clusters correspond to recognisable continental regions in each projection?
Are there noticeable differences between the projections in how cities are grouped?
Which projection, if any, produces the most geographically interpretable clusters, and why might that be?

Your answer here: All projections recover recognizable continental clusters, confirming k-means captures geographic structure from coordinates alone. Raw Cartesian and WGS84 show similar but distorted groupings. Mercator exaggerates polar areas but keeps continents distinct. NAD83 (North America-focused) is nearly identical to WGS84 for global data. WGS84/Raw produce the most interpretable clusters for this global dataset because they preserve relative continental separation without strong regional bias. Differences arise mainly at high latitudes due to projection distortions affecting Euclidean distances.

Good luck! Submit both your .Rmd source file and the knitted .html output.