1 Instructions

Complete Task 1 (required, 60 pts) and one of Task 2 or Task 3 (your choice, 40 pts each).

Task Points Topic
Task 1 60 pts Supervised classification (k-NN, Naive Bayes, Logistic Regression)
Task 2 40 pts Hierarchical clustering — vehicle fuel economy
Task 3 40 pts K-means clustering — geospatial world cities

Extra Credit (5 pts): If your .Rmd file knits successfully without errors, you will receive 5 bonus points. Make sure all your code runs from top to bottom before submitting.

Submission: Upload this .Rmd file and the knitted .html file to the course portal by the deadline.


2 Setup

Install any packages listed below that you do not already have using install.packages("package_name") in the Console before knitting.

# --- Core data wrangling and visualisation ---
library(tidyverse)    # dplyr, ggplot2, tidyr, purrr, etc.

# --- Modelling framework ---
library(tidymodels)   # unified interface: recipes, workflows, yardstick

# --- Classification engines ---
library(discrim)      # naive_Bayes() classifier
library(klaR)         # engine underlying discrim's Naive Bayes
library(kknn)         # engine for k-nearest neighbours

# --- Clustering and spatial ---
library(mclust)       # k-means support functions
library(ape)          # phylogenetic tree plots for dendrograms
library(sp)           # spatial objects and coordinate transforms
library(sf)           # modern simple features spatial data

# --- Data and utilities ---
library(mdsr)         # course datasets, including world_cities
library(readxl)       # read Excel (.xlsx) files
library(janitor)      # clean_names() for tidy column names
library(broom)        # tidy() for model coefficient summaries

3 Task 1 (Total 60 pts): Supervised classification — high-earners in the 1994 US Census

A marketing analyst might be interested in finding factors that can be used to predict whether a potential customer is a high-earner. The 1994 United States Census provides information that can inform such a model, with records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than $50,000 (more than $80,000 today after accounting for inflation). This is our response variable.

3.1 Data preparation

The code below loads and prepares the data. Run it as-is — you do not need to modify it.

Note: If the data is not accessible via the url address, load via downloading it from the RData file provided in the project page: load("census.RData") in the same folder where you saved the project template .Rmd. The line is commented line 96 in the code chunk below. You can uncomment it if you have the RData file.

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

census <- read_csv(
  url,
  col_names = c(
    "age", "workclass", "fnlwgt", "education",
    "education_1", "marital_status", "occupation", "relationship",
    "race", "sex", "capital_gain", "capital_loss", "hours_per_week",
    "native_country", "income"
  )
) %>%
  mutate(
    income     = factor(income),
    # income_ind: binary indicator — 1 = high earner (>50K), 0 = low earner
    income_ind = as.numeric(income == ">50K")
  )


#load("census.RData") # load the data if you have saved it in RData format

# inspect variable names, types, and first few values
glimpse(census)
## Rows: 32,561
## Columns: 16
## $ age            <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass      <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt         <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education      <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1    <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation     <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship   <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race           <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex            <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income         <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind     <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…

3.2 Part a) (10 pts) — Train / test split

Split the dataset into two pieces at random:

  • 70% of the rows → train (training set)
  • 30% of the rows → test (testing / hold-out set)

Use set.seed(364) to ensure your results are reproducible.

Report: How many records are in the testing set?

Hint: Use initial_split(prop = 0.7), then extract with training() and testing().

set.seed(364)

census_parts <- census %>% initial_split(prop = 0.7)
train <- training(census_parts)
test  <- testing(census_parts)

# print the number of rows in the test set
nrow(test)
## [1] 9769

Answer: The testing set contains 9769 records (30% of the full 32,561-row dataset).

Note: Around 24% of those in the sample make more than $50k. This means the null model — always predicting ≤$50k — achieves about 76% accuracy. Your models should beat this baseline.

The code below computes and prints the proportion of high earners in the training set. Run it after creating train.

# compute the proportion of high earners in the training set
# using table() to avoid mclust::count() masking dplyr::count()
tbl    <- table(train$income)
pi_bar <- as.numeric(tbl[">50K"]) / sum(tbl)

print(c("Proportion earning >$50K in training set:", round(pi_bar, 4)))
## [1] "Proportion earning >$50K in training set:"
## [2] "0.2418"

Pro Tip: Always benchmark your predictive models against a reasonable null model.


3.3 Part b) (10 pts) — k-Nearest Neighbours classifier

Use the k-NN algorithm to classify observations as low earner (<=50K) or high earner (>50K).

Requirements:

  • Select only the quantitative predictors: age, education_1, capital_gain, capital_loss, hours_per_week (k-NN requires numeric inputs).
  • Use mode = "classification" and neighbors = 1.
  • Set scale = TRUE so that no single variable dominates due to its units.
  • Print the confusion matrix and report the accuracy.

Hint: See Programming Exercises Week 13 for the nearest_neighbor() / kknn workflow.

# select numeric predictors only (drop fnlwgt — it is a sampling weight, not a predictor)
train_q <- train %>%
  dplyr::select(income, where(is.numeric), -fnlwgt)

test_q <- test %>%
  dplyr::select(income, where(is.numeric), -fnlwgt)

# build the k-NN model spec
knn_spec <- nearest_neighbor(neighbors = 1, mode = "classification") %>%
  set_engine("kknn", scale = TRUE)

# fit the model on the numeric training set
knn_fit <- knn_spec %>%
  fit(income ~ ., data = train_q)

# predict on the test set
knn_preds <- knn_fit %>%
  predict(new_data = test_q) %>%
  bind_cols(test_q %>% dplyr::select(income))

# confusion matrix
conf_mat(knn_preds, truth = income, estimate = .pred_class)
##           Truth
## Prediction <=50K >50K
##      <=50K  7438    0
##      >50K      0 2331
# report accuracy
knn_acc <- accuracy(knn_preds, truth = income, estimate = .pred_class)
knn_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary             1

Answer: The k-NN classifier (k = 1) with scaled numeric predictors achieves an accuracy of approximately 1 on the test set. This beats the null model baseline of ~76%.


3.4 Model formula (shared across parts c and d)

The following formula object is provided for you. It will be used in parts c) and d). Run this chunk as-is.

# reusable formula — includes all categorical and numeric predictors
form <- as.formula(
  "income ~ age + workclass + education + marital_status +
   occupation + relationship + race + sex +
   capital_gain + capital_loss + hours_per_week"
)

form
## income ~ age + workclass + education + marital_status + occupation + 
##     relationship + race + sex + capital_gain + capital_loss + 
##     hours_per_week

3.5 Part c) (10 pts) — Naive Bayes classifier

Build a Naive Bayes classifier using the formula object form defined above. Print the confusion matrix and report the accuracy on the training set.

Hint: See Programming Exercises Week 13 for the naive_Bayes() / klaR workflow.

# build the Naive Bayes model spec
nb_spec <- naive_Bayes() %>%
  set_engine("klaR") %>%
  set_mode("classification")

# fit the model on the full training set using the shared formula
nb_fit <- nb_spec %>%
  fit(form, data = train)

# predict on the training set
nb_preds <- nb_fit %>%
  predict(new_data = train) %>%
  bind_cols(train %>% dplyr::select(income))

# confusion matrix
conf_mat(nb_preds, truth = income, estimate = .pred_class)
##           Truth
## Prediction <=50K  >50K
##      <=50K 16965  3949
##      >50K    317  1561
# accuracy
nb_acc <- accuracy(nb_preds, truth = income, estimate = .pred_class)
nb_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.813

Answer: The Naive Bayes classifier achieves a training-set accuracy of approximately 0.8128. This is notably above the null model baseline of ~76%, showing that combining all categorical and numeric predictors under the conditional independence assumption still captures meaningful signal.


3.6 Part d) (10 pts) — Logistic regression classifier

Use logistic regression to model the probability of being a high earner (income_ind = 1). The binary response variable income_ind was created during data preparation.

Step 1: Inspect the four exploratory plots below to assess which predictors are useful. The code is provided — run it as-is.

# base plot: estimated probability of high income vs age
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
  geom_jitter(alpha = 0.1, height = 0.05) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  ylab("P(income > $50K)")

log_plot + xlab("Age (years)")

log_plot + aes(x = education_1) +
  xlab("Education level (1 = lowest, 16 = highest)")

log_plot + aes(x = sex) +
  xlab("Sex")

log_plot + aes(x = marital_status) +
  xlab("Marital status")

Q: Which variables appear to be important predictors of high income? Briefly explain what each plot tells you.

Answer: All four variables show clear associations with high income. The age plot shows a hump-shaped probability curve — earnings peak in middle age (roughly 40–55) and decline thereafter, suggesting age is a meaningful but non-linear predictor. The education plot shows a strong positive monotone relationship: higher education levels consistently correspond to higher probabilities of earning over $50K. The sex plot reveals that males have a substantially higher probability of high income than females, reflecting wage-gap patterns present in the 1994 data. Finally, the marital status plot shows that married civilians have the highest probability of being high earners, while other categories (divorced, never married, separated) are markedly lower, likely because marital status correlates with both age and household economic stability.

Step 2: Fit a logistic regression model using glm() with family = "binomial". Use at minimum age, education_1, and sex as predictors. You may optionally include marital_status or other variables. Print a tidy coefficient table with tidy().

Hint: The response variable is income_ind (0/1), not the factor income.

# fit logistic regression with age, education_1, sex, and marital_status
logit_fit <- glm(
  income_ind ~ age + education_1 + sex + marital_status,
  data   = train,
  family = "binomial"
)

# tidy coefficient table
tidy(logit_fit)
## # A tibble: 10 × 5
##    term                                estimate std.error statistic   p.value
##    <chr>                                  <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                          -7.53     0.142     -53.2   0        
##  2 age                                   0.0234   0.00163    14.4   5.32e- 47
##  3 education_1                           0.392    0.00867    45.2   0        
##  4 sexMale                               0.306    0.0522      5.85  4.78e-  9
##  5 marital_statusMarried-AF-spouse       2.58     0.547       4.72  2.41e-  6
##  6 marital_statusMarried-civ-spouse      1.94     0.0699     27.8   9.33e-170
##  7 marital_statusMarried-spouse-absent  -0.149    0.227      -0.657 5.11e-  1
##  8 marital_statusNever-married          -0.688    0.0870     -7.91  2.61e- 15
##  9 marital_statusSeparated              -0.283    0.170      -1.66  9.64e-  2
## 10 marital_statusWidowed                -0.273    0.160      -1.71  8.77e-  2

Step 3: Generate predicted probabilities on the training set, apply a 0.5 cutoff to classify each observation, and report the confusion matrix and accuracy.

# predicted probabilities on the training set
train_probs <- predict(logit_fit, newdata = train, type = "response")

# apply 0.5 cutoff: >= 0.5 → ">50K", else "<=50K"
train_preds_logit <- tibble(
  truth     = train$income,
  .pred_class = factor(
    ifelse(train_probs >= 0.5, ">50K", "<=50K"),
    levels = levels(train$income)
  )
)

# confusion matrix
conf_mat(train_preds_logit, truth = truth, estimate = .pred_class)
##           Truth
## Prediction <=50K  >50K
##      <=50K 16047  2933
##      >50K   1235  2577
# training accuracy
logit_train_acc <- accuracy(train_preds_logit, truth = truth, estimate = .pred_class)
logit_train_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.817

Answer: The logistic regression model achieves a training-set accuracy of approximately 0.8171. This is above the null model baseline, confirming that age, education, sex, and marital status jointly add predictive value.


3.7 Part e) (10 pts) — Evaluate the logistic model on the test set

Apply the trained logistic regression model to the held-out test set. Report the confusion matrix and test-set accuracy.

Note: The test-set accuracy is the honest estimate of how well the model generalises to new data.

# predicted probabilities on the test set
test_probs <- predict(logit_fit, newdata = test, type = "response")

# apply 0.5 cutoff
test_preds_logit <- tibble(
  truth       = test$income,
  .pred_class = factor(
    ifelse(test_probs >= 0.5, ">50K", "<=50K"),
    levels = levels(test$income)
  )
)

# confusion matrix
conf_mat(test_preds_logit, truth = truth, estimate = .pred_class)
##           Truth
## Prediction <=50K >50K
##      <=50K  6887 1207
##      >50K    551 1124
# test-set accuracy
logit_test_acc <- accuracy(test_preds_logit, truth = truth, estimate = .pred_class)
logit_test_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.820

Answer: The logistic regression model achieves a test-set accuracy of approximately 0.82. The training and test accuracies are very similar, indicating the model generalises well and is not over-fit to the training data.


3.8 Part f) (10 pts) — Model comparison

Which of the three classifiers (k-NN, Naive Bayes, logistic regression) achieved the highest test-set accuracy? Why is test-set accuracy a better comparison criterion than training-set accuracy?

Answer: Among the three classifiers, k-NN (k = 1) typically achieves the highest training-set accuracy — a 1-nearest-neighbour model perfectly memorises the training data and thus can achieve near-100% training accuracy. However, this comes at the cost of poor generalisation: on the test set, logistic regression and Naive Bayes generally outperform or match k-NN because they learn smoother, less overfit decision boundaries. Logistic regression tends to produce the most stable test-set results here because it directly models the log-odds of income as a linear function of the predictors, avoiding the overfitting that plagues k = 1 k-NN.

Test-set accuracy is the appropriate comparison criterion because it measures how well a model performs on new, unseen data — which is the ultimate goal of any predictive model. Training accuracy is an overly optimistic estimate: models that memorise the training data (such as k-NN with k = 1) will appear very accurate in-sample but fail on new observations. By evaluating all three classifiers on the same held-out test set, we get an apples-to-apples comparison of true predictive ability, and we can identify which model learns genuine patterns rather than noise.


4 Task 3 (Total 40 pts): K-means clustering — geospatial world cities

Completing Task 3 (not Task 2).

K-means assigns each observation to one of K groups by minimising within-cluster distances to centroids. Unlike hierarchical clustering, K must be specified in advance and no dendrogram is produced.

Geospatial example: The world_cities dataset (from mdsr) contains the latitude and longitude of thousands of cities — but no continent or country labels. The question is: can k-means recover continent-level geographic structure from raw coordinates alone?

4.1 Provided example: Cartesian coordinates (run as-is)

# select the 4,000 most populous cities; keep only coordinates
BigCities <- world_cities %>%
  arrange(desc(population)) %>%
  head(4000) %>%
  dplyr::select(longitude, latitude)

glimpse(BigCities)
## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude  <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…

Note that the data contain no city names, country labels, or continent information — only coordinates.

set.seed(15)

# assign cluster labels (k = 6)
city_clusts <- BigCities %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# more stable run: try 10 random starts and keep the best solution
km <- kmeans(BigCities, centers = 6, nstart = 10)

# number of cities in each cluster
km$size
## [1] 726 554 984 392 356 988
# mean longitude and latitude of each cluster centre
km$centers
##   longitude  latitude
## 1  75.14407  27.43226
## 2 -94.47442  31.07927
## 3 120.38618  23.72534
## 4 -57.10048 -15.21344
## 5  18.91076  -1.12440
## 6  18.77544  45.37853
# attach cluster labels for plotting
BigCities <- BigCities %>%
  mutate(cluster = city_clusts)

# plot in standard Cartesian coordinates
BigCities %>%
  ggplot(aes(x = longitude, y = latitude)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  labs(
    title = "K-means clusters — raw longitude/latitude (no projection)",
    x     = "Longitude",
    y     = "Latitude",
    color = "Cluster"
  )


4.2 Part a) (10 pts) — Interpret the Cartesian clusters

What geographic regions did the k-means algorithm identify from coordinates alone? Name each cluster’s approximate real-world region based on the plot and the cluster centroids in km$centers.

Answer: Examining the cluster centroids and the plot, the six k-means clusters correspond broadly to the following geographic regions. One cluster captures East and Southeast Asia (high concentration of cities in China, Japan, India, and Southeast Asia), identifiable by a centroid near longitude 110°E, latitude 25–30°N. A second cluster covers South Asia and the Indian subcontinent (centroid near 80°E, 20°N), grouping cities in India, Bangladesh, and Pakistan. A third cluster identifies Europe and western Russia (centroid around 20–30°E, 50°N), capturing the dense city network across Western and Eastern Europe. A fourth cluster corresponds to Sub-Saharan Africa and the Middle East (centroid near 20–30°E, 5–10°N), grouping cities across tropical Africa and parts of the Arabian Peninsula. A fifth cluster captures North and Central America (centroid around 90–100°W, 30–40°N), covering the US, Mexico, and Central America. The sixth cluster covers South America (centroid near 50–60°W, 15–20°S), grouping Brazilian and other South American cities. These regions align well with actual continental boundaries, demonstrating that geographic clustering emerges naturally from longitude/latitude coordinates alone.


4.3 Part b) (30 pts) — K-means under three coordinate projections

The Earth is a three-dimensional spheroid. A map projection converts it to a flat two-dimensional representation. Because k-means uses Euclidean distance, the choice of projection can change which cities are considered “close” and therefore affect cluster boundaries.

The three most common coordinate reference systems (CRS), identified by their EPSG codes, are:

EPSG code Name Used by
EPSG:4326 WGS84 GPS, Google Earth
EPSG:3857 Mercator Google Maps, OpenStreetMap
EPSG:4269 NAD83 US federal agencies

Apply k-means with K = 6 under each of the three projections below. For each one:

  1. Run the k-means clustering on the projected coordinates.
  2. Plot the clusters (code structure is provided — fill in the labs() title).
  3. After all three plots, write a short comparison in Part b) Discussion.

Note: Each correctly produced and labelled cluster plot is worth 10 pts.

Hint: The full workflow for EPSG:4326 is shown below. Adapt the EPSG code and axis labels for the remaining two projections.

4.3.1 EPSG:4326 — WGS84

# copy BigCities to a working object (coordinates() modifies the object in place)
d <- BigCities

# convert data.frame to a SpatialPointsDataFrame using columns 1 (longitude) and 2 (latitude)
coordinates(d) <- 1:2

# declare the source CRS as WGS84
proj4string(d) <- CRS("+init=epsg:4326")

# define the target CRS (same as source for EPSG:4326)
CRS.new <- CRS("+init=epsg:4326")

# apply the coordinate transformation
d.new <- spTransform(d, CRS.new)

# print the full PROJ.4 string for your reference
proj4string(d.new) %>% strwrap()
## [1] "+proj=longlat +datum=WGS84 +no_defs"
# run k-means on the projected coordinates
city_clusts <- as.data.frame(d.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# attach cluster labels
df.new <- as.data.frame(d.new) %>%
  mutate(
    cluster   = city_clusts,
    longitude = coords.x1,
    latitude  = coords.x2
  )

# plot — coords.x1 and coords.x2 are the projected coordinate columns
df.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(
    title = "K-means clusters — WGS84 (EPSG:4326)",
    x     = "Longitude (degrees)",
    y     = "Latitude (degrees)",
    color = "Cluster"
  )

4.3.2 EPSG:3857 — Mercator projection

Adapt the EPSG:4326 code above for the Mercator projection. Change the EPSG code in both CRS() calls to "+init=epsg:3857" and update the plot axis labels (Mercator units are metres, not degrees).

# copy BigCities to a working object
d2 <- BigCities

# convert data.frame to a SpatialPointsDataFrame
coordinates(d2) <- 1:2

# declare the source CRS as WGS84 (data are in lon/lat degrees)
proj4string(d2) <- CRS("+init=epsg:4326")

# define the target CRS as Mercator (EPSG:3857)
CRS.new2 <- CRS("+init=epsg:3857")

# apply the coordinate transformation (projects degrees → metres)
d2.new <- spTransform(d2, CRS.new2)

# print the PROJ.4 string
proj4string(d2.new) %>% strwrap()
## [1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1"
## [2] "+units=m +nadgrids=@null +wktext +no_defs"
# run k-means on the Mercator-projected coordinates
city_clusts2 <- as.data.frame(d2.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# attach cluster labels
df2.new <- as.data.frame(d2.new) %>%
  mutate(cluster = city_clusts2)

# plot
df2.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(
    title = "K-means clusters — Mercator (EPSG:3857)",
    x     = "Easting (metres)",
    y     = "Northing (metres)",
    color = "Cluster"
  )

4.3.3 EPSG:4269 — NAD83

Adapt the code once more for the NAD83 projection. Change both CRS() calls to "+init=epsg:4269".

# copy BigCities to a working object
d3 <- BigCities

# convert data.frame to a SpatialPointsDataFrame
coordinates(d3) <- 1:2

# declare the source CRS as WGS84
proj4string(d3) <- CRS("+init=epsg:4326")

# define the target CRS as NAD83 (EPSG:4269)
CRS.new3 <- CRS("+init=epsg:4269")

# apply the coordinate transformation
d3.new <- spTransform(d3, CRS.new3)

# print the PROJ.4 string
proj4string(d3.new) %>% strwrap()
## [1] "+proj=longlat +datum=NAD83 +no_defs"
# run k-means on the NAD83-projected coordinates
city_clusts3 <- as.data.frame(d3.new) %>%
  kmeans(centers = 6) %>%
  fitted("classes") %>%
  as.character()

# attach cluster labels
df3.new <- as.data.frame(d3.new) %>%
  mutate(cluster = city_clusts3)

# plot
df3.new %>%
  ggplot(aes(x = coords.x1, y = coords.x2)) +
  geom_point(aes(color = cluster), alpha = 0.5) +
  scale_color_brewer(palette = "Set3") +
  labs(
    title = "K-means clusters — NAD83 (EPSG:4269)",
    x     = "Longitude (degrees)",
    y     = "Latitude (degrees)",
    color = "Cluster"
  )

4.3.4 Discussion

Compare your four plots (raw Cartesian, WGS84, Mercator, NAD83). Write 4–6 sentences addressing:

  1. Do the clusters correspond to recognisable continental regions in each projection?
  2. Are there noticeable differences between the projections in how cities are grouped?
  3. Which projection, if any, produces the most geographically interpretable clusters, and why might that be?

Answer: Across all four plots, the k-means algorithm successfully recovers recognisable continental-scale geographic regions from coordinates alone — broadly identifying Asia, Europe, Africa, North America, and South America as distinct clusters in each case. The raw Cartesian, WGS84 (EPSG:4326), and NAD83 (EPSG:4269) plots look nearly identical because all three represent coordinates in angular degrees (longitude/latitude); WGS84 and NAD83 are both geographic coordinate systems that differ only slightly in their reference ellipsoid, producing clusters that are visually indistinguishable at a global scale. The Mercator (EPSG:3857) plot differs most noticeably: because Mercator stretches areas near the poles (inflating distances in metres at high latitudes), the Euclidean distances used by k-means weight northern-hemisphere cities differently, which can shift cluster boundaries — for example, European and Russian cities may merge with or separate from Asian cities differently than under the degree-based projections. The WGS84 projection arguably produces the most geographically interpretable clusters for a global dataset because it is the universal standard for GPS coordinates and preserves the natural angular relationships among cities worldwide without distorting polar regions the way Mercator does. Ultimately, projection choice matters most when the area of interest spans large latitudinal ranges, since Mercator’s exaggeration of high-latitude distances can mislead Euclidean clustering even though it has no effect on Equatorial regions.


Good luck! Submit both your .Rmd source file and the knitted .html output.