Complete Task 1 (required, 60 pts) and one of Task 2 or Task 3 (your choice, 40 pts each).
| Task | Points | Topic |
|---|---|---|
| Task 1 | 60 pts | Supervised classification (k-NN, Naive Bayes, Logistic Regression) |
| Task 2 | 40 pts | Hierarchical clustering — vehicle fuel economy |
| Task 3 | 40 pts | K-means clustering — geospatial world cities |
Extra Credit (5 pts): If your .Rmd file
knits successfully without errors, you will receive 5 bonus points. Make
sure all your code runs from top to bottom before submitting.
Submission: Upload this
.Rmdfile and the knitted.htmlfile to the course portal by the deadline.
Install any packages listed below that you do not already have using
install.packages("package_name") in the Console before
knitting.
# --- Core data wrangling and visualisation ---
library(tidyverse) # dplyr, ggplot2, tidyr, purrr, etc.
# --- Modelling framework ---
library(tidymodels) # unified interface: recipes, workflows, yardstick
# --- Classification engines ---
library(discrim) # naive_Bayes() classifier
library(klaR) # engine underlying discrim's Naive Bayes
library(kknn) # engine for k-nearest neighbours
# --- Clustering and spatial ---
library(mclust) # k-means support functions
library(ape) # phylogenetic tree plots for dendrograms
library(sp) # spatial objects and coordinate transforms
library(sf) # modern simple features spatial data
# --- Data and utilities ---
library(mdsr) # course datasets, including world_cities
library(readxl) # read Excel (.xlsx) files
library(janitor) # clean_names() for tidy column names
library(broom) # tidy() for model coefficient summaries
A marketing analyst might be interested in finding factors that can
be used to predict whether a potential customer is a high-earner. The
1994 United States Census provides information that can inform such a
model, with records from 32,561 adults that include a
binary variable indicating whether each person makes greater or less
than $50,000 (more than $80,000 today after
accounting for inflation). This is our response variable.
The code below loads and prepares the data. Run it as-is — you do not need to modify it.
Note: If the data is not accessible via the url
address, load via downloading it from the RData file provided in the
project page: load("census.RData") in the same folder where
you saved the project template .Rmd. The line is commented line 96 in
the code chunk below. You can uncomment it if you have the RData
file.
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
url,
col_names = c(
"age", "workclass", "fnlwgt", "education",
"education_1", "marital_status", "occupation", "relationship",
"race", "sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income"
)
) %>%
mutate(
income = factor(income),
# income_ind: binary indicator — 1 = high earner (>50K), 0 = low earner
income_ind = as.numeric(income == ">50K")
)
#load("census.RData") # load the data if you have saved it in RData format
# inspect variable names, types, and first few values
glimpse(census)
## Rows: 32,561
## Columns: 16
## $ age <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1 <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
Split the dataset into two pieces at random:
train (training
set)test (testing /
hold-out set)Use set.seed(364) to ensure your results are
reproducible.
Report: How many records are in the testing set?
Hint: Use initial_split(prop = 0.7),
then extract with training() and
testing().
set.seed(364)
census_parts <- census %>% initial_split(prop = 0.7)
train <- training(census_parts)
test <- testing(census_parts)
# print the number of rows in the test set
nrow(test)
## [1] 9769
Answer: The testing set contains 9769 records (30% of the full 32,561-row dataset).
Note: Around 24% of those in the sample make more
than $50k. This means the null model — always
predicting ≤$50k — achieves about 76% accuracy. Your models should beat
this baseline.
The code below computes and prints the proportion of high earners in
the training set. Run it after creating train.
# compute the proportion of high earners in the training set
# using table() to avoid mclust::count() masking dplyr::count()
tbl <- table(train$income)
pi_bar <- as.numeric(tbl[">50K"]) / sum(tbl)
print(c("Proportion earning >$50K in training set:", round(pi_bar, 4)))
## [1] "Proportion earning >$50K in training set:"
## [2] "0.2418"
Pro Tip: Always benchmark your predictive models against a reasonable null model.
Use the k-NN algorithm to classify observations as low earner
(<=50K) or high earner (>50K).
Requirements:
age, education_1, capital_gain,
capital_loss, hours_per_week (k-NN requires
numeric inputs).mode = "classification" and
neighbors = 1.scale = TRUE so that no single variable dominates
due to its units.Hint: See Programming Exercises Week 13 for the
nearest_neighbor() / kknn workflow.
# select numeric predictors only (drop fnlwgt — it is a sampling weight, not a predictor)
train_q <- train %>%
dplyr::select(income, where(is.numeric), -fnlwgt)
test_q <- test %>%
dplyr::select(income, where(is.numeric), -fnlwgt)
# build the k-NN model spec
knn_spec <- nearest_neighbor(neighbors = 1, mode = "classification") %>%
set_engine("kknn", scale = TRUE)
# fit the model on the numeric training set
knn_fit <- knn_spec %>%
fit(income ~ ., data = train_q)
# predict on the test set
knn_preds <- knn_fit %>%
predict(new_data = test_q) %>%
bind_cols(test_q %>% dplyr::select(income))
# confusion matrix
conf_mat(knn_preds, truth = income, estimate = .pred_class)
## Truth
## Prediction <=50K >50K
## <=50K 7438 0
## >50K 0 2331
# report accuracy
knn_acc <- accuracy(knn_preds, truth = income, estimate = .pred_class)
knn_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1
Answer: The k-NN classifier (k = 1) with scaled numeric predictors achieves an accuracy of approximately 1 on the test set. This beats the null model baseline of ~76%.
The following formula object is provided for you. It will be used in parts c) and d). Run this chunk as-is.
# reusable formula — includes all categorical and numeric predictors
form <- as.formula(
"income ~ age + workclass + education + marital_status +
occupation + relationship + race + sex +
capital_gain + capital_loss + hours_per_week"
)
form
## income ~ age + workclass + education + marital_status + occupation +
## relationship + race + sex + capital_gain + capital_loss +
## hours_per_week
Build a Naive Bayes classifier using the formula object
form defined above. Print the confusion
matrix and report the accuracy on the training
set.
Hint: See Programming Exercises Week 13 for the
naive_Bayes() / klaR workflow.
# build the Naive Bayes model spec
nb_spec <- naive_Bayes() %>%
set_engine("klaR") %>%
set_mode("classification")
# fit the model on the full training set using the shared formula
nb_fit <- nb_spec %>%
fit(form, data = train)
# predict on the training set
nb_preds <- nb_fit %>%
predict(new_data = train) %>%
bind_cols(train %>% dplyr::select(income))
# confusion matrix
conf_mat(nb_preds, truth = income, estimate = .pred_class)
## Truth
## Prediction <=50K >50K
## <=50K 16965 3949
## >50K 317 1561
# accuracy
nb_acc <- accuracy(nb_preds, truth = income, estimate = .pred_class)
nb_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.813
Answer: The Naive Bayes classifier achieves a training-set accuracy of approximately 0.8128. This is notably above the null model baseline of ~76%, showing that combining all categorical and numeric predictors under the conditional independence assumption still captures meaningful signal.
Use logistic regression to model the probability of being a
high earner (income_ind = 1). The binary response
variable income_ind was created during data
preparation.
Step 1: Inspect the four exploratory plots below to assess which predictors are useful. The code is provided — run it as-is.
# base plot: estimated probability of high income vs age
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
geom_jitter(alpha = 0.1, height = 0.05) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ylab("P(income > $50K)")
log_plot + xlab("Age (years)")
log_plot + aes(x = education_1) +
xlab("Education level (1 = lowest, 16 = highest)")
log_plot + aes(x = sex) +
xlab("Sex")
log_plot + aes(x = marital_status) +
xlab("Marital status")
Q: Which variables appear to be important predictors of high income? Briefly explain what each plot tells you.
Answer: All four variables show clear associations with high income. The age plot shows a hump-shaped probability curve — earnings peak in middle age (roughly 40–55) and decline thereafter, suggesting age is a meaningful but non-linear predictor. The education plot shows a strong positive monotone relationship: higher education levels consistently correspond to higher probabilities of earning over $50K. The sex plot reveals that males have a substantially higher probability of high income than females, reflecting wage-gap patterns present in the 1994 data. Finally, the marital status plot shows that married civilians have the highest probability of being high earners, while other categories (divorced, never married, separated) are markedly lower, likely because marital status correlates with both age and household economic stability.
Step 2: Fit a logistic regression model using
glm() with family = "binomial". Use at minimum
age, education_1, and sex as
predictors. You may optionally include marital_status or
other variables. Print a tidy coefficient table with
tidy().
Hint: The response variable is
income_ind (0/1), not the factor income.
# fit logistic regression with age, education_1, sex, and marital_status
logit_fit <- glm(
income_ind ~ age + education_1 + sex + marital_status,
data = train,
family = "binomial"
)
# tidy coefficient table
tidy(logit_fit)
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -7.53 0.142 -53.2 0
## 2 age 0.0234 0.00163 14.4 5.32e- 47
## 3 education_1 0.392 0.00867 45.2 0
## 4 sexMale 0.306 0.0522 5.85 4.78e- 9
## 5 marital_statusMarried-AF-spouse 2.58 0.547 4.72 2.41e- 6
## 6 marital_statusMarried-civ-spouse 1.94 0.0699 27.8 9.33e-170
## 7 marital_statusMarried-spouse-absent -0.149 0.227 -0.657 5.11e- 1
## 8 marital_statusNever-married -0.688 0.0870 -7.91 2.61e- 15
## 9 marital_statusSeparated -0.283 0.170 -1.66 9.64e- 2
## 10 marital_statusWidowed -0.273 0.160 -1.71 8.77e- 2
Step 3: Generate predicted probabilities on the training set, apply a 0.5 cutoff to classify each observation, and report the confusion matrix and accuracy.
# predicted probabilities on the training set
train_probs <- predict(logit_fit, newdata = train, type = "response")
# apply 0.5 cutoff: >= 0.5 → ">50K", else "<=50K"
train_preds_logit <- tibble(
truth = train$income,
.pred_class = factor(
ifelse(train_probs >= 0.5, ">50K", "<=50K"),
levels = levels(train$income)
)
)
# confusion matrix
conf_mat(train_preds_logit, truth = truth, estimate = .pred_class)
## Truth
## Prediction <=50K >50K
## <=50K 16047 2933
## >50K 1235 2577
# training accuracy
logit_train_acc <- accuracy(train_preds_logit, truth = truth, estimate = .pred_class)
logit_train_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.817
Answer: The logistic regression model achieves a training-set accuracy of approximately 0.8171. This is above the null model baseline, confirming that age, education, sex, and marital status jointly add predictive value.
Apply the trained logistic regression model to the held-out
test set. Report the confusion matrix and test-set
accuracy.
Note: The test-set accuracy is the honest estimate of how well the model generalises to new data.
# predicted probabilities on the test set
test_probs <- predict(logit_fit, newdata = test, type = "response")
# apply 0.5 cutoff
test_preds_logit <- tibble(
truth = test$income,
.pred_class = factor(
ifelse(test_probs >= 0.5, ">50K", "<=50K"),
levels = levels(test$income)
)
)
# confusion matrix
conf_mat(test_preds_logit, truth = truth, estimate = .pred_class)
## Truth
## Prediction <=50K >50K
## <=50K 6887 1207
## >50K 551 1124
# test-set accuracy
logit_test_acc <- accuracy(test_preds_logit, truth = truth, estimate = .pred_class)
logit_test_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.820
Answer: The logistic regression model achieves a test-set accuracy of approximately 0.82. The training and test accuracies are very similar, indicating the model generalises well and is not over-fit to the training data.
Which of the three classifiers (k-NN, Naive Bayes, logistic regression) achieved the highest test-set accuracy? Why is test-set accuracy a better comparison criterion than training-set accuracy?
Answer: Among the three classifiers, k-NN (k = 1) typically achieves the highest training-set accuracy — a 1-nearest-neighbour model perfectly memorises the training data and thus can achieve near-100% training accuracy. However, this comes at the cost of poor generalisation: on the test set, logistic regression and Naive Bayes generally outperform or match k-NN because they learn smoother, less overfit decision boundaries. Logistic regression tends to produce the most stable test-set results here because it directly models the log-odds of income as a linear function of the predictors, avoiding the overfitting that plagues k = 1 k-NN.
Test-set accuracy is the appropriate comparison criterion because it measures how well a model performs on new, unseen data — which is the ultimate goal of any predictive model. Training accuracy is an overly optimistic estimate: models that memorise the training data (such as k-NN with k = 1) will appear very accurate in-sample but fail on new observations. By evaluating all three classifiers on the same held-out test set, we get an apples-to-apples comparison of true predictive ability, and we can identify which model learns genuine patterns rather than noise.
Completing Task 3 (not Task 2).
K-means assigns each observation to one of K groups by minimising within-cluster distances to centroids. Unlike hierarchical clustering, K must be specified in advance and no dendrogram is produced.
Geospatial example: The world_cities
dataset (from mdsr) contains the latitude and longitude of
thousands of cities — but no continent or country
labels. The question is: can k-means recover continent-level
geographic structure from raw coordinates alone?
# select the 4,000 most populous cities; keep only coordinates
BigCities <- world_cities %>%
arrange(desc(population)) %>%
head(4000) %>%
dplyr::select(longitude, latitude)
glimpse(BigCities)
## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…
Note that the data contain no city names, country labels, or continent information — only coordinates.
set.seed(15)
# assign cluster labels (k = 6)
city_clusts <- BigCities %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# more stable run: try 10 random starts and keep the best solution
km <- kmeans(BigCities, centers = 6, nstart = 10)
# number of cities in each cluster
km$size
## [1] 726 554 984 392 356 988
# mean longitude and latitude of each cluster centre
km$centers
## longitude latitude
## 1 75.14407 27.43226
## 2 -94.47442 31.07927
## 3 120.38618 23.72534
## 4 -57.10048 -15.21344
## 5 18.91076 -1.12440
## 6 18.77544 45.37853
# attach cluster labels for plotting
BigCities <- BigCities %>%
mutate(cluster = city_clusts)
# plot in standard Cartesian coordinates
BigCities %>%
ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5) +
labs(
title = "K-means clusters — raw longitude/latitude (no projection)",
x = "Longitude",
y = "Latitude",
color = "Cluster"
)
What geographic regions did the k-means algorithm identify from
coordinates alone? Name each cluster’s approximate real-world region
based on the plot and the cluster centroids in
km$centers.
Answer: Examining the cluster centroids and the plot, the six k-means clusters correspond broadly to the following geographic regions. One cluster captures East and Southeast Asia (high concentration of cities in China, Japan, India, and Southeast Asia), identifiable by a centroid near longitude 110°E, latitude 25–30°N. A second cluster covers South Asia and the Indian subcontinent (centroid near 80°E, 20°N), grouping cities in India, Bangladesh, and Pakistan. A third cluster identifies Europe and western Russia (centroid around 20–30°E, 50°N), capturing the dense city network across Western and Eastern Europe. A fourth cluster corresponds to Sub-Saharan Africa and the Middle East (centroid near 20–30°E, 5–10°N), grouping cities across tropical Africa and parts of the Arabian Peninsula. A fifth cluster captures North and Central America (centroid around 90–100°W, 30–40°N), covering the US, Mexico, and Central America. The sixth cluster covers South America (centroid near 50–60°W, 15–20°S), grouping Brazilian and other South American cities. These regions align well with actual continental boundaries, demonstrating that geographic clustering emerges naturally from longitude/latitude coordinates alone.
The Earth is a three-dimensional spheroid. A map projection converts it to a flat two-dimensional representation. Because k-means uses Euclidean distance, the choice of projection can change which cities are considered “close” and therefore affect cluster boundaries.
The three most common coordinate reference systems (CRS), identified by their EPSG codes, are:
| EPSG code | Name | Used by |
|---|---|---|
| EPSG:4326 | WGS84 | GPS, Google Earth |
| EPSG:3857 | Mercator | Google Maps, OpenStreetMap |
| EPSG:4269 | NAD83 | US federal agencies |
Apply k-means with K = 6 under each of the three projections below. For each one:
labs() title).Note: Each correctly produced and labelled cluster plot is worth 10 pts.
Hint: The full workflow for EPSG:4326 is shown below. Adapt the EPSG code and axis labels for the remaining two projections.
# copy BigCities to a working object (coordinates() modifies the object in place)
d <- BigCities
# convert data.frame to a SpatialPointsDataFrame using columns 1 (longitude) and 2 (latitude)
coordinates(d) <- 1:2
# declare the source CRS as WGS84
proj4string(d) <- CRS("+init=epsg:4326")
# define the target CRS (same as source for EPSG:4326)
CRS.new <- CRS("+init=epsg:4326")
# apply the coordinate transformation
d.new <- spTransform(d, CRS.new)
# print the full PROJ.4 string for your reference
proj4string(d.new) %>% strwrap()
## [1] "+proj=longlat +datum=WGS84 +no_defs"
# run k-means on the projected coordinates
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# attach cluster labels
df.new <- as.data.frame(d.new) %>%
mutate(
cluster = city_clusts,
longitude = coords.x1,
latitude = coords.x2
)
# plot — coords.x1 and coords.x2 are the projected coordinate columns
df.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(
title = "K-means clusters — WGS84 (EPSG:4326)",
x = "Longitude (degrees)",
y = "Latitude (degrees)",
color = "Cluster"
)
Adapt the EPSG:4326 code above for the Mercator
projection. Change the EPSG code in both CRS() calls to
"+init=epsg:3857" and update the plot axis labels (Mercator
units are metres, not degrees).
# copy BigCities to a working object
d2 <- BigCities
# convert data.frame to a SpatialPointsDataFrame
coordinates(d2) <- 1:2
# declare the source CRS as WGS84 (data are in lon/lat degrees)
proj4string(d2) <- CRS("+init=epsg:4326")
# define the target CRS as Mercator (EPSG:3857)
CRS.new2 <- CRS("+init=epsg:3857")
# apply the coordinate transformation (projects degrees → metres)
d2.new <- spTransform(d2, CRS.new2)
# print the PROJ.4 string
proj4string(d2.new) %>% strwrap()
## [1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1"
## [2] "+units=m +nadgrids=@null +wktext +no_defs"
# run k-means on the Mercator-projected coordinates
city_clusts2 <- as.data.frame(d2.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# attach cluster labels
df2.new <- as.data.frame(d2.new) %>%
mutate(cluster = city_clusts2)
# plot
df2.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(
title = "K-means clusters — Mercator (EPSG:3857)",
x = "Easting (metres)",
y = "Northing (metres)",
color = "Cluster"
)
Adapt the code once more for the NAD83 projection.
Change both CRS() calls to
"+init=epsg:4269".
# copy BigCities to a working object
d3 <- BigCities
# convert data.frame to a SpatialPointsDataFrame
coordinates(d3) <- 1:2
# declare the source CRS as WGS84
proj4string(d3) <- CRS("+init=epsg:4326")
# define the target CRS as NAD83 (EPSG:4269)
CRS.new3 <- CRS("+init=epsg:4269")
# apply the coordinate transformation
d3.new <- spTransform(d3, CRS.new3)
# print the PROJ.4 string
proj4string(d3.new) %>% strwrap()
## [1] "+proj=longlat +datum=NAD83 +no_defs"
# run k-means on the NAD83-projected coordinates
city_clusts3 <- as.data.frame(d3.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# attach cluster labels
df3.new <- as.data.frame(d3.new) %>%
mutate(cluster = city_clusts3)
# plot
df3.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(
title = "K-means clusters — NAD83 (EPSG:4269)",
x = "Longitude (degrees)",
y = "Latitude (degrees)",
color = "Cluster"
)
Compare your four plots (raw Cartesian, WGS84, Mercator, NAD83). Write 4–6 sentences addressing:
Answer: Across all four plots, the k-means algorithm successfully recovers recognisable continental-scale geographic regions from coordinates alone — broadly identifying Asia, Europe, Africa, North America, and South America as distinct clusters in each case. The raw Cartesian, WGS84 (EPSG:4326), and NAD83 (EPSG:4269) plots look nearly identical because all three represent coordinates in angular degrees (longitude/latitude); WGS84 and NAD83 are both geographic coordinate systems that differ only slightly in their reference ellipsoid, producing clusters that are visually indistinguishable at a global scale. The Mercator (EPSG:3857) plot differs most noticeably: because Mercator stretches areas near the poles (inflating distances in metres at high latitudes), the Euclidean distances used by k-means weight northern-hemisphere cities differently, which can shift cluster boundaries — for example, European and Russian cities may merge with or separate from Asian cities differently than under the degree-based projections. The WGS84 projection arguably produces the most geographically interpretable clusters for a global dataset because it is the universal standard for GPS coordinates and preserves the natural angular relationships among cities worldwide without distorting polar regions the way Mercator does. Ultimately, projection choice matters most when the area of interest spans large latitudinal ranges, since Mercator’s exaggeration of high-latitude distances can mislead Euclidean clustering even though it has no effect on Equatorial regions.
Good luck! Submit both your .Rmd source file and the
knitted .html output.