Complete Task 1 (required, 60 pts) and one of Task 2 or Task 3 (your choice, 40 pts each).
| Task | Points | Topic |
|---|---|---|
| Task 1 | 60 pts | Supervised classification (k-NN, Naive Bayes, Logistic Regression) |
| Task 2 | 40 pts | Hierarchical clustering — vehicle fuel economy |
| Task 3 | 40 pts | K-means clustering — geospatial world cities |
Extra Credit (5 pts): If your .Rmd file
knits successfully without errors, you will receive 5 bonus points. Make
sure all your code runs from top to bottom before submitting.
Submission: Upload this
.Rmdfile and the knitted.htmlfile to the course portal by the deadline.
Install any packages listed below that you do not already have using
install.packages("package_name") in the Console before
knitting.
# --- Core data wrangling and visualisation ---
library(tidyverse) # dplyr, ggplot2, tidyr, purrr, etc.
# --- Modelling framework ---
library(tidymodels) # unified interface: recipes, workflows, yardstick
# --- Classification engines ---
library(discrim) # naive_Bayes() classifier
library(klaR) # engine underlying discrim's Naive Bayes
library(kknn) # engine for k-nearest neighbours
# --- Clustering and spatial ---
library(mclust) # k-means support functions
library(ape) # phylogenetic tree plots for dendrograms
library(sp) # spatial objects and coordinate transforms
library(sf) # modern simple features spatial data
# --- Data and utilities ---
library(mdsr) # course datasets, including world_cities
library(readxl) # read Excel (.xlsx) files
library(janitor) # clean_names() for tidy column names
library(broom) # tidy() for model coefficient summaries
A marketing analyst might be interested in finding factors that can
be used to predict whether a potential customer is a high-earner. The
1994 United States Census provides information that can inform such a
model, with records from 32,561 adults that include a
binary variable indicating whether each person makes greater or less
than $50,000 (more than $80,000 today after
accounting for inflation). This is our response variable.
The code below loads and prepares the data. Run it as-is — you do not need to modify it.
Note: If the data is not accessible via the url
address, load via downloading it from the RData file provided in the
project page: load("census.RData") in the same folder where
you saved the project template .Rmd. The line is commented line 96 in
the code chunk below. You can uncomment it if you have the RData
file.
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read_csv(
url,
col_names = c(
"age", "workclass", "fnlwgt", "education",
"education_1", "marital_status", "occupation", "relationship",
"race", "sex", "capital_gain", "capital_loss", "hours_per_week",
"native_country", "income"
)
) %>%
mutate(
income = factor(income),
# income_ind: binary indicator — 1 = high earner (>50K), 0 = low earner
income_ind = as.numeric(income == ">50K")
)
#load("census.RData") # load the data if you have saved it in RData format
# inspect variable names, types, and first few values
glimpse(census)
## Rows: 32,561
## Columns: 16
## $ age <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
## $ workclass <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
## $ fnlwgt <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
## $ education <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
## $ education_1 <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
## $ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
## $ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
## $ relationship <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
## $ race <chr> "White", "White", "White", "Black", "Black", "White", "…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
## $ capital_gain <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
## $ capital_loss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
## $ native_country <chr> "United-States", "United-States", "United-States", "Uni…
## $ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >50K, …
## $ income_ind <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
Split the dataset into two pieces at random:
train (training
set)test (testing /
hold-out set)Use set.seed(364) to ensure your results are
reproducible.
Report: How many records are in the testing set?
Hint: Use initial_split(prop = 0.7),
then extract with training() and
testing().
set.seed(364)
census_parts <- initial_split(census, prop = 0.7, strata = income)
train <- training(census_parts)
test <- testing(census_parts)
train <- train %>%
mutate(across(everything(), ~ {
if (is.list(.x)) unlist(.x) else .x
})) %>%
as_tibble()
test <- test %>%
mutate(across(everything(), ~ {
if (is.list(.x)) unlist(.x) else .x
})) %>%
as_tibble()
nrow(test)
## [1] 9769
Note: Around 24% of those in the sample make more
than $50k. This means the null model — always
predicting ≤$50k — achieves about 76% accuracy. Your models should beat
this baseline.
The code below computes and prints the proportion of high earners in
the training set. Run it after creating train.
pi_bar <- train %>%
dplyr::count(income) %>%
mutate(
n = as.numeric(n), # force numeric
pct = n / sum(n)
) %>%
filter(income == ">50K") %>%
pull(pct)
print(c("Proportion earning >$50K in training set:", round(pi_bar, 4)))
## [1] "Proportion earning >$50K in training set:"
## [2] "0.2408"
Pro Tip: Always benchmark your predictive models against a reasonable null model.
Use the k-NN algorithm to classify observations as low earner
(<=50K) or high earner (>50K).
Requirements:
age, education_1, capital_gain,
capital_loss, hours_per_week (k-NN requires
numeric inputs).mode = "classification" and
neighbors = 1.scale = TRUE so that no single variable dominates
due to its units.Hint: See Programming Exercises Week 13 for the
nearest_neighbor() / kknn workflow.
train_q <- train %>%
dplyr::select(income, age, education_1, capital_gain, capital_loss, hours_per_week) %>%
mutate(across(where(is.numeric), as.numeric))
library(kknn)
set.seed(364)
knn_model <- kknn(income ~ .,
train = train_q,
test = train_q,
k = 1,
scale = TRUE)
knn_pred <- fitted(knn_model)
# Confusion matrix
conf_mat_knn <- table(Predicted = knn_pred, Actual = train_q$income)
print(conf_mat_knn)
## Actual
## Predicted <=50K >50K
## <=50K 17151 2603
## >50K 153 2885
# Accuracy
accuracy_knn <- mean(knn_pred == train_q$income)
cat("k-NN Training Accuracy:", round(accuracy_knn, 4), "\n")
## k-NN Training Accuracy: 0.8791
knn_train_pred <- tibble(
.pred_class = knn_pred,
income = train_q$income
)
knn_train_pred %>%
accuracy(truth = income, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.879
The following formula object is provided for you. It will be used in parts c) and d). Run this chunk as-is.
# reusable formula — includes all categorical and numeric predictors
form <- as.formula(
"income ~ age + workclass + education + marital_status +
occupation + relationship + race + sex +
capital_gain + capital_loss + hours_per_week"
)
form
## income ~ age + workclass + education + marital_status + occupation +
## relationship + race + sex + capital_gain + capital_loss +
## hours_per_week
Build a Naive Bayes classifier using the formula object
form defined above. Print the confusion
matrix and report the accuracy on the training
set.
Hint: See Programming Exercises Week 13 for the
naive_Bayes() / klaR workflow.
nb_spec <- naive_Bayes(mode = "classification") %>%
set_engine("klaR")
nb_wf <- workflow() %>%
add_model(nb_spec) %>%
add_formula(form)
nb_fit <- nb_wf %>% fit(data = train)
nb_train_pred <- predict(nb_fit, train) %>%
bind_cols(train)
nb_conf_mat <- nb_train_pred %>%
conf_mat(truth = income, estimate = .pred_class)
nb_conf_mat
## Truth
## Prediction <=50K >50K
## <=50K 17001 3748
## >50K 303 1740
# Accuracy
nb_train_pred %>%
accuracy(truth = income, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.822
Use logistic regression to model the probability of being a
high earner (income_ind = 1). The binary response
variable income_ind was created during data
preparation.
Step 1: Inspect the four exploratory plots below to assess which predictors are useful. The code is provided — run it as-is.
# base plot: estimated probability of high income vs age
log_plot <- ggplot(data = census, aes(x = age, y = income_ind)) +
geom_jitter(alpha = 0.1, height = 0.05) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ylab("P(income > $50K)")
log_plot + xlab("Age (years)")
log_plot + aes(x = education_1) +
xlab("Education level (1 = lowest, 16 = highest)")
log_plot + aes(x = sex) +
xlab("Sex")
log_plot + aes(x = marital_status) +
xlab("Marital status")
Q: Which variables appear to be important predictors of high income? Briefly explain what each plot tells you.
Your answer here (2–4 sentences):
Step 2: Fit a logistic regression model using
glm() with family = "binomial". Use at minimum
age, education_1, and sex as
predictors. You may optionally include marital_status or
other variables. Print a tidy coefficient table with
tidy().
Hint: The response variable is
income_ind (0/1), not the factor income.
# Logistic model
logit_model <- glm(income_ind ~ age + education_1 + sex + marital_status,
family = binomial, data = train)
tidy(logit_model, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 10 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.000451 0.144 -53.6 0 0.000339 0.000596
## 2 age 1.02 0.00165 14.2 1.14e- 45 1.02 1.03
## 3 education_1 1.50 0.00874 46.3 0 1.47 1.53
## 4 sexMale 1.37 0.0527 6.00 1.95e- 9 1.24 1.52
## 5 marital_statusMarr… 8.96 0.528 4.15 3.28e- 5 2.98 24.5
## 6 marital_statusMarr… 7.13 0.0705 27.9 6.53e-171 6.22 8.20
## 7 marital_statusMarr… 0.878 0.223 -0.584 5.59e- 1 0.556 1.34
## 8 marital_statusNeve… 0.490 0.0880 -8.11 5.06e- 16 0.412 0.582
## 9 marital_statusSepa… 0.771 0.169 -1.54 1.24e- 1 0.547 1.06
## 10 marital_statusWido… 0.814 0.160 -1.29 1.97e- 1 0.591 1.11
Step 3: Generate predicted probabilities on the training set, apply a 0.5 cutoff to classify each observation, and report the confusion matrix and accuracy.
# Training evaluation
train_pred_prob <- predict(logit_model, train, type = "response")
train_pred_class <- ifelse(train_pred_prob > 0.5, ">50K", "<=50K")
conf_mat_train <- table(Predicted = train_pred_class, Actual = train$income)
conf_mat_train
## Actual
## Predicted <=50K >50K
## <=50K 16064 2847
## >50K 1240 2641
# Accuracy
mean(train_pred_class == train$income)
## [1] 0.8206827
Apply the trained logistic regression model to the held-out
test set. Report the confusion matrix and test-set
accuracy.
Note: The test-set accuracy is the honest estimate of how well the model generalises to new data.
test_pred_prob <- predict(logit_model, test, type = "response")
test_pred_class <- ifelse(test_pred_prob > 0.5, ">50K", "<=50K")
conf_mat_test <- table(Predicted = test_pred_class, Actual = test$income)
conf_mat_test
## Actual
## Predicted <=50K >50K
## <=50K 6880 1292
## >50K 536 1061
# Test accuracy
mean(test_pred_class == test$income)
## [1] 0.8128775
Which of the three classifiers (k-NN, Naive Bayes, logistic regression) achieved the highest test-set accuracy? Why is test-set accuracy a better comparison criterion than training-set accuracy?
Your answer here (3–5 sentences): Test-set accuracy is the honest performance metric because it evaluates on unseen data, revealing generalization ability and avoiding overfitting bias that inflates training accuracy. Logistic regression balances interpretability and performance best here
K-means assigns each observation to one of K groups by minimising within-cluster distances to centroids. Unlike hierarchical clustering, K must be specified in advance and no dendrogram is produced.
Geospatial example: The world_cities
dataset (from mdsr) contains the latitude and longitude of
thousands of cities — but no continent or country
labels. The question is: can k-means recover continent-level
geographic structure from raw coordinates alone?
# select the 4,000 most populous cities; keep only coordinates
BigCities <- world_cities %>%
arrange(desc(population)) %>%
head(4000) %>%
dplyr::select(longitude, latitude)
glimpse(BigCities)
## Rows: 4,000
## Columns: 2
## $ longitude <dbl> 121.45806, 28.94966, -58.37723, 72.88261, -99.12766, 116.397…
## $ latitude <dbl> 31.22222, 41.01384, -34.61315, 19.07283, 19.42847, 39.90750,…
Note that the data contain no city names, country labels, or continent information — only coordinates.
set.seed(15)
# assign cluster labels (k = 6)
city_clusts <- BigCities %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# more stable run: try 10 random starts and keep the best solution
km <- kmeans(BigCities, centers = 6, nstart = 10)
# number of cities in each cluster
km$size
## [1] 726 554 984 392 356 988
# mean longitude and latitude of each cluster centre
km$centers
## longitude latitude
## 1 75.14407 27.43226
## 2 -94.47442 31.07927
## 3 120.38618 23.72534
## 4 -57.10048 -15.21344
## 5 18.91076 -1.12440
## 6 18.77544 45.37853
# attach cluster labels for plotting
BigCities <- BigCities %>%
mutate(cluster = city_clusts)
# plot in standard Cartesian coordinates
BigCities %>%
ggplot(aes(x = longitude, y = latitude)) +
geom_point(aes(color = cluster), alpha = 0.5) +
labs(
title = "K-means clusters — raw longitude/latitude (no projection)",
x = "Longitude",
y = "Latitude",
color = "Cluster"
)
What geographic regions did the k-means algorithm identify from
coordinates alone? Name each cluster’s approximate real-world region
based on the plot and the cluster centroids in
km$centers.
Your answer here (3–5 sentences):
Cluster 1: North America
Cluster 2: South America
Cluster 3: Europe and parts of west asia
Cluster 4: Africa
Cluster 5: Remaing parts of Asia and Australia/Oceania
Cluster 6: Remaining areas
The Earth is a three-dimensional spheroid. A map projection converts it to a flat two-dimensional representation. Because k-means uses Euclidean distance, the choice of projection can change which cities are considered “close” and therefore affect cluster boundaries.
The three most common coordinate reference systems (CRS), identified by their EPSG codes, are:
| EPSG code | Name | Used by |
|---|---|---|
| EPSG:4326 | WGS84 | GPS, Google Earth |
| EPSG:3857 | Mercator | Google Maps, OpenStreetMap |
| EPSG:4269 | NAD83 | US federal agencies |
Apply k-means with K = 6 under each of the three projections below. For each one:
labs() title).Note: Each correctly produced and labelled cluster plot is worth 10 pts.
Hint: The full workflow for EPSG:4326 is shown below. Adapt the EPSG code and axis labels for the remaining two projections.
# copy BigCities to a working object (coordinates() modifies the object in place)
d <- BigCities
# convert data.frame to a SpatialPointsDataFrame using columns 1 (longitude) and 2 (latitude)
coordinates(d) <- 1:2
# declare the source CRS as WGS84
proj4string(d) <- CRS("+init=epsg:4326")
# define the target CRS (same as source for EPSG:4326)
CRS.new <- CRS("+init=epsg:4326")
# apply the coordinate transformation
d.new <- spTransform(d, CRS.new)
# print the full PROJ.4 string for your reference
proj4string(d.new) %>% strwrap()
## [1] "+proj=longlat +datum=WGS84 +no_defs"
# run k-means on the projected coordinates
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
# attach cluster labels
df.new <- as.data.frame(d.new) %>%
mutate(
cluster = city_clusts,
longitude = coords.x1,
latitude = coords.x2
)
# plot — coords.x1 and coords.x2 are the projected coordinate columns
df.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(
title = "K-means clusters — WGS84 (EPSG:4326)",
x = "Longitude",
y = "Latitude",
color = "Cluster"
)
Adapt the EPSG:4326 code above for the Mercator
projection. Change the EPSG code in both CRS() calls to
"+init=epsg:3857" and update the plot axis labels (Mercator
units are metres, not degrees).
d <- BigCities
coordinates(d) <- 1:2
proj4string(d) <- CRS("+init=epsg:4326")
CRS.new <- CRS("+init=epsg:3857")
d.new <- spTransform(d, CRS.new)
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
df.new <- as.data.frame(d.new) %>%
mutate(cluster = city_clusts,
longitude = coords.x1,
latitude = coords.x2)
df.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(title = "K-means clusters — Mercator (EPSG:3857)",
x = "Easting (meters)", y = "Northing (meters)", color = "Cluster")
Adapt the code once more for the NAD83 projection.
Change both CRS() calls to
"+init=epsg:4269".
d <- BigCities
coordinates(d) <- 1:2
proj4string(d) <- CRS("+init=epsg:4326")
CRS.new <- CRS("+init=epsg:4269")
d.new <- spTransform(d, CRS.new)
city_clusts <- as.data.frame(d.new) %>%
kmeans(centers = 6) %>%
fitted("classes") %>%
as.character()
df.new <- as.data.frame(d.new) %>%
mutate(cluster = city_clusts,
longitude = coords.x1,
latitude = coords.x2)
df.new %>%
ggplot(aes(x = coords.x1, y = coords.x2)) +
geom_point(aes(color = cluster), alpha = 0.5) +
scale_color_brewer(palette = "Set3") +
labs(title = "K-means clusters — NAD83 (EPSG:4269)",
x = "Longitude", y = "Latitude", color = "Cluster")
Compare your four plots (raw Cartesian, WGS84, Mercator, NAD83). Write 4–6 sentences addressing:
Your answer here: All projections recover recognizable continental clusters, confirming k-means captures geographic structure from coordinates alone. Raw Cartesian and WGS84 show similar but distorted groupings. Mercator exaggerates polar areas but keeps continents distinct. NAD83 (North America-focused) is nearly identical to WGS84 for global data. WGS84/Raw produce the most interpretable clusters for this global dataset because they preserve relative continental separation without strong regional bias. Differences arise mainly at high latitudes due to projection distortions affecting Euclidean distances.
Good luck! Submit both your .Rmd source file and the
knitted .html output.