Part 1

For this part of the lab work in groups of 3-5. This will mainly be an in-class activity but your group may be asked to share your thoughts to the rest of the class.

• Identify four real-life applications of supervised and unsupervised problems. Think about activities you do on a regular basis (i.e. shop on Amazon, watch shows on Netflix, use Google Maps for navigation) and how supervised and/or unsupervised learning may be applied. Spam Email Detection – Classifies emails as spam or not based on past labeled examples (supervised learning). Handwriting Recognition – Identifies handwritten characters from images, often used in digitizing documents (supervised learning). Grouping News Articles – Clusters similar news articles together based on content without predefined labels (unsupervised learning). Market Basket Analysis – Identifies patterns in shopping carts to suggest frequently bought-together items (unsupervised learning).

• What benefits does machine learning bring to these problems/activities? How does machine learning improve your experience with these activities or how would it improve the organizations capabilities? Machine learning improves efficiency, accuracy, and personalization across various activities by automating processes and identifying patterns in data. It enhances user experiences by providing relevant recommendations, filtering unwanted content, and streamlining everyday tasks. For organizations, it boosts productivity, improves decision-making, and optimizes operations to drive better outcomes.

• Explain what makes these problems supervised versus unsupervised. Supervised Learning: Uses labeled data, where the model learns from examples with known answers. For example, spam email detection uses labeled emails to teach the model to identify spam.

Unsupervised Learning: Uses data without labels, letting the model find patterns on its own. For example, market basket analysis finds items that are often bought together without knowing the relationships in advance.

• For each problem identify the target variable (if applicable) and potential feature variables that could be used. How do you think this data gets collected? Spam Email Detection, the Target is: Spam or not spam. The features are: Email content, sender, subject, keywords. For data collection: Emails are labeled as spam or not based on user feedback or filters.

Handwriting Recognition The Target is: Handwritten character or word. The features are: Image pixels, stroke patterns, pen pressure. For data collection: Data is collected from scanned documents or handwriting datasets with labeled characters.

Grouping News Articles The Target is: None (unsupervised). The features are: Article text, keywords, source, topic tags. For data collection: Articles are collected from news sites, APIs, and aggregators.

Market Basket Analysis The Target is: None (unsupervised). The features are: Items bought, purchase time, customer data. For data collection: Transaction data is collected from point-of-sale systems or online shopping platforms.

• For each of these applications could you foresee any ethical concerns in using machine learning? Could machine learning (or maybe the data collection process) be misused in any way? One ethical concern with spam email detection is that it might mistakenly flag important emails as spam, causing users to miss crucial information. The data collection process can also be misused if email providers track user behaviors or read personal messages to improve their models without consent. Additionally, using machine learning to target specific users with too many personalized ads might invade their privacy and create unwanted bias in the email filtering process.

Part 2

Other variables used to help make predictions of cmedv include:

lon: longitude of census tract lat: latitude of census tract crim: per capita crime rate by town zn: proportion of residential land zoned for lots over 25,000 sq.ft indus: proportion of non-retail business acres per town chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox: nitric oxides concentration (parts per 10 million) –> aka air pollution rm: average number of rooms per dwelling age: proportion of owner-occupied units built prior to 1940 dis: weighted distances to five Boston employment centers rad: index of accessibility to radial highways

chooseCRANmirror(graphics = FALSE, ind = 1)
install.packages("tidymodels")

## Installing package into 'C:/Users/Senge/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'tidymodels' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Senge\AppData\Local\Temp\RtmpYJn00W\downloaded_packages

install.packages("rlang")

## Installing package into 'C:/Users/Senge/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'rlang' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'rlang'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Senge\AppData\Local\R\win-library\4.4\00LOCK\rlang\libs\x64\rlang.dll
## to C:\Users\Senge\AppData\Local\R\win-library\4.4\rlang\libs\x64\rlang.dll:
## Permission denied

## Warning: restored 'rlang'

## 
## The downloaded binary packages are in
##  C:\Users\Senge\AppData\Local\Temp\RtmpYJn00W\downloaded_packages

packageVersion("rlang")

## [1] '1.1.5'

install.packages("kknn")

## Installing package into 'C:/Users/Senge/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'kknn' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'kknn'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Senge\AppData\Local\R\win-library\4.4\00LOCK\kknn\libs\x64\kknn.dll to
## C:\Users\Senge\AppData\Local\R\win-library\4.4\kknn\libs\x64\kknn.dll:
## Permission denied

## Warning: restored 'kknn'

## 
## The downloaded binary packages are in
##  C:\Users\Senge\AppData\Local\Temp\RtmpYJn00W\downloaded_packages

install.packages("Rtools")

## Installing package into 'C:/Users/Senge/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## Warning: package 'Rtools' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.4.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1

## Warning: package 'dials' was built under R version 4.4.3

## Warning: package 'infer' was built under R version 4.4.3

## Warning: package 'modeldata' was built under R version 4.4.3

## Warning: package 'parsnip' was built under R version 4.4.3

## Warning: package 'recipes' was built under R version 4.4.3

## Warning: package 'rsample' was built under R version 4.4.3

## Warning: package 'tune' was built under R version 4.4.3

## Warning: package 'workflows' was built under R version 4.4.3

## Warning: package 'workflowsets' was built under R version 4.4.3

## Warning: package 'yardstick' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

Is this a supervised or unsupervised learning problem? Why?
There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?
Given the type of variable cmedv is, is this a regression or classification problem?
Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- readr::read_csv("boston.csv")

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sum(is.na(boston))

## [1] 0

summary(boston$cmedv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Question 5

set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(boston_split)
test <- testing(boston_split)

Question 6

boston_split

## <Training/Testing/Total>
## <352/154/506>

Question 7

ggplot(mapping = aes(x = cmedv)) +
  geom_histogram(data = train, binwidth = 1, fill = "blue", alpha = 0.5) +
  geom_histogram(data = test, binwidth = 1, fill = "red", alpha = 0.5)

# Question 8

lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)

lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83

Question 9

lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)

lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

Question 10

knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)

## Warning: package 'kknn' was built under R version 4.4.3

knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        3.37

Module 8 Lab

Seth Engelhardt

2025-03-04