Part 1

Four real-life applications about supervised and unsupervised problems:

  1. Supervised Learning: Spam detection
  1. Supervised Learning: Facial recognition

Part 2

Other variables used to help make predictions of cmedv include:
lon: longitude of census tract
lat: latitude of census tract
crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 sq.ft
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox: nitric oxides concentration (parts per 10 million) –> aka air pollution
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centers
rad: index of accessibility to radial highways

Prerequisites:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6     ✔ recipes      1.1.0
## ✔ dials        1.3.0     ✔ rsample      1.2.1
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
## ✔ infer        1.0.7     ✔ tune         1.2.1
## ✔ modeldata    1.4.0     ✔ workflows    1.1.4
## ✔ parsnip      1.2.1     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.2     ✔ yardstick    1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(kknn)

Modeling Tasks:

  1. Is this a supervised or unsupervised learning problem? Why?
  • This is a supervised learning problem because the variable cmedv is the target variable, and it using the features variables to predict its value.
  1. There are 16 variables in this dataset. Which variable is the response variable and which variables are the predictor variables?
  • The response variable is cmdev, and the predictor variables are lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, and lstat.
  1. Given the type of variable cmedv is, is this a regression or classification problem?
  • This is a regression problem because the variable cmedv is a continuous numberical variable
  1. Import the Boston housing data set. Are there any values missing? What is the minimum and maximum values of cmedv and what is the average?
# Import the Boston housing data set
boston <- readr::read_csv("Data/boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for missing values
sum(is.na(boston))
## [1] 0
# What are the min, max, and average values of cmedv
summary(boston$cmedv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00
  1. Split the data into a training set and test set using a 70-30% split.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
  1. How many observations are in the training set and test set?
split
## <Training/Testing/Total>
## <352/154/506>
  1. Compare the distribution of cmedv between the training set and test set. Do they have the same distribution or differ significantly?
ggplot(mapping = aes(x = cmedv)) +
  geom_histogram(data = train, binwidth = 1, fill = "red", alpha = 0.5) + 
  geom_histogram(data = test, binwidth = 1, fill = "blue", alpha = 0.5)

  1. Fit a linear regression model using the rm feature variable. Predict cmedv and compute RMSE on the test data. What is the test set RMSE?
# Fit model
lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)

# Compute teh RMSE on the test data
lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83
  1. Fit a linear regression model to predict cmedv and compute the RMSE on the test data. What is the test set RMSE and is this better than the previous model’s performance?
# Fit model
lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)

# Compute the RMSE on the test data
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

This is better than the previous model’s perfomance.

  1. Fit a K-nearest neighbor model to predict cmedv and compute the RMSE on the test data. What is the test set RMSE and is this better than the previous two models’ performances?
# Fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)

# Compute the RMSE on the test data
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        3.37

This is better than the previous two models’ performance.