Four real-life applications about supervised and unsupervised problems:
| Other variables used to help make predictions of cmedv include: |
|---|
| lon: longitude of census tract |
| lat: latitude of census tract |
| crim: per capita crime rate by town |
| zn: proportion of residential land zoned for lots over 25,000 sq.ft |
| indus: proportion of non-retail business acres per town |
| chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox: nitric oxides concentration (parts per 10 million) –> aka air pollution |
| rm: average number of rooms per dwelling |
| age: proportion of owner-occupied units built prior to 1940 |
| dis: weighted distances to five Boston employment centers |
| rad: index of accessibility to radial highways |
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ recipes 1.1.0
## ✔ dials 1.3.0 ✔ rsample 1.2.1
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.1
## ✔ modeldata 1.4.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(kknn)
# Import the Boston housing data set
boston <- readr::read_csv("Data/boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for missing values
sum(is.na(boston))
## [1] 0
# What are the min, max, and average values of cmedv
summary(boston$cmedv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
split
## <Training/Testing/Total>
## <352/154/506>
ggplot(mapping = aes(x = cmedv)) +
geom_histogram(data = train, binwidth = 1, fill = "red", alpha = 0.5) +
geom_histogram(data = test, binwidth = 1, fill = "blue", alpha = 0.5)
# Fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# Compute teh RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
# Fit model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# Compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
This is better than the previous model’s perfomance.
# Fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# Compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.37
This is better than the previous two models’ performance.