library(kknn)
library(tidymodels)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
── Attaching packages ────────────────────────────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom 1.0.7 ✔ recipes 1.1.0
✔ dials 1.3.0 ✔ rsample 1.2.1
✔ dplyr 1.1.4 ✔ tibble 3.2.1
✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
✔ infer 1.0.7 ✔ tune 1.2.1
✔ modeldata 1.4.0 ✔ workflows 1.1.4
✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
✔ purrr 1.0.2 ✔ yardstick 1.3.1
── Conflicts ───────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
This is a supervised because we are using predictor variables to predict a known response variable (cmedv)
cmedv is the response variable. All others are predictor variables
This is a regression problem because the response variable cmedv is continuous
boston <- readr::read_csv("~/Library/CloudStorage/OneDrive-UniversityofCincinnati/UC Courses/Fall 2024/Data Mining/boston.csv")
Rows: 506 Columns: 16── Column specification ────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sum(is.na(boston))
[1] 0
summary(boston$cmedv)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.02 21.20 22.53 25.00 50.00
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
nrow(train) #training set
[1] 352
nrow(test) #test set
[1] 154
library(ggplot2)
# For the training set histogram
ggplot(train, aes(x = cmedv)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
ggtitle("Training Set cmedv Distribution")
# For the test set histogram
ggplot(test, aes(x = cmedv)) +
geom_histogram(binwidth = 1, fill = "red", color = "black") +
ggtitle("Test Set cmedv Distribution")
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
lm1 %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
lm2 %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# Compute the RMSE on the test data
knn %>%
predict(new_data = test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)