install.packages(“tidymodels”)
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.0 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
1. Is this a supervised or unsupervised learning problem? Why? This is a surpervised problem because we’re trying to predict cmedv using features from the boston housing dataset. We know what we’re trying to predict
2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)? We’re trying to predict cmedv, so cmedv is the response variable. All other variables are predictor variables.
3. Given the type of variable cmedv is, is this a regression or classification problem? Because cmedv is a continous, numeric variable, this is a regression problem.
4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?
boston <- readr::read_csv('data/boston.csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(is.na(boston))
## lon lat cmedv crim
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## zn indus chas nox
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## rm age dis rad
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## tax ptratio b lstat
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
There are no missing values in the boston dataset
summary(boston$cmedv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
The minimum cmedv is $5,000, the max is $50,000, and the mean is $22,530.
5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(boston_split)
boston_test <- testing(boston_split)
boston_split
## <Training/Testing/Total>
## <352/154/506>
6. How many observations are in the training set and test set? 352 observations are in the training set, 154 are in the testing set.
7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?
ggplot(boston_train, aes(x = cmedv)) +
geom_line(stat = "density", trim = TRUE) +
geom_line(data = boston_test, stat = "density", trim = TRUE, col = "darkgreen")
Distributions are pretty similar so we are safe to continue.
8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
#fit model
linear_model <- linear_reg() %>%
fit(cmedv ~ rm, data = boston_train)
#compute the RSME on the test data
linear_model %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
# fit model
linear_model2 <-linear_reg() %>%
fit(cmedv ~ ., data = boston_train)
#compute the RSME on the test data
linear_model2 %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = boston_test)
## Warning: package 'kknn' was built under R version 4.4.3
#compute the RSME on the test data
knn %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.66