Part 2

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.4.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1

## Warning: package 'dials' was built under R version 4.4.3

## Warning: package 'infer' was built under R version 4.4.3

## Warning: package 'modeldata' was built under R version 4.4.3

## Warning: package 'parsnip' was built under R version 4.4.3

## Warning: package 'recipes' was built under R version 4.4.3

## Warning: package 'rsample' was built under R version 4.4.3

## Warning: package 'tune' was built under R version 4.4.3

## Warning: package 'workflows' was built under R version 4.4.3

## Warning: package 'workflowsets' was built under R version 4.4.3

## Warning: package 'yardstick' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

1. Is this a supervised or unsupervised learning problem? Why? This is a surpervised problem because we’re trying to predict cmedv using features from the boston housing dataset. We know what we’re trying to predict

2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)? We’re trying to predict cmedv, so cmedv is the response variable. All other variables are predictor variables.

3. Given the type of variable cmedv is, is this a regression or classification problem? Because cmedv is a continous, numeric variable, this is a regression problem.

4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- readr::read_csv('data/boston.csv')

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(is.na(boston))

##     lon             lat            cmedv            crim        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##      zn            indus            chas            nox         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##      rm             age             dis             rad         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##     tax           ptratio            b             lstat        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506

There are no missing values in the boston dataset

summary(boston$cmedv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

The minimum cmedv is $5,000, the max is $50,000, and the mean is $22,530.

5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.

set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(boston_split)
boston_test <- testing(boston_split)

boston_split

## <Training/Testing/Total>
## <352/154/506>

6. How many observations are in the training set and test set? 352 observations are in the training set, 154 are in the testing set.

7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?

ggplot(boston_train, aes(x = cmedv)) + 
  geom_line(stat = "density", trim = TRUE) + 
  geom_line(data = boston_test, stat = "density", trim = TRUE, col = "darkgreen")

Distributions are pretty similar so we are safe to continue.

8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?

#fit model
linear_model <- linear_reg() %>%
  fit(cmedv ~ rm, data = boston_train)

#compute the RSME on the test data
linear_model %>%
  predict(boston_test) %>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83

9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?

# fit model
linear_model2 <-linear_reg() %>%
  fit(cmedv ~ ., data = boston_train)

#compute the RSME on the test data
linear_model2 %>%
  predict(boston_test) %>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?

# fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = boston_test)

## Warning: package 'kknn' was built under R version 4.4.3

#compute the RSME on the test data
knn %>%
  predict(boston_test) %>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        2.66

Module 8 Lab

Victoria Mwangi

2025-03-04

Part 2