Part 2:

Pacakges:

library(kknn)
## Warning: package 'kknn' was built under R version 4.4.3
library(modeldata)
## Warning: package 'modeldata' was built under R version 4.4.3
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tibble       3.2.1
## ✔ dplyr        1.1.4     ✔ tidyr        1.3.1
## ✔ ggplot2      3.5.1     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ parsnip      1.3.0     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.4     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Modeling Tasks:

  1. Is this a supervised or unsupervised learning problem? Why?

This is supervised because we know the problem that we are trying to solve

  1. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?

lon: longitude of census tract • lat: latitude of census tract • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft • indus: proportion of non-retail business acres per town • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration (parts per 10 million) –> aka air pollution • rm: average number of rooms per dwelling • age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers • rad: index of accessibility to radial highways • tax: full-value property-tax rate per USD 10,000 • ptratio: pupil-teacher ratio by town • lstat: percentage of lower status of the population

  1. Given the type of variable cmedv is, is this a regression or classification problem? This is a regression problem because its a contious numerial variable

  2. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- readr::read_csv('boston (1).csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(boston$cmedv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Our minimum value of cmdv is 5,000, our max is 50,000, and our average is $22,530

summary(is.na(boston))
##     lon             lat            cmedv            crim        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##      zn            indus            chas            nox         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##      rm             age             dis             rad         
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506      
##     tax           ptratio            b             lstat        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:506       FALSE:506       FALSE:506       FALSE:506

5.) Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.

set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(boston_split)
boston_test <- testing(boston_split)

6.) How many observations are in the training set and test set?

boston_split
## <Training/Testing/Total>
## <352/154/506>

352 obersvations are in the training set and 154 are in the testing set

  1. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?
ggplot(boston_train, aes(x= cmedv)) +
  geom_line(stat ='density', trim = TRUE) +
  geom_line(data = boston_test, stat ='density', trim = TRUE, color = 'red')

The distributions are pretty similar, so we can safly continue

  1. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
# fit model 
linear_model <- linear_reg() %>%
  fit(cmedv ~ rm, data = boston_train)

# compute the RMSE on the test data 
linear_model %>%
  predict(boston_test)%>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83
  1. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = boston_train)

# compute the RMSE on the test data
lm2 %>%
  predict(new_data = boston_test) %>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

10.)Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?

knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ . + rm, data = boston_train)  
knn %>%
  predict(boston_test) %>%
  bind_cols(boston_test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        3.37