Part 2:
Pacakges:
library(kknn)
## Warning: package 'kknn' was built under R version 4.4.3
library(modeldata)
## Warning: package 'modeldata' was built under R version 4.4.3
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.4.0 ✔ tibble 3.2.1
## ✔ dplyr 1.1.4 ✔ tidyr 1.3.1
## ✔ ggplot2 3.5.1 ✔ tune 1.3.0
## ✔ infer 1.0.7 ✔ workflows 1.2.0
## ✔ parsnip 1.3.0 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.4 ✔ yardstick 1.3.2
## ✔ recipes 1.1.1
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Modeling Tasks:
This is supervised because we know the problem that we are trying to solve
lon: longitude of census tract • lat: latitude of census tract • crim: per capita crime rate by town • zn: proportion of residential land zoned for lots over 25,000 sq.ft • indus: proportion of non-retail business acres per town • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) • nox: nitric oxides concentration (parts per 10 million) –> aka air pollution • rm: average number of rooms per dwelling • age: proportion of owner-occupied units built prior to 1940 • dis: weighted distances to five Boston employment centers • rad: index of accessibility to radial highways • tax: full-value property-tax rate per USD 10,000 • ptratio: pupil-teacher ratio by town • lstat: percentage of lower status of the population
Given the type of variable cmedv is, is this a regression or classification problem? This is a regression problem because its a contious numerial variable
Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?
boston <- readr::read_csv('boston (1).csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(boston$cmedv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
Our minimum value of cmdv is 5,000, our max is 50,000, and our average is $22,530
summary(is.na(boston))
## lon lat cmedv crim
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## zn indus chas nox
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## rm age dis rad
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
## tax ptratio b lstat
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:506 FALSE:506 FALSE:506 FALSE:506
5.) Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
boston_train <- training(boston_split)
boston_test <- testing(boston_split)
6.) How many observations are in the training set and test set?
boston_split
## <Training/Testing/Total>
## <352/154/506>
352 obersvations are in the training set and 154 are in the testing set
ggplot(boston_train, aes(x= cmedv)) +
geom_line(stat ='density', trim = TRUE) +
geom_line(data = boston_test, stat ='density', trim = TRUE, color = 'red')
The distributions are pretty similar, so we can safly continue
# fit model
linear_model <- linear_reg() %>%
fit(cmedv ~ rm, data = boston_train)
# compute the RMSE on the test data
linear_model %>%
predict(boston_test)%>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = boston_train)
# compute the RMSE on the test data
lm2 %>%
predict(new_data = boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
10.)Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ . + rm, data = boston_train)
knn %>%
predict(boston_test) %>%
bind_cols(boston_test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.37