This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Ctrl+Shift+Enter.
library(tidymodels)
library(readr)
For this exercise we’ll use the Boston housing data set. The Boston
Housing data set is derived from information collected by the U.S.
Census Service concerning housing in the area of Boston, MA. Originally
published in Harrison Jr and Rubinfeld (1978). The purpose of this data
set is to predict the median value of owner-occupied homes for various
census tracts in the Boston area. Each row (observation) represents a
given census tract and the variable we wish to predict is cmedv (median
value of owner-occupied homes in USD 1000’s). The other variables are
variables we want to use to help make predictions of cmedv and
include:
• lon: longitude of census tract
• lat: latitude of census tract
• crim: per capita crime rate by town
• zn: proportion of residential land zoned for lots over 25,000
sq.ft
• indus: proportion of non-retail business acres per town
• chas: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
• nox: nitric oxides concentration (parts per 10 million) –> aka
air pollution
• rm: average number of rooms per dwelling
• age: proportion of owner-occupied units built prior to 1940
• dis: weighted distances to five Boston employment centers
• rad: index of accessibility to radial highways
• tax: full-value property-tax rate per USD 10,000
• ptratio: pupil-teacher ratio by town
• lstat: percentage of lower status of the population
- Is this a supervised or unsupervised learning problem? Why?
This is a supervised learning problem because we a using a labeled
data set (cmedv)
- There are 16 variables in this data set. Which variable is the
response variable and which variables are the predictor variables (aka
features)?
The response variable is “cmedv” the predictor variables are all
other variables from the dataset (lon, lat, crim, zn, indus, chas, nox,
rm, age, dis, rad, tax, ptratio, and lstat.)
- Given the type of variable cmedv is, is this a regression or
classification problem?
This is a regression porblem because cmedv is a continous
variable
- Fill in the blanks to import the Boston housing data set
(boston.csv). Are there any missing values? What is the minimum and
maximum values of cmedv? What is the average cmedv value?
boston <- readr::read_csv("boston.csv")
Rows: 506 Columns: 16── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##Missing Variables
sum(is.na(boston))
[1] 0
## Min, Max, & Median
summary(boston$cmedv)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 17.02 21.20 22.53 25.00 50.00
### Min =5, Max = 50, Median = 21.20, Average = 22.53
Split the data into a training set and test set using a 70-30% split.
Be sure to include the set.seed(123) so that your train and test sets
are the same size as mine
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
- How many observations are in the training set and test set?
split
<Training/Testing/Total>
<352/154/506>
ggplot(mapping = aes(x = cmedv)) +
geom_histogram(data = train, bindwidth = 1, fill = "purple", alpha = 0.5) +
geom_histogram(data = test, bindwidth = 1, fill ="green", alpha = 0.5)
Warning: Ignoring unknown parameters: `bindwidth`Warning: Ignoring unknown parameters: `bindwidth`

Fit a linear regression model using the rm feature variable. Predict
cmedv and compute RMSE on the test data. What is the test set RMSE?
#Fit model
lml <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
##compute the RMSE on the test data
lml %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
NA
Fill in the blanks to fit a linear regression model using all
available features to predict cmedv and compute the RMSE on the test
data. What is the test set RMSE? Is this better than the previous
model’s performance?
# Fit Model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
Fit a K-nearest neighbor model that uses all available features to
predict cmedv and compute the RMSE on the test data. What is the test
set RMSE? Is this better than the previous two models’ performances?
# Fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# Compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The new RMSE is better than the two previous models
LS0tDQp0aXRsZTogIk1vZHVsZSA4IExhYjogU3VwZXJ2aXNlZCB2cyBVbnN1cGVydmlzZWQgRGF0YSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNClRoaXMgaXMgYW4gW1IgTWFya2Rvd25dKGh0dHA6Ly9ybWFya2Rvd24ucnN0dWRpby5jb20pIE5vdGVib29rLiBXaGVuIHlvdSBleGVjdXRlIGNvZGUgd2l0aGluIHRoZSBub3RlYm9vaywgdGhlIHJlc3VsdHMgYXBwZWFyIGJlbmVhdGggdGhlIGNvZGUuIA0KDQpUcnkgZXhlY3V0aW5nIHRoaXMgY2h1bmsgYnkgY2xpY2tpbmcgdGhlICpSdW4qIGJ1dHRvbiB3aXRoaW4gdGhlIGNodW5rIG9yIGJ5IHBsYWNpbmcgeW91ciBjdXJzb3IgaW5zaWRlIGl0IGFuZCBwcmVzc2luZyAqQ3RybCtTaGlmdCtFbnRlciouIA0KDQpgYGB7cn0NCmxpYnJhcnkodGlkeW1vZGVscykNCmxpYnJhcnkocmVhZHIpDQpgYGANCkZvciB0aGlzIGV4ZXJjaXNlIHdl4oCZbGwgdXNlIHRoZSBCb3N0b24gaG91c2luZyBkYXRhIHNldC4gVGhlIEJvc3RvbiBIb3VzaW5nIGRhdGEgc2V0IGlzIGRlcml2ZWQgZnJvbSBpbmZvcm1hdGlvbiBjb2xsZWN0ZWQgYnkgdGhlIFUuUy4gQ2Vuc3VzIFNlcnZpY2UgY29uY2VybmluZyBob3VzaW5nIGluIHRoZSBhcmVhIG9mIEJvc3RvbiwgTUEuIE9yaWdpbmFsbHkgcHVibGlzaGVkIGluIEhhcnJpc29uIEpyIGFuZCBSdWJpbmZlbGQgKDE5NzgpLg0KVGhlIHB1cnBvc2Ugb2YgdGhpcyBkYXRhIHNldCBpcyB0byBwcmVkaWN0IHRoZSBtZWRpYW4gdmFsdWUgb2Ygb3duZXItb2NjdXBpZWQgaG9tZXMgZm9yIHZhcmlvdXMgY2Vuc3VzIHRyYWN0cyBpbiB0aGUgQm9zdG9uIGFyZWEuIEVhY2ggcm93IChvYnNlcnZhdGlvbikgcmVwcmVzZW50cyBhIGdpdmVuIGNlbnN1cyB0cmFjdCBhbmQgdGhlIHZhcmlhYmxlIHdlIHdpc2ggdG8gcHJlZGljdCBpcyBjbWVkdiAobWVkaWFuIHZhbHVlIG9mIG93bmVyLW9jY3VwaWVkIGhvbWVzIGluIFVTRCAxMDAw4oCZcykuIFRoZSBvdGhlciB2YXJpYWJsZXMgYXJlIHZhcmlhYmxlcyB3ZSB3YW50IHRvIHVzZSB0byBoZWxwIG1ha2UgcHJlZGljdGlvbnMgb2YgY21lZHYgYW5kIGluY2x1ZGU6DQoNCuKAoiBsb246IGxvbmdpdHVkZSBvZiBjZW5zdXMgdHJhY3QNCg0K4oCiIGxhdDogbGF0aXR1ZGUgb2YgY2Vuc3VzIHRyYWN0DQoNCuKAoiBjcmltOiBwZXIgY2FwaXRhIGNyaW1lIHJhdGUgYnkgdG93bg0KDQrigKIgem46IHByb3BvcnRpb24gb2YgcmVzaWRlbnRpYWwgbGFuZCB6b25lZCBmb3IgbG90cyBvdmVyIDI1LDAwMCBzcS5mdA0KDQrigKIgaW5kdXM6IHByb3BvcnRpb24gb2Ygbm9uLXJldGFpbCBidXNpbmVzcyBhY3JlcyBwZXIgdG93bg0KDQrigKIgY2hhczogQ2hhcmxlcyBSaXZlciBkdW1teSB2YXJpYWJsZSAoPSAxIGlmIHRyYWN0IGJvdW5kcyByaXZlcjsgMCBvdGhlcndpc2UpDQoNCuKAoiBub3g6IG5pdHJpYyBveGlkZXMgY29uY2VudHJhdGlvbiAocGFydHMgcGVyIDEwIG1pbGxpb24pIOKAkz4gYWthIGFpciBwb2xsdXRpb24NCg0K4oCiIHJtOiBhdmVyYWdlIG51bWJlciBvZiByb29tcyBwZXIgZHdlbGxpbmcNCg0K4oCiIGFnZTogcHJvcG9ydGlvbiBvZiBvd25lci1vY2N1cGllZCB1bml0cyBidWlsdCBwcmlvciB0byAxOTQwDQoNCuKAoiBkaXM6IHdlaWdodGVkIGRpc3RhbmNlcyB0byBmaXZlIEJvc3RvbiBlbXBsb3ltZW50IGNlbnRlcnMNCg0K4oCiIHJhZDogaW5kZXggb2YgYWNjZXNzaWJpbGl0eSB0byByYWRpYWwgaGlnaHdheXMNCg0K4oCiIHRheDogZnVsbC12YWx1ZSBwcm9wZXJ0eS10YXggcmF0ZSBwZXIgVVNEIDEwLDAwMA0KDQrigKIgcHRyYXRpbzogcHVwaWwtdGVhY2hlciByYXRpbyBieSB0b3duDQoNCuKAoiBsc3RhdDogcGVyY2VudGFnZSBvZiBsb3dlciBzdGF0dXMgb2YgdGhlIHBvcHVsYXRpb24NCg0KDQoxLiBJcyB0aGlzIGEgc3VwZXJ2aXNlZCBvciB1bnN1cGVydmlzZWQgbGVhcm5pbmcgcHJvYmxlbT8gV2h5Pw0KDQpUaGlzIGlzIGEgc3VwZXJ2aXNlZCBsZWFybmluZyBwcm9ibGVtIGJlY2F1c2Ugd2UgYSB1c2luZyBhIGxhYmVsZWQgZGF0YSBzZXQgKGNtZWR2KQ0KDQoyLiBUaGVyZSBhcmUgMTYgdmFyaWFibGVzIGluIHRoaXMgZGF0YSBzZXQuIFdoaWNoIHZhcmlhYmxlIGlzIHRoZSByZXNwb25zZSB2YXJpYWJsZSBhbmQgd2hpY2ggdmFyaWFibGVzIGFyZSB0aGUgcHJlZGljdG9yIHZhcmlhYmxlcyAoYWthIGZlYXR1cmVzKT8NCg0KVGhlIHJlc3BvbnNlIHZhcmlhYmxlIGlzICJjbWVkdiIgdGhlIHByZWRpY3RvciB2YXJpYWJsZXMgYXJlIGFsbCBvdGhlciB2YXJpYWJsZXMgZnJvbSB0aGUgZGF0YXNldCAobG9uLCBsYXQsIGNyaW0sIHpuLCBpbmR1cywgY2hhcywgbm94LCBybSwgYWdlLCBkaXMsIHJhZCwgdGF4LCBwdHJhdGlvLCBhbmQgbHN0YXQuKQ0KDQozLiBHaXZlbiB0aGUgdHlwZSBvZiB2YXJpYWJsZSBjbWVkdiBpcywgaXMgdGhpcyBhIHJlZ3Jlc3Npb24gb3IgY2xhc3NpZmljYXRpb24gcHJvYmxlbT8NCg0KVGhpcyBpcyBhIHJlZ3Jlc3Npb24gcG9yYmxlbSBiZWNhdXNlIGNtZWR2IGlzIGEgY29udGlub3VzIHZhcmlhYmxlIA0KDQo0LiBGaWxsIGluIHRoZSBibGFua3MgdG8gaW1wb3J0IHRoZSBCb3N0b24gaG91c2luZyBkYXRhIHNldCAoYm9zdG9uLmNzdikuIEFyZSB0aGVyZSBhbnkgbWlzc2luZyB2YWx1ZXM/IFdoYXQgaXMgdGhlIG1pbmltdW0gYW5kIG1heGltdW0gdmFsdWVzIG9mIGNtZWR2PyBXaGF0IGlzIHRoZSBhdmVyYWdlIGNtZWR2IHZhbHVlPyAgDQoNCmBgYHtyfQ0KYm9zdG9uIDwtIHJlYWRyOjpyZWFkX2NzdigiYm9zdG9uLmNzdiIpDQpgYGANCmBgYHtyfQ0KIyNNaXNzaW5nIFZhcmlhYmxlcw0Kc3VtKGlzLm5hKGJvc3RvbikpDQpgYGANCmBgYHtyfQ0KIyMgTWluLCBNYXgsICYgTWVkaWFuDQpzdW1tYXJ5KGJvc3RvbiRjbWVkdikNCg0KYGBgDQpgYGB7cn0NCiMjIyBNaW4gPTUsIE1heCA9IDUwLCBNZWRpYW4gPSAyMS4yMCwgQXZlcmFnZSA9IDIyLjUzDQpgYGANCg0KU3BsaXQgdGhlIGRhdGEgaW50byBhIHRyYWluaW5nIHNldCBhbmQgdGVzdCBzZXQgdXNpbmcgYSA3MC0zMCUgc3BsaXQuIEJlIHN1cmUgdG8NCmluY2x1ZGUgdGhlIHNldC5zZWVkKDEyMykgc28gdGhhdCB5b3VyIHRyYWluIGFuZCB0ZXN0IHNldHMgYXJlIHRoZSBzYW1lIHNpemUgYXMgbWluZQ0KDQpgYGB7cn0NCnNldC5zZWVkKDEyMykNCnNwbGl0IDwtIGluaXRpYWxfc3BsaXQoYm9zdG9uLCBwcm9wID0gMC43LCBzdHJhdGEgPSBjbWVkdikNCnRyYWluIDwtIHRyYWluaW5nKHNwbGl0KQ0KdGVzdCA8LSB0ZXN0aW5nKHNwbGl0KQ0KYGBgDQoNCjYuIEhvdyBtYW55IG9ic2VydmF0aW9ucyBhcmUgaW4gdGhlIHRyYWluaW5nIHNldCBhbmQgdGVzdCBzZXQ/DQoNCmBgYHtyfQ0Kc3BsaXQNCmBgYA0KYGBge3J9DQpnZ3Bsb3QobWFwcGluZyA9IGFlcyh4ID0gY21lZHYpKSArDQogIGdlb21faGlzdG9ncmFtKGRhdGEgPSB0cmFpbiwgYmluZHdpZHRoID0gMSwgZmlsbCA9ICJwdXJwbGUiLCBhbHBoYSA9IDAuNSkgKw0KICBnZW9tX2hpc3RvZ3JhbShkYXRhID0gdGVzdCwgYmluZHdpZHRoID0gMSwgZmlsbCA9ImdyZWVuIiwgYWxwaGEgPSAwLjUpDQpgYGANCg0KRml0IGEgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgdXNpbmcgdGhlIHJtIGZlYXR1cmUgdmFyaWFibGUuIFByZWRpY3QgY21lZHYgYW5kIGNvbXB1dGUgUk1TRSBvbiB0aGUgdGVzdCBkYXRhLiBXaGF0IGlzIHRoZSB0ZXN0IHNldCBSTVNFPw0KDQpgYGB7cn0NCiNGaXQgbW9kZWwNCmxtbCA8LSBsaW5lYXJfcmVnKCkgJT4lDQogIGZpdChjbWVkdiB+IHJtLCBkYXRhID0gdHJhaW4pDQpgYGANCg0KYGBge3J9DQojI2NvbXB1dGUgdGhlIFJNU0Ugb24gdGhlIHRlc3QgZGF0YQ0KbG1sICU+JQ0KICBwcmVkaWN0KHRlc3QpICU+JQ0KICBiaW5kX2NvbHModGVzdCAlPiUgc2VsZWN0KGNtZWR2KSkgJT4lDQogIHJtc2UodHJ1dGggPSBjbWVkdiwgZXN0aW1hdGUgPSAucHJlZCkNCg0KYGBgDQpGaWxsIGluIHRoZSBibGFua3MgdG8gZml0IGEgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgdXNpbmcgYWxsIGF2YWlsYWJsZSBmZWF0dXJlcyB0byBwcmVkaWN0IGNtZWR2IGFuZA0KY29tcHV0ZSB0aGUgUk1TRSBvbiB0aGUgdGVzdCBkYXRhLiBXaGF0IGlzIHRoZSB0ZXN0IHNldCBSTVNFPyBJcyB0aGlzIGJldHRlciB0aGFuIHRoZSBwcmV2aW91cw0KbW9kZWzigJlzIHBlcmZvcm1hbmNlPw0KDQpgYGB7cn0NCiMgRml0IE1vZGVsDQpsbTIgPC0gbGluZWFyX3JlZygpICU+JQ0KICBmaXQoY21lZHYgfiAuLCBkYXRhID0gdHJhaW4pDQpgYGANCg0KYGBge3J9DQojIGNvbXB1dGUgdGhlIFJNU0Ugb24gdGhlIHRlc3QgZGF0YQ0KbG0yICU+JQ0KICBwcmVkaWN0KHRlc3QpICU+JQ0KICBiaW5kX2NvbHModGVzdCAlPiUgc2VsZWN0KGNtZWR2KSkgJT4lDQogIHJtc2UodHJ1dGggPSBjbWVkdiwgZXN0aW1hdGUgPSAucHJlZCkNCmBgYA0KDQpGaXQgYSBLLW5lYXJlc3QgbmVpZ2hib3IgbW9kZWwgdGhhdCB1c2VzIGFsbCBhdmFpbGFibGUgZmVhdHVyZXMgdG8gcHJlZGljdCBjbWVkdiBhbmQgY29tcHV0ZSB0aGUNClJNU0Ugb24gdGhlIHRlc3QgZGF0YS4gV2hhdCBpcyB0aGUgdGVzdCBzZXQgUk1TRT8gSXMgdGhpcyBiZXR0ZXIgdGhhbiB0aGUgcHJldmlvdXMgdHdvIG1vZGVsc+KAmQ0KcGVyZm9ybWFuY2VzPw0KDQpgYGB7cn0NCiMgRml0IG1vZGVsDQprbm4gPC0gbmVhcmVzdF9uZWlnaGJvcigpICU+JQ0KICBzZXRfZW5naW5lKCJra25uIikgJT4lDQogIHNldF9tb2RlKCJyZWdyZXNzaW9uIikgJT4lDQogIGZpdChjbWVkdiB+IC4sIGRhdGEgPSB0cmFpbikNCmBgYA0KDQpgYGB7cn0NCiMgQ29tcHV0ZSB0aGUgUk1TRSBvbiB0aGUgdGVzdCBkYXRhDQprbm4gJT4lDQogIHByZWRpY3QodGVzdCkgJT4lDQogIGJpbmRfY29scyh0ZXN0ICU+JSBzZWxlY3QoY21lZHYpKSAlPiUNCiAgcm1zZSh0cnV0aCA9IGNtZWR2LCBlc3RpbWF0ZSA9IC5wcmVkKQ0KYGBgDQpUaGUgbmV3IFJNU0UgaXMgYmV0dGVyIHRoYW4gdGhlIHR3byBwcmV2aW91cyBtb2RlbHMNCg0K