library(tidymodels)
Boston housing data set - purpose is to predict the median value of owner occupied homes for various census tracts in the Boston area
Is this supervised or unsupervised learning problem?
This is a supervised learning because it is using feature variables to
predict a target variable (cmedv).
Which variable is the response variable and which are the
predictor varaibles?
Response Variable - cmedv (median value of owner occupied homes in USD
1000s)
Predictor Variables - lon, lat, crim, zn, indus, chas, nox, rm, age,
dis, rad, tax, ptratio, lstat
Is this a regression or classification problem?
This is a regression problem since cmedv, since median value is a
numeric value.
Import the Bostom housing data set
boston <- readr::read_csv('ML-data-1/boston.csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Are there any missing values?
sum(is.na(boston))
## [1] 0
Minimum and Maximum Value of cmedv?
# Minimum
min(boston$cmedv)
## [1] 5
# Maximum
max(boston$cmedv)
## [1] 50
Average cmdev value?
# Mean
mean(boston$cmedv)
## [1] 22.52885
# Median
median(boston$cmedv)
## [1] 21.2
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
# Training
count(train)
# Testing
count(test)
ggplot(train, aes(x = cmedv)) +
geom_line(stat ="density", trim = TRUE, col = "black") +
geom_line(data = test, stat = "density", trim = TRUE, col = "red")
They appear to have the same distribution.
# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The RMSE for the test set is: 6.83.
# fit model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The RMSE for the test set is: 4.83. This model does have a better performance than the previous model because the goal is to minimize the RMSE.
# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
The RMSE for the test set is: 3.37. This model does have a better performance than the previous two models because the goal is to minimize the RMSE.