library(tidymodels)

Boston housing data set - purpose is to predict the median value of owner occupied homes for various census tracts in the Boston area

Modeling tasks

  1. Is this supervised or unsupervised learning problem?
    This is a supervised learning because it is using feature variables to predict a target variable (cmedv).

  2. Which variable is the response variable and which are the predictor varaibles?
    Response Variable - cmedv (median value of owner occupied homes in USD 1000s)
    Predictor Variables - lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat

  3. Is this a regression or classification problem?
    This is a regression problem since cmedv, since median value is a numeric value.

  4. Import the Bostom housing data set

boston <- readr::read_csv('ML-data-1/boston.csv')
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Are there any missing values?

sum(is.na(boston))
## [1] 0

Minimum and Maximum Value of cmedv?

# Minimum
min(boston$cmedv)
## [1] 5
# Maximum 
max(boston$cmedv)
## [1] 50

Average cmdev value?

# Mean
mean(boston$cmedv)
## [1] 22.52885
# Median
median(boston$cmedv)
## [1] 21.2
  1. Split the data into a training set and test set using a 70-30% split.
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
  1. How many observations are in the training and testing set?
# Training
count(train)
# Testing
count(test)
  1. Compute the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ slightly?
ggplot(train, aes(x = cmedv)) +
  geom_line(stat ="density", trim = TRUE, col = "black") +
  geom_line(data = test, stat = "density", trim = TRUE, col = "red")

They appear to have the same distribution.

  1. Fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)

# compute the RMSE on the test data
lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The RMSE for the test set is: 6.83.

  1. Fit a liner regression model using all available feature variables to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
# fit model
lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)

# compute the RMSE on the test data
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The RMSE for the test set is: 4.83. This model does have a better performance than the previous model because the goal is to minimize the RMSE.

  1. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models performance?
# fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)

# compute the RMSE on the test data
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)

The RMSE for the test set is: 3.37. This model does have a better performance than the previous two models because the goal is to minimize the RMSE.