Part 1

For this part of the lab work in groups of 3-5. This will mainly be an in-class activity but your group may be asked to share your thoughts to the rest of the class. Four real-life applications of supervised and unsupervised problems, their benefits, and potential ethical concerns:

  1. Supervised Learning: Personalized Product Recommendations on Amazon
  1. Unsupervised Learning: Fraud Detection in Banking
  1. Supervised Learning: Language Translation on Google Translate
  1. Unsupervised Learning: Traffic Analysis on Google Maps

Part 2

For this part of the lab work you can still work in groups but you’ll need to perform your own lab quiz and submit your own code.

For this exercise we’ll use the Boston housing data set. The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston, MA. Originally published in Harrison Jr and Rubinfeld (1978).

The purpose of this data set is to predict the median value of owner-occupied homes for various census tracts in the Boston area. Each row (observation) represents a given census tract and the variable we wish to predict is cmedv (median value of owner-occupied homes in USD 1000’s). The other variables are variables we want to use to help make predictions of cmedv and include:

Prerequisites:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.3     ✔ recipes      1.0.5
## ✔ dials        1.1.0     ✔ rsample      1.1.1
## ✔ dplyr        1.1.0     ✔ tibble       3.1.8
## ✔ ggplot2      3.4.1     ✔ tidyr        1.3.0
## ✔ infer        1.0.4     ✔ tune         1.0.1
## ✔ modeldata    1.1.0     ✔ workflows    1.1.3
## ✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
## ✔ purrr        1.0.1     ✔ yardstick    1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

Modeling Tasks

  1. Is this a supervised or unsupervised learning problem? Why?:

    This is a supervised learning problem because we have a target variable (cmedv) that we want to predict based on the other variables (features).

  2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?:

    Response variable (target variable): cmedv

    Predictor variables (features): lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, and lstat.

  3. Given the type of variable cmedv is, is this a regression or classification problem?

    Since cmedv is a continuous numerical variable, this is a regression problem.

  4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

    # Import the Boston housing data set
    boston <- readr::read_csv("boston.csv")
    ## Rows: 506 Columns: 16
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # Check for missing values: no missing values
    sum(is.na(boston))
    ## [1] 0
    # Check minimum, maximum, and average cmedv values
    # Minimum: 5
    # Maximum: 50
    # Median: 21.20
    # Average: 22.53
    summary(boston$cmedv)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    5.00   17.02   21.20   22.53   25.00   50.00
  5. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.

    set.seed(123)
    boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
    train <- training(boston_split)
    test <- testing(boston_split)
  6. How many observations are in the training set and test set?

    # Check for number of observations
    # Training: 352
    # Test: 154
    boston_split 
    ## <Training/Testing/Total>
    ## <352/154/506>
  7. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?

    # The distributions of cmedv in the training and test sets appear to be 
    # similar.
    ggplot(mapping = aes(x = cmedv)) +
      geom_histogram(data = train, binwidth = 1, fill = "blue", alpha = 0.5) +
      geom_histogram(data = test, binwidth = 1, fill = "red", alpha = 0.5)

  8. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?

    # fit model
    lm1 <- linear_reg() %>%
      fit(cmedv ~ rm, data = train)
    
    # compute the RMSE on the test data
    # Test set RMSE: 6.83
    lm1 %>%
      predict(test) %>%
      bind_cols(test %>% select(cmedv)) %>%
      rmse(truth = cmedv, estimate = .pred)
  9. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?

    # fit model
    lm2 <- linear_reg() %>%
      fit(cmedv ~ ., data = train)
    
    # compute the RMSE on the test data
    # Test set RMSE: 4.83, which is better than the previous model's performance.
    lm2 %>%
      predict(test) %>%
      bind_cols(test %>% select(cmedv)) %>%
      rmse(truth = cmedv, estimate = .pred)
  10. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?

    # fit model
    knn <- nearest_neighbor() %>%
      set_engine("kknn") %>%
      set_mode("regression") %>%
      fit(cmedv ~ ., data = train)
    
    # compute the RMSE on the test data
    # Test set RMSE: 3.37, which is better than the previous model's performance.
    knn %>%
      predict(test) %>%
      bind_cols(test %>% select(cmedv)) %>%
      rmse(truth = cmedv, estimate = .pred)