Part 1

For this part of the lab work in groups of 3-5. This will mainly be an in-class activity but your group may be asked to share your thoughts to the rest of the class. Four real-life applications of supervised and unsupervised problems, their benefits, and potential ethical concerns:

Supervised Learning: Personalized Product Recommendations on Amazon

Benefits: Machine learning algorithms can analyze customers’ browsing and purchasing history to provide personalized product recommendations, which can improve the shopping experience and increase customer satisfaction and loyalty. It can also help Amazon increase sales and revenue.
Supervised Problem: The target variable is the probability of a customer purchasing a particular product, and the features can include browsing and purchasing history, demographics, and user behavior. This data is collected through customer transactions and activity on the website.
Ethical Concerns: Data privacy and the potential for algorithmic bias are concerns in using machine learning for personalized recommendations.

Unsupervised Learning: Fraud Detection in Banking

Benefits: Machine learning algorithms can detect fraudulent transactions, which can save banks money and protect customers from financial harm. It can also increase customer trust in the bank’s security measures.
Unsupervised Problem: The algorithm identifies patterns in transaction data to identify transactions that are anomalous and likely to be fraudulent. The target variable is not defined, and the algorithm discovers the patterns by itself. The features can include transaction amount, location, and user history.
Ethical Concerns: False positives can lead to innocent customers being flagged as fraudsters, and false negatives can lead to fraudulent transactions being missed.

Supervised Learning: Language Translation on Google Translate

Benefits: Machine learning algorithms can translate text from one language to another, which can increase access to information and facilitate communication across language barriers.
Supervised Problem: The target variable is the translated text, and the features can include the original text, language, and context. This data is collected from translated texts and user feedback.
Ethical Concerns: The accuracy of translations and the potential for algorithmic bias are concerns in using machine learning for language translation.

Unsupervised Learning: Traffic Analysis on Google Maps

Benefits: Machine learning algorithms can analyze traffic patterns and predict the fastest route for drivers, which can save time and reduce traffic congestion.
Unsupervised Problem: The algorithm identifies patterns in traffic data to predict which roads are congested and which ones are clear. The target variable is not defined, and the algorithm discovers the patterns by itself. The features can include traffic flow, speed, and location data.
Ethical Concerns: The accuracy of traffic predictions and the potential for data privacy violations are concerns in using machine learning for traffic analysis.

Part 2

For this part of the lab work you can still work in groups but you’ll need to perform your own lab quiz and submit your own code.

For this exercise we’ll use the Boston housing data set. The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston, MA. Originally published in Harrison Jr and Rubinfeld (1978).

The purpose of this data set is to predict the median value of owner-occupied homes for various census tracts in the Boston area. Each row (observation) represents a given census tract and the variable we wish to predict is cmedv (median value of owner-occupied homes in USD 1000’s). The other variables are variables we want to use to help make predictions of cmedv and include:

lon: longitude of census tract
lat: latitude of census tract
crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 sq.ft
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox: nitric oxides concentration (parts per 10 million) –> aka air pollution
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centers
rad: index of accessibility to radial highways
tax: full-value property-tax rate per USD 10,000
ptratio: pupil-teacher ratio by town
lstat: percentage of lower status of the population

Prerequisites:

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

## ✔ broom        1.0.3     ✔ recipes      1.0.5
## ✔ dials        1.1.0     ✔ rsample      1.1.1
## ✔ dplyr        1.1.0     ✔ tibble       3.1.8
## ✔ ggplot2      3.4.1     ✔ tidyr        1.3.0
## ✔ infer        1.0.4     ✔ tune         1.0.1
## ✔ modeldata    1.1.0     ✔ workflows    1.1.3
## ✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
## ✔ purrr        1.0.1     ✔ yardstick    1.1.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

Modeling Tasks

Is this a supervised or unsupervised learning problem? Why?:

This is a supervised learning problem because we have a target variable (cmedv) that we want to predict based on the other variables (features).
There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)?:

Response variable (target variable): cmedv

Predictor variables (features): lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, and lstat.
Given the type of variable cmedv is, is this a regression or classification problem?

Since cmedv is a continuous numerical variable, this is a regression problem.

Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

# Import the Boston housing data set
boston <- readr::read_csv("boston.csv")

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check for missing values: no missing values
sum(is.na(boston))

## [1] 0

# Check minimum, maximum, and average cmedv values
# Minimum: 5
# Maximum: 50
# Median: 21.20
# Average: 22.53
summary(boston$cmedv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
```
set.seed(123)
boston_split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(boston_split)
test <- testing(boston_split)
```

How many observations are in the training set and test set?

# Check for number of observations
# Training: 352
# Test: 154
boston_split

## <Training/Testing/Total>
## <352/154/506>

Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly?

# The distributions of cmedv in the training and test sets appear to be 
# similar.
ggplot(mapping = aes(x = cmedv)) +
  geom_histogram(data = train, binwidth = 1, fill = "blue", alpha = 0.5) +
  geom_histogram(data = test, binwidth = 1, fill = "red", alpha = 0.5)

Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
```
# fit model
lm1 <- linear_reg() %>%
  fit(cmedv ~ rm, data = train)

# compute the RMSE on the test data
# Test set RMSE: 6.83
lm1 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
```
Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
```
# fit model
lm2 <- linear_reg() %>%
  fit(cmedv ~ ., data = train)

# compute the RMSE on the test data
# Test set RMSE: 4.83, which is better than the previous model's performance.
lm2 %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
```
Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
```
# fit model
knn <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("regression") %>%
  fit(cmedv ~ ., data = train)

# compute the RMSE on the test data
# Test set RMSE: 3.37, which is better than the previous model's performance.
knn %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
```

Module 8 Lab

Part 1

Part 2

Prerequisites:

Modeling Tasks