Shopping on Amazon: Supervised learning is applied through data collection from users clicking or purchasing specific items. This is used to build predictive models that send ads towards these specific products our way. Watch shows on Netflix:Unsupervised learning is applied through recommending shows or movies to users. Netlix uses clustering algorithms to group users in similar preferences. Use Waze for navigation: Supervised learning is applied by using real-time traffic prediction. The app collects data on people’s speed and location.Use my Google Home to play music: Supervised learning is applied in the speech recognition system. When I give my Google Home a command or question supervised learning is used to understand. Unsupervised learning is cluster similar texts together.
Machine learning improves accuracy by using algorithms, applies automation, improves decision making, and creates a personalizaed experience. There are many benefits for users and organizations.
Supervised learning is used for tasks that involve the prediction of a given output using other variables in the data set. Unsupervised learning is used for identifying groups in a data set, whether those groups are across observations or features.
Amazon:
Target variable: Purchase behavior
Feature variables: Product category, product brand, price, user demographics
Data collection: Through user activity
Netflix:
Target variable: User engagement
Feature variable: Genre, user demographics, viewing history
Data collection: Through user activity or third-party data providers
Waze:
Target variable: Traffic conditions and travel time
Feature variable: Real-time traffic data, accidents, user-reported data
Data collection: Through user location and user-reported incidents
Google Home:
Target variable: Accurate voice recognition
Feature variable: User speech and user location
Data collection: Through user speech patterns and voice recognition software. Also collects data through the users’ device.
Prerequisites:
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.2 ✔ recipes 1.0.5
## ✔ dials 1.1.0 ✔ rsample 1.1.1
## ✔ dplyr 1.1.0 ✔ tibble 3.1.8
## ✔ ggplot2 3.4.0 ✔ tidyr 1.2.1
## ✔ infer 1.0.4 ✔ tune 1.0.1
## ✔ modeldata 1.1.0 ✔ workflows 1.1.3
## ✔ parsnip 1.0.4 ✔ workflowsets 1.0.0
## ✔ purrr 1.0.1 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(kknn)
Is this a supervised or unsupervised learning problem? Why? • It is supervised because cmedv is used to predict other variables in the dataset.
There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)? • Response: cmedv • Predictor: lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat
Given the type of variable cmedv is, is this a regression or classification problem? • Regression because cmedv is a continuous numberical variable that we want to predict.
Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?
boston <- readr::read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(boston$cmedv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
set.seed(123)
split <- initial_split(boston, prop = .7, strata = cmedv)
train <- training(split)
test <- testing(split)
How many observations are in the training set and test set? • 352
Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly? • They appear to have a similar distribution.
library(ggplot2)
ggplot() +
geom_density(data = train, aes(x = cmedv), fill = "blue", alpha = 0.5) +
geom_density(data = test, aes(x = cmedv), fill = "red", alpha = 0.5) +
labs(title = "Distribution of cmedv in Training and Test Sets",
x = "cmedv (in USD 1000's)", y = "Density")
# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
# fit model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.37