Module 8 Lab

Part 1

Shopping on Amazon: Supervised learning is applied through data collection from users clicking or purchasing specific items. This is used to build predictive models that send ads towards these specific products our way. Watch shows on Netflix:Unsupervised learning is applied through recommending shows or movies to users. Netlix uses clustering algorithms to group users in similar preferences. Use Waze for navigation: Supervised learning is applied by using real-time traffic prediction. The app collects data on people’s speed and location.Use my Google Home to play music: Supervised learning is applied in the speech recognition system. When I give my Google Home a command or question supervised learning is used to understand. Unsupervised learning is cluster similar texts together.
Machine learning improves accuracy by using algorithms, applies automation, improves decision making, and creates a personalizaed experience. There are many benefits for users and organizations.
Supervised learning is used for tasks that involve the prediction of a given output using other variables in the data set. Unsupervised learning is used for identifying groups in a data set, whether those groups are across observations or features.
Amazon:

Target variable: Purchase behavior
Feature variables: Product category, product brand, price, user demographics
Data collection: Through user activity

Netflix:
Target variable: User engagement
Feature variable: Genre, user demographics, viewing history
Data collection: Through user activity or third-party data providers

Waze:
Target variable: Traffic conditions and travel time
Feature variable: Real-time traffic data, accidents, user-reported data
Data collection: Through user location and user-reported incidents

Google Home:
Target variable: Accurate voice recognition
Feature variable: User speech and user location
Data collection: Through user speech patterns and voice recognition software. Also collects data through the users’ device.

There could definitely be ethical issues and misuse when dealing with machine learning. Using demographic information to target advertisements can be seen as invasive. Also using demographic information could lead to discriminatory practices. For Waze, an ethical concern is the app could lead to unsafe driving practices since drivers can add in their own traffic updates. For Google Home, the speech recognition could collect sensitive information about people and could ultimately be hacked into where the data could be retrieved.

Part 2

Prerequisites:

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

## ✔ broom        1.0.2     ✔ recipes      1.0.5
## ✔ dials        1.1.0     ✔ rsample      1.1.1
## ✔ dplyr        1.1.0     ✔ tibble       3.1.8
## ✔ ggplot2      3.4.0     ✔ tidyr        1.2.1
## ✔ infer        1.0.4     ✔ tune         1.0.1
## ✔ modeldata    1.1.0     ✔ workflows    1.1.3
## ✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
## ✔ purrr        1.0.1     ✔ yardstick    1.1.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(kknn)

Modeling tasks:

Is this a supervised or unsupervised learning problem? Why? • It is supervised because cmedv is used to predict other variables in the dataset.
There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)? • Response: cmedv • Predictor: lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat
Given the type of variable cmedv is, is this a regression or classification problem? • Regression because cmedv is a continuous numberical variable that we want to predict.
Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- readr::read_csv("boston.csv")

## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(boston$cmedv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Including Plots

Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.

set.seed(123)
split <- initial_split(boston, prop = .7, strata = cmedv)
train <- training(split)
test <- testing(split)

How many observations are in the training set and test set? • 352
Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly? • They appear to have a similar distribution.

library(ggplot2)
ggplot() + 
  geom_density(data = train, aes(x = cmedv), fill = "blue", alpha = 0.5) +
  geom_density(data = test, aes(x = cmedv), fill = "red", alpha = 0.5) +
  labs(title = "Distribution of cmedv in Training and Test Sets",
       x = "cmedv (in USD 1000's)", y = "Density")

Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?

# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83

Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?

# fit model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?

# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        3.37

Module 8 Lab

Addyson Stansel

2023-03-03

Part 1

Part 2

Modeling tasks:

Including Plots