Part 1

  1. Shopping on Amazon: Supervised learning is applied through data collection from users clicking or purchasing specific items. This is used to build predictive models that send ads towards these specific products our way. Watch shows on Netflix:Unsupervised learning is applied through recommending shows or movies to users. Netlix uses clustering algorithms to group users in similar preferences. Use Waze for navigation: Supervised learning is applied by using real-time traffic prediction. The app collects data on people’s speed and location.Use my Google Home to play music: Supervised learning is applied in the speech recognition system. When I give my Google Home a command or question supervised learning is used to understand. Unsupervised learning is cluster similar texts together.

  2. Machine learning improves accuracy by using algorithms, applies automation, improves decision making, and creates a personalizaed experience. There are many benefits for users and organizations.

  3. Supervised learning is used for tasks that involve the prediction of a given output using other variables in the data set. Unsupervised learning is used for identifying groups in a data set, whether those groups are across observations or features.

  4. Amazon:

  1. There could definitely be ethical issues and misuse when dealing with machine learning. Using demographic information to target advertisements can be seen as invasive. Also using demographic information could lead to discriminatory practices. For Waze, an ethical concern is the app could lead to unsafe driving practices since drivers can add in their own traffic updates. For Google Home, the speech recognition could collect sensitive information about people and could ultimately be hacked into where the data could be retrieved.

Part 2

Prerequisites:

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.2     ✔ recipes      1.0.5
## ✔ dials        1.1.0     ✔ rsample      1.1.1
## ✔ dplyr        1.1.0     ✔ tibble       3.1.8
## ✔ ggplot2      3.4.0     ✔ tidyr        1.2.1
## ✔ infer        1.0.4     ✔ tune         1.0.1
## ✔ modeldata    1.1.0     ✔ workflows    1.1.3
## ✔ parsnip      1.0.4     ✔ workflowsets 1.0.0
## ✔ purrr        1.0.1     ✔ yardstick    1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(kknn)

Modeling tasks:

  1. Is this a supervised or unsupervised learning problem? Why? • It is supervised because cmedv is used to predict other variables in the dataset.

  2. There are 16 variables in this data set. Which variable is the response variable and which variables are the predictor variables (aka features)? • Response: cmedv • Predictor: lon, lat, crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, lstat

  3. Given the type of variable cmedv is, is this a regression or classification problem? • Regression because cmedv is a continuous numberical variable that we want to predict.

  4. Fill in the blanks to import the Boston housing data set (boston.csv). Are there any missing values? What is the minimum and maximum values of cmedv? What is the average cmedv value?

boston <- readr::read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(boston$cmedv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Including Plots

  1. Fill in the blanks to split the data into a training set and test set using a 70-30% split. Be sure to include the set.seed(123) so that your train and test sets are the same size as mine.
set.seed(123)
split <- initial_split(boston, prop = .7, strata = cmedv)
train <- training(split)
test <- testing(split)
  1. How many observations are in the training set and test set? • 352

  2. Compare the distribution of cmedv between the training set and test set. Do they appear to have the same distribution or do they differ significantly? • They appear to have a similar distribution.

library(ggplot2)
ggplot() + 
  geom_density(data = train, aes(x = cmedv), fill = "blue", alpha = 0.5) +
  geom_density(data = test, aes(x = cmedv), fill = "red", alpha = 0.5) +
  labs(title = "Distribution of cmedv in Training and Test Sets",
       x = "cmedv (in USD 1000's)", y = "Density")

  1. Fill in the blanks to fit a linear regression model using the rm feature variable to predict cmedv and compute the RMSE on the test data. What is the test set RMSE?
# fit model
lm1 <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
# compute the RMSE on the test data
lm1 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83
  1. Fill in the blanks to fit a linear regression model using all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous model’s performance?
# fit model
lm2 <- linear_reg() %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
lm2 %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83
  1. Fit a K-nearest neighbor model that uses all available features to predict cmedv and compute the RMSE on the test data. What is the test set RMSE? Is this better than the previous two models’ performances?
# fit model
knn <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("regression") %>%
fit(cmedv ~ ., data = train)
# compute the RMSE on the test data
knn %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        3.37