Quiz

Visually Exploring Data

ggplot(dfC, aes(Sales, fill=ShelveLoc)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dfC, aes(CompPrice, fill=ShelveLoc)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dfC, aes(Income, fill=ShelveLoc)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dfC, aes(Age, fill=ShelveLoc)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Programming Quiz

set.seed(1)
dfC <- dfC |> mutate(id = row_number(), ShelveLoc01 = ifelse(ShelveLoc == "Good", 1, 0))
train <- dfC |> sample_frac(0.8)
test <- dfC |> anti_join(train, by = "id")
glm_model = glm(ShelveLoc01 ~ Sales+Income+Age, data = train, family = binomial)
rows <- nrow(test)
test |>
  add_predictions(glm_model, var="pred_prob", type="response") |>
  mutate(prediction = ifelse(pred_prob > 0.8, 1, 0)) |>
  mutate(right = ifelse(prediction == ShelveLoc01, 1, 0)) |>
  summarise(error = 1-sum(right)/rows)

Programming Bonus

Selected variables from visual exploration above. Sales has a clear trend where “good”s are on one side of the graph. Age seems to have a concentration of “good” in the middle of the graph. Income seems to be a bit more spread but with more on the lower end.

set.seed(1)
dfC <- dfC |> mutate(id = row_number(), ShelveLoc01 = ifelse(ShelveLoc == "Good", 1, 0))
train <- dfC |> sample_frac(0.8)
test <- dfC |> anti_join(train, by = "id")
train_true <- train$ShelveLoc01
test_true <- test$ShelveLoc01
train <- train |> select(Sales, Income, Age)
test <- test |> select(Sales, Income, Age)
knn_pred <- knn(train, test, train_true, k = 5)
error <- 1-mean(knn_pred == test_true)
error

## [1] 0.125

ISL-04C

Ethan Niser

2/13/2022

Quiz

Visually Exploring Data

Programming Quiz

Programming Bonus