As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Reflect and Plan

Part A:

  1. Like after each of the previous two learning labs, please interpret the predictive accuracy of the model. How much better was it relative to the untuned model? Was tuning worth the investment.
  1. Please interpret the fit metrics other than accuracy. What do the Kappa, sensitivity, specificity, and AOC tell us about the predictive accuracy of the model?
  1. Provide an APA citation for your selected study.

    • Liang, J., Li, C., & Zheng, L. (2016, August). Machine learning application in MOOCs: Dropout prediction. In 2016 11th International Conference on Computer Science & Education (ICCSE) (pp. 52-57). IEEE.
  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

    • How to predict students’ dropout in MOOCs
  3. What were the results of these analyses?

    • The authors used supervised classification approach in the machine learning field, and achieved 89% accuracy in dropout prediction task with gradient boosting decision tree model.

Part II: Data Product

This is likely to be the most challenging data product to date. Here, tune an even more complex model - a neural network - and evaluate how much more predictively accurate (if at all fr this data set!) this highly complex algorithm is.

The code we ran in the case study is below. Please modify this code to specify and tune a neural network.

A tutorial you can use as a starting point to adapt the code below is here.

Documentation for a neural network engine we recommend - keras - is here.

Prepare

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom        0.7.12     ✔ rsample      1.0.0 
## ✔ dials        1.0.0      ✔ tune         1.0.0 
## ✔ infer        1.0.2      ✔ workflows    1.0.0 
## ✔ modeldata    1.0.0      ✔ workflowsets 0.2.1 
## ✔ parsnip      1.0.0      ✔ yardstick    1.0.0 
## ✔ recipes      1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(here)
## here() starts at /Users/meinazhu/Documents/GitHub/machine-learning
library(ranger) # this is needed for the random forest algorithm
library(vip) # a new package we're adding for variable importance measures
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
#d <- read_csv(here("data", "ngsschat-processed-data.csv"))

d <- read_csv("data/ngsschat-processed-data-add-three-features.csv")
## Rows: 3793 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): code
## dbl (7): mean_favorite_count, sum_favorite_count, mean_retweet_count, sum_re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d <- d %>% 
    mutate(code = as.factor(code)) # this is needed for the classification mode

Split data

train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)

kfcv <- vfold_cv(data_train, n=5) # again, we will use resampling

Engineer features

my_rec <- recipe(code ~ ., data = data_train) %>% 
    step_normalize(all_numeric_predictors()) %>%
    step_nzv(all_predictors())

Specify recipe, model, and workflow

# specify model
my_mod <-
    rand_forest(mtry = tune(), # this specifies that we'll take steps later to tune the model
                min_n = tune(), 
                trees = tune()) %>%
    set_engine("ranger", importance = "impurity") %>%
    set_mode("classification")

# specify workflow
my_wf <-
    workflow() %>%
    add_model(my_mod) %>% 
    add_recipe(my_rec)

Fit model

# specify tuning grid
finalize(mtry(), data_train)
## # Randomly Selected Predictors (quantitative)
## Range: [1, 8]
finalize(min_n(), data_train)
## Minimal Node Size (quantitative)
## Range: [2, 40]
finalize(trees(), data_train)
## # Trees (quantitative)
## Range: [1, 2000]
tree_grid <- grid_max_entropy(mtry(range(1, 5)),
                              min_n(range(2, 40)),
                              trees(range(1,500)),
                              size = 10)

# fit model with tune_grid
# fitted_model <- my_wf %>% 
#     tune_grid(
#         resamples = kfcv,
  # #       grid = tree_grid,
#         metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision)
   #      )

    fitted_model <- my_wf %>% 
    tune_grid(
        resamples = kfcv,
        grid = tree_grid,
        metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision)
    )

Fit model (part 2)

# examine best set of tuning parameters; repeat?
show_best(fitted_model, n = 10, metric = "accuracy")
## # A tibble: 10 × 9
##     mtry trees min_n .metric  .estimator  mean     n std_err .config            
##    <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
##  1     2   482    20 accuracy binary     0.888    10 0.00626 Preprocessor1_Mode…
##  2     4    95    39 accuracy binary     0.887    10 0.00666 Preprocessor1_Mode…
##  3     2   263    38 accuracy binary     0.887    10 0.00567 Preprocessor1_Mode…
##  4     1   148    24 accuracy binary     0.887    10 0.00462 Preprocessor1_Mode…
##  5     4   456    40 accuracy binary     0.885    10 0.00644 Preprocessor1_Mode…
##  6     3    15    25 accuracy binary     0.884    10 0.00664 Preprocessor1_Mode…
##  7     5    59    15 accuracy binary     0.876    10 0.00682 Preprocessor1_Mode…
##  8     5   492    11 accuracy binary     0.873    10 0.00640 Preprocessor1_Mode…
##  9     4   273     6 accuracy binary     0.872    10 0.00589 Preprocessor1_Mode…
## 10     3    48     3 accuracy binary     0.869    10 0.00423 Preprocessor1_Mode…
# select best set of tuning parameters
best_tree <- fitted_model %>% select_best(metric = "accuracy")

# finalize workflow with best set of tuning parameters
final_wf <- my_wf %>% 
    finalize_workflow(best_tree)

final_fit <- final_wf %>% 
    last_fit(train_test_split, metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision))

final_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble>     <workflow>

Evaluate accuracy

# fit stats
final_fit %>%
    collect_metrics()
## # A tibble: 6 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.876 Preprocessor1_Model1
## 2 kap         binary         0.751 Preprocessor1_Model1
## 3 sensitivity binary         0.889 Preprocessor1_Model1
## 4 specificity binary         0.866 Preprocessor1_Model1
## 5 precision   binary         0.844 Preprocessor1_Model1
## 6 roc_auc     binary         0.944 Preprocessor1_Model1
# variable importance plot
final_fit %>% 
    pluck(".workflow", 1) %>%   
    extract_fit_parsnip() %>% 
    vip(num_features = 10)

Did specifying and tuning a neural network make any difference compared to the mean predictive accuracy you found in the case study? Add a few notes below:

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

  1. Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.

  3. Commit your changes in GitHub Desktop and push them to your online GitHub repository.

  4. Publish your HTML page the web using one of the following publishing methods:

    • Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.

    • Publishing on GitHub using either GitHub Pages or the HTML previewer.

  5. Post a new discussion on GitHub to our ML badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.