Machine Learning - Learning Lab 3 Badge

As a reminder, to earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply an analytic technique introduced in this learning lab.

Part I: Reflect and Plan

Part A:

Like after each of the previous two learning labs, please interpret the predictive accuracy of the model. How much better was it relative to the untuned model? Was tuning worth the investment.

It wasn’t really any better than the previous version of the model that we did on this data. It was worth the effort learning about how to do parameter tuning though in this case, but not much benefit as far as our model is concerned.

Please interpret the fit metrics other than accuracy. What do the Kappa, sensitivity, specificity, and AOC tell us about the predictive accuracy of the model?

The sensitivity and specificity at least, are indicating that this model achieves the similar level of accuracy with the previous iteration but with a better balance between sensitivity and specificity than in the last iteration.

Part B: Once again, use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions – and, ideally, one that utilizes a random forest.

Provide an APA citation for your selected study.
- Batool, S., Rashid, J., Nisar, M. W., Kim, J., Mahmood, T., & Hussain, A. (2021, July). A random forest students’ performance prediction (rfspp) model based on students’ demographic features. In 2021 Mohammad Ali Jinnah University International Conference on Computing (MAJICC) (pp. 1-4). IEEE.
What research questions were the authors of this study trying to address and why did they consider these questions important?
- How can you predict final exam performance for students in MOOCs based on a set of demographic features using random forest? It seems like they thought this was important because they wanted to do a random forest and they had a publically available data set to work with.
What were the results of these analyses?
- They fit their random forest model to 3 different datasets that had demographics and exam performance coded as pass/fail. They were able to get 81%, 95%, and 84% accuacy with these models using the F metric

Part II: Data Product

This is likely to be the most challenging data product to date. Here, tune an even more complex model - a neural network - and evaluate how much more predictively accurate (if at all fr this data set!) this highly complex algorithm is.

The code we ran in the case study is below. Please modify this code to specify and tune a neural network.

You are also welcome to another model type. The key here is that it’s a more complex model that requires using tuning parameters. Another option is a support vector machine (tutorial here).

A tutorial you can use as a starting point to adapt the code below is here.

Documentation for a neural network engine we recommend - keras - is here.

Prepare

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✔ broom        0.8.0     ✔ rsample      1.0.0
## ✔ dials        1.0.0     ✔ tune         1.0.0
## ✔ infer        1.0.2     ✔ workflows    1.0.0
## ✔ modeldata    1.0.0     ✔ workflowsets 0.2.1
## ✔ parsnip      1.0.0     ✔ yardstick    1.0.0
## ✔ recipes      1.0.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(vip) # a new package we're adding for variable importance measures

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

library(NeuralNetTools)

d <- read_csv("data/ngsschat-processed-data-add-three-features.csv")

## Rows: 3793 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): code
## dbl (7): mean_favorite_count, sum_favorite_count, mean_retweet_count, sum_re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Split data

train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)

kfcv <- vfold_cv(data_train, v = 10) # again, we will use resampling

Engineer features

my_rec <- recipe(code ~ ., data = data_train) %>% 
    step_normalize(all_numeric_predictors()) %>%
    step_nzv(all_predictors())

Specify recipe, model, and workflow

# specify model
my_mod <-
    mlp(hidden_units = tune(), # Hidden Units (type: integer, default: 5L)
        epochs = tune(),  # Epochs (type: integer, default: 20L)
) %>%
        set_engine("nnet") %>%
    set_mode("classification")

# my_mod <-
#     mlp() %>%
#     set_engine("nnet") %>%
#     set_mode("classification")

# specify workflow
my_wf <-
    workflow() %>%
    add_model(my_mod) %>% 
    add_recipe(my_rec)

Fit model

# specify tuning grid
finalize(hidden_units(), data_train)

## # Hidden Units (quantitative)
## Range: [1, 10]

finalize(epochs(), data_train)

## # Epochs (quantitative)
## Range: [10, 1000]

tree_grid <- grid_max_entropy(hidden_units(range(1, 10)),
                              epochs(range(10, 1000)),
                              size = 10)

# fit model with tune_grid
fitted_model <- my_wf %>% 
    tune_grid(
        resamples = kfcv,
        grid = tree_grid,
        metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision)
    )

# fitted_model_resamples <- fit_resamples(my_wf, resamples = kfcv,
#                               control = control_grid(save_pred = TRUE))

Fit model (part 2)

# examine best set of tuning parameters; repeat?
show_best(fitted_model, n = 10, metric = "accuracy")

## # A tibble: 10 × 8
##    hidden_units epochs .metric  .estimator  mean     n std_err .config          
##           <int>  <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>            
##  1            1    993 accuracy binary     0.885    10 0.00843 Preprocessor1_Mo…
##  2            1    367 accuracy binary     0.885    10 0.00843 Preprocessor1_Mo…
##  3            5    964 accuracy binary     0.884    10 0.00813 Preprocessor1_Mo…
##  4            2     46 accuracy binary     0.884    10 0.00826 Preprocessor1_Mo…
##  5           10     19 accuracy binary     0.883    10 0.00982 Preprocessor1_Mo…
##  6            3    616 accuracy binary     0.883    10 0.00929 Preprocessor1_Mo…
##  7            6    354 accuracy binary     0.881    10 0.00929 Preprocessor1_Mo…
##  8            5     12 accuracy binary     0.873    10 0.00683 Preprocessor1_Mo…
##  9           10    339 accuracy binary     0.871    10 0.00827 Preprocessor1_Mo…
## 10            9    692 accuracy binary     0.870    10 0.00868 Preprocessor1_Mo…

# select best set of tuning parameters
best_tree <- fitted_model %>% 
    select_best(metric = "accuracy")

# finalize workflow with best set of tuning parameters
final_wf <- my_wf %>% 
    finalize_workflow(best_tree)

final_fit <- final_wf %>% 
    last_fit(train_test_split, metrics = metric_set(roc_auc, accuracy, kap, sensitivity, specificity, precision))

Evaluate accuracy

# fit stats
final_fit %>%
    collect_metrics()

## # A tibble: 6 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.883 Preprocessor1_Model1
## 2 kap         binary         0.763 Preprocessor1_Model1
## 3 sensitivity binary         0.892 Preprocessor1_Model1
## 4 specificity binary         0.876 Preprocessor1_Model1
## 5 precision   binary         0.849 Preprocessor1_Model1
## 6 roc_auc     binary         0.934 Preprocessor1_Model1

# variable importance plot
final_fit %>% 
    pluck(".workflow", 1) %>%   
    extract_fit_parsnip() %>% 
    vip(num_features = 10)

Did specifying and tuning a neural network make any difference compared to the mean predictive accuracy you found in the case study? Add a few notes below:

By a couple of percentage points the overall accuracy went up, and as in the random forest case the sensitivity and specificity are better balanced than in the logistic regression. ROC is up a bit from the random forest example too.

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our ML badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Machine Learning - Learning Lab 3 Badge

Lexi Lishinski

July 15, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit