The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Reflect and Plan

Part A:

  1. How good was the machine learning model we developed in the guiding practice? Would you be comfortable using it to code more conversations? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.
  1. How might the model be improved? Share any ideas you have at this time below:

Part B: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions.

  1. Provide an APA citation for your selected study.

    • Lukes, D. (2018). Revisiting the Long-Term Impacts of Head Start: A Machine Learning Approach.

    • Hagenbuchner, M., Cliff, D. P., Trost, S. G., Van Tuc, N., & Peoples, G. E. (2015). Prediction of activity type in preschool children using machine learning techniques. Journal of Science and Medicine in Sport, 18(4), 426-431.

  2. What research questions were the authors of this study trying to address and why did they consider these questions important?

    • Machine Learning is not frequently used in Head Start research; the first research study was aiming to add to the literature about the lasting impact of Head Start on later achievement using machine learning techniques, typically the same indicators of achievement have been used over the last few decades in HS research despite changes in systems and technology over time so when discussing the fade out effect it is important to make sure we are adjusting for changes in cohorts and teaching over time to highlight the HS features that are related to later achievement
  3. What were the results of these analyses?

    • Using Lasso, post-Lasso and Random forest methods along with an updated NLSY and CNLSY analysis sample, this paper casts doubt on the long-term impacts the program may have for newer cohorts. Further exploration is needed as to the exact value machine learning methods add to the analysis of this program, particularly as it relates to the causal context.

Part II: Data Product

For the data product, you are asked to dive into what it means for the model to be predictively accurate. Specifically, we’ll explore some measures of just how predictively accurate the model we developed in the guided practice is.

We’ll use a shortcut to cut to the chase – interpreting the model. The code below loads the model we estimated in the guided practice – in the form of the final_fit. This is necessary even if you currently have final_fit loaded in your environment/current R session, as you’ll need to have everything generated by code in this document for it to successfully “knit”.

library(here)
## here() starts at /Users/lizfrechette/Desktop/machine-learning
library(readr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom        1.0.0     ✔ recipes      1.0.1
## ✔ dials        1.0.0     ✔ rsample      1.0.0
## ✔ dplyr        1.0.9     ✔ tibble       3.1.7
## ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
## ✔ infer        1.0.2     ✔ tune         1.0.0
## ✔ modeldata    1.0.0     ✔ workflows    1.0.0
## ✔ parsnip      1.0.0     ✔ workflowsets 0.2.1
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
final_fit <- read_rds("out/ngsschat-final-fit.rds")

final_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble>     <workflow>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Run `show_notes(.Last.tune.result)` for more information.

Run the code below to calculate a *confusion matrix*

final_fit$.predictions[[1]] %>% 
    conf_mat(.pred_class, code)
##           Truth
## Prediction  SB  TS
##         SB 260  60
##         TS  37 402

Please interpret the above confusion matrix using these guidelines in terms of the true positive, true negative, false positive, and false negative rates. After each of the following (i.e., “True positive”), add both the number and percentage of observations. For instance, if there were 100 true positives out of a total of 400 data points, please write: 100 (25%).

True positive: 260 (34%)

True negative: 402 (53%)

False positive: 60 (8%)

False negative: 37 (5%)

You can read more about interpreting these here in terms of the specificity, sensitivity, precision, and recall, four statistics based on the information in the confusion matrix.

Return to your answer for Part 1A. Now, having examined the true and false positive and negative rates, how good do you think machine learning model we developed in the case study was? Write more specifically using the evidence you have from creating and interpreting the confusion matrix (above) after the following bullet point.

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

  1. Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.

  3. Commit your changes in GitHub Desktop and push them to your online GitHub repository.

  4. Publish your HTML page the web using one of the following publishing methods:

    • Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.

    • Publishing on GitHub using either GitHub Pages or the HTML previewer.

  5. Post a new discussion on GitHub to our ML Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.