Machine Learning - Learning Lab 1 Independent Practice

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply an analytic technique introduced in this learning lab.

Part I: Reflect and Plan

Part A:

How good was the machine learning model we developed in the guiding practice? Would you be comfortable using it to code more conversations? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

I think it depends if the sample had enough power to be able to infer reliability of the coding scheme although it corrected the responses correct over 85% of the time it would depend if this is acceptable in the type of research you conduct

How might the model be improved? Share any ideas you have at this time below:

perhaps a larger sample size? or adding more variables in that allowed for more accurate reliability of the machine coding new tweets

Part B: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions.

Provide an APA citation for your selected study.
- Lukes, D. (2018). Revisiting the Long-Term Impacts of Head Start: A Machine Learning Approach.
- Hagenbuchner, M., Cliff, D. P., Trost, S. G., Van Tuc, N., & Peoples, G. E. (2015). Prediction of activity type in preschool children using machine learning techniques. Journal of Science and Medicine in Sport, 18(4), 426-431.
What research questions were the authors of this study trying to address and why did they consider these questions important?
- Machine Learning is not frequently used in Head Start research; the first research study was aiming to add to the literature about the lasting impact of Head Start on later achievement using machine learning techniques, typically the same indicators of achievement have been used over the last few decades in HS research despite changes in systems and technology over time so when discussing the fade out effect it is important to make sure we are adjusting for changes in cohorts and teaching over time to highlight the HS features that are related to later achievement
What were the results of these analyses?
- Using Lasso, post-Lasso and Random forest methods along with an updated NLSY and CNLSY analysis sample, this paper casts doubt on the long-term impacts the program may have for newer cohorts. Further exploration is needed as to the exact value machine learning methods add to the analysis of this program, particularly as it relates to the causal context.

Part II: Data Product

For the data product, you are asked to dive into what it means for the model to be predictively accurate. Specifically, we’ll explore some measures of just how predictively accurate the model we developed in the guided practice is.

We’ll use a shortcut to cut to the chase – interpreting the model. The code below loads the model we estimated in the guided practice – in the form of the final_fit. This is necessary even if you currently have final_fit loaded in your environment/current R session, as you’ll need to have everything generated by code in this document for it to successfully “knit”.

library(here)

## here() starts at /Users/lizfrechette/Desktop/machine-learning

library(readr)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✔ broom        1.0.0     ✔ recipes      1.0.1
## ✔ dials        1.0.0     ✔ rsample      1.0.0
## ✔ dplyr        1.0.9     ✔ tibble       3.1.7
## ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
## ✔ infer        1.0.2     ✔ tune         1.0.0
## ✔ modeldata    1.0.0     ✔ workflows    1.0.0
## ✔ parsnip      1.0.0     ✔ workflowsets 0.2.1
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

final_fit <- read_rds("out/ngsschat-final-fit.rds")

final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble>     <workflow>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Run `show_notes(.Last.tune.result)` for more information.

Run the code below to calculate a *confusion matrix*

final_fit$.predictions[[1]] %>% 
    conf_mat(.pred_class, code)

##           Truth
## Prediction  SB  TS
##         SB 260  60
##         TS  37 402

Please interpret the above confusion matrix using these guidelines in terms of the true positive, true negative, false positive, and false negative rates. After each of the following (i.e., “True positive”), add both the number and percentage of observations. For instance, if there were 100 true positives out of a total of 400 data points, please write: 100 (25%).

True positive: 260 (34%)

True negative: 402 (53%)

False positive: 60 (8%)

False negative: 37 (5%)

You can read more about interpreting these here in terms of the specificity, sensitivity, precision, and recall, four statistics based on the information in the confusion matrix.

Return to your answer for Part 1A. Now, having examined the true and false positive and negative rates, how good do you think machine learning model we developed in the case study was? Write more specifically using the evidence you have from creating and interpreting the confusion matrix (above) after the following bullet point.

The model was adequate for predicting the final codes. Precision is about 81% and recall is 88% so accuracy is 84% therefore, this is about 80% reliability.

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our ML Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Machine Learning - Learning Lab 1 Independent Practice

Liz Frechette

July 13, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit