Machine Learning - Learning Lab 1 Independent Practice

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply an analytic technique introduced in this learning lab.

Part I: Reflect and Plan

Part A:

How good was the machine learning model we developed in the guiding practice? Would you be comfortable using it to code more conversations? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

It did pretty well I think with accuracy=0.84 and roc_auc =0.94. I would be comfortable using such model but would run robustness checks before like bootstrapping the testing data and re-fitting the model. I do not think this data and model is good for other purposes besides predicting the coded outcome. You would need predictors that are related to the quality of research to fit the model.

How might the model be improved? Share any ideas you have at this time below:

Include additinal predictors like text covariates or using different testing data (or K) to re-fit the model

Part B: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions.

Provide an APA citation for your selected study.
- Bahr, P. R., Fagioli, L. P., Hetts, J., Hayward, C., Willett, T., Lamoree, D., … & Baker, R. B. (2019). Improving placement accuracy in California’s community colleges using multiple measures of high school achievement. Community College Review, 47(2), 178-211.
What research questions were the authors of this study trying to address and why did they consider these questions important?
- Do multiple measures predict college placement better than standardized test scores? Standardized placement tests remain the primary means by which new community college students are assessed and placed in the hierarchy of math and English coursework. A growing body of evidence indicates that placement tests tend to underestimate students’ likelihood of achieving passing grades in college-level courses, leading to students being misplaced in developmental coursework, slowing their academic progress, and increasing their likelihood of dropping out of college. This question is important as placement into remediation has been studied to significantly affect college persistence and completion, particularly for low-income, first-gen, and Black and Hispanic students.
What were the results of these analyses?
- Cumulative high school grade point average (GPA) is the most consistently useful predictor of performance across levels of math and English coursework, and a higher GPA is necessary to signal readiness for college-level coursework in math than is necessary to signal readiness for college-level coursework in English. In addition, cumulative GPA combined with specific indications of progress in the high school curriculum is frequently useful for predicting performance in math among direct matriculants and for predicting performance in both math and English among nondirect matriculants.

Part II: Data Product

For the data product, you are asked to dive into what it means for the model to be predictively accurate. Specifically, we’ll explore some measures of just how predictively accurate the model we developed in the guided practice is.

We’ll use a shortcut to cut to the chase – interpreting the model. The code below loads the model we estimated in the guided practice – in the form of the final_fit. This is necessary even if you currently have final_fit loaded in your environment/current R session, as you’ll need to have everything generated by code in this document for it to successfully “knit”.

library(here)

## here() starts at /Users/Vero/Documents/GitHub/machine-learning

library(readr)
library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.1.2

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✔ broom        1.0.0     ✔ recipes      1.0.1
## ✔ dials        1.0.0     ✔ rsample      1.0.0
## ✔ dplyr        1.0.9     ✔ tibble       3.1.7
## ✔ ggplot2      3.3.5     ✔ tidyr        1.2.0
## ✔ infer        1.0.2     ✔ tune         1.0.0
## ✔ modeldata    1.0.0     ✔ workflows    1.0.0
## ✔ parsnip      1.0.0     ✔ workflowsets 0.2.1
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0

## Warning: package 'broom' was built under R version 4.1.2

## Warning: package 'dials' was built under R version 4.1.2

## Warning: package 'dplyr' was built under R version 4.1.2

## Warning: package 'infer' was built under R version 4.1.2

## Warning: package 'modeldata' was built under R version 4.1.2

## Warning: package 'parsnip' was built under R version 4.1.2

## Warning: package 'recipes' was built under R version 4.1.2

## Warning: package 'rsample' was built under R version 4.1.2

## Warning: package 'tibble' was built under R version 4.1.2

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'tune' was built under R version 4.1.2

## Warning: package 'workflows' was built under R version 4.1.2

## Warning: package 'workflowsets' was built under R version 4.1.2

## Warning: package 'yardstick' was built under R version 4.1.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

final_fit <- read_rds("out/ngsschat-final-fit.rds")

final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble>     <workflow>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Run `show_notes(.Last.tune.result)` for more information.

Run the code below to calculate a *confusion matrix*

final_fit$.predictions[[1]] %>% 
    conf_mat(.pred_class, code)

##           Truth
## Prediction  SB  TS
##         SB 260  60
##         TS  37 402

Please interpret the above confusion matrix using these guidelines in terms of the true positive, true negative, false positive, and false negative rates. After each of the following (i.e., “True positive”), add both the number and percentage of observations. For instance, if there were 100 true positives out of a total of 400 data points, please write: 100 (25%).

True positive: 260 (87%) True negative: 402(87%) False positive: 60 (13%) False negative: 37 (12%)

You can read more about interpreting these here in terms of the specificity, sensitivity, precision, and recall, four statistics based on the information in the confusion matrix.

Return to your answer for Part 1A. Now, having examined the true and false positive and negative rates, how good do you think machine learning model we developed in the case study was? Write more specifically using the evidence you have from creating and interpreting the confusion matrix (above) after the following bullet point.

The confusion matrix confirms my answer for Part 1A. Precision, recall, and F-measures are above 80%

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our ML Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Machine Learning - Learning Lab 1 Independent Practice

July 13, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit