Machine Learning - Learning Lab 1 Independent Practice

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply an analytic technique introduced in this learning lab.

Part I: Reflect and Plan

Part A:

How good was the machine learning model we developed in the guiding practice? Would you be comfortable using it to code more conversations? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The metrics certainly looked good, but I also don’t think I have a strong intuition to be able to contextualize the amount of accuracy in the model. I would want to know what’s generally considered good for a binary classifier in this type of context where we are categorizing the topics of text.

How might the model be improved? Share any ideas you have at this time below:

Feature engineering using text mining techniques would add a ton of potential for getting a better fit on the model.

Part B: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involve making predictions.

Provide an APA citation for your selected study.
- Bergin, S., Mooney, A., Ghent, J., & Quille, K. (2015). Using Machine Learning Techniques to Predict Introductory Programming Performance. International Journal of Computer Science and Software Engineering, 4(12), 323–328.
What research questions were the authors of this study trying to address and why did they consider these questions important?
- How can we predict the performance of introductory programming students? How accurate are such methods and which one is the best in this regard?
What were the results of these analyses?
- They tried 6 ML techniques, Naive Bayes had the best performance, but the accuracy wasn’t significantly better than the other methods (Accuracy between 71.6% and 78.3% for all 6 methods)

Part II: Data Product

For the data product, you are asked to dive into what it means for the model to be predictively accurate. Specifically, we’ll explore some measures of just how predictively accurate the model we developed in the guided practice is.

We’ll use a shortcut to cut to the chase – interpreting the model. The code below loads the model we estimated in the guided practice – in the form of the final_fit. This is necessary even if you currently have final_fit loaded in your environment/current R session, as you’ll need to have everything generated by code in this document for it to successfully “knit”.

library(here)

## here() starts at /home/alishinski/Documents/work/laser_institute/machine-learning

library(readr)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✔ broom        0.8.0     ✔ recipes      1.0.1
## ✔ dials        1.0.0     ✔ rsample      1.0.0
## ✔ dplyr        1.0.9     ✔ tibble       3.1.7
## ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
## ✔ infer        1.0.2     ✔ tune         1.0.0
## ✔ modeldata    1.0.0     ✔ workflows    1.0.0
## ✔ parsnip      1.0.0     ✔ workflowsets 0.2.1
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

final_fit <- read_rds("out/ngsschat-final-fit.rds")

final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3034/759]> train/test split <tibble> <tibble> <tibble>     <workflow>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Run `show_notes(.Last.tune.result)` for more information.

Run the code below to calculate a *confusion matrix*

cm <- final_fit$.predictions[[1]] %>% 
    conf_mat(.pred_class, code)

Please interpret the above confusion matrix using these guidelines in terms of the true positive, true negative, false positive, and false negative rates. After each of the following (i.e., “True positive”), add both the number and percentage of observations. For instance, if there were 100 true positives out of a total of 400 data points, please write: 100 (25%).

Accuracy: 0.8722003

True positive: 0.8125

True negative: 0.9157175

False positive: 0.1875

False negative: 0.0842825

You can read more about interpreting these here in terms of the specificity, sensitivity, precision, and recall, four statistics based on the information in the confusion matrix.

Return to your answer for Part 1A. Now, having examined the true and false positive and negative rates, how good do you think machine learning model we developed in the case study was? Write more specifically using the evidence you have from creating and interpreting the confusion matrix (above) after the following bullet point.

I think it was pretty good still, but again I have little in the way of context for understanding whether this was good.

Knit & Submit

Congratulations, you’ve completed your Prediction badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our ML Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Machine Learning - Learning Lab 1 Independent Practice

Lexi Lishinski

July 13, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit