The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

WRANGLE

Read CSV Data File

students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspect Data

glimpse(students)
## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

“Mutate” Variables

students <- students %>%
    # create a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% 
    # make the variable a factor, helping later steps
    mutate(pass = as.factor(pass))
students <- students %>% 
    mutate(disability = as.factor(disability))
view(students)

EXPLORE

Examine Variables

students %>% 
    count(id_student) # this many students
## # A tibble: 28,785 × 2
##    id_student     n
##         <dbl> <int>
##  1       3733     1
##  2       6516     1
##  3       8462     2
##  4      11391     1
##  5      23629     1
##  6      23632     1
##  7      23698     1
##  8      23798     1
##  9      24186     1
## 10      24213     2
## # ℹ 28,775 more rows
students %>% 
    count(code_module, code_presentation) # this many offerings
## # A tibble: 22 × 3
##    code_module code_presentation     n
##    <chr>       <chr>             <int>
##  1 AAA         2013J               383
##  2 AAA         2014J               365
##  3 BBB         2013B              1767
##  4 BBB         2013J              2237
##  5 BBB         2014B              1613
##  6 BBB         2014J              2292
##  7 CCC         2014B              1936
##  8 CCC         2014J              2498
##  9 DDD         2013B              1303
## 10 DDD         2013J              1938
## # ℹ 12 more rows

Feature Engineering

students <- students %>% 
    # create a factor with ordered levels
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% 
    # change the levels into integers based on the order of the factor levels
    mutate(imd_band = as.integer(imd_band)) 

students
## # A tibble: 32,593 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

MODEL

Split data

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)
data_train
## # A tibble: 26,074 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 FFF         2014B                 595186 M      South Regi… Lower Than A Lev…
##  2 BBB         2014J                 504066 F      East Midla… Lower Than A Lev…
##  3 BBB         2013J                 585790 F      South East… HE Qualification 
##  4 CCC         2014J                 278413 M      London Reg… HE Qualification 
##  5 GGG         2014B                 634933 F      South Regi… Lower Than A Lev…
##  6 CCC         2014J                 608577 M      North Regi… HE Qualification 
##  7 BBB         2014B                 612120 F      East Midla… Lower Than A Lev…
##  8 FFF         2013J                 530852 M      Wales       Lower Than A Lev…
##  9 CCC         2014J                2555596 M      South Regi… A Level or Equiv…
## 10 DDD         2013B                 556575 M      North Regi… HE Qualification 
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>
data_test
## # A tibble: 6,519 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  4 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
##  5 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
##  6 AAA         2013J                  71361 M      Ireland     HE Qualification 
##  7 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
##  8 AAA         2013J                  98094 M      Wales       Lower Than A Lev…
##  9 AAA         2013J                 111717 F      East Angli… HE Qualification 
## 10 AAA         2013J                 114017 F      North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

Create a “Recipe”

my_rec <- recipe(pass ~ disability + imd_band, data = data_train)

my_rec
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 2

Specify the model and workflow

# specify model
my_mod <-
    logistic_reg()
my_mod <-
    logistic_reg() %>% 
    # generalized linear model
    set_engine("glm") %>% 
    # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
    set_mode("classification") 

my_mod
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Fit model

fitted_model <- fit(my_wf, data = data_train)
fitted_model
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)  disabilityY     imd_band  
##    -0.78655     -0.28186      0.05974  
## 
## Degrees of Freedom: 22406 Total (i.e. Null);  22404 Residual
##   (3667 observations deleted due to missingness)
## Null Deviance:       29830 
## Residual Deviance: 29650     AIC: 29660
final_fit <- last_fit(fitted_model, train_test_split)
final_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

Interpret accuracy

# collect test split predictions
final_fit %>%
    collect_predictions()
## # A tibble: 6,519 × 7
##    id               .pred_0 .pred_1  .row .pred_class pass  .config             
##    <chr>              <dbl>   <dbl> <int> <fct>       <fct> <chr>               
##  1 train/test split   0.647   0.353     2 0           1     Preprocessor1_Model1
##  2 train/test split   0.605   0.395     4 0           1     Preprocessor1_Model1
##  3 train/test split   0.634   0.366     7 0           1     Preprocessor1_Model1
##  4 train/test split  NA      NA        10 <NA>        1     Preprocessor1_Model1
##  5 train/test split   0.577   0.423    16 0           0     Preprocessor1_Model1
##  6 train/test split  NA      NA        18 <NA>        1     Preprocessor1_Model1
##  7 train/test split   0.634   0.366    21 0           1     Preprocessor1_Model1
##  8 train/test split   0.577   0.423    24 0           1     Preprocessor1_Model1
##  9 train/test split   0.547   0.453    33 0           1     Preprocessor1_Model1
## 10 train/test split  NA      NA        35 <NA>        1     Preprocessor1_Model1
## # ℹ 6,509 more rows
final_fit %>% 
    # see test set predictions
    collect_predictions() %>% 
    # to make the output easier to view
    select(.pred_class, pass) %>%  
    # create a new variable, correct, telling us when the model was and was not correct
    mutate(correct = .pred_class == pass) 
## # A tibble: 6,519 × 3
##    .pred_class pass  correct
##    <fct>       <fct> <lgl>  
##  1 0           1     FALSE  
##  2 0           1     FALSE  
##  3 0           1     FALSE  
##  4 <NA>        1     NA     
##  5 0           0     TRUE   
##  6 <NA>        1     NA     
##  7 0           1     FALSE  
##  8 0           1     FALSE  
##  9 0           1     FALSE  
## 10 <NA>        1     NA     
## # ℹ 6,509 more rows
final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)
##  correct    n   percent valid_percent
##    FALSE 2071 0.3176868      0.372549
##     TRUE 3488 0.5350514      0.627451
##       NA  960 0.1472618            NA
final_fit %>% 
    collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.627 Preprocessor1_Model1
## 2 roc_auc  binary         0.548 Preprocessor1_Model1
students %>% 
    count(pass)
## # A tibble: 2 × 2
##   pass      n
##   <fct> <int>
## 1 0     20232
## 2 1     12361
students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)
##  correct     n   percent
##        0 16328 0.5009665
##        1 16265 0.4990335
view(students)

Create new “Recipe” with third predictive variable

my_rec3 <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

my_rec3
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 3

Update workflow with new recipe

my_wf_new <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec3) # add our recipe we wrote above

Fit new model

fitted_model_new <- fit(my_wf_new, data = data_train)
fitted_model_new
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        disabilityY           imd_band  date_registration  
##         -0.667029          -0.280013           0.059134           0.001643  
## 
## Degrees of Freedom: 22371 Total (i.e. Null);  22368 Residual
##   (3702 observations deleted due to missingness)
## Null Deviance:       29800 
## Residual Deviance: 29580     AIC: 29590
final_fit_new <- last_fit(fitted_model_new, train_test_split)
final_fit_new
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>
final_fit_new %>% 
    collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.627 Preprocessor1_Model1
## 2 roc_auc  binary         0.555 Preprocessor1_Model1

How does the accuracy of this new model compare? Add a few reflections below:

  • The new model with the third predictor variable (date_registration) slightly decreases (at the 10,000th decimal place) the accuracy of the model compared to the model without this varaible. As such, the number of days prior to the start of the module-presentation that a student registered does not strongly correlate to whether that student passed the module presentation. Of the predictor variables in the model, the variable corresponding to disability status (disability) has the strongest (namely negative) correlation with a student’s final result in the module presentation.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

  1. Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?
  • The two cultures of statistical modeling that Breiman outlines in his paper are The Data Modeling Culture and The Algorithmic Modeling Culture. From the abstract, “[The Data Modeling Culture] assumes that the data are generated by a given stochastic data model. The Algorithmic Modeling Culture uses algorithmic models and treats the data mechanism as unknown.” Based on the article text, I understand the primary difference between these two cultures to be that the former assumes that the path from x to y is knowable and ordered, whereas the latter assumes that the path from x to y is an unknowably complex. The Algorithmic Modeling Culture, consequently, aims only to mimic the generation pattern of y from x, without regard for parity in the mechanism of said generation. In contrast, The Data Modeling Culture aims to approximate as best as possible the mathematical mechanism that generates y from x.
  1. How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?
  • Big data and machine learning has reinforced Breiman’s argument that pragmatism rather than disciplinary convention should drive problem solving. The problem solving approaches that leverage big data and machine learning draw on statistical models (e.g., in supervised ML) as well as algorithmic models (e.g., in unsupervised ML). Breiman essentially argues for a “by any means” necessary approach with accuracy of output being the goal, which has indeed be the approach for the current technological era.
  1. Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?
  • I’m glad you asked! If there are no implications associated with the mechanism of how something happens then sure, predictive accuracy can take precedence over mechanistic understanding. However, for problems that have social implications and intervention may be needed, I think it is insufficient and grossly negligent to not attend to mechanism. Setting predictive accuracy above mechanistic understanding risks the development deterministic perspectives. Take recidivism as an example, the current algorithms used by judges have been trained on “big data” that yield recidivism risk predictions that are racially biased. The bias is embedded in the data that has been socially constructed. When the data used for any algorithm is socially constructed, those using that algorithm must understand that the generated predictions are only relevant to the social entity/phenomenon that has been constructed. In the case of recidivism, the social conditions that lead to recidivism have been constructed, so a recidivism algortihm can only predict the outcomes of that system functioning as it has been constructed. The danger (i.e. gross neglect) arises from the high risk of equating nature and nurture, which can (and often does) come from using algorithms for socially relevant problems. The human tendency, based on millions of years of biological evolution, is to assume that outcomes with high predictive value cannot be changed only reported. Reporting predictive accuracy for a social problem hacks into this evolutionary wiring (a product of nature not nurture) to the detriment of social improvement. The question that needs to be part of any social problem solving endeavor is “Why?” Thus, the pursuit of mechanistic understanding in social problem solving creates space for change and betterment. In the case of recidivism, like any other social concern, predictive accuracy is misplaced and has dire consequences, especially when placed in supremacy to mechanistic understanding.

Part B:

  1. How good was the machine learning model we developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.
  • Given the model’s relatively weak predictive accuracy (i.e., ~67%), my thoughts about the use of such a model would depend on the purposes for which the model was being used. If it were being used to determine the whether the module-presentation was discriminatory based on disability status, controlling for socioeconomic status, then I would want to see in which direction the model erred. If it incorrectly predicted fail when for a student with disability more than it did for a student without disability, then I would question it’s usefulness/meaningful contribution to the field. However, if it correctly predicted the final result of a student with disability with high accuracy then I would consider it a meaningful contribution, even with the relatively weak overall predictive accuracy of ~67%.
  1. How might the model be improved? Share any ideas you have at this time below:
  • The model might be improved by the inclusion as predictors other variables theorized (with research based evidence) to influence students’ academic performance or learing of the module content.

Part C: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

  1. Provide an APA citation for your selected study.
  • Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine learning, 39, 103-134.
  1. What research questions were the authors of this study trying to address and why did they consider these questions important?
  • RQ: How is it that unlabeled data can increase classification accuracy? “This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available” (extracted from the abstract).
  1. What were the results of these analyses?
  • The results were as follows “(1) unlabeled data can significantly increase performance, (2) the basic EM algorithm can suffer from a misfit between the modeling assumptions and the unlabeled data, and (3) each extension mentioned [below] often reduces the effect of this problem and improves classification… The first extension introduces a weighting factor that dynamically adjusts the strength of the unlabeled data’s contribution to parameter estimation in EM. The second reduces the bias of naive Bayes by modeling each class with multiple mixture components, instead of a single component…[T]o identify the source newsgroup for a UseNet article with 70% classification accuracy, the proposed algorithm takes advantage of 10000 unlabeled examples and requires only 600 labeled examples to achieve the same accuracy…reduc[ing] the need for labeled training examples by more than a factor of three. With only 40 labeled documents (two per class), accuracy is improved from 27% to 43% by adding unlabeled data” (p. 105).

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official LASER Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet. We recommend bookmarking this spreadsheet as we’ll be using it throughout the year to keep track of your progress.

Once your instructor has checked your link, you will be provided a physical version of the badge below!