Machine Learning - Lab 1 Independent Assignment

The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at all, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(students)

## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(disability = as.factor(disability))

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## # A tibble: 32,593 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

my_rec

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 3

# specify model
my_mod <-
    logistic_reg()

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data = data_train)
fitted_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        disabilityY           imd_band  date_registration  
##         -0.667029          -0.280013           0.059134           0.001643  
## 
## Degrees of Freedom: 22371 Total (i.e. Null);  22368 Residual
##   (3702 observations deleted due to missingness)
## Null Deviance:       29800 
## Residual Deviance: 29580     AIC: 29590

final_fit <- last_fit(my_mod, my_rec, train_test_split)

final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

# collect test split predictions
final_fit %>%
    collect_predictions()

## # A tibble: 6,519 × 7
##    .pred_class .pred_0 .pred_1 id                .row pass  .config             
##    <fct>         <dbl>   <dbl> <chr>            <int> <fct> <chr>               
##  1 0             0.640   0.360 train/test split     2 1     Preprocessor1_Model1
##  2 0             0.598   0.402 train/test split     4 1     Preprocessor1_Model1
##  3 0             0.632   0.368 train/test split     7 1     Preprocessor1_Model1
##  4 <NA>         NA      NA     train/test split    10 1     Preprocessor1_Model1
##  5 0             0.620   0.380 train/test split    16 0     Preprocessor1_Model1
##  6 <NA>         NA      NA     train/test split    18 1     Preprocessor1_Model1
##  7 0             0.617   0.383 train/test split    21 1     Preprocessor1_Model1
##  8 0             0.591   0.409 train/test split    24 1     Preprocessor1_Model1
##  9 0             0.537   0.463 train/test split    33 1     Preprocessor1_Model1
## 10 <NA>         NA      NA     train/test split    35 1     Preprocessor1_Model1
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## # A tibble: 6,519 × 3
##    .pred_class pass  correct
##    <fct>       <fct> <lgl>  
##  1 0           1     FALSE  
##  2 0           1     FALSE  
##  3 0           1     FALSE  
##  4 <NA>        1     NA     
##  5 0           0     TRUE   
##  6 <NA>        1     NA     
##  7 0           1     FALSE  
##  8 0           1     FALSE  
##  9 0           1     FALSE  
## 10 <NA>        1     NA     
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n   percent valid_percent
##    FALSE 2071 0.3176868     0.3728844
##     TRUE 3483 0.5342844     0.6271156
##       NA  965 0.1480288            NA

Previous results: The previous model had a valid_percent of 0.627451 for TRUE.

New results: The new model has a valid_percent of 0.627115 for TRUE.

How does the accuracy of this new model compare? Add a few reflections below:

The new model appears to be slightly less accurate indicating that the new variable does not have much predictive power.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

The data modeling culture begins with an assumption of a model with random variables inside of a black box. Parameters are estimated and the model is used for prediction. The model is validated using goodness-of-fit tests and residual examination. An estimated 98% of statisticians use this type of modeling.
The algorithmic modeling culture considers the inside of a black box as complex and unknown. This approach finds an algorithm that operates on x to predict the responses y. The model is validated by predictive accuracy. An estimated 2% of statisticians use this type of modeling, but many individuals use this process in other fields.
Data modeling compares variables to make inferences about the data while algorithmic modeling does not.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

Breiman states that so much data is being generated at an awesome rate. The advent of big data and machine learning affirms his belief that scientists should be open to using a range of tools to conduct analysis. Data modeling and algorithmic modeling should be combined in order to strengthen the predictive outputs and improve accuracy.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

I believe it is important for a model to be accurate, but understanding a model and understanding why it is accurate is of equal importance. The more we understand our methods, the greater the likelihood that we can make advancements within those methods.

Part B:

How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The machine learning model I developed in the badge activity had an accuracy of about 62.7%. I would prefer the model to be much more accurate. I would hope that someone who is a reviewer of research would agree with my stance and use a stronger model with a higher accuracy.

How might the model be improved? Share any ideas you have at this time below:

The model could be improved if additional variables were added that increased our predictive power. We could add more quality data, remove outliers, or use feature engineering.

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.

Tamez-Peña, J., Rosella, P., Totterman, S., Schreyer, E., Gonzalez, P., Venkataraman, A., & Meyers, S. P. (2021, November 26). Post-concussive mtbi in student athletes: MRI features and Machine Learning. Frontiers. https://www.frontiersin.org/journals/neurology/articles/10.3389/fneur.2021.734329/full
What research questions were the authors of this study trying to address and why did they consider these questions important?

The authors’ purpose for this study was to determine and characterize the radiomics features from structural MRI and Diffusion Tensor Imaging associated with the presence of mild traumatic brain injuries on student athletes with post-concussive syndrome.
What were the results of these analyses?

Following a machine learning strategy, they were able to determine the presence of concussion on 81% of the concussion subjects with a specificity of 74%. The findings suggested that the concussion-induced abnormalities on post-concussion syndrome subjects are not uniformly distributed among the entire brain tissue. Subjects with post-concussion syndrome may have localized brain abnormalities that are invisible to conventional radiologic observation, but are present and detectable with radiomic feature analysis.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Lab 1 Independent Assignment

[Dominic Valdiserri]

April 2, 2024

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge