Machine Learning - Lab 1 Independent Assignment

The lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at all, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

# ALL RELEVANT CODE FROM CASE STUDY


# LOAD RELEVANT PACKAGES
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# DATA WRANGLING
students <- read_csv("data/oulad-students.csv") #load/import files

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>%
    mutate(pass = as.factor(pass)) # create dichotomous variable showing whether a student passed and convert to factor
students <- students %>% 
    mutate(disability = as.factor(disability)) # convert disability dichotomous variable to factor

# DATA EXPLORATION
students %>% 
    count(id_student) #count number of students

## # A tibble: 28,785 × 2
##    id_student     n
##         <dbl> <int>
##  1       3733     1
##  2       6516     1
##  3       8462     2
##  4      11391     1
##  5      23629     1
##  6      23632     1
##  7      23698     1
##  8      23798     1
##  9      24186     1
## 10      24213     2
## # ℹ 28,775 more rows

students %>% 
    count(code_module, code_presentation) # count number of courses and offerings

## # A tibble: 22 × 3
##    code_module code_presentation     n
##    <chr>       <chr>             <int>
##  1 AAA         2013J               383
##  2 AAA         2014J               365
##  3 BBB         2013B              1767
##  4 BBB         2013J              2237
##  5 BBB         2014B              1613
##  6 BBB         2014J              2292
##  7 CCC         2014B              1936
##  8 CCC         2014J              2498
##  9 DDD         2013B              1303
## 10 DDD         2013J              1938
## # ℹ 12 more rows

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # changes levels into integers based on the order of the factor levels

# FIRST MODEL, NO MODIFICATIONS
set.seed(20230712)
train_test_split <- initial_split(students, prop = .80) # split data

data_train <- training(train_test_split)
data_test  <- testing(train_test_split) # assign data variables for split

data_train

## # A tibble: 26,074 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 FFF         2014B                 595186 M      South Regi… Lower Than A Lev…
##  2 BBB         2014J                 504066 F      East Midla… Lower Than A Lev…
##  3 BBB         2013J                 585790 F      South East… HE Qualification 
##  4 CCC         2014J                 278413 M      London Reg… HE Qualification 
##  5 GGG         2014B                 634933 F      South Regi… Lower Than A Lev…
##  6 CCC         2014J                 608577 M      North Regi… HE Qualification 
##  7 BBB         2014B                 612120 F      East Midla… Lower Than A Lev…
##  8 FFF         2013J                 530852 M      Wales       Lower Than A Lev…
##  9 CCC         2014J                2555596 M      South Regi… A Level or Equiv…
## 10 DDD         2013B                 556575 M      North Regi… HE Qualification 
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

data_test # check variables to verify proportion

## # A tibble: 6,519 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  4 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
##  5 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
##  6 AAA         2013J                  71361 M      Ireland     HE Qualification 
##  7 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
##  8 AAA         2013J                  98094 M      Wales       Lower Than A Lev…
##  9 AAA         2013J                 111717 F      East Angli… HE Qualification 
## 10 AAA         2013J                 114017 F      North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

my_rec <- recipe(pass ~ disability + imd_band, data = data_train) # create recipe to prepare data, added date_registration to model

my_mod <- # create model
    logistic_reg() %>% # logistic regression model
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_wf <-
    workflow() %>% # create a workflow using model and recipe outlined earlier
    add_model(my_mod) %>% 
    add_recipe(my_rec) 

fitted_model <- fit(my_wf, data = data_train) # fit model to data
final_fit <- last_fit(my_wf, train_test_split) # final fit using testing data to evaluate accuracy

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n   percent valid_percent
##    FALSE 2071 0.3176868      0.372549
##     TRUE 3488 0.5350514      0.627451
##       NA  960 0.1472618            NA

# SECOND EXTENDED MODEL, REGISTRATION DATE ADDED
set.seed(20230712)
train_test_split <- initial_split(students, prop = .80) # split data

data_train <- training(train_test_split)
data_test  <- testing(train_test_split) # assign data variables for split

data_train

## # A tibble: 26,074 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 FFF         2014B                 595186 M      South Regi… Lower Than A Lev…
##  2 BBB         2014J                 504066 F      East Midla… Lower Than A Lev…
##  3 BBB         2013J                 585790 F      South East… HE Qualification 
##  4 CCC         2014J                 278413 M      London Reg… HE Qualification 
##  5 GGG         2014B                 634933 F      South Regi… Lower Than A Lev…
##  6 CCC         2014J                 608577 M      North Regi… HE Qualification 
##  7 BBB         2014B                 612120 F      East Midla… Lower Than A Lev…
##  8 FFF         2013J                 530852 M      Wales       Lower Than A Lev…
##  9 CCC         2014J                2555596 M      South Regi… A Level or Equiv…
## 10 DDD         2013B                 556575 M      North Regi… HE Qualification 
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

data_test # check variables to verify proportion

## # A tibble: 6,519 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  4 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
##  5 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
##  6 AAA         2013J                  71361 M      Ireland     HE Qualification 
##  7 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
##  8 AAA         2013J                  98094 M      Wales       Lower Than A Lev…
##  9 AAA         2013J                 111717 F      East Angli… HE Qualification 
## 10 AAA         2013J                 114017 F      North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train) # create recipe to prepare data, added date_registration to model

my_mod <- # create model
    logistic_reg() %>% # logistic regression model
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_wf <-
    workflow() %>% # create a workflow using model and recipe outlined earlier
    add_model(my_mod) %>% 
    add_recipe(my_rec) 

fitted_model <- fit(my_wf, data = data_train) # fit model to data
final_fit <- last_fit(my_wf, train_test_split) # final fit using testing data to evaluate accuracy

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n   percent valid_percent
##    FALSE 2071 0.3176868     0.3728844
##     TRUE 3483 0.5342844     0.6271156
##       NA  965 0.1480288            NA

Previous results: A 62.75% correct prediction rate

New results: An 62.71% correct prediction rate

How does the accuracy of this new model compare? Add a few reflections below:

The new model performs marginally worse than the previous model after adding date of registration. This points to the date of registration having no effect on predicting dropout, so it can be removed.

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

According to Leo Breiman, there are two specific cultures of statistical modeling in his paper: data modeling vs. algorithmic modeling. Subscribers of the data modeling perspective (some 98% of statisticians according to Breiman) begin with an intuitive assumption that might explain the data, and fit models that make sense. In contrast, algorithmic modelers (2% of statisticians according to Breiman) consider the data-generating mechanism unknown (a black box) and begin with no assumptions. They instead focus purely upon the model with the most predictive accuracy and performance, and deal with explainability afterwards.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

The huge, complicated data needs of today have only served to validate Breiman’s argument. Algorithmic modeling techniques are being applied en masse through machine learning in virtually every modern industry because of their predictive powers. I imagine the split of 98% data modeling and 2% algorithmic modeling described by Breiman has changed drastically in recent years.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

I agree with Breiman - what point is there explaining and understanding a model that is not accurate or relevant? It doesn’t matter whether a model should or should not perform well based on intuition or subjectivity, what matters is the objective truth of which models perform the best. Theory and understanding will follow. Accuracy should be the most important.

Part B:

How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The machine learning model developed in our case study had a relatively poor predictive accuracy of ~63%, which is not much better than random chance. Extending the model in our badge activity did not change this accuracy significantly (a 0.04% change).

How might the model be improved? Share any ideas you have at this time below:

Incorporating other variables (test scores, age, etc.) could add more predictive power and accuracy to the model. Trying other types of models could also be fruitful.

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.
- Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2017, March 7). Predicting student dropout in higher education. arxiv. https://doi.org/10.48550/arXiv.1606.06364
What research questions were the authors of this study trying to address and why did they consider these questions important?
- What factors are associated with student dropout risk in higher education? Can we predict student dropout, if so, how accurately?
What were the results of these analyses?
- Using a balanced dataset of 32,500 students, logistic regression was the most accurate methodology. The features with the highest predictive performance were GPA in math, english, chemistry, and psychology courses, along with birth year and year of enrollment. The authors say this point out to time effects.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let your instructor know if you run into any issues with knitting.
Finally, publish your webpage on Rpubs by clicking the “Publish” button located in the Viewer Pane after you knit your document.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official Lab Badge, submit the link on Blackboard and share with your instructor.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Lab 1 Independent Assignment

Benjamin Mauldin

April 2, 2024

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge