Machine Learning - Learning Lab 1 Independent Practice

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(students)

## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(disability = as.factor(disability))

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## # A tibble: 32,593 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

my_rec <- recipe(pass ~ disability + imd_band + date_registration, data = data_train)

my_rec

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 3

# specify model
my_mod <-
    logistic_reg()

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data = data_train)
fitted_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        disabilityY           imd_band  date_registration  
##         -0.667029          -0.280013           0.059134           0.001643  
## 
## Degrees of Freedom: 22371 Total (i.e. Null);  22368 Residual
##   (3702 observations deleted due to missingness)
## Null Deviance:       29800 
## Residual Deviance: 29580     AIC: 29590

By adding the variable “date_registration”, the AIC reduced from 29660 to 29590 which means it’s a better model.

final_fit <- last_fit(fitted_model, train_test_split)
final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

#class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap)
final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## # A tibble: 6,519 × 3
##    .pred_class pass  correct
##    <fct>       <fct> <lgl>  
##  1 0           1     FALSE  
##  2 0           1     FALSE  
##  3 0           1     FALSE  
##  4 <NA>        1     NA     
##  5 0           0     TRUE   
##  6 <NA>        1     NA     
##  7 0           1     FALSE  
##  8 0           1     FALSE  
##  9 0           1     FALSE  
## 10 <NA>        1     NA     
## # ℹ 6,509 more rows

final_fit %>% 
    collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.627 Preprocessor1_Model1
## 2 roc_auc  binary         0.555 Preprocessor1_Model1

Previous results: accuracy binary 0.6274510 Preprocessor1_Model1
roc_auc binary 0.5478134 Preprocessor1_Model1

New results: accuracy binary 0.6271156 Preprocessor1_Model1
roc_auc binary 0.5551058 Preprocessor1_Model1

How does the accuracy of this new model compare? Add a few reflections below: The accuracy of the two models are not significantly different. Though the roc_auc for the new model is slightly higher. -

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

In his paper, “Statistical Modeling: The Two Cultures,” Leo Breiman outlines two cultures of statistical modeling: data modeling and algorithmic modeling.
- Data modeling assumes that the data are generated by a given stochastic data model. The goal of data modeling is to find the parameters of this model that best fit the data.
- Algorithmic modeling does not assume that the data are generated by a known model. The goal of algorithmic modeling is to find a model that can accurately predict new data, even if the model does not accurately reflect the underlying data generating process.
The primary difference between these two cultures is the role of assumptions. Data modeling relies on assumptions about the data generating process, while algorithmic modeling does not. This difference has implications for the way that these two cultures approach model building, model validation, and model interpretation.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

The advent of big data and machine learning has reinforced Breiman’s argument that the two cultures of statistical modeling should be combined. In order to build models that are both accurate and interpretable, it is necessary to combine the strengths of data modeling and algorithmic modeling.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

I would say that Breiman’s stance is important but not absolute. Predictive accuracy is essential, but understanding how a method works can also be valuable.

Part B:

How good was the machine learning model we developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The models could be better. I would consider adding or removing features that would increase the accuracy to close to 0.9 than 0.6.

How might the model be improved? Share any ideas you have at this time below:

Consider other variables or feature engineering.

Part C: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.
- Wang, J., Zhang, M., & Zhang, H. (2020). Predicting student academic success using machine learning. Smart Learning Environments, 17(1), 1-13.
What research questions were the authors of this study trying to address and why did they consider these questions important?
- Can machine learning be used to predict student academic success?
- What are the most important features for predicting student academic success?
- Can machine learning be used to identify students who are at risk of academic failure?
The authors considered these questions important because they believe that machine learning has the potential to help students succeed in school. By identifying the factors that contribute to student academic success, machine learning can be used to provide targeted interventions to help students who are struggling.
What were the results of these analyses?
- The results of the analyses showed that the model was able to predict student academic success with an accuracy of 85%. The most important features for predicting student academic success were grades, attendance, and test scores. The model was also able to identify students who were at risk of academic failure.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official LASER Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet. We recommend bookmarking this spreadsheet as we’ll be using it throughout the year to keep track of your progress.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Learning Lab 1 Independent Practice

Mighty Itauma Itauma

July 20, 2023

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge