Machine Learning - Learning Lab 1 Independent Practice

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will extend our model by adding another variable.
In Part II, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.

Part I: Extending our model

In this part of the badge activity, please add another variable – a variable for the number of days before the start of the module students registered. This variable will be a third predictor. By adding it, you’ll be able to examine how much more accurate your model is (if at al, as this variable might not have great predictive power). Note that this variable is a number and so no pre-processing is necessary.

In doing so, please move all of your code needed to run the analysis over from your case study file here. This is essential for your analysis to be reproducible. You may wish to break your code into multiple chunks based on the overall purpose of the code in the chunk (e.g., loading packages and data, wrangling data, and each of the machine learning steps).

install.packages("tidymodels")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──

## ✔ broom        1.0.5     ✔ recipes      1.0.6
## ✔ dials        1.2.0     ✔ rsample      1.1.1
## ✔ dplyr        1.1.2     ✔ tibble       3.2.1
## ✔ ggplot2      3.4.2     ✔ tidyr        1.3.0
## ✔ infer        1.0.4     ✔ tune         1.1.1
## ✔ modeldata    1.1.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.0     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.1     ✔ yardstick    1.2.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ stringr   1.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(students)

## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

students <- students %>% 
    mutate(withdrawn = ifelse(final_result == "Withdrawn", 1, 0)) %>% # creates a new variable named "withdrawn" and a dummy code of 1 if value of final_result equals "Withdrawn" and 0 if not
    mutate(withdrawn = as.factor(withdrawn)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(disability = as.factor(disability))

view(students)
count(students)

students <- students %>% 
    mutate(withdrawn = ifelse(final_result == "Withdrawn", 1, 0)) %>% # creates a new variable named "withdrawn" and a dummy code of 1 if value of final_result equals "Withdrawn" and 0 if not
    mutate(withdrawn = as.factor(withdrawn)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(disability = as.factor(disability))

view(students)

students %>% 
    count(id_student) # this many students

## # A tibble: 28,785 × 2
##    id_student     n
##         <dbl> <int>
##  1       3733     1
##  2       6516     1
##  3       8462     2
##  4      11391     1
##  5      23629     1
##  6      23632     1
##  7      23698     1
##  8      23798     1
##  9      24186     1
## 10      24213     2
## # ℹ 28,775 more rows

students %>% 
    count(code_module, code_presentation) # this many offerings

## # A tibble: 22 × 3
##    code_module code_presentation     n
##    <chr>       <chr>             <int>
##  1 AAA         2013J               383
##  2 AAA         2014J               365
##  3 BBB         2013B              1767
##  4 BBB         2013J              2237
##  5 BBB         2014B              1613
##  6 BBB         2014J              2292
##  7 CCC         2014B              1936
##  8 CCC         2014J              2498
##  9 DDD         2013B              1303
## 10 DDD         2013J              1938
## # ℹ 12 more rows

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels



students

## # A tibble: 32,593 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, withdrawn <fct>

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

data_train

## # A tibble: 26,074 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 FFF         2014B                 595186 M      South Regi… Lower Than A Lev…
##  2 BBB         2014J                 504066 F      East Midla… Lower Than A Lev…
##  3 BBB         2013J                 585790 F      South East… HE Qualification 
##  4 CCC         2014J                 278413 M      London Reg… HE Qualification 
##  5 GGG         2014B                 634933 F      South Regi… Lower Than A Lev…
##  6 CCC         2014J                 608577 M      North Regi… HE Qualification 
##  7 BBB         2014B                 612120 F      East Midla… Lower Than A Lev…
##  8 FFF         2013J                 530852 M      Wales       Lower Than A Lev…
##  9 CCC         2014J                2555596 M      South Regi… A Level or Equiv…
## 10 DDD         2013B                 556575 M      North Regi… HE Qualification 
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, withdrawn <fct>

data_test

## # A tibble: 6,519 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  4 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
##  5 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
##  6 AAA         2013J                  71361 M      Ireland     HE Qualification 
##  7 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
##  8 AAA         2013J                  98094 M      Wales       Lower Than A Lev…
##  9 AAA         2013J                 111717 F      East Angli… HE Qualification 
## 10 AAA         2013J                 114017 F      North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, withdrawn <fct>

my_rec <- recipe(withdrawn ~ disability + imd_band + date_registration , data = data_train)

my_rec

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 3

# specify model
my_mod <-
    logistic_reg()

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

library(tidymodels)
library(tidyverse)
library(janitor)
fitted_model <- fit(my_wf, data = data_train)

fitted_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##       (Intercept)        disabilityY           imd_band  date_registration  
##         -0.908786           0.380061          -0.054449          -0.004984  
## 
## Degrees of Freedom: 22371 Total (i.e. Null);  22368 Residual
##   (3702 observations deleted due to missingness)
## Null Deviance:       27620 
## Residual Deviance: 27130     AIC: 27140

final_fit <- last_fit(fitted_model, train_test_split)

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, withdrawn) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == withdrawn) # create a new variable, correct, telling us when the model was and was not correct

## # A tibble: 6,519 × 3
##    .pred_class withdrawn correct
##    <fct>       <fct>     <lgl>  
##  1 0           0         TRUE   
##  2 0           0         TRUE   
##  3 0           0         TRUE   
##  4 <NA>        0         NA     
##  5 0           1         FALSE  
##  6 <NA>        0         NA     
##  7 0           0         TRUE   
##  8 0           0         TRUE   
##  9 0           0         TRUE   
## 10 <NA>        0         NA     
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, withdrawn) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == withdrawn) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n   percent valid_percent
##    FALSE 1760 0.2699801     0.3168887
##     TRUE 3794 0.5819911     0.6831113
##       NA  965 0.1480288            NA

final_fit %>% 
    collect_metrics()

## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.683 Preprocessor1_Model1
## 2 roc_auc  binary         0.588 Preprocessor1_Model1

students %>% 
    count(withdrawn)

## # A tibble: 2 × 2
##   withdrawn     n
##   <fct>     <int>
## 1 0         22437
## 2 1         10156

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & withdrawn == 1 |
               prediction == 0 & withdrawn == 0, 1, 0)) %>% 
    tabyl(correct)

##  correct     n   percent
##        0 16305 0.5002608
##        1 16288 0.4997392

How does the accuracy of this new model compare? Add a few reflections below: The model accuracy actually decreased. -

Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper? data models: data are generated by a given data model. algorithmic models: data mechanism is unknown.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published? the advent of big model and machine learning vividly showed that algorithmic models are more appropriate in prediction.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

The availability of large high quality data plus with machine learning tools allows researchers to gain more insights.Yet algorithm models should be carefully evaluated. The research design should place an emphasis on theories and show how theories can be utilized to inform or guide the study.

Part B:

How good was the machine learning model we developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

The machine learning model we develop is good as it demonstrates how the machine learning can be used in learning analytics in education.

If i serve as a reviewer, i would focus on the theories behind the research design;
the data collection procedure as well as data quality.
How the authors interpret the research findings.
How might the model be improved? Share any ideas you have at this time below:
1. Adding more relevant independent variables.
2. Using re-sampling approaches when analyzing the data.

Part C: Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.

Baker, R., Berning, A.W., Gowda, S.M., Zhang, S., & Hawn, A. (2020). Predicting K-12 Dropout. Journal of Education for Students Placed at Risk (JESPAR), 25, 28 - 54.

What research questions were the authors of this study trying to address and why did they consider these questions important? What factors predict high school student drop out? Student dropout is linked with unemployment/underemployment and mental health issues.
What were the results of these analyses?
- They authors found a total of 23 factors predict student dropout. Factors such as dress code violations, in-school suspensions, high variance in grades, grades shifting across the course of the year, and absences are all related to student dropout. In particular, the authors found dress code violations is a strong predictor of student dropout.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Your First Machine Learning Badge

Congratulations, you’ve completed your first badge activity! To receive credit for this assignment and earn your first official LASER Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet. We recommend bookmarking this spreadsheet as we’ll be using it throughout the year to keep track of your progress.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Learning Lab 1 Independent Practice

Hongwei Yu

July 20, 2023

Part I: Extending our model

Part II: Reflect and Plan

Knit and Publish

Your First Machine Learning Badge