Machine Learning - Lab 2 Badge

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyr)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.1

## Warning: package 'ggplot2' was built under R version 4.3.1

## Warning: package 'readr' was built under R version 4.3.1

## Warning: package 'purrr' was built under R version 4.3.1

## Warning: package 'dplyr' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Install the tidymodels and janitor packages if they are not already installed
if (!requireNamespace("tidymodels", quietly = TRUE)) {
  install.packages("tidymodels")
}

if (!requireNamespace("janitor", quietly = TRUE)) {
  install.packages("janitor")
}

# Load the tidymodels and janitor packages
library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.1

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## Warning: package 'broom' was built under R version 4.3.1

## Warning: package 'dials' was built under R version 4.3.1

## Warning: package 'modeldata' was built under R version 4.3.1

## Warning: package 'parsnip' was built under R version 4.3.1

## Warning: package 'recipes' was built under R version 4.3.1

## Warning: package 'rsample' was built under R version 4.3.1

## Warning: package 'tune' was built under R version 4.3.1

## Warning: package 'workflows' was built under R version 4.3.1

## Warning: package 'workflowsets' was built under R version 4.3.1

## Warning: package 'yardstick' was built under R version 4.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(janitor)

## Warning: package 'janitor' was built under R version 4.3.1

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

students <- read_csv("C:/Users/nikhi/Desktop/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(students)

## Rows: 32,593
## Columns: 15
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(disability = as.factor(disability))

view(students)

students <- students %>%
  mutate(disability = as.factor(disability))

students %>% 
    count(id_student) # this many students

## # A tibble: 28,785 × 2
##    id_student     n
##         <dbl> <int>
##  1       3733     1
##  2       6516     1
##  3       8462     2
##  4      11391     1
##  5      23629     1
##  6      23632     1
##  7      23698     1
##  8      23798     1
##  9      24186     1
## 10      24213     2
## # ℹ 28,775 more rows

students %>% 
    count(code_module, code_presentation) # this many offerings

## # A tibble: 22 × 3
##    code_module code_presentation     n
##    <chr>       <chr>             <int>
##  1 AAA         2013J               383
##  2 AAA         2014J               365
##  3 BBB         2013B              1767
##  4 BBB         2013J              2237
##  5 BBB         2014B              1613
##  6 BBB         2014J              2292
##  7 CCC         2014B              1936
##  8 CCC         2014J              2498
##  9 DDD         2013B              1303
## 10 DDD         2013J              1938
## # ℹ 12 more rows

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students

## # A tibble: 32,593 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80)

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

data_test

## # A tibble: 6,519 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  2 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  3 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  4 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
##  5 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
##  6 AAA         2013J                  71361 M      Ireland     HE Qualification 
##  7 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
##  8 AAA         2013J                  98094 M      Wales       Lower Than A Lev…
##  9 AAA         2013J                 111717 F      East Angli… HE Qualification 
## 10 AAA         2013J                 114017 F      North Regi… Post Graduate Qu…
## # ℹ 6,509 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

data_train

## # A tibble: 26,074 × 16
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 FFF         2014B                 595186 M      South Regi… Lower Than A Lev…
##  2 BBB         2014J                 504066 F      East Midla… Lower Than A Lev…
##  3 BBB         2013J                 585790 F      South East… HE Qualification 
##  4 CCC         2014J                 278413 M      London Reg… HE Qualification 
##  5 GGG         2014B                 634933 F      South Regi… Lower Than A Lev…
##  6 CCC         2014J                 608577 M      North Regi… HE Qualification 
##  7 BBB         2014B                 612120 F      East Midla… Lower Than A Lev…
##  8 FFF         2013J                 530852 M      Wales       Lower Than A Lev…
##  9 CCC         2014J                2555596 M      South Regi… A Level or Equiv…
## 10 DDD         2013B                 556575 M      North Regi… HE Qualification 
## # ℹ 26,064 more rows
## # ℹ 10 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

my_rec <- recipe(pass ~ disability + imd_band, data = data_train)

my_rec

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:   1
## predictor: 2

# specify model
my_mod <-
    logistic_reg()

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data = data_train)

fitted_model

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)  disabilityY     imd_band  
##    -0.78655     -0.28186      0.05974  
## 
## Degrees of Freedom: 22406 Total (i.e. Null);  22404 Residual
##   (3667 observations deleted due to missingness)
## Null Deviance:       29830 
## Residual Deviance: 29650     AIC: 29660

final_fit <- last_fit(my_mod, my_rec, split = train_test_split)

final_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits               id              .metrics .notes   .predictions .workflow 
##   <list>               <chr>           <list>   <list>   <list>       <list>    
## 1 <split [26074/6519]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

# collect test split predictions
final_fit %>%
    collect_predictions()

## # A tibble: 6,519 × 7
##    id               .pred_0 .pred_1  .row .pred_class pass  .config             
##    <chr>              <dbl>   <dbl> <int> <fct>       <fct> <chr>               
##  1 train/test split   0.647   0.353     2 0           1     Preprocessor1_Model1
##  2 train/test split   0.605   0.395     4 0           1     Preprocessor1_Model1
##  3 train/test split   0.634   0.366     7 0           1     Preprocessor1_Model1
##  4 train/test split  NA      NA        10 <NA>        1     Preprocessor1_Model1
##  5 train/test split   0.577   0.423    16 0           0     Preprocessor1_Model1
##  6 train/test split  NA      NA        18 <NA>        1     Preprocessor1_Model1
##  7 train/test split   0.634   0.366    21 0           1     Preprocessor1_Model1
##  8 train/test split   0.577   0.423    24 0           1     Preprocessor1_Model1
##  9 train/test split   0.547   0.453    33 0           1     Preprocessor1_Model1
## 10 train/test split  NA      NA        35 <NA>        1     Preprocessor1_Model1
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct

## # A tibble: 6,519 × 3
##    .pred_class pass  correct
##    <fct>       <fct> <lgl>  
##  1 0           1     FALSE  
##  2 0           1     FALSE  
##  3 0           1     FALSE  
##  4 <NA>        1     NA     
##  5 0           0     TRUE   
##  6 <NA>        1     NA     
##  7 0           1     FALSE  
##  8 0           1     FALSE  
##  9 0           1     FALSE  
## 10 <NA>        1     NA     
## # ℹ 6,509 more rows

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) %>% # create a new variable, correct, telling us when the model was and was not correct
    tabyl(correct)

##  correct    n   percent valid_percent
##    FALSE 2071 0.3176868      0.372549
##     TRUE 3488 0.5350514      0.627451
##       NA  960 0.1472618            NA

students %>% 
    count(pass)

## # A tibble: 2 × 2
##   pass      n
##   <fct> <int>
## 1 0     20232
## 2 1     12361

students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
               prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)

##  correct     n   percent
##        0 16328 0.5009665
##        1 16265 0.4990335

Please add your interpretations here:

MAE: On average, our model’s predictions have an absolute error of MAE units from the true Final Exam Score.
MSE: The mean squared error represents the average squared difference between our model’s predictions and the true values.
RMSE:RMSE is the square root of MSE, providing a measure of the typical error in our model’s predictions in the same units as the outcome variable.

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

Determining based on past usage and demographic information if a client would depart or remain with a subscription service. This is a binary classification problem using a dichotomous variable as the outcome.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

Estimating a retail store’s annual revenue using information on its location, size, marketing budget, and target market. Regression analysis is appropriate in this situation since the outcome variable is continuous (revenue expressed in dollars).

Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Because the goal is to forecast the classification result (over or under) based on product usage such as student demographics, prior learning, and classroom characteristics, the decision to employ a classification model is required. Classification models are useful for problems with inconsistent findings involving more than one group.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!

Machine Learning - Lab 2 Badge

Nikhila

September 22, 2023

Part I: Data Product

Part II: Reflect and Plan

Knit and Publish