As a reminder, to earn a badge for these learning labs, you will have to respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Please use the code chunk below for your code:

#Setup
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
set.seed(20231204)

#Data Loading
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Data Wrangling and Exploration
table(assessments$assessment_type)
## 
##   CMA  Exam   TMA 
## 70527  4959 98426
assessments_exam <- assessments %>% 
    mutate(Bfinal_exam = ifelse(assessment_type == "Exam", 1, 0)) %>% 
    mutate(Bfinal_exam = as.factor(Bfinal_exam))
table(assessments_exam$assessment_type,assessments_exam$Bfinal_exam)
##       
##            0     1
##   CMA  70527     0
##   Exam     0  4959
##   TMA  98426     0
is_unique_identifier <- !any(duplicated(assessments_exam[c("id_student", "id_assessment")]) | duplicated(assessments_exam[c("id_student", "id_assessment")], fromLast = TRUE))
print(is_unique_identifier) 
## [1] TRUE
assessments_exam <- subset(assessments_exam, Bfinal_exam == 1)
count(assessments_exam)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  4959
summary(assessments_exam$score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   51.00   67.00   65.57   82.00  100.00
table(assessments_exam$Bfinal_exam)
## 
##    0    1 
##    0 4959
assessments_exam <- rename(assessments_exam, final_exam_score = score)
is_unique_identifier <- !any(duplicated(assessments_exam[c("id_student", "code_module", "code_presentation")]) | duplicated(assessments_exam[c("id_student", "code_module", "code_presentation")], fromLast = TRUE))
print(is_unique_identifier) 
## [1] TRUE
assessments_exam <- select(assessments_exam, id_student, code_module, code_presentation, final_exam_score) 
assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(mean_date = mean(date, na.rm = TRUE),
              median_date = median(date, na.rm = TRUE),
              sd_date = sd(date, na.rm = TRUE),
              min_date = min(date, na.rm = TRUE),
              max_date = max(date, na.rm = TRUE),
              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) 
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups:   code_module [7]
##    code_module code_presentation mean_date median_date sd_date min_date max_date
##    <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAA         2013J                 109.          117    71.3       19      215
##  2 AAA         2014J                 109.          117    71.5       19      215
##  3 BBB         2013B                 104.           89    55.5       19      187
##  4 BBB         2013J                 112.           96    61.6       19      208
##  5 BBB         2014B                  98.9          82    58.6       12      194
##  6 BBB         2014J                  99.1         110    65.2       19      201
##  7 CCC         2014B                  98.4         102    68.0       18      207
##  8 CCC         2014J                 104.          109    70.8       18      214
##  9 DDD         2013B                 104.           81    66.0       23      240
## 10 DDD         2013J                 118.           88    77.9       25      261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>
code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
assessments_joined <- assessments |>
    left_join(code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
assessments_filtered <- assessments_joined %>% 
    filter(date < quantile_cutoff_date)
assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% 
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score)) 
students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% 
    mutate(imd_band = as.integer(imd_band)) 
    nrow(distinct(students))
## [1] 32593
    n_distinct(students$id_student)
## [1] 28785
    is_unique_identifier <- !any(duplicated(students[c("id_student", "code_module", "code_presentation")]) | duplicated(students[c("id_student", "code_module", "code_presentation")], fromLast = TRUE))
    print(is_unique_identifier) 
## [1] TRUE
    nrow(distinct(assessments_summarized))
## [1] 21577
    n_distinct(assessments_summarized$id_student)
## [1] 21577
    nrow(distinct(assessments_exam))
## [1] 4959
    n_distinct(assessments_exam$id_student)
## [1] 4633
    students_and_assessments <- students |>
        left_join(assessments_summarized) |>
        left_join(assessments_exam)
## Joining with `by = join_by(id_student)`
## Joining with `by = join_by(code_module, code_presentation, id_student)`
    table(students_and_assessments$final_exam_score)
## 
##   0   4   7   8   9  10  12  13  14  16  18  20  22  24  26  27  28  29  30  31 
##   3   1   1   1   1   2   2   4   4   9  11  16  23  32  20  27  18  32  20  43 
##  32  33  34  36  38  40  42  44  46  47  48  49  50  51  52  53  54  56  58  60 
##  25  50  32  86 104 110 132 127  45  89  40  98  30  96  48 113  56 178 160 188 
##  62  64  66  67  68  69  70  71  72  73  74  76  78  80  82  84  86  87  88  89 
## 151 178  57 113  46 137  44 108  59 132  54 157 191 135 183 124  50 112  52  45 
##  90  91  92  93  94  96  98 100 
##  56  95  48  36  85 174  98 162
    summary(students_and_assessments[c("mean_weighted_score", "final_exam_score")])
##  mean_weighted_score final_exam_score
##  Min.   :   0.0      Min.   :  0.00  
##  1st Qu.: 160.0      1st Qu.: 51.00  
##  Median : 610.0      Median : 67.00  
##  Mean   : 544.7      Mean   : 65.57  
##  3rd Qu.: 875.0      3rd Qu.: 82.00  
##  Max.   :1512.0      Max.   :100.00  
##  NA's   :7958        NA's   :27634
students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score, final_exam_score)

#Data Splitting - Splitting the data into training and testing sets with a 50% split, stratified by final_exam_score.
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "final_exam_score")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

#Recipe Preparation - Creating a recipe for the prediction model, including preprocessing steps like creating dummy variables.
my_rec <- recipe(final_exam_score ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)

#Model Specification - Setting up a linear regression model (linear_reg()) with the lm engine and specifying the mode as regression.
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")

#Workflow -- Creating a workflow
my_wf <-
    workflow() %>% 
    add_model(my_mod) %>% 
    add_recipe(my_rec) 

#Model Fitting -- model and recipe and fitting it to the training data.
fitted_model <- fit(my_wf, data = data_train)
class_metrics <- metric_set(mae, rmse)

#Model Evaluation -- Using last_fit to evaluate the model on the test split, focusing on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)

#Prediction Collection -- Collecting and Displaying Predictions 
collect_predictions <- collect_predictions(final_fit)
collect_predictions
## # A tibble: 2,461 × 5
##    id               .pred  .row final_exam_score .config             
##    <chr>            <dbl> <int>            <dbl> <chr>               
##  1 train/test split  66.7     2               76 Preprocessor1_Model1
##  2 train/test split  67.3     3               66 Preprocessor1_Model1
##  3 train/test split  67.8     5               98 Preprocessor1_Model1
##  4 train/test split  71.6     6              100 Preprocessor1_Model1
##  5 train/test split  67.6    11              100 Preprocessor1_Model1
##  6 train/test split  72.6    15               60 Preprocessor1_Model1
##  7 train/test split  64.7    19               24 Preprocessor1_Model1
##  8 train/test split  64.2    23               60 Preprocessor1_Model1
##  9 train/test split  77.5    25               94 Preprocessor1_Model1
## 10 train/test split  68.1    27               90 Preprocessor1_Model1
## # ℹ 2,451 more rows
final_fit_metrics <- collect_metrics(final_fit) 
final_fit_metrics
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        16.4 Preprocessor1_Model1
## 2 rmse    standard        19.8 Preprocessor1_Model1
#Metric Calculations -- Calculating and displaying predictive accuracy metrics, including the derivation of Mean Squared Error (MSE) by squaring the RMSE.
mse_row <- final_fit_metrics %>%
  filter(.metric == "rmse") %>%
  mutate(.estimate = .estimate^2, .metric = "mse")

final_fit_metrics <- bind_rows(final_fit_metrics, mse_row)

print(final_fit_metrics)
## # A tibble: 3 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        16.4 Preprocessor1_Model1
## 2 rmse    standard        19.8 Preprocessor1_Model1
## 3 mse     standard       391.  Preprocessor1_Model1

Please add your interpretations here:

Part II: Reflect and Plan

  1. What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?
  1. What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?
  1. Look back to the study you identified for the first machine learning learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!