Machine Learning - Learning Lab 2 Badge

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students

## # A tibble: 32,593 × 15
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 9 more variables: imd_band <chr>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>

assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

assessments

## # A tibble: 173,912 × 10
##    id_assessment id_student date_submitted is_banked score code_module
##            <dbl>      <dbl>          <dbl>     <dbl> <dbl> <chr>      
##  1          1752      11391             18         0    78 AAA        
##  2          1752      28400             22         0    70 AAA        
##  3          1752      31604             17         0    72 AAA        
##  4          1752      32885             26         0    69 AAA        
##  5          1752      38053             19         0    79 AAA        
##  6          1752      45462             20         0    70 AAA        
##  7          1752      45642             18         0    72 AAA        
##  8          1752      52130             19         0    72 AAA        
##  9          1752      53025              9         0    71 AAA        
## 10          1752      57506             18         0    68 AAA        
## # ℹ 173,902 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## #   date <dbl>, weight <dbl>

assessments %>% 
    count(assessment_type)

## # A tibble: 3 × 2
##   assessment_type     n
##   <chr>           <int>
## 1 CMA             70527
## 2 Exam             4959
## 3 TMA             98426

assessments %>% 
    distinct(id_assessment) # this many unique assessments

## # A tibble: 188 × 1
##    id_assessment
##            <dbl>
##  1          1752
##  2          1753
##  3          1754
##  4          1755
##  5          1756
##  6          1758
##  7          1759
##  8          1760
##  9          1761
## 10          1762
## # ℹ 178 more rows

assessments %>% 
    summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
              median_date = median(date, na.rm = TRUE), # find the median
              sd_date = sd(date, na.rm = TRUE), # find the sd
              min_date = min(date, na.rm = TRUE), # find the min
              max_date = max(date, na.rm = TRUE)) # find the mad

## # A tibble: 1 × 5
##   mean_date median_date sd_date min_date max_date
##       <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1      131.         129    78.0       12      261

assessments %>% 
    group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
    summarize(mean_date = mean(date, na.rm = TRUE),
              median_date = median(date, na.rm = TRUE),
              sd_date = sd(date, na.rm = TRUE),
              min_date = min(date, na.rm = TRUE),
              max_date = max(date, na.rm = TRUE),
              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

## # A tibble: 22 × 8
## # Groups:   code_module [7]
##    code_module code_presentation mean_date median_date sd_date min_date max_date
##    <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAA         2013J                 109.          117    71.3       19      215
##  2 AAA         2014J                 109.          117    71.5       19      215
##  3 BBB         2013B                 104.           89    55.5       19      187
##  4 BBB         2013J                 112.           96    61.6       19      208
##  5 BBB         2014B                  98.9          82    58.6       12      194
##  6 BBB         2014J                  99.1         110    65.2       19      201
##  7 CCC         2014B                  98.4         102    68.0       18      207
##  8 CCC         2014J                 104.          109    70.8       18      214
##  9 DDD         2013B                 104.           81    66.0       23      240
## 10 DDD         2013J                 118.           88    77.9       25      261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

code_module_dates

## # A tibble: 22 × 3
## # Groups:   code_module [7]
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows

code_module_dates

## # A tibble: 22 × 3
## # Groups:   code_module [7]
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows

assessments_joined <- assessments %>% 
    left_join(code_module_dates, by = join_by(code_module, code_presentation)) # join the data based on course_module and course_presentation

assessments_filtered <- assessments_joined %>% 
    filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score))

students <- students %>% 
    mutate(pass = ifelse(final_result == "pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students_and_assessments <- students %>% 
    left_join(assessments_summarized,by = join_by(id_student))

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score)

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

head(data_train)

## # A tibble: 6 × 17
##   code_module code_presentation id_student gender region       highest_education
##   <chr>       <chr>                  <dbl> <chr>  <chr>        <chr>            
## 1 AAA         2013J                  38053 M      Wales        A Level or Equiv…
## 2 AAA         2013J                  45462 M      Scotland     HE Qualification 
## 3 AAA         2013J                  53025 M      North Region Post Graduate Qu…
## 4 AAA         2013J                  62155 F      North Weste… HE Qualification 
## 5 AAA         2013J                  63400 M      Scotland     Lower Than A Lev…
## 6 AAA         2013J                  65002 F      East Anglia… A Level or Equiv…
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   mean_weighted_score <dbl>

my_rec <- recipe(mean_weighted_score ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)

# specify model
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% #  linear regression model
    set_mode("regression") # multiple regression
my_mod

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

fitted_model <- fit(my_wf, data = data_train)

install.packages("yardstick")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(yardstick)
class_metrics <- metric_set( mae)

final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)

collect_predictions(final_fit)

## # A tibble: 12,318 × 5
##    id               .pred  .row mean_weighted_score .config             
##    <chr>            <dbl> <int>               <dbl> <chr>               
##  1 train/test split  717.     1                 780 Preprocessor1_Model1
##  2 train/test split  701.     2                 700 Preprocessor1_Model1
##  3 train/test split  701.     3                 720 Preprocessor1_Model1
##  4 train/test split  723.     4                 690 Preprocessor1_Model1
##  5 train/test split  697.     7                 720 Preprocessor1_Model1
##  6 train/test split  698.     8                 720 Preprocessor1_Model1
##  7 train/test split  707.    10                 680 Preprocessor1_Model1
##  8 train/test split  700.    11                 730 Preprocessor1_Model1
##  9 train/test split  699.    12                 670 Preprocessor1_Model1
## 10 train/test split  709.    16                 590 Preprocessor1_Model1
## # ℹ 12,308 more rows

collect_metrics(final_fit)

## # A tibble: 1 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        194. Preprocessor1_Model1

Please add your interpretations here:

MAE: MAE (Mean Absolute Error) is the average absolute error between actual and predicted values.
MSE: Mean squared error (MSE) measures the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values.
RMSE: RMSE can be interpreted as the standard deviation of the unexplained variance.

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

student likelihood of vertical transfer

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model? student GPA

Look back to the study you identified for the first machine learning learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

classification machine learning model
classification machine learning model
Outcome: Likelihood of high school dropout

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Learning Lab 2 Badge

Hongwei Yu

July 23, 2023

Part I: Data Product

Part II: Reflect and Plan

Knit and Publish

Receiving Your Machine Learning Badge