Lab2_Casestudy.knit

Lab 2 Case Study:

Jeevani P May 18, 2024

1. PREPARE

When we left off the first case study, we saw that our model was pretty accurate. We can and will do better in terms of accuracy. But, the accuracy value we found also raises a broader question: How good was our model in terms of making predictions?

While accuracy is an easy to understand and interpret value, it provides a limited answer to the above question about how good our model was in terms of making predictions.

In this learning and case study, we explore a variety of ways to understand how good of a predictive model ours is. We do this through a variety of means:

Other statistics, such as sensitivity and specificity
Tables–particularly, a confusion matrix

Our driving question for this case study, then, is: How good is our model at making predictions?

We’ll use the Open University Learning Analytics Dataset (OULAD) again–this time, adding another data source, one on students’ performance on assessments.

Like in the first learning lab, we’ll first load several packages – {tidyverse}, {tidymodels}, and {janitor}. Make sure these are installed (via install.packages()). Note that if they’re already installed, you don’t need to install them again.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)
library(stats)
library(yardstick)

2. WRANGLE

Like in the code-along for the overview presentation, let’s take a look at the data and do some processing of it.

Your Turn ⤵

We’ll load the students file together; you’ll write code to read the assessments file, which is named “oulad-assessments.csv”. Please assign the name assessments to the loaded assessments file.

assessments <- read_csv("D:/machine-learning-main/machine-learning-main/lab-2/data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. EXPLORE

Your Turn ⤵

In the last lab, we used the count() function. Let’s practice that again, by counting the number of of assessments of different types. If you need, recall that the data dictionary is here. Note what the different types of assessments mean.

count(assessments)

## # A tibble: 1 × 1
##        n
##    <int>
## 1 173912

ADD YOUR RESPONSE HERE

Your Turn ⤵

We’ll now use another function–like count(), from the {tidyverse}. Specifically, we’ll use the distinct() function. This returns the unique (or distinct) values for a specified variable. Learn more about distinct() here. Below, find the distinct assessment IDs.

distinct(assessments)

## # A tibble: 173,912 × 10
##    id_assessment id_student date_submitted is_banked score code_module
##            <dbl>      <dbl>          <dbl>     <dbl> <dbl> <chr>      
##  1          1752      11391             18         0    78 AAA        
##  2          1752      28400             22         0    70 AAA        
##  3          1752      31604             17         0    72 AAA        
##  4          1752      32885             26         0    69 AAA        
##  5          1752      38053             19         0    79 AAA        
##  6          1752      45462             20         0    70 AAA        
##  7          1752      45642             18         0    72 AAA        
##  8          1752      52130             19         0    72 AAA        
##  9          1752      53025              9         0    71 AAA        
## 10          1752      57506             18         0    68 AAA        
## # ℹ 173,902 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## #   date <dbl>, weight <dbl>

Let’s explore the assessments data a bit.

We might be interested in how many assessments there are per course. To calculate that, we need to count() several variables at once; when doing this, count() tabulates the number of unique combinations of the variables passed to it.

assessments %>% 
    count(assessment_type, code_module, code_presentation)

## # A tibble: 41 × 4
##    assessment_type code_module code_presentation     n
##    <chr>           <chr>       <chr>             <int>
##  1 CMA             BBB         2013B              5049
##  2 CMA             BBB         2013J              6416
##  3 CMA             BBB         2014B              4493
##  4 CMA             CCC         2014B              3920
##  5 CMA             CCC         2014J              5846
##  6 CMA             DDD         2013B              5252
##  7 CMA             FFF         2013B              6681
##  8 CMA             FFF         2013J              8847
##  9 CMA             FFF         2014B              5549
## 10 CMA             FFF         2014J              8915
## # ℹ 31 more rows

Let’s explore the dates assignments were submitted a bit – using the summarize() function:

assessments %>% 
    summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
              median_date = median(date, na.rm = TRUE), # find the median
              sd_date = sd(date, na.rm = TRUE), # find the sd
              min_date = min(date, na.rm = TRUE), # find the min
              max_date = max(date, na.rm = TRUE)) # find the mad

## # A tibble: 1 × 5
##   mean_date median_date sd_date min_date max_date
##       <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1      131.         129    78.0       12      261

What can we take from this? It looks like, on average, the average (mean and median) date assignments were due was around 130 – 130 days after the start of the course. The first assignment seems to have been due 12 days into the course, and the last 261 days after.

Crucially, though, these dates vary by course. Thus, we need to first group the data by course. Let’s use a different function this time – quantile(), and calculate the first quantile value. Our reasoning is that we want to be able to act to support students, and if we wait until after the average assignment is due, then that might be too late. Whereas the first quantile comes approximately one-quarter through the semester — with, therefore, more time to intervene.

assessments %>% 
    group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
    summarize(mean_date = mean(date, na.rm = TRUE),
              median_date = median(date, na.rm = TRUE),
              sd_date = sd(date, na.rm = TRUE),
              min_date = min(date, na.rm = TRUE),
              max_date = max(date, na.rm = TRUE),
              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

## # A tibble: 22 × 8
## # Groups:   code_module [7]
##    code_module code_presentation mean_date median_date sd_date min_date max_date
##    <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAA         2013J                 109.          117    71.3       19      215
##  2 AAA         2014J                 109.          117    71.5       19      215
##  3 BBB         2013B                 104.           89    55.5       19      187
##  4 BBB         2013J                 112.           96    61.6       19      208
##  5 BBB         2014B                  98.9          82    58.6       12      194
##  6 BBB         2014J                  99.1         110    65.2       19      201
##  7 CCC         2014B                  98.4         102    68.0       18      207
##  8 CCC         2014J                 104.          109    70.8       18      214
##  9 DDD         2013B                 104.           81    66.0       23      240
## 10 DDD         2013J                 118.           88    77.9       25      261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>

Alright, this is a bit complicated, but we can actually work with this data. Let’s use just a portion of the above code, assigning it the name code_module_dates.

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

Your Turn ⤵

Let’s take a look at what we just created; type code_module_dates below:

code_module_dates

## # A tibble: 22 × 3
## # Groups:   code_module [7]
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows

What have we created? We found the date that is one-quarter of the way through the semester (in terms of the dates assignments are due).

Your Turn ⤵

We can thus use this to group and calculate students’ performance on assignments through this point. To do this, we need to use a join function — left_join(), in particular. Please use left_join() to join together assessments and code_module_dates, with assessments being the “left” data frame, and code_module_dates the right. Save the output of the join the name assessments_joined.

assessments_joined <- left_join(assessments, code_module_dates)

## Joining with `by = join_by(code_module, code_presentation)`

We’re almost there! The next few lines filter the assessments data so it only includes assessments before our cutoff date.

assessments_filtered <- assessments_joined %>% 
  dplyr::filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

Finally, we’ll find the average score for each student.

assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score))

As a point of reflection here, note how much work we’ve done before we’ve gotten to the machine learning steps. Though a challenge, this is typical for both machine learning and more traditional statistical models: a lot of the work is in the preparation and data wrangling stages.

Let’s copy the code below that we used to process the students data (when processing the pass and imd_band variables).

students <- read_csv("D:/machine-learning-main/machine-learning-main/lab-2/data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

Your Turn ⤵

Finally, let’s join together students and assessments_summarized, assigning the joined the name students_and_assessments.

students_and_assessments <- left_join(students, assessments_summarized)

## Joining with `by = join_by(id_student)`

students_and_assessments

## # A tibble: 32,593 × 17
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   mean_weighted_score <dbl>

Big picture, we’ve added another measure – another feature – that we can use to make predictions: students’ performance on assessments 1/4 of the way through the course.

We’re now ready to proceed to our machine learning steps!

The problem we will be working on - predicting students who pass, based on data from the first one-third of the semester - has an analog in a recent paper by Ryan Baker and colleagues:

In Baker et al. (2020), the authors create a precision-recall (also known as sensitivity) graph - one that demonstrates the trade-off between optimizing these two statistics. Review their paper - especially the results section - to see how they discuss these two statistics.

Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk, 25(1), 28-54.

Please review this paper before proceeding, focusing on how they describe

4. MODEL

Step 1. Split data

This is identical to what we did in the first learning lab, using the students_and_assessments data. We’ll also create a testing data set we’ll use later.

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score)

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Step 2: Engineer features and write down the recipe

We’ll also specify the recipe, adding our mean_weighted_score variable we calculated above as well as variables we used in LL1 (case study and badge). Note how we dummy code two variables using step_dummy() (described further in the first learning lab).

my_rec <- recipe(pass ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)

Step 3: Specify the model and workflow

These steps are also the same as in LL1.

# specify model
my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Step 4: Fit model

Finally, we’ll fit our model.

fitted_model <- fit(my_wf, data = data_train)

Finally, we’ll use the last_fit function, but we’ll add something a little different - we’ll add the metric set here. To the above, we’ll add another common metric - Cohen’s Kappa, which is similar to accuracy, but which accounts for chance agreement.

To do so, we have to specify which metrics we want to use using the metric_set() function (see all of the available options here). Please specify six metrics in the metric set – accuracy, sensitivity, specificity, ppv (recall), npv, and kappa. Assign the name class_metrics to the output of your use of the metric_set() function.

Your Turn ⤵

## Define the metric set.
class_metrics <- metric_set(
  accuracy,
  sensitivity,
  specificity,
  recall,
  npv,
  kap
)
##last_fit function with the specified metric set.
last_fit <- last_fit(my_wf, split = train_test_split, metrics = class_metrics)

Then, please use last_fit, looking to how we did this in the last learning lab for a guide on how to specify the argument nts. To the arguments, add metrics = class_metrics.

final_fit  <- last_fit(my_wf, train_test_split, metrics = class_metrics)

We’re now ready to move on to interpreting the accuracy of our model.

Step 5: Interpret accuracy

A confusion matrix and true and false positives and negatives

Let’s start with a simple confusion matrix. The confusion matrix is a 2 x 2 table with values (cells in the table) representing one of four conditions, elaborated below. You’ll fill in the last two columns in a few moments.

Statistic	How to Find	Interpretation
True positive (TP)	Truth: 1, Prediction: 1	Proportion of the population that is affected by a condition and correctly tested positive
True negative (TN)	Truth: 0, Prediction: 0	Proportion of the population that is not affected by a condition and correctly tested negative
False positive (FP)	Truth: 0, Prediction: 1	Proportion of the population that is not affected by a condition and incorrectly tested positive
False negative (FN)	Truth: 1, Prediction: 0	Proportion of the population that is affected by a condition and incorrectly tested positive.

To create a confusion matrix, run collect_predictions(), which does what it suggests - it gathers together the model’s test set predictions. Pass the last_fit object to this function below.

Your Turn ⤵

collect_predictions(last_fit)

## # A tibble: 12,318 × 5
##    .pred_class id                .row pass  .config             
##    <fct>       <chr>            <int> <fct> <chr>               
##  1 1           train/test split     1 1     Preprocessor1_Model1
##  2 1           train/test split     2 1     Preprocessor1_Model1
##  3 1           train/test split     3 1     Preprocessor1_Model1
##  4 1           train/test split     5 1     Preprocessor1_Model1
##  5 1           train/test split     6 1     Preprocessor1_Model1
##  6 1           train/test split     8 1     Preprocessor1_Model1
##  7 1           train/test split    10 1     Preprocessor1_Model1
##  8 1           train/test split    11 1     Preprocessor1_Model1
##  9 1           train/test split    12 1     Preprocessor1_Model1
## 10 1           train/test split    17 1     Preprocessor1_Model1
## # ℹ 12,308 more rows

Take a look at the columns. You’ll need to provide the predictions (created with collect_predictions()) and then pipe that to conf_mat(), to which you provide the names of a) the predictions and b) the known values for the test set. Some code to get you started is below.

collect_predictions(final_fit) %>% 
    conf_mat(.pred_class, pass)

##           Truth
## Prediction    0    1
##          0 4777 1824
##          1 3525 2187

You should see a confusion matrix output. If so, nice work! Please fill in the Value and Percentage columns in the table above (with TP, TN, FP, and FN), entering the appropriate values and then converting those into a percentage based on the total number of data points in the test data set. The accuracy can be interpreted as the sum of the true positive and true negative percentages. So, what’s the accuracy? Add that below as a percentage.

True Positives (TP): 4484 Percentage: 36.31063 % True Negatives (TN): 2558 Percentage: 20.71423 % False Positives (FP): 2132 Percentage: 17.26456 % False Negatives (FN): 3175 Percentage: 25.71058 %

Accuracy: 57.02486 %

Other measures of predictive accuracy

Here’s where things get interesting: There are other statistics that have different denominators. Recall from the overview presentation that we can slice and dice the confusion matrix to calculate statistics that give us insights into the predictive utility of the model. Based on the above Values for TP, TN, FP, and FN you interpreted a few moments ago, add the Statistic Values for sensitivity, specificity, precision, and negative predictive value below to three decimal points.

Your Turn ⤵

Statistic	Equation	Interpretation	Question Answered	Statistic Values
Sensitivity (AKA recall)	TP / (TP + FN)	Proportion of those who are affected by a condition and correctly tested positive	Out of all the actual positives, how many did we correctly predict?
Specificity	TN / (FP + TN)	Proportion of those who are not affected by a condition and correctly tested negative;	Out of all the actual negatives, how many did we correctly predict?
Precision (AKA Positive Predictive Value)	TP / (TP + FP)	Proportion of those who tested positive who are affected by a condition	Out of all the instances we predicted as positive, how many are actually positive?
Negative Predictive Value	TN / (TN + FN)	Proportion of those who tested positive who are not affected by a condition; the probability that a negative test is correct	Out of all the instances we predicted as negative, how many are actually negative?

So, what does this hard-won by output tell us? Let’s interpret each statistic carefully in the table below. Please add the value and provide a substantive interpretation. One is provided to get you started.

Your Turn ⤵

Statistic	Substantive Interpretation
Accuracy	57.02486 %
Sensitivity (AKA recall)	0.585 The model correctly predicts about 2/3 of students who do not pass correctly (as not passing). This is pretty good, but it could be better.
Specificity	0.545
Precision (AKA Positive Predictive Value)	0.678
Negative Predictive Value	0.446

This process might suggest to you how a “good” model isn’t necessarily one that is the most accurate, but instead is one that has good values for statistics that matter given our particular question and context.

Recall that Baker and colleagues sought to balance between precision and recall (specificity). Please briefly discuss how well our model does this; is it better suited to correctly identifying “positive” pass cases (sensitivity) or “negatively” identifying students who do not pass (specificity)?

YOUR RESPONSE HERE

We can see that the model performs slightly better in correctly identifying “positive” pass cases (sensitivity) compared to “negatively” identifying students who do not pass (specificity). Sensitivity indicates the proportion of actual positive cases (students who pass) that the model correctly identifies. In this case, the sensitivity value of 0.585 suggests that the model correctly identifies approximately 58.5% of the students who actually pass the course.

On the other hand, specificity indicates the proportion of actual negative cases (students who do not pass) that the model correctly identifies. With a specificity value of 0.545, the model correctly identifies approximately 54.5% of the students who actually do not pass the course.

Overall, while the model demonstrates moderate performance in both sensitivity and specificity, it is slightly more effective in identifying positive pass cases (sensitivity) than negative non-pass cases (specificity).

5. COMMUNICATE

Quickly calculating metrics

After all of this work, we can calculate the above much more easily given how we setup our metrics (using metric_set() earlier, such as when we want to efficiently communicate the results of our analysis to our audience? Below, use collect_metrics() with the final_fit object, noting that in addition to the four metrics we calculated manually just a few moments ago, we are also provided with the accuracy and Cohen’s Kappa values.

Your Turn ⤵

collect_metrics(final_fit)

## # A tibble: 6 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.566 Preprocessor1_Model1
## 2 sensitivity binary         0.724 Preprocessor1_Model1
## 3 specificity binary         0.383 Preprocessor1_Model1
## 4 recall      binary         0.724 Preprocessor1_Model1
## 5 npv         binary         0.545 Preprocessor1_Model1
## 6 kap         binary         0.109 Preprocessor1_Model1

Having invested in understanding the terminology of machine learning metrics, we’ll use this “shortcut” (using collect_metrics() going forward.

🧶 Knit & Check ✅

Congratulations - you’ve completed this case study! Go over to Lab 2 badge activity.