Lab 2 Case Study:
Jeevani P May 18, 2024
When we left off the first case study, we saw that our model was pretty accurate. We can and will do better in terms of accuracy. But, the accuracy value we found also raises a broader question: How good was our model in terms of making predictions?
While accuracy is an easy to understand and interpret value, it provides a limited answer to the above question about how good our model was in terms of making predictions.
In this learning and case study, we explore a variety of ways to understand how good of a predictive model ours is. We do this through a variety of means:
Other statistics, such as sensitivity and specificity
Tables–particularly, a confusion matrix
Our driving question for this case study, then, is: How good is our model at making predictions?
We’ll use the Open University Learning Analytics Dataset (OULAD) again–this time, adding another data source, one on students’ performance on assessments.
Like in the first learning lab, we’ll first load several packages –
{tidyverse}, {tidymodels}, and {janitor}. Make sure these are installed
(via install.packages()). Note that if they’re already
installed, you don’t need to install them again.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.0
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(stats)
library(yardstick)
Like in the code-along for the overview presentation, let’s take a look at the data and do some processing of it.
We’ll load the students file together; you’ll write code to read the
assessments file, which is named “oulad-assessments.csv”. Please assign
the name assessments to the loaded assessments file.
assessments <- read_csv("D:/machine-learning-main/machine-learning-main/lab-2/data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the last lab, we used the count() function. Let’s
practice that again, by counting the number of of assessments of
different types. If you need, recall that the data dictionary is here. Note what
the different types of assessments mean.
count(assessments)
## # A tibble: 1 × 1
## n
## <int>
## 1 173912
We’ll now use another function–like count(), from the
{tidyverse}. Specifically, we’ll use the distinct()
function. This returns the unique (or distinct) values for a specified
variable. Learn more about distinct() here.
Below, find the distinct assessment IDs.
distinct(assessments)
## # A tibble: 173,912 × 10
## id_assessment id_student date_submitted is_banked score code_module
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1752 11391 18 0 78 AAA
## 2 1752 28400 22 0 70 AAA
## 3 1752 31604 17 0 72 AAA
## 4 1752 32885 26 0 69 AAA
## 5 1752 38053 19 0 79 AAA
## 6 1752 45462 20 0 70 AAA
## 7 1752 45642 18 0 72 AAA
## 8 1752 52130 19 0 72 AAA
## 9 1752 53025 9 0 71 AAA
## 10 1752 57506 18 0 68 AAA
## # ℹ 173,902 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## # date <dbl>, weight <dbl>
Let’s explore the assessments data a bit.
We might be interested in how many assessments there are per course.
To calculate that, we need to count() several variables at
once; when doing this, count() tabulates the number of
unique combinations of the variables passed to it.
assessments %>%
count(assessment_type, code_module, code_presentation)
## # A tibble: 41 × 4
## assessment_type code_module code_presentation n
## <chr> <chr> <chr> <int>
## 1 CMA BBB 2013B 5049
## 2 CMA BBB 2013J 6416
## 3 CMA BBB 2014B 4493
## 4 CMA CCC 2014B 3920
## 5 CMA CCC 2014J 5846
## 6 CMA DDD 2013B 5252
## 7 CMA FFF 2013B 6681
## 8 CMA FFF 2013J 8847
## 9 CMA FFF 2014B 5549
## 10 CMA FFF 2014J 8915
## # ℹ 31 more rows
Let’s explore the dates assignments were submitted a bit – using the
summarize() function:
assessments %>%
summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
median_date = median(date, na.rm = TRUE), # find the median
sd_date = sd(date, na.rm = TRUE), # find the sd
min_date = min(date, na.rm = TRUE), # find the min
max_date = max(date, na.rm = TRUE)) # find the mad
## # A tibble: 1 × 5
## mean_date median_date sd_date min_date max_date
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 131. 129 78.0 12 261
What can we take from this? It looks like, on average, the average (mean and median) date assignments were due was around 130 – 130 days after the start of the course. The first assignment seems to have been due 12 days into the course, and the last 261 days after.
Crucially, though, these dates vary by course. Thus, we need to first
group the data by course. Let’s use a different function this time –
quantile(), and calculate the first quantile value. Our
reasoning is that we want to be able to act to support students, and if
we wait until after the average assignment is due, then that might be
too late. Whereas the first quantile comes approximately one-quarter
through the semester — with, therefore, more time to intervene.
assessments %>%
group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
summarize(mean_date = mean(date, na.rm = TRUE),
median_date = median(date, na.rm = TRUE),
sd_date = sd(date, na.rm = TRUE),
min_date = min(date, na.rm = TRUE),
max_date = max(date, na.rm = TRUE),
first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups: code_module [7]
## code_module code_presentation mean_date median_date sd_date min_date max_date
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA 2013J 109. 117 71.3 19 215
## 2 AAA 2014J 109. 117 71.5 19 215
## 3 BBB 2013B 104. 89 55.5 19 187
## 4 BBB 2013J 112. 96 61.6 19 208
## 5 BBB 2014B 98.9 82 58.6 12 194
## 6 BBB 2014J 99.1 110 65.2 19 201
## 7 CCC 2014B 98.4 102 68.0 18 207
## 8 CCC 2014J 104. 109 70.8 18 214
## 9 DDD 2013B 104. 81 66.0 23 240
## 10 DDD 2013J 118. 88 77.9 25 261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>
Alright, this is a bit complicated, but we can actually work with
this data. Let’s use just a portion of the above code, assigning it the
name code_module_dates.
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
Let’s take a look at what we just created; type
code_module_dates below:
code_module_dates
## # A tibble: 22 × 3
## # Groups: code_module [7]
## code_module code_presentation quantile_cutoff_date
## <chr> <chr> <dbl>
## 1 AAA 2013J 54
## 2 AAA 2014J 54
## 3 BBB 2013B 54
## 4 BBB 2013J 54
## 5 BBB 2014B 47
## 6 BBB 2014J 54
## 7 CCC 2014B 32
## 8 CCC 2014J 32
## 9 DDD 2013B 51
## 10 DDD 2013J 53
## # ℹ 12 more rows
What have we created? We found the date that is one-quarter of the way through the semester (in terms of the dates assignments are due).
We can thus use this to group and calculate students’ performance on
assignments through this point. To do this, we need to use a join
function — left_join(), in particular. Please use
left_join() to join together assessments and
code_module_dates, with assessments being the
“left” data frame, and code_module_dates the right. Save
the output of the join the name assessments_joined.
assessments_joined <- left_join(assessments, code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
We’re almost there! The next few lines filter the assessments data so it only includes assessments before our cutoff date.
assessments_filtered <- assessments_joined %>%
dplyr::filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
Finally, we’ll find the average score for each student.
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
As a point of reflection here, note how much work we’ve done before we’ve gotten to the machine learning steps. Though a challenge, this is typical for both machine learning and more traditional statistical models: a lot of the work is in the preparation and data wrangling stages.
Let’s copy the code below that we used to process the students data
(when processing the pass and imd_band
variables).
students <- read_csv("D:/machine-learning-main/machine-learning-main/lab-2/data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
Finally, let’s join together students and
assessments_summarized, assigning the joined the name
students_and_assessments.
students_and_assessments <- left_join(students, assessments_summarized)
## Joining with `by = join_by(id_student)`
students_and_assessments
## # A tibble: 32,593 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East Angli… HE Qualification
## 2 AAA 2013J 28400 F Scotland HE Qualification
## 3 AAA 2013J 30268 F North West… A Level or Equiv…
## 4 AAA 2013J 31604 F South East… A Level or Equiv…
## 5 AAA 2013J 32885 F West Midla… Lower Than A Lev…
## 6 AAA 2013J 38053 M Wales A Level or Equiv…
## 7 AAA 2013J 45462 M Scotland HE Qualification
## 8 AAA 2013J 45642 F North West… A Level or Equiv…
## 9 AAA 2013J 52130 F East Angli… A Level or Equiv…
## 10 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # mean_weighted_score <dbl>
Big picture, we’ve added another measure – another feature – that we can use to make predictions: students’ performance on assessments 1/4 of the way through the course.
We’re now ready to proceed to our machine learning steps!
The problem we will be working on - predicting students who pass, based on data from the first one-third of the semester - has an analog in a recent paper by Ryan Baker and colleagues:
In Baker et al. (2020), the authors create a precision-recall (also known as sensitivity) graph - one that demonstrates the trade-off between optimizing these two statistics. Review their paper - especially the results section - to see how they discuss these two statistics.
Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk, 25(1), 28-54.
Please review this paper before proceeding, focusing on how they describe
This is identical to what we did in the first learning lab, using the
students_and_assessments data. We’ll also create a testing
data set we’ll use later.
set.seed(20230712)
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score)
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
We’ll also specify the recipe, adding our
mean_weighted_score variable we calculated above as well as
variables we used in LL1 (case study and badge). Note how we dummy code
two variables using step_dummy() (described further in the
first learning lab).
my_rec <- recipe(pass ~ disability +
date_registration +
gender +
code_module +
mean_weighted_score,
data = data_train) %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module)
These steps are also the same as in LL1.
# specify model
my_mod <-
logistic_reg() %>%
set_engine("glm") %>% # generalized linear model
set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
Finally, we’ll fit our model.
fitted_model <- fit(my_wf, data = data_train)
Finally, we’ll use the last_fit function, but we’ll add
something a little different - we’ll add the metric set here. To the
above, we’ll add another common metric - Cohen’s Kappa, which is similar
to accuracy, but which accounts for chance agreement.
To do so, we have to specify which metrics we want to use
using the metric_set() function (see all of the available
options here).
Please specify six metrics in the metric set – accuracy, sensitivity,
specificity, ppv (recall), npv, and kappa. Assign the name
class_metrics to the output of your use of the
metric_set() function.
## Define the metric set.
class_metrics <- metric_set(
accuracy,
sensitivity,
specificity,
recall,
npv,
kap
)
##last_fit function with the specified metric set.
last_fit <- last_fit(my_wf, split = train_test_split, metrics = class_metrics)
Then, please use last_fit, looking to how we did this in
the last learning lab for a guide on how to specify the argument nts. To
the arguments, add metrics = class_metrics.
final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)
We’re now ready to move on to interpreting the accuracy of our model.
Let’s start with a simple confusion matrix. The confusion matrix is a 2 x 2 table with values (cells in the table) representing one of four conditions, elaborated below. You’ll fill in the last two columns in a few moments.
| Statistic | How to Find | Interpretation | Value | % |
|---|---|---|---|---|
| True positive (TP) | Truth: 1, Prediction: 1 | Proportion of the population that is affected by a condition and correctly tested positive | ||
| True negative (TN) | Truth: 0, Prediction: 0 | Proportion of the population that is not affected by a condition and correctly tested negative | ||
| False positive (FP) | Truth: 0, Prediction: 1 | Proportion of the population that is not affected by a condition and incorrectly tested positive | ||
| False negative (FN) | Truth: 1, Prediction: 0 | Proportion of the population that is affected by a condition and incorrectly tested positive. |
To create a confusion matrix, run collect_predictions(),
which does what it suggests - it gathers together the model’s test set
predictions. Pass the last_fit object to this function
below.
collect_predictions(last_fit)
## # A tibble: 12,318 × 5
## .pred_class id .row pass .config
## <fct> <chr> <int> <fct> <chr>
## 1 1 train/test split 1 1 Preprocessor1_Model1
## 2 1 train/test split 2 1 Preprocessor1_Model1
## 3 1 train/test split 3 1 Preprocessor1_Model1
## 4 1 train/test split 5 1 Preprocessor1_Model1
## 5 1 train/test split 6 1 Preprocessor1_Model1
## 6 1 train/test split 8 1 Preprocessor1_Model1
## 7 1 train/test split 10 1 Preprocessor1_Model1
## 8 1 train/test split 11 1 Preprocessor1_Model1
## 9 1 train/test split 12 1 Preprocessor1_Model1
## 10 1 train/test split 17 1 Preprocessor1_Model1
## # ℹ 12,308 more rows
Take a look at the columns. You’ll need to provide the predictions
(created with collect_predictions()) and then pipe that to
conf_mat(), to which you provide the names of a) the
predictions and b) the known values for the test set. Some code to get
you started is below.
collect_predictions(final_fit) %>%
conf_mat(.pred_class, pass)
## Truth
## Prediction 0 1
## 0 4777 1824
## 1 3525 2187
You should see a confusion matrix output. If so, nice work! Please fill in the Value and Percentage columns in the table above (with TP, TN, FP, and FN), entering the appropriate values and then converting those into a percentage based on the total number of data points in the test data set. The accuracy can be interpreted as the sum of the true positive and true negative percentages. So, what’s the accuracy? Add that below as a percentage.
True Positives (TP): 4484 Percentage: 36.31063 % True Negatives (TN): 2558 Percentage: 20.71423 % False Positives (FP): 2132 Percentage: 17.26456 % False Negatives (FN): 3175 Percentage: 25.71058 %
Accuracy: 57.02486 %
Here’s where things get interesting: There are other statistics that have different denominators. Recall from the overview presentation that we can slice and dice the confusion matrix to calculate statistics that give us insights into the predictive utility of the model. Based on the above Values for TP, TN, FP, and FN you interpreted a few moments ago, add the Statistic Values for sensitivity, specificity, precision, and negative predictive value below to three decimal points.
| Statistic | Equation | Interpretation | Question Answered | Statistic Values |
| Sensitivity (AKA recall) | TP / (TP + FN) | Proportion of those who are affected by a condition and correctly tested positive | Out of all the actual positives, how many did we correctly predict? | |
| Specificity | TN / (FP + TN) | Proportion of those who are not affected by a condition and correctly tested negative; | Out of all the actual negatives, how many did we correctly predict? | |
| Precision (AKA Positive Predictive Value) | TP / (TP + FP) | Proportion of those who tested positive who are affected by a condition | Out of all the instances we predicted as positive, how many are actually positive? | |
| Negative Predictive Value | TN / (TN + FN) | Proportion of those who tested positive who are not affected by a condition; the probability that a negative test is correct | Out of all the instances we predicted as negative, how many are actually negative? |
So, what does this hard-won by output tell us? Let’s interpret each statistic carefully in the table below. Please add the value and provide a substantive interpretation. One is provided to get you started.
| Statistic | Substantive Interpretation |
| Accuracy | 57.02486 % |
| Sensitivity (AKA recall) | 0.585 The model correctly predicts about 2/3 of students who do not pass correctly (as not passing). This is pretty good, but it could be better. |
| Specificity | 0.545 |
| Precision (AKA Positive Predictive Value) | 0.678 |
| Negative Predictive Value | 0.446 |
This process might suggest to you how a “good” model isn’t necessarily one that is the most accurate, but instead is one that has good values for statistics that matter given our particular question and context.
Recall that Baker and colleagues sought to balance between precision and recall (specificity). Please briefly discuss how well our model does this; is it better suited to correctly identifying “positive” pass cases (sensitivity) or “negatively” identifying students who do not pass (specificity)?
We can see that the model performs slightly better in correctly identifying “positive” pass cases (sensitivity) compared to “negatively” identifying students who do not pass (specificity). Sensitivity indicates the proportion of actual positive cases (students who pass) that the model correctly identifies. In this case, the sensitivity value of 0.585 suggests that the model correctly identifies approximately 58.5% of the students who actually pass the course.
On the other hand, specificity indicates the proportion of actual negative cases (students who do not pass) that the model correctly identifies. With a specificity value of 0.545, the model correctly identifies approximately 54.5% of the students who actually do not pass the course.
Overall, while the model demonstrates moderate performance in both sensitivity and specificity, it is slightly more effective in identifying positive pass cases (sensitivity) than negative non-pass cases (specificity).
After all of this work, we can calculate the above much more
easily given how we setup our metrics (using metric_set()
earlier, such as when we want to efficiently communicate the results of
our analysis to our audience? Below, use collect_metrics()
with the final_fit object, noting that in addition to the
four metrics we calculated manually just a few moments ago, we are also
provided with the accuracy and Cohen’s Kappa values.
collect_metrics(final_fit)
## # A tibble: 6 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.566 Preprocessor1_Model1
## 2 sensitivity binary 0.724 Preprocessor1_Model1
## 3 specificity binary 0.383 Preprocessor1_Model1
## 4 recall binary 0.724 Preprocessor1_Model1
## 5 npv binary 0.545 Preprocessor1_Model1
## 6 kap binary 0.109 Preprocessor1_Model1
Having invested in understanding the terminology of machine learning
metrics, we’ll use this “shortcut” (using collect_metrics()
going forward.
Congratulations - you’ve completed this case study! Go over to Lab 2 badge activity.