As a reminder, to earn a badge for these learning labs, you will have to respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
#Setup
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.1.1
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.0 ✔ yardstick 1.2.0
## ✔ recipes 1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
set.seed(20231204)
#Data Loading
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Data Wrangling and Exploration
table(assessments$assessment_type)
##
## CMA Exam TMA
## 70527 4959 98426
assessments_exam <- assessments %>%
mutate(Bfinal_exam = ifelse(assessment_type == "Exam", 1, 0)) %>%
mutate(Bfinal_exam = as.factor(Bfinal_exam))
table(assessments_exam$assessment_type,assessments_exam$Bfinal_exam)
##
## 0 1
## CMA 70527 0
## Exam 0 4959
## TMA 98426 0
is_unique_identifier <- !any(duplicated(assessments_exam[c("id_student", "id_assessment")]) | duplicated(assessments_exam[c("id_student", "id_assessment")], fromLast = TRUE))
print(is_unique_identifier)
## [1] TRUE
assessments_exam <- subset(assessments_exam, Bfinal_exam == 1)
count(assessments_exam)
## # A tibble: 1 × 1
## n
## <int>
## 1 4959
summary(assessments_exam$score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 51.00 67.00 65.57 82.00 100.00
table(assessments_exam$Bfinal_exam)
##
## 0 1
## 0 4959
assessments_exam <- rename(assessments_exam, final_exam_score = score)
is_unique_identifier <- !any(duplicated(assessments_exam[c("id_student", "code_module", "code_presentation")]) | duplicated(assessments_exam[c("id_student", "code_module", "code_presentation")], fromLast = TRUE))
print(is_unique_identifier)
## [1] TRUE
assessments_exam <- select(assessments_exam, id_student, code_module, code_presentation, final_exam_score)
assessments %>%
group_by(code_module, code_presentation) %>%
summarize(mean_date = mean(date, na.rm = TRUE),
median_date = median(date, na.rm = TRUE),
sd_date = sd(date, na.rm = TRUE),
min_date = min(date, na.rm = TRUE),
max_date = max(date, na.rm = TRUE),
first_quantile = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups: code_module [7]
## code_module code_presentation mean_date median_date sd_date min_date max_date
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA 2013J 109. 117 71.3 19 215
## 2 AAA 2014J 109. 117 71.5 19 215
## 3 BBB 2013B 104. 89 55.5 19 187
## 4 BBB 2013J 112. 96 61.6 19 208
## 5 BBB 2014B 98.9 82 58.6 12 194
## 6 BBB 2014J 99.1 110 65.2 19 201
## 7 CCC 2014B 98.4 102 68.0 18 207
## 8 CCC 2014J 104. 109 70.8 18 214
## 9 DDD 2013B 104. 81 66.0 23 240
## 10 DDD 2013J 118. 88 77.9 25 261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
assessments_joined <- assessments |>
left_join(code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
assessments_filtered <- assessments_joined %>%
filter(date < quantile_cutoff_date)
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>%
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>%
mutate(imd_band = as.integer(imd_band))
nrow(distinct(students))
## [1] 32593
n_distinct(students$id_student)
## [1] 28785
is_unique_identifier <- !any(duplicated(students[c("id_student", "code_module", "code_presentation")]) | duplicated(students[c("id_student", "code_module", "code_presentation")], fromLast = TRUE))
print(is_unique_identifier)
## [1] TRUE
nrow(distinct(assessments_summarized))
## [1] 21577
n_distinct(assessments_summarized$id_student)
## [1] 21577
nrow(distinct(assessments_exam))
## [1] 4959
n_distinct(assessments_exam$id_student)
## [1] 4633
students_and_assessments <- students |>
left_join(assessments_summarized) |>
left_join(assessments_exam)
## Joining with `by = join_by(id_student)`
## Joining with `by = join_by(code_module, code_presentation, id_student)`
table(students_and_assessments$final_exam_score)
##
## 0 4 7 8 9 10 12 13 14 16 18 20 22 24 26 27 28 29 30 31
## 3 1 1 1 1 2 2 4 4 9 11 16 23 32 20 27 18 32 20 43
## 32 33 34 36 38 40 42 44 46 47 48 49 50 51 52 53 54 56 58 60
## 25 50 32 86 104 110 132 127 45 89 40 98 30 96 48 113 56 178 160 188
## 62 64 66 67 68 69 70 71 72 73 74 76 78 80 82 84 86 87 88 89
## 151 178 57 113 46 137 44 108 59 132 54 157 191 135 183 124 50 112 52 45
## 90 91 92 93 94 96 98 100
## 56 95 48 36 85 174 98 162
summary(students_and_assessments[c("mean_weighted_score", "final_exam_score")])
## mean_weighted_score final_exam_score
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 160.0 1st Qu.: 51.00
## Median : 610.0 Median : 67.00
## Mean : 544.7 Mean : 65.57
## 3rd Qu.: 875.0 3rd Qu.: 82.00
## Max. :1512.0 Max. :100.00
## NA's :7958 NA's :27634
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score, final_exam_score)
#Data Splitting - Splitting the data into training and testing sets with a 50% split, stratified by final_exam_score.
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "final_exam_score")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
#Recipe Preparation - Creating a recipe for the prediction model, including preprocessing steps like creating dummy variables.
my_rec <- recipe(final_exam_score ~ disability +
date_registration +
gender +
code_module +
mean_weighted_score,
data = data_train) %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module)
#Model Specification - Setting up a linear regression model (linear_reg()) with the lm engine and specifying the mode as regression.
my_mod <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
#Workflow -- Creating a workflow
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
#Model Fitting -- model and recipe and fitting it to the training data.
fitted_model <- fit(my_wf, data = data_train)
class_metrics <- metric_set(mae, rmse)
#Model Evaluation -- Using last_fit to evaluate the model on the test split, focusing on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
#Prediction Collection -- Collecting and Displaying Predictions
collect_predictions <- collect_predictions(final_fit)
collect_predictions
## # A tibble: 2,461 × 5
## id .pred .row final_exam_score .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 66.7 2 76 Preprocessor1_Model1
## 2 train/test split 67.3 3 66 Preprocessor1_Model1
## 3 train/test split 67.8 5 98 Preprocessor1_Model1
## 4 train/test split 71.6 6 100 Preprocessor1_Model1
## 5 train/test split 67.6 11 100 Preprocessor1_Model1
## 6 train/test split 72.6 15 60 Preprocessor1_Model1
## 7 train/test split 64.7 19 24 Preprocessor1_Model1
## 8 train/test split 64.2 23 60 Preprocessor1_Model1
## 9 train/test split 77.5 25 94 Preprocessor1_Model1
## 10 train/test split 68.1 27 90 Preprocessor1_Model1
## # ℹ 2,451 more rows
final_fit_metrics <- collect_metrics(final_fit)
final_fit_metrics
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 mae standard 16.4 Preprocessor1_Model1
## 2 rmse standard 19.8 Preprocessor1_Model1
#Metric Calculations -- Calculating and displaying predictive accuracy metrics, including the derivation of Mean Squared Error (MSE) by squaring the RMSE.
mse_row <- final_fit_metrics %>%
filter(.metric == "rmse") %>%
mutate(.estimate = .estimate^2, .metric = "mse")
final_fit_metrics <- bind_rows(final_fit_metrics, mse_row)
print(final_fit_metrics)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 mae standard 16.4 Preprocessor1_Model1
## 2 rmse standard 19.8 Preprocessor1_Model1
## 3 mse standard 391. Preprocessor1_Model1
Please add your interpretations here:
MAE:The MAE was 16.40999. The MAE is the measure of the average magnitude of the errors in a set of predictions (without direction). It is calculated by the average of absolute differences between the predicted values and the observed actual outcomes. In most predictive modeling, the goal is to minimize the MAE, but it’s acceptability is highly context-specific. In this context, the MAE means that, on average, the predictions of the students’ final exam scores made by the model are about 16.41 points off from their actual exam scores.
MSE: The MSE was 390.95005. The Mean Squared Error (MSE) measures the quality of an estimator. It’s calculated as the average of the squared differences between the predicted values and the actual values. In this context, the MSE of 390.95 means that there is a considerable amount of variance between the predicted and actual exam scores. It is better when values are closer to zero. The measure is sensitive to greater weight to large errors because they are squared and averaged. Outliers and large deviations are possible when a high MSE is there.
RMSE: The RMSE was 19.77246. The Root Mean Square Error (RMSE) measures the differences between the values predicted by a model and the values actually observed. It’s calculated by taking the square root of the Mean Squared Error (MSE), which is the average of the squared differences between prediction and actual observation. In the context, the RMSE of 19.77 means that, on average, the predictions made by the model have an error of approximately 19.77 points from the actual exam scores. This is an indication of the typical size of the prediction errors. Since the RMSE is larger than the MAE, it suggests that there are some larger errors in the predictions.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!