As a reminder, to earn a badge for these learning labs, you will have to respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.1.1
## ✔ dials 1.2.0 ✔ tune 1.1.1
## ✔ infer 1.0.4 ✔ workflows 1.1.3
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.0 ✔ yardstick 1.2.0
## ✔ recipes 1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students
## # A tibble: 32,593 × 15
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East Angli… HE Qualification
## 2 AAA 2013J 28400 F Scotland HE Qualification
## 3 AAA 2013J 30268 F North West… A Level or Equiv…
## 4 AAA 2013J 31604 F South East… A Level or Equiv…
## 5 AAA 2013J 32885 F West Midla… Lower Than A Lev…
## 6 AAA 2013J 38053 M Wales A Level or Equiv…
## 7 AAA 2013J 45462 M Scotland HE Qualification
## 8 AAA 2013J 45642 F North West… A Level or Equiv…
## 9 AAA 2013J 52130 F East Angli… A Level or Equiv…
## 10 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 9 more variables: imd_band <chr>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments
## # A tibble: 173,912 × 10
## id_assessment id_student date_submitted is_banked score code_module
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1752 11391 18 0 78 AAA
## 2 1752 28400 22 0 70 AAA
## 3 1752 31604 17 0 72 AAA
## 4 1752 32885 26 0 69 AAA
## 5 1752 38053 19 0 79 AAA
## 6 1752 45462 20 0 70 AAA
## 7 1752 45642 18 0 72 AAA
## 8 1752 52130 19 0 72 AAA
## 9 1752 53025 9 0 71 AAA
## 10 1752 57506 18 0 68 AAA
## # ℹ 173,902 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## # date <dbl>, weight <dbl>
assessments %>%
count(assessment_type)
## # A tibble: 3 × 2
## assessment_type n
## <chr> <int>
## 1 CMA 70527
## 2 Exam 4959
## 3 TMA 98426
assessments %>%
distinct(id_assessment) # this many unique assessments
## # A tibble: 188 × 1
## id_assessment
## <dbl>
## 1 1752
## 2 1753
## 3 1754
## 4 1755
## 5 1756
## 6 1758
## 7 1759
## 8 1760
## 9 1761
## 10 1762
## # ℹ 178 more rows
assessments %>%
summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
median_date = median(date, na.rm = TRUE), # find the median
sd_date = sd(date, na.rm = TRUE), # find the sd
min_date = min(date, na.rm = TRUE), # find the min
max_date = max(date, na.rm = TRUE)) # find the mad
## # A tibble: 1 × 5
## mean_date median_date sd_date min_date max_date
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 131. 129 78.0 12 261
assessments %>%
group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
summarize(mean_date = mean(date, na.rm = TRUE),
median_date = median(date, na.rm = TRUE),
sd_date = sd(date, na.rm = TRUE),
min_date = min(date, na.rm = TRUE),
max_date = max(date, na.rm = TRUE),
first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups: code_module [7]
## code_module code_presentation mean_date median_date sd_date min_date max_date
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA 2013J 109. 117 71.3 19 215
## 2 AAA 2014J 109. 117 71.5 19 215
## 3 BBB 2013B 104. 89 55.5 19 187
## 4 BBB 2013J 112. 96 61.6 19 208
## 5 BBB 2014B 98.9 82 58.6 12 194
## 6 BBB 2014J 99.1 110 65.2 19 201
## 7 CCC 2014B 98.4 102 68.0 18 207
## 8 CCC 2014J 104. 109 70.8 18 214
## 9 DDD 2013B 104. 81 66.0 23 240
## 10 DDD 2013J 118. 88 77.9 25 261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
code_module_dates
## # A tibble: 22 × 3
## # Groups: code_module [7]
## code_module code_presentation quantile_cutoff_date
## <chr> <chr> <dbl>
## 1 AAA 2013J 54
## 2 AAA 2014J 54
## 3 BBB 2013B 54
## 4 BBB 2013J 54
## 5 BBB 2014B 47
## 6 BBB 2014J 54
## 7 CCC 2014B 32
## 8 CCC 2014J 32
## 9 DDD 2013B 51
## 10 DDD 2013J 53
## # ℹ 12 more rows
code_module_dates
## # A tibble: 22 × 3
## # Groups: code_module [7]
## code_module code_presentation quantile_cutoff_date
## <chr> <chr> <dbl>
## 1 AAA 2013J 54
## 2 AAA 2014J 54
## 3 BBB 2013B 54
## 4 BBB 2013J 54
## 5 BBB 2014B 47
## 6 BBB 2014J 54
## 7 CCC 2014B 32
## 8 CCC 2014J 32
## 9 DDD 2013B 51
## 10 DDD 2013J 53
## # ℹ 12 more rows
assessments_joined <- assessments %>%
left_join(code_module_dates, by = join_by(code_module, code_presentation)) # join the data based on course_module and course_presentation
assessments_filtered <- assessments_joined %>%
filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
students <- students %>%
mutate(pass = ifelse(final_result == "pass", 1, 0)) %>% # creates a dummy code
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
students_and_assessments <- students %>%
left_join(assessments_summarized,by = join_by(id_student))
set.seed(20230712)
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score)
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
head(data_train)
## # A tibble: 6 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 38053 M Wales A Level or Equiv…
## 2 AAA 2013J 45462 M Scotland HE Qualification
## 3 AAA 2013J 53025 M North Region Post Graduate Qu…
## 4 AAA 2013J 62155 F North Weste… HE Qualification
## 5 AAA 2013J 63400 M Scotland Lower Than A Lev…
## 6 AAA 2013J 65002 F East Anglia… A Level or Equiv…
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # mean_weighted_score <dbl>
my_rec <- recipe(mean_weighted_score ~ disability +
date_registration +
gender +
code_module +
mean_weighted_score,
data = data_train) %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module)
# specify model
my_mod <-
linear_reg() %>%
set_engine("lm") %>% # linear regression model
set_mode("regression") # multiple regression
my_mod
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
install.packages("yardstick")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(yardstick)
class_metrics <- metric_set( mae)
final_fit <- last_fit(fitted_model, train_test_split, metrics = class_metrics)
collect_predictions(final_fit)
## # A tibble: 12,318 × 5
## id .pred .row mean_weighted_score .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 717. 1 780 Preprocessor1_Model1
## 2 train/test split 701. 2 700 Preprocessor1_Model1
## 3 train/test split 701. 3 720 Preprocessor1_Model1
## 4 train/test split 723. 4 690 Preprocessor1_Model1
## 5 train/test split 697. 7 720 Preprocessor1_Model1
## 6 train/test split 698. 8 720 Preprocessor1_Model1
## 7 train/test split 707. 10 680 Preprocessor1_Model1
## 8 train/test split 700. 11 730 Preprocessor1_Model1
## 9 train/test split 699. 12 670 Preprocessor1_Model1
## 10 train/test split 709. 16 590 Preprocessor1_Model1
## # ℹ 12,308 more rows
collect_metrics(final_fit)
## # A tibble: 1 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 mae standard 194. Preprocessor1_Model1
Please add your interpretations here:
MAE: MAE (Mean Absolute Error) is the average absolute error between actual and predicted values.
MSE: Mean squared error (MSE) measures the amount of error in statistical models. It assesses the average squared difference between the observed and predicted values.
RMSE: RMSE can be interpreted as the standard deviation of the unexplained variance.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.
Once your instructor has checked your link, you will be provided a physical version of the badge below!