In this lab you will respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code: ## 1. PREPARE
##Load Packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.0
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(yardstick)
library(dplyr)
assessments <- read.csv("data/oulad-assessments.csv")
assessments %>%
count(assessment_type)
## assessment_type n
## 1 CMA 70527
## 2 Exam 4959
## 3 TMA 98426
assessments <- assessments %>%
filter(assessment_type == "Exam")# Filter rows where assessment_type is "Exam" so we can predict final exam
assessments_score <- assessments %>% ## rem
group_by(id_student)
assessments_score
## # A tibble: 4,959 × 10
## # Groups: id_student [4,633]
## id_assessment id_student date_submitted is_banked score code_module
## <int> <int> <int> <int> <int> <chr>
## 1 24290 558914 230 0 32 CCC
## 2 24290 559706 234 0 78 CCC
## 3 24290 559770 230 0 54 CCC
## 4 24290 560114 230 0 64 CCC
## 5 24290 560311 234 0 100 CCC
## 6 24290 560494 230 0 92 CCC
## 7 24290 561363 230 0 84 CCC
## 8 24290 561559 230 0 42 CCC
## 9 24290 561774 230 0 62 CCC
## 10 24290 562450 230 0 46 CCC
## # ℹ 4,949 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## # date <int>, weight <dbl>
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
students and
assessments_summarized to create
students_and_assessments## this is an inner join, using left join to avoid student_id's not in the assessments_score table
students_and_assessments <- inner_join(x = students, y = assessments_score, by="id_student")
## Warning in inner_join(x = students, y = assessments_score, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 8685 of `x` matches multiple rows in `y`.
## ℹ Row 1624 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
set.seed(20230712)
students_and_assessments <- students_and_assessments %>%
drop_na(score) ## changed to score
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "score") ## changed from pass to score
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
my_rec <- recipe(score ~ disability + ## changed from pass to score
date_registration +
gender +
highest_education + ## added in 2 predictor variables
age_band,
##code_module + - removed due to taking out above
##mean_weighted_score, - removed
data = data_train) %>%
step_dummy(disability) %>%
step_dummy(gender)## %>%
##step_dummy(code_module)
# specify model
my_mod <-
linear_reg() %>%
set_engine("lm") %>% # updated to lm
set_mode("regression") # updated to regression
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
## Need to calculate mse separately?
class_metrics <- metric_set( mae, rmse)
final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)
predictions <- collect_predictions(final_fit)
predictions
## # A tibble: 3,337 × 5
## .pred id .row score .config
## <dbl> <chr> <int> <int> <chr>
## 1 64.8 train/test split 1 73 Preprocessor1_Model1
## 2 69.8 train/test split 2 56 Preprocessor1_Model1
## 3 59.8 train/test split 3 71 Preprocessor1_Model1
## 4 57.7 train/test split 4 73 Preprocessor1_Model1
## 5 70.5 train/test split 6 76 Preprocessor1_Model1
## 6 67.8 train/test split 7 66 Preprocessor1_Model1
## 7 72.2 train/test split 8 50 Preprocessor1_Model1
## 8 69.6 train/test split 9 40 Preprocessor1_Model1
## 9 72.2 train/test split 10 98 Preprocessor1_Model1
## 10 70.3 train/test split 11 100 Preprocessor1_Model1
## # ℹ 3,327 more rows
collect_metrics(final_fit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 mae standard 16.9 Preprocessor1_Model1
## 2 rmse standard 20.3 Preprocessor1_Model1
results <- final_fit %>%
collect_predictions()
results
## # A tibble: 3,337 × 5
## .pred id .row score .config
## <dbl> <chr> <int> <int> <chr>
## 1 64.8 train/test split 1 73 Preprocessor1_Model1
## 2 69.8 train/test split 2 56 Preprocessor1_Model1
## 3 59.8 train/test split 3 71 Preprocessor1_Model1
## 4 57.7 train/test split 4 73 Preprocessor1_Model1
## 5 70.5 train/test split 6 76 Preprocessor1_Model1
## 6 67.8 train/test split 7 66 Preprocessor1_Model1
## 7 72.2 train/test split 8 50 Preprocessor1_Model1
## 8 69.6 train/test split 9 40 Preprocessor1_Model1
## 9 72.2 train/test split 10 98 Preprocessor1_Model1
## 10 70.3 train/test split 11 100 Preprocessor1_Model1
## # ℹ 3,327 more rows
mse <- predictions %>%
summarise(mse = mean((.pred - score)^2)) %>%
pull(mse)
mse
## [1] 413.8447
std <- sd(predictions$score)
std
## [1] 20.71507
Please add your interpretations here:
MAE: 16.88. Mean Absolute Error. MAE measures the average absolute difference between the predicted values and the actual values in a data set. With an exam score being 0 - 100 an MAE of 16 is sort of large.
MSE: 413.84. Mean Squared Error. Average squared difference between the actual values (observed) and the predicted values (estimated) by the model
RMSE: 20.34 Root Mean Squared Error. average magnitude of the residuals (the differences between predicted and observed values) in the same units as the target variable. Penalizes large errors. This RMSE is on the large side if you consider an exam grade is 0-100, 20 is 1/5 the possible total score. The RMSE is similar to the standard deviation of score so that might indicate the model captures the variability in the data?
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Have fun!