In this lab you will respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ recipes 1.0.10
## ✔ dials 1.2.1 ✔ rsample 1.2.1
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.0
## ✔ modeldata 1.3.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(tidymodels)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
assessments <- readr::read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students <- readr::read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
assessments_joined <- left_join(assessments, code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
assessments_filtered <- assessments_joined %>%
filter(date < quantile_cutoff_date)
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>%
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>%
mutate(pass = as.factor(pass))
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>%
mutate(imd_band = as.integer(imd_band))
students_and_assessments <- students %>%
left_join(assessments_summarized, by = join_by(id_student))
set.seed(20230712)
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score)
train_test_split <- initial_split(students_and_assessments, prop = .50)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
my_rec <- recipe(mean_weighted_score ~ disability +
date_registration +
gender +
code_module +
mean_weighted_score,
data = data_train) %>%
step_dummy(disability) %>%
step_dummy(gender) %>%
step_dummy(code_module)
my_mod <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
fitted_model <- fit(my_wf, data = data_train)
class_metrics <- metric_set(mae, rmse)
final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)
collect_metrics(final_fit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 mae standard 194. Preprocessor1_Model1
## 2 rmse standard 256. Preprocessor1_Model1
Please add your interpretations here:
MAE: The mean absolute error (MAE), or the absolute difference between actual and predicted values, for this model is 194.4208. Our objective is to minimize this value.
MSE: The mean squared error (MSE), or the average of the squared error, for this model is 65,332.1779. This value was determined by squaring the RMSE (255.6016^2). Our objective is to minimize this value.
RMSE: The root mean squared error (RMSE), or the square root of the MSE, for this model is 255.6016. Again, our objective is to minimize this value.
The study I identified in the first machine learning lab badge activity researched concussions in student athletes. The study was able to determine the presence of concussion on 81% of the concussion subjects. The findings suggested that the concussion-induced abnormalities on post-concussion syndrome subjects are not uniformly distributed among the entire brain tissue.
A classification machine learning model was used to determine these results. This was the correct decision since a student athlete would either have or not have a concussion. The outcome is binary and is therefore a classification problem.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Have fun!