In this lab you will respond to a set of prompts for two parts.
For the data product, you will interpret a different type of model – a model in a regression mode.
So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).
While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.
The requirements are as follows:
Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.
Please use the code chunk below for your code:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'tidyr' was built under R version 4.3.1
## Warning: package 'readr' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.3.1
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## Warning: package 'broom' was built under R version 4.3.1
## Warning: package 'dials' was built under R version 4.3.1
## Warning: package 'modeldata' was built under R version 4.3.1
## Warning: package 'parsnip' was built under R version 4.3.1
## Warning: package 'recipes' was built under R version 4.3.1
## Warning: package 'rsample' was built under R version 4.3.1
## Warning: package 'tune' was built under R version 4.3.1
## Warning: package 'workflows' was built under R version 4.3.1
## Warning: package 'workflowsets' was built under R version 4.3.1
## Warning: package 'yardstick' was built under R version 4.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.1
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments <- read_csv("data/oulad-assessments.csv")
## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
assessments %>%
count(id_assessment)
## # A tibble: 188 × 2
## id_assessment n
## <dbl> <int>
## 1 1752 359
## 2 1753 342
## 3 1754 331
## 4 1755 303
## 5 1756 298
## 6 1758 337
## 7 1759 317
## 8 1760 304
## 9 1761 280
## 10 1762 278
## # ℹ 178 more rows
assessments %>%
distinct(id_assessment)
## # A tibble: 188 × 1
## id_assessment
## <dbl>
## 1 1752
## 2 1753
## 3 1754
## 4 1755
## 5 1756
## 6 1758
## 7 1759
## 8 1760
## 9 1761
## 10 1762
## # ℹ 178 more rows
assessments %>%
count(assessment_type, code_module, code_presentation)
## # A tibble: 41 × 4
## assessment_type code_module code_presentation n
## <chr> <chr> <chr> <int>
## 1 CMA BBB 2013B 5049
## 2 CMA BBB 2013J 6416
## 3 CMA BBB 2014B 4493
## 4 CMA CCC 2014B 3920
## 5 CMA CCC 2014J 5846
## 6 CMA DDD 2013B 5252
## 7 CMA FFF 2013B 6681
## 8 CMA FFF 2013J 8847
## 9 CMA FFF 2014B 5549
## 10 CMA FFF 2014J 8915
## # ℹ 31 more rows
assessments %>%
summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
median_date = median(date, na.rm = TRUE), # find the median
sd_date = sd(date, na.rm = TRUE), # find the sd
min_date = min(date, na.rm = TRUE), # find the min
max_date = max(date, na.rm = TRUE)) # find the mad
## # A tibble: 1 × 5
## mean_date median_date sd_date min_date max_date
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 131. 129 78.0 12 261
assessments %>%
group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
summarize(mean_date = mean(date, na.rm = TRUE),
median_date = median(date, na.rm = TRUE),
sd_date = sd(date, na.rm = TRUE),
min_date = min(date, na.rm = TRUE),
max_date = max(date, na.rm = TRUE),
first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile
## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 8
## # Groups: code_module [7]
## code_module code_presentation mean_date median_date sd_date min_date max_date
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAA 2013J 109. 117 71.3 19 215
## 2 AAA 2014J 109. 117 71.5 19 215
## 3 BBB 2013B 104. 89 55.5 19 187
## 4 BBB 2013J 112. 96 61.6 19 208
## 5 BBB 2014B 98.9 82 58.6 12 194
## 6 BBB 2014J 99.1 110 65.2 19 201
## 7 CCC 2014B 98.4 102 68.0 18 207
## 8 CCC 2014J 104. 109 70.8 18 214
## 9 DDD 2013B 104. 81 66.0 23 240
## 10 DDD 2013J 118. 88 77.9 25 261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>
code_module_dates <- assessments %>%
group_by(code_module, code_presentation) %>%
summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE), .groups = 'drop')
code_module_dates
## # A tibble: 22 × 3
## code_module code_presentation quantile_cutoff_date
## <chr> <chr> <dbl>
## 1 AAA 2013J 54
## 2 AAA 2014J 54
## 3 BBB 2013B 54
## 4 BBB 2013J 54
## 5 BBB 2014B 47
## 6 BBB 2014J 54
## 7 CCC 2014B 32
## 8 CCC 2014J 32
## 9 DDD 2013B 51
## 10 DDD 2013J 53
## # ℹ 12 more rows
Joining the data
assessments_joined <- left_join(assessments, code_module_dates)
## Joining with `by = join_by(code_module, code_presentation)`
Filtering the assessment data
assessments_filtered <- assessments_joined %>%
filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
Finding average score for each student
assessments_summarized <- assessments_filtered %>%
mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
group_by(id_student) %>%
summarize(mean_weighted_score = mean(weighted_score))
Mutating
students <- students %>%
mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps
students <- students %>%
mutate(imd_band = factor(imd_band, levels = c("0-10%",
"10-20%",
"20-30%",
"30-40%",
"40-50%",
"50-60%",
"60-70%",
"70-80%",
"80-90%",
"90-100%"))) %>% # this creates a factor with ordered levels
mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
Joining the files - “students” and “assessments_summarized”
students_and_assessments <- left_join(students, assessments_summarized)
## Joining with `by = join_by(id_student)`
students_and_assessments
## # A tibble: 32,593 × 17
## code_module code_presentation id_student gender region highest_education
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 AAA 2013J 11391 M East Angli… HE Qualification
## 2 AAA 2013J 28400 F Scotland HE Qualification
## 3 AAA 2013J 30268 F North West… A Level or Equiv…
## 4 AAA 2013J 31604 F South East… A Level or Equiv…
## 5 AAA 2013J 32885 F West Midla… Lower Than A Lev…
## 6 AAA 2013J 38053 M Wales A Level or Equiv…
## 7 AAA 2013J 45462 M Scotland HE Qualification
## 8 AAA 2013J 45642 F North West… A Level or Equiv…
## 9 AAA 2013J 52130 F East Angli… A Level or Equiv…
## 10 AAA 2013J 53025 M North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## # num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## # final_result <chr>, module_presentation_length <dbl>,
## # date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## # mean_weighted_score <dbl>
glimpse(students_and_assessments)
## Rows: 32,593
## Columns: 17
## $ code_module <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band <int> 10, 3, 4, 6, 6, 9, 4, 10, 8, NA, 8, 3, 7, 6…
## $ age_band <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pass <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mean_weighted_score <dbl> 780, 700, NA, 720, 690, 790, 700, 720, 720,…
set.seed(20230712)
students_and_assessments <- students_and_assessments %>%
drop_na(mean_weighted_score)
train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "mean_weighted_score")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)
glimpse(data_train)
## Rows: 12,317
## Columns: 17
## $ code_module <chr> "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "…
## $ code_presentation <chr> "2013B", "2013J", "2013J", "2013J", "2013J"…
## $ id_student <dbl> 542562, 485439, 559344, 581016, 590867, 591…
## $ gender <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F"…
## $ region <chr> "South East Region", "Wales", "South Region…
## $ highest_education <chr> "A Level or Equivalent", "Lower Than A Leve…
## $ imd_band <int> 7, 1, 8, 5, 5, 1, 1, 8, 1, 3, 5, 1, 4, NA, …
## $ age_band <chr> "0-35", "35-55", "0-35", "0-35", "35-55", "…
## $ num_of_prev_attempts <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
## $ studied_credits <dbl> 150, 60, 120, 60, 90, 60, 60, 120, 180, 120…
## $ disability <chr> "Y", "Y", "N", "N", "N", "N", "N", "N", "N"…
## $ final_result <chr> "Withdrawn", "Withdrawn", "Withdrawn", "Wit…
## $ module_presentation_length <dbl> 240, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration <dbl> -85, -46, -116, -149, -101, -12, -71, -63, …
## $ date_unregistration <dbl> -73, 202, -64, -107, NA, -3, NA, 135, NA, 9…
## $ pass <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mean_weighted_score <dbl> 100.0, 160.0, 0.0, 0.0, 0.0, 0.0, 150.0, 0.…
my_rec <- recipe(mean_weighted_score ~ disability + studied_credits + highest_education + age_band, data = data_train) %>%
step_dummy(disability) %>%
step_dummy(highest_education)
# specify model
my_mod <-
linear_reg() %>%
set_engine("lm") %>% # generalized linear model
set_mode("regression") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression
# specify workflow
my_wf <-
workflow() %>% # create a workflow
add_model(my_mod) %>% # add the model we wrote above
add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)
# Handle missing values in the mean_weighted_score column
data_train <- data_train %>% drop_na(mean_weighted_score)
data_test <- data_test %>% drop_na(mean_weighted_score)
Evaluating metrics
# Make predictions on the test data
predictions <- predict(fitted_model, new_data = data_test)
# Calculate MAE (Mean Absolute Error)
mae_value <- mean(abs(data_test$mean_weighted_score - predictions$`.pred`))
# Calculate MSE (Mean Squared Error)
mse_value <- mean((data_test$mean_weighted_score - predictions$`.pred`)^2)
# Calculate RMSE (Root Mean Squared Error)
rmse_value <- sqrt(mse_value)
# Print the metrics
cat("MAE:", mae_value, "\n")
## MAE: 329.6713
cat("MSE:", mse_value, "\n")
## MSE: 137150.7
cat("RMSE:", rmse_value, "\n")
## RMSE: 370.3387
Please add your interpretations here:
MAE: MAE or Mean Absolute Error measures the average absolute difference between the predicted values and the actual (observed) values. In this case, MAE is 329.6713. On average, our model’s predictions are off by approximately 329.67 units (in the same unit as your target variable). Lower MAE values indicate better model accuracy.
MSE: MSE or Mean Squared Error measures the average squared difference between the predicted values and the actual values. Here, the MSE is 137150.7.MSE gives more weight to large errors because it squares the differences. In this context, it means that the model’s predictions have, on average, a squared error of approximately 137150.7 units. Like MAE, lower MSE values indicate better model accuracy.
RMSE: RMSE is the square root of MSE and is often preferred because it provides error values in the same units as the target variable. In our model, the RMSE is 370.3387. This tells us that on average, our model’s predictions have an error of approximately 370.34 units. RMSE is especially useful when you want to express the prediction error in the same units as the target variable. As with MAE and MSE, lower RMSE values indicate better model accuracy.
Complete the following steps to knit and publish your work:
First, change the name of the author: in the YAML
header at the very top of this document to your name. The YAML
header controls the style and feel for knitted document but doesn’t
actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.
Have fun!