Machine Learning - Lab 2 Badge

In this lab you will respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(yardstick)

assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

assessments_filtered <- assessments %>%
    mutate(final_exam = ifelse(assessment_type == "Exam", 1, 0)) %>% 
    mutate(final_exam = as.factor(final_exam))
# change date filter to final exam filter

length(assessments_filtered$assessment_type) #the total records of data

## [1] 173912

assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score))

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students_and_assessments <- left_join(students, assessments_summarized)

## Joining with `by = join_by(id_student)`

summary(students_and_assessments$mean_weighted_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   529.7   745.6   951.9  1400.0  8400.0    6057

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score)

train_test_split <- initial_split(students_and_assessments, prop = .50)
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

my_rec <- recipe(mean_weighted_score ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)

my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% # change from glm to lm (linear model)
    set_mode("regression") # change from classification to regregression

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above
fitted_model <- fit(my_wf, data = data_train)

class_metrics <- metric_set(yardstick::mae, yardstick::rmse)
final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)

collect_metrics(final_fit)

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        352. Preprocessor1_Model1
## 2 rmse    standard        493. Preprocessor1_Model1

Please add your interpretations here:

MAE (Mean Absolute Error): Measures the average absolute difference between the predicted value and the actual value known as a residual.
- 352.461
MSE (Mean Squared Error): Measures the average of the squared residuals.
- 242,648.553
RMSE (Root Mean Squared Error): Measures standard deviation of the residuals or the square root of the MSE.
- 492.594
All three metrics calculate errors in related ways to gauge performance. Because these track errors, lower numbers for all three metrics indicate a better fitting model.

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

Predictive diagnosis is a great use of a classification model to determine if a patient may have an ailment. The binary outcome from a logistic regression model could incorporate independent variables like age, BMI, smoking, drinking, demographics, and family history to produces a diagnosis outcome prediction.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

A ubiquitous example of regressions appears in the stock market as stock prices are continuous. Any skill-level of investor would willingly accept tomorrow’s stock valuations. Fortunately, stock prices and the supporting metrics are widely available and heavily studied.

Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Khor, E. T. (2019). PREDICTIVE MODELS WITH MACHINE LEARNING ALGORITHMS TO FORECAST STUDENTS’ PERFORMANCE. INTED2019 Proceedings, 2831–2837. https://doi.org/10.21125/inted.2019.0757
The study from my Lab 1 not only had academic application but also comparatively evaluated four algorithms, Decision Tree, Naive Bayes, Support Vector Machine, and Neural Network, which are all categorized as supervised and classification. A neural network model can function unsupervised but that was not the case in this study. The study seeked to evaluated among three catagories if a student should be viewed as a low, middle, or high-achiever. Considering the results of the “best” succeeded at a rate of 78.75%, I would call the model selection highly appropriate.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!

Machine Learning - Lab 2 Badge

J. Nick Hussey

April 12, 2024

Part I: Data Product

Part II: Reflect and Plan

Knit and Publish