Machine Learning - Lab 2 Badge

In this lab you will respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code: ## 1. PREPARE

##Load Packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(yardstick)
library(dplyr)

2. WRANGLE

assessments <- read.csv("data/oulad-assessments.csv")

3. EXPLORE

assessments %>% 
    count(assessment_type)

##   assessment_type     n
## 1             CMA 70527
## 2            Exam  4959
## 3             TMA 98426

Filtering for only assessment types of Exam since we want to predict only final exam

assessments <- assessments %>%
  filter(assessment_type == "Exam")# Filter rows where assessment_type is "Exam" so we can predict final exam

Score for each student on final exam

assessments_score <- assessments %>%  ## rem
    group_by(id_student) 
   

assessments_score

## # A tibble: 4,959 × 10
## # Groups:   id_student [4,633]
##    id_assessment id_student date_submitted is_banked score code_module
##            <int>      <int>          <int>     <int> <int> <chr>      
##  1         24290     558914            230         0    32 CCC        
##  2         24290     559706            234         0    78 CCC        
##  3         24290     559770            230         0    54 CCC        
##  4         24290     560114            230         0    64 CCC        
##  5         24290     560311            234         0   100 CCC        
##  6         24290     560494            230         0    92 CCC        
##  7         24290     561363            230         0    84 CCC        
##  8         24290     561559            230         0    42 CCC        
##  9         24290     561774            230         0    62 CCC        
## 10         24290     562450            230         0    46 CCC        
## # ℹ 4,949 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## #   date <int>, weight <dbl>

Processing student variables

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

Join together `students` and `assessments_summarized` to create `students_and_assessments`

## this is an inner join, using left join to avoid student_id's  not in the assessments_score table
students_and_assessments <- inner_join(x = students, y = assessments_score, by="id_student")

## Warning in inner_join(x = students, y = assessments_score, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 8685 of `x` matches multiple rows in `y`.
## ℹ Row 1624 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Model

Step 1. Split data

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(score) ## changed to score

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "score") ## changed from pass to score
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Step 2: Engineer features and write down the recipe

my_rec <- recipe(score ~ disability + ## changed from pass to score
                     date_registration + 
                     gender +
                     highest_education + ## added in 2 predictor variables
                     age_band,
                    ##code_module + - removed due to taking out above
                    ##mean_weighted_score, - removed
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender)## %>%  
    ##step_dummy(code_module)

Step 3: Specify the model and workflow

# specify model
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% # updated to lm
    set_mode("regression") # updated to regression

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Step 4: Fit model

fitted_model <- fit(my_wf, data = data_train)

ClassMetrics

## Need to calculate mse separately?
class_metrics <- metric_set( mae, rmse)

Last_fit function

final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)

Step 5: Interpret accuracy

predictions <- collect_predictions(final_fit)
predictions

## # A tibble: 3,337 × 5
##    .pred id                .row score .config             
##    <dbl> <chr>            <int> <int> <chr>               
##  1  64.8 train/test split     1    73 Preprocessor1_Model1
##  2  69.8 train/test split     2    56 Preprocessor1_Model1
##  3  59.8 train/test split     3    71 Preprocessor1_Model1
##  4  57.7 train/test split     4    73 Preprocessor1_Model1
##  5  70.5 train/test split     6    76 Preprocessor1_Model1
##  6  67.8 train/test split     7    66 Preprocessor1_Model1
##  7  72.2 train/test split     8    50 Preprocessor1_Model1
##  8  69.6 train/test split     9    40 Preprocessor1_Model1
##  9  72.2 train/test split    10    98 Preprocessor1_Model1
## 10  70.3 train/test split    11   100 Preprocessor1_Model1
## # ℹ 3,327 more rows

5. Communicate / Results

collect_metrics(final_fit)

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        16.9 Preprocessor1_Model1
## 2 rmse    standard        20.3 Preprocessor1_Model1

results <- final_fit %>%
    collect_predictions()

results

## # A tibble: 3,337 × 5
##    .pred id                .row score .config             
##    <dbl> <chr>            <int> <int> <chr>               
##  1  64.8 train/test split     1    73 Preprocessor1_Model1
##  2  69.8 train/test split     2    56 Preprocessor1_Model1
##  3  59.8 train/test split     3    71 Preprocessor1_Model1
##  4  57.7 train/test split     4    73 Preprocessor1_Model1
##  5  70.5 train/test split     6    76 Preprocessor1_Model1
##  6  67.8 train/test split     7    66 Preprocessor1_Model1
##  7  72.2 train/test split     8    50 Preprocessor1_Model1
##  8  69.6 train/test split     9    40 Preprocessor1_Model1
##  9  72.2 train/test split    10    98 Preprocessor1_Model1
## 10  70.3 train/test split    11   100 Preprocessor1_Model1
## # ℹ 3,327 more rows

mse <- predictions %>%
  summarise(mse = mean((.pred - score)^2)) %>%
  pull(mse)

mse

## [1] 413.8447

std <- sd(predictions$score)

std

## [1] 20.71507

Please add your interpretations here:

MAE: 16.88. Mean Absolute Error. MAE measures the average absolute difference between the predicted values and the actual values in a data set. With an exam score being 0 - 100 an MAE of 16 is sort of large.
MSE: 413.84. Mean Squared Error. Average squared difference between the actual values (observed) and the predicted values (estimated) by the model
RMSE: 20.34 Root Mean Squared Error. average magnitude of the residuals (the differences between predicted and observed values) in the same units as the target variable. Penalizes large errors. This RMSE is on the large side if you consider an exam grade is 0-100, 20 is 1/5 the possible total score. The RMSE is similar to the standard deviation of score so that might indicate the model captures the variability in the data?

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

As with most learning analytics studies a pass/fail grade is of interest and falls under classification. Another outcome of interest could be attendance if a student shows up for class on a certain day, that would be a yes/no scenario.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

Exam scores or final grades is an interesting use of regression models. Accurately predicting a students final grade based on predictor variables could help students/teachers to aid student performance.

Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Doctor, A. C. (2023, April 11). A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course. ArXiv.org. https://doi.org/10.25147/ijcsr.2017.001.1.135. The model used in the above paper was classification. They are predicting a pass/fail scenario. The paper uses a decision tree for their model. A decision tree model works good for this scenario, tree-based methods classify instances by sorting them down the tree from the root to a leaf node, which provides the classification of a specific instance to help identify the weakness of students.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!

Machine Learning - Lab 2 Badge

Austin Hannold

April 12, 2024

Part I: Data Product

2. WRANGLE

3. EXPLORE

Filtering for only assessment types of Exam since we want to predict only final exam

Score for each student on final exam

Processing student variables

Join together `students` and `assessments_summarized` to create `students_and_assessments`

Model

Step 1. Split data

Step 2: Engineer features and write down the recipe

Step 3: Specify the model and workflow

Step 4: Fit model

ClassMetrics

Last_fit function

Step 5: Interpret accuracy

5. Communicate / Results

Part II: Reflect and Plan

Knit and Publish

Machine Learning - Lab 2 Badge

Austin Hannold

April 12, 2024

Part I: Data Product

2. WRANGLE

3. EXPLORE

Filtering for only assessment types of Exam since we want to predict only final exam

Score for each student on final exam

Processing student variables

Join together students and assessments_summarized to create students_and_assessments

Model

Step 1. Split data

Step 2: Engineer features and write down the recipe

Step 3: Specify the model and workflow

Step 4: Fit model

ClassMetrics

Last_fit function

Step 5: Interpret accuracy

5. Communicate / Results

Part II: Reflect and Plan

Knit and Publish

Join together `students` and `assessments_summarized` to create `students_and_assessments`