In this lab you will respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Please use the code chunk below for your code: ## 1. PREPARE

##Load Packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(yardstick)
library(dplyr)

2. WRANGLE

assessments <- read.csv("data/oulad-assessments.csv")

3. EXPLORE

assessments %>% 
    count(assessment_type)
##   assessment_type     n
## 1             CMA 70527
## 2            Exam  4959
## 3             TMA 98426

Filtering for only assessment types of Exam since we want to predict only final exam

assessments <- assessments %>%
  filter(assessment_type == "Exam")# Filter rows where assessment_type is "Exam" so we can predict final exam

Score for each student on final exam

assessments_score <- assessments %>%  ## rem
    group_by(id_student) 
   

assessments_score
## # A tibble: 4,959 × 10
## # Groups:   id_student [4,633]
##    id_assessment id_student date_submitted is_banked score code_module
##            <int>      <int>          <int>     <int> <int> <chr>      
##  1         24290     558914            230         0    32 CCC        
##  2         24290     559706            234         0    78 CCC        
##  3         24290     559770            230         0    54 CCC        
##  4         24290     560114            230         0    64 CCC        
##  5         24290     560311            234         0   100 CCC        
##  6         24290     560494            230         0    92 CCC        
##  7         24290     561363            230         0    84 CCC        
##  8         24290     561559            230         0    42 CCC        
##  9         24290     561774            230         0    62 CCC        
## 10         24290     562450            230         0    46 CCC        
## # ℹ 4,949 more rows
## # ℹ 4 more variables: code_presentation <chr>, assessment_type <chr>,
## #   date <int>, weight <dbl>

Processing student variables

students <- read_csv("data/oulad-students.csv")
## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

Join together students and assessments_summarized to create students_and_assessments

## this is an inner join, using left join to avoid student_id's  not in the assessments_score table
students_and_assessments <- inner_join(x = students, y = assessments_score, by="id_student")
## Warning in inner_join(x = students, y = assessments_score, by = "id_student"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 8685 of `x` matches multiple rows in `y`.
## ℹ Row 1624 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Model

Step 1. Split data

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(score) ## changed to score

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "score") ## changed from pass to score
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Step 2: Engineer features and write down the recipe

my_rec <- recipe(score ~ disability + ## changed from pass to score
                     date_registration + 
                     gender +
                     highest_education + ## added in 2 predictor variables
                     age_band,
                    ##code_module + - removed due to taking out above
                    ##mean_weighted_score, - removed
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender)## %>%  
    ##step_dummy(code_module)

Step 3: Specify the model and workflow

# specify model
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% # updated to lm
    set_mode("regression") # updated to regression

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Step 4: Fit model

fitted_model <- fit(my_wf, data = data_train)

ClassMetrics

## Need to calculate mse separately?
class_metrics <- metric_set( mae, rmse)

Last_fit function

final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)

Step 5: Interpret accuracy

predictions <- collect_predictions(final_fit)
predictions
## # A tibble: 3,337 × 5
##    .pred id                .row score .config             
##    <dbl> <chr>            <int> <int> <chr>               
##  1  64.8 train/test split     1    73 Preprocessor1_Model1
##  2  69.8 train/test split     2    56 Preprocessor1_Model1
##  3  59.8 train/test split     3    71 Preprocessor1_Model1
##  4  57.7 train/test split     4    73 Preprocessor1_Model1
##  5  70.5 train/test split     6    76 Preprocessor1_Model1
##  6  67.8 train/test split     7    66 Preprocessor1_Model1
##  7  72.2 train/test split     8    50 Preprocessor1_Model1
##  8  69.6 train/test split     9    40 Preprocessor1_Model1
##  9  72.2 train/test split    10    98 Preprocessor1_Model1
## 10  70.3 train/test split    11   100 Preprocessor1_Model1
## # ℹ 3,327 more rows

5. Communicate / Results

collect_metrics(final_fit)
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 mae     standard        16.9 Preprocessor1_Model1
## 2 rmse    standard        20.3 Preprocessor1_Model1
results <- final_fit %>%
    collect_predictions()

results
## # A tibble: 3,337 × 5
##    .pred id                .row score .config             
##    <dbl> <chr>            <int> <int> <chr>               
##  1  64.8 train/test split     1    73 Preprocessor1_Model1
##  2  69.8 train/test split     2    56 Preprocessor1_Model1
##  3  59.8 train/test split     3    71 Preprocessor1_Model1
##  4  57.7 train/test split     4    73 Preprocessor1_Model1
##  5  70.5 train/test split     6    76 Preprocessor1_Model1
##  6  67.8 train/test split     7    66 Preprocessor1_Model1
##  7  72.2 train/test split     8    50 Preprocessor1_Model1
##  8  69.6 train/test split     9    40 Preprocessor1_Model1
##  9  72.2 train/test split    10    98 Preprocessor1_Model1
## 10  70.3 train/test split    11   100 Preprocessor1_Model1
## # ℹ 3,327 more rows
mse <- predictions %>%
  summarise(mse = mean((.pred - score)^2)) %>%
  pull(mse)

mse
## [1] 413.8447
std <- sd(predictions$score)

std
## [1] 20.71507

Please add your interpretations here:

Part II: Reflect and Plan

  1. What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?
  • As with most learning analytics studies a pass/fail grade is of interest and falls under classification. Another outcome of interest could be attendance if a student shows up for class on a certain day, that would be a yes/no scenario.
  1. What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?
  • Exam scores or final grades is an interesting use of regression models. Accurately predicting a students final grade based on predictor variables could help students/teachers to aid student performance.
  1. Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.
  • Doctor, A. C. (2023, April 11). A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course. ArXiv.org. https://doi.org/10.25147/ijcsr.2017.001.1.135. The model used in the above paper was classification. They are predicting a pass/fail scenario. The paper uses a decision tree for their model. A decision tree model works good for this scenario, tree-based methods classify instances by sorting them down the tree from the root to a leaf node, which provides the classification of a specific instance to help identify the weakness of students.

Knit and Publish

Complete the following steps to knit and publish your work:

  1. First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.

  2. Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.

  3. Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!