Machine Learning - Lab 2 Badge

In this lab you will respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this lab, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

Load Libraries

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.1

## Warning: package 'tidyr' was built under R version 4.3.1

## Warning: package 'readr' was built under R version 4.3.1

## Warning: package 'dplyr' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.3.1

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## Warning: package 'broom' was built under R version 4.3.1

## Warning: package 'dials' was built under R version 4.3.1

## Warning: package 'modeldata' was built under R version 4.3.1

## Warning: package 'parsnip' was built under R version 4.3.1

## Warning: package 'recipes' was built under R version 4.3.1

## Warning: package 'rsample' was built under R version 4.3.1

## Warning: package 'tune' was built under R version 4.3.1

## Warning: package 'workflows' was built under R version 4.3.1

## Warning: package 'workflowsets' was built under R version 4.3.1

## Warning: package 'yardstick' was built under R version 4.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(janitor)

## Warning: package 'janitor' was built under R version 4.3.1

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Load data

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore

assessments %>% 
    count(id_assessment)

## # A tibble: 188 × 2
##    id_assessment     n
##            <dbl> <int>
##  1          1752   359
##  2          1753   342
##  3          1754   331
##  4          1755   303
##  5          1756   298
##  6          1758   337
##  7          1759   317
##  8          1760   304
##  9          1761   280
## 10          1762   278
## # ℹ 178 more rows

assessments %>% 
    distinct(id_assessment)

## # A tibble: 188 × 1
##    id_assessment
##            <dbl>
##  1          1752
##  2          1753
##  3          1754
##  4          1755
##  5          1756
##  6          1758
##  7          1759
##  8          1760
##  9          1761
## 10          1762
## # ℹ 178 more rows

assessments %>% 
    count(assessment_type, code_module, code_presentation)

## # A tibble: 41 × 4
##    assessment_type code_module code_presentation     n
##    <chr>           <chr>       <chr>             <int>
##  1 CMA             BBB         2013B              5049
##  2 CMA             BBB         2013J              6416
##  3 CMA             BBB         2014B              4493
##  4 CMA             CCC         2014B              3920
##  5 CMA             CCC         2014J              5846
##  6 CMA             DDD         2013B              5252
##  7 CMA             FFF         2013B              6681
##  8 CMA             FFF         2013J              8847
##  9 CMA             FFF         2014B              5549
## 10 CMA             FFF         2014J              8915
## # ℹ 31 more rows

assessments %>% 
    summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
              median_date = median(date, na.rm = TRUE), # find the median
              sd_date = sd(date, na.rm = TRUE), # find the sd
              min_date = min(date, na.rm = TRUE), # find the min
              max_date = max(date, na.rm = TRUE)) # find the mad

## # A tibble: 1 × 5
##   mean_date median_date sd_date min_date max_date
##       <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1      131.         129    78.0       12      261

assessments %>% 
    group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
    summarize(mean_date = mean(date, na.rm = TRUE),
              median_date = median(date, na.rm = TRUE),
              sd_date = sd(date, na.rm = TRUE),
              min_date = min(date, na.rm = TRUE),
              max_date = max(date, na.rm = TRUE),
              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile

## `summarise()` has grouped output by 'code_module'. You can override using the
## `.groups` argument.

## # A tibble: 22 × 8
## # Groups:   code_module [7]
##    code_module code_presentation mean_date median_date sd_date min_date max_date
##    <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAA         2013J                 109.          117    71.3       19      215
##  2 AAA         2014J                 109.          117    71.5       19      215
##  3 BBB         2013B                 104.           89    55.5       19      187
##  4 BBB         2013J                 112.           96    61.6       19      208
##  5 BBB         2014B                  98.9          82    58.6       12      194
##  6 BBB         2014J                  99.1         110    65.2       19      201
##  7 CCC         2014B                  98.4         102    68.0       18      207
##  8 CCC         2014J                 104.          109    70.8       18      214
##  9 DDD         2013B                 104.           81    66.0       23      240
## 10 DDD         2013J                 118.           88    77.9       25      261
## # ℹ 12 more rows
## # ℹ 1 more variable: first_quantile <dbl>

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE), .groups = 'drop')

code_module_dates

## # A tibble: 22 × 3
##    code_module code_presentation quantile_cutoff_date
##    <chr>       <chr>                            <dbl>
##  1 AAA         2013J                               54
##  2 AAA         2014J                               54
##  3 BBB         2013B                               54
##  4 BBB         2013J                               54
##  5 BBB         2014B                               47
##  6 BBB         2014J                               54
##  7 CCC         2014B                               32
##  8 CCC         2014J                               32
##  9 DDD         2013B                               51
## 10 DDD         2013J                               53
## # ℹ 12 more rows

Joining the data

assessments_joined <- left_join(assessments, code_module_dates)

## Joining with `by = join_by(code_module, code_presentation)`

Filtering the assessment data

assessments_filtered <- assessments_joined %>% 
    filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

Finding average score for each student

assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score))

Mutating

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

Joining the files - “students” and “assessments_summarized”

students_and_assessments <- left_join(students, assessments_summarized)

## Joining with `by = join_by(id_student)`

students_and_assessments

## # A tibble: 32,593 × 17
##    code_module code_presentation id_student gender region      highest_education
##    <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
##  1 AAA         2013J                  11391 M      East Angli… HE Qualification 
##  2 AAA         2013J                  28400 F      Scotland    HE Qualification 
##  3 AAA         2013J                  30268 F      North West… A Level or Equiv…
##  4 AAA         2013J                  31604 F      South East… A Level or Equiv…
##  5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
##  6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
##  7 AAA         2013J                  45462 M      Scotland    HE Qualification 
##  8 AAA         2013J                  45642 F      North West… A Level or Equiv…
##  9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
## 10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
## # ℹ 32,583 more rows
## # ℹ 11 more variables: imd_band <int>, age_band <chr>,
## #   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
## #   final_result <chr>, module_presentation_length <dbl>,
## #   date_registration <dbl>, date_unregistration <dbl>, pass <fct>,
## #   mean_weighted_score <dbl>

glimpse(students_and_assessments)

## Rows: 32,593
## Columns: 17
## $ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
## $ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
## $ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
## $ region                     <chr> "East Anglian Region", "Scotland", "North W…
## $ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
## $ imd_band                   <int> 10, 3, 4, 6, 6, 9, 4, 10, 8, NA, 8, 3, 7, 6…
## $ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
## $ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
## $ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
## $ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
## $ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pass                       <fct> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mean_weighted_score        <dbl> 780, 700, NA, 720, 690, 790, 700, 720, 720,…

Model

Split Data

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score)

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "mean_weighted_score")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

glimpse(data_train)

## Rows: 12,317
## Columns: 17
## $ code_module                <chr> "BBB", "BBB", "BBB", "BBB", "BBB", "BBB", "…
## $ code_presentation          <chr> "2013B", "2013J", "2013J", "2013J", "2013J"…
## $ id_student                 <dbl> 542562, 485439, 559344, 581016, 590867, 591…
## $ gender                     <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F"…
## $ region                     <chr> "South East Region", "Wales", "South Region…
## $ highest_education          <chr> "A Level or Equivalent", "Lower Than A Leve…
## $ imd_band                   <int> 7, 1, 8, 5, 5, 1, 1, 8, 1, 3, 5, 1, 4, NA, …
## $ age_band                   <chr> "0-35", "35-55", "0-35", "0-35", "35-55", "…
## $ num_of_prev_attempts       <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
## $ studied_credits            <dbl> 150, 60, 120, 60, 90, 60, 60, 120, 180, 120…
## $ disability                 <chr> "Y", "Y", "N", "N", "N", "N", "N", "N", "N"…
## $ final_result               <chr> "Withdrawn", "Withdrawn", "Withdrawn", "Wit…
## $ module_presentation_length <dbl> 240, 268, 268, 268, 268, 268, 268, 268, 268…
## $ date_registration          <dbl> -85, -46, -116, -149, -101, -12, -71, -63, …
## $ date_unregistration        <dbl> -73, 202, -64, -107, NA, -3, NA, 135, NA, 9…
## $ pass                       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mean_weighted_score        <dbl> 100.0, 160.0, 0.0, 0.0, 0.0, 0.0, 150.0, 0.…

Engineer Features

my_rec <- recipe(mean_weighted_score ~ disability + studied_credits + highest_education + age_band, data = data_train) %>% 
    step_dummy(disability) %>%
    step_dummy(highest_education)

Specify the model and workflow

# specify model
my_mod <-
    linear_reg() %>% 
    set_engine("lm") %>% # generalized linear model
    set_mode("regression") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Fit model

fitted_model <- fit(my_wf, data = data_train)

# Handle missing values in the mean_weighted_score column
data_train <- data_train %>% drop_na(mean_weighted_score)
data_test <- data_test %>% drop_na(mean_weighted_score)

Evaluating metrics

# Make predictions on the test data
predictions <- predict(fitted_model, new_data = data_test)

# Calculate MAE (Mean Absolute Error)
mae_value <- mean(abs(data_test$mean_weighted_score - predictions$`.pred`))

# Calculate MSE (Mean Squared Error)
mse_value <- mean((data_test$mean_weighted_score - predictions$`.pred`)^2)

# Calculate RMSE (Root Mean Squared Error)
rmse_value <- sqrt(mse_value)

# Print the metrics
cat("MAE:", mae_value, "\n")

## MAE: 329.6713

cat("MSE:", mse_value, "\n")

## MSE: 137150.7

cat("RMSE:", rmse_value, "\n")

## RMSE: 370.3387

Please add your interpretations here:

MAE: MAE or Mean Absolute Error measures the average absolute difference between the predicted values and the actual (observed) values. In this case, MAE is 329.6713. On average, our model’s predictions are off by approximately 329.67 units (in the same unit as your target variable). Lower MAE values indicate better model accuracy.
MSE: MSE or Mean Squared Error measures the average squared difference between the predicted values and the actual values. Here, the MSE is 137150.7.MSE gives more weight to large errors because it squares the differences. In this context, it means that the model’s predictions have, on average, a squared error of approximately 137150.7 units. Like MAE, lower MSE values indicate better model accuracy.
RMSE: RMSE is the square root of MSE and is often preferred because it provides error values in the same units as the target variable. In our model, the RMSE is 370.3387. This tells us that on average, our model’s predictions have an error of approximately 370.34 units. RMSE is especially useful when you want to express the prediction error in the same units as the target variable. As with MAE and MSE, lower RMSE values indicate better model accuracy.

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

Sentiment analysis is a common use case for classification machine learning models. It involves determining the sentiment or emotional tone expressed in a piece of text, such as a product review, social media post, or customer feedback. Sentiment analysis can be applied in several contexts, including Customer Feedback Analysis, Social Media Monitoring, Market Research and Brand Reputation Management. In this scenario, a classification machine learning model is trained on labeled data, where each text sample is associated with a sentiment label (e.g., positive, negative, or neutral). The model learns to identify patterns and features in the text that are indicative of the sentiment expressed. Once trained, the model can be used to automatically classify new, unlabeled text samples into sentiment categories.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

Regression machine learning models are commonly employed to predict numerical outcomes, such as sales revenue in retail forecasting. In this context, historical sales data, along with relevant features like date, product type, and external factors, are used to train the model. The objective is to forecast future sales revenue as a continuous numerical value, enabling businesses to make informed decisions regarding inventory management and financial planning. Evaluation metrics like Mean Absolute Error (MAE) and Mean Squared Error (MSE) assess the model’s accuracy in predicting sales, facilitating data-driven decision-making in the retail sector.

Look back to the study you identified for the first machine learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

The outcome of the study was modeled using both classification and regression machine learning models. The authors used artificial neural network models to predict three different outcomes: grade point average, academic retention, and degree completion. Grade point average is a continuous variable that ranges from 0 to 10, so it was predicted using a regression model. Academic retention and degree completion are binary variables that indicate whether a student stayed or left the university, and whether a student graduated or not, respectively. Therefore, they were predicted using classification models. The authors chose these models because they are flexible and powerful, and can capture complex nonlinear relationships between the predictors and the outcomes. They also compared the performance of the neural network models with other traditional models such as logistic regression, linear regression, and decision trees, and found that the neural network models outperformed them in terms of accuracy. The authors’ decision to use neural network models seems appropriate given the nature of the data and the research question.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on RPubs by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Have fun!