As a reminder, to earn a badge for these learning labs, you will have to respond to a set of prompts for two parts.

Part I: Data Product

For the data product, you will interpret a different type of model – a model in a regression mode.

So far, we have specified and interpreted a classification model: one predicting a dichotomous outcome (i.e., whether students pass a course). In many cases, however, we are interested in predicting a continuous outcome (e.g., students’ number of points in a course or their score on a final exam).

While many parts of the machine learning process are the same for a regression machine learning model, one key part that is relevant to this learning lab is different: their interpretation. The confusion matrix we created to parse the predictive strength of our classification model does not pertain to regression machine learning models. Different metrics are used. For this badge activity, you will specify and interpret a regression machine learning model.

The requirements are as follows:

Change your outcome to students’ final exam performance (note: check the data dictionary for a pointer!).
Using the same data (and testing and training data sets), recipe, and workflow as you used in the case study, change the mode of your model from classification to regression and change the engine from a glm to an lm model.
Interpret your regression machine learning model in terms of three regression machine learning model metrics: MAE, MSE, and RMSE. Read about these metrics here. Similar to how we interpreted the classification machine learning metrics, focus on the substantive meaning of these statistics.

Please use the code chunk below for your code:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# Load the dataset
assessments <- read_csv("data/oulad-assessments.csv")

## Rows: 173912 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): code_module, code_presentation, assessment_type
## dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students <- read_csv("data/oulad-students.csv")

## Rows: 32593 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): code_module, code_presentation, gender, region, highest_education, ...
## dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Filter the assessments data frame for Exams
filtered_assessments <- assessments %>%
  filter(assessment_type == "Exam")

# Join the data frames by id_student using full_join
joined_data <- filtered_assessments %>%
  inner_join(students, by = c("id_student", "code_presentation"))

## Warning in inner_join(., students, by = c("id_student", "code_presentation")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 7 of `x` matches multiple rows in `y`.
## ℹ Row 9073 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Convert the character columns to factors
data <- joined_data %>% mutate_at(vars(code_presentation, gender, region, highest_education, disability), as.factor)

data <- data %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

data <- data %>% 
    mutate(age_band = factor(age_band, levels = c("0-35",
                                                  "35-55",
                                                  "55<="))) %>% # this creates a factor with ordered levels
    mutate(age_band = as.integer(age_band)) # this changes the levels into integers based on the order of the factor levels

Select Features

# Select the columns for analysis
data <- data %>% 
    select(score, code_presentation, gender, region, highest_education, disability, imd_band, age_band, studied_credits, date_registration)

Training and Test set split

# Create a training and test set
set.seed(123)
split <- initial_split(data, prop = 0.75)
train <- training(split)
test <- testing(split)

Model

# Fit a linear regression model to the training set
model <- lm(score ~ ., data = train)

# Predict final_result on the test set
predictions <- predict(model, test)

Evaluation Metrics

# Calculate the MAE, MSE, and RMSE
mae <- mean(abs(predictions - test$score), na.rm = TRUE)
mse <- mean((predictions - test$score)^2, na.rm = TRUE)
rmse <- sqrt(mse)

# Print the MAE, MSE, and RMSE
print(mae)

## [1] 16.51279

print(mse)

## [1] 390.5987

print(rmse)

## [1] 19.76357

Please add your interpretations here:

MAE:16.51279
MSE:390.5987
RMSE:19.76357

Model 1: In general, a lower MAE, MSE, and RMSE indicates a better model. In this case, the MAE is 13.6614, the MSE is 313.6629, and the RMSE is 17.71053. This indicates that the model is able to predict score with a relatively small error.

Model 2: In general, a lower MAE, MSE, and RMSE indicates a better model. In this case, the MAE is 16.51279, the MSE is 390.5987, and the RMSE is 19.76357 This indicates that the model is able to predict score with a relatively small error.

Part II: Reflect and Plan

What is an example of an outcome related to your research interests that could be modeled using a classification machine learning model?

Consideration to enroll in STEM. Binary variable.

What is an example of an outcome related to your research interests that could be modeled using a regression machine learning model?

Student CGPA

Look back to the study you identified for the first machine learning learning lab badge activity. Was the outcome one that is modeled using a classification or a regression machine learning model? Identify which mode(s) the authors of that paper used and briefly discuss the appropriateness of their decision.

Knit and Publish

Complete the following steps to knit and publish your work:

First, change the name of the author: in the YAML header at the very top of this document to your name. The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Next, click the knit button in the toolbar above to “knit” your R Markdown document to a HTML file that will be saved in your R Project folder. You should see a formatted webpage appear in your Viewer tab in the lower right pan or in a new browser window. Let’s us know if you run into any issues with knitting.
Finally, publish your webpage on Posit Cloud by clicking the “Publish” button located in the Viewer Pane after you knit your document. See screenshot below.

Receiving Your Machine Learning Badge

To receive credit for this assignment and earn your second ML Badge, share the link to published webpage under the next incomplete badge artifact column on the 2023 LASER Scholar Information and Documents spreadsheet: https://go.ncsu.edu/laser-sheet.

Once your instructor has checked your link, you will be provided a physical version of the badge below!

Machine Learning - Learning Lab 2 Badge

Mighty Itauma Itauma

July 20, 2023