This case study is similar to the first, but it differs in three key ways:
Feature engineering is a rich topic in machine learning research, including in the learning analytics and educational data mining communities.
Consider research on online learning and the work of Li et al. (2020) and Rodriguez et al. (2021). In these two studies, digital log-trace data, data generated through users’ interactions with digital technologies, was used to study elements of the theoretical frame of self-regulated learning and how it related to students’ achievement. Notably, the authors took several steps to prepare the data so that it could be validly interpreted as measures of students’ self-regulated learning. In short, we need to process the data from contexts such as online classes to use them in analyses. Citations and links to these papers follow.
Li, Q., Baker, R., & Warschauer, M. (2020). Using clickstream data to measure, understand, and support self-regulated learning in online courses. The Internet and Higher Education, 45, 100727. https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-2/li-et-al-2020-ihe.pdf
Rodriguez, F., Lee, H. R., Rutherford, T., Fischer, C., Potma, E., & Warschauer, M. (2021, April). Using clickstream data mining techniques to understand and support first-generation college students in an online chemistry course. In LAK21: 11th International Learning Analytics and Knowledge Conference (pp. 313-322). https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-2/rodriguez-et-al-2021-lak.pdf
The same is true here in the context of machine learning. In a different context, the work of Gobert et al. (2013) is a great example of using data from educational simulations. Salmeron-Majadas provides an example of feature engineering using mouse-click data. Last, we note that there are methods that intended to automated the process of feature engineering (Bosch et al., 2021), though such processes are not necessarily interpretable and they usually require some degree of tailoring to your particular context.
Gobert, J. D., Sao Pedro, M., Raziuddin, J., & Baker, R. S. (2013). From log files to assessment metrics: Measuring students’ science inquiry skills using educational data mining. Journal of the Learning Sciences, 22(4), 521-563. https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-3/gobert-et-al-2013-jls.pdf
Salmeron-Majadas, S., Baker, R. S., Santos, O. C., & Boticario, J. G. (2018). A machine learning approach to leverage individual keyboard and mouse interaction behavior from multiple users in real-world learning scenarios. IEEE Access, 6, 39154-39179. https://ieeexplore.ieee.org/iel7/6287639/8274985/08416736.pdf
Bosch, N. (2021). AutoML Feature Engineering for Student Modeling Yields High Accuracy, but Limited Interpretability. Journal of Educational Data Mining, 13(2), 55-79. https://github.com/laser-institute/essential-readings/blob/main/machine-learning/ml-lab-3/bosch-et-al-2021-jedm.pdf
Our driving question for this case study is: How much do new predictors improve the prediction quality?
We use a data set of many online classes to answer this question. To answer it, we will engage in several feature engineering steps.
Like in the first learning lab, we’ll first load several packages.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom 1.0.0 ✔ rsample 1.0.0
## ✔ dials 1.0.0 ✔ tune 1.0.0
## ✔ infer 1.0.2 ✔ workflows 1.0.0
## ✔ modeldata 1.0.0 ✔ workflowsets 0.2.1
## ✔ parsnip 1.0.0 ✔ yardstick 1.0.0
## ✔ recipes 1.0.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
Like in the code-along for the overview presentation, let’s take a look at the data and do some processing of it.
d <- read_csv("data/data-to-model-no-gradebook.csv")
## Rows: 546 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): course_id, gender, enrollment_reason, enrollment_status, subject, s...
## dbl (9): student_id, final_grade, time_spent, int, uv, percomp, tv, sum_disc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d <- select(d, -time_spent) # this is another outcome, so we'll cut this here
gb <- read_csv("data/data-to-model-gradebook.csv")
## Rows: 14340 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): course_id, gradebook_item, gradebook_type, gradebook_date, grade_c...
## dbl (5): student_id, item_position, points_earned, points_attempted, points...
## lgl (1): status
## time (1): last_access_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We mentioned that this lab is premised on the need to improve on an earlier model. Indeed, an earlier version of this model without feature engineering achieved predictive accuracy of an RMSE of approximately 13 (see more here). Predicting students’ passing (or not passing) the course with around 75% accuracy. We think we can do better – the aim of this learning lab is to do just that.
As a bit more background, the online science classes we explore in this chapter were designed and taught by instructors through a state-wide online course provider designed to supplement—but not replace—students’ enrollment in their local school. For example, students may have chosen to enroll in an online physics class because one was not offered at their school. The data were originally collected for a research study, which utilized a number of different data sources to understand students’ course-related motivation. These datasets included:
Data sources 1-3 are already joined together in the data frame we
named d above
Data source 4 - the gradebook data - is separate, in the data
frame we named gb.
Take a look at the two data frames by running the two chunks below.
d
## # A tibble: 546 × 15
## student_id course_id gender enrollment_reas… enrollment_stat… final_grade
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 60186 AnPhA-S116-01 M Course Unavaila… Approved/Enroll… 86.3
## 2 66693 AnPhA-S116-01 M Course Unavaila… Approved/Enroll… 93.8
## 3 66811 AnPhA-S116-01 F Course Unavaila… Approved/Enroll… 91.2
## 4 70532 AnPhA-S116-01 F Learning Prefer… Approved/Enroll… 93.6
## 5 77010 AnPhA-S116-01 F Learning Prefer… Approved/Enroll… 73.2
## 6 85249 AnPhA-S116-01 F Course Unavaila… Approved/Enroll… 86.9
## 7 85411 AnPhA-S116-01 F Scheduling Conf… Approved/Enroll… 90.9
## 8 85583 AnPhA-S116-01 F Scheduling Conf… Approved/Enroll… 91.7
## 9 85866 AnPhA-S116-01 F Learning Prefer… Approved/Enroll… 75.1
## 10 85970 AnPhA-S116-01 F Course Unavaila… Approved/Enroll… 81.6
## # … with 536 more rows, and 9 more variables: subject <chr>, semester <chr>,
## # section <chr>, int <dbl>, uv <dbl>, percomp <dbl>, tv <dbl>,
## # sum_discussion_posts <dbl>, sum_n_words <dbl>
gb
## # A tibble: 14,340 × 12
## course_id student_id gradebook_item item_position gradebook_type
## <chr> <dbl> <chr> <dbl> <chr>
## 1 FrScA-S216-02 43146 0-1.1: Intro Assignmen… 10 N
## 2 FrScA-S216-02 43146 0-1.2: Intro Assignmen… 11 N
## 3 FrScA-S216-02 43146 0-1.3: Intro Assignmen… 12 N
## 4 FrScA-S216-02 43146 1-1.1: Lesson 1-1 Grap… 13 N
## 5 FrScA-S216-02 43146 1-2.1: Explore a Caree… 14 N
## 6 FrScA-S216-02 43146 1-2.2: Explore a Caree… 15 N
## 7 FrScA-S216-02 43146 PROGRESS CHECK 1 @ 02-… 16 P
## 8 FrScA-S216-02 43146 1-2.3: Lesson 1-2 Grap… 17 N
## 9 FrScA-S216-02 43146 Unit 1 Assessment 18 N
## 10 FrScA-S216-02 43146 2-1.1: Crime Scene DB … 19 N
## # … with 14,330 more rows, and 7 more variables: gradebook_date <chr>,
## # grade_category <chr>, status <lgl>, points_earned <dbl>,
## # points_attempted <dbl>, points_possible <dbl>, last_access_date <time>
You’ll notice the data have different dimensions. We’ll have to take some steps to further process the gradebook data. In doing so, we’ll engineer some features. Let’s take a closer look at the gradebook data.
gb %>%
glimpse()
## Rows: 14,340
## Columns: 12
## $ course_id <chr> "FrScA-S216-02", "FrScA-S216-02", "FrScA-S216-02", "F…
## $ student_id <dbl> 43146, 43146, 43146, 43146, 43146, 43146, 43146, 4314…
## $ gradebook_item <chr> "0-1.1: Intro Assignment - Send a Message to Your Ins…
## $ item_position <dbl> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 2…
## $ gradebook_type <chr> "N", "N", "N", "N", "N", "N", "P", "N", "N", "N", "N"…
## $ gradebook_date <chr> "31:28.9", "47:10.5", "01:26.5", "33:11.5", "25:33.2"…
## $ grade_category <chr> "Hw", "Hw", "Hw", "Hw", "Hw", "Hw", NA, "Hw", "Qz", "…
## $ status <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ points_earned <dbl> 5, 5, 5, 5, 12, 5, 37, 5, 18, 5, 5, 10, 26, 5, 10, 10…
## $ points_attempted <dbl> 0, 0, 0, 0, 0, 0, 37, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ points_possible <dbl> 5, 5, 5, 5, 12, 5, 37, 5, 20, 5, 5, 10, 28, 5, 10, 10…
## $ last_access_date <time> 00:56:00, 00:56:00, 00:56:00, 00:56:00, 00:56:00, 00…
You may also want to take a look through the data with the
View() function; try that out below (asking for help or
searching the Internet for help as needed!).
View(gb)
Let’s first consider what these variables are, focusing just on some key variables:
course_id: an identifier for the coursestudent_id: an identifier for the studentgradebook_item: the name of the gradebook
entry/assignmentitem_position: the position of the gradebook item in
the gradebook; differs between studentsgrade_category: Hw (homework),
Qz (quiz or test), or NA (not classified)points_earned: the number of points student earnedpoints_possible: the number of points possible to
earnWhat are some features we could create based on these variables? And how might we create them?
Add a few ideas below before proceeding:
the rate of points earned = points_earned/point_possible
gb
## # A tibble: 14,340 × 12
## course_id student_id gradebook_item item_position gradebook_type
## <chr> <dbl> <chr> <dbl> <chr>
## 1 FrScA-S216-02 43146 0-1.1: Intro Assignmen… 10 N
## 2 FrScA-S216-02 43146 0-1.2: Intro Assignmen… 11 N
## 3 FrScA-S216-02 43146 0-1.3: Intro Assignmen… 12 N
## 4 FrScA-S216-02 43146 1-1.1: Lesson 1-1 Grap… 13 N
## 5 FrScA-S216-02 43146 1-2.1: Explore a Caree… 14 N
## 6 FrScA-S216-02 43146 1-2.2: Explore a Caree… 15 N
## 7 FrScA-S216-02 43146 PROGRESS CHECK 1 @ 02-… 16 P
## 8 FrScA-S216-02 43146 1-2.3: Lesson 1-2 Grap… 17 N
## 9 FrScA-S216-02 43146 Unit 1 Assessment 18 N
## 10 FrScA-S216-02 43146 2-1.1: Crime Scene DB … 19 N
## # … with 14,330 more rows, and 7 more variables: gradebook_date <chr>,
## # grade_category <chr>, status <lgl>, points_earned <dbl>,
## # points_attempted <dbl>, points_possible <dbl>, last_access_date <time>
Let’s get to feature engineering. First, we’ll have to group our data by course and student ID.
gb <- gb %>%
group_by(course_id, student_id)
Next, let’s create a variable with the percent of points earned
(points earned divided by points attempted). To do so, add to the
mutate() function below. Create a new variable called
percent_earned. You can read more about mutate here
gb <- gb %>%
mutate(percent_earned = points_earned/points_possible)
Finally, let’s create three features from the gradebook data:
You can probably imagine others; you’re welcome to explore adding those, too.
We’ll use summarize to do this, as below:
gb <- gb %>%
summarize(overall_percent_earned = sum(points_earned, na.rm = TRUE) / sum(points_possible, na.rm = TRUE),
variability_percent_earned = sd(percent_earned, na.rm = TRUE),
n_with_100_pct = sum(percent_earned == 1, na.rm = TRUE)) %>%
select(student_id, course_id, overall_percent_earned, variability_percent_earned, n_with_100_pct) # selecting just the variables we'll use
## `summarise()` has grouped output by 'course_id'. You can override using the
## `.groups` argument.
We have one last step before we can get to modeling (gb)
- joining this data with all of the other data (d).
d <- d %>%
left_join(gb)
## Joining, by = c("student_id", "course_id")
Let’s talk a look at the joined data to make sure everything is looking as we intend it to. Inspect the data using the code chunk below:
Next, we’ll split the data, just like before. We’ll set the seed again to ensure that we obtain the same results (when running the analysis again and between analysts at the LASER Institute). We use an 80% split again; how will you “spend” your data? You can change this number if you wish, but consider how much data you have to “spend” for both training and testing.
set.seed(20220712)
train_test_split <- initial_split(d, prop = .80)
data_train <- training(train_test_split)
Here’s a key difference! Pay careful attention to this next line of
code, which sets the groundwork for k-folds cross-validation.
Note that in the function below (run ?vfold_cv to see
more), the letter v is used instead of k, though they
share a meaning, as the documentation notes).
kfcv <- vfold_cv(data_train) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
kfcv
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [392/44]> Fold01
## 2 <split [392/44]> Fold02
## 3 <split [392/44]> Fold03
## 4 <split [392/44]> Fold04
## 5 <split [392/44]> Fold05
## 6 <split [392/44]> Fold06
## 7 <split [393/43]> Fold07
## 8 <split [393/43]> Fold08
## 9 <split [393/43]> Fold09
## 10 <split [393/43]> Fold10
Above, we split the data into 10 different folds. Change the number
of folds from 10 to 20 by changing the value of v; 10 is simply the
default. For help, run ?vfold_cv to get a hint.
kfcv <- vfold_cv(data_train, v = 20) # this differentiates this from what we did before
# before, we simple used data_train to fit our model
kfcv
## # 20-fold cross-validation
## # A tibble: 20 × 2
## splits id
## <list> <chr>
## 1 <split [414/22]> Fold01
## 2 <split [414/22]> Fold02
## 3 <split [414/22]> Fold03
## 4 <split [414/22]> Fold04
## 5 <split [414/22]> Fold05
## 6 <split [414/22]> Fold06
## 7 <split [414/22]> Fold07
## 8 <split [414/22]> Fold08
## 9 <split [414/22]> Fold09
## 10 <split [414/22]> Fold10
## 11 <split [414/22]> Fold11
## 12 <split [414/22]> Fold12
## 13 <split [414/22]> Fold13
## 14 <split [414/22]> Fold14
## 15 <split [414/22]> Fold15
## 16 <split [414/22]> Fold16
## 17 <split [415/21]> Fold17
## 18 <split [415/21]> Fold18
## 19 <split [415/21]> Fold19
## 20 <split [415/21]> Fold20
Here, we’ll carry out several feature engineering steps.
Read about possible steps and see more about how the following five feature engineering steps below work. Like in the first learning lab, this is the step in which we set the recipe.
step_normalize(all_numeric_predictors())step_nzv(all_predictors())step_novel(all_nominal_predictors())step_dummy(all_nominal_predictors())step_impute_knn(all_predictors(), all_outcomes())my_rec <- recipe(final_grade ~ ., data = data_train) %>%
step_normalize(all_numeric_predictors()) %>% # standardizes numeric variables
step_nzv(all_predictors()) %>% # remove predictors with a "near-zero variance"
step_novel(all_nominal_predictors()) %>% # add a musing label for factors
step_dummy(all_nominal_predictors()) %>% # dummy code all factor variables
step_impute_knn(all_predictors()) # impute missing data for all predictor variables
Next, we specify the model and workflow, using the same engine but a different engine and mode, here, regression for a continuous outcome. Specifically, we use:
linear_reg() function to set the
modelset_engine("glm") to set the engineset_mode("regression"))# specify model
my_mod <-
linear_reg() %>%
set_engine("glm") %>%
set_mode("regression")
Last, we’ll put the pieces together - the model and recipe - in a workflow.
# specify workflow
my_wf <-
workflow() %>%
add_model(my_mod) %>%
add_recipe(my_rec)
Note that here we use the kfcv data. We’ll run that in
the next chunk.
We can ignore the warnings and messages we see.
fitted_model_resamples <- fit_resamples(my_wf, resamples = kfcv,
control = control_grid(save_pred = TRUE)) # this allows us to inspect the predictions
## ! Fold01: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold01: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold02: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold02: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold03: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold03: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold04: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold04: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold05: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold05: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold06: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold06: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold07: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold07: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold08: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold08: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold09: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold09: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold10: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold10: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold11: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold11: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold12: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold12: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold13: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold13: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold14: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold14: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold15: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold15: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold16: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold16: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold17: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold17: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold18: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold18: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold19: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold19: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
## ! Fold20: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! Fold20: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
What did we get as output? Let’s take a look at the metrics. This is
critical to understanding how and why we use k-folds cross validation.
Each of the rows below represents the accuracy (in the
.estimate column) for each of the 20 folds that we used to
train our model; our model was fit 20 times, and accuracy was calculated
separately for each of these times. Next, we’ll summarize these.
Recall our definition of the Root Mean Squared Error (RMSE) - it is the square root of the mean of the squared error, or difference between the predicted and known y variables (here, students’ final grade). Since this is the square root of a statistic that is squared, its interpretation can be considerably simplified: RMSE can be interpreted as the average error, or difference between the predicted and known y variables (here, students’ final grade). This, along with the Mean Squared Error (MSE), are the most common metrics of predictive accuracy for a numeric outcome such as students’ final grade. See more about fit metrics for numeric/continuous outcomes (those utilized in a regression mode) here. The goal is to minimize both the RMSE and MSE.
Note that the common R-squared measure (rsq in the
output) can also be interpreted. Though helpful descriptively, it has
less useful as a measure of the predictive effectiveness of a trained
model, and it should generally not be used to select between competing
model specifications.
fitted_model_resamples %>%
unnest(.metrics) %>%
filter(.metric == "rmse") # we also get another metric, the RSQ; we focus just on RMSE for nwo
## # A tibble: 20 × 8
## splits id .metric .estimator .estimate .config .notes
## <list> <chr> <chr> <chr> <dbl> <chr> <list>
## 1 <split [414/22]> Fold01 rmse standard 8.58 Preprocessor1_… <tibble>
## 2 <split [414/22]> Fold02 rmse standard 5.64 Preprocessor1_… <tibble>
## 3 <split [414/22]> Fold03 rmse standard 13.3 Preprocessor1_… <tibble>
## 4 <split [414/22]> Fold04 rmse standard 6.22 Preprocessor1_… <tibble>
## 5 <split [414/22]> Fold05 rmse standard 9.08 Preprocessor1_… <tibble>
## 6 <split [414/22]> Fold06 rmse standard 6.84 Preprocessor1_… <tibble>
## 7 <split [414/22]> Fold07 rmse standard 13.5 Preprocessor1_… <tibble>
## 8 <split [414/22]> Fold08 rmse standard 9.17 Preprocessor1_… <tibble>
## 9 <split [414/22]> Fold09 rmse standard 8.45 Preprocessor1_… <tibble>
## 10 <split [414/22]> Fold10 rmse standard 6.94 Preprocessor1_… <tibble>
## 11 <split [414/22]> Fold11 rmse standard 9.92 Preprocessor1_… <tibble>
## 12 <split [414/22]> Fold12 rmse standard 7.68 Preprocessor1_… <tibble>
## 13 <split [414/22]> Fold13 rmse standard 8.16 Preprocessor1_… <tibble>
## 14 <split [414/22]> Fold14 rmse standard 5.97 Preprocessor1_… <tibble>
## 15 <split [414/22]> Fold15 rmse standard 13.1 Preprocessor1_… <tibble>
## 16 <split [414/22]> Fold16 rmse standard 10.8 Preprocessor1_… <tibble>
## 17 <split [415/21]> Fold17 rmse standard 8.89 Preprocessor1_… <tibble>
## 18 <split [415/21]> Fold18 rmse standard 6.13 Preprocessor1_… <tibble>
## 19 <split [415/21]> Fold19 rmse standard 16.7 Preprocessor1_… <tibble>
## 20 <split [415/21]> Fold20 rmse standard 6.92 Preprocessor1_… <tibble>
## # … with 1 more variable: .predictions <list>
Running the code below calculates the mean of the metrics we inspected in the previous chunk. Focus on the mean variable for the accuracy metric. This can be interpreted in the precise same was as our accuracy measure we calculated in learning lab 1 - this is the percentage of students the model correctly classified as passing or not passing the course.
# fit stats
fitted_model_resamples %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 9.10 20 0.672 Preprocessor1_Model1
## 2 rsq standard 0.756 20 0.0402 Preprocessor1_Model1
We can imagine trying out many different sets of features (engineered
in different ways). So long as we evaluate the accuracy using the
resampling method used above, we can repeat this process as needed.
Then, we can carry out a process like that in the first learning lab -
fitting the model not using the different folds obtained
through the kfcv function, but rather using the
entire training data set.
fitted_model <- fit(my_wf, data_train)
Then, we can use the model to predict students passing (or not passing) using our testing data that we have not used for any purpose until this point — and interpret that model. This output is suggestive to us of how the model would perform on new data, as this testing data set has not been used to make any decisions about the feature engineering.
final_fit <- last_fit(fitted_model, train_test_split)
## ! train/test split: preprocessor 1/1: skipping variable with zero or non-finite range.
## ! train/test split: preprocessor 1/1, model 1/1 (predictions): skipping variable with zero or non-finite range., prediction from a rank...
collect_metrics(final_fit)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 8.90 Preprocessor1_Model1
## 2 rsq standard 0.817 Preprocessor1_Model1
Last, we can plot the predicted versus known y variables to gain a graphical sense for how the model performed:
collect_predictions(final_fit) %>%
ggplot(aes(x = .pred, y = final_grade)) +
geom_point()
## Warning: Removed 5 rows containing missing values (geom_point).
Consider making a modification to the above plot (small or large) using ggplot2.
Congratulations - you’ve completed this case study! Consider moving on to the badge activity next.