In Unit 2, we pick up where we left off and learn to apply some basic machine learning techniques to help us understand how a predictive model might be developed and tested as part of an early warning system to identify students at risk of failing. Specifically, we will make a very crude first attempt at developing a testing a logistic regression and a random first model that can (we hope!) accurately predict whether a student is likely to pass or fail and online course. Unlike Unit 1, where we were interested in identifying factors from data collected throughout the course that might help explain why students earned the grade they did, we are instead interested identifying students who may be at risk before it is too late to intervene. Therefore we will only use information that is available at the start of the course.
The tidymodels package is a “meta-package” for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse. It includes a core set of packages that are loaded on startup and contains tools for:
In addition to the {tidymodels} package, we’ll also be using the {tidyverse}, {skimr} and {here} packages we learned about in Unit 1. Use the code chunk below to load these three packages:
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.9 ✓ recipes 0.1.17
## ✓ dials 0.0.10 ✓ rsample 0.1.0
## ✓ dplyr 1.0.7 ✓ tibble 3.1.5
## ✓ ggplot2 3.3.5 ✓ tidyr 1.1.4
## ✓ infer 1.0.0 ✓ tune 0.1.6
## ✓ modeldata 0.1.1 ✓ workflows 0.2.3
## ✓ parsnip 0.1.7 ✓ workflowsets 0.1.0
## ✓ purrr 0.3.4 ✓ yardstick 0.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ✓ stringr 1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x readr::spec() masks yardstick::spec()
library(skimr)
library(here)
## here() starts at /cloud/project
For this case study, we will again be working again with data from the online courses introduced in Unit 1. This data is located in the data folder and named data-to-explore.csv. Fortunately, our data has already been largely wrangled. This will save us quite a bit of time, which we’ll need to help wrap our heads around some supervised machine learning basics.
Use the code chunk below to read this data into your R environment and assign the name data_to_model. Note, you may choose to use the {here} package or you may specify the file path directly:
data_to_model <- read_csv(here("data", "data-to-explore.csv"))
## Rows: 943 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): student_id, subject, semester, section, gender, enrollment_reason...
## dbl (23): total_points_possible, total_points_earned, proportion_earned, ti...
## dttm (3): date_x, date_y, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
As noted in Chapter 14 of Data Science in Education Using R, it’s good practice to inspect your data and make sure it looks the way you expect it to.
Use the code chunk below to take a look our data, then answer the questions that follow:
glimpse(data_to_model)
## Rows: 943
## Columns: 34
## $ student_id <chr> "43146", "44638", "47448", "47979", "48797", "51…
## $ subject <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrSc…
## $ semester <chr> "S216", "S116", "S216", "S216", "S116", "S216", …
## $ section <chr> "02", "01", "01", "01", "01", "03", "01", "01", …
## $ total_points_possible <dbl> 1217, 1676, 1232, 1833, 2225, 1222, 1775, 2225, …
## $ total_points_earned <dbl> 1150.00, 1384.23, 1116.00, 1492.73, 1994.75, 70.…
## $ proportion_earned <dbl> 0.94494659, 0.82591289, 0.90584416, 0.81436443, …
## $ gender <chr> "F", "M", "F", "M", "M", "M", "F", "F", "F", "F"…
## $ enrollment_reason <chr> "Course Unavailable at Local School", "Course Un…
## $ enrollment_status <chr> "Approved/Enrolled", "Approved/Enrolled", "Appro…
## $ time_spent <dbl> 1555.1667, 1382.7001, 860.4335, 1598.6166, 1481.…
## $ time_spent_hours <dbl> 25.91944500, 23.04500167, 14.34055833, 26.643610…
## $ course_id <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
## $ int <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, NA,…
## $ val <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667…
## $ percomp <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, NA,…
## $ tv <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286…
## $ q1 <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, NA, 5, 4, 3, 4, 4, 5,…
## $ q2 <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, NA, 4, 5, 1, 3, 2, 5,…
## $ q3 <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, NA, 4, 4, 3, 3, 2, 4,…
## $ q4 <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, NA, 5, 4, 4, 4, 5, 5,…
## $ q5 <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, NA, 5, 5, 4, 4, 5, 5,…
## $ q6 <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, NA, 3, 5, 3, 3, 3, 5,…
## $ q7 <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, NA, 4, 4, 4, 4, 4, 4,…
## $ q8 <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, NA, 5, 5, 3, 4, 5, 5,…
## $ q9 <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, NA, 3, 4, 2, 3, 1, 5,…
## $ q10 <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, NA, 4, 5, 3, 4, 3, 5,…
## $ date_x <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-…
## $ post_int <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, NA, …
## $ post_uv <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, …
## $ post_tv <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, …
## $ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, NA, 4.…
## $ date_y <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00…
## $ date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
How many observations and variables are in our dataset?
Since our goal is develop a model that predicts whether students are likely to be at risk of failing, what variable might we use as our outcome variable for our model?
Based on your prior work with this data, and the descriptions of the data in Chapter 7 of DSIEUR, what are some variables collected prior to the course that you think might make good predictors of at risk students? Why?
subject - some classes are potentially more difficult
gender - gender makeup of class could influence study habits
enrollment_reason - personal motivation may influence study effort
int - interest level may be a predictor of intrinsic motivation
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following wrangling processes to:
Isolate Predictors. In this section, we will narrow down our data set to all potential predictors of interest.
Preprocess Data. We create a binary variable outcome we are interested in predicting, and remove quite a bit of cruft.
Split Data. We split our data into a training and test set that will be used to develop a predictive model.
Create a Recipe. Finally, we will test our a couple different “recipes” for our predictive model and learn how to deal with nominal data that we would like to use as predictors.
As you’ve probably noticed, there is a lot of great information in this dataset - but we won’t need all of it. Estrellado et al. note in DSIEUR that for statistical reasons and as a good general practice, it’s better to select a few variables you are interested in and go from there:
At a certain point, adding more variables will appear to make your analysis more accurate, but will in fact obscure the truth from you.
Unlike Unit 1, we are not interested interested so much explaining why students earned the grade they did after the fact, or which variables might be significant predictors of their final grade. Instead we are interested in a more immediate and practical application of modeling, that is developing a model using machine learning techniques that can accurately predict whether a student is likely to pass or fail and online course using only information available prior to course.
Our goal, therefore, is to identify students who may be at risk of failing using information that is available to teachers before it is too late to intervene.
Complete the code chunk below to “select” (hint, hint) proportioned_earned for our outcome variable, as well as any variables from our data_to_model data frame that teachers would have access to at the start of the course (e.g. survey data, course info, student id and demographics, etc.). Save as a new data frame named course_data and inspect your data.
course_data <- data_to_model %>%
select(student_id, proportion_earned, subject, semester,
section, enrollment_reason, gender, int, val, percomp,
q1:q10)
Prior to developing a predictive model from our data, we have a little bit of data processing to do, after which we will divide, or “split,” our data set into two separate data frames, one for training our predictive model and one for testing our model to see how well it performs.
First, however, we need an outcome variable that let’s us know if they have either passed or failed.
Let’s combine the hopefully familiar mutate() function with the if_else() function also from the {dplyr} package to create a new variable called at_risk which indicates “no” for students whose proportioned_earned is greater than or equal too 66.7% (NC State’s cutoff for a D-) and “yes” if it is not.
course_data <- course_data %>%
mutate(at_risk = if_else(proportion_earned >= .667, "no", "yes"))
As you may have noticed from your inspection of the data earlier, some students do not have the course grade data at all. Let’s use the drop_na() function also from {dplyr} to remove observations with no grade data and the select function again to remove the proportion_earned variable entirely so we’re not tempted to cheat and use this for what would be a VERY good predictor of being “at risk”:
course_data <- course_data %>%
drop_na(proportion_earned) %>%
select(-proportion_earned)
Let’s also use the na.omit() function introduced previously to delete cases listwise (that is, an entire row is deleted if any single value is missing), since most models does not accept cases with missing data:
course_data <- course_data %>%
na.omit()
While inspecting the data, you may have noticed that many of our predictors of interest are character variables. For creating models, it is better to have qualitative variable like gender encoded as factors instead of character strings. Factors store data as categorical variables, each with its own levels. Because categorical variables are used in statistical models differently than continuous variables, storing data as factors ensures that the modeling functions will treat them correctly.
To do so, we introduce a variation of the mutate() function that will look for any variable that is a character and recode as a factor:
course_data <- course_data %>%
mutate_if(is.character, as.factor)
course_data
## # A tibble: 533 × 20
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 43146 FrScA S216 02 Course Unavailable at… F 4.2 3.67
## 2 44638 OcnA S116 01 Course Unavailable at… M 4 3
## 3 47448 FrScA S216 01 Other F 4.2 3
## 4 47979 OcnA S216 01 Course Unavailable at… M 4 3.67
## 5 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 6 51943 FrScA S216 03 Learning Preference o… M 3.8 3.67
## 7 52326 AnPhA S216 01 Other F 3.6 4
## 8 52446 PhysA S116 01 Learning Preference o… F 4.2 3.67
## 9 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 10 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## # … with 523 more rows, and 12 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>
To reduce all the redundancy caused by demonstrating each step separately above, complete the following code below by using the pipe operator to repeat all of these steps starting from isolating predictors to dropping empty values from proportion_earned and removing all incomplete cases.
course_data <- data_to_model %>%
select(student_id, proportion_earned, subject, semester,
section, enrollment_reason, gender, int, val, percomp,
q1:q10) %>%
mutate(at_risk = if_else(proportion_earned >= .667, "no", "yes")) %>%
drop_na(proportion_earned) %>%
select(-proportion_earned) %>%
na.omit() %>%
mutate_if(is.character, as.factor)
Now inspect your data again and make sure it looks like expected:
course_data
## # A tibble: 533 × 20
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 43146 FrScA S216 02 Course Unavailable at… F 4.2 3.67
## 2 44638 OcnA S116 01 Course Unavailable at… M 4 3
## 3 47448 FrScA S216 01 Other F 4.2 3
## 4 47979 OcnA S216 01 Course Unavailable at… M 4 3.67
## 5 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 6 51943 FrScA S216 03 Learning Preference o… M 3.8 3.67
## 7 52326 AnPhA S216 01 Other F 3.6 4
## 8 52446 PhysA S116 01 Learning Preference o… F 4.2 3.67
## 9 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 10 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## # … with 523 more rows, and 12 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>
Hint: You should see a total of 532 observations and 20 variables including 1 outcome variable named at_risk, 1 student identifier named student_id, and 18 potential predictors.
One last step before splitting our data is to take a quick look at the proportion of students in our final data set that were identified as at-risk.
To do so, first let’s take a count of the number of students labeled “yes” or “no” of being at_risk, and then create a new variable called proportion that takes the number n of each and divides by the total number:
course_data %>%
count(at_risk) %>%
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 439 0.824
## 2 yes 94 0.176
As you can see, roughly 17% of students in our data set were identified as “yes” for being at risk.
It is common when beginning a modeling project to separate the data set into two partitions:
The training set is used to estimate develop and compare models, feature engineering techniques, tune models, etc.
The test set is held in reserve until the end of the project, at which point there should only be one or two models under serious consideration. It is used as an unbiased source for measuring final model performance.
There are different ways to create these partitions of the data and there is no uniform guideline for determining how much data should be set aside for testing. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors.
After you decide how much to set aside, the most common approach for actually partitioning your data is to use a random sample. For our purposes, we’ll use random sampling to select 25% for the test set and use the remainder for the training set, which are the defaults for the {rsample} package.
Additionally, since random sampling uses random numbers, it is important to set the random number seed. This ensures that the random numbers can be reproduced at a later time (if needed).
The function initial_split() function from the {rsample} package takes the original data and saves the information on how to make the partitions.
Run the following code to set the random number seed and make our initial data split:
set.seed(586)
course_split <- initial_split(course_data,
strata = at_risk)
Note that we used the strata = argument, which conducts a stratified split. This ensures that, despite the imbalance we noticed in our at_risk variable, our training and test data sets will keep roughly the same proportions of at-risk students as in the original data.
Type course_split into the code chunk below, run, and answer the question that follows?
course_split
## <Analysis/Assess/Total>
## <399/134/533>
How many observations should we expect to see in our training and test sets respectively?
The {rsample} package has two aptly named functions for created a training and testing data set called training() and testing() respectively.
Run the following code to split the data into
train_data <- training(course_split)
test_data <- testing(course_split)
First take a look at the training and testing sets we just created:
train_data
## # A tibble: 399 × 20
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 43146 FrScA S216 02 Course Unavailable at… F 4.2 3.67
## 2 44638 OcnA S116 01 Course Unavailable at… M 4 3
## 3 47979 OcnA S216 01 Course Unavailable at… M 4 3.67
## 4 52326 AnPhA S216 01 Other F 3.6 4
## 5 52446 PhysA S116 01 Learning Preference o… F 4.2 3.67
## 6 54066 OcnA S116 01 Other F 4.6 4.67
## 7 54282 OcnA S116 02 Course Unavailable at… F 3.4 2
## 8 54342 OcnA S116 02 Learning Preference o… F 4 3
## 9 55078 FrScA S216 01 Learning Preference o… M 4.4 3.67
## 10 55140 PhysA S116 01 Course Unavailable at… M 5 4
## # … with 389 more rows, and 12 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>
test_data
## # A tibble: 134 × 20
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 47448 FrScA S216 01 Other F 4.2 3
## 2 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 3 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 4 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## 5 54434 PhysA S116 01 Learning Preference o… F 5 5
## 6 54567 OcnA S216 02 Course Unavailable at… M 4 3.67
## 7 57489 PhysA S116 01 Course Unavailable at… M 4.6 3.67
## 8 60186 AnPhA S116 01 Course Unavailable at… M 4.2 4.33
## 9 61357 FrScA S116 02 Course Unavailable at… F 5 4.67
## 10 61941 FrScA S116 01 Course Unavailable at… F 4.6 4.33
## # … with 124 more rows, and 12 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>
Next, recycle the code from above to check to see that the proportion of at-risk students in our training and test data are close to those in our overall course_data:
train_data %>%
count(at_risk) %>%
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 329 0.825
## 2 yes 70 0.175
test_data %>%
count(at_risk) %>%
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 110 0.821
## 2 yes 24 0.179
Now answer the following questions:
Do the number of observations in each set match your expectations? Why?
Do the proportion of at-risk students in each set match your expectations? Why?
In this section, we introduce another tidymodels package, recipes, which is designed to help you prepare your data before training your model. Recipes are built as a series of preprocessing steps, such as:
converting qualitative predictors to indicator variables (also known as dummy variables),
transforming data to be on a different scale (e.g., taking the logarithm of a variable),
transforming whole groups of predictors together,
extracting key features from raw variables (e.g., getting the day of the week out of a date variable),
and so on. If you are familiar with R’s formula interface, a lot of this might sound familiar and like what a formula already does. Recipes can be used to do many of the same things, but they have a much wider range of possibilities.
To get started, let’s create a recipe for a simple logistic regression model. Before training the model, we can use a recipe to create a few new predictors and conduct some preprocessing required by the model.
The recipe() function as we used it here has two arguments:
A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (at_risk in our case). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.
The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = train_data here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.
Let’s create our very first recipe using at_risk as our outcome variable; int and gender and enrollment_reason as predictors; and train_data as our data to train:
lr_recipe_1 <- recipe(at_risk ~ int + gender + enrollment_reason,
data = train_data)
Now let’s take a quick peek at our recipe and create a quick summary of our recipe using the summary() function:
lr_recipe_1
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
summary(lr_recipe_1)
## # A tibble: 4 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 int numeric predictor original
## 2 gender nominal predictor original
## 3 enrollment_reason nominal predictor original
## 4 at_risk nominal outcome original
You can see that our recipe has three predictors and 1 outcome, just as expected.
Because we’ll be using a simple logistic regression model, variables like gender and enrollment_reason will need to be coded as dummy variables. Dummy coding means transforming a variable with multiple categories into new variables, where each binary variable indicates the presence and absence of each category. For example, gender will be recoded to gender_f, where 1 indicates female and 0 indicates male.
Unlike the standard model formula methods in R, a recipe does not automatically create these dummy variables for you; you’ll need to tell your recipe to add this step. This is for two reasons. First, many models do not require numeric predictors, so dummy variables may not always be preferred. Second, recipes can also be used for purposes outside of modeling, where non-dummy versions of the variables may work better. For example, you may want to make a table or a plot with a variable as a single factor.
For these reasons, you need to explicitly tell recipes to create dummy variables using step_dummy(). Let’s add this to our recipe and include all_nominal_predictors() to tell our recipe to change all of our factor variables to dummy variables:
lr_recipe_1 <-
recipe(at_risk ~ int + gender + enrollment_reason,
data = train_data) %>%
step_dummy(all_nominal_predictors())
lr_recipe_1
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
##
## Operations:
##
## Dummy variables from all_nominal_predictors()
summary(lr_recipe_1)
## # A tibble: 4 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 int numeric predictor original
## 2 gender nominal predictor original
## 3 enrollment_reason nominal predictor original
## 4 at_risk nominal outcome original
Before training our model, let’s create a second recipe just for contrast that includes all of our predictors by adding the . after the tilde as noted above.
In addition to our step for creating dummy variables, we’ll also have to indicate that our student_id variable is not used for prediction purposes, but rather as a unique identifier for our students.
To do so, we will have to add roles to this recipe using the update_role() function that lets recipes know that student_id is a variable with a custom role that we called "ID" (a role can have any character value). Whereas our formula includes all variables in the training set other than at_risk as predictors, this tells the recipe to keep these two variables but not use them as either outcomes or predictors.
Run the following code to add student_id as a role to our model:
lr_recipe_2 <-
recipe(at_risk ~ .,
data = train_data) %>%
step_dummy(all_nominal_predictors()) %>%
update_role(student_id, new_role = "ID")
lr_recipe_2
## Recipe
##
## Inputs:
##
## role #variables
## ID 1
## outcome 1
## predictor 18
##
## Operations:
##
## Dummy variables from all_nominal_predictors()
summary(lr_recipe_2)
## # A tibble: 20 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 student_id nominal ID original
## 2 subject nominal predictor original
## 3 semester nominal predictor original
## 4 section nominal predictor original
## 5 enrollment_reason nominal predictor original
## 6 gender nominal predictor original
## 7 int numeric predictor original
## 8 val numeric predictor original
## 9 percomp numeric predictor original
## 10 q1 numeric predictor original
## 11 q2 numeric predictor original
## 12 q3 numeric predictor original
## 13 q4 numeric predictor original
## 14 q5 numeric predictor original
## 15 q6 numeric predictor original
## 16 q7 numeric predictor original
## 17 q8 numeric predictor original
## 18 q9 numeric predictor original
## 19 q10 numeric predictor original
## 20 at_risk nominal outcome original
With tidymodels, we start building a model by specifying the functional form of the model that we want using the [parsnip] package. Since our outcome is binary, the model type we will use is “logistic regression.” We can declare this with logistic_reg() and assign to an object we will later use in our workflow:
lr_mod <- logistic_reg()
That is pretty underwhelming since, on its own, it doesn’t really do much. However, now that the type of model has been specified, a method for fitting or training the model can be stated using the engine.
The engine value is often a mash-up of different packages that can be used to fit or train the model as well as the estimation method. For example, we will use “glm” a generalized linear model for binary outcomes and default for logistic regression in the {parsnip} package.
Run the following code to finish specifying our model:
lr_mod <-
logistic_reg() %>%
set_engine("glm")
We will want to use our recipes created earlier across several steps as we train and test our model. To simplify this process, we can use a model workflow, which pairs a model and recipe together.
This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows.
We’ll use the {workflows} package from tidymodels to bundle our parsnip model (lr_mod) with our first recipe (lr_recipe_1).
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe_1)
Now that we have a single workflow that can be used to prepare the recipe and train the model from the resulting predictors, we can use the fit() function to fit our model to our train_data. And again, we set a random number seed to ensure that if we run this same code again, we will get the same results in terms of the data partition:
set.seed(586)
lr_fit <-
lr_workflow %>%
fit(data = train_data)
This lr_fit object has the finalized recipe and fitted model objects inside. To extract the model fit from the workflow, we will use the helper functions extract_fit_parsnip(). Here we pull the fitted model object then use the broom::tidy() function to get a tidy tibble of model coefficients:
lr_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0105 0.920 -0.0114 0.991
## 2 int -0.355 0.209 -1.70 0.0892
## 3 gender_M 0.174 0.283 0.616 0.538
## 4 enrollment_reason_Credit.Recovery 1.08 1.04 1.03 0.301
## 5 enrollment_reason_Learning.Preference.of… -0.497 0.418 -1.19 0.234
## 6 enrollment_reason_Other -0.109 0.377 -0.288 0.773
## 7 enrollment_reason_Scheduling.Conflict -0.255 0.451 -0.566 0.571
Among the predictors included in our model, int or interest in subject area score, seems to be the sole significant predictor of being identified as at-risk according to our definition. This doesn’t bode too well for our model but let’s proceed with testing anyways.
Now that we’ve fit our model to our training data, we’re FINALLY ready to test our model on the data we set aside in the beginning. Just to recap the steps that led to this moment, however, recall that we:
Built the model (lr_mod),
Created a preprocessing recipe (lr_recipe_1),
Bundled the model and recipe (lr_workflow), and
Trained our workflow using a single call to fit().
The next step is to use the trained workflow (lr_fit) to predict outcomes for our unseen test data, which we will do with the function predict(). The predict() method applies the recipe to the new data, then passes them to the fitted model.
predict(lr_fit, test_data)
## # A tibble: 134 × 1
## .pred_class
## <fct>
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## # … with 124 more rows
Because our outcome variable at_risk here is a factor, the output from predict() returns the predicted class: no versus yes. Not a super useful output to be honest though.
Fortunately there is an augment() function we can use with our lr_fir model and test_data to save them together:
Let’s use this function, save as lr_predictions:
lr_predictions <- augment(lr_fit, test_data)
Take a quick look at lr_predictions in the code chunk below and answer the question that follows?
lr_predictions
## # A tibble: 134 × 23
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 47448 FrScA S216 01 Other F 4.2 3
## 2 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 3 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 4 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## 5 54434 PhysA S116 01 Learning Preference o… F 5 5
## 6 54567 OcnA S216 02 Course Unavailable at… M 4 3.67
## 7 57489 PhysA S116 01 Course Unavailable at… M 4.6 3.67
## 8 60186 AnPhA S116 01 Course Unavailable at… M 4.2 4.33
## 9 61357 FrScA S116 02 Course Unavailable at… F 5 4.67
## 10 61941 FrScA S116 01 Course Unavailable at… F 4.6 4.33
## # … with 124 more rows, and 15 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>, .pred_class <fct>, .pred_no <dbl>,
## # .pred_yes <dbl>
Was our model successful at predicting any students who were at-risk? How do you know?
Hint: Scroll to the end of the data frame and take a look at our original at_risk outcome and the .pred_class variable which shows the predicted outcomes.
As you probably noticed, just looking at the lr_predictions object is not the easiest way to check for model accuracy. Fortunately, the {yardstick} package has an accuracy() function for looking at the overall classification accuracy, which uses the hard class predictions to measure performance.
Hard class predictions tell us whether our model predicted yes or no for each student in the .pred_class column, as well as the estimating a probability. A simple 50% probability cutoff is used to categorize a student as at risk. For example, student 47448 had a .pred_no probability of 0.8854332 and .pred_yes probability of 0.11456677 and so was classified as no for not being for at_risk.
Run the following code to select() these variables followed by the accuracy() function to see how frequently our prediction matched our observed data:
lr_predictions %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.821
Overall it looks like our model was correct 82% of the time, which we’ll see in a second is not terribly impressive.
Another way to check model accuracy is with the conf_mat() function from {yardstick} for creating a confusion matrix. Recall from our course text Learning Analytics Goes to School that a confusion matrix is simply a 2 × 2 table that lists the number of true-negatives, false-negatives, true-positives, and false-positives.
Run the follow code to create a confusion matrix for our logistic regression predictions:
lr_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 110 24
## yes 0 0
As you can see our model accurately predicted all students as “no” for at risk 110 times, but inaccurately predicted 24 students a “no” for at risk when they were actually “yes” in our test_data. Overall our model was 82% accurate, but it achieved this by simply labeling everyone as “no” for at risk. And since rough 82% of our students were actually “no,” this is clearly not a great prediction model.
Recycle our code from above to create a workflow for lr_recipe_2 and test our “kitchen sink” model on our test data. Hint: You can accomplish this by simply copying and pasting and changing a single character from above.
# create workflow
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe_2)
# set seeed
set.seed(586)
# fit model to workflow
lr_fit <-
lr_workflow %>%
fit(data = test_data)
# extract model estimates
lr_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 28 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.72 2.74 -0.991 0.322
## 2 int -0.749 2.53 -0.295 0.768
## 3 val 0.544 1.62 0.335 0.738
## 4 percomp 0.752 0.850 0.884 0.377
## 5 q1 -0.702 1.06 -0.664 0.507
## 6 q2 0.624 0.787 0.793 0.428
## 7 q3 0.0931 0.584 0.159 0.873
## 8 q4 -0.863 0.875 -0.987 0.324
## 9 q5 0.920 1.04 0.887 0.375
## 10 q6 0.0174 0.765 0.0227 0.982
## # … with 18 more rows
# get predictions
predict(lr_fit, test_data)
## # A tibble: 134 × 1
## .pred_class
## <fct>
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 yes
## 7 no
## 8 no
## 9 no
## 10 no
## # … with 124 more rows
lr_predictions <- augment(lr_fit, test_data)
# check overall accuracy
lr_predictions %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.851
# create a confusion matrix
lr_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 106 16
## yes 4 8
Does this model perform any better? How do you know?
.estimate increased to 85.1%
multiple “yes” values had a .pred_yes result > 50%
Random forest models are ensembles of decision trees. Each of those terms is linked because there is a dense amount of information required to understand what each of those terms really means. For now, however, them main thing to know is that:
One of the benefits of a random forest model is that it is very low maintenance; it requires very little preprocessing of the data and the default parameters tend to give reasonable results.
For that reason, we’ll just create a very simple recipe for our course_data data includes all our predictors and singles out our student_id role:
rf_recipe <-
recipe(at_risk ~ .,
data = train_data) %>%
update_role(student_id, new_role = "ID")
rf_recipe
## Recipe
##
## Inputs:
##
## role #variables
## ID 1
## outcome 1
## predictor 18
summary(rf_recipe)
## # A tibble: 20 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 student_id nominal ID original
## 2 subject nominal predictor original
## 3 semester nominal predictor original
## 4 section nominal predictor original
## 5 enrollment_reason nominal predictor original
## 6 gender nominal predictor original
## 7 int numeric predictor original
## 8 val numeric predictor original
## 9 percomp numeric predictor original
## 10 q1 numeric predictor original
## 11 q2 numeric predictor original
## 12 q3 numeric predictor original
## 13 q4 numeric predictor original
## 14 q5 numeric predictor original
## 15 q6 numeric predictor original
## 16 q7 numeric predictor original
## 17 q8 numeric predictor original
## 18 q9 numeric predictor original
## 19 q10 numeric predictor original
## 20 at_risk nominal outcome original
To fit a random forest model on the training set, we’ll use the {parsnip} package again along with the ranger engine. We’ll also include the set_mode() function to specify our model as “classification” rather than “regression.”
Recall from our Learning Analytics Goes to School that supervised machined learning, or predictive modeling, involves two broad approaches: classification and regression. Classification algorithms model categorical outcomes (e.g., yes or no outcomes like with our at risk data). Regression algorithms characterize continuous outcomes (e.g., test scores).
Run the following code to create our random forest model for our training data:
rf_mod <-
rand_forest(trees = 1000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
Now let’s combine our model and recipe to create our new random forest workflow:
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
And fit our model and recipe to our training data:
set.seed(586)
rf_fit <-
rf_workflow %>%
fit(data = train_data)
Gather our random forest predictions:
rf_predictions <- augment(rf_fit, test_data)
rf_predictions
## # A tibble: 134 × 23
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 47448 FrScA S216 01 Other F 4.2 3
## 2 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 3 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 4 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## 5 54434 PhysA S116 01 Learning Preference o… F 5 5
## 6 54567 OcnA S216 02 Course Unavailable at… M 4 3.67
## 7 57489 PhysA S116 01 Course Unavailable at… M 4.6 3.67
## 8 60186 AnPhA S116 01 Course Unavailable at… M 4.2 4.33
## 9 61357 FrScA S116 02 Course Unavailable at… F 5 4.67
## 10 61941 FrScA S116 01 Course Unavailable at… F 4.6 4.33
## # … with 124 more rows, and 15 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>, .pred_class <fct>, .pred_no <dbl>,
## # .pred_yes <dbl>
And check our model’s overall accuracy:
rf_predictions %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.821
And create a confusion matrix to see where it did and did not make accurate predictions:
rf_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 110 24
## yes 0 0
Overall, our random forest model is slightly better at predicting students who were defined as “at risk,” and definitely better at predicting students who were truly “at risk” according to our definition.
Unfortunately, if this had been a real-world situation, only 7 of the 24 students would have received additional support. Clearly we’d need to build a better model. In this case, more data would definitely help, but it’s likely we’re missing some important information about students that might be helpful for including in our model.
Try creating your own recipe and a logistic regression or random forest model and see how it performs against the three that I created.
Use the code chunk below to record your final model:
#create rf recipe using only 4 predictor variables
rf_recipe2 <-
recipe(at_risk ~ student_id + subject + gender + enrollment_reason + int,
data = train_data) %>%
update_role(student_id, new_role = "ID")
# create model for training data
rf_mod <-
rand_forest(trees = 1000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# create workflow
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
# fit model and recipe to training data
set.seed(586)
rf_fit <-
rf_workflow %>%
fit(data = train_data)
# gather predictions
rf_predictions <- augment(rf_fit, test_data)
rf_predictions
## # A tibble: 134 × 23
## student_id subject semester section enrollment_reason gender int val
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
## 1 47448 FrScA S216 01 Other F 4.2 3
## 2 48797 PhysA S116 01 Learning Preference o… M 3.8 3.67
## 3 53447 FrScA S116 01 Course Unavailable at… F 3.8 2
## 4 53475 FrScA S216 01 Course Unavailable at… F 4.8 3.33
## 5 54434 PhysA S116 01 Learning Preference o… F 5 5
## 6 54567 OcnA S216 02 Course Unavailable at… M 4 3.67
## 7 57489 PhysA S116 01 Course Unavailable at… M 4.6 3.67
## 8 60186 AnPhA S116 01 Course Unavailable at… M 4.2 4.33
## 9 61357 FrScA S116 02 Course Unavailable at… F 5 4.67
## 10 61941 FrScA S116 01 Course Unavailable at… F 4.6 4.33
## # … with 124 more rows, and 15 more variables: percomp <dbl>, q1 <dbl>,
## # q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>,
## # q9 <dbl>, q10 <dbl>, at_risk <fct>, .pred_class <fct>, .pred_no <dbl>,
## # .pred_yes <dbl>
# check accuracy
rf_predictions %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.821
# create confusion matrix
rf_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 110 24
## yes 0 0
Now answer the following questions?
How did you model do compared to others?
What information might be useful to know about students prior to the course start that might be useful for improving our model?
In this case study, we focused applying some basic machine learning techniques to help us understand how a predictive model used in early warning systems might actually be developed and tested. Specifically, we made a very crude first attempt at developing a model using machine learning techniques that were not terribly great at accurately predict whether a student is likely to pass or fail and online course.
Below, add a few notes in response to the following prompts:
One thing I took away from this learning lab that I found especially useful: learning the standard workflow to create, train/test a model.
One thing I want to learn more about: how do we adjust the split ratios for our data sets? If we want an 80/20 or a 90/10 split, how would we adjust our code to return those sized data sets?
To “turn in” your work, you can click the “Knit” icon at the top of the file, or click the dropdown arrow next to it and select “Knit top HTML.” This will create a report in your Files pane that serves as a record of your completed assignment and its output you can open or share.