In Unit 2, we pick up where we left off and learn to apply some basic machine learning techniques to help us understand how a predictive model might be developed and tested as part of an early warning system to identify students at risk of failing. Specifically, we will make a very crude first attempt at developing a testing a logistic regression and a random forest model that can (we hope!) accurately predict whether a student will pass or fail an online course. Unlike Unit 1, where we were interested in identifying factors from data collected throughout the course that might help explain why students earned the grade they did, we are instead interested identifying students who may be at risk before it is too late to intervene.
The tidymodels package is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse. It includes a core set of packages that are loaded on startup and contains tools for:
In addition to the {tidymodels} package, we’ll also be using the {tidyverse} packages we learned about in Unit 1, and the {ranger} package we’ll be using for our random forest model in Part 4.
Use the code chunk below to load these three packages:
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.1 ✔ recipes 1.0.1
## ✔ dials 1.0.0 ✔ rsample 1.1.0
## ✔ dplyr 1.0.10 ✔ tibble 3.1.8
## ✔ ggplot2 3.3.6 ✔ tidyr 1.2.1
## ✔ infer 1.0.3 ✔ tune 1.0.0
## ✔ modeldata 1.0.1 ✔ workflows 1.1.0
## ✔ parsnip 1.0.1 ✔ workflowsets 1.0.0
## ✔ purrr 0.3.4 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ stringr 1.4.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
library(ranger)
For this case study, we will again be working again with data from
the online science courses introduced in Unit 1. This data is located in
the data folder and named processed-data.csv. Fortunately,
our data has already been largely wrangled. This will save us quite a
bit of time, which we’ll need to help wrap our heads around some
supervised machine learning basics.
Run the code chunk below to read this data into your R environment as
an object named processed_data and let’s take a quick look
at data we may use in our predictive model:
processed_data <- read_csv("data/processed-data.csv")
## Rows: 603 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): course_id, gender
## dbl (15): student_id, final_grade, course_interest, perceived_competence, ut...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
processed_data
## # A tibble: 603 × 17
## student_id course_id final…¹ gender cours…² perce…³ utili…⁴ q1 q2 q3
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… 93.5 M 5 4.5 4.33 5 4 4
## 2 44638 OcnA-S11… 81.7 F 4.2 3.5 4 4 4 3
## 3 47448 FrScA-S2… 88.5 M 5 4 3.67 5 4 4
## 4 47979 OcnA-S21… 81.9 M 5 3.5 5 5 5 3
## 5 48797 PhysA-S1… 84 F 3.8 3.5 3.5 4 3 3
## 6 51943 FrScA-S2… NA F 4.6 4 4 NA NA NA
## 7 52326 AnPhA-S2… 83.6 M 5 3.5 5 5 5 3
## 8 52446 PhysA-S1… 97.8 F 3 3 3.33 3 3 3
## 9 53447 FrScA-S1… 96.1 F 4.2 3 2.67 4 3 3
## 10 53475 FrScA-S1… NA M NA NA NA NA NA NA
## # … with 593 more rows, 7 more variables: q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, and abbreviated variable names
## # ¹final_grade, ²course_interest, ³perceived_competence, ⁴utility_value
In addition to students’ gender and their pre-course survey responses assessing three aspects of student motivation, we will also incorporate gradebook data collected during the course that we will also use to help “train” a predictive model for identifying students at risk of failing.
Use the code chunk below to read in the .csv file named
“grade-book.csv” located in the data folder, assign to a new object
named grade_book, and use a method of your choosing to
answer the questions that follow:
grade_book <- read_csv("data/grade-book.csv")
## Rows: 29711 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): course_id, gradebook_item
## dbl (3): student_id, points_possible, points_earned
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
grade_book
## # A tibble: 29,711 × 5
## course_id student_id gradebook_item point…¹ point…²
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 60186 POINTS EARNED & TOTAL COURSE POINTS 5 4.05
## 2 AnPhA-S116-01 60186 WORK ATTEMPTED 30 24
## 3 AnPhA-S116-01 60186 0.1: Message Your Instructor 105 71.7
## 4 AnPhA-S116-01 60186 0.2: Intro Assignment - Discussion … 140 141.
## 5 AnPhA-S116-01 60186 0.3: Intro Assignment - Submitting … 5 5
## 6 AnPhA-S116-01 60186 1.1: Quiz 5 4
## 7 AnPhA-S116-01 60186 1.2: Quiz 20 NA
## 8 AnPhA-S116-01 60186 1.3: Create a Living Creature 50 50
## 9 AnPhA-S116-01 60186 1.3: Create a Living Creature - Dis… 10 NA
## 10 AnPhA-S116-01 60186 1.4: Negative Feedback Loop Flowcha… 50 44
## # … with 29,701 more rows, and abbreviated variable names ¹points_possible,
## # ²points_earned
How many observations and variables are in our
grade_book dataset? And roughly how many observations are
there per student?
How might this data be used in our model to help predict student at risk of failing the course?
What else do you notice about this data set?
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following wrangling processes to:
Create an Outcome Variable. We create a binary variable outcome we are interested in predicting, i.e. pass/fail.
Convert Character Stings to Factors. We convert variables stored as character strings to factors expected by our models.
Select Predictors. We narrow down our data set to specific predictors of interest.
Since we are interested in developing a predictive model that can predict whether a student is at risk of failing a course, and so we can intervene before that happens, we need an outcome variable that lets us know if they have either passed or failed.
Let’s combine the hopefully familiar mutate() function
with the if_else() function also from the {dplyr} package
to create a new variable called at_risk which indicates
“no” for students whose final_grade is greater than or
equal to 66.7% (NC State’s cutoff for a D-) and “yes” if it is not.
course_data <- processed_data %>%
mutate(at_risk = if_else(final_grade >= 66.7, "no", "yes"))
course_data
## # A tibble: 603 × 18
## student_id course_id final…¹ gender cours…² perce…³ utili…⁴ q1 q2 q3
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… 93.5 M 5 4.5 4.33 5 4 4
## 2 44638 OcnA-S11… 81.7 F 4.2 3.5 4 4 4 3
## 3 47448 FrScA-S2… 88.5 M 5 4 3.67 5 4 4
## 4 47979 OcnA-S21… 81.9 M 5 3.5 5 5 5 3
## 5 48797 PhysA-S1… 84 F 3.8 3.5 3.5 4 3 3
## 6 51943 FrScA-S2… NA F 4.6 4 4 NA NA NA
## 7 52326 AnPhA-S2… 83.6 M 5 3.5 5 5 5 3
## 8 52446 PhysA-S1… 97.8 F 3 3 3.33 3 3 3
## 9 53447 FrScA-S1… 96.1 F 4.2 3 2.67 4 3 3
## 10 53475 FrScA-S1… NA M NA NA NA NA NA NA
## # … with 593 more rows, 8 more variables: q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, at_risk <chr>, and abbreviated
## # variable names ¹final_grade, ²course_interest, ³perceived_competence,
## # ⁴utility_value
While inspecting the data, you may have noticed that one of our
predictors of interest and our at_risk outcome variable are
stored as character <chr> data types. For creating models, it is
better to have qualitative variables like gender and
at_risk encoded as factors instead of character strings.
Factors store data as categorical variables, each with its own levels.
Because categorical variables are used in statistical models differently
than continuous variables, storing these data as factors ensures that
the modeling functions will treat them correctly.
To do so, we once again use the mutate() function but
instead of creating a new variable, we will use the
as_factor() function to convert gender and
at_risk from a character to a factor and save it as a
variable of the same name, effectively replacing the old variables with
new ones:
course_data <- course_data |>
mutate(gender = as_factor(gender),
at_risk = as_factor(at_risk))
course_data
## # A tibble: 603 × 18
## student_id course_id final…¹ gender cours…² perce…³ utili…⁴ q1 q2 q3
## <dbl> <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S2… 93.5 M 5 4.5 4.33 5 4 4
## 2 44638 OcnA-S11… 81.7 F 4.2 3.5 4 4 4 3
## 3 47448 FrScA-S2… 88.5 M 5 4 3.67 5 4 4
## 4 47979 OcnA-S21… 81.9 M 5 3.5 5 5 5 3
## 5 48797 PhysA-S1… 84 F 3.8 3.5 3.5 4 3 3
## 6 51943 FrScA-S2… NA F 4.6 4 4 NA NA NA
## 7 52326 AnPhA-S2… 83.6 M 5 3.5 5 5 5 3
## 8 52446 PhysA-S1… 97.8 F 3 3 3.33 3 3 3
## 9 53447 FrScA-S1… 96.1 F 4.2 3 2.67 4 3 3
## 10 53475 FrScA-S1… NA M NA NA NA NA NA NA
## # … with 593 more rows, 8 more variables: q4 <dbl>, q5 <dbl>, q6 <dbl>,
## # q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, at_risk <fct>, and abbreviated
## # variable names ¹final_grade, ²course_interest, ³perceived_competence,
## # ⁴utility_value
Note: Be sure to pay attention to the fact that in
addition to assigning our modified gender variable to a
variable of the same name using the = operator, we also
assigned our modified course_data to an object of the same
name using the <- operator. That is, we effectively
overwrite the old course_data that included
gender stored as a character data type to a new
course_data object with gender stored as a
factor.
As you’ve probably noticed, there is a lot of great information in
our course_data - but we won’t, and shouldn’t, include all
of it in our predictive model. Indeed, Estrellado et al. Estrellado et al. (2020) point out that for
statistical reasons and as a good general practice, it’s better to
select a few variables of interest because:
At a certain point, adding more variables will appear to make your analysis more accurate, but will in fact obscure the truth from you.
Complete the code chunk below to “select” (hint, hint)
at_risk for our outcome variable, as well as the following
variables we will use in our predictive model:
course_data <- course_data |>
select(student_id, course_id, at_risk, gender, course_interest,
perceived_competence, utility_value)
course_data
## # A tibble: 603 × 7
## student_id course_id at_risk gender course_interest perceived_c…¹ utili…²
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S216-02 no M 5 4.5 4.33
## 2 44638 OcnA-S116-01 no F 4.2 3.5 4
## 3 47448 FrScA-S216-01 no M 5 4 3.67
## 4 47979 OcnA-S216-01 no M 5 3.5 5
## 5 48797 PhysA-S116-01 no F 3.8 3.5 3.5
## 6 51943 FrScA-S216-03 <NA> F 4.6 4 4
## 7 52326 AnPhA-S216-01 no M 5 3.5 5
## 8 52446 PhysA-S116-01 no F 3 3 3.33
## 9 53447 FrScA-S116-01 no F 4.2 3 2.67
## 10 53475 FrScA-S116-02 <NA> M NA NA NA
## # … with 593 more rows, and abbreviated variable names ¹perceived_competence,
## # ²utility_value
As noted by Krumm et al. (2018), exploratory data analysis often involves some combination of data visualization and feature engineering. In Part 3, we will create a quick visualization to help us spot any issues with our data and engineer new features that we will use in our predictive models. Specifically, in Part 3 we will:
Examine Outcomes by taking a quick look at
at_risk outcome variable of interest as well as our other
predictors to spot any issues that may hang up our models.
Engineer Predictors by creating some new variables we think will be predictive of students at risk, such as performance on assignments on during the first half of the course.
Before we move on to feature engineering, let’s take a quick look at
the proportion of students in our final data set that were identified as
at-risk. To do so, run the following code chunk to take a count of the
number of students labeled “yes” or “no” of being at_risk,
and then create a new variable called proportion that takes the number
n of each and divides by the total number:
course_data |>
count(at_risk) |>
mutate(proportion = n/sum(n))
## # A tibble: 3 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 465 0.771
## 2 yes 108 0.179
## 3 <NA> 30 0.0498
As you can see, roughly 18% of students in our data set were identified as “yes” for being at risk. And since these are final grades we are working with, these student were not only at risk of failing, but did indeed fail the course. You may also notice we have a number of students with missing grades, which is something we will have to address prior to analysis.
Alternatively, we could have created a quick visualization to help us get a better sense of the outcomes we are trying to predict and to spot any potential issues we might run into during analysis.
Complete the following code chunk to create a simple bar plot illustrating the number of students who passed, failed, or have missing data:
course_data |>
ggplot() +
geom_bar(mapping = aes(x = at_risk))
As you may have noticed from the outputs above, some students do not have a final grade and some are missing survey data as well. Since the models we will be using are not designed to deal with missing data, we will need to remove cases with missing data.
Let’s use the drop_na() function also from the {dplyr}
package to remove observations with missing data and reassign to our
course_data object:
course_data <- course_data |>
drop_na()
course_data
## # A tibble: 503 × 7
## student_id course_id at_risk gender course_interest perceived_c…¹ utili…²
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S216-02 no M 5 4.5 4.33
## 2 44638 OcnA-S116-01 no F 4.2 3.5 4
## 3 47448 FrScA-S216-01 no M 5 4 3.67
## 4 47979 OcnA-S216-01 no M 5 3.5 5
## 5 48797 PhysA-S116-01 no F 3.8 3.5 3.5
## 6 52326 AnPhA-S216-01 no M 5 3.5 5
## 7 52446 PhysA-S116-01 no F 3 3 3.33
## 8 53447 FrScA-S116-01 no F 4.2 3 2.67
## 9 54066 OcnA-S116-01 no M 4.4 4 5
## 10 54282 OcnA-S116-02 no F 3.4 3 2.67
## # … with 493 more rows, and abbreviated variable names ¹perceived_competence,
## # ²utility_value
Finally, you have probably noticed that we wrote a lot more code than necessary in order to illustrate different wrangling processes to get our data ready for analysis.
To reduce all the redundancy caused by demonstrating each step
separately above, complete the following code below by using the new
|> or old %>% pipe
operators to chain the mutate(), select(),
and drop_na() functions together to wrangle our data for
analysis and write a brief comment following each # that
explains what each line of code accomplishes.
course_data <- processed_data |>
mutate(at_risk = if_else(final_grade >= 66.7, "no", "yes"), # adds at-risk variable
at_risk = as_factor(at_risk), # changes at_risk data type from character to a factor
gender = as_factor(gender)) |> # changes gender data type from character to a factor
select(student_id, course_id, at_risk, gender, course_interest,
perceived_competence, utility_value) |> # selects the relevant variables for the data we want to work with
drop_na() # removes the observations with NA for any variable
course_data
## # A tibble: 503 × 7
## student_id course_id at_risk gender course_interest perceived_c…¹ utili…²
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 43146 FrScA-S216-02 no M 5 4.5 4.33
## 2 44638 OcnA-S116-01 no F 4.2 3.5 4
## 3 47448 FrScA-S216-01 no M 5 4 3.67
## 4 47979 OcnA-S216-01 no M 5 3.5 5
## 5 48797 PhysA-S116-01 no F 3.8 3.5 3.5
## 6 52326 AnPhA-S216-01 no M 5 3.5 5
## 7 52446 PhysA-S116-01 no F 3 3 3.33
## 8 53447 FrScA-S116-01 no F 4.2 3 2.67
## 9 54066 OcnA-S116-01 no M 4.4 4 5
## 10 54282 OcnA-S116-02 no F 3.4 3 2.67
## # … with 493 more rows, and abbreviated variable names ¹perceived_competence,
## # ²utility_value
Now inspect your data again and make sure it looks like expected:
view(course_data)
Hint: You should see a total of 503 observations and
7 variables including 1 outcome variable named at_risk, 1
student identifier named student_id.
As defined by Krumm, Means, and Bienkowski (2018) in Learning Analytics Goes to School:
Feature engineering is the process of creating new variables within a dataset, which goes above and beyond the work of recoding and rescaling variables.
The authors note that feature engineering draws on substantive knowledge from theory or practice, experience with a particular data system, and general experience in data-intensive research. Moreover, these features can be used not only in machine learning models, but also in visualizations and tables comprising descriptive statistics.
Though not often discussed, feature engineering is an important element of data-intensive research that can generate new insights and improve predictive models. Indeed, an earlier attempt using this data without feature engineering predicted students’ passing (or not passing) the course with only around 75% accuracy. One goal of this case study is to improve upon these results creating by some new variable using data that we think may improve our model.
In addition to information about student gender and motivation collected prior to the course, we will also incorporate into our model student performance data on course assignments. And since we are interested in developing a predictive model that that can be used to intervene before students actually fail the course, we will develop new features based on student performance on the first 20 assignments completed during the first half of the course. Specifically, we’ll create the following three “features” from the gradebook data we imported earlier:
The {dplyr} package has a handy slice() function that
that allows us to select, remove, and duplicate rows in a dataset based
the rows in which we specify.
For example, run the following code chunk to select rows 1 through 20
from our grade_book data.
slice(grade_book, 1:20)
## # A tibble: 20 × 5
## course_id student_id gradebook_item point…¹ point…²
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 AnPhA-S116-01 60186 POINTS EARNED & TOTAL COURSE POINTS 5 4.05
## 2 AnPhA-S116-01 60186 WORK ATTEMPTED 30 24
## 3 AnPhA-S116-01 60186 0.1: Message Your Instructor 105 71.7
## 4 AnPhA-S116-01 60186 0.2: Intro Assignment - Discussion … 140 141.
## 5 AnPhA-S116-01 60186 0.3: Intro Assignment - Submitting … 5 5
## 6 AnPhA-S116-01 60186 1.1: Quiz 5 4
## 7 AnPhA-S116-01 60186 1.2: Quiz 20 NA
## 8 AnPhA-S116-01 60186 1.3: Create a Living Creature 50 50
## 9 AnPhA-S116-01 60186 1.3: Create a Living Creature - Dis… 10 NA
## 10 AnPhA-S116-01 60186 1.4: Negative Feedback Loop Flowcha… 50 44
## 11 AnPhA-S116-01 60186 1.5: Quiz 5 4.19
## 12 AnPhA-S116-01 60186 Unit 1 Assessment: Pickle Autopsy L… 5 5
## 13 AnPhA-S116-01 60186 PROGRESS CHECK 1 @ 10-02-15 24 16
## 14 AnPhA-S116-01 60186 2.1: Types of Membranes 10 9.32
## 15 AnPhA-S116-01 60186 2.2 - 2.3: Quiz 10 9
## 16 AnPhA-S116-01 60186 2.4 - 2.5 Quiz 10 10
## 17 AnPhA-S116-01 60186 2.6: Quiz 5 5
## 18 AnPhA-S116-01 60186 Unit 2 Assessment: Test 443 389
## 19 AnPhA-S116-01 60186 3.1: Models of the Integumentary Sy… 10 10
## 20 AnPhA-S116-01 60186 3.2 - 3.3: Quiz 5 3
## # … with abbreviated variable names ¹points_possible, ²points_earned
While this helps us retrieve the first 20 assignments for the first student, it doesn’t help us with the hundreds of other students in our data set!
Fortunately, the {dplyr} package also has the extremely useful
group_by() function which allows us to perform other
{tidyverse} functions like slice(), mutate(),
and summarize() by groups that we specify as arguments.
For example, run the following code to group our data by
student_id and course_id using the
group_by() function, and then we’ll use the
slice() function again to select the first 20 assignment,
but this time for the first 20 assignments in each course the student
completed, instead of just the first 20 assignments for the first
student:
grade_book |>
group_by(student_id, course_id) |>
slice(1:20)
## # A tibble: 12,060 × 5
## # Groups: student_id, course_id [603]
## course_id student_id gradebook_item point…¹ point…²
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 FrScA-S216-02 43146 POINTS EARNED & TOTAL COURSE POINTS 5 NA
## 2 FrScA-S216-02 43146 WORK ATTEMPTED 10 NA
## 3 FrScA-S216-02 43146 0-1.1: Intro Assignment - Send a Me… 5 5
## 4 FrScA-S216-02 43146 0-1.2: Intro Assignment - DB #1 5 5
## 5 FrScA-S216-02 43146 0-1.3: Intro Assignment - Submittin… 5 5
## 6 FrScA-S216-02 43146 1-1.1: Lesson 1-1 Graphic Organizer 15 13
## 7 FrScA-S216-02 43146 1-2.1: Explore a Career Assignment 37 37
## 8 FrScA-S216-02 43146 1-2.2: Explore a Career DB #2 443 365
## 9 FrScA-S216-02 43146 PROGRESS CHECK 1 @ 02-18-16 5 5
## 10 FrScA-S216-02 43146 1-2.3: Lesson 1-2 Graphic Organizer 5 5
## # … with 12,050 more rows, and abbreviated variable names ¹points_possible,
## # ²points_earned
You’ll notice the data have significantly different dimensions now.
We’ll have to take some steps to further process our
grade_book data. In doing so, we’ll engineer some
features.
Now let’s create a new grade_book data frame using what
we just learned to:
student_id and
course_id;percent_earned that equals
the points_earned divided by points_possible;
and,Complete the following code chunk:
grade_book <- grade_book |>
group_by(student_id, course_id) |>
slice(1:20) |>
mutate(percent_earned = points_earned/points_possible) |>
drop_na()
grade_book
## # A tibble: 10,522 × 6
## # Groups: student_id, course_id [603]
## course_id student_id gradebook_item point…¹ point…² perce…³
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 FrScA-S216-02 43146 0-1.1: Intro Assignment - S… 5 5 1
## 2 FrScA-S216-02 43146 0-1.2: Intro Assignment - D… 5 5 1
## 3 FrScA-S216-02 43146 0-1.3: Intro Assignment - S… 5 5 1
## 4 FrScA-S216-02 43146 1-1.1: Lesson 1-1 Graphic O… 15 13 0.867
## 5 FrScA-S216-02 43146 1-2.1: Explore a Career Ass… 37 37 1
## 6 FrScA-S216-02 43146 1-2.2: Explore a Career DB … 443 365 0.824
## 7 FrScA-S216-02 43146 PROGRESS CHECK 1 @ 02-18-16 5 5 1
## 8 FrScA-S216-02 43146 1-2.3: Lesson 1-2 Graphic O… 5 5 1
## 9 FrScA-S216-02 43146 2-2.1: Evidence Collection … 10 8.5 0.85
## 10 FrScA-S216-02 43146 2-3.1: Hair Analysis Lab 30 23 0.767
## # … with 10,512 more rows, and abbreviated variable names ¹points_possible,
## # ²points_earned, ³percent_earned
Note: If you completed this correctly, you should have a data frame with 10,522 observations and 6 variables.
Finally, let’s create three features from the gradebook data:
You can probably imagine others; you’re welcome to explore adding those, too.
We’ll use the summarize() function instead of the
mutate() function to do this, and then we’ll select just
the variables needed to join our data and :
grade_book <- grade_book |>
summarize(overall_percent = sum(points_earned) / sum(points_possible),
variability = sd(percent_earned),
n_100 = sum(percent_earned == 1)) |>
select(student_id,
course_id,
overall_percent,
variability,
n_100)
## `summarise()` has grouped output by 'student_id'. You can override using the
## `.groups` argument.
grade_book
## # A tibble: 603 × 5
## # Groups: student_id [580]
## student_id course_id overall_percent variability n_100
## <dbl> <chr> <dbl> <dbl> <int>
## 1 43146 FrScA-S216-02 0.840 0.211 9
## 2 44638 OcnA-S116-01 0.560 0.235 12
## 3 47448 FrScA-S216-01 0.420 0.297 5
## 4 47979 OcnA-S216-01 0.713 0.204 6
## 5 48797 PhysA-S116-01 0.905 0.286 9
## 6 51943 FrScA-S216-03 0.936 0.172 8
## 7 52326 AnPhA-S216-01 0.485 0.298 6
## 8 52446 PhysA-S116-01 0.843 0.137 8
## 9 53447 FrScA-S116-01 0.702 0.221 5
## 10 53475 FrScA-S116-02 0.706 0.173 4
## # … with 593 more rows
We have one last step before we can get to modeling
(grade_book) - joining this data with all of the other data
(course_data).
course_data <- inner_join(course_data, grade_book)
## Joining, by = c("student_id", "course_id")
Let’s talk a look at the joined data to make sure everything is looking as we intend it to. Inspect the data using the code chunk below and answer the following questions:
view(course_data)
summarize() function above instead
of the mutate() function to create our new features?
Recall from our readings that there are two general types of modeling approaches: unsupervised and supervised machine learning. In Part 4, we focused on supervised learning models, which are used to quantify relationships between features (e.g., motivation and performance) and a known outcome (e.g., passing or failing a course). These models can be used for statistical inference, as illustrated in Unit 1, or prediction as we’ll illustrate in this section. Specifically, in Part 4 we will learn how to:
Split Data into a training and test set that will be used to develop a predictive model.
Create a “Recipe” for our predictive model and learn how to deal with nominal data that we would like to use as predictors.
Fit Models to our training set using logistic regression and random forest models.
Check Model Accuracy on our test set to see how well our model can “predict” our outcome of interest.
The authors of Data Science in Education Using R Estrellado et al. (2020) remind us that:
At its core, machine learning is the process of “showing” your statistical model only some of the data at once and training the model to predict accurately on that training dataset (this is the “learning” part of machine learning). Then, the model as developed on the training data is shown new data - data you had all along, but hid from your computer initially - and you see how well the model that you developed on the training data performs on this new testing data. Eventually, you might use the model on entirely new data.
It is therefore common when beginning a modeling project to separate the data set into two partitions:
The training set is used to estimate develop and compare models, feature engineering techniques, tune models, etc.
The test set is held in reserve until the end of the project, at which point there should only be one or two models under serious consideration. It is used as an unbiased source for measuring final model performance.
There are different ways to create these partitions of the data and there is no uniform guideline for determining how much data should be set aside for testing. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors.
After you decide how much to set aside, the most common approach for actually partitioning your data is to use a random sample. For our purposes, we’ll use random sampling to select 25% for the test set and use the remainder for the training set, which are the defaults for the {rsample} package.
Additionally, since random sampling uses random numbers, it is important to set the random number seed. This ensures that the random numbers can be reproduced at a later time (if needed).
The function initial_split() function from the {rsample}
package takes the original data and saves the information on how to make
the partitions.
Run the following code to set the random number seed and make our initial data split:
set.seed(586)
course_split <- initial_split(course_data,
strata = at_risk)
Note that we used the strata = argument, which conducts
a stratified split. This ensures that, despite the imbalance we noticed
in our at_risk variable, our training and test data sets
will keep roughly the same proportions of at-risk students as in the
original data.
Type course_split into the code chunk below, run, and
answer the question that follows?
course_split
## <Training/Testing/Total>
## <376/127/503>
How many observations should we expect to see in our training and test sets respectively?
The {rsample} package has two aptly named functions for creating a
training and testing data set called training() and
testing() respectively.
Run the following code to split the data into our training and test data sets:
train_data <- training(course_split)
test_data <- testing(course_split)
Now take a look at the training and testing sets we just created by typing their names into the code chunk:
train_data
## # A tibble: 376 × 10
## studen…¹ cours…² at_risk gender cours…³ perce…⁴ utili…⁵ overa…⁶ varia…⁷ n_100
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 43146 FrScA-… no M 5 4.5 4.33 0.840 0.211 9
## 2 44638 OcnA-S… no F 4.2 3.5 4 0.560 0.235 12
## 3 47448 FrScA-… no M 5 4 3.67 0.420 0.297 5
## 4 47979 OcnA-S… no M 5 3.5 5 0.713 0.204 6
## 5 52326 AnPhA-… no M 5 3.5 5 0.485 0.298 6
## 6 52446 PhysA-… no F 3 3 3.33 0.843 0.137 8
## 7 54282 OcnA-S… no F 3.4 3 2.67 0.849 0.138 7
## 8 54434 PhysA-… no F 4 3 3 0.914 0.200 4
## 9 55078 FrScA-… no F 4.2 3.5 4.33 0.911 0.171 9
## 10 56152 AnPhA-… no F 4 4 2.67 0.864 0.112 10
## # … with 366 more rows, and abbreviated variable names ¹student_id, ²course_id,
## # ³course_interest, ⁴perceived_competence, ⁵utility_value, ⁶overall_percent,
## # ⁷variability
test_data
## # A tibble: 127 × 10
## studen…¹ cours…² at_risk gender cours…³ perce…⁴ utili…⁵ overa…⁶ varia…⁷ n_100
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 48797 PhysA-… no F 3.8 3.5 3.5 0.905 0.286 9
## 2 53447 FrScA-… no F 4.2 3 2.67 0.702 0.221 5
## 3 54066 OcnA-S… no M 4.4 4 5 0.848 0.307 6
## 4 55140 PhysA-… no F 3.8 3 2.67 0.772 0.275 6
## 5 57981 OcnA-S… yes F 4.2 3 3.67 0.896 0.118 10
## 6 58168 AnPhA-… no F 5 5 4.67 0.871 0.242 4
## 7 62157 FrScA-… no F 4.5 3.5 3.33 0.859 0.162 7
## 8 62175 OcnA-S… no M 4 4 3.33 0.842 0.356 5
## 9 62752 PhysA-… no F 4.2 2.5 2.33 0.834 0.120 11
## 10 64930 FrScA-… no F 4 3 3.67 0.792 0.151 7
## # … with 117 more rows, and abbreviated variable names ¹student_id, ²course_id,
## # ³course_interest, ⁴perceived_competence, ⁵utility_value, ⁶overall_percent,
## # ⁷variability
Next, recycle the code from above to check to see that the proportion
of at-risk students in our training and test data are close to those in
our overall course_data:
train_data |>
count(at_risk) |>
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 306 0.814
## 2 yes 70 0.186
test_data |>
count(at_risk) |>
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 103 0.811
## 2 yes 24 0.189
Now answer the following questions:
Do the number of observations in each set match your expectations? Why?
Do the proportion of at-risk students in each set match your expectations? Why?
In this section, we introduce another tidymodels package, recipes, which is designed to help you prepare your data before training your model. Recipes are built as a series of preprocessing steps, such as:
converting qualitative predictors to indicator variables (also known as dummy variables),
transforming data to be on a different scale (e.g., taking the logarithm of a variable),
transforming whole groups of predictors together,
extracting key features from raw variables (e.g., getting the day of the week out of a date variable),
and so on. If you are familiar with R’s formula interface, a lot of this might sound familiar and like what a formula already does. Recipes can be used to do many of the same things, but they have a much wider range of possibilities.
To get started, let’s create a recipe for a simple logistic regression model. Before training the model, we can use a recipe to add a few predictors and conduct some preprocessing required by the model.
The recipe() function as
we used it here has two arguments:
A formula. Any variable on the left-hand side of
the tilde (~) is considered the model outcome
(at_risk in our case). On the right-hand side of the tilde
are the predictors. Variables may be listed by name, or you can use the
dot (.) to indicate all other variables as
predictors.
The data. A recipe is associated with the data
set used to create the model. This will typically be
the training set, so data = train_data here.
Naming a data set doesn’t actually change the data itself; it is only
used to catalog the names of the variables and their types, like
factors, integers, dates, etc.
Let’s create our very first recipe using at_risk as our
outcome variable; course_interest and gender
and overall_percent as predictors; and
train_data as our data to train:
lr_recipe_1 <- recipe(at_risk ~ course_interest + gender + overall_percent,
data = train_data)
Now let’s take a quick peek at our recipe and create a quick summary
of our recipe using the summary() function:
lr_recipe_1
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
summary(lr_recipe_1)
## # A tibble: 4 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 course_interest numeric predictor original
## 2 gender nominal predictor original
## 3 overall_percent numeric predictor original
## 4 at_risk nominal outcome original
You can see that our recipe has four ingredients including three predictors and 1 outcome, just as expected.
Because we’ll be using a simple logistic regression model, variables
like gender will need to be coded as dummy
variables. Dummy coding means transforming a variable with multiple
categories into new variables, where each binary variable indicates the
presence and absence of each category. For example, gender will be
recoded to gender_f, where 1 indicates female and 0 indicates male.
Unlike the standard model formula methods in R, a recipe does not automatically create these dummy variables for you; you’ll need to tell your recipe to add this step. This is for two reasons. First, many models do not require numeric predictors, so dummy variables may not always be preferred. Second, recipes can also be used for purposes outside of modeling, where non-dummy versions of the variables may work better. For example, you may want to make a table or a plot with a variable as a single factor.
For these reasons, we need to explicitly tell recipes to create dummy
variables using step_dummy(). Let’s add this to our recipe
and include all_nominal_predictors() to tell our recipe to
change all of our factor variables to dummy variables:
lr_recipe_1 <-
recipe(at_risk ~ course_interest + gender + overall_percent,
data = train_data) |>
step_dummy(all_nominal_predictors())
lr_recipe_1
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 3
##
## Operations:
##
## Dummy variables from all_nominal_predictors()
summary(lr_recipe_1)
## # A tibble: 4 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 course_interest numeric predictor original
## 2 gender nominal predictor original
## 3 overall_percent numeric predictor original
## 4 at_risk nominal outcome original
Before training our model, let’s create a second recipe just for contrast that includes all of our predictors.
Run the following code to add all our predictors to our new recipe:
lr_recipe_2 <-
recipe(at_risk ~ course_interest + gender + overall_percent +
perceived_competence + utility_value + variability + n_100,
data = train_data) |>
step_dummy(all_nominal_predictors())
lr_recipe_2
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
##
## Operations:
##
## Dummy variables from all_nominal_predictors()
summary(lr_recipe_2)
## # A tibble: 8 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 course_interest numeric predictor original
## 2 gender nominal predictor original
## 3 overall_percent numeric predictor original
## 4 perceived_competence numeric predictor original
## 5 utility_value numeric predictor original
## 6 variability numeric predictor original
## 7 n_100 numeric predictor original
## 8 at_risk nominal outcome original
With tidymodels, we start building a model by specifying the
functional form of the model that we want using the [parsnip] package.
Since our outcome is binary, the model type we will use is “logistic
regression”. We can declare this with logistic_reg()
and assign to an object we will later use in our workflow:
lr_mod <- logistic_reg()
That is pretty underwhelming since, on its own, it doesn’t really do much. However, now that the type of model has been specified, a method for fitting or training the model can be stated using the engine.
The engine value is often a mash-up of different packages that can be used to fit or train the model as well as the estimation method. For example, we will use “glm” a generalized linear model for binary outcomes and default for logistic regression in the {parsnip} package.
Run the following code to finish specifying our model:
lr_mod <-
logistic_reg() %>%
set_engine("glm")
We will want to use our recipes created earlier across several steps as we train and test our model. To simplify this process, we can use a model workflow, which pairs a model and recipe together.
This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows.
We’ll use the {workflows} package from
tidymodels to bundle our parsnip model (lr_mod) with our
first recipe (lr_recipe_1).
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe_1)
Now that we have a single workflow that can be used to prepare the
recipe and train the model from the resulting predictors, we can use the
fit() function to fit our model to our
train_data. And again, we set a random number seed to
ensure that if we run this same code again, we will get the same results
in terms of the data partition:
set.seed(586)
lr_fit <-
lr_workflow %>%
fit(data = train_data)
This lr_fit object has the finalized recipe and fitted
model objects inside. To extract the model fit from the workflow, we
will use the helper functions extract_fit_parsnip(). Here
we pull the fitted model object then use
the broom::tidy() function to get a tidy tibble of model
coefficients:
lr_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.61 1.36 -1.92 0.0548
## 2 course_interest -0.0513 0.227 -0.226 0.821
## 3 overall_percent 1.56 1.16 1.35 0.177
## 4 gender_F 0.113 0.295 0.384 0.701
Among the predictors included in our model, there are no statistically significant (i.e., p-value < 0.05) relationships between our predictors and being identified as at-risk according to our definition, which doesn’t bode too well for our model but let’s proceed with testing anyways.
Now that we’ve fit our model to our training data, we’re FINALLY ready to test our model on the data we set aside in the beginning. Just to recap the steps that led to this moment, however, recall that we:
identified the model to use (lr_mod);
created a preprocessing recipe consisting or predictors and
outcome (lr_recipe_1);
bundled the model and recipe into a workflow
(lr_workflow); and
trained our workflow using a single call
to fit().
The next step is to use the trained workflow (lr_fit) to
predict outcomes for our unseen test data, which we will do with the
function predict(). The predict() method
applies the recipe to the new data, then passes them to the fitted
model.
predict(lr_fit, test_data)
## # A tibble: 127 × 1
## .pred_class
## <fct>
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## # … with 117 more rows
Because our outcome variable at_risk here is a factor,
the output from predict() returns the predicted
class: no versus yes. Not a super useful
output to be honest though.
Fortunately there is an augment() function we can use
with our lr_fit model and test_data to save
them together:
Let’s use this function, save as lr_predictions:
lr_predictions <- augment(lr_fit, test_data)
Take a quick look at lr_predictions in the code chunk
below and answer the question that follows?
view(lr_predictions)
Was our model successful at predicting any students who were at-risk? How do you know?
Hint: Scroll to the end of the data frame and take a
look at our original at_risk outcome and the
.pred_class variable which shows the predicted
outcomes.
As you probably noticed, just looking at the
lr_predictions object is not the easiest way to check for
model accuracy. Fortunately, the {yardstick} package
has an accuracy() function for looking at the overall
classification accuracy, which uses the hard class predictions to
measure performance.
Hard class predictions tell us whether our model
predicted yes or no for each student in the
.pred_class column, as well as the estimating a
probability. A simple 50% probability cutoff is used to categorize a
student as at risk. For example, student 47448 had a
.pred_no probability of 0.8854332 and
.pred_yes probability of 0.11456677 and so was classified
as no for not being for at_risk.
Run the following code to select() these variables
followed by the accuracy() function to see how frequently
our prediction matched our observed data:
lr_predictions |>
select(at_risk, .pred_class) |>
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.811
Overall it looks like our model was correct 81% of the time, which is not too bad but we’ll see in a second is not good enough to serve it’s intended purpose of identifying at-risk students.
Another way to check model accuracy is with the
conf_mat() function from {yardstick} for creating a
confusion matrix. Recall from our course text Learning Analytics Goes to
School that a confusion matrix is simply a 2 × 2 table that lists the
number of true-negatives, false-negatives, true-positives, and
false-positives.
Run the follow code to create a confusion matrix for our logistic regression predictions:
lr_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 103 24
## yes 0 0
As you can see our model accurately predicted all students as “no”
for at risk 103 times, but inaccurately predicted 24 students a “no” for
at risk when they were actually “yes” in our test_data.
Overall our model was 81% (103/127) accurate, but it achieved this by
simply labeling everyone as “no” for at risk. And since roughly 82% of
our students were actually “no,” this is clearly not a great prediction
model.
Recycle our code from above to create a workflow for
lr_recipe_2 and test our “kitchen sink” model on our test
data. Hint: You can accomplish this by simply copying
and pasting and changing a single character from above.
# create workflow
lr_2_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe_2)
# set seed
set.seed(586)
# fit model to workflow
lr_2_fit <-
lr_2_workflow %>%
fit(data = train_data)
# extract model estimates
lr_2_fit %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 8 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.51 1.85 -2.43 0.0149
## 2 course_interest -0.528 0.323 -1.64 0.102
## 3 overall_percent 1.49 1.30 1.15 0.251
## 4 perceived_competence 0.224 0.283 0.792 0.428
## 5 utility_value 0.541 0.259 2.09 0.0370
## 6 variability 0.155 2.41 0.0643 0.949
## 7 n_100 0.136 0.0642 2.11 0.0344
## 8 gender_F 0.0840 0.300 0.280 0.779
# get predictions
predict(lr_2_fit, test_data)
## # A tibble: 127 × 1
## .pred_class
## <fct>
## 1 no
## 2 no
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## # … with 117 more rows
lr_2_predictions <- augment(lr_2_fit, test_data)
# check overall accuracy
lr_2_predictions |>
select(at_risk, .pred_class) |>
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.811
# create a confusion matrix
lr_2_predictions %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 103 24
## yes 0 0
Does this model perform any better? How do you know?
Random forest models are ensembles of decision trees. Each of those terms is linked because there is a dense amount of information required to understand what each of those terms really means. For now, however, them main thing to know is that:
One of the benefits of a random forest model is that it is very low maintenance; it requires very little preprocessing of the data and the default parameters tend to give reasonable results.
We’ll also be using the {ranger} pacakage which provides a fast implementation of Random Forest models and is particularly suited for high dimensional data, e.g., models with a lot of predictors and features. This package supports supervised learning approaches including classification, regression, survival and probability prediction.
Let’s create a recipe for our course_data data that
includes all our predictors.
rf_recipe <-
recipe(at_risk ~ gender + course_interest + perceived_competence +
utility_value + variability + n_100 + overall_percent,
data = train_data)
rf_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
summary(rf_recipe)
## # A tibble: 8 × 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 gender nominal predictor original
## 2 course_interest numeric predictor original
## 3 perceived_competence numeric predictor original
## 4 utility_value numeric predictor original
## 5 variability numeric predictor original
## 6 n_100 numeric predictor original
## 7 overall_percent numeric predictor original
## 8 at_risk nominal outcome original
To fit a random forest model on the training set, we’ll use the {parsnip} package again
along with the ranger
engine. We’ll also include the set_mode() function to
specify our model as “classification” rather than “regression.”
Recall from our Learning Analytics Goes to School that supervised machined learning, or predictive modeling, involves two broad approaches: classification and regression. Classification algorithms model categorical outcomes (e.g., yes or no outcomes like with our at risk data). Regression algorithms characterize continuous outcomes (e.g., test scores).
Run the following code to create our random forest model for our training data:
rf_mod <-
rand_forest(trees = 5000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
Now let’s combine our model and recipe to create our new random forest workflow:
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
And fit our model and recipe to our training data:
set.seed(586)
rf_fit <-
rf_workflow %>%
fit(data = train_data)
Gather our random forest predictions:
rf_predictions <- augment(rf_fit, test_data)
rf_predictions
## # A tibble: 127 × 13
## studen…¹ cours…² at_risk gender cours…³ perce…⁴ utili…⁵ overa…⁶ varia…⁷ n_100
## <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 48797 PhysA-… no F 3.8 3.5 3.5 0.905 0.286 9
## 2 53447 FrScA-… no F 4.2 3 2.67 0.702 0.221 5
## 3 54066 OcnA-S… no M 4.4 4 5 0.848 0.307 6
## 4 55140 PhysA-… no F 3.8 3 2.67 0.772 0.275 6
## 5 57981 OcnA-S… yes F 4.2 3 3.67 0.896 0.118 10
## 6 58168 AnPhA-… no F 5 5 4.67 0.871 0.242 4
## 7 62157 FrScA-… no F 4.5 3.5 3.33 0.859 0.162 7
## 8 62175 OcnA-S… no M 4 4 3.33 0.842 0.356 5
## 9 62752 PhysA-… no F 4.2 2.5 2.33 0.834 0.120 11
## 10 64930 FrScA-… no F 4 3 3.67 0.792 0.151 7
## # … with 117 more rows, 3 more variables: .pred_class <fct>, .pred_no <dbl>,
## # .pred_yes <dbl>, and abbreviated variable names ¹student_id, ²course_id,
## # ³course_interest, ⁴perceived_competence, ⁵utility_value, ⁶overall_percent,
## # ⁷variability
And check our model’s overall accuracy:
rf_predictions %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.803
And create a confusion matrix to see where it did and did not make accurate predictions:
rf_predictions |>
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 101 23
## yes 2 1
Unfortunately, our random forest model wasn’t much better at predicting students who were defined as “at risk.”
Even more unfortunate, if this had been a real-world situation, only 1 of the 24 students who failed the course would have received additional support. Clearly we’d need to build a better model. In this case, more data would definitely help, but it’s likely we’re missing some important information about students that is predictive of their success.
Try creating your own recipe and logistic regression or random forest
model and see how it performs against the three that we just created.
Feel free to experiment with variables we excluded, like specific survey
items. There is also a file in your data folder
called sci-mo-with-text.csv that includes data about
students participation in discussion forums, including the number of
posts and psychometric properties about their language used. You can
read more about that data in Chapter
14 of DSIEUR.
I recommend creating and using an R script to develop your model. Use the code chunk below to record your final model:
course_data_3 <- processed_data %>%
mutate(at_risk = if_else(final_grade >= 66.7, "no", "yes")) %>%
mutate(gender = as_factor(gender), at_risk = as_factor(at_risk)) %>%
select(student_id, course_id, at_risk, gender, q1, q2, q3, q4, q5, q6,
q7, q8, q9, q10) %>%
drop_na()
course_data_3 %>%
count(at_risk) %>%
mutate(proportion = n/sum(n))
## # A tibble: 2 × 3
## at_risk n proportion
## <fct> <int> <dbl>
## 1 no 361 0.813
## 2 yes 83 0.187
course_data_3 <- inner_join(course_data_3, grade_book)
## Joining, by = c("student_id", "course_id")
set.seed(123)
course_split_3 <- initial_split(course_data_3, strata = at_risk)
train_data_3 <- training(course_split_3)
test_data_3 <- testing(course_split_3)
lr_recipe_3 <- recipe(at_risk ~ course_id, gender, q1, q2, q3, q4, q5,
q6, q7, q8, q9, q10, data = train_data_3) %>%
step_dummy(all_nominal_predictors())
lr_workflow_3 <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe_3)
set.seed(123)
lr_fit_3 <-
lr_workflow_3 %>%
fit(data = train_data_3)
lr_fit_3 %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 24 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.48 0.736 -3.38 0.000735
## 2 course_id_AnPhA.S116.02 1.70 0.912 1.86 0.0630
## 3 course_id_AnPhA.S216.01 1.93 0.859 2.24 0.0250
## 4 course_id_AnPhA.S216.02 1.79 1.14 1.58 0.115
## 5 course_id_AnPhA.T116.01 -15.1 1769. -0.00852 0.993
## 6 course_id_BioA.S116.01 2.23 0.892 2.50 0.0123
## 7 course_id_BioA.S216.01 3.18 1.43 2.22 0.0261
## 8 course_id_BioA.T116.01 -15.1 3956. -0.00381 0.997
## 9 course_id_FrScA.S116.01 0.762 0.882 0.864 0.387
## 10 course_id_FrScA.S116.02 2.08 1.17 1.77 0.0762
## # … with 14 more rows
lr_predictions_3 <- augment(lr_fit_3, test_data_3)
lr_predictions_3 %>%
select(at_risk, .pred_class) %>%
accuracy(truth = at_risk, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.795
lr_predictions_3 %>%
conf_mat(at_risk, .pred_class)
## Truth
## Prediction no yes
## no 89 21
## yes 2 0
Now answer the following questions?
How did your model do compared to others?
What information might be useful to know about students prior to the course start that might be useful for improving our model?
In this case study, we focused applying some basic machine learning techniques to help us understand how a predictive model used in early warning systems might actually be developed and tested. Specifically, we made a very crude first attempt at developing a model using machine learning techniques that were not terribly great at accurately predict whether a student is likely to pass or fail and online course.
Below, add a few notes in response to the following prompts:
One thing I took away from this learning lab that I found especially useful:
One thing I want to learn more about:
To “turn in” your work, you can click the “Knit” icon at the top of the file, or click the dropdown arrow next to it and select “Knit top HTML”. This will create a report in your Files pane that serves as a record of your completed assignment and that be opened in a browser or shared on the web.