Presentation information

This presentation introduces one question that predictive analytic methods can help us answer, using example student data and hypothetical analysis methods.
Similar predictive methods can also be applied to a broad range of questions with different forms and goals.
Link to these slides: tinyurl.com/ClassificationExample2020 (not case sensitive)

Pressing keys on your keyboard, you can modify how you view this presentation:

Key you can press	What should happen when you press it
A	Toggle between seeing all slides and one slide at a time
S	Make everything on the slide smaller
B	Make everything on the slide bigger

BE SURE TO PAUSE THE VIDEO ANY TIME YOU FEEL LIKE IT.

Learning goals of this presentation

By the end of this presentation, our goals are to:

Identify the types of questions that predictive analytics methods can help us answer in education.
Build intuition about how machine learning algorithms can help us predict group membership using systematically-organized data (also known as classification).

Noodles and pasta

I really like to eat noodles and pasta. It’s January 10 2020 and the weather is really cold here in Boston. A perfect day for hot noodles. I cooked some noodles today in a pot of boiling water.

But now I have a problem to solve:

How can I separate the noodles from the water?

(Right now the noodles and water are together in the pot)

What is our goal?

Our goal:

Keep all of the noodles.
Discard all of the water.

To be clear: Our goal is to take EVERYTHING IN THE POT and separate it into two groups:

KEEP and DISCARD.

How can we achieve our goal?

We will have to use some kind of sorting mechanism to separate the noodles and water, both of which are currently mixed together in the pot.

Sorting mechanism – Colander filter

Image source: https://www.masterfile.com/image/en/600-02346521/hands-straining-pasta

This is the sorting mechanism we’re going to use. We’ll pour the noodles and water from the pot into a colander filter.

This is called colander filtering.

What did we just do?

In plain words:

We separated or filtered water and noodles

In predictive analytic / machine learning words:

We classified everything in the pot into two classes: save and discard

Learning analytics with students

In addition to liking noodles, I am also an educator and administrator in an educational program.

Here are some key details:

Today’s date is Jan 10 2020
Students participate in a one-year certification program.
Cohort 1 (called C1) started in August 2018 and finished in June 2019.
Cohort 2 (called C2) started in August 2019 and will finish in June 2020.
C2 students have finished the fall term and are about to start the spring term.

I have a problem to solve:

How can I figure out which of my C2 students are going to pass and which ones are going to fail in the final exams this May?

Timeline and data

One-year program timeline for students:

Here’s what we do and don’t know for each cohort:

Cohort	Fall term grades	Spring term grades	Final Exam Results
C1: 2018–19	Yes	Yes	Yes
C2: 2019–20	Yes	No	No

C1 data is complete. C2 data is incomplete.
Our goal: predict final exam results for C2, using only their fall term grades.

Remember:

Today’s date is Jan 10 2020
C1 students have all finished and graduated
C2 students are in the middle of their year. They have finished the fall term and are about to start the spring term.

PAUSE TO MAKE SURE THIS MAKES SENSE, BEFORE YOU CONTINUE.

What is our goal?

Our goal is to predict final exam results for C2, using only their fall term grades.

To be clear: Our goal is to take all of the students in C2 and sort them into two groups:

Predicted to pass final exam

and

Predicted to fail final exam (at-risk students)

Why do we have this goal?

Right now it’s Jan 10 2020, in between fall and spring terms.
The final exam for C2 students is not until May, in 5 months.
If we can identify RIGHT NOW who all we think will fail in five months, we can give them extra remedial support and prevent them from failing!

How can we achieve our goal?

Just like with the noodles, we will have to use some kind of sorting mechanism to separate the students in C2 predicted to fail from the students predicted to pass, all of whom are half-way through the one-year program at this point.

Just imagine that you’re “pouring” students into a colander filter that will catch the at-risk students and let the other ones pass through.

Noodles:

Students:

How will we make the right filter (sorting mechanism) for students?

Use (complete) C1 data to look for patterns.
Apply what we learned from C1 to make predictions on C2.

What we already know:

What we want to know:

We want to exploit the completeness of the C1 student data to predict something unknown about the students in C2.
We ignore the spring 2019 grades for C1 because we don’t have them for C2, so we don’t want to use them to train (calibrate) our predictive model.

Modified process diagram – Student learning analytics

We want to use old, complete data from C1 to make a prediction using incomplete data from C2.
The calibration process using old data (from C1) is what the “learning” in “machine learning algorithm” refers to, in this case. The computer looks for patterns in C1 and then applies them later to C2.

Explore student data

Now it’s time to look more closely at the data we have for C1 and C2.
See Excel data spreadsheet in video.
Each row in the data is a student.
We have separate data sheets for C1 and C2.

Split up C1 data

Randomly select the 75 “training” students and the 25 to save for later.

Sorting mechanism #1 – SVM

We give the fall 2018 data and the final exam results for 75 C1 students to the computer.
The computer applies the SVM (support vector machine) algorithm these 75 students to “learn” and look for patterns in this data.
We won’t go over the technical details of SVM today.

Test accuracy with the 25 students we left out before

Remember the 25 C1 students we left out? Now we use them. They were not used to train/calibrate the model. So now we can pretend that these 25 C1 students are actually C2 students about whom we want to make predictions. Very sneaky!
But we know the final exam grades of these 25 students. We are keeping them secret from the computer though.
These 25 students from C1 who we left out of the calibration process are called the testing data. We know the final grades for the testing data students but we leave them out of the model calibration anyway, specifically so that we can use them to test how well our model works.
We will compare the predicted final exam grades to the true final exam grades in the testing data.

Review the process

Pause and go back if needed.

Here’s what we are doing:

Goal: predict whether C2 students will pass or fail, using patterns in data from C1 students.
Randomly separate the students into two datasets: 75 students in training data, 25 students in testing data.
Use all fall grades (independent variables) and final grades (dependent variable) of training dataset students to train a machine learning model.
Plug the testing dataset students’ fall grades (independent variables) into the machine learning model to see if it predicts whether they passed or failed. Compare these predictions to what actually happened (which we know because we actually have their final results and we’re just pretending that we don’t).
If the accuracy of the predictions (the success rate) is good enough, use the same machine learning model to predict final grades for C2 students (for whom we do not have the actual final grades).

Sorting mechanism #1 – Did we achieve our goal?

Look at the spreadsheet in the video.

Sorting mechanism #1 – Confusion matrix

What we wanted:

	Actually failed	Actually passed
Predicted to fail	3	0
Predicted to pass	0	22

What we actually got:

	Actually failed	Actually passed
Predicted to fail	2	6
Predicted to pass	1	16

Above:

2 people who actually failed were predicted to fail (true negatives)
6 people who actually passed were predicted to fail (false negatives)
1 person who actually failed was predicted to pass (false positive)
16 people who actually passed were predicted to pass (true positives)

Sorting mechanism #1 – Accuracy (success rate)

The accuracy of the sorting mechanism success rate is the proportion of correct predictions divided by the total number of students.

\[\text{accuracy} = \frac{\text{correctly classified students}}{\text{all students}} = \frac{\text{true negatives + true positives}}{\text{total students}}\]

\[\text{ideal desired accuracy} = \frac{3+22}{25} = 1\]

\[\text{Sorting mechanism #1 actual accuracy} = \frac{2 + 16}{25} = .72\]

Sorting mechanism #2 – Random Forest

We give the fall 2018 data and the final exam results for 75 C1 students to the computer.
The computer applies the Random Forest algorithm these 75 students to “learn” and look for patterns in this data.
We won’t go over the technical details of Random Forest today. It works differently than SVM to look for patterns.

Review the process again

Here’s what we are doing again:

Goal: predict whether C2 students will pass or fail, using data from C1 students.
Randomly separate the students into two datasets: 75 students in training data, 25 students in testing data.
Use all fall grades (independent variables) and final grades (dependent variable) of training dataset students to train a machine learning model.
Plug the testing dataset students’ fall grades (independent variables) into the machine learning model to see if it predicts whether they passed or failed. Compare these predictions to what actually happened (which we know because we actually have their final results and we’re just pretending that we don’t).
If the accuracy of the predictions (the success rate) is good enough, use the same machine learning model to predict final grades for C2 students (for whom we do not have the actual final grades).

Sorting mechanism #2 – Confusion matrix

What we wanted:

	Actually failed	Actually passed
Predicted to fail	3	0
Predicted to pass	0	22

What we actually got:

	Actually failed	Actually passed
Predicted to fail	3	8
Predicted to pass	0	14

Above:

3 people who actually failed were predicted to fail (true negatives)
8 people who actually passed were predicted to fail (false negatives)
0 people who actually failed were predicted to pass (false positive)
14 people who actually passed were predicted to pass (true positives)

Sorting mechanism #2 – Accuracy (success rate)

The accuracy of the sorting mechanism success rate is the proportion of correct predictions divided by the total number of students.

\[\text{accuracy} = \frac{\text{correctly classified students}}{\text{all students}} = \frac{\text{true negatives + true positives}}{\text{total students}}\]

\[\text{ideal desired accuracy} = \frac{3+22}{25} = 1\]

\[\text{Sorting mechanism #1 actual accuracy} = \frac{3 + 14}{25} = .68\]

What we will do next

Before we declare which sorting mechanism is best, let’s review what we’re going to use the best sorting mechanism to do:

Using the best trained predictive model that had the best predictions on the testing dataset (from C1), we will make predictions about C2 students’ final exam results.
We will provide remedial support for all students in C2 predicted to fail.
We are trying to create a safety net for early detection of students at risk of failing.

Select best sorting mechanism

Compare predictions made by each sorting mechanism on the 25 testing students from C1:

SVM:

	Actually failed	Actually passed
Predicted to fail	2	6
Predicted to pass	1	16

Accuracy: 0.72
False positives: 1
Number of students to remediate: 2 + 6 = 8

Random Forest:

	Actually failed	Actually passed
Predicted to fail	3	8
Predicted to pass	0	14

Accuracy: 0.68
False positives: 0
Number of students to remediate: 3 + 8 = 11

Criteria to consider while picking the best one:

Which and how many students will we remediate in each case? Do we have the resources to do so?
There is a trade-off between false positives and number of students requiring remediation.

Select best sorting mechanism

Compare predictions made by each sorting mechanism on the 25 testing students from C1 (with hypothetical C2 numbers in parentheses):

SVM:

	Actually failed	Actually passed
Predicted to fail	2 (8)	6 (24)
Predicted to pass	1 (4)	16 (64)

Accuracy: 0.72 (0.72)
False positives: 1 (4)
Number of students to remediate: 2 + 6 = 8 (32)

Random Forest:

	Actually failed	Actually passed
Predicted to fail	3 (12)	8 (32)
Predicted to pass	0 (0 or 1)	14 (56)

Accuracy: 0.68 (0.68)
False positives: 0 (0 or 1)
Number of students to remediate: 3 + 8 = 11 (44)

Criteria to consider while picking the best one:

Which and how many students will we remediate in each case? Do we have the resources to do so?
In this example (not always) there is a trade-off between false positives and number of students requiring remediation.

Full analytics process

What did we just do?

In plain words:

Today, on January 10 2020, we “poured” all of our C2 students into a filter (a sorting mechanism).
The filter allowed students who are not at risk of failing to pass through. But it did not allow students who ARE at risk of failing pass through; it retained them in its net.
We can now give remedial support to the students who were caught by the filter.

In predictive analytics / machine learning terms:

Using complete previous data (from C1), we made a prediction on incomplete current data (from C2). We classified C2 students into two classes: those predicted to pass and those predicted to fail (at-risk students).
Using predictive analytics, we created an early warning system to make predictions of who in C2 would pass and fail the final exam, five months in advance of the exam happening.
We can’t know for sure how accurate our predictions are for C2; but we have a sense for how accurate they might be, given our predictions and accuracy calculations using the 25-person testing dataset from C1.
We can now give remedial support to the students predicted to fail by the analytic models.

Technical terms

Most important:

Independent variables: These are all of the data that we are using to make a prediction. In our case, these are all of the fall term grades.
Dependent variable: This what you are trying to predict. In our case, this is the final exam grade.
Machine learning: Machine learning is a group of analysis techniques that help us do predictive analytics. Machine learning and predictive analytics are subsets of artificial intelligence and statistical analysis.

Optional:

Supervised learning: This is the type of machine learning in which we are trying to predict a dependent variable. In other situations, we may not want to predict a single outcome. We may instead want to identify groups within our data. This is called unsupervised learning.

We are using supervised machine learning in this example.

Classification algorithms examples

Logistic regression
k-nearest neighbors
Decision trees and random forest
Naive Bayes
Support vector machines

All of the algorithms above use different statistical and/or algorithmic approaches to predict classes (outcome categories) into which our observations (rows of data) fall.

Limitations, pitfalls, and tips

Generalizability: If C1 and C2 are very different from each other in any way, our predictions for C2’s final grades won’t be accurate. Example: if C2 WILL be affected by COVID-19 disruptions but C1 WASN’T, our predictions could be inaccurate. The computer doesn’t know about COVID-19; it only knows what we feed into it.
In educational analytics, it’s almost impossible to be 100% accurate. Students are not noodles. You should always compare results of your analytics to other sources of information as well as your own intuition and reasoning.
Predictive analytics is just as much an art as it may be a science. Correctly calibrating a machine learning algorithm for your specific situation can take time and effort. This presentation is a simplified summary of the entire process, just to introduce the concepts. The illustrated trade-off between SVM and Random Forest earlier in this presentation is an example of how the human’s judgment and involvement is critical to predictive analytics.
Quantitative analysis—the topic of this presentation—can often be paired well with qualitative analysis to achieve your goal. Example: In this case, our goal is to minimize the number of students who fail the final exam. In addition to doing our predictive analysis, we could also supplement this by doing qualitative interviewing of a subset of students who passed and failed the final exam in C1. We can ask them which study techniques they used. We can then compare the study techniques used by the students who passed with those of the students who failed. We can then recommend the successful study techniques—or even build into the spring curriculum—for the C2 students.
Analytics should be used only when they are useful to your work. This may not always be the case. Example: I never use a colander because I find it difficult to clean. For me, the benefits of the colander’s ability to filter well are outweighed by the extra cost of extra cleaning work.

Notes

This presentation was actually created in June 2020, but I’m pretending that the date is January 2020 because this makes the examples more logical.
Assistance for this presentation came from a number of people in the HPEd and PA programs at MGHIHP.
This presentation was created for the students of the MS and PhD programs in HPEd (health professions education) at MGHIHP.
The data and results used in this presentation were all fabricated for illustrative purposes. But they are reflective of true examples.

Discussion

Here are some questions to consider:

How can predictive/learning analytics help you in your own work as an educator or at your organization/institution? What predictions would be useful for you to make?
How can you leverage data that your institution already collects (or that it is well-positioned to collect) using predictive analytic methods?
What would be the benefits and detriments of incorporating predictive analytics into your institution’s practices and processes?
Could predictive analysis complement any already-ongoing initiatives at your institution?
What would the ethical implications be of using predictive analytics at your institution? Would it cause unfair discrimination against particular learners? Would it help level the playing field for all learners?

Predictive Analytics in Education: An introductory example

Presentation information

Learning goals of this presentation

Noodles and pasta

What is our goal?

How can we achieve our goal?

Sorting mechanism – Colander filter

What did we just do?

Learning analytics with students

Timeline and data

What is our goal?

Why do we have this goal?

How can we achieve our goal?

How will we make the right filter (sorting mechanism) for students?

Modified process diagram – Student learning analytics

Explore student data

Split up C1 data

Sorting mechanism #1 – SVM

Test accuracy with the 25 students we left out before

Review the process

Sorting mechanism #1 – Did we achieve our goal?

Sorting mechanism #1 – Confusion matrix

Sorting mechanism #1 – Accuracy (success rate)

Sorting mechanism #2 – Random Forest

Review the process again

Sorting mechanism #2 – Confusion matrix

Sorting mechanism #2 – Accuracy (success rate)

What we will do next

Select best sorting mechanism

Select best sorting mechanism

Full analytics process

What did we just do?

Technical terms

Classification algorithms examples

Limitations, pitfalls, and tips

Notes

Discussion