Anshul Kumar
Version – 19 June 2020
Link to video that goes with this presentation: https://youtu.be/aql1u3Wi8Dg
Link to these slides: https://rpubs.com/AnshulKumar/ClassifyStudents1 (not case sensitive)
Most of the R code used to make this document can be found here: https://rpubs.com/anshulkumar/EducAnalytics1
Pressing keys on your keyboard, you can modify how you view this presentation:
| Key you can press | What should happen when you press it |
|---|---|
| A | Toggle between seeing all slides and one slide at a time |
| S | Make everything on the slide smaller |
| B | Make everything on the slide bigger |
| right or left arrow | Go to next or previous slide |
| space-bar | Go to next slide |
| C | Show table of contents |
BE SURE TO PAUSE THE VIDEO ANY TIME YOU FEEL LIKE IT.
By the end of this presentation, our goals are to:
Identify the types of questions that predictive analytics methods can help us answer in education.
Examine real results from a predictive analytic method and brainstorm about how we would use the results.
Build intuition about how machine learning algorithms can help us predict group membership using systematically-organized data (also known as classification).
We have spaghetti and water in a pot. We need to separate them.
We separate them with some kind of sorting mechanism, a colander filter in this case.
In machine learning terms, the colander filter solves a classification problem for us: it “classified” (sorted) the contents of the pot into two groups.
Our goal is to make a sorting mechanism on the computer to classify (separate from each other) students at risk of failing and those not at risk.
We want to “pour” all of the students into this computerized “filter” and see who it “catches” and identifies as at-risk.
This presentation uses an example from:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. http://www3.dsi.uminho.pt/pcortez/student.pdf.
Download the data from: Student Performance Data Set. UCI Machine Learning Repository. Center for Machine Learning and Intelligent Systems. https://archive.ics.uci.edu/ml/datasets/Student+Performance. Download the file student.zip, then open the file student-por.csv to see the data.
Also available at kaggle.com: https://www.kaggle.com/larsen0966/student-performance-data-set
Imagine we are teaching a large course with many students…
How can we improve student outcomes and support our students better?
See examples in the video of LAST YEAR’s and THIS YEAR’s data.
n = 649 students in one Portuguese language course LAST YEAR.
\(G3 \geq 10\) means passing final grade
Order: green, black, red
Using…
…Can we reduce the number of students who fail THIS YEAR?
| Attribute | Description (Domain) | |
|---|---|---|
| sex | student’s sex (binary: female or male) | |
| age | student’s age (numeric: from 15 to 22) | |
| school | student’s school (binary: Gabriel Pereira or Mousinho da Silveira) | |
| address | student’s home address type (binary: urban or rural) | |
| Pstatus | parent’s cohabitation status (binary: living together or apart) | |
| Medu | mother’s education (numeric: from 0 to 4 (a)) | |
| Mjob | mother’s job (nominal (b)) | |
| Fedu | father’s education (numeric: from 0 to 4 (a)) | |
| Fjob | father’s job (nominal (b)) | |
| guardian | student’s guardian (nominal: mother, father or other) | |
| famsize | family size (binary: ≤ 3 or > 3) | |
| famrel | quality of family relationships (numeric: from 1 – very bad to 5 – excellent) | |
| reason | reason to choose this school (nominal: close to home, school reputation, course preference or other) | |
| traveltime | home to school travel time (numeric: 1 – < 15 min., 2 – 15 to 30 min., 3 – 30 min. to 1 hour or 4 – > 1 hour). | |
| studytime | weekly study time (numeric: 1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4 – > 10 hours) | |
| failures | number of past class failures (numeric: n if 1 ≤ n < 3, else 4) | |
| schoolsup | extra educational school support (binary: yes or no) | |
| famsup | family educational support (binary: yes or no) | |
| activities | extra-curricular activities (binary: yes or no) | |
| paidclass | extra paid classes (binary: yes or no) | |
| internet | Internet access at home (binary: yes or no) | |
| nursery | attended nursery school (binary: yes or no) | |
| higher | wants to take higher education (binary: yes or no) | |
| romantic | with a romantic relationship (binary: yes or no) | |
| freetime | free time after school (numeric: from 1 – very low to 5 – very high) | |
| goout | going out with friends (numeric: from 1 – very low to 5 – very high) | |
| Walc | weekend alcohol consumption (numeric: from 1 – very low to 5 – very high) | |
| Dalc | workday alcohol consumption (numeric: from 1 – very low to 5 – very high) | |
| health | current health status (numeric: from 1 – very bad to 5 – very good) | |
| absences | number of school absences (numeric: from 0 to 93) | |
| G1 | first period grade (numeric: from 0 to 20) | |
| G2 | second period grade (numeric: from 0 to 20) | |
| G3 | final grade (numeric: from 0 to 20) |
Notes:
Source: Table 1 in the Cortez & Silva article (p. 3 of 8).
Here’s what we do and don’t know for each cohort:
| Cohort | Demographic information | Mid-term grades | Final grades |
|---|---|---|---|
| Last year | Yes | Yes | Yes |
| This year | Yes | Yes | No |
LAST YEAR’s data is complete. THIS YEAR’s data is incomplete.
Our procedure:
Look for patterns in LAST YEAR’s data.
Predict final exam results for THIS YEAR’s students, using only their demographic information and mid-term grades.
Provide extra support to at-risk students in THIS YEAR’s cohort.
We need a sorting mechanism that we can “pour” our students into.
The sorting mechanism will tell us who is predicted to fail the course (at risk students) and who is predicted to pass.
Spaghetti:
.
Students:
Important trick!
Randomly select 75% of students for training and 25% for testing.
Before we trust, we must test.
Hide 25% of the data from LAST YEAR, then use it to double-check.
We give LAST YEAR’s data for 487 students to the computer.
The computer makes the best decision tree possible to figure out each student’s score.
| Feature | Importance |
|---|---|
| G1 | 1.5531560 |
| failures | 0.5927510 |
| age | 0.5624101 |
| absences | 0.2821250 |
| school | 0.2774309 |
| higher | 0.2386127 |
| Dalc | 0.2171346 |
| Mjob | 0.1712615 |
| Walc | 0.1241261 |
| schoolsup | 0.1174620 |
| Medu | 0.0997061 |
| activities | 0.0802697 |
| reason | 0.0753583 |
| Fedu | 0.0465667 |
| studytime | 0.0334740 |
| sex | 0.0000000 |
| address | 0.0000000 |
| famsize | 0.0000000 |
| Pstatus | 0.0000000 |
| Fjob | 0.0000000 |
| guardian | 0.0000000 |
| traveltime | 0.0000000 |
| famsup | 0.0000000 |
| paid | 0.0000000 |
| nursery | 0.0000000 |
| internet | 0.0000000 |
| romantic | 0.0000000 |
| famrel | 0.0000000 |
| freetime | 0.0000000 |
| goout | 0.0000000 |
| health | 0.0000000 |
162 students from LAST YEAR were not used to create the decision tree model. We know their final grades.
Model testing procedure:
Pretend that these students were not part of LAST YEAR’s cohort.
Ask computer to predict their final grades.
But wait! We actually know their final grades.
Compare true final grades to predicted final grades.
If the predictions were close enough, use this model to make predictions for THIS YEAR’s students.
Look at the spreadsheet in the video.
| Actually failed | Actually passed | |
|---|---|---|
| Predicted to fail | 24 | 12 |
| Predicted to pass | 5 | 121 |
Out of the 162 students used to test the model:
Pass/fail cutoff for predictions: 10
| Actually failed | Actually passed | |
|---|---|---|
| Predicted to fail | 24 | 12 |
| Predicted to pass | 5 | 121 |
.
\[\text{accuracy} = \frac{\text{correct predictions}}{\text{total predictions attempted}} = \frac{24+121}{162} = .895\]
.
\[\text{students to remediate} = 24+12 = 36\] .
\[\text{students who fell through the cracks} = 5\] .
Pass/fail cutoff for predictions: 10
PAUSE TO REVIEW IF YOU WANT
| Actually failed | Actually passed | |
|---|---|---|
| Predicted to fail | 29 | 56 |
| Predicted to pass | 0 | 77 |
.
\[\text{accuracy} = \frac{\text{correct predictions}}{\text{total predictions attempted}} = \frac{29+77}{162} = .654\]
.
\[\text{students to remediate} = 29+56 = 85\] .
\[\text{students who fell through the cracks} = 0\] .
Pass/fail cutoff for predictions: 11.5
PAUSE TO REVIEW IF YOU WANT
Decision tree prediction, cutoff = 10:
| Actually failed | Actually passed | |
|---|---|---|
| Predicted to fail | 24 | 12 |
| Predicted to pass | 5 | 121 |
Metrics:
Decision tree prediction, cutoff = 11.5:
| Actually failed | Actually passed | |
|---|---|---|
| Predicted to fail | 29 | 56 |
| Predicted to pass | 0 | 77 |
Metrics:
If the predictive model failed our tests (unfavorable metrics):
If the predictive model passed our tests (favorable metrics):
| ?? | |
|---|---|
| Predicted to fail | 33 |
| Predicted to pass | 567 |
Above:
Depends on your needs and resources
Be sure to combine with other sources of information
Order: green, black, red
The PA and HPEd programs at MGHIHP are currently working together to apply this to a project aimed at improving student outcomes in the PA program.
Goal: Increase students’ chances of passing the PANCE (Physician Assistant National Certifying Exam) in cohorts graduating in 2020 and 2021.
Data: All data from TBL-based curriculum for cohorts graduating 2017–2019.
TBL = team based learning. Data is collected on student progress very frequently.
Independent variables: These are all of the data that we are using to make a prediction. In our case, these are all of the data other than G3.
Dependent variable: This what you are trying to predict, the outcome you care about. In our case, this is the final exam grade, G3.
Machine learning: Machine learning is a group of analysis techniques that help us do predictive analytics. Machine learning and predictive analytics are subsets of artificial intelligence and statistical analysis.
There are many many machine learning methods that were not demonstrated in this presentation. They include: logistic regression, k-nearest neighbors, random forest, Naive Bayes, support vector machines. All use different strategies to divide up the data and make predictions. In all cases, we still train and test a predictive model before using it to make new predictions.
Generalizability: If LAST YEAR’s and THIS YEAR’s cohorts are very different from each other in any way, our predictions for THIS YEAR’s final grades won’t be accurate.
Predictive analytics is just as much an art as it may be a science. Correctly calibrating a machine learning algorithm for your specific situation can take time and effort. This presentation is a simplified summary of the entire process, just to introduce the concepts. Example: Choosing whether to use a prediction cutoff of 10, 11.5, or something else is an example of how user input is required to make a useful prediction.
Quantitative analysis—the topic of this presentation—can often be paired well with qualitative analysis to achieve your goal. Example: In this case, our goal is to minimize the number of students who fail the final exam. In addition to doing our predictive analysis, we could also supplement this by doing qualitative interviewing of a subset of students who passed and failed the class last year. We can ask them which study techniques they used. We can then compare the study techniques used by the students who passed with those of the students who failed. We can then recommend the successful study techniques—or even build them into our teaching—for this year’s students.
Analytics should be used only when they are useful to your work. This may not always be the case. Example: I never use a colander to filter my spaghetti out of the water, because I find it difficult to clean the colander. For me, the benefits of the colander’s ability to filter well are outweighed by the extra cost of extra cleaning work. Don’t assume that the additional tool will always be helpful.
Similar predictive methods can also be applied to a broad range of questions with different forms and goals. It’s not only used for predicting student outcomes, as in this presentation. If you have systematically collected and structured data (related to any subject matter), there’s a chance that you can use this data to make a useful prediction, even if it’s completely different from this presentation’s example.
Here are some questions to consider:
How can predictive/learning analytics help you in your own work as an educator or at your organization/institution? What predictions would be useful for you to make?
How can you leverage data that your institution already collects (or that it is well-positioned to collect) using predictive analytic methods?
What would be the benefits and detriments of incorporating predictive analytics into your institution’s practices and processes?
Could predictive analysis complement any already-ongoing initiatives at your institution?
What would the ethical implications be of using predictive analytics at your institution? Would it cause unfair discrimination against particular learners? Would it help level the playing field for all learners?