Data Mining Practice Actuarial Exams from CoachingActuaries.com

1. Introduction

Motivation

The actuarial profession has been rated highly for decades for low stress levels, high compensation, an interesting work environment, and job security; however, the elephant in the room is the rigorous actuarial exam requirements. There are 10 exams in order to gain a Fellow credential, and each of these have pass rates between 40-50%. There is a great deal of literature and study tutorials based on anectdotal evidence, offering prescriptive study methods and study schedules, but very little hard data. This analysis uses maching learning and statistics to shed light onto which factors contribute to actuarial exam success.

Project Background

My goal in this project was to gain experience using maching learning in a real-world scenario and to apply this knowledge to help actuarial students. This data was provided by CoachingActuaries.com, from my personal online-learning accounts for exams P/1, FM/2, and MFE/3F with the Adapt product. This article would not have been possible without the support of Ben Kester, Thane Himes, and Tong Khon Teh, who sent me the data and published the result. I do not work or receive funding from any company or organization that would benefit from this article. A thank you to to Dr. Una-May OReilly for excellent research ideas.

Actuarial Exam Overview

The premiminary actuarial exams consist of 3-hour, multiple-choice, computer-based math tests offered once every six months. There are 30-35 questions for the first three exams. My experience has been with exam P, covering probability theory, exam FM, covering financial mathematics, and exam MFE (soon to be IMF), covering models for financial economics.

Adapt Product Overview

Adapt is an automatic learning engine which generates problems to fit a user’s current skill level. These questions are intented to mimic the questions on real exams. The adapt subscription includes access to a practice test bank, an exam simulator, tutorial videos, a user forum, and performance feedback. Each user is given an Earned Level, which ranges from 1 - 10 and increases as they answer more difficult questions correctly. There are two main types of practice within Adapt: quizzes and exams. Quizzes are non timed, and are customized by the user. Exams are limited to 3 hours, contain 30 questions, and are generated to simulate a real exam as closely as possible.

The data and full R code for generating this article can be downloaded from github.

2. Research Findings

Practice questions can be predicted. Models had an accuracy rating of above 0.75 and an AUC of about 0.70. The exam-takers history was measured to be meaningful in predicting future exam performance and 14 historical features were evaluated for predictive power. The type of model used was found to be less significant than the quality of the input features.
Exam time is valuable in a 30-question test. The minutes used per problem was consistently one of the most important features measured across several different models. If the question is easy, the exam-taker should not spend much time on it. The optimal time per problem is about 5.5 minutes for the single-question level, but this varies depending on the difficulty of the problem.
Experience trumps difficulty. The Been There; Done That (BTDT) rule is that in an exam setting, the student should already have seen every question before in practice. When looking at the partial dependence plots, there was evidence in support of BTDT. Question difficulty as Adapt rates it is not necessarily related to low success rate. Simply because other people find a topic difficult does not imply that the person taking the exam should.
Practice questions are less useful when answered incorrectly. For each specific sub-category, these models calculated a running total of the problem difficulty which added to the total when a question was answered correctly and subtracting when incorrect. This feature could be used to predict whether any given question would be answered correctly.
Practice on hard questions. Baseball players practice-swing with weighted bats, martial artists punch bricks to strengthen their hands, and actuarial students should solve math problems more difficult than needed to prepare for the real exam. These models used a feature which filters out questions of a lesser difficulty and found this to be significant in predicting question outcomes.

3. Data

Limitations

These data represent only a single individual’s experience, my own, and these results have not been verified against a broader population. These findings will not necessarily apply to other people’s experience. Since this data was collected, exams P, FM, and MFE have undergone significant curriculumn changes, and these have not been taken into consideration. This analysis uses only a limited number of variables and does not consider environmental factors such noise level or testing environment. For exams P, FM, and MFE, my account logged 36.7, 60.1, and 54.8 hours of practice respectively. These numbers are only estimates, as there is no clear measure for active screen time. Time spent on the Adapt product represents only a fraction of my total study time, which included off-screen reviews and textbook practice. The Society of Actuaries (SOA) recommends 100 study hours per hour of exam, or 300 for each of these three exams.

Timeline

This data was collected between August 2016 - November 2017, where each exam had about a 2-week Adapt subscription. As the graphs below indicate, as more time was spent practicing, the account earned level increased. For exam P, the earned level was highest at about 7.3. For MFE, the earned level was less significant as a larger majority of practice time was spent on quizzes, which do not impact earned level.

Original Features

These features were supplied directly from the CoachingActuaries website and underwent minimal modification. This consisted of question-level detail for the online course. There are around 1500 math problems from my personal account, with details such as whether it was correct or not, the time spent on it, the curriculumn category, and so forth.

The main question is which features are impactful on exam performance. A starting point is to look at the correlations between features. This shows if any features are consistently different when the question is correct verses incorrect. At this stage, there is not enough information to make causual inference, but to merely note the correlations. A correlation of 1 means between two features means that information about one completely expains the other. The correlation of 0.97 between question ordinal, q_ordinal, and remaining_exam_time means that questions later in the exam have less time remaining on the clock, and questions earlier have more time remaining. Observations such as this serve as a consistency check for the data. As shown below, there is a negative correlation with difficulty, minutes_used, and remaining_exam_time. This is explored later in the modeling section.

Class separation is how well the x-features separate out the class label, correct in this case. These graphs show the empirical probability distributions, with red corresponding to incorrect questions and blue to correct questions. Generally, the greater the difference between the red and the blue distributions the more predictive power for the given feature. The greatest class separation is for difficulty and minutes_used, as seen in the lower right.

Categorical Features

When created by the CoachingActuaries.com author, each question is assigned a series of categorical tags. For instance, if a question on exam P asks for an expected value of a normal random variable with a deductible, the category tags might be “Continuous Probability Distributions”, “Expected Value”, and the subcategory tags might be “Univariate Normal Distribution”, “Insurance Deductible”. From my understanding, these tags are used in the Adapt exam simulator in order to generate exams which are realistic to the SOA’s curriculum.

Historical Features

These question tags were too numerous to understand directly, and so historical experience features were created in order to approximate a user’s learning over time. These features were useful in constructing predictive models, as seen later. An importance point is that these historical features allow for the quiz questions to be incorporated to the training and testing sets. Otherwise, the quiz questions could not be mixed with the exam questions due to data quality reasons; for quizzes, my behavior changed with respect to looking up formulas on note sheets, spending long amounts of time per problem, and otherwise giving consistent effort. For these reasons, the exam only the exam data was used for modeling.

As the below plot shows, there is improved class separation for several of the historical features. Intuitively, this implies that my likelihood of answering a question correctly depends on the number of questions which I have answered in the past in the same category. Not all of these features are useful, and in fact this graph is only showing a subset of the 13 historical features tested.

4. Modeling

Model Statement

The objective was to predict whether questions on the 3-hour practice exams would be answered correctly, with model interpretibility being a priority. This was done on an aggregate-level, where questions were not grouped by exam and each question weighted equally. All models were fit with the caret package (Classification And REgression Training) and subsequent dependencies.

Treatment of Missing Values

The first type of missingness was present in the raw data itself. For instance, there were missing timestamp values for exam P in August 24, 2016. As this was my first Adapt exam of 12 for exam P, this was deemed to not hold predictive value and was dropped. Because each question could have between 1-3 category and subcategory tags, discretion was needed in order to compare all questions equally. The method used was to weight each question tag so that questions with a single tag would be treated as having multiple copies of the same tag. For example, a question with tag A would be treated as AAA, questions with tag BA would treated as BAA, questions with tag ABC would be left unchanged. This method is not perfect, but given that the category tags are assigned by the problem author, this seemed like a reasonable means of treating the missing values.

The second type of missingness was in questions with zero minutes used, or in practice exams where the 3-hour time limit was not followed. A cutoff of 10 seconds was used to drop questions, as this is approximately the amount of time it would take to read a problem in an exam setting. Questions with less than 10 seconds used were dropped. This reduced the number of question observations in the data from 1734 to 1503. For 10 practice exams, time limit rules were not followed, and so these were kept in the analysis while being treated as un-timed quizzes. These questions would not be missing completely at random (MCAR) as there is a higher liklihood of going over the time limit when doing poorly on an exam than when succeeding.

Model Evaluation

Interpretability was considered in all stages, as the objective of this analysis was to understand the data. The Receiver Operating Characteristic Area Under the Curve (AUC) metric was used for model selection. A 75%-25% validation set split was first created, and then 10-fold cross validation with 3 repeats was used for model training. The numeric variables were pre-processed to be scaled and centered for some of the models.

Because there were more cases of correct than incorrect problems in the training data, with a split of 63% correct, 37% incorrect, several subsampling techniques were tested including oversampling, undersampling, SMOTE, and ROSE. When evaluated with cross validation against the testing data set, these led to a decrease in performance and so these methods were not used.

Naive Bayes: A Naive Bayes classifier using Gaussian distributions was fit to a subset of the features. The input numeric data was centered and scaled. This performed well compared to more sophistocated models.
Logistic Regression: Several logit models were fit with different subsets of features. The data was transformed prior to fitting with centering and scaling. A power transformation was tested and increased accuracy by about 0.01, but this was not used in the final model due to lower interpretability.
K-Nearest Neighbor: KNN models were tested to the centered and scaled data. These were tuned by varying the number of neighbors and testing with cross-validation.
Random Forest: The random forest model was used throughout the feature engineering process to assess variable importance. Several versions were fit to various subsets of the data. The only parameter tuned was the number of variables available for splitting at each tree node. The number of trees used was held constant at 500.

Model Selection

The goal in modeling was to understand the data. The Naive Bayes model performed best, although these results cannot be easily interpreted. Given the amount of noise present in the data, the fact that this simple model out-performed its sophistocated counterparts is not surprising. While the logistic regression model performed well, there was doubt in the credibility of this performance due to the number of outliers present and p-values of the cofficients. Because the random forest is the most resiliant to both outlying cases and multi-colinearity, this was chosen as the final model to use in interpretation.

Model Interpretation

Variable importance gives a measure of how the value of a given variable influences the outcome of correct. This plot below was generated using a random forest model, and shows which factors are most important to whether a question is answered correctly or not. The most important feature to determening question outcomes was whether or not the question had been marked for review during the practice exam. This is not surprising given that exam-takers can identify to a certain degree which questions they understand. As this is a timed exam, minutes_used should be an important feature as is indicated. The engineered historical feature hist_greater_diff measures the quantity and quality of preparation for the specific question type by taking the total difficulty of all questions previously answered under the current category. hist_net_diff is has a similar definition.

Partial dependency plots are used to interpret complex machine learning models. These show how the expected outcome changes with different levels on inputs. The plots below were created using the best random forest model from the model selection section, and show how the probability of answering a question correctly changes when adjusting for the other variables in the model.

These are not an exact measure of the conditional expectations, but are a useful tool in understanding the model.

Difficulty and Minutes Used (Below Left):

We can plot any of the two partial dependencies together to examine how the probability changes as both of the inputs change. In the graphs below, the dark blue color represents higher probability of a correct answer. This is the same data as the above graphs only rearranged. As the partial dependence plot of difficulty verses minutes_used (left below) shows, there is a cluster of highest probability for easy questions in which the exam-taker spends about 3 minutes on. In other words, if a question is easy, the exam taker not spend much time on it. This suggests an optimal strategy is to target the questions which are easiest.

Difficulty and Experience Level (Below Right):

In the graph of the partial dependence of difficulty verses experience with more difficult problems, hist_greater_diff, there is a clear relationship between how experience helps across all difficulty levels. The maximum usefullness of exerience is in the range of 500-1000 total difficulty points. This can be thought of as 100, 10-difficulty problems for the current topic, or 200, 5-difficulty problems. The effect of prior experience improving the probability of success is consistent for all difficulty levels less than 6 or so. This supports the Been There, Done That (BTDT) theory, that experience trumps difficulty. Even if the difficulty is a 3, if the student lacks experience with questions of difficulty 3 or above (the y-axis is less than 500), they are unlikely to answer it correctly in an exam setting.

4. Conclusion

Recap of Findings

Practice questions can be predicted. Models had an accuracy rating of above 0.75 and an AUC of about 0.70. The exam-takers history was measured to be meaningful in predicting future exam performance and 14 historical features were evaluated for predictive power.
Exam time is valuable in a 30-question test. The minutes used per problem was consistently one of the most important features measured across several different models. If the question is easy, the exam-taker should not spend much time on it.
Experience trumps difficulty. The Been There; Done That (BTDT) rule is that in an exam setting, the student should already have seen every question before in practice. When looking at the partial dependence plots, there was evidence in support of BTDT.
Practice questions are less useful when answered incorrectly. For each specific sub-category, these models calculated a running total of the problem difficulty which added to the total when a question was answered correctly and subtracting when incorrect. This feature could be used to predict whether any given question would be answered correctly.
Practice on hard questions. Baseball players practice-swing with weighted bats, martial artists punch bricks to strengthen their hands, and actuarial students should solve math problems more difficult than needed to prepare for the real exam. These models used a feature which filters out questions of a lesser difficulty and found this to be significant in predicting question outcomes.

Future Improvements

The source data from CoachingActuaries.com could be improved. For example, there was no unique question or exam identifier provided, and questionID and examID needed to be created. These were not perfect facsimiles. The variable remaining_exam_time was an approximation, as the data source only includes a the order in which questions were generated and the time spent on the problem. In a real environment, the order in which the questions appear is rarely the order in which they are answered, as a student often will read a problem, skip it, and then come back to the problem later. Having precise information for this clickstream data could be insightful.

Models could search over a broader range of category and subcategory combinations. In this experiment, these category features did not improve model performance when included explicitely, but were included implicitely within the historical features. A model which tests for connections between categories on a deeper level than these models used could lead to improved accuracy. Alternatively, a diminsionality reduction such as factor analysis for the number of categories could be used. Given limited computational capacity, these models were evaluated using only the top 10 most frequent category levels for P, FM, and MFE.

Total exam performance could be evaluated using the question-level predictive models. One possible method of accomplishing this would be to feed in unseen practice exam questions in batches of 30 and simulate the user-generated behavior. This would allow for maximazation of exam scores as opposed to the probability of answering a given question correct. Various user time-allocation and other exam-taking strategies could be tested and evaluated for performance using monte carlo simulation.

References

Hastie, T., Tibshirani, R., and Friedman, J. 2001. “The Elements of Statistical Learning: Data Mining, Inference, and Prediction.” Springer Series in Statistics.

Taylor, Colin. Stopout Prediction in Massive Open Online Courses.. M.Eng Thesis completed in MIT Dept of EECS, 2014.