Looks can be deceiving: what your features can tell you about your data

Amber Ferger
9/17/2019

(Supervised) Predictive Models in a Nutshell

“How can I put a label on unclassified data?”

  • Develop features
  • Train model on known data
  • Predict on unclassified data

plot of chunk predictiveModel

A real life example

The Goal: Identify members that have another commercial insurance plan

The (current) process:

  • Generate a list of members (based on a few parameters)
  • Randomly select 4000 of those members to send letters to weekly
  • Record results and (if applicable), update system

The Problem: Very few of the members that we send letters to actually have another insurance plan

Here comes the Predictive Model Piece!

Solution: Develop a predictive model to rank the members from the pull by probability of having another commercial insurance plan!

Note that this is still a work in progress!

Success! (Or is it?)

  • Our first run of the model yielded 95% accuracy on the training set
  • It also yielded 90% accuracy on the test set

Cause for celebration? Not yet. The results look too good…

Let's take a look at our features, shall we?

plot of chunk picture

Good thing we checked...

  • Went back to the data
  • PDPD_ID has the most variance out of all features (over 3000 distinct values)
  • Variance is likely what's causing the importance
  • Run without feature and the accuracy goes down to 58%

And that brings us to present day. We're back to the drawing board, developing some new features (we have 6 new ones to test!)

Conclusion

  • Always question your data!
  • Think logically about your results
  • Developing a model is a process that takes many tries before it's right