Carlos Mercado
August 24, 2017
Data was provided by the Center for Medicaid and Medicare Services (CMS) from the Open Enrollment Period (9 million Americans enrolled).
csv <- "Business Analyst Data Analysis Presentation - Open Enrollment Help Page Comments - Comments.csv"
medicare <- read.csv(csv, stringsAsFactors = FALSE)
#reads as 2,179 observations of 2 columns (URL and comment)
## [1] 51
## [1] 2175
The reasons I chose Natural Language Processing in R:
The data is small, but rich. So, doing an extensive process such as n-gram modeling is computationally possible on a single laptop.
The data comes from a specific population - people who wanted to enroll and who also saught help in doing so and also decided to give feedback.
This population may have unique (possibly demographic) similarities that can be identified by both the language use and the most common feedback.
Demographics and other valuable information would be helpful in the future for making actual product recommendations for the system.
## # A tibble: 51 x 2
## URL Comment
## <chr> <int>
## 1 help/what-health-coverage-do-i-have/ 213
## 2 help/parent-and-caretaker-relative-questions/ 162
## 3 help/add-other-income/ 128
## 4 help/i-am-having-trouble-logging-in-to-my-marketplace-account/ 114
## 5 help/deduction-questions/ 108
## 6 help/automatic-enrollment/ 107
## 7 help/disability-questions/ 103
## 8 help/found-not-eligible-for-medicaid/ 101
## 9 help/losing-health-coverage/ 101
## 10 help/information-on-medicare/ 95
## # ... with 41 more rows
## # A tibble: 51 x 2
## URL Comment
## <chr> <int>
## 1 help/what-health-coverage-do-i-have/ 213
## 2 help/parent-and-caretaker-relative-questions/ 162
## 3 help/add-other-income/ 128
## 4 help/i-am-having-trouble-logging-in-to-my-marketplace-account/ 114
## 5 help/deduction-questions/ 108
## 6 help/automatic-enrollment/ 107
## 7 help/disability-questions/ 103
## 8 help/found-not-eligible-for-medicaid/ 101
## 9 help/losing-health-coverage/ 101
## 10 help/information-on-medicare/ 95
## # ... with 41 more rows
For feasibility, it may be prudent to only seek to solve the most common pain points (for example, those with 80 or more comments).
Do clients generally know what needs to be fixed first?
Does NAVA to select the features to develop best serve the population?
## # A tibble: 11 x 2
## URL Comment
## <chr> <int>
## 1 help/what-health-coverage-do-i-have/ 213
## 2 help/parent-and-caretaker-relative-questions/ 162
## 3 help/add-other-income/ 128
## 4 help/i-am-having-trouble-logging-in-to-my-marketplace-account/ 114
## 5 help/deduction-questions/ 108
## 6 help/automatic-enrollment/ 107
## 7 help/disability-questions/ 103
## 8 help/found-not-eligible-for-medicaid/ 101
## 9 help/losing-health-coverage/ 101
## 10 help/information-on-medicare/ 95
## 11 help/reconciling-your-tax-credit/ 89
count3[6,]
## # A tibble: 1 x 3
## # Groups: URL [1]
## URL trigram n
## <chr> <chr> <int>
## 1 help/automatic-enrollment/ how to cancel 15
## # A tibble: 35,181 x 2
## quadgram n
## <chr> <int>
## 1 individual insurance non group 26
## 2 insurance non group coverage 25
## 3 what to do if 19
## 4 how to answer this 17
## 5 it is not clear 15
## 6 the end of the 14
## 7 to do if you 14
## 8 i don't know if 13
## 9 it would be helpful 13
## 10 to answer this question 13
## # ... with 35,171 more rows
Looking at the broadest tested case: 4 and 5 grams
For example, in the parent and caretaker questions several comments including that a feature is missing “there is no option for” or seek extra advice, “19 but…”
Most common 5-word groups in the parent, caretaker, relative category:
Looking at the most common problems based on counting the different word pairs (or triplets, or quadruplets) we see a few things:
The comments show a clear lack of understanding, “how to”,“if you”,“how do i”,what to do if“…
People are commenting in the help section because the help mechanisms (whether they be FAQs or live chat, etc) aren’t working.
Users feel that there specific situations are unique enough to warrant specific instruction - “if…”
This is different than feeling like a feature should exist, but doesn’t or that an interface is too difficult to use.
Things to consider with more time:
More data on the commenters
Do certain Demographics have more issues with certain topics than others? Example: in “add-other-income” and “deduction”: HSA Contributions Lowering income
Strong local word groupings in rarely used categories
More NLP
Due to time constraints I decided against removing words or engaging in sentiment analysis, i.e. Do users give more “negative” feedback in certain categories compared to others?
AUTHOR’s NOTE: Detailed Presentation with full annotations and code are available.