A Data Scientist

Data Science: the Big Picture

Hypothetical Scenario

Medicare Coverage

You are an administrator at a large urban hospital. Many of the hospital’s patients are elderly, and use Medicare to pay a portion of what the hospital charges. You wonder:

  • How your hospital’s charges for various medical conditions compare to other hospitals in the state and around the country?
  • In terms of the percentage of the costs that are covered by Medicare Plan A payments, how does your hospital compare to other hospitals?

You ask one of the hospital’s data scientists to provide some insight your questions.

Available Data

Key Variables

  • Diagnostic-Related Condition (DRG)
  • Discharges: number of patients treated for a given DRG, by a given hospital, in a given year)
  • Medicare Payments: how much Medicare Plan A paid for the the discharges associated with the DRG during the year, for that hospital.

Part of the “Answer”

Data Issues

The Big Picture Again

Import Issues

  • Downloading from the cms.gov site is not difficult, but …
  • the data format varied from year to year

So the data scientist had to do some programming in order to combine the individual-year data sets.

Tidying Issues

  • CMS data was quite repetitive, resulting in files too large for an app to handle quickly.
  • We had to separate the data into smaller “relational” tables that can be joined quickly in response to particular queries.

External Data

  • CMS data tells us the zip code for each provider.
  • We had to use external data to determine the county for each zip code.

Mapping

  • Spatial data is a large, important topic in contemporary data science.
  • You have to think about
    • projection systems
    • standards for encoding spatial data
    • how to access the right online mapping service,
    • etc.

Further Questions

  • What factors affect total costs for a given DRG?
  • What factors affect the percentage covered by Medicare?

We probably need access to per-patient data, not just per-hospital data.