BAIS660_Exam1
US HHS - CE Breach Statistics
Introduction
The Department of Health and Human Services (HHS) is mandated by Congress to provide reporting on Personal Health Information (PHI) breaches in “Covered Entities” (CE) where they occur. Pursuant to that mandate the Office of Civil Rights (OCR) has been tasked with preparing regular reporting on afformentioned PHI-Breach Events.
A “Breach Event” is a hacking, penetration, negligent-share or other action which allows PHI of 500+ Individuals to enter the broader world wide web, whether through malice, neglicence, or poor luck, the OCR must record the incident. Other offices are responsible for enforcement actions or punitive measures and will utilize those powers, where applicable.
The purpose of this document is to summarize and present in visual form the broad-trends of CE-Breach-Event-Data over the duration of the OCR’s Collection Mission.
Data Collected
Data is collected from within HHS Systems, and is of two general types - Breach Reports which either:
- have
or
- have not
been completely investigated. HHS bears a mandate to investigate every CE Breach and transitions “pending” reports to the “completed” bin when they are finished, but as with any enduring mandate - the backlog is ever growing.
This article merges completed and pending data in the interests of presenting the most up-to-date and accurate overview of the CE-Breach Situation. This is done because those newest reports, which may grant immediately applicable insight, are the least likely to have a finalized investigation.
Methodology & Use
This article will assess the current trends in PHI Breaches in order to illustrate the current hurdles facing HIPAA Compliance Efforts among CE’s and hopefully identify key markets or behaviors which may be of use in predicting potential at risk CE Sectors.
As an example, if the data suggests that CE’s in Georgia are at a higher than usual likelihood to be targeted, this data may help HHS and allied offices reallocate resources to more effectively meet and defeat the threat to our citizen’s health-data.
The key take away, from an analytical standpoint, is that it’s impossible to effectively engage with a situation without understanding the current state and recent trends. The US Army’s own Intelligence Manual suggests that the very first step of any analytical process is to understand the current state.
With that in mind, we first look to our data.
Data Collection and Cleanup.
The data was collected and manipulated in RStudio using the R Programming Language. R, as a language, may be modified through the use of “packages” which bring additional functions or features to the baseline of a program. It is similar to enabling “add-ons” in Excel. Because there are over 10,000 Packages, we shall provide a listing of those utilized and their purpose.
library(rmdformats)
library(knitr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(stringr)
library(DT)rmdformats - Provides some prebuilt templates and formating for a “Markdown” document, which is the language in which this article was written. Markdown allows the author to format the text, images, and tables in their articles much like the formatting controls in Microsoft Word.
knitr - much like Markdown provides a framework for formatting an article, knitr provides a framework for publishing an article. To “knit” an article is to collect it into a .html file which may then be uploaded to a website in order to be published - in this case RPubs serves as the repository and hosting service.
ggplot2 - ggplot is a library of features which significantly improve and bone-up on the basic “Data Visualization” functions baked into the core R language. GGPlot has been utilized to code all of the graphs presented in this article and has helped make them more robust, understandable, and meaningful to the analytical conclusions presented herein.
tidyverse - Tidyverse is the single largest collection of packages wrapped up together into one available on the web. Think, best, of Tidyverse as a Package composed of other Packages, which is band-boxed together for convenience and simplicity.
lubridate - Many of the core take-aways from this data are dependent on when they happened. Unfortunately, when data is imported from a Comma Separated Values document (like an Excel file) it is impossible for the creators of the R language to predict the myriad ways in which the Calendar Dates are recorded. July 18th 2009? 21st October ’16? 6/21/14? These are all valid methods of recording time. Consequently, some manipulation of the data must occur to ensure that all dates adhere to the same standards and may be measured uniformly. Where possible dates shall be shown as Year-Month-Day so that the events may be grouped most efficiently. It will therefore be easy to find all events ocurring in October of 2009, for example. “Lubridate” is a package the specializes in date-manipulation.
stringr - Necessary to create Strings.
DT - The “DataTables” package allows for us to display Data Tables on the page.
Data Summary
HHS Data Collected contained almost 2450 Breach Reports, although there were some duplicated entries.
The data contained the following Variables.
- Name of Covered Entity - The name of an organization where a breach event occured
- State - The state wherein said organization is located
- Covered Entity Type - Which category of business or organization is it
- Individuals Affected - Number of individuals who may have had PHI leaked
- Breach Submission Date - When the breach was reported. In the data-frame this has been separated out into columns for Year, Month and Day.
- Type of Breach - The type of breach
- Location of Breached Information - In what medium was the information which was leaked, kept
- Business Associate Present - Was an associate of the CE present during the breach
- Web Description - A prose recording of the facts surrounding that case
- Investigated - Has an investigation been completed on this report
Interactive Table
Feel free to view the data yourself below.
Analysis
Year Over Year Breach Reporting
Here we see breaches reported, as segmented by year. While it is natural to be excited by the apparent dip in 2018 Breaches, keep in mind that as of the time of this writing the year is not over yet.
Additional Analysis
Here we see that most reports are filed with HHS on a Friday.
Analytical Conclusions
In every year the number of incidents has increased, despite increases in security technology, enforcement actions by HHS and other government bodies, and increasing awareness efforts in re: HIPAA Violations. These increases are likely a direct result of the massively increased utilization of electronics documents storage systems by government and healthcare offices, combined with their endemically poor utilization of security best-practice and/or modern data-management systems.
The Public and Health sectors have historically been poor at this, and as technological utilization continues to increase in both sectors these organizations continue to have thier poor security policy lead to increasing numbers of breaches.