Breaches Analysis

## Loading required package: ggplot2
## Loading required package: xtable

Introduction

This is an analysis of information security breach data from the website Information is Beautiful. They have made a really elegant bubble chart showing a very intuitive time evolution of data breaches which can be encoded for a number of factors.

Thankfully they have also made their raw data compilation available here so I’ve used it for this analysis to expand on some specific questions I wanted to address.

The specific question I want to address here (quantitatively) is “How much have the severity and number of hacking-induced data breaches changed over time and is it different than the general trend of breaches.”

The conclusion is that while Hacks are increasing faster than the overall rate of breaches, the sensitivity of the data lost to hacks is not increasing as fast as that of breaches overall.

A note on this doc

I’ve left exposed code essnetial to the analysis (like data cleaning) but have hidden what I consider the boring part of the code, like standard plotting, etc. You can get the whole code on my GitHub here.

Cleaned Data

The data behind the Information is Beautiful chart are made available as a google doc. I’ve been unable (so far) to write a successful R program to download the data from the website directly, so in the interest of time I just downloaded their raw data as a .csv.

I made the download copy of the data I use below available on the Github repository linked to this analysis.

The data was generally in pretty good shape, but I did need to take some steps to clean it up for analysis in R.

Steps to clean the data include: * Retaining only spreadsheet columns 1-10 to include only data I need.
* Getting rid of a second descriptive row (which is not data).
* converting column names to lower case and deleting spaces * cleaning some formatting of numbers to remove commas etc.
* Implementing some assumptions about the coding of severities. The legend runs between 1 (least severe) and 5 (most severe). Some of the data are outside this range (e.g. 20 and 50000). Based on my reading of details, I have assumed these are incorrect and have converted them to single digits.
* Turned the year into actual numerical representation of a calendar year.

You can see the details in the code.

## DATA CLEANING

        ## delete trailing columns
        Breaches<-Breaches[,1:10]
  
        ##get rid of second desrciptive row
        testt<-as.vector(Breaches[1,] )

                ##function applying as.character to a list   
                tc<-function(x) as.character(x)

        #manage column names
                testt<-lapply(testt, tc)
                ##apply colnames
                testt<-tolower(testt)
                testt<-gsub(" ", "", testt)

                colnames(Breaches)<-testt
        
        ##delete junk rows
        Breaches<-Breaches[c(-1, -2),]

        ##get rid of commas in some numerical recorgs
        tempo <- sub("[,]", "", as.character(Breaches$noofrecordsstolen))
        #Breaches$NO.OF.RECORDS.STOLEN <- sub("[,]", "", as.character(Breaches$NO.OF.RECORDS.STOLEN))

        #Make the year numeric
        Breaches$year<-as.integer(as.numeric(as.character(Breaches$year))+2004)      

        ##Make Breaches numeric
        Breaches$noofrecordsstolen<-as.integer(as.character(Breaches$noofrecordsstolen))

## Warning: NAs introduced by coercion

##clean up sensitivity
        ## Note that this is a little subjective. The legend ranges from 1 to 5, but the data goes to 5000 and sometimes contains multiple impacts
        ## I have just taken the first character of each line. This will tend to underestimate impact but is reproducible from a first pass effort.

        ##Turn sensitivity into character
        Breaches$datasensitivity<-as.character(Breaches$datasensitivity)
        ##create function to select first character
        first.char <- function(x){
                substring(x, 1, 1)
        }
        ##apply the function
        Breaches$datasensitivity <- lapply(Breaches$datasensitivity, first.char)
        ##turn back into numeric
        Breaches$datasensitivity<-as.numeric(Breaches$datasensitivity)

Here is a random subset of the cleaned data. You’ll note that one row has blank data for number of records stolen. I’m not sure why that is. I plan to look into it later but don’t have time right now. Judging from a back-of-the-envelope calculation I don’t think it biases the conclusions of this analysis.

	entity	year	methodofleak	noofrecordsstolen	datasensitivity
3	Adobe	2014	hacked	38000000	5.00
4	Advocate Medical Group	2014	lost / stolen media	4000000	2.00
13	AOL	2014	hacked		1.00
28	UbiSoft	2013	hacked	58000000	2.00
52	Memorial Healthcare System	2011	lost / stolen media	102153	2.00
54	Nemours Foundation	2011	lost / stolen media	1055489	4.00
56	Oregon Department of Motor Vehicles	2011	poor security	1000000	2.00
66	Stratfor	2011	accidentally published	935000	3.00
124	TD Ameritrade	2007	hacked	6300000	1.00
125	Texas Lottery	2007	inside job	89000	2.00

Aggregate Records Lost

The trend of total records lost from all data breaches shows a general upward trend. The plot below is a full aggregate of all the data. There is substantial year on year variation, dominantly from the fact that large data breaches are (thankfully) still rare from the standpoint that the 1/sqrt(N) is reasonably large.

plot of chunk AggRecordsLost

Records Stolen by Hacking

An interesting segmentation is to look at the records stolen by hacking (as opposed to, say, physical theft or an insider). Hacking is usually associated with malicious outside attack of a vulnerability. The industry is taking increased steps to close vulnerabilities, but the investment in hacks, especially by nation-states and organized crime, is also rising.

plot of chunk Hacks

As can be seen from the data, breaches from hacks vary significantly from year to year, but also have resulted in a generally upward trend of number of records being lost. The fitted trend line apparently rises faster than the trend line for the overall breaches. Indeed the rate of increase inrecords lost 6990 thousand/year is about 5.7 % higher than 6593 thousand/year records lost from all breaches.

Average severity of Records Stolen by Hacking versus the overall trend

The record loss from hacks is increasing, as shown above, faster than the total of data breaches, but is the severity of the attacks also increasing?

The data is categorized by severity on a scale of 1-5.
1. Just email address/Online information
2. SSN/Personal details
3. Credit card information
4. Email password/Health records
5. Full bank account details

The scale is of course arbitrary, but it is sensible and provides a systematic framework for measuring qualitative change.

The analysis to compute the mean severity of attacks is straigtforward. The results, shown below, show that in fact while the severity of hacks are increasing, they are not increasing as fast as the severity of all attacks.

## Loading required package: grid

plot of chunk AggBreachSens

So in response to the orginial question:

How much have the severity and number of hacking-induced data breaches changed over time and is it different than the general trend of breaches.

While number of hacking-induced data record losses has increased about 5.7 % per year faster than the number of records lost from breaches overall, we find the surprising result that the sensitivity of records lost from breaches overall is increasing 1.9 times faster than the sensitivity of Hacks.

This means that while the number of hacked records is growing, the biggest problem from a data sensitivty standpoint is actually from other causes.

The number of records lost to hacks are increasing faster than the overall rate of record losses from data breaches, the sensitivity of the data stolen by hackers is atually increasing more slowly than that from breaches overall.

A couple of hypothesis:
1. Hacking attacks, such as the large breach at Target or Home Depot, focused on stealing credit card numbers, presumably for financial gain. These activities, while getting larger, will be concentrated around sensitivity level 3. They are not after medical records in general.
2. HIPAA now requires disclosure of breaches of medical records in some cases. Medical records on the scale used here, have higher sensitivity. So there may be a systematic bias in which data are reported that is affecting the overall trend.

… which suggest questions for the next round a programming.