## Loading required package: ggplot2
## Loading required package: xtable
This is an analysis of information security breach data from the website Information is Beautiful. They have made a really elegant bubble chart showing a very intuitive time evolution of data breaches which can be encoded for a number of factors.
Thankfully they have also made their raw data compilation available here so I’ve used it for this analysis to expand on some specific questions I wanted to address.
The specific question I want to address here (quantitatively) is “How much have the severity and number of hacking-induced data breaches changed over time and is it different than the general trend of breaches.”
The conclusion is that while Hacks are increasing faster than the overall rate of breaches, the sensitivity of the data lost to hacks is not increasing as fast as that of breaches overall.
I’ve left exposed code essnetial to the analysis (like data cleaning) but have hidden what I consider the boring part of the code, like standard plotting, etc. You can get the whole code on my GitHub here.
The data behind the Information is Beautiful chart are made available as a google doc. I’ve been unable (so far) to write a successful R program to download the data from the website directly, so in the interest of time I just downloaded their raw data as a .csv.
I made the download copy of the data I use below available on the Github repository linked to this analysis.
The data was generally in pretty good shape, but I did need to take some steps to clean it up for analysis in R.
Steps to clean the data include: * Retaining only spreadsheet columns 1-10 to include only data I need.
* Getting rid of a second descriptive row (which is not data).
* converting column names to lower case and deleting spaces * cleaning some formatting of numbers to remove commas etc.
* Implementing some assumptions about the coding of severities. The legend runs between 1 (least severe) and 5 (most severe). Some of the data are outside this range (e.g. 20 and 50000). Based on my reading of details, I have assumed these are incorrect and have converted them to single digits.
* Turned the year into actual numerical representation of a calendar year.
You can see the details in the code.
## DATA CLEANING
## delete trailing columns
Breaches<-Breaches[,1:10]
##get rid of second desrciptive row
testt<-as.vector(Breaches[1,] )
##function applying as.character to a list
tc<-function(x) as.character(x)
#manage column names
testt<-lapply(testt, tc)
##apply colnames
testt<-tolower(testt)
testt<-gsub(" ", "", testt)
colnames(Breaches)<-testt
##delete junk rows
Breaches<-Breaches[c(-1, -2),]
##get rid of commas in some numerical recorgs
tempo <- sub("[,]", "", as.character(Breaches$noofrecordsstolen))
#Breaches$NO.OF.RECORDS.STOLEN <- sub("[,]", "", as.character(Breaches$NO.OF.RECORDS.STOLEN))
#Make the year numeric
Breaches$year<-as.integer(as.numeric(as.character(Breaches$year))+2004)
##Make Breaches numeric
Breaches$noofrecordsstolen<-as.integer(as.character(Breaches$noofrecordsstolen))
## Warning: NAs introduced by coercion
##clean up sensitivity
## Note that this is a little subjective. The legend ranges from 1 to 5, but the data goes to 5000 and sometimes contains multiple impacts
## I have just taken the first character of each line. This will tend to underestimate impact but is reproducible from a first pass effort.
##Turn sensitivity into character
Breaches$datasensitivity<-as.character(Breaches$datasensitivity)
##create function to select first character
first.char <- function(x){
substring(x, 1, 1)
}
##apply the function
Breaches$datasensitivity <- lapply(Breaches$datasensitivity, first.char)
##turn back into numeric
Breaches$datasensitivity<-as.numeric(Breaches$datasensitivity)
Here is a random subset of the cleaned data. You’ll note that one row has blank data for number of records stolen. I’m not sure why that is. I plan to look into it later but don’t have time right now. Judging from a back-of-the-envelope calculation I don’t think it biases the conclusions of this analysis.
entity | year | methodofleak | noofrecordsstolen | datasensitivity | |
---|---|---|---|---|---|
3 | Adobe | 2014 | hacked | 38000000 | 5.00 |
4 | Advocate Medical Group | 2014 | lost / stolen media | 4000000 | 2.00 |
13 | AOL | 2014 | hacked | 1.00 | |
28 | UbiSoft | 2013 | hacked | 58000000 | 2.00 |
52 | Memorial Healthcare System | 2011 | lost / stolen media | 102153 | 2.00 |
54 | Nemours Foundation | 2011 | lost / stolen media | 1055489 | 4.00 |
56 | Oregon Department of Motor Vehicles | 2011 | poor security | 1000000 | 2.00 |
66 | Stratfor | 2011 | accidentally published | 935000 | 3.00 |
124 | TD Ameritrade | 2007 | hacked | 6300000 | 1.00 |
125 | Texas Lottery | 2007 | inside job | 89000 | 2.00 |
The trend of total records lost from all data breaches shows a general upward trend. The plot below is a full aggregate of all the data. There is substantial year on year variation, dominantly from the fact that large data breaches are (thankfully) still rare from the standpoint that the 1/sqrt(N) is reasonably large.
An interesting segmentation is to look at the records stolen by hacking (as opposed to, say, physical theft or an insider). Hacking is usually associated with malicious outside attack of a vulnerability. The industry is taking increased steps to close vulnerabilities, but the investment in hacks, especially by nation-states and organized crime, is also rising.
As can be seen from the data, breaches from hacks vary significantly from year to year, but also have resulted in a generally upward trend of number of records being lost. The fitted trend line apparently rises faster than the trend line for the overall breaches. Indeed the rate of increase inrecords lost 6990 thousand/year is about 5.7 % higher than 6593 thousand/year records lost from all breaches.
The record loss from hacks is increasing, as shown above, faster than the total of data breaches, but is the severity of the attacks also increasing?
The data is categorized by severity on a scale of 1-5.
1. Just email address/Online information
2. SSN/Personal details
3. Credit card information
4. Email password/Health records
5. Full bank account details
The scale is of course arbitrary, but it is sensible and provides a systematic framework for measuring qualitative change.
The analysis to compute the mean severity of attacks is straigtforward. The results, shown below, show that in fact while the severity of hacks are increasing, they are not increasing as fast as the severity of all attacks.
## Loading required package: grid
So in response to the orginial question:
How much have the severity and number of hacking-induced data breaches changed over time and is it different than the general trend of breaches.
While number of hacking-induced data record losses has increased about 5.7 % per year faster than the number of records lost from breaches overall, we find the surprising result that the sensitivity of records lost from breaches overall is increasing 1.9 times faster than the sensitivity of Hacks.
This means that while the number of hacked records is growing, the biggest problem from a data sensitivty standpoint is actually from other causes.
The number of records lost to hacks are increasing faster than the overall rate of record losses from data breaches, the sensitivity of the data stolen by hackers is atually increasing more slowly than that from breaches overall.
A couple of hypothesis:
1. Hacking attacks, such as the large breach at Target or Home Depot, focused on stealing credit card numbers, presumably for financial gain. These activities, while getting larger, will be concentrated around sensitivity level 3. They are not after medical records in general.
2. HIPAA now requires disclosure of breaches of medical records in some cases. Medical records on the scale used here, have higher sensitivity. So there may be a systematic bias in which data are reported that is affecting the overall trend.
… which suggest questions for the next round a programming.