The data breaches noted here have been reported because the personal information compromised includes data elements useful to identity thieves, such as Social Security numbers, account numbers, and driver’s license numbers. Some breaches that do NOT expose such sensitive information have been included in order to underscore the variety and frequency of data breaches. However, we have not included the number of records involved in such breaches in the total because we want this compilation to reflect breaches that expose individuals to identity theft as well as breaches that qualify for disclosure under state laws. The breaches posted below include only those reported in the United States. They do not include incidents in other countries.
This information was extracted from https://www.privacyrights.org/data-breach between 2005 and 2015.
some variables included are: Type of breach, Entity affected by the breach, State, Source and Description.
1. Question:
It is possible to classify correctly the type of breach using text mining on the full description of the breach? Can we identity specific threats for each Entity?
2. Input Data:
You can download the data-set containing data-breach information (3.6 MB) at: https://www.privacyrights.org/data-breach
The data-set containing city names and GPS coordinates can be obtained from: http://simplemaps.com/files/cities.csv
We will use the USA map from ‘library(maps)’
Finally we will obtain a very detailed table that relates short and long state names from: http://www.infoplease.com/ipa/A0110468.html
3. Preprocessing:
Clean and Pool data
Dealing with NA values; Zero Frequency Ratio; Remove if necessary redundant highly correlated variables; Remove near-zero predictors.
4. Exploratory Analysis:
Take a look at relationships and patters.
5. Text Mining:
Before creating a predictive model, on the type of data breach, we will implement text mining on the variable Descrip. this variable contains commentaries about the data breach and we will try to predict the Type of Data breach by using VCorpus and tm_map_map. Here we will show that we can create clusters of one, two or more words and use them for classification. At the end we will use clusters of one word for our final analysis.
6. Build the models and compare:
A seed is assign and a ratio of 70/30 for training and testing data-sets are created ‘Rattle tree, and ’glmnet’ model algorithms to see if I could build a successful predictive model. This models achieved 62% and 84% prediction accuracy.
setwd("~/R/Data Driven security")
libs <- c ("tm", "plyr", "class", "SnowballC", "dplyr", "caret", "corrplot",
"beepr","RWeka","e1071","ggplot2", "datasets", "maps","plotly", "wordcloud")
lapply(libs, library, character.only=T)
Load the data-set but replace all missing 'NA', 'NaN' or empty spaces by 'NA'
mydata = read.csv("Chronology-of-Data-Breaches_-_Privacy-Rights-Clearinghouse.csv",
header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))
Download the file containing all cities information in USA and replace all missing 'NA', 'NaN' or empty spaces by NA'. This data-set with 5 variables and 29470 observables relates zip code, state, latitude and longitude coordinates with a specific city.
## Outdated Link
#fileURL <- "http://simplemaps.com/files/cities.csv"
# desfile <- "~/R/Data Driven security/cities.csv"
# download.file(fileURL, destfile = desfile)
cities = read.csv("~/R/Data Driven security/cities.csv",
header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))
head(cities,2)
## zip state city lat lng
## 1 35004 AL Acmar 33.58413 -86.51557
## 2 35005 AL Adamsville 33.58844 -86.95973
Load us map data
all_states <- map_data("state")
A table that contains all the possible versions of state names and territories is very useful when different data-sets will be bind using the state as the binding parameter
library(rvest)
StatePosTCODE <- read_html("http://www.infoplease.com/ipa/A0110468.html")
STPC<-StatePosTCODE %>%
html_nodes("table") %>%
.[[2]] %>%
html_table(); head(STPC,2)
## State Abbreviation Postal Code
## 1 Alabama Ala. AL
## 2 Alaska Alaska AK
#Keep only the names and postal code
STPC <- data.frame(STPC[,c(1,3)])
# Eliminate and Rename the necessary variables
mydata$Total.Records <- mydata$X.5
mydata <- select(mydata, -Street, -Postal.Code, -X.3, -X.1,
-Attached.Files, -X.5, -X, -All.Terms)
names(mydata)[names(mydata)=="X.."] <- "State"
names(mydata)[names(mydata)=="X.4"] <- "Data_Sourcee"
names(mydata)[names(mydata)=="X.2"] <- "Descrip"
mydata. Some cities contain more than one GPS coordinate. Let’s Simplify this data-set to one GPS coordinate per city. Also make names more significant.cities$Postal.Code <- cities$state; cities$state <- NULL
cities$Location <- cities$city; cities$city <- NULL
cities <- ddply(cities, c("Postal.Code", "Location"), summarise, lat=max(lat), lng=max(lng))
# Include Short names in mydata
mydata <- merge(mydata,STPC, by=c("State") )
# Merge mydata with cities
mydata1 <- merge(mydata,cities,by=c("Postal.Code", "Location"))
# size of mydata
print(object.size(mydata1), units="Mb")
## 2.6 Mb
To finalize let’s transform the Date.Made.Public variable from character to Date (POSIXct, format). Let’s also create a new variable containing only the year of the data breach.
mydata1$Date.Made.Public <- as.POSIXct(mydata1$Date.Made.Public, format="%B %d, %Y")
mydata1$year <- strftime(mydata1$Date.Made.Public, "%Y")
# Keep only those variables with less than 10% NA
Find_NA = data.frame (apply(mydata1, 2, function(x) sum(is.na(x)==T)))
mydata1<-mydata1[, colMeans(is.na(mydata1)) <= .1]
# Omit rows with NA
mydata1<- na.omit(mydata1)
## Logwage and Zero covariates
ZC <- nearZeroVar(mydata1, saveMetrics = T); ZC
## freqRatio percentUnique zeroVar nzv
## Postal.Code 1.787879 1.2790995 FALSE FALSE
## Location 2.542169 26.7843438 FALSE FALSE
## State 1.787879 1.2790995 FALSE FALSE
## Date.Made.Public 1.500000 53.1337938 FALSE FALSE
## Name 1.000000 91.1742134 FALSE FALSE
## Entity 1.637209 0.1790739 FALSE FALSE
## Type 1.029756 0.2046559 FALSE FALSE
## Total.Records 48.146341 23.7145050 FALSE FALSE
## Descrip 1.000000 99.9232540 FALSE FALSE
## Data_Sourcee 2.092541 0.3837299 FALSE FALSE
## lat 2.542169 29.4448708 FALSE FALSE
## lng 2.542169 29.4704528 FALSE FALSE
## year 1.119231 0.2814019 FALSE FALSE
nzv_T <- which(ZC$nzv == T); nzv_T # locate those variables
## integer(0)
# No near zero or 0 frequency ratio are found.
Let’s take a look at the outputs Type and Entity.
But first, ‘CARD’ and ‘UNKN’ classifiers (from the output Type) are going to be removed because ‘CARD’ corresponds to Payment Card Fraud and ‘UNKN’ does not provide information. we understand that ‘CARD’ is related to the use credit card but it does not indicates the method used to obtain this information. “UNKN’ means that the path that caused the data breach is not known. As a consequence, searching into the data will not provide any type of information.
mydata1 <- subset(mydata1, Type != "CARD")
mydata1 <- subset(mydata1, Type != "UNKN")
table(mydata1$Type, mydata1$Entity)
##
## BSF BSO BSR EDU GOV MED NGO
## DISC 91 73 43 191 155 137 7
## HACK 108 255 211 221 81 70 23
## INSD 79 52 64 20 60 192 7
## PHYS 48 47 34 43 84 184 8
## PORT 141 111 61 110 129 358 31
## STAT 21 18 16 47 21 89 5
ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Type))+
xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Type by State")+
theme(axis.text.x=element_text(angle=60, hjust = 1))
It is clear that California suffers the highest number of threats, followed by New York state and Texas. Florida, Illinois and Ohio are the next. The first cause of data breach is HACK. PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) and DISC (Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail.) are the next causes of data breaches. These two account for a very large number of cases and are caused by lack of vigilance, not by a direct action against the institution.
ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Entity))+
xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Entity by State")+
theme(axis.text.x=element_text(angle=60, hjust = 1))
This graph supports the fact that MED (Healthcare - Medical Providers) is the Entity with the highest number of Data-Breaches, followed very closely from BSO (Businesses - Other).
ggplot(mydata1, aes(Entity))+geom_bar(aes(fill=Type))+
xlab("Entity") + ylab("Number of Cases") + ggtitle("Data Breach by Entity (Type)")+
theme(axis.text.x=element_text(angle=90, hjust = 1))
This image shows how Medical entities suffered the largest number of Data Breaches and that most of these breaches are PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) in other words lack of vigilance. The second cause of data breaches in medical institutions are (INSD) (Someone with legitimate access intentionally breaches information such as an employee or contractor) and then PHYS (Physical loss, Lost, discarded or stolen non-electronic records, such as paper documents). It is surprising that the three main reasons for data breaches are related to negligence, lack of vigilance.
Also it is visible that BSR (Businesses - Retail/Merchant) and BSO (Businesses - Other) are the main target for Hackers.In addition, GOV (Government and Military) data breaches are mainly due to (DISC) (Unintended disclosure, Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail).
ggplot(mydata1, aes(year))+geom_bar(aes(fill=Entity))+
facet_grid(Entity~., scales = "free")+
xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Entity (Historical)")+
theme(axis.text.x=element_text(angle=90, hjust = 1))
It can be seen that 2009 represents a minimum in the number of data breaches events and after that date it is a significant increase for all the cases. This increase stops between 2012 and 2013 and it has been falling since then except for BSO (Business - non retailer or Financial)
ggplot(mydata1, aes(year))+geom_bar(aes(fill=Type))+
facet_grid(Type~., scales = "free")+
xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Type (Historical)")+
theme(axis.text.x=element_text(angle=90, hjust = 1))
We can see that in almost all cases the number of data breaches was small in 2009 and then sharply increased until 20011 and 2012 since then it has been decreasing steadily and it has been a huge reduction in the number of cases in 2014 and 2015.
Seems that the data breaches have been curved and their number are in decay. However the news are talking more and more about data breaches and records exposure. Let’s take a look to a historical representation of the record exposure
ggplot(mydata1, aes(x=year, y=(Total.Records+1)))+
geom_point(aes(colour = Type, size = (Total.Records+1), shape=Entity))+
scale_size(range = c(0,10))+ geom_jitter(aes(colour = Type))+
xlab("Year") + ylab("Number of records Exposed") +
ggtitle("Historical Data Breach Number Exposed")
Note: NGO has no related symbol in this plot. The number of events and number of data breaches in NGO are insignificant.
It is obvious that data breach as a result of hackers attack is the most damaging and it focuses in BSF and BSR entities. It can be also seen that in 2014 and 2015 the total number of records exposed is dominated by Hacks attacks (brown = HACK, dominates). In addition, the number of records exposed drop in 2010 and it has been increasing since then. MED Entities are starting to be a focus of hacker attacks.
## Duplicated entity
mydata_duplicated <- subset(mydata1, Name == mydata1$Name[anyDuplicated(mydata1$Name)])
mydata_duplicated[,4:7,]
## Date.Made.Public Name Entity Type
## 32 2007-11-05 Alabama Department of Public Health GOV DISC
## 40 2014-05-22 Alabama Department of Public Health GOV INSD
Alabama Department of Public Health suffered two data breaches one in 2007 an another in 2014. The Type of data breach was different in either case.
Description variable and then use the information to predict the Type of data breach.Looking at combinations of two words
set.seed(512)
wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2"))
# Frequency Classification
ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
geom_bar(stat="Identity", fill="Blue2") +
geom_text(aes(label=Freq), vjust = -0.5, size=2) +
ggtitle("Bigrams frequency") +
ylab("Frequency") +
xlab("Term")+
theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))
set.seed(512)
wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2"))
# Frequency Classification
ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
geom_bar(stat="Identity", fill="Blue2") +
geom_text(aes(label=Freq), vjust = -0.5, size=2) +
ggtitle("Monograms frequency") +
ylab("Frequency") +
xlab("Term")+
theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))
Add Type variable to the matrix
tdmMTX <- t(data.matrix(tdm2)); tdm <- data.frame(tdmMTX, stringsAsFactors = F)
TypetoTDM <- cbind(tdm, mydata1$Type)
TypetoTDM[is.na(TypetoTDM)]<- 0
names(TypetoTDM)[names(TypetoTDM)=="mydata1$Type"] <- "Type"
Set the system to use parallel processing
library(doParallel)
library(foreach)
library(parallel);
library(doSNOW)
num_Cores=detectCores()-1
cl <- makeCluster(num_Cores, type = "SOCK")
registerDoSNOW(cl)
newdata <- TypetoTDM
set.seed(512)
intrain <- createDataPartition(y = TypetoTDM$Type, p = 0.7, list = F)
training <- TypetoTDM[intrain,]
testing <- TypetoTDM[-intrain,]
dim(training); dim(testing)
## [1] 2624 8364
## [1] 1122 8364
# size of my dataset
print(object.size(training), units="Mb")
## 168.4 Mb
library(pROC)
library(gbm)
# Rattle Tree model
ctrlrpart <- trainControl(method="cv",
number = 10,
classProbs=TRUE,
savePredictions =TRUE)
library(rattle)
set.seed(512)
t3 <- Sys.time()
registerDoSNOW(cl)
rpart_fit <- train(Type~.,
data = training,
method = "rpart",
metric = "ROC",
maxdepth=8,
trControl = ctrlrpart,
tuneGrid = expand.grid(.cp=0.01))
t4 <- Sys.time()
trpart_fit<- difftime(t4,t3); trpart_fit
## Time difference of 2.592191 mins
save(rpart_fit, file = "rpart_fit.rda")
#registerDoSEQ();on.exit(stopCluster(cl))
# - rpart Testing
# Confussion Matrix
test_pred_rpart <- predict(rpart_fit,testing)
ConMatrix_rpart <-confusionMatrix(testing$Type,test_pred_rpart); ConMatrix_rpart
## Confusion Matrix and Statistics
##
## Reference
## Prediction DISC HACK INSD PHYS PORT STAT
## DISC 116 40 51 1 1 0
## HACK 25 237 20 1 5 2
## INSD 19 12 102 0 8 1
## PHYS 37 19 29 20 29 0
## PORT 21 10 35 0 202 14
## STAT 2 18 10 0 6 29
##
## Overall Statistics
##
## Accuracy : 0.6292
## 95% CI : (0.6002, 0.6576)
## No Information Rate : 0.2995
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.535
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity 0.5273 0.7054 0.41296 0.90909
## Specificity 0.8969 0.9326 0.95429 0.89636
## Pos Pred Value 0.5550 0.8172 0.71831 0.14925
## Neg Pred Value 0.8861 0.8810 0.85204 0.99798
## Prevalence 0.1961 0.2995 0.22014 0.01961
## Detection Rate 0.1034 0.2112 0.09091 0.01783
## Detection Prevalence 0.1863 0.2585 0.12656 0.11943
## Balanced Accuracy 0.7121 0.8190 0.68362 0.90273
## Class: PORT Class: STAT
## Sensitivity 0.8048 0.63043
## Specificity 0.9082 0.96654
## Pos Pred Value 0.7163 0.44615
## Neg Pred Value 0.9417 0.98392
## Prevalence 0.2237 0.04100
## Detection Rate 0.1800 0.02585
## Detection Prevalence 0.2513 0.05793
## Balanced Accuracy 0.8565 0.79849
print(ConMatrix_rpart$overall)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 6.292335e-01 5.349547e-01 6.002162e-01 6.575712e-01 2.994652e-01
## AccuracyPValue McnemarPValue
## 6.454398e-115 NaN
print((ConMatrix_rpart$overall)[1]); print(trpart_fit)
## Accuracy
## 0.6292335
## Time difference of 2.711045 mins
# glmnet model
ctrlCV <- trainControl(method="cv", #Cross validation
number = 10, #K-(number)
returnResamp = "all",
classProbs=TRUE, # should class probabilities be computed for classification
savePredictions =TRUE)
library(glmnet)
set.seed(512)
t5 <- Sys.time()
registerDoSNOW(cl)
glmnetCV_fit <- train(Type~.,
data = training,
method = "glmnet",
metric = "ROC",
#preProc = c("center", "scale"),
trControl = ctrlCV,
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
t6 <- Sys.time()
tglmnet_fit<- difftime(t6,t5); tglmnet_fit
## Time difference of 39.68031 mins
save(glmnetCV_fit, file = "glmnetCV_fit.rda")
# - Testing
# Confussion Matrix
test_pred_glmnetCV <- predict(glmnetCV_fit,testing)
ConMatrix_glmnetCV <-confusionMatrix(testing$Type,test_pred_glmnetCV); ConMatrix_glmnetCV
## Confusion Matrix and Statistics
##
## Reference
## Prediction DISC HACK INSD PHYS PORT STAT
## DISC 173 25 4 4 2 1
## HACK 5 281 1 2 0 1
## INSD 6 16 108 1 11 0
## PHYS 6 3 8 92 23 2
## PORT 2 3 0 1 271 5
## STAT 3 3 1 2 37 19
##
## Overall Statistics
##
## Accuracy : 0.8414
## 95% CI : (0.8186, 0.8623)
## No Information Rate : 0.3066
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7987
## Mcnemar's Test P-Value : 1.388e-13
##
## Statistics by Class:
##
## Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity 0.8872 0.8489 0.88525 0.90196
## Specificity 0.9612 0.9886 0.96600 0.95882
## Pos Pred Value 0.8278 0.9690 0.76056 0.68657
## Neg Pred Value 0.9759 0.9399 0.98571 0.98988
## Prevalence 0.1738 0.2950 0.10873 0.09091
## Detection Rate 0.1542 0.2504 0.09626 0.08200
## Detection Prevalence 0.1863 0.2585 0.12656 0.11943
## Balanced Accuracy 0.9242 0.9188 0.92562 0.93039
## Class: PORT Class: STAT
## Sensitivity 0.7878 0.67857
## Specificity 0.9859 0.95795
## Pos Pred Value 0.9610 0.29231
## Neg Pred Value 0.9131 0.99149
## Prevalence 0.3066 0.02496
## Detection Rate 0.2415 0.01693
## Detection Prevalence 0.2513 0.05793
## Balanced Accuracy 0.8868 0.81826
print(ConMatrix_glmnetCV$overall)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.413547e-01 7.987381e-01 8.186417e-01 8.622609e-01 3.065954e-01
## AccuracyPValue McnemarPValue
## 5.104556e-302 1.387784e-13
print((ConMatrix_glmnetCV$overall)[1]); print(tglmnet_fit)
## Accuracy
## 0.8413547
## Time difference of 39.68031 mins
registerDoSEQ();on.exit(stopCluster(cl))
library(beepr)
beep(4)
The application of two models, Recursive Partitioning and Regression Trees and Lasso and Elastic-Net Regularized Generalized Linear Model resulted in classifications of 62% and 84% accuracy respectively.
Analysis of the data indicates that the main cause of data breach is HACK, followed by PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) and DISC (Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail). These last two are related to lack of vigilance.
MED (Healthcare - Medical Providers) is the Entity with the highest number of Data-Breaches, followed very closely from BSO (Businesses - Other).
It is visible that BSR (Businesses - Retail/Merchant) and BSO (Businesses - Other) are the main target for Hackers.In addition, GOV (Government and Military) data breaches are mainly due to (DISC) (Unintended disclosure, Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail).
Historical analysis indicates that the number of Data-Breaches events has reduced in the last two years. This can be due to improvement of the economy and more jobs = less crimes and/or improvement in the security of institutions.
However HACK attacks even when they are less frequent they are becoming the dominating cause for data breaches and can be seen that the last two years they are the most frequent type of data breach and also HACK results in the largest number of records exposed.
Future studies could involve matrices of two or three words combinations and/or combinations between all the matrices.