Synopsis:

The data breaches noted here have been reported because the personal information compromised includes data elements useful to identity thieves, such as Social Security numbers, account numbers, and driver’s license numbers. Some breaches that do NOT expose such sensitive information have been included in order to underscore the variety and frequency of data breaches. However, we have not included the number of records involved in such breaches in the total because we want this compilation to reflect breaches that expose individuals to identity theft as well as breaches that qualify for disclosure under state laws. The breaches posted below include only those reported in the United States. They do not include incidents in other countries.

Type of breach:

Entity affected by breach:

1. Question:
It is possible to classify correctly the type of breach using text mining on the full description of the breach? Can we identity specific threats for each Entity?

2. Input Data:
You can download the data-set containing data-breach information (3.6 MB) at: https://www.privacyrights.org/data-breach
The data-set containing city names and GPS coordinates can be obtained from: http://simplemaps.com/files/cities.csv
We will use the USA map from ‘library(maps)’
Finally we will obtain a very detailed table that relates short and long state names from: http://www.infoplease.com/ipa/A0110468.html

3. Preprocessing:
Clean and Pool data
Dealing with NA values; Zero Frequency Ratio; Remove if necessary redundant highly correlated variables; Remove near-zero predictors.

4. Exploratory Analysis:
Take a look at relationships and patters.

5. Text Mining:
Before creating a predictive model, on the type of data breach, we will implement text mining on the variable Descrip. this variable contains commentaries about the data breach and we will try to predict the Type of Data breach by using VCorpus and tm_map_map. Here we will show that we can create clusters of one, two or more words and use them for classification. At the end we will use clusters of one word for our final analysis.

6. Build the models and compare:
A seed is assign and a ratio of 70/30 for training and testing data-sets are created ‘Rattle tree, and ’glmnet’ model algorithms to see if I could build a successful predictive model. This models achieved 62% and 84% prediction accuracy.

8. Conclusion:

Input Data

setwd("~/R/Data Driven security")

libs <- c ("tm", "plyr", "class", "SnowballC", "dplyr", "caret", "corrplot", 
           "beepr","RWeka","e1071","ggplot2", "datasets", "maps","plotly", "wordcloud")

lapply(libs, library, character.only=T)

A. Data Breach set

Load the data-set but replace all missing 'NA', 'NaN' or empty spaces by 'NA'

mydata = read.csv("Chronology-of-Data-Breaches_-_Privacy-Rights-Clearinghouse.csv",
                  header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))

B. Cities information set

Download the file containing all cities information in USA and replace all missing 'NA', 'NaN' or empty spaces by NA'. This data-set with 5 variables and 29470 observables relates zip code, state, latitude and longitude coordinates with a specific city.

## Outdated Link
#fileURL <- "http://simplemaps.com/files/cities.csv"
#        desfile <- "~/R/Data Driven security/cities.csv"
#        download.file(fileURL, destfile = desfile)
        
cities = read.csv("~/R/Data Driven security/cities.csv",
                  header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))
head(cities,2)     
##     zip state       city      lat       lng
## 1 35004    AL      Acmar 33.58413 -86.51557
## 2 35005    AL Adamsville 33.58844 -86.95973

C. US Map set

Load us map data

all_states <- map_data("state")

D. Transformations from long state name to short state name set

A table that contains all the possible versions of state names and territories is very useful when different data-sets will be bind using the state as the binding parameter

library(rvest)
StatePosTCODE <- read_html("http://www.infoplease.com/ipa/A0110468.html")
STPC<-StatePosTCODE %>%
        html_nodes("table") %>%
        .[[2]] %>%
        html_table(); head(STPC,2)
##     State Abbreviation Postal Code
## 1 Alabama         Ala.          AL
## 2  Alaska       Alaska          AK
#Keep only the names and postal code
STPC <- data.frame(STPC[,c(1,3)])

Preprocessing

  1. Use more significant variable names and eliminate those variables without significant contribution to the data.
# Eliminate and Rename the necessary variables
mydata$Total.Records <- mydata$X.5
mydata <- select(mydata, -Street, -Postal.Code, -X.3, -X.1,
                         -Attached.Files, -X.5, -X, -All.Terms)

        names(mydata)[names(mydata)=="X.."] <- "State"
        names(mydata)[names(mydata)=="X.4"] <- "Data_Sourcee"
        names(mydata)[names(mydata)=="X.2"] <- "Descrip"
  1. Prepare cities data-set to be merged with mydata. Some cities contain more than one GPS coordinate. Let’s Simplify this data-set to one GPS coordinate per city. Also make names more significant.
cities$Postal.Code <- cities$state; cities$state <- NULL
        cities$Location <- cities$city; cities$city <- NULL
        cities <- ddply(cities, c("Postal.Code", "Location"), summarise, lat=max(lat), lng=max(lng))
  1. Merge the three data-sets
# Include Short names in mydata 
mydata <- merge(mydata,STPC, by=c("State") )

# Merge mydata with cities
mydata1 <- merge(mydata,cities,by=c("Postal.Code", "Location"))

# size of mydata
print(object.size(mydata1), units="Mb")        
## 2.6 Mb

To finalize let’s transform the Date.Made.Public variable from character to Date (POSIXct, format). Let’s also create a new variable containing only the year of the data breach.

mydata1$Date.Made.Public <- as.POSIXct(mydata1$Date.Made.Public, format="%B %d, %Y")
mydata1$year <- strftime(mydata1$Date.Made.Public, "%Y")
# Keep only those variables with less than 10% NA
Find_NA = data.frame (apply(mydata1, 2, function(x) sum(is.na(x)==T)))
mydata1<-mydata1[, colMeans(is.na(mydata1)) <= .1] 

# Omit rows with NA   
mydata1<- na.omit(mydata1) 

## Logwage and Zero covariates
ZC <- nearZeroVar(mydata1, saveMetrics = T); ZC
##                  freqRatio percentUnique zeroVar   nzv
## Postal.Code       1.787879     1.2790995   FALSE FALSE
## Location          2.542169    26.7843438   FALSE FALSE
## State             1.787879     1.2790995   FALSE FALSE
## Date.Made.Public  1.500000    53.1337938   FALSE FALSE
## Name              1.000000    91.1742134   FALSE FALSE
## Entity            1.637209     0.1790739   FALSE FALSE
## Type              1.029756     0.2046559   FALSE FALSE
## Total.Records    48.146341    23.7145050   FALSE FALSE
## Descrip           1.000000    99.9232540   FALSE FALSE
## Data_Sourcee      2.092541     0.3837299   FALSE FALSE
## lat               2.542169    29.4448708   FALSE FALSE
## lng               2.542169    29.4704528   FALSE FALSE
## year              1.119231     0.2814019   FALSE FALSE
nzv_T <- which(ZC$nzv == T); nzv_T # locate those variables 
## integer(0)
# No near zero or 0 frequency ratio are found.

Exploratory Analysis

Let’s take a look at the outputs Type and Entity.

But first, ‘CARD’ and ‘UNKN’ classifiers (from the output Type) are going to be removed because ‘CARD’ corresponds to Payment Card Fraud and ‘UNKN’ does not provide information. we understand that ‘CARD’ is related to the use credit card but it does not indicates the method used to obtain this information. “UNKN’ means that the path that caused the data breach is not known. As a consequence, searching into the data will not provide any type of information.

mydata1 <- subset(mydata1, Type != "CARD")
mydata1 <- subset(mydata1, Type != "UNKN")
table(mydata1$Type, mydata1$Entity)
##       
##        BSF BSO BSR EDU GOV MED NGO
##   DISC  91  73  43 191 155 137   7
##   HACK 108 255 211 221  81  70  23
##   INSD  79  52  64  20  60 192   7
##   PHYS  48  47  34  43  84 184   8
##   PORT 141 111  61 110 129 358  31
##   STAT  21  18  16  47  21  89   5
ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Type))+
        xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Type by State")+
        theme(axis.text.x=element_text(angle=60, hjust = 1)) 

It is clear that California suffers the highest number of threats, followed by New York state and Texas. Florida, Illinois and Ohio are the next. The first cause of data breach is HACK. PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) and DISC (Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail.) are the next causes of data breaches. These two account for a very large number of cases and are caused by lack of vigilance, not by a direct action against the institution.

ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Entity))+
        xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Entity by State")+
        theme(axis.text.x=element_text(angle=60, hjust = 1))

This graph supports the fact that MED (Healthcare - Medical Providers) is the Entity with the highest number of Data-Breaches, followed very closely from BSO (Businesses - Other).

ggplot(mydata1, aes(Entity))+geom_bar(aes(fill=Type))+
        xlab("Entity") + ylab("Number of Cases") + ggtitle("Data Breach by Entity (Type)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

This image shows how Medical entities suffered the largest number of Data Breaches and that most of these breaches are PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) in other words lack of vigilance. The second cause of data breaches in medical institutions are (INSD) (Someone with legitimate access intentionally breaches information such as an employee or contractor) and then PHYS (Physical loss, Lost, discarded or stolen non-electronic records, such as paper documents). It is surprising that the three main reasons for data breaches are related to negligence, lack of vigilance.

Also it is visible that BSR (Businesses - Retail/Merchant) and BSO (Businesses - Other) are the main target for Hackers.In addition, GOV (Government and Military) data breaches are mainly due to (DISC) (Unintended disclosure, Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail).

ggplot(mydata1, aes(year))+geom_bar(aes(fill=Entity))+
        facet_grid(Entity~., scales = "free")+
        xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Entity (Historical)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

It can be seen that 2009 represents a minimum in the number of data breaches events and after that date it is a significant increase for all the cases. This increase stops between 2012 and 2013 and it has been falling since then except for BSO (Business - non retailer or Financial)

ggplot(mydata1, aes(year))+geom_bar(aes(fill=Type))+
        facet_grid(Type~., scales = "free")+
        xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Type (Historical)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

We can see that in almost all cases the number of data breaches was small in 2009 and then sharply increased until 20011 and 2012 since then it has been decreasing steadily and it has been a huge reduction in the number of cases in 2014 and 2015.

Seems that the data breaches have been curved and their number are in decay. However the news are talking more and more about data breaches and records exposure. Let’s take a look to a historical representation of the record exposure

ggplot(mydata1, aes(x=year, y=(Total.Records+1)))+
        geom_point(aes(colour = Type, size = (Total.Records+1), shape=Entity))+
        scale_size(range = c(0,10))+ geom_jitter(aes(colour = Type))+
        xlab("Year") + ylab("Number of records Exposed") + 
        ggtitle("Historical Data Breach Number Exposed")

Note: NGO has no related symbol in this plot. The number of events and number of data breaches in NGO are insignificant.

It is obvious that data breach as a result of hackers attack is the most damaging and it focuses in BSF and BSR entities. It can be also seen that in 2014 and 2015 the total number of records exposed is dominated by Hacks attacks (brown = HACK, dominates). In addition, the number of records exposed drop in 2010 and it has been increasing since then. MED Entities are starting to be a focus of hacker attacks.

## Duplicated entity
mydata_duplicated <- subset(mydata1, Name == mydata1$Name[anyDuplicated(mydata1$Name)])
mydata_duplicated[,4:7,]
##    Date.Made.Public                                Name Entity Type
## 32       2007-11-05 Alabama Department of Public Health    GOV DISC
## 40       2014-05-22 Alabama Department of Public Health    GOV INSD

Alabama Department of Public Health suffered two data breaches one in 2007 an another in 2014. The Type of data breach was different in either case.

Text Mining

  1. build TDM TermDocumentMatrix and match it with its corresponding predictor Type
    Analyze the Description of the data breach

Two Words Classification

Looking at combinations of two words

Cloud Graphycal representation Bigram

        set.seed(512)   
        wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2")) 

# Frequency Classification
        ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
        geom_bar(stat="Identity", fill="Blue2") +
        geom_text(aes(label=Freq), vjust = -0.5, size=2) +
        ggtitle("Bigrams frequency") +
        ylab("Frequency") +
        xlab("Term")+
        theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))

Cloud Graphycal representation Monogram

        set.seed(512)   
        wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2")) 

# Frequency Classification
        ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
        geom_bar(stat="Identity", fill="Blue2") +
        geom_text(aes(label=Freq), vjust = -0.5, size=2) +
        ggtitle("Monograms frequency") +
        ylab("Frequency") +
        xlab("Term")+
        theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))

Build the Model

Add Type variable to the matrix

        tdmMTX <-  t(data.matrix(tdm2)); tdm <- data.frame(tdmMTX, stringsAsFactors = F)
        TypetoTDM <- cbind(tdm, mydata1$Type)
        TypetoTDM[is.na(TypetoTDM)]<- 0
        names(TypetoTDM)[names(TypetoTDM)=="mydata1$Type"] <- "Type"

Set the system to use parallel processing

        library(doParallel)
        library(foreach)
        library(parallel);
        library(doSNOW)
        num_Cores=detectCores()-1
        cl <- makeCluster(num_Cores, type = "SOCK")
        registerDoSNOW(cl)
        
        newdata <- TypetoTDM

Create Training set

set.seed(512)
intrain <- createDataPartition(y = TypetoTDM$Type, p = 0.7, list = F)
training <- TypetoTDM[intrain,]
testing <- TypetoTDM[-intrain,]
dim(training); dim(testing)
## [1] 2624 8364
## [1] 1122 8364
# size of my dataset
print(object.size(training), units="Mb")  
## 168.4 Mb
  • As can be seen, we are analyzing a 168.4 Mb data-set!

Set Train Control parameters

Use 10-fold cross validation to evaluate the model

library(pROC)
library(gbm)

Recursive Partitioning and Regression Trees model

# Rattle Tree model
ctrlrpart <- trainControl(method="cv",
                          number = 10,
                          classProbs=TRUE,
                          savePredictions =TRUE)

library(rattle)
set.seed(512)
t3 <- Sys.time()

registerDoSNOW(cl)
        
rpart_fit <- train(Type~.,
                   data = training,
                   method = "rpart",
                   metric = "ROC",
                   maxdepth=8,
                   trControl = ctrlrpart,
                   tuneGrid = expand.grid(.cp=0.01))


t4 <- Sys.time()
trpart_fit<- difftime(t4,t3); trpart_fit
## Time difference of 2.592191 mins
save(rpart_fit, file = "rpart_fit.rda")
#registerDoSEQ();on.exit(stopCluster(cl))
# - rpart Testing
# Confussion Matrix
test_pred_rpart <- predict(rpart_fit,testing)
ConMatrix_rpart <-confusionMatrix(testing$Type,test_pred_rpart); ConMatrix_rpart
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DISC HACK INSD PHYS PORT STAT
##       DISC  116   40   51    1    1    0
##       HACK   25  237   20    1    5    2
##       INSD   19   12  102    0    8    1
##       PHYS   37   19   29   20   29    0
##       PORT   21   10   35    0  202   14
##       STAT    2   18   10    0    6   29
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6292          
##                  95% CI : (0.6002, 0.6576)
##     No Information Rate : 0.2995          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.535           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity               0.5273      0.7054     0.41296     0.90909
## Specificity               0.8969      0.9326     0.95429     0.89636
## Pos Pred Value            0.5550      0.8172     0.71831     0.14925
## Neg Pred Value            0.8861      0.8810     0.85204     0.99798
## Prevalence                0.1961      0.2995     0.22014     0.01961
## Detection Rate            0.1034      0.2112     0.09091     0.01783
## Detection Prevalence      0.1863      0.2585     0.12656     0.11943
## Balanced Accuracy         0.7121      0.8190     0.68362     0.90273
##                      Class: PORT Class: STAT
## Sensitivity               0.8048     0.63043
## Specificity               0.9082     0.96654
## Pos Pred Value            0.7163     0.44615
## Neg Pred Value            0.9417     0.98392
## Prevalence                0.2237     0.04100
## Detection Rate            0.1800     0.02585
## Detection Prevalence      0.2513     0.05793
## Balanced Accuracy         0.8565     0.79849
print(ConMatrix_rpart$overall)
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.292335e-01   5.349547e-01   6.002162e-01   6.575712e-01   2.994652e-01 
## AccuracyPValue  McnemarPValue 
##  6.454398e-115            NaN
print((ConMatrix_rpart$overall)[1]); print(trpart_fit)
##  Accuracy 
## 0.6292335
## Time difference of 2.711045 mins

Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET) Model

# glmnet model
ctrlCV <- trainControl(method="cv",   #Cross validation
                       number = 10,     #K-(number)
                       returnResamp = "all",
                       classProbs=TRUE, # should class probabilities be computed for classification
                       savePredictions =TRUE) 


library(glmnet)
set.seed(512)
t5 <- Sys.time()

registerDoSNOW(cl)
glmnetCV_fit <- train(Type~.,
                      data = training,
                      method = "glmnet",
                      metric = "ROC",
                      #preProc = c("center", "scale"),
                      trControl = ctrlCV,
                      tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
                                             .lambda = c((1:5)/10)))
t6 <- Sys.time()
tglmnet_fit<- difftime(t6,t5); tglmnet_fit
## Time difference of 39.68031 mins
save(glmnetCV_fit, file = "glmnetCV_fit.rda")

# - Testing
# Confussion Matrix
test_pred_glmnetCV <- predict(glmnetCV_fit,testing)
ConMatrix_glmnetCV <-confusionMatrix(testing$Type,test_pred_glmnetCV); ConMatrix_glmnetCV
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DISC HACK INSD PHYS PORT STAT
##       DISC  173   25    4    4    2    1
##       HACK    5  281    1    2    0    1
##       INSD    6   16  108    1   11    0
##       PHYS    6    3    8   92   23    2
##       PORT    2    3    0    1  271    5
##       STAT    3    3    1    2   37   19
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8414          
##                  95% CI : (0.8186, 0.8623)
##     No Information Rate : 0.3066          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7987          
##  Mcnemar's Test P-Value : 1.388e-13       
## 
## Statistics by Class:
## 
##                      Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity               0.8872      0.8489     0.88525     0.90196
## Specificity               0.9612      0.9886     0.96600     0.95882
## Pos Pred Value            0.8278      0.9690     0.76056     0.68657
## Neg Pred Value            0.9759      0.9399     0.98571     0.98988
## Prevalence                0.1738      0.2950     0.10873     0.09091
## Detection Rate            0.1542      0.2504     0.09626     0.08200
## Detection Prevalence      0.1863      0.2585     0.12656     0.11943
## Balanced Accuracy         0.9242      0.9188     0.92562     0.93039
##                      Class: PORT Class: STAT
## Sensitivity               0.7878     0.67857
## Specificity               0.9859     0.95795
## Pos Pred Value            0.9610     0.29231
## Neg Pred Value            0.9131     0.99149
## Prevalence                0.3066     0.02496
## Detection Rate            0.2415     0.01693
## Detection Prevalence      0.2513     0.05793
## Balanced Accuracy         0.8868     0.81826
print(ConMatrix_glmnetCV$overall)
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.413547e-01   7.987381e-01   8.186417e-01   8.622609e-01   3.065954e-01 
## AccuracyPValue  McnemarPValue 
##  5.104556e-302   1.387784e-13
print((ConMatrix_glmnetCV$overall)[1]); print(tglmnet_fit)
##  Accuracy 
## 0.8413547
## Time difference of 39.68031 mins

Close parallel clustering

  registerDoSEQ();on.exit(stopCluster(cl))
library(beepr)
  beep(4)

Conclusion