Data source Privacy Rights Clearing House

Synopsis:

The data breaches noted here have been reported because the personal information compromised includes data elements useful to identity thieves, such as Social Security numbers, account numbers, and driver’s license numbers. Some breaches that do NOT expose such sensitive information have been included in order to underscore the variety and frequency of data breaches. However, we have not included the number of records involved in such breaches in the total because we want this compilation to reflect breaches that expose individuals to identity theft as well as breaches that qualify for disclosure under state laws. The breaches posted below include only those reported in the United States. They do not include incidents in other countries.

This information was extracted from https://www.privacyrights.org/data-breach between 2005 and 2015.
The Data-Set contains 17 variables and 4695 observables.
some variables included are: Type of breach, Entity affected by the breach, State, Source and Description.

Type of breach:

(DISC) Unintended disclosure
Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail.
(HACK) Hacking or malware
Electronic entry by an outside party, malware and spyware.
(CARD)Payment Card Fraud
Fraud involving debit and credit cards that is not accomplished via hacking. For example, skimming devices at point-of-service terminals.
(INSD)Insider
Someone with legitimate access intentionally breaches information such as an employee or contractor.
(PHYS) Physical loss
Lost, discarded or stolen non-electronic records, such as paper documents
(PORT) Portable device
Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc
(STAT) Stationary device - Lost, discarded or stolen stationary electronic device such as a computer or server not designed for mobility.
(UNKN) Cause of Data breach not known

Entity affected by breach:

BSO - Businesses - Other
BSF - Businesses - Financial and Insurance Services
BSR - Businesses - Retail/Merchant
EDU - Educational Institutions
GOV - Government and Military
MED - Healthcare - Medical Providers
NGO - Nonprofit Organizations

1. Question:
It is possible to classify correctly the type of breach using text mining on the full description of the breach? Can we identity specific threats for each Entity?

2. Input Data:
You can download the data-set containing data-breach information (3.6 MB) at: https://www.privacyrights.org/data-breach
The data-set containing city names and GPS coordinates can be obtained from: http://simplemaps.com/files/cities.csv
We will use the USA map from ‘library(maps)’
Finally we will obtain a very detailed table that relates short and long state names from: http://www.infoplease.com/ipa/A0110468.html

3. Preprocessing:
Clean and Pool data
Dealing with NA values; Zero Frequency Ratio; Remove if necessary redundant highly correlated variables; Remove near-zero predictors.

4. Exploratory Analysis:
Take a look at relationships and patters.

5. Text Mining:
Before creating a predictive model, on the type of data breach, we will implement text mining on the variable Descrip. this variable contains commentaries about the data breach and we will try to predict the Type of Data breach by using VCorpus and tm_map_map. Here we will show that we can create clusters of one, two or more words and use them for classification. At the end we will use clusters of one word for our final analysis.

6. Build the models and compare:
A seed is assign and a ratio of 70/30 for training and testing data-sets are created ‘Rattle tree, and ’glmnet’ model algorithms to see if I could build a successful predictive model. This models achieved 62% and 84% prediction accuracy.

8. Conclusion:

Input Data

setwd("~/R/Data Driven security")

libs <- c ("tm", "plyr", "class", "SnowballC", "dplyr", "caret", "corrplot", 
           "beepr","RWeka","e1071","ggplot2", "datasets", "maps","plotly", "wordcloud")

lapply(libs, library, character.only=T)

A. Data Breach set

Load the data-set but replace all missing 'NA', 'NaN' or empty spaces by 'NA'

mydata = read.csv("Chronology-of-Data-Breaches_-_Privacy-Rights-Clearinghouse.csv",
                  header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))

B. Cities information set

Download the file containing all cities information in USA and replace all missing 'NA', 'NaN' or empty spaces by NA'. This data-set with 5 variables and 29470 observables relates zip code, state, latitude and longitude coordinates with a specific city.

## Outdated Link
#fileURL <- "http://simplemaps.com/files/cities.csv"
#        desfile <- "~/R/Data Driven security/cities.csv"
#        download.file(fileURL, destfile = desfile)
        
cities = read.csv("~/R/Data Driven security/cities.csv",
                  header = TRUE, stringsAsFactors = FALSE, na.strings = c("NA", "NaN", ""))
head(cities,2)

##     zip state       city      lat       lng
## 1 35004    AL      Acmar 33.58413 -86.51557
## 2 35005    AL Adamsville 33.58844 -86.95973

C. US Map set

Load us map data

all_states <- map_data("state")

D. Transformations from long state name to short state name set

A table that contains all the possible versions of state names and territories is very useful when different data-sets will be bind using the state as the binding parameter

library(rvest)
StatePosTCODE <- read_html("http://www.infoplease.com/ipa/A0110468.html")

STPC<-StatePosTCODE %>%
        html_nodes("table") %>%
        .[[2]] %>%
        html_table(); head(STPC,2)

##     State Abbreviation Postal Code
## 1 Alabama         Ala.          AL
## 2  Alaska       Alaska          AK

#Keep only the names and postal code
STPC <- data.frame(STPC[,c(1,3)])

Preprocessing

Now let’s clean the data-sets

Use more significant variable names and eliminate those variables without significant contribution to the data.

# Eliminate and Rename the necessary variables
mydata$Total.Records <- mydata$X.5
mydata <- select(mydata, -Street, -Postal.Code, -X.3, -X.1,
                         -Attached.Files, -X.5, -X, -All.Terms)

        names(mydata)[names(mydata)=="X.."] <- "State"
        names(mydata)[names(mydata)=="X.4"] <- "Data_Sourcee"
        names(mydata)[names(mydata)=="X.2"] <- "Descrip"

Prepare cities data-set to be merged with mydata. Some cities contain more than one GPS coordinate. Let’s Simplify this data-set to one GPS coordinate per city. Also make names more significant.

cities$Postal.Code <- cities$state; cities$state <- NULL
        cities$Location <- cities$city; cities$city <- NULL
        cities <- ddply(cities, c("Postal.Code", "Location"), summarise, lat=max(lat), lng=max(lng))

Merge the three data-sets

# Include Short names in mydata 
mydata <- merge(mydata,STPC, by=c("State") )

# Merge mydata with cities
mydata1 <- merge(mydata,cities,by=c("Postal.Code", "Location"))

# size of mydata
print(object.size(mydata1), units="Mb")

## 2.6 Mb

To finalize let’s transform the Date.Made.Public variable from character to Date (POSIXct, format). Let’s also create a new variable containing only the year of the data breach.

mydata1$Date.Made.Public <- as.POSIXct(mydata1$Date.Made.Public, format="%B %d, %Y")
mydata1$year <- strftime(mydata1$Date.Made.Public, "%Y")

Now let’s deal with NA and near zero values

# Keep only those variables with less than 10% NA
Find_NA = data.frame (apply(mydata1, 2, function(x) sum(is.na(x)==T)))
mydata1<-mydata1[, colMeans(is.na(mydata1)) <= .1] 

# Omit rows with NA   
mydata1<- na.omit(mydata1) 

## Logwage and Zero covariates
ZC <- nearZeroVar(mydata1, saveMetrics = T); ZC

##                  freqRatio percentUnique zeroVar   nzv
## Postal.Code       1.787879     1.2790995   FALSE FALSE
## Location          2.542169    26.7843438   FALSE FALSE
## State             1.787879     1.2790995   FALSE FALSE
## Date.Made.Public  1.500000    53.1337938   FALSE FALSE
## Name              1.000000    91.1742134   FALSE FALSE
## Entity            1.637209     0.1790739   FALSE FALSE
## Type              1.029756     0.2046559   FALSE FALSE
## Total.Records    48.146341    23.7145050   FALSE FALSE
## Descrip           1.000000    99.9232540   FALSE FALSE
## Data_Sourcee      2.092541     0.3837299   FALSE FALSE
## lat               2.542169    29.4448708   FALSE FALSE
## lng               2.542169    29.4704528   FALSE FALSE
## year              1.119231     0.2814019   FALSE FALSE

nzv_T <- which(ZC$nzv == T); nzv_T # locate those variables

## integer(0)

# No near zero or 0 frequency ratio are found.

Exploratory Analysis

Let’s take a look at the outputs Type and Entity.

But first, ‘CARD’ and ‘UNKN’ classifiers (from the output Type) are going to be removed because ‘CARD’ corresponds to Payment Card Fraud and ‘UNKN’ does not provide information. we understand that ‘CARD’ is related to the use credit card but it does not indicates the method used to obtain this information. “UNKN’ means that the path that caused the data breach is not known. As a consequence, searching into the data will not provide any type of information.

mydata1 <- subset(mydata1, Type != "CARD")
mydata1 <- subset(mydata1, Type != "UNKN")

table(mydata1$Type, mydata1$Entity)

##       
##        BSF BSO BSR EDU GOV MED NGO
##   DISC  91  73  43 191 155 137   7
##   HACK 108 255 211 221  81  70  23
##   INSD  79  52  64  20  60 192   7
##   PHYS  48  47  34  43  84 184   8
##   PORT 141 111  61 110 129 358  31
##   STAT  21  18  16  47  21  89   5

What Types of Data Breaches are the most common? What type of Entities suffer from a larger number of Data Breaches? It is there a relationship between state and Data Breach?
To answer this questions let’s create some plots.

ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Type))+
        xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Type by State")+
        theme(axis.text.x=element_text(angle=60, hjust = 1))

It is clear that California suffers the highest number of threats, followed by New York state and Texas. Florida, Illinois and Ohio are the next. The first cause of data breach is HACK. PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) and DISC (Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail.) are the next causes of data breaches. These two account for a very large number of cases and are caused by lack of vigilance, not by a direct action against the institution.

ggplot(mydata1, aes(Postal.Code))+geom_bar(aes(fill=Entity))+
        xlab("State") + ylab("Number Cases") + ggtitle("Data Breach Entity by State")+
        theme(axis.text.x=element_text(angle=60, hjust = 1))

This graph supports the fact that MED (Healthcare - Medical Providers) is the Entity with the highest number of Data-Breaches, followed very closely from BSO (Businesses - Other).

It is a relationship between Entity and type of Data breach?

ggplot(mydata1, aes(Entity))+geom_bar(aes(fill=Type))+
        xlab("Entity") + ylab("Number of Cases") + ggtitle("Data Breach by Entity (Type)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

This image shows how Medical entities suffered the largest number of Data Breaches and that most of these breaches are PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) in other words lack of vigilance. The second cause of data breaches in medical institutions are (INSD) (Someone with legitimate access intentionally breaches information such as an employee or contractor) and then PHYS (Physical loss, Lost, discarded or stolen non-electronic records, such as paper documents). It is surprising that the three main reasons for data breaches are related to negligence, lack of vigilance.

Also it is visible that BSR (Businesses - Retail/Merchant) and BSO (Businesses - Other) are the main target for Hackers.In addition, GOV (Government and Military) data breaches are mainly due to (DISC) (Unintended disclosure, Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail).

Let’s take a look now to the historical trend in the data breach by the type of Entity

ggplot(mydata1, aes(year))+geom_bar(aes(fill=Entity))+
        facet_grid(Entity~., scales = "free")+
        xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Entity (Historical)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

It can be seen that 2009 represents a minimum in the number of data breaches events and after that date it is a significant increase for all the cases. This increase stops between 2012 and 2013 and it has been falling since then except for BSO (Business - non retailer or Financial)

Let’s take a look now to the historical trend in the data breach by the Type of Breach

ggplot(mydata1, aes(year))+geom_bar(aes(fill=Type))+
        facet_grid(Type~., scales = "free")+
        xlab("Year") + ylab("Number Cases") + ggtitle("Data Breach by Type (Historical)")+
        theme(axis.text.x=element_text(angle=90, hjust = 1))

We can see that in almost all cases the number of data breaches was small in 2009 and then sharply increased until 20011 and 2012 since then it has been decreasing steadily and it has been a huge reduction in the number of cases in 2014 and 2015.

Seems that the data breaches have been curved and their number are in decay. However the news are talking more and more about data breaches and records exposure. Let’s take a look to a historical representation of the record exposure

ggplot(mydata1, aes(x=year, y=(Total.Records+1)))+
        geom_point(aes(colour = Type, size = (Total.Records+1), shape=Entity))+
        scale_size(range = c(0,10))+ geom_jitter(aes(colour = Type))+
        xlab("Year") + ylab("Number of records Exposed") + 
        ggtitle("Historical Data Breach Number Exposed")

Note: NGO has no related symbol in this plot. The number of events and number of data breaches in NGO are insignificant.

It is obvious that data breach as a result of hackers attack is the most damaging and it focuses in BSF and BSR entities. It can be also seen that in 2014 and 2015 the total number of records exposed is dominated by Hacks attacks (brown = HACK, dominates). In addition, the number of records exposed drop in 2010 and it has been increasing since then. MED Entities are starting to be a focus of hacker attacks.

We are going to look for entities that have suffered more than a single one data-breach

## Duplicated entity
mydata_duplicated <- subset(mydata1, Name == mydata1$Name[anyDuplicated(mydata1$Name)])
mydata_duplicated[,4:7,]

##    Date.Made.Public                                Name Entity Type
## 32       2007-11-05 Alabama Department of Public Health    GOV DISC
## 40       2014-05-22 Alabama Department of Public Health    GOV INSD

Alabama Department of Public Health suffered two data breaches one in 2007 an another in 2014. The Type of data breach was different in either case.

We have been looking at the number of Data Breaches but not at the number of files compromised. Now, we are going to use a dynamic map in which all the Types and Entities information is included. Hovering over the “bubble” will show the name of the entity, the state, the number of records compromised for that specific data breach, and the type of entity. The color represents the Type of data breach and it can filtered by clicking with the mouse on the icons of the legends.

Text Mining

A classification analysis it is going to be performed using Text Mining We are going to read the Description variable and then use the information to predict the Type of data breach.

build TDM TermDocumentMatrix and match it with its corresponding predictor Type
Analyze the Description of the data breach

Two Words Classification

Looking at combinations of two words

Cloud Graphycal representation Bigram

        set.seed(512)   
        wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2"))

# Frequency Classification
        ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
        geom_bar(stat="Identity", fill="Blue2") +
        geom_text(aes(label=Freq), vjust = -0.5, size=2) +
        ggtitle("Bigrams frequency") +
        ylab("Frequency") +
        xlab("Term")+
        theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))

Cloud Graphycal representation Monogram

        set.seed(512)   
        wordcloud(cr2$Names, cr2$Freq, min.freq=200, colors=brewer.pal(6, "Dark2"))

# Frequency Classification
        ggplot(head(cr2,30), aes(x=reorder(Names,-Freq), y=Freq)) +
        geom_bar(stat="Identity", fill="Blue2") +
        geom_text(aes(label=Freq), vjust = -0.5, size=2) +
        ggtitle("Monograms frequency") +
        ylab("Frequency") +
        xlab("Term")+
        theme(axis.text.x=element_text(angle=60, hjust=1, size = 10))

Build the Model

Add Type variable to the matrix

        tdmMTX <-  t(data.matrix(tdm2)); tdm <- data.frame(tdmMTX, stringsAsFactors = F)
        TypetoTDM <- cbind(tdm, mydata1$Type)
        TypetoTDM[is.na(TypetoTDM)]<- 0
        names(TypetoTDM)[names(TypetoTDM)=="mydata1$Type"] <- "Type"

Set the system to use parallel processing

        library(doParallel)
        library(foreach)
        library(parallel);
        library(doSNOW)
        num_Cores=detectCores()-1
        cl <- makeCluster(num_Cores, type = "SOCK")
        registerDoSNOW(cl)
        
        newdata <- TypetoTDM

Create Training set

set.seed(512)
intrain <- createDataPartition(y = TypetoTDM$Type, p = 0.7, list = F)
training <- TypetoTDM[intrain,]
testing <- TypetoTDM[-intrain,]
dim(training); dim(testing)

## [1] 2624 8364

## [1] 1122 8364

# size of my dataset
print(object.size(training), units="Mb")

## 168.4 Mb

As can be seen, we are analyzing a 168.4 Mb data-set!

Set Train Control parameters

Use 10-fold cross validation to evaluate the model

library(pROC)
library(gbm)

Recursive Partitioning and Regression Trees model

# Rattle Tree model
ctrlrpart <- trainControl(method="cv",
                          number = 10,
                          classProbs=TRUE,
                          savePredictions =TRUE)

library(rattle)
set.seed(512)
t3 <- Sys.time()

registerDoSNOW(cl)
        
rpart_fit <- train(Type~.,
                   data = training,
                   method = "rpart",
                   metric = "ROC",
                   maxdepth=8,
                   trControl = ctrlrpart,
                   tuneGrid = expand.grid(.cp=0.01))


t4 <- Sys.time()
trpart_fit<- difftime(t4,t3); trpart_fit

## Time difference of 2.592191 mins

save(rpart_fit, file = "rpart_fit.rda")
#registerDoSEQ();on.exit(stopCluster(cl))

# - rpart Testing
# Confussion Matrix
test_pred_rpart <- predict(rpart_fit,testing)
ConMatrix_rpart <-confusionMatrix(testing$Type,test_pred_rpart); ConMatrix_rpart

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DISC HACK INSD PHYS PORT STAT
##       DISC  116   40   51    1    1    0
##       HACK   25  237   20    1    5    2
##       INSD   19   12  102    0    8    1
##       PHYS   37   19   29   20   29    0
##       PORT   21   10   35    0  202   14
##       STAT    2   18   10    0    6   29
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6292          
##                  95% CI : (0.6002, 0.6576)
##     No Information Rate : 0.2995          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.535           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity               0.5273      0.7054     0.41296     0.90909
## Specificity               0.8969      0.9326     0.95429     0.89636
## Pos Pred Value            0.5550      0.8172     0.71831     0.14925
## Neg Pred Value            0.8861      0.8810     0.85204     0.99798
## Prevalence                0.1961      0.2995     0.22014     0.01961
## Detection Rate            0.1034      0.2112     0.09091     0.01783
## Detection Prevalence      0.1863      0.2585     0.12656     0.11943
## Balanced Accuracy         0.7121      0.8190     0.68362     0.90273
##                      Class: PORT Class: STAT
## Sensitivity               0.8048     0.63043
## Specificity               0.9082     0.96654
## Pos Pred Value            0.7163     0.44615
## Neg Pred Value            0.9417     0.98392
## Prevalence                0.2237     0.04100
## Detection Rate            0.1800     0.02585
## Detection Prevalence      0.2513     0.05793
## Balanced Accuracy         0.8565     0.79849

print(ConMatrix_rpart$overall)

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   6.292335e-01   5.349547e-01   6.002162e-01   6.575712e-01   2.994652e-01 
## AccuracyPValue  McnemarPValue 
##  6.454398e-115            NaN

print((ConMatrix_rpart$overall)[1]); print(trpart_fit)

##  Accuracy 
## 0.6292335

## Time difference of 2.711045 mins

Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET) Model

# glmnet model
ctrlCV <- trainControl(method="cv",   #Cross validation
                       number = 10,     #K-(number)
                       returnResamp = "all",
                       classProbs=TRUE, # should class probabilities be computed for classification
                       savePredictions =TRUE) 


library(glmnet)
set.seed(512)
t5 <- Sys.time()

registerDoSNOW(cl)
glmnetCV_fit <- train(Type~.,
                      data = training,
                      method = "glmnet",
                      metric = "ROC",
                      #preProc = c("center", "scale"),
                      trControl = ctrlCV,
                      tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
                                             .lambda = c((1:5)/10)))
t6 <- Sys.time()
tglmnet_fit<- difftime(t6,t5); tglmnet_fit

## Time difference of 39.68031 mins

save(glmnetCV_fit, file = "glmnetCV_fit.rda")

# - Testing
# Confussion Matrix
test_pred_glmnetCV <- predict(glmnetCV_fit,testing)
ConMatrix_glmnetCV <-confusionMatrix(testing$Type,test_pred_glmnetCV); ConMatrix_glmnetCV

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DISC HACK INSD PHYS PORT STAT
##       DISC  173   25    4    4    2    1
##       HACK    5  281    1    2    0    1
##       INSD    6   16  108    1   11    0
##       PHYS    6    3    8   92   23    2
##       PORT    2    3    0    1  271    5
##       STAT    3    3    1    2   37   19
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8414          
##                  95% CI : (0.8186, 0.8623)
##     No Information Rate : 0.3066          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7987          
##  Mcnemar's Test P-Value : 1.388e-13       
## 
## Statistics by Class:
## 
##                      Class: DISC Class: HACK Class: INSD Class: PHYS
## Sensitivity               0.8872      0.8489     0.88525     0.90196
## Specificity               0.9612      0.9886     0.96600     0.95882
## Pos Pred Value            0.8278      0.9690     0.76056     0.68657
## Neg Pred Value            0.9759      0.9399     0.98571     0.98988
## Prevalence                0.1738      0.2950     0.10873     0.09091
## Detection Rate            0.1542      0.2504     0.09626     0.08200
## Detection Prevalence      0.1863      0.2585     0.12656     0.11943
## Balanced Accuracy         0.9242      0.9188     0.92562     0.93039
##                      Class: PORT Class: STAT
## Sensitivity               0.7878     0.67857
## Specificity               0.9859     0.95795
## Pos Pred Value            0.9610     0.29231
## Neg Pred Value            0.9131     0.99149
## Prevalence                0.3066     0.02496
## Detection Rate            0.2415     0.01693
## Detection Prevalence      0.2513     0.05793
## Balanced Accuracy         0.8868     0.81826

print(ConMatrix_glmnetCV$overall)

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   8.413547e-01   7.987381e-01   8.186417e-01   8.622609e-01   3.065954e-01 
## AccuracyPValue  McnemarPValue 
##  5.104556e-302   1.387784e-13

print((ConMatrix_glmnetCV$overall)[1]); print(tglmnet_fit)

##  Accuracy 
## 0.8413547

## Time difference of 39.68031 mins

Close parallel clustering

  registerDoSEQ();on.exit(stopCluster(cl))
library(beepr)
  beep(4)

Conclusion

The application of two models, Recursive Partitioning and Regression Trees and Lasso and Elastic-Net Regularized Generalized Linear Model resulted in classifications of 62% and 84% accuracy respectively.
Analysis of the data indicates that the main cause of data breach is HACK, followed by PORT (Lost, discarded or stolen laptop, PDA, smartphone, portable memory device, CD, hard drive, data tape, etc) and DISC (Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail). These last two are related to lack of vigilance.
MED (Healthcare - Medical Providers) is the Entity with the highest number of Data-Breaches, followed very closely from BSO (Businesses - Other).
It is visible that BSR (Businesses - Retail/Merchant) and BSO (Businesses - Other) are the main target for Hackers.In addition, GOV (Government and Military) data breaches are mainly due to (DISC) (Unintended disclosure, Sensitive information posted publicly on a website, mishandled or sent to the wrong party via email, fax or mail).
Historical analysis indicates that the number of Data-Breaches events has reduced in the last two years. This can be due to improvement of the economy and more jobs = less crimes and/or improvement in the security of institutions.
However HACK attacks even when they are less frequent they are becoming the dominating cause for data breaches and can be seen that the last two years they are the most frequent type of data breach and also HACK results in the largest number of records exposed.
Future studies could involve matrices of two or three words combinations and/or combinations between all the matrices.