DATA 606 - Data Project

library(DT)
library(tidyverse)
library(dplyr)
library(stringr)
library(ggplot2)
library(gmodels)
library(kableExtra)

# load data
mydata <- "https://raw.githubusercontent.com/theoracley/Data606/master/security_breaches.csv"

cyber_sb <- read.csv(mydata, header=TRUE, stringsAsFactors = FALSE)

Part 1 - Introduction

In recent years, many big companies announced they had been the victims of the biggest data breach in history. Starting in 2003 when a group of hackers compromised a billion accounts. Since then, many companies came forward and stated they were compromised too.

When security breaches happen, they could happen using two types, that is hackers can be remotely hacking or on-premise (locally on the site) hacking.

In on-premise hacking, hackers may end up finding data from paper printers, physical hard drives, or social engineering to obtain usernames and passwords.

In remote hacking, breaches happen remotely and hackers use network vulnerability to gain access to all types of electronic data.

We would like to find out what method is preferrable by hackers; do they prefer to hack remotely or when they are on-premise(i.e. physically present at the hacking site).

Part 2 - Data

data.frame of cyber security breaches involving health care records of 500 or more humans reported to the U.S. Department of Health and Human Services (HHS) as of June 27, 2014.

DataSets Repository is here: All DataSets are here: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Documentation for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/breaches.html

Data for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/breaches.csv

Per documentation above, here are the columns definitions and usage: - Number: integer record number in the HHS data base - Name_of_Covered_Entity: factor giving the name of the entity experiencing the breach - State: Factor giving the 2-letter code of the state where the breach occurred. This has 52 levels for the 50 states plus the District of Columbia (DC) and Puerto Rico (PR). - Business_Associate_Involved: Factor giving the name of a subcontractor (or blank) associated with the breach. - Individuals_Affected: integer number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer. - Date_of_Breach: character vector giving the date or date range of the breach. Recodes as Dates in breach_start and breach_end. - Type_of_Breach: factor with 29 levels giving the type of breach (e.g., “Theft” vs., “Unauthorized Access/Disclosure”, etc.) - Location_of_Breached_Information: factor with 41 levels coding the location from which the breach occurred (e.g., “Paper”, “Laptop”, etc.) - Date_Posted_or_Updated: Date the information was posted to the HHS data base or last updated. - Summary: character vector of a summary of the incident. - breach_start: Date of the start of the incident = first date given in Date_of_Breach above. - breach_end: Date of the end of the incident or NA if only one date is given in Date_of_Breach above. year integer giving the year of the breach

In this project, we will be using only the following fields (as other fields are irrelevant for our project).

EntityName
State
Year
BreachType
AffectedNumber

colnames(cyber_sb)

##  [1] "Number"                           "Name_of_Covered_Entity"          
##  [3] "State"                            "Business_Associate_Involved"     
##  [5] "Individuals_Affected"             "Date_of_Breach"                  
##  [7] "Type_of_Breach"                   "Location_of_Breached_Information"
##  [9] "Date_Posted_or_Updated"           "Summary"                         
## [11] "breach_start"                     "breach_end"                      
## [13] "year"

#take only the important columns, no need for the others who do not add any value.
cyber_sb_dataframe <- cyber_sb[c("Name_of_Covered_Entity","State","Individuals_Affected","Date_of_Breach","Type_of_Breach","Location_of_Breached_Information","breach_start","breach_end","year")]

# Display data
head(cyber_sb_dataframe) %>% kable() %>% kable_styling()

Name_of_Covered_Entity	State	Individuals_Affected	Date_of_Breach	Type_of_Breach	Location_of_Breached_Information	breach_start	breach_end	year
Brooke Army Medical Center	TX	1000	10/16/2009	Theft	Paper	10/16/2009	NA	2009
Mid America Kidney Stone Association, LLC	MO	1000	9/22/2009	Theft	Network Server	9/22/2009	NA	2009
Alaska Department of Health and Social Services	AK	501	10/12/2009	Theft	Other Portable Electronic Device, Other	10/12/2009	NA	2009
Health Services for Children with Special Needs, Inc.	DC	3800	10/9/2009	Loss	Laptop	10/9/2009	NA	2009
L. Douglas Carlson, M.D.	CA	5257	9/27/2009	Theft	Desktop Computer	9/27/2009	NA	2009
David I. Cohen, MD	CA	857	9/27/2009	Theft	Desktop Computer	9/27/2009	NA	2009

Examining data

If we look at the text contained in the Summary and Location_of_Breached_Information fields, we see that they contain words pertaining to Physical or Remote. Let’s create a vector that has all these types of words, and let’s call it “physical_types”.

# Data cleaning/extraction
cyber_sb_new <- cyber_sb

# List of keywords to extract physical/ remote variable
physical_types <- c("binder","transit","public transportation","desk","usb flash drive","cleaning crew","locked facility","tapes","paper","desktop","portable","disposal")

Extract Physical breaches from text from column “Summary”

cyber_sb_new$PhysicalBreach1 <- str_detect(cyber_sb_new$Summary, fixed(physical_types, ignore_case=TRUE))

Extract Physical breaches from text from column “Location_of_Breached_Information”

cyber_sb_new$PhysicalBreach2 <- str_detect(cyber_sb_new$Location_of_Breached_Information, fixed(physical_types, ignore_case=TRUE))

Combine both variable to one

cyber_sb_new$PhysicalBreach3 <- (cyber_sb_new$PhysicalBreach1 | cyber_sb_new$PhysicalBreach2)

Divide Individuals_Affected by 10000 for the charts

cyber_sb_new$Individuals_Affected <- (cyber_sb_new$Individuals_Affected/10000)

Extract columns

cyber_sb_new <- select(cyber_sb_new, c("State","Individuals_Affected","year","PhysicalBreach3") )
names(cyber_sb_new) <- c("State","AffectedNumber","Year","IsPhysical")

# New data
head(cyber_sb_new) %>% kable() %>% kable_styling()

State	AffectedNumber	Year	IsPhysical
TX	0.1000	2009	TRUE
MO	0.1000	2009	FALSE
AK	0.0501	2009	FALSE
DC	0.3800	2009	FALSE
CA	0.5257	2009	TRUE
CA	0.0857	2009	TRUE

Part 3 - Exploratory data analysis

Chi Squared Tests this!!

H0: (Null Hypothesis) : Physical breach is increasing (and not Remote)
HA: (Alternative Hypothesis) : Remote breach is increasing (and not Physical)

cyber_sb_new_r01 <- cyber_sb_new %>% dplyr::select(Year,IsPhysical) 

# Grouping
cyber_sb_new_rg01 <- cyber_sb_new_r01 %>% group_by(Year,IsPhysical) %>% summarize(n()) 

# Spreading
cyber_sb_new_rg01 <- spread(cyber_sb_new_rg01, Year,"n()")

# substitute NA with 0
cyber_sb_new_rg01[is.na(cyber_sb_new_rg01)] <- 0

cyber_sb_new_rg01 <- as.data.frame(t(cyber_sb_new_rg01))

cyber_sb_new_rg02 <- cbind(rownames(cyber_sb_new_rg01), cyber_sb_new_rg01)
rownames(cyber_sb_new_rg02) <- NULL
colnames(cyber_sb_new_rg02) <- c("Year","IsRemote", "IsPhysical")

cyber_sb_new_rg02 <- cyber_sb_new_rg02[!grepl("IsPhysical", cyber_sb_new_rg02$Year),]

rownames(cyber_sb_new_rg02) <- 1:nrow(cyber_sb_new_rg02)


head(cyber_sb_new_rg02) %>% kable() %>% kable_styling()

Year	IsRemote	IsPhysical
1997	1	0
2002	1	0
2003	1	0
2004	2	0
2005	2	0
2006	1	0

# Here comes Chi test
chisq.test(cyber_sb_new_rg02[,-1])

## 
##  Pearson's Chi-squared test
## 
## data:  cyber_sb_new_rg02[, -1]
## X-squared = 4.8025, df = 13, p-value = 0.9793

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis, that is there is not enough evidence that Physical breach is increasing or on the rise, in favor of the alternatice that is Remote breach is increasing and not the physical.

summary(cyber_sb_new_rg02$IsPhysical)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.500   4.071   8.000  14.000

summary(cyber_sb_new_rg02$IsRemote)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    7.00   71.29  161.25  245.00

qqnorm(cyber_sb_new_rg02$IsPhysical)
qqline(cyber_sb_new_rg02$IsPhysical)

qqnorm(cyber_sb_new_rg02$IsRemote)
qqline(cyber_sb_new_rg02$IsRemote)

hist(cyber_sb_new_rg02$IsPhysical, main = "Physical Breaches", xlab = "Breach")

hist(cyber_sb_new_rg02$IsRemote, main = "Remote Breaches", xlab = "Breach")

plot(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$Year, xlab = "Physical Breaches", ylab = "Year", col = 'darkblue')

plot(cyber_sb_new_rg02$IsRemote, cyber_sb_new_rg02$Year, xlab = "Remote Breaches", ylab = "Year", col = 'darkblue')

T-Test This!!

Assumption 1: Independent Samples
Assumption 2: Samples are drawn from population with equal variance
Assumption 3: Samples are drawn from the same population with a normal distribution

cyber_sb_new_rgrm02 <- select(cyber_sb_new_rg02, c("Year","IsRemote") )
cyber_sb_new_rgph02 <- select(cyber_sb_new_rg02, c("Year","IsPhysical") )

names(cyber_sb_new_rgrm02) <- c("Year","Count")
names(cyber_sb_new_rgph02) <- c("Year","Count")

cyber_sb_new_rgrm02$BreachType <- "Remote"
cyber_sb_new_rgph02$BreachType <- "Physical"

cyber_sb_new_rgrmph02 <- rbind(cyber_sb_new_rgrm02, cyber_sb_new_rgph02)

cyber_sb_new_rgrmph02mtx <- data.frame(as.matrix(cyber_sb_new_rgrmph02[,2:3]))

# Examine the data
boxplot(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType,ylab="Breach Count", main="Physical vs Remote Breach Count")

# Assumption 2
var.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)

## 
##  F test to compare two variances
## 
## data:  as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## F = 0.30981, num df = 13, denom df = 13, p-value = 0.04356
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.0994556 0.9650622
## sample estimates:
## ratio of variances 
##          0.3098077

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis.

# Assumption 3
isRemote <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Remote")
isPhysical <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Physical")

shapiro.test(as.numeric(isRemote$Count))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isRemote$Count)
## W = 0.82213, p-value = 0.009499

shapiro.test(as.numeric(isPhysical$Count))

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isPhysical$Count)
## W = 0.77676, p-value = 0.002612

qqnorm(as.numeric(isRemote$Count))
qqline(as.numeric(isRemote$Count))

qqnorm(as.numeric(isPhysical$Count))
qqline(as.numeric(isPhysical$Count))

t.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)

## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## t = -2.3591, df = 20.35, p-value = 0.02842
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4568845 -0.4002583
## sample estimates:
## mean in group Physical   mean in group Remote 
##               3.071429               6.500000

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis and we accept the alternative.

Here comes the Cross Table

# 2-Way Cross Tabulation
CrossTable(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$IsRemote)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14 
## 
##  
##                              | cyber_sb_new_rg02$IsRemote 
## cyber_sb_new_rg02$IsPhysical |         1 |         2 |        12 |        51 |        54 |       197 |       214 |       216 |       245 | Row Total | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            0 |         5 |         2 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         7 | 
##                              |     2.500 |     1.000 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |           | 
##                              |     0.714 |     0.286 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 | 
##                              |     1.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.357 |     0.143 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            1 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            2 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            5 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            9 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |    12.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           13 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 |         0 |         2 | 
##                              |     0.714 |     0.286 |     0.143 |     0.143 |     0.143 |     0.143 |     5.143 |     5.143 |     0.143 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |     0.500 |     0.000 |     0.143 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     1.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.071 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           14 |         0 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 Column Total |         5 |         2 |         1 |         1 |         1 |         1 |         1 |         1 |         1 |        14 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##

cyber_sb_004 <- cyber_sb_new %>% dplyr::select(State,Year,IsPhysical)

cyber_sb_a_004 <- cyber_sb_004 %>%
  group_by(State,IsPhysical) %>%
  summarize(n())

cyber_sb_a1_004 <- spread(cyber_sb_a_004, State,"n()")

cyber_sb_a1_004 <- as.data.frame(t(cyber_sb_a1_004))
cyber_sb_a1_004[is.na(cyber_sb_a1_004)] <- 0


names(cyber_sb_a1_004) <- c("IsRemote", "IsPhysical")
barplot(as.matrix(cyber_sb_a1_004), beside = TRUE)

chisq.test(cyber_sb_a1_004[,-1])

## Warning in chisq.test(cyber_sb_a1_004[, -1]): Chi-squared approximation may
## be incorrect

## 
##  Chi-squared test for given probabilities
## 
## data:  cyber_sb_a1_004[, -1]
## X-squared = 132.07, df = 52, p-value = 6.697e-09

Heat map This!!!

ggplot(cyber_sb_new, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))

cyber_sb_new_physical <- cyber_sb_new %>% filter(IsPhysical == TRUE)

ggplot(cyber_sb_new_physical, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Physical Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))

cyber_sb_new_remote <- cyber_sb_new %>% filter(IsPhysical == FALSE)

ggplot(cyber_sb_new_remote, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Remote Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))

ggplot(cyber_sb_new, aes(Year, AffectedNumber)) + geom_count(col="tomato3", show.legend=F)

al_mean <- mean(cyber_sb_new$AffectedNumber)
al_sd <- sd(cyber_sb_new$AffectedNumber)

#Where is my distribution?
ggplot(data = cyber_sb_new, aes(cyber_sb_new$AffectedNumber)) +
  stat_function(fun = dnorm, n = 101, args = list(mean=al_mean, sd=al_sd)) + ylab("") + xlab("Count") +
  scale_y_continuous(breaks = NULL) + ggtitle("Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

summary(cyber_sb_new$AffectedNumber)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0500   0.1000   0.2300   3.0262   0.6941 490.0000

summary(cyber_sb_new$IsPhysical)

##    Mode   FALSE    TRUE 
## logical     998      57

Part 4 - Inference

We infer the following:

We can clearly see that the breached occurred after 2009. Before that only a small amount of breached occurred, but not significant.
The breaches happened in all the states and not particular states.
Hackers prefer to hack remotely instead of on-premise.

Part 5 - Conclusion

We fist started by looking at weither Remote IT Security is increasing or not. We started with a null hypothesis (that is Physical breach is increasing) and we proved that the Null hypothesis needs to be rejected in favor of the alternative (Remote breaches is increasing), that is Physical breaches is not increasing, but rather the remote breaches is the one that is increasing. We determined that through the p-value and it’s statistical calculations. Therefore enterprises need to do more and put more resources into strengthing their Remote IT Security.