library(DT)
library(tidyverse)
library(dplyr)
library(stringr)
library(ggplot2)
library(gmodels)
library(kableExtra)

# load data
mydata <- "https://raw.githubusercontent.com/theoracley/Data606/master/security_breaches.csv"

cyber_sb <- read.csv(mydata, header=TRUE, stringsAsFactors = FALSE) 

Part 1 - Introduction

In recent years, many big companies announced they had been the victims of the biggest data breach in history. Starting in 2003 when a group of hackers compromised a billion accounts. Since then, many companies came forward and stated they were compromised too.

When security breaches happen, they could happen using two types, that is hackers can be remotely hacking or on-premise (locally on the site) hacking.

In on-premise hacking, hackers may end up finding data from paper printers, physical hard drives, or social engineering to obtain usernames and passwords.

In remote hacking, breaches happen remotely and hackers use network vulnerability to gain access to all types of electronic data.

We would like to find out what method is preferrable by hackers; do they prefer to hack remotely or when they are on-premise(i.e. physically present at the hacking site).

Part 2 - Data

data.frame of cyber security breaches involving health care records of 500 or more humans reported to the U.S. Department of Health and Human Services (HHS) as of June 27, 2014.

DataSets Repository is here: All DataSets are here: https://vincentarelbundock.github.io/Rdatasets/datasets.html

Documentation for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/breaches.html

Data for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/breaches.csv

Per documentation above, here are the columns definitions and usage: - Number: integer record number in the HHS data base - Name_of_Covered_Entity: factor giving the name of the entity experiencing the breach - State: Factor giving the 2-letter code of the state where the breach occurred. This has 52 levels for the 50 states plus the District of Columbia (DC) and Puerto Rico (PR). - Business_Associate_Involved: Factor giving the name of a subcontractor (or blank) associated with the breach. - Individuals_Affected: integer number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer. - Date_of_Breach: character vector giving the date or date range of the breach. Recodes as Dates in breach_start and breach_end. - Type_of_Breach: factor with 29 levels giving the type of breach (e.g., “Theft” vs., “Unauthorized Access/Disclosure”, etc.) - Location_of_Breached_Information: factor with 41 levels coding the location from which the breach occurred (e.g., “Paper”, “Laptop”, etc.) - Date_Posted_or_Updated: Date the information was posted to the HHS data base or last updated. - Summary: character vector of a summary of the incident. - breach_start: Date of the start of the incident = first date given in Date_of_Breach above. - breach_end: Date of the end of the incident or NA if only one date is given in Date_of_Breach above. year integer giving the year of the breach

In this project, we will be using only the following fields (as other fields are irrelevant for our project).

  • EntityName
  • State
  • Year
  • BreachType
  • AffectedNumber
colnames(cyber_sb)
##  [1] "Number"                           "Name_of_Covered_Entity"          
##  [3] "State"                            "Business_Associate_Involved"     
##  [5] "Individuals_Affected"             "Date_of_Breach"                  
##  [7] "Type_of_Breach"                   "Location_of_Breached_Information"
##  [9] "Date_Posted_or_Updated"           "Summary"                         
## [11] "breach_start"                     "breach_end"                      
## [13] "year"
#take only the important columns, no need for the others who do not add any value.
cyber_sb_dataframe <- cyber_sb[c("Name_of_Covered_Entity","State","Individuals_Affected","Date_of_Breach","Type_of_Breach","Location_of_Breached_Information","breach_start","breach_end","year")]

# Display data
head(cyber_sb_dataframe) %>% kable() %>% kable_styling()
Name_of_Covered_Entity State Individuals_Affected Date_of_Breach Type_of_Breach Location_of_Breached_Information breach_start breach_end year
Brooke Army Medical Center TX 1000 10/16/2009 Theft Paper 10/16/2009 NA 2009
Mid America Kidney Stone Association, LLC MO 1000 9/22/2009 Theft Network Server 9/22/2009 NA 2009
Alaska Department of Health and Social Services AK 501 10/12/2009 Theft Other Portable Electronic Device, Other 10/12/2009 NA 2009
Health Services for Children with Special Needs, Inc.  DC 3800 10/9/2009 Loss Laptop 10/9/2009 NA 2009
L. Douglas Carlson, M.D.  CA 5257 9/27/2009 Theft Desktop Computer 9/27/2009 NA 2009
David I. Cohen, MD CA 857 9/27/2009 Theft Desktop Computer 9/27/2009 NA 2009

Examining data

If we look at the text contained in the Summary and Location_of_Breached_Information fields, we see that they contain words pertaining to Physical or Remote. Let’s create a vector that has all these types of words, and let’s call it “physical_types”.

# Data cleaning/extraction
cyber_sb_new <- cyber_sb

# List of keywords to extract physical/ remote variable
physical_types <- c("binder","transit","public transportation","desk","usb flash drive","cleaning crew","locked facility","tapes","paper","desktop","portable","disposal")

Extract Physical breaches from text from column “Summary”

cyber_sb_new$PhysicalBreach1 <- str_detect(cyber_sb_new$Summary, fixed(physical_types, ignore_case=TRUE))

Extract Physical breaches from text from column “Location_of_Breached_Information”

cyber_sb_new$PhysicalBreach2 <- str_detect(cyber_sb_new$Location_of_Breached_Information, fixed(physical_types, ignore_case=TRUE))

Combine both variable to one

cyber_sb_new$PhysicalBreach3 <- (cyber_sb_new$PhysicalBreach1 | cyber_sb_new$PhysicalBreach2)

Divide Individuals_Affected by 10000 for the charts

cyber_sb_new$Individuals_Affected <- (cyber_sb_new$Individuals_Affected/10000)

Extract columns

cyber_sb_new <- select(cyber_sb_new, c("State","Individuals_Affected","year","PhysicalBreach3") )
names(cyber_sb_new) <- c("State","AffectedNumber","Year","IsPhysical")

# New data
head(cyber_sb_new) %>% kable() %>% kable_styling()
State AffectedNumber Year IsPhysical
TX 0.1000 2009 TRUE
MO 0.1000 2009 FALSE
AK 0.0501 2009 FALSE
DC 0.3800 2009 FALSE
CA 0.5257 2009 TRUE
CA 0.0857 2009 TRUE

Part 3 - Exploratory data analysis

Chi Squared Tests this!!

  • H0: (Null Hypothesis) : Physical breach is increasing (and not Remote)
  • HA: (Alternative Hypothesis) : Remote breach is increasing (and not Physical)
cyber_sb_new_r01 <- cyber_sb_new %>% dplyr::select(Year,IsPhysical) 

# Grouping
cyber_sb_new_rg01 <- cyber_sb_new_r01 %>% group_by(Year,IsPhysical) %>% summarize(n()) 

# Spreading
cyber_sb_new_rg01 <- spread(cyber_sb_new_rg01, Year,"n()")

# substitute NA with 0
cyber_sb_new_rg01[is.na(cyber_sb_new_rg01)] <- 0

cyber_sb_new_rg01 <- as.data.frame(t(cyber_sb_new_rg01))

cyber_sb_new_rg02 <- cbind(rownames(cyber_sb_new_rg01), cyber_sb_new_rg01)
rownames(cyber_sb_new_rg02) <- NULL
colnames(cyber_sb_new_rg02) <- c("Year","IsRemote", "IsPhysical")

cyber_sb_new_rg02 <- cyber_sb_new_rg02[!grepl("IsPhysical", cyber_sb_new_rg02$Year),]

rownames(cyber_sb_new_rg02) <- 1:nrow(cyber_sb_new_rg02)


head(cyber_sb_new_rg02) %>% kable() %>% kable_styling()
Year IsRemote IsPhysical
1997 1 0
2002 1 0
2003 1 0
2004 2 0
2005 2 0
2006 1 0
# Here comes Chi test
chisq.test(cyber_sb_new_rg02[,-1])
## 
##  Pearson's Chi-squared test
## 
## data:  cyber_sb_new_rg02[, -1]
## X-squared = 4.8025, df = 13, p-value = 0.9793

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis, that is there is not enough evidence that Physical breach is increasing or on the rise, in favor of the alternatice that is Remote breach is increasing and not the physical.

summary(cyber_sb_new_rg02$IsPhysical)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.500   4.071   8.000  14.000
summary(cyber_sb_new_rg02$IsRemote)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    7.00   71.29  161.25  245.00
qqnorm(cyber_sb_new_rg02$IsPhysical)
qqline(cyber_sb_new_rg02$IsPhysical)

qqnorm(cyber_sb_new_rg02$IsRemote)
qqline(cyber_sb_new_rg02$IsRemote)

hist(cyber_sb_new_rg02$IsPhysical, main = "Physical Breaches", xlab = "Breach")

hist(cyber_sb_new_rg02$IsRemote, main = "Remote Breaches", xlab = "Breach")

plot(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$Year, xlab = "Physical Breaches", ylab = "Year", col = 'darkblue')

plot(cyber_sb_new_rg02$IsRemote, cyber_sb_new_rg02$Year, xlab = "Remote Breaches", ylab = "Year", col = 'darkblue')

T-Test This!!

  • Assumption 1: Independent Samples
  • Assumption 2: Samples are drawn from population with equal variance
  • Assumption 3: Samples are drawn from the same population with a normal distribution
cyber_sb_new_rgrm02 <- select(cyber_sb_new_rg02, c("Year","IsRemote") )
cyber_sb_new_rgph02 <- select(cyber_sb_new_rg02, c("Year","IsPhysical") )

names(cyber_sb_new_rgrm02) <- c("Year","Count")
names(cyber_sb_new_rgph02) <- c("Year","Count")

cyber_sb_new_rgrm02$BreachType <- "Remote"
cyber_sb_new_rgph02$BreachType <- "Physical"

cyber_sb_new_rgrmph02 <- rbind(cyber_sb_new_rgrm02, cyber_sb_new_rgph02)

cyber_sb_new_rgrmph02mtx <- data.frame(as.matrix(cyber_sb_new_rgrmph02[,2:3]))

# Examine the data
boxplot(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType,ylab="Breach Count", main="Physical vs Remote Breach Count")

# Assumption 2
var.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)
## 
##  F test to compare two variances
## 
## data:  as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## F = 0.30981, num df = 13, denom df = 13, p-value = 0.04356
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.0994556 0.9650622
## sample estimates:
## ratio of variances 
##          0.3098077

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis.

# Assumption 3
isRemote <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Remote")
isPhysical <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Physical")

shapiro.test(as.numeric(isRemote$Count))
## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isRemote$Count)
## W = 0.82213, p-value = 0.009499
shapiro.test(as.numeric(isPhysical$Count))
## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isPhysical$Count)
## W = 0.77676, p-value = 0.002612
qqnorm(as.numeric(isRemote$Count))
qqline(as.numeric(isRemote$Count))

qqnorm(as.numeric(isPhysical$Count))
qqline(as.numeric(isPhysical$Count))

t.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)
## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## t = -2.3591, df = 20.35, p-value = 0.02842
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4568845 -0.4002583
## sample estimates:
## mean in group Physical   mean in group Remote 
##               3.071429               6.500000

Our p-value is so small (less than 0.05), therefore we reject the null hypothesis and we accept the alternative.

Here comes the Cross Table

# 2-Way Cross Tabulation
CrossTable(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$IsRemote)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14 
## 
##  
##                              | cyber_sb_new_rg02$IsRemote 
## cyber_sb_new_rg02$IsPhysical |         1 |         2 |        12 |        51 |        54 |       197 |       214 |       216 |       245 | Row Total | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            0 |         5 |         2 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         7 | 
##                              |     2.500 |     1.000 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |           | 
##                              |     0.714 |     0.286 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 | 
##                              |     1.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.357 |     0.143 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            1 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            2 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            5 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                            9 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |    12.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           13 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 |         0 |         2 | 
##                              |     0.714 |     0.286 |     0.143 |     0.143 |     0.143 |     0.143 |     5.143 |     5.143 |     0.143 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |     0.500 |     0.000 |     0.143 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     1.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.071 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                           14 |         0 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         1 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |           | 
##                              |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                 Column Total |         5 |         2 |         1 |         1 |         1 |         1 |         1 |         1 |         1 |        14 | 
##                              |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 
cyber_sb_004 <- cyber_sb_new %>% dplyr::select(State,Year,IsPhysical)

cyber_sb_a_004 <- cyber_sb_004 %>%
  group_by(State,IsPhysical) %>%
  summarize(n())

cyber_sb_a1_004 <- spread(cyber_sb_a_004, State,"n()")

cyber_sb_a1_004 <- as.data.frame(t(cyber_sb_a1_004))
cyber_sb_a1_004[is.na(cyber_sb_a1_004)] <- 0


names(cyber_sb_a1_004) <- c("IsRemote", "IsPhysical")
barplot(as.matrix(cyber_sb_a1_004), beside = TRUE)