library(DT)
library(tidyverse)
library(dplyr)
library(stringr)
library(ggplot2)
library(gmodels)
library(kableExtra)
# load data
mydata <- "https://raw.githubusercontent.com/theoracley/Data606/master/security_breaches.csv"
cyber_sb <- read.csv(mydata, header=TRUE, stringsAsFactors = FALSE)
In recent years, many big companies announced they had been the victims of the biggest data breach in history. Starting in 2003 when a group of hackers compromised a billion accounts. Since then, many companies came forward and stated they were compromised too.
When security breaches happen, they could happen using two types, that is hackers can be remotely hacking or on-premise (locally on the site) hacking.
In on-premise hacking, hackers may end up finding data from paper printers, physical hard drives, or social engineering to obtain usernames and passwords.
In remote hacking, breaches happen remotely and hackers use network vulnerability to gain access to all types of electronic data.
We would like to find out what method is preferrable by hackers; do they prefer to hack remotely or when they are on-premise(i.e. physically present at the hacking site).
data.frame of cyber security breaches involving health care records of 500 or more humans reported to the U.S. Department of Health and Human Services (HHS) as of June 27, 2014.
DataSets Repository is here: All DataSets are here: https://vincentarelbundock.github.io/Rdatasets/datasets.html
Documentation for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/breaches.html
Data for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/breaches.csv
Per documentation above, here are the columns definitions and usage: - Number: integer record number in the HHS data base - Name_of_Covered_Entity: factor giving the name of the entity experiencing the breach - State: Factor giving the 2-letter code of the state where the breach occurred. This has 52 levels for the 50 states plus the District of Columbia (DC) and Puerto Rico (PR). - Business_Associate_Involved: Factor giving the name of a subcontractor (or blank) associated with the breach. - Individuals_Affected: integer number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer. - Date_of_Breach: character vector giving the date or date range of the breach. Recodes as Dates in breach_start and breach_end. - Type_of_Breach: factor with 29 levels giving the type of breach (e.g., “Theft” vs., “Unauthorized Access/Disclosure”, etc.) - Location_of_Breached_Information: factor with 41 levels coding the location from which the breach occurred (e.g., “Paper”, “Laptop”, etc.) - Date_Posted_or_Updated: Date the information was posted to the HHS data base or last updated. - Summary: character vector of a summary of the incident. - breach_start: Date of the start of the incident = first date given in Date_of_Breach above. - breach_end: Date of the end of the incident or NA if only one date is given in Date_of_Breach above. year integer giving the year of the breach
In this project, we will be using only the following fields (as other fields are irrelevant for our project).
colnames(cyber_sb)
## [1] "Number" "Name_of_Covered_Entity"
## [3] "State" "Business_Associate_Involved"
## [5] "Individuals_Affected" "Date_of_Breach"
## [7] "Type_of_Breach" "Location_of_Breached_Information"
## [9] "Date_Posted_or_Updated" "Summary"
## [11] "breach_start" "breach_end"
## [13] "year"
#take only the important columns, no need for the others who do not add any value.
cyber_sb_dataframe <- cyber_sb[c("Name_of_Covered_Entity","State","Individuals_Affected","Date_of_Breach","Type_of_Breach","Location_of_Breached_Information","breach_start","breach_end","year")]
# Display data
head(cyber_sb_dataframe) %>% kable() %>% kable_styling()
Name_of_Covered_Entity | State | Individuals_Affected | Date_of_Breach | Type_of_Breach | Location_of_Breached_Information | breach_start | breach_end | year |
---|---|---|---|---|---|---|---|---|
Brooke Army Medical Center | TX | 1000 | 10/16/2009 | Theft | Paper | 10/16/2009 | NA | 2009 |
Mid America Kidney Stone Association, LLC | MO | 1000 | 9/22/2009 | Theft | Network Server | 9/22/2009 | NA | 2009 |
Alaska Department of Health and Social Services | AK | 501 | 10/12/2009 | Theft | Other Portable Electronic Device, Other | 10/12/2009 | NA | 2009 |
Health Services for Children with Special Needs, Inc. | DC | 3800 | 10/9/2009 | Loss | Laptop | 10/9/2009 | NA | 2009 |
L. Douglas Carlson, M.D. | CA | 5257 | 9/27/2009 | Theft | Desktop Computer | 9/27/2009 | NA | 2009 |
David I. Cohen, MD | CA | 857 | 9/27/2009 | Theft | Desktop Computer | 9/27/2009 | NA | 2009 |
Examining data
If we look at the text contained in the Summary and Location_of_Breached_Information fields, we see that they contain words pertaining to Physical or Remote. Let’s create a vector that has all these types of words, and let’s call it “physical_types”.
# Data cleaning/extraction
cyber_sb_new <- cyber_sb
# List of keywords to extract physical/ remote variable
physical_types <- c("binder","transit","public transportation","desk","usb flash drive","cleaning crew","locked facility","tapes","paper","desktop","portable","disposal")
Extract Physical breaches from text from column “Summary”
cyber_sb_new$PhysicalBreach1 <- str_detect(cyber_sb_new$Summary, fixed(physical_types, ignore_case=TRUE))
Extract Physical breaches from text from column “Location_of_Breached_Information”
cyber_sb_new$PhysicalBreach2 <- str_detect(cyber_sb_new$Location_of_Breached_Information, fixed(physical_types, ignore_case=TRUE))
Combine both variable to one
cyber_sb_new$PhysicalBreach3 <- (cyber_sb_new$PhysicalBreach1 | cyber_sb_new$PhysicalBreach2)
Divide Individuals_Affected by 10000 for the charts
cyber_sb_new$Individuals_Affected <- (cyber_sb_new$Individuals_Affected/10000)
Extract columns
cyber_sb_new <- select(cyber_sb_new, c("State","Individuals_Affected","year","PhysicalBreach3") )
names(cyber_sb_new) <- c("State","AffectedNumber","Year","IsPhysical")
# New data
head(cyber_sb_new) %>% kable() %>% kable_styling()
State | AffectedNumber | Year | IsPhysical |
---|---|---|---|
TX | 0.1000 | 2009 | TRUE |
MO | 0.1000 | 2009 | FALSE |
AK | 0.0501 | 2009 | FALSE |
DC | 0.3800 | 2009 | FALSE |
CA | 0.5257 | 2009 | TRUE |
CA | 0.0857 | 2009 | TRUE |
cyber_sb_new_r01 <- cyber_sb_new %>% dplyr::select(Year,IsPhysical)
# Grouping
cyber_sb_new_rg01 <- cyber_sb_new_r01 %>% group_by(Year,IsPhysical) %>% summarize(n())
# Spreading
cyber_sb_new_rg01 <- spread(cyber_sb_new_rg01, Year,"n()")
# substitute NA with 0
cyber_sb_new_rg01[is.na(cyber_sb_new_rg01)] <- 0
cyber_sb_new_rg01 <- as.data.frame(t(cyber_sb_new_rg01))
cyber_sb_new_rg02 <- cbind(rownames(cyber_sb_new_rg01), cyber_sb_new_rg01)
rownames(cyber_sb_new_rg02) <- NULL
colnames(cyber_sb_new_rg02) <- c("Year","IsRemote", "IsPhysical")
cyber_sb_new_rg02 <- cyber_sb_new_rg02[!grepl("IsPhysical", cyber_sb_new_rg02$Year),]
rownames(cyber_sb_new_rg02) <- 1:nrow(cyber_sb_new_rg02)
head(cyber_sb_new_rg02) %>% kable() %>% kable_styling()
Year | IsRemote | IsPhysical |
---|---|---|
1997 | 1 | 0 |
2002 | 1 | 0 |
2003 | 1 | 0 |
2004 | 2 | 0 |
2005 | 2 | 0 |
2006 | 1 | 0 |
# Here comes Chi test
chisq.test(cyber_sb_new_rg02[,-1])
##
## Pearson's Chi-squared test
##
## data: cyber_sb_new_rg02[, -1]
## X-squared = 4.8025, df = 13, p-value = 0.9793
Our p-value is so small (less than 0.05), therefore we reject the null hypothesis, that is there is not enough evidence that Physical breach is increasing or on the rise, in favor of the alternatice that is Remote breach is increasing and not the physical.
summary(cyber_sb_new_rg02$IsPhysical)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.500 4.071 8.000 14.000
summary(cyber_sb_new_rg02$IsRemote)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 7.00 71.29 161.25 245.00
qqnorm(cyber_sb_new_rg02$IsPhysical)
qqline(cyber_sb_new_rg02$IsPhysical)
qqnorm(cyber_sb_new_rg02$IsRemote)
qqline(cyber_sb_new_rg02$IsRemote)
hist(cyber_sb_new_rg02$IsPhysical, main = "Physical Breaches", xlab = "Breach")
hist(cyber_sb_new_rg02$IsRemote, main = "Remote Breaches", xlab = "Breach")
plot(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$Year, xlab = "Physical Breaches", ylab = "Year", col = 'darkblue')
plot(cyber_sb_new_rg02$IsRemote, cyber_sb_new_rg02$Year, xlab = "Remote Breaches", ylab = "Year", col = 'darkblue')
cyber_sb_new_rgrm02 <- select(cyber_sb_new_rg02, c("Year","IsRemote") )
cyber_sb_new_rgph02 <- select(cyber_sb_new_rg02, c("Year","IsPhysical") )
names(cyber_sb_new_rgrm02) <- c("Year","Count")
names(cyber_sb_new_rgph02) <- c("Year","Count")
cyber_sb_new_rgrm02$BreachType <- "Remote"
cyber_sb_new_rgph02$BreachType <- "Physical"
cyber_sb_new_rgrmph02 <- rbind(cyber_sb_new_rgrm02, cyber_sb_new_rgph02)
cyber_sb_new_rgrmph02mtx <- data.frame(as.matrix(cyber_sb_new_rgrmph02[,2:3]))
# Examine the data
boxplot(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType,ylab="Breach Count", main="Physical vs Remote Breach Count")
# Assumption 2
var.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)
##
## F test to compare two variances
##
## data: as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## F = 0.30981, num df = 13, denom df = 13, p-value = 0.04356
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.0994556 0.9650622
## sample estimates:
## ratio of variances
## 0.3098077
Our p-value is so small (less than 0.05), therefore we reject the null hypothesis.
# Assumption 3
isRemote <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Remote")
isPhysical <- subset(cyber_sb_new_rgrmph02mtx, BreachType=="Physical")
shapiro.test(as.numeric(isRemote$Count))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(isRemote$Count)
## W = 0.82213, p-value = 0.009499
shapiro.test(as.numeric(isPhysical$Count))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(isPhysical$Count)
## W = 0.77676, p-value = 0.002612
qqnorm(as.numeric(isRemote$Count))
qqline(as.numeric(isRemote$Count))
qqnorm(as.numeric(isPhysical$Count))
qqline(as.numeric(isPhysical$Count))
t.test(as.numeric(cyber_sb_new_rgrmph02mtx$Count)~cyber_sb_new_rgrmph02mtx$BreachType)
##
## Welch Two Sample t-test
##
## data: as.numeric(cyber_sb_new_rgrmph02mtx$Count) by cyber_sb_new_rgrmph02mtx$BreachType
## t = -2.3591, df = 20.35, p-value = 0.02842
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.4568845 -0.4002583
## sample estimates:
## mean in group Physical mean in group Remote
## 3.071429 6.500000
Our p-value is so small (less than 0.05), therefore we reject the null hypothesis and we accept the alternative.
# 2-Way Cross Tabulation
CrossTable(cyber_sb_new_rg02$IsPhysical, cyber_sb_new_rg02$IsRemote)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14
##
##
## | cyber_sb_new_rg02$IsRemote
## cyber_sb_new_rg02$IsPhysical | 1 | 2 | 12 | 51 | 54 | 197 | 214 | 216 | 245 | Row Total |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
## | 2.500 | 1.000 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | |
## | 0.714 | 0.286 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 |
## | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.357 | 0.143 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 12.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 13 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 |
## | 0.714 | 0.286 | 0.143 | 0.143 | 0.143 | 0.143 | 5.143 | 5.143 | 0.143 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.500 | 0.000 | 0.143 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.071 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 14 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 5 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 14 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## -----------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
cyber_sb_004 <- cyber_sb_new %>% dplyr::select(State,Year,IsPhysical)
cyber_sb_a_004 <- cyber_sb_004 %>%
group_by(State,IsPhysical) %>%
summarize(n())
cyber_sb_a1_004 <- spread(cyber_sb_a_004, State,"n()")
cyber_sb_a1_004 <- as.data.frame(t(cyber_sb_a1_004))
cyber_sb_a1_004[is.na(cyber_sb_a1_004)] <- 0
names(cyber_sb_a1_004) <- c("IsRemote", "IsPhysical")
barplot(as.matrix(cyber_sb_a1_004), beside = TRUE)
chisq.test(cyber_sb_a1_004[,-1])
## Warning in chisq.test(cyber_sb_a1_004[, -1]): Chi-squared approximation may
## be incorrect
##
## Chi-squared test for given probabilities
##
## data: cyber_sb_a1_004[, -1]
## X-squared = 132.07, df = 52, p-value = 6.697e-09
ggplot(cyber_sb_new, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))
cyber_sb_new_physical <- cyber_sb_new %>% filter(IsPhysical == TRUE)
ggplot(cyber_sb_new_physical, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Physical Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))
cyber_sb_new_remote <- cyber_sb_new %>% filter(IsPhysical == FALSE)
ggplot(cyber_sb_new_remote, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Remote Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))
ggplot(cyber_sb_new, aes(Year, AffectedNumber)) + geom_count(col="tomato3", show.legend=F)
al_mean <- mean(cyber_sb_new$AffectedNumber)
al_sd <- sd(cyber_sb_new$AffectedNumber)
#Where is my distribution?
ggplot(data = cyber_sb_new, aes(cyber_sb_new$AffectedNumber)) +
stat_function(fun = dnorm, n = 101, args = list(mean=al_mean, sd=al_sd)) + ylab("") + xlab("Count") +
scale_y_continuous(breaks = NULL) + ggtitle("Distribution") +
theme(plot.title = element_text(hjust = 0.5))
summary(cyber_sb_new$AffectedNumber)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1000 0.2300 3.0262 0.6941 490.0000
summary(cyber_sb_new$IsPhysical)
## Mode FALSE TRUE
## logical 998 57
We infer the following:
We fist started by looking at weither Remote IT Security is increasing or not. We started with a null hypothesis (that is Physical breach is increasing) and we proved that the Null hypothesis needs to be rejected in favor of the alternative (Remote breaches is increasing), that is Physical breaches is not increasing, but rather the remote breaches is the one that is increasing. We determined that through the p-value and it’s statistical calculations. Therefore enterprises need to do more and put more resources into strengthing their Remote IT Security.
DataSets Repository is here: All DataSets are here: https://vincentarelbundock.github.io/Rdatasets/datasets.html
Documentation for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/breaches.html
Data for cyber security breaches is here: https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/breaches.csv
Visualizations: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html