library(dplyr)
library(kableExtra)
library(DT)
library(ggplot2)
library(stringr)
library(ggplot2)
library(tidyverse)
library(gmodels)
# load data
data_utl <- "https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_606/security_breaches.csv"
#Read csv data
security_breaches <- read.csv(data_utl, stringsAsFactors = FALSE)
There are many types of IT security breaches. They can be grouped into two broad categiries - remote and physical. Breaches that happen remotely uses network vulnerability to gain access to electronic data. In this type of breach the hacker is remote. For breaches that happens physically, the hacker is on site and steal physical data like printouts, stealing hard drives etc.
In this study we are trying to determine if the hypothesis of remote IT security breach rising is true or false. In other words we are trying to find if more hackers prefer to be on site (physically) or want to be remote.
The data primarily consists of breaches that occurred from 2010 through early 2014 when the extract was taken. However, a few breaches are recorded including 1 from 1997, 8 from 2002-2007, 13 from 2008 and 56 from 2009. The numbers of breaches from 2010 - 2014 are 211, 229, 227, 254 and 56, respectively.
A data.frame with 1055 observations on the following 24 variables:
We don’t need all the above columns. We can transform/extract data to form the columns below:
## [1] "Number" "Name_of_Covered_Entity"
## [3] "State" "Business_Associate_Involved"
## [5] "Individuals_Affected" "Date_of_Breach"
## [7] "Type_of_Breach" "Location_of_Breached_Information"
## [9] "Date_Posted_or_Updated" "Summary"
## [11] "breach_start" "breach_end"
## [13] "year"
security_breaches_df = security_breaches[c("Name_of_Covered_Entity","State","Individuals_Affected","Date_of_Breach","Type_of_Breach","Location_of_Breached_Information","breach_start","breach_end","year")]
# Display data
datatable(security_breaches_df)
The data is stored as text in various columns. We have to extract data that suggest Physical or Remote. For this we create a array “physical_key_words” that contains words that indicates Physical breach. We will examine columns - Summary and Location_of_Breached_Information.
# Data cleaning/extraction
security_breaches_new <- security_breaches
# List of keywords to extract physical/ remote variable
physical_key_words <- c("binder","transit","public transportation","desk","usb flash drive","cleaning crew","locked facility","tapes","paper","desktop","portable","disposal")
Extract Physical breaches from text from column “Summary”
Extract Physical breaches from text from column “Location_of_Breached_Information”
Combine both variable to one
Divide Individuals_Affected by 10000 for the charts
Extract columns
security_breaches_new <- select(security_breaches_new, c("State","Individuals_Affected","year","PhysicalBreach3") )
names(security_breaches_new) <- c("State","AffectedNumber","Year","IsPhysical")
# New data
head(security_breaches_new)
## State AffectedNumber Year IsPhysical
## 1 TX 0.1000 2009 TRUE
## 2 MO 0.1000 2009 FALSE
## 3 AK 0.0501 2009 FALSE
## 4 DC 0.3800 2009 FALSE
## 5 CA 0.5257 2009 TRUE
## 6 CA 0.0857 2009 TRUE
# Prepare data for analysis
security_breaches_new_r01 <- security_breaches_new %>% dplyr::select(Year,IsPhysical)
# Group by year
security_breaches_new_rg01 <- security_breaches_new_r01 %>% group_by(Year,IsPhysical) %>% summarize(n())
# Spread the data
security_breaches_new_rg01 <- spread(security_breaches_new_rg01, Year,"n()")
# Remove NA. Replace it with 0.
security_breaches_new_rg01[is.na(security_breaches_new_rg01)] <- 0
security_breaches_new_rg01 <- as.data.frame(t(security_breaches_new_rg01))
security_breaches_new_rg02 <- cbind(rownames(security_breaches_new_rg01), security_breaches_new_rg01)
rownames(security_breaches_new_rg02) <- NULL
colnames(security_breaches_new_rg02) <- c("Year","IsRemote", "IsPhysical")
security_breaches_new_rg02 <- security_breaches_new_rg02[!grepl("IsPhysical", security_breaches_new_rg02$Year),]
rownames(security_breaches_new_rg02) <- 1:nrow(security_breaches_new_rg02)
# View data
head(security_breaches_new_rg02) %>% kable() %>% kable_styling()
Year | IsRemote | IsPhysical |
---|---|---|
1997 | 1 | 0 |
2002 | 1 | 0 |
2003 | 1 | 0 |
2004 | 2 | 0 |
2005 | 2 | 0 |
2006 | 1 | 0 |
##
## Pearson's Chi-squared test
##
## data: security_breaches_new_rg02[, -1]
## X-squared = 4.8025, df = 13, p-value = 0.9793
Since the p-value is less than 0.05, we can reject the null hypothesis that Physical breach is on the rise over year and not remote. We can accept the alternative hypothesis that Remote breach is on the rise over year and not physical
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.500 4.071 8.000 14.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 7.00 71.29 161.25 245.00
Lets begin out t-test. We will examine our assumptions and do the test.
security_breaches_new_rgrm02 <- select(security_breaches_new_rg02, c("Year","IsRemote") )
security_breaches_new_rgph02 <- select(security_breaches_new_rg02, c("Year","IsPhysical") )
names(security_breaches_new_rgrm02) <- c("Year","Count")
names(security_breaches_new_rgph02) <- c("Year","Count")
security_breaches_new_rgrm02$BreachType <- "Remote"
security_breaches_new_rgph02$BreachType <- "Physical"
security_breaches_new_rgrmph02 <- rbind(security_breaches_new_rgrm02, security_breaches_new_rgph02)
security_breaches_new_rgrmph02mtx <- data.frame(as.matrix(security_breaches_new_rgrmph02[,2:3]))
# Examine the data
boxplot(as.numeric(security_breaches_new_rgrmph02mtx$Count)~security_breaches_new_rgrmph02mtx$BreachType,ylab="Breach Count", main="Physical vs Remote Breach Count")
# Assumption 1: Independent Samples
# The samples are independent
# Assumption 2: Samples are drawn from population with equal variance (NOT MET)
var.test(as.numeric(security_breaches_new_rgrmph02mtx$Count)~security_breaches_new_rgrmph02mtx$BreachType)
##
## F test to compare two variances
##
## data: as.numeric(security_breaches_new_rgrmph02mtx$Count) by security_breaches_new_rgrmph02mtx$BreachType
## F = 0.30981, num df = 13, denom df = 13, p-value = 0.04356
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.0994556 0.9650622
## sample estimates:
## ratio of variances
## 0.3098077
As we see the p-value is below 0.05 so we reject the null hypothesis. This assumption is not met. If the assumption were met then we countinue with the below.
# Assumption 3: Samples are drawn from the same population with a normal distribution
isRemote <- subset(security_breaches_new_rgrmph02mtx, BreachType=="Remote")
isPhysical <- subset(security_breaches_new_rgrmph02mtx, BreachType=="Physical")
shapiro.test(as.numeric(isRemote$Count))
##
## Shapiro-Wilk normality test
##
## data: as.numeric(isRemote$Count)
## W = 0.82213, p-value = 0.009499
##
## Shapiro-Wilk normality test
##
## data: as.numeric(isPhysical$Count)
## W = 0.77676, p-value = 0.002612
##
## Welch Two Sample t-test
##
## data: as.numeric(security_breaches_new_rgrmph02mtx$Count) by security_breaches_new_rgrmph02mtx$BreachType
## t = -2.3591, df = 20.35, p-value = 0.02842
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.4568845 -0.4002583
## sample estimates:
## mean in group Physical mean in group Remote
## 3.071429 6.500000
The p-value is less than 0.05 and we reject the null hypothesis and keep the alternative hypothesis
# 2-Way Cross Tabulation
CrossTable(security_breaches_new_rg02$IsPhysical, security_breaches_new_rg02$IsRemote)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 14
##
##
## | security_breaches_new_rg02$IsRemote
## security_breaches_new_rg02$IsPhysical | 1 | 2 | 12 | 51 | 54 | 197 | 214 | 216 | 245 | Row Total |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
## | 2.500 | 1.000 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 | |
## | 0.714 | 0.286 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 |
## | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.357 | 0.143 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 12.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 13 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 |
## | 0.714 | 0.286 | 0.143 | 0.143 | 0.143 | 0.143 | 5.143 | 5.143 | 0.143 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.500 | 0.500 | 0.000 | 0.143 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.071 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 14 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 12.071 | 0.071 | 0.071 | 0.071 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.071 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.071 | 0.000 | 0.000 | 0.000 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 5 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 14 |
## | 0.357 | 0.143 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | |
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
security_breaches_004 <- security_breaches_new %>% dplyr::select(State,Year,IsPhysical)
security_breaches_a_004 <- security_breaches_004 %>%
group_by(State,IsPhysical) %>%
summarize(n())
security_breaches_a1_004 <- spread(security_breaches_a_004, State,"n()")
security_breaches_a1_004 <- as.data.frame(t(security_breaches_a1_004))
security_breaches_a1_004[is.na(security_breaches_a1_004)] <- 0
names(security_breaches_a1_004) <- c("IsRemote", "IsPhysical")
barplot(as.matrix(security_breaches_a1_004), beside = TRUE)
#barplot(as.matrix(all_age_contin), beside = TRUE)
chisq.test(security_breaches_a1_004[,-1]) #We remove the major names for the chi-squared test
## Warning in chisq.test(security_breaches_a1_004[, -1]): Chi-squared
## approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: security_breaches_a1_004[, -1]
## X-squared = 132.07, df = 52, p-value = 6.697e-09
security_breaches_new_physical <- security_breaches_new %>% filter(IsPhysical == TRUE)
ggplot(security_breaches_new_physical, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Physical Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))
security_breaches_new_remote <- security_breaches_new %>% filter(IsPhysical == FALSE)
ggplot(security_breaches_new_remote, aes(Year, State)) + geom_tile(aes(fill = AffectedNumber), colour = "white") + scale_fill_gradient(low = "steelblue", high = "red") + theme(text = element_text(size=7), axis.text.x = element_text(angle=90, hjust=1)) + ylab("State") + xlab("Year") + ggtitle("Heat Map - Remote Breaches by State per Year") + theme(plot.title = element_text(hjust = 0.5))
al_mean <- mean(security_breaches_new$AffectedNumber)
al_sd <- sd(security_breaches_new$AffectedNumber)
# Find the distribution
ggplot(data = security_breaches_new, aes(security_breaches_new$AffectedNumber)) +
stat_function(fun = dnorm, n = 101, args = list(mean=al_mean, sd=al_sd)) + ylab("") + xlab("Count") +
scale_y_continuous(breaks = NULL) + ggtitle("Distribution") +
theme(plot.title = element_text(hjust = 0.5))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1000 0.2300 3.0262 0.6941 490.0000
## Mode FALSE TRUE
## logical 998 57
#unempl <- cbind(all_ages$Unemployment_rate, rct_grad$Unemployment_rate, grad_stdnt$Grad_unemployment_rate) boxplot(unempl,names = c("All", "Recent Grad", "Grad Student"), ylab = "Unemployment Rate")
#ggplot(all_ages, aes(x = Unemployment_rate, y = Median)) + geom_point(color = 'blue') + geom_smooth(method = "lm", formula = y~x)
#security_breaches_new_g1 <- select(security_breaches_new, c("State","Year","IsPhysical") )
#security_breaches_new_g1 %>% group_by(State)
From the various test we conducted, we can see the breach, both physical or remote occurred after 2009. There were smaller breaches prior to that but we not significiant. We can also infer that the breach occurred across all states and not just particular states. Most of the breaches are small. Hackers prefer remote means than on site.
In this study were are trying to determine if the hypothesis of remote IT security breach rising is true or false. From the study we can conclude that preferred mean of breach continues to be remote. More has to be done to strengthen remote security.
Charts: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html