Part 1 - Introduction

There are many types of IT security breaches. They can be grouped into two broad categiries - remote and physical. Breaches that happen remotely uses network vulnerability to gain access to electronic data. In this type of breach the hacker is remote. For breaches that happens physically, the hacker is on site and steal physical data like printouts, stealing hard drives etc.

In this study we are trying to determine if the hypothesis of remote IT security breach rising is true or false. In other words we are trying to find if more hackers prefer to be on site (physically) or want to be remote.

Part 2 - Data

The data primarily consists of breaches that occurred from 2010 through early 2014 when the extract was taken. However, a few breaches are recorded including 1 from 1997, 8 from 2002-2007, 13 from 2008 and 56 from 2009. The numbers of breaches from 2010 - 2014 are 211, 229, 227, 254 and 56, respectively.

A data.frame with 1055 observations on the following 24 variables:

  • Number: integer record number in the HHS data base
  • Name_of_Covered_Entity: factor giving the name of the entity experiencing the breach
  • State: Factor giving the 2-letter code of the state where the breach occurred. This has 52 levels for the 50 states plus the District of Columbia (DC) and Puerto Rico (PR).
  • Business_Associate_Involved: Factor giving the name of a subcontractor (or blank) associated with the breach.
  • Individuals_Affected: integer number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer.
  • Date_of_Breach: character vector giving the date or date range of the breach. Recodes as Dates in breach_start and breach_end.
  • Type_of_Breach: factor with 29 levels giving the type of breach (e.g., “Theft” vs., “Unauthorized Access/Disclosure”, etc.)
  • Location_of_Breached_Information: factor with 41 levels coding the location from which the breach occurred (e.g., “Paper”, “Laptop”, etc.)
  • Date_Posted_or_Updated: Date the information was posted to the HHS data base or last updated.
  • Summary: character vector of a summary of the incident.
  • breach_start: Date of the start of the incident = first date given in Date_of_Breach above.

We don’t need all the above columns. We can transform/extract data to form the columns below:

  • EntityName
  • State
  • Year
  • BreachType
  • AffectedNumber
##  [1] "Number"                           "Name_of_Covered_Entity"          
##  [3] "State"                            "Business_Associate_Involved"     
##  [5] "Individuals_Affected"             "Date_of_Breach"                  
##  [7] "Type_of_Breach"                   "Location_of_Breached_Information"
##  [9] "Date_Posted_or_Updated"           "Summary"                         
## [11] "breach_start"                     "breach_end"                      
## [13] "year"

Data Preparation

The data is stored as text in various columns. We have to extract data that suggest Physical or Remote. For this we create a array “physical_key_words” that contains words that indicates Physical breach. We will examine columns - Summary and Location_of_Breached_Information.

Extract Physical breaches from text from column “Summary”

Extract Physical breaches from text from column “Location_of_Breached_Information”

Combine both variable to one

Divide Individuals_Affected by 10000 for the charts

Extract columns

##   State AffectedNumber Year IsPhysical
## 1    TX         0.1000 2009       TRUE
## 2    MO         0.1000 2009      FALSE
## 3    AK         0.0501 2009      FALSE
## 4    DC         0.3800 2009      FALSE
## 5    CA         0.5257 2009       TRUE
## 6    CA         0.0857 2009       TRUE

Part 3 - Exploratory data analysis

Chi Squared Tests

  • Null Hypothesis - Physical breach is on the rise over year and not remote
  • Alternative Hypothesis - Remote breach is on the rise over year and not physical
Year IsRemote IsPhysical
1997 1 0
2002 1 0
2003 1 0
2004 2 0
2005 2 0
2006 1 0
## 
##  Pearson's Chi-squared test
## 
## data:  security_breaches_new_rg02[, -1]
## X-squared = 4.8025, df = 13, p-value = 0.9793

Since the p-value is less than 0.05, we can reject the null hypothesis that Physical breach is on the rise over year and not remote. We can accept the alternative hypothesis that Remote breach is on the rise over year and not physical

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.500   4.071   8.000  14.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    7.00   71.29  161.25  245.00

T-Test

Lets begin out t-test. We will examine our assumptions and do the test.

  • Assumption 1: Independent Samples
  • Assumption 2: Samples are drawn from population with equal variance
  • Assumption 3: Samples are drawn from the same population with a normal distribution

## 
##  F test to compare two variances
## 
## data:  as.numeric(security_breaches_new_rgrmph02mtx$Count) by security_breaches_new_rgrmph02mtx$BreachType
## F = 0.30981, num df = 13, denom df = 13, p-value = 0.04356
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.0994556 0.9650622
## sample estimates:
## ratio of variances 
##          0.3098077

As we see the p-value is below 0.05 so we reject the null hypothesis. This assumption is not met. If the assumption were met then we countinue with the below.

## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isRemote$Count)
## W = 0.82213, p-value = 0.009499
## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(isPhysical$Count)
## W = 0.77676, p-value = 0.002612

## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(security_breaches_new_rgrmph02mtx$Count) by security_breaches_new_rgrmph02mtx$BreachType
## t = -2.3591, df = 20.35, p-value = 0.02842
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.4568845 -0.4002583
## sample estimates:
## mean in group Physical   mean in group Remote 
##               3.071429               6.500000

The p-value is less than 0.05 and we reject the null hypothesis and keep the alternative hypothesis

Cross Table

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  14 
## 
##  
##                                       | security_breaches_new_rg02$IsRemote 
## security_breaches_new_rg02$IsPhysical |         1 |         2 |        12 |        51 |        54 |       197 |       214 |       216 |       245 | Row Total | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                     0 |         5 |         2 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         7 | 
##                                       |     2.500 |     1.000 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |     0.500 |           | 
##                                       |     0.714 |     0.286 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 | 
##                                       |     1.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                                       |     0.357 |     0.143 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                     1 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                                       |     0.357 |     0.143 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                                       |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                                       |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                                       |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                     2 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##                                       |     0.357 |     0.143 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                     5 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##                                       |     0.357 |     0.143 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                                       |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                     9 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 | 
##                                       |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |    12.071 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.071 | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                    13 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         1 |         0 |         2 | 
##                                       |     0.714 |     0.286 |     0.143 |     0.143 |     0.143 |     0.143 |     5.143 |     5.143 |     0.143 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |     0.500 |     0.000 |     0.143 | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     1.000 |     0.000 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.071 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                                    14 |         0 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         1 | 
##                                       |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |    12.071 |     0.071 |     0.071 |     0.071 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.071 | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |           | 
##                                       |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.071 |     0.000 |     0.000 |     0.000 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##                          Column Total |         5 |         2 |         1 |         1 |         1 |         1 |         1 |         1 |         1 |        14 | 
##                                       |     0.357 |     0.143 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |     0.071 |           | 
## --------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

## Warning in chisq.test(security_breaches_a1_004[, -1]): Chi-squared
## approximation may be incorrect
## 
##  Chi-squared test for given probabilities
## 
## data:  security_breaches_a1_004[, -1]
## X-squared = 132.07, df = 52, p-value = 6.697e-09

Heat map

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0500   0.1000   0.2300   3.0262   0.6941 490.0000
##    Mode   FALSE    TRUE 
## logical     998      57

Part 4 - Inference

From the various test we conducted, we can see the breach, both physical or remote occurred after 2009. There were smaller breaches prior to that but we not significiant. We can also infer that the breach occurred across all states and not just particular states. Most of the breaches are small. Hackers prefer remote means than on site.

Part 5 - Conclusion

In this study were are trying to determine if the hypothesis of remote IT security breach rising is true or false. From the study we can conclude that preferred mean of breach continues to be remote. More has to be done to strengthen remote security.