Final: Is there a relationship between cyber attack type and industry type?

Is there a relationship between cyber attack type and industry type?

Is there a relationship between cyber attack type and industry type?

This project is using “Data_Breach_Notifications_Affecting_Washington_Residents_Personal_Information_Breakdown.csv” from data.wa.gov and the dataset has6605 observations and 16 variables such as cyber attack type, data breach cause, business type,Iindustry type, and date aware of the attack . The variables used here are cyber attack type and industry type. The reason for choosing this topick is because as time gose on the world is increasingly reliant on technology and cyber attacks could put many people in danger.

Source: data.wa.gov

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
databreach <- read_csv("Data_Breach_Notifications_Affecting_Washington_Residents__Personal_Information_Breakdown_.csv")
Rows: 6605 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (11): DateAware, DateSubmitted, DataBreachCause, Name, CyberattackType,...
dbl   (3): Id, WashingtoniansAffected, Year
dttm  (2): DateStart, DateEnd

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(databreach)
# A tibble: 6 × 16
     Id DateAware              DateSubmitted DataBreachCause DateStart          
  <dbl> <chr>                  <chr>         <chr>           <dttm>             
1  9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack     2016-04-24 00:00:00
2  9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack     2016-04-24 00:00:00
3  9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack     2016-04-24 00:00:00
4  9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack     2013-06-01 00:00:00
5  9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack     2013-06-01 00:00:00
6  9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack     2013-06-01 00:00:00
# ℹ 11 more variables: DateEnd <dttm>, Name <chr>, CyberattackType <chr>,
#   WashingtoniansAffected <dbl>, IndustryType <chr>, BusinessType <chr>,
#   InformationType <chr>, Year <dbl>, WashingtoniansAffectedRange <chr>,
#   BreachLifecycleRange <chr>, EntityState <chr>
str(databreach)
spc_tbl_ [6,605 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Id                         : num [1:6605] 9667 9667 9667 9668 9668 ...
 $ DateAware                  : chr [1:6605] "11/16/2016 12:00:00 AM" "11/16/2016 12:00:00 AM" "11/16/2016 12:00:00 AM" "02/06/2017 12:00:00 AM" ...
 $ DateSubmitted              : chr [1:6605] "02/22/2017 12:00:00 AM" "02/22/2017 12:00:00 AM" "02/22/2017 12:00:00 AM" "02/23/2017 12:00:00 AM" ...
 $ DataBreachCause            : chr [1:6605] "Cyberattack" "Cyberattack" "Cyberattack" "Cyberattack" ...
 $ DateStart                  : POSIXct[1:6605], format: "2016-04-24" "2016-04-24" ...
 $ DateEnd                    : POSIXct[1:6605], format: "2016-12-14" "2016-12-14" ...
 $ Name                       : chr [1:6605] "Intex Recreation Corp." "Intex Recreation Corp." "Intex Recreation Corp." "Abbott Nutrition" ...
 $ CyberattackType            : chr [1:6605] "Malware" "Malware" "Malware" "Malware" ...
 $ WashingtoniansAffected     : num [1:6605] 1547 1547 1547 1819 1819 ...
 $ IndustryType               : chr [1:6605] "Business" "Business" "Business" "Business" ...
 $ BusinessType               : chr [1:6605] "Manufacturing" "Manufacturing" "Manufacturing" "Consumable" ...
 $ InformationType            : chr [1:6605] "Name" "Financial & Banking Information" "Other" "Name" ...
 $ Year                       : num [1:6605] 2017 2017 2017 2017 2017 ...
 $ WashingtoniansAffectedRange: chr [1:6605] "1,000-9,999" "1,000-9,999" "1,000-9,999" "1,000-9,999" ...
 $ BreachLifecycleRange       : chr [1:6605] "200-299" "200-299" "200-299" "500+" ...
 $ EntityState                : chr [1:6605] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   Id = col_double(),
  ..   DateAware = col_character(),
  ..   DateSubmitted = col_character(),
  ..   DataBreachCause = col_character(),
  ..   DateStart = col_datetime(format = ""),
  ..   DateEnd = col_datetime(format = ""),
  ..   Name = col_character(),
  ..   CyberattackType = col_character(),
  ..   WashingtoniansAffected = col_double(),
  ..   IndustryType = col_character(),
  ..   BusinessType = col_character(),
  ..   InformationType = col_character(),
  ..   Year = col_double(),
  ..   WashingtoniansAffectedRange = col_character(),
  ..   BreachLifecycleRange = col_character(),
  ..   EntityState = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Checking for how many NAs

sum(is.na(databreach$IndustryType))
[1] 0
sum(is.na(databreach$CyberattackType))
[1] 1725

Data Analysis

first filterd out NAs in cyber atack type then selected the variables that where going to be used which where CyberattackType and IndustryType. Lastly removed Skimmers from cyber attack type because later found that there is less than 5 for expected cell count.

Cleaning

dbci <- databreach |>
  filter(!is.na(CyberattackType)) |>
  filter(CyberattackType != "Skimmers")
dbci |>
  select(CyberattackType, IndustryType)
# A tibble: 4,848 × 2
   CyberattackType IndustryType
   <chr>           <chr>       
 1 Malware         Business    
 2 Malware         Business    
 3 Malware         Business    
 4 Malware         Business    
 5 Malware         Business    
 6 Malware         Business    
 7 Malware         Business    
 8 Malware         Business    
 9 Malware         Business    
10 Malware         Business    
# ℹ 4,838 more rows
dbc <- dbci |>
  filter(CyberattackType != "Skimmers") # because later found that there is less than 5 for expected cell count.

Statistical analysis

The method used here is the Chi-squared test of indipendance. Why this method workes to answer this question is it can use two categorical variables and see wether their are associated with eachother which will answer the question is there a relationship between cyber attack type and industry type.

\(H_0\) : Cyberattack Type is not associated with Industry Type \(H_a\) : Cyberattack Type is associated with Industry Type

observed_dbc<- table(dbc$CyberattackType, dbc$IndustryType)
observed_dbc
                 
                  Business Education Finance Government Health
  Malware              470        17     104         19    156
  Other                205        45     144         16    142
  Phishing             166        33      86         38    177
  Ransomware          1017       326     296         99    563
  Unclear/unknown      183        22      32         15     65
                 
                  Non-Profit/Charity
  Malware                         16
  Other                           16
  Phishing                        47
  Ransomware                     310
  Unclear/unknown                 23
dbc_result <- chisq.test(observed_dbc)
dbc_result

    Pearson's Chi-squared test

data:  observed_dbc
X-squared = 401.97, df = 20, p-value < 2.2e-16
expected_cell_counts <-  dbc_result$expected
expected_cell_counts
                 
                   Business Education   Finance Government    Health
  Malware          329.2207  71.45751 106.78300   30.16378 177.91790
  Other            239.1271  51.90264  77.56106   21.90924 129.22937
  Phishing         230.2861  49.98370  74.69348   21.09922 124.45153
  Ransomware      1099.2267 238.58767 356.53507  100.71308 594.04559
  Unclear/unknown  143.1394  31.06848  46.42739   13.11469  77.35561
                 
                  Non-Profit/Charity
  Malware                   66.45710
  Other                     48.27063
  Phishing                  46.48597
  Ransomware               221.89191
  Unclear/unknown           28.89439

Bar graph

dbc_v <- dbc |>
  ggplot() +
  geom_bar(aes(x=CyberattackType, fill=IndustryType),
      position = "dodge") +
  labs(title = "Cyber Attack Type and Industry Type",,
       x = "Cyber Attack Type",
       y = "Count",
       fill = "Industry Type",
      caption = "Source: data.wa.gov") +
theme_minimal()

dbc_v

df = 20, p-value < 0.00000000000000022 X-squared = 401.97

Based on the low p-value obtained from the Chi-squared test, we reject the null hypothesis that the outcomes are equally likely and conclude that there are differences in the probabilities of diffrent cyber attack types happening to diffrent industries.

Conclusion

This project attempts to anwser weather there is an association between cyber attack types and industry types.The results show that there are differences in the probabilities of diffrent cyber attack types happening to diffrent industries. In the future it would be intresting to look at the difrence over time for the number of the diffrent types of cyber attacks and on which industries and maybe see if there is an association between time and cyber attack type or industry type.

References / Source

data.wa.gov https://catalog.data.gov/dataset/data-breach-notifications-affecting-washington-residents-personal-information-breakdown