Final: Is there a relationship between cyber attack type and industry type?
Is there a relationship between cyber attack type and industry type?
Is there a relationship between cyber attack type and industry type?
This project is using “Data_Breach_Notifications_Affecting_Washington_Residents_Personal_Information_Breakdown.csv” from data.wa.gov and the dataset has6605 observations and 16 variables such as cyber attack type, data breach cause, business type,Iindustry type, and date aware of the attack . The variables used here are cyber attack type and industry type. The reason for choosing this topick is because as time gose on the world is increasingly reliant on technology and cyber attacks could put many people in danger.
Source: data.wa.gov
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 6605 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): DateAware, DateSubmitted, DataBreachCause, Name, CyberattackType,...
dbl (3): Id, WashingtoniansAffected, Year
dttm (2): DateStart, DateEnd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(databreach)
# A tibble: 6 × 16
Id DateAware DateSubmitted DataBreachCause DateStart
<dbl> <chr> <chr> <chr> <dttm>
1 9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack 2016-04-24 00:00:00
2 9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack 2016-04-24 00:00:00
3 9667 11/16/2016 12:00:00 AM 02/22/2017 1… Cyberattack 2016-04-24 00:00:00
4 9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack 2013-06-01 00:00:00
5 9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack 2013-06-01 00:00:00
6 9668 02/06/2017 12:00:00 AM 02/23/2017 1… Cyberattack 2013-06-01 00:00:00
# ℹ 11 more variables: DateEnd <dttm>, Name <chr>, CyberattackType <chr>,
# WashingtoniansAffected <dbl>, IndustryType <chr>, BusinessType <chr>,
# InformationType <chr>, Year <dbl>, WashingtoniansAffectedRange <chr>,
# BreachLifecycleRange <chr>, EntityState <chr>
first filterd out NAs in cyber atack type then selected the variables that where going to be used which where CyberattackType and IndustryType. Lastly removed Skimmers from cyber attack type because later found that there is less than 5 for expected cell count.
# A tibble: 4,848 × 2
CyberattackType IndustryType
<chr> <chr>
1 Malware Business
2 Malware Business
3 Malware Business
4 Malware Business
5 Malware Business
6 Malware Business
7 Malware Business
8 Malware Business
9 Malware Business
10 Malware Business
# ℹ 4,838 more rows
dbc <- dbci |>filter(CyberattackType !="Skimmers") # because later found that there is less than 5 for expected cell count.
Statistical analysis
The method used here is the Chi-squared test of indipendance. Why this method workes to answer this question is it can use two categorical variables and see wether their are associated with eachother which will answer the question is there a relationship between cyber attack type and industry type.
\(H_0\) : Cyberattack Type is not associated with Industry Type \(H_a\) : Cyberattack Type is associated with Industry Type
Based on the low p-value obtained from the Chi-squared test, we reject the null hypothesis that the outcomes are equally likely and conclude that there are differences in the probabilities of diffrent cyber attack types happening to diffrent industries.
Conclusion
This project attempts to anwser weather there is an association between cyber attack types and industry types.The results show that there are differences in the probabilities of diffrent cyber attack types happening to diffrent industries. In the future it would be intresting to look at the difrence over time for the number of the diffrent types of cyber attacks and on which industries and maybe see if there is an association between time and cyber attack type or industry type.