This document presents a brief exploratory data analysis (“EDA”) of cyber breach data sourced from Cyentia Institute’s public GitHub repository.
Following best practices, I am including this data dictionary to help the reader better understand the analysis.
| column | description |
|---|---|
| id | Unique record identifier |
| affected_count | Number of data records involved in the breach |
| total_amount | Dollar cost of the breach |
| naic_sector | Two digit NAICS code of the industry sector |
| naic_national_industry | Full six-digit NAICS code for the breached company |
| sector | Text description of the naic_sector field |
| breach_date | Date the breach occurred |
| cause | High level summary of the cause of the breach |
In the initial stages of analyzing the cybersecurity data, I spotted some outliers in key variables like affected_count and total_amount.
To improve visualization and focus, I divided the dataset into data_clean (excluding outliers) and data_outlier for separate analysis.
Although this analysis focuses on the clean dataset, given more time I would explore the outlier data set to see whether the the relationships in the clean data set hold true for the outlier data set, or whether new patterns emerge.
Finally, when analyzing breach cost and number of records affected, outliers are particularly important to examine, as breaches have a real impact on people’s lives and the economy, and therefore must be explored in detail to help glean as much information as possible and avoid repeat incidents of that scale.
First, I created boxplots for both the data_clean and data_outlier datasets to provide a summary of the distribution and demonstrate the importance of removing the outliers.
As expected, the original dataset has a few large-impact breaches affecting millions of records, and costing billions of dollars.
Similarly, the data set without outliers shows breaches with much smaller numbers of affected records and dollar cost.
Histograms of affected_count and total_amount reveal that most breaches affect a small amount of records and are relatively low-cost, with most breaches costing under $300,000.
To further understand the histogram data, I looked into the percentage of breaches involving “low-impact” (<50 records impacted) and “low-cost”(<$150,000) events in the data set.
The result showed that 74% of the breaches in the dataset were low-impact, and 87% were low-cost.
## [1] "Percentage of IDs with affected_count < 50: 74.2132261195064"
## [1] "Percentage of IDs with total_amount < 300k: 98.9602107306253"
## [1] "Percentage of IDs with total_amount < 150k: 87.7582143352281"
Analysis by Type, Cause, and Sector revealed insights like higher costs associated with external breaches and notable cost and impact variations among causes and sectors, as discussed below.
The analysis showed that on average, external breaches are associated with a higher total cost, even though the average counts of Internal and External groups are similar.
On average, causes related to the “Former Consultant” group lead to high losses, despite affecting fewer records.
Interestingly, a related “Consultant” group leads to even higher losses and number of records affected. Given more time, I would love to dig into this information some more, as it could form the basis of a valuable recommendation to organizations relying on consulting services to strengthen their cybsersecurity measures when onboarding or offboarding consultants.
Furthermore, the “Terrorist” group tends to lead to both high losses and number of records involved in the breach, another interesting, albeit logical, finding worth exploring.
Lastly, the TTP group’s position indicates a better-managed risk in
third-party interactions. It would be intersting to look at the exact
mitigating steps taken by organizations for this group to see if they
could be replicated or extended to othe groups as effectively.
Although the “Agriculture” sector breaches affect few records, the outsized cost is notable and prompts the question of what makes this sector particularly vulnerable to such high costs?
Next, I looked at the pattern of cyber breaches over time by Cause. An interesting insight was noticing a “cost dip” in cyber attacks across most causes around 2017-2018. This could be an interesting trend to investigate further to look into the cause of the dip.
Given more time, I would like to look into any changes in the regulatory environment that could explain this observation. For example, given that 2017 marked the first year of a new presidential administration in the US, the regulatory guidelines around reporting of cyber events could have either been relaxed, resulting in under-reporting of losses, or tightened, resulting in a “real” reduction of losses due to preventative measures and automation.
I would also like to investigate this relationship by month, rather than by year to see if there are any seasonal trends that can be found. For example, in the retail industry, do cyber attackers operate during specific times of the year? Does their behavior form any discernible pattern?
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
This analysis shows that the average cyber attack has become less costly over time, which prompted me to want to break this down by Cause, as shown below.
This plot shows a dramatic increase in cyber breaches in 2019, and
given more time I would like to investigate the possible underlying
factors further.
A scatter plot analysis indicates a weak positive relationship between affected_count and total_amount, suggesting the need for non-linear modeling.
It appears that a negative logarithmic line could better fit the scatter plot, and this insight could be useful to inform variable transformation decisions if we wanted to create a predictive model down the road.
## `geom_smooth()` using formula = 'y ~ x'
These two bar plots help understand the sectors that are the most affected by breach events, when viewed through the lens of cost across all time.
The bar plots show that the:
Top 3 sectors impacted by breaches in terms of total cost are the financial, professional, and administrative sectors. Top 3 sectors impacted by breaches in terms of average cost are the agriculture, utilities and transportation sectors.
Outlier Influence: The division of the dataset into data_clean and data_outlier revealed the significant impact outliers have on the overall dataset, particularly in the areas of affected_count and total_amount.
Prevalence of Low-Impact and Low-Cost Breaches: A substantial portion of the breaches are low-impact and low-cost, suggesting that while massive breaches capture headlines, smaller incidents are more common.
Sector Variability: Different sectors exhibit varied patterns in breach impact and cost. Financial, professional, and administrative sectors show the highest total costs, while agriculture, utilities, and transportation sectors have higher average costs per breach.
Temporal Trends: There was a notable dip in the cost of cyber attacks around 2017-2018, coinciding with potential regulatory or environmental changes. Additionally, 2019 saw a significant increase in the number of breaches.
Relationship Between Affected Count and Total Amount: A weak positive relationship was observed, indicating that higher affected counts don’t always correlate with higher financial costs. This suggests the complexity and multifaceted nature of cyber breaches.
Outlier Analysis: A detailed exploration of the data_outlier dataset could reveal insights into the nature of extreme breaches, potentially uncovering patterns or vulnerabilities unique to high-impact incidents.
Regulatory Impact Assessment: Investigating the dip in cyber attack costs around 2017-2018 in more detail could shed light on the influence of regulatory changes or other external factors on cyber breach reporting and management.
Seasonal and Monthly Trends: A finer-grained temporal analysis, breaking down the data by month, could reveal seasonal trends in cyber attacks, particularly in sectors like retail.
Consultant and Terrorist Groups Analysis: The high losses associated with breaches attributed to consultants and terrorist groups warrant a deeper investigation to understand the specific risks and develop targeted mitigation strategies.
Predictive Modeling: With the insights gained, there’s an opportunity to build predictive models to forecast breach impacts, aiding in proactive risk management.