Summary

This document presents a brief exploratory data analysis (“EDA”) of cyber breach data sourced from Cyentia Institute’s public GitHub repository.

Data Dictionary

Following best practices, I am including this data dictionary to help the reader better understand the analysis.


column	description
id	Unique record identifier
affected_count	Number of data records involved in the breach
total_amount	Dollar cost of the breach
naic_sector	Two digit NAICS code of the industry sector
naic_national_industry	Full six-digit NAICS code for the breached company
sector	Text description of the naic_sector field
breach_date	Date the breach occurred
cause	High level summary of the cause of the breach

ETL and Outlier Separation

In the initial stages of analyzing the cybersecurity data, I spotted some outliers in key variables like affected_count and total_amount.

To improve visualization and focus, I divided the dataset into data_clean (excluding outliers) and data_outlier for separate analysis.

Although this analysis focuses on the clean dataset, given more time I would explore the outlier data set to see whether the the relationships in the clean data set hold true for the outlier data set, or whether new patterns emerge.

Finally, when analyzing breach cost and number of records affected, outliers are particularly important to examine, as breaches have a real impact on people’s lives and the economy, and therefore must be explored in detail to help glean as much information as possible and avoid repeat incidents of that scale.

Boxplots

First, I created boxplots for both the data_clean and data_outlier datasets to provide a summary of the distribution and demonstrate the importance of removing the outliers.

As expected, the original dataset has a few large-impact breaches affecting millions of records, and costing billions of dollars.

Similarly, the data set without outliers shows breaches with much smaller numbers of affected records and dollar cost.

Histograms

Histograms of affected_count and total_amount reveal that most breaches affect a small amount of records and are relatively low-cost, with most breaches costing under $300,000.

The percentage of “low-impact” and “low-cost” breaches

To further understand the histogram data, I looked into the percentage of breaches involving “low-impact” (<50 records impacted) and “low-cost”(<$150,000) events in the data set.

The result showed that 74% of the breaches in the dataset were low-impact, and 87% were low-cost.

## [1] "Percentage of IDs with affected_count < 50: 74.2132261195064"

## [1] "Percentage of IDs with total_amount < 300k: 98.9602107306253"

## [1] "Percentage of IDs with total_amount < 150k: 87.7582143352281"

Breaches by Type, Cause, and Sector

Analysis by Type, Cause, and Sector revealed insights like higher costs associated with external breaches and notable cost and impact variations among causes and sectors, as discussed below.

By Type

The analysis showed that on average, external breaches are associated with a higher total cost, even though the average counts of Internal and External groups are similar.

By Cause

On average, causes related to the “Former Consultant” group lead to high losses, despite affecting fewer records.

Interestingly, a related “Consultant” group leads to even higher losses and number of records affected. Given more time, I would love to dig into this information some more, as it could form the basis of a valuable recommendation to organizations relying on consulting services to strengthen their cybsersecurity measures when onboarding or offboarding consultants.

Furthermore, the “Terrorist” group tends to lead to both high losses and number of records involved in the breach, another interesting, albeit logical, finding worth exploring.

Lastly, the TTP group’s position indicates a better-managed risk in third-party interactions. It would be intersting to look at the exact mitigating steps taken by organizations for this group to see if they could be replicated or extended to othe groups as effectively.

By Sector

Although the “Agriculture” sector breaches affect few records, the outsized cost is notable and prompts the question of what makes this sector particularly vulnerable to such high costs?

Breaches Over Time

Next, I looked at the pattern of cyber breaches over time by Cause. An interesting insight was noticing a “cost dip” in cyber attacks across most causes around 2017-2018. This could be an interesting trend to investigate further to look into the cause of the dip.

Given more time, I would like to look into any changes in the regulatory environment that could explain this observation. For example, given that 2017 marked the first year of a new presidential administration in the US, the regulatory guidelines around reporting of cyber events could have either been relaxed, resulting in under-reporting of losses, or tightened, resulting in a “real” reduction of losses due to preventative measures and automation.

I would also like to investigate this relationship by month, rather than by year to see if there are any seasonal trends that can be found. For example, in the retail industry, do cyber attackers operate during specific times of the year? Does their behavior form any discernible pattern?

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

Average Breach Cost By Year Over Time

This analysis shows that the average cyber attack has become less costly over time, which prompted me to want to break this down by Cause, as shown below.

Average Breach Cost By Cause Over Time

Number of Cyber Breaches Per Year

This plot shows a dramatic increase in cyber breaches in 2019, and given more time I would like to investigate the possible underlying factors further.

Trends Analysis

A scatter plot analysis indicates a weak positive relationship between affected_count and total_amount, suggesting the need for non-linear modeling.

It appears that a negative logarithmic line could better fit the scatter plot, and this insight could be useful to inform variable transformation decisions if we wanted to create a predictive model down the road.

## `geom_smooth()` using formula = 'y ~ x'

Sector Impact Analysis

These two bar plots help understand the sectors that are the most affected by breach events, when viewed through the lens of cost across all time.

The bar plots show that the:

Top 3 sectors impacted by breaches in terms of total cost are the financial, professional, and administrative sectors. Top 3 sectors impacted by breaches in terms of average cost are the agriculture, utilities and transportation sectors.

Concluding Remarks

Key Findings

Outlier Influence: The division of the dataset into data_clean and data_outlier revealed the significant impact outliers have on the overall dataset, particularly in the areas of affected_count and total_amount.
Prevalence of Low-Impact and Low-Cost Breaches: A substantial portion of the breaches are low-impact and low-cost, suggesting that while massive breaches capture headlines, smaller incidents are more common.
Sector Variability: Different sectors exhibit varied patterns in breach impact and cost. Financial, professional, and administrative sectors show the highest total costs, while agriculture, utilities, and transportation sectors have higher average costs per breach.
Temporal Trends: There was a notable dip in the cost of cyber attacks around 2017-2018, coinciding with potential regulatory or environmental changes. Additionally, 2019 saw a significant increase in the number of breaches.
Relationship Between Affected Count and Total Amount: A weak positive relationship was observed, indicating that higher affected counts don’t always correlate with higher financial costs. This suggests the complexity and multifaceted nature of cyber breaches.

Future Areas To Investigate

Outlier Analysis: A detailed exploration of the data_outlier dataset could reveal insights into the nature of extreme breaches, potentially uncovering patterns or vulnerabilities unique to high-impact incidents.
Regulatory Impact Assessment: Investigating the dip in cyber attack costs around 2017-2018 in more detail could shed light on the influence of regulatory changes or other external factors on cyber breach reporting and management.
Seasonal and Monthly Trends: A finer-grained temporal analysis, breaking down the data by month, could reveal seasonal trends in cyber attacks, particularly in sectors like retail.
Consultant and Terrorist Groups Analysis: The high losses associated with breaches attributed to consultants and terrorist groups warrant a deeper investigation to understand the specific risks and develop targeted mitigation strategies.
Predictive Modeling: With the insights gained, there’s an opportunity to build predictive models to forecast breach impacts, aiding in proactive risk management.

Janna Kiseeva Cyentia EDA Project