# Set up and extract your ZIP filezip_path <-"data/network_logs.zip"# UPDATE THIS PATHoutdir <-file.path(dirname(zip_path), "extracted_data")dir.create(outdir, showWarnings =FALSE)unzip(zip_path, exdir = outdir, overwrite =TRUE)# Get list of CSV filescsv_files <-list.files(outdir, pattern ="\\.csv$", full.names =TRUE)names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))# Open with Arrow - specify the main file you want to work withnetworklogs <-open_dataset(csv_files[1], format ="csv") # Adjust [1] as needed# Check memory usageglue("Memory used by Arrow object: {format(object.size(networklogs), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb
Part 2: READY Framework Analysis
Work through each component of READY with your dataset:
R - Representative Data
Time period covered: January 2, 2023 to January 1, 2024
Geographic coverage: Worldwide coverage. This dataset is collected on network traffic and did not specify that it was collected in a single region or country.
Population represented: People who have access to the internet.
Potential biases or limitations: Lower income countries or poorer areas may be underrepresented, due to lack of access to internet.
E - Executive Driven Questions
How can we patch the most vulnerable spots for cybersecurity threats?
How can we detect cybersecurity attacks/threats before they happen?
how can we improve system impermeability against cyber threats?
A - Analytical Framework
Analytical Approach:
Analysis Step 1: Gather statistics on the distributions and mean values of single variables. Look for any outliers, or irregularities. See if the distribution has any meaning or leads to any key finding about the data
Analysis Step 2: Perform tests to identify patterns in the data and any correlations between variables that have significant meaning
Analysis Step 3: Extract meaning from any correlations found in the dataset. Understand what the analysis patterns mean in the context of the data
D - Data Best Practices
Missing data assessment: There is no missing data
Quality Concerns:
The dataset contains mostly categorical variables. This may limit the amount of graphical visualizations available.
The dataset did not specify what conditions the network logs were collected under. And network logs under different environments may behave differently
Y - Your Insights
Initial hypothesis:There is a correlation between at least one variable to the threat label of the network logs, which allows for effective detection of malicious network logs.
Expected findings:
I Expect to find a correlation between threat_label and protocol, such that some protocols such as HTTP may show a trend of higher ratio of suspicious network logs.
I expect to find correlations between threat_label(whether the action was blocked/allowed) and request_path, which would be key in detecting cyber threats before they happened
It would surprise me if there are no patterns that assist in predicting cyber threats
DATASET OVERVIEW: - Records: [6 million] representing [Network logs] - Time span: [one year] from [January 1st 2024] to [December 30, 2024] - Key metrics: The dataset contains 10 columns and 6 million rows
DATA COMPLETENESS: - Core fields: [100%] complete - Variable 1: [100%] complete - Variable 2: [100%] complete. This dataset contains no missing data.
DATA QUALITY STRENGTHS: 1. [This data contains a lot of important information on network logs that can are commonly tampered in cyber threats] 2. [This data collected can’t be biased since it is impartial, and it simply contains information on the network logs]
DATA QUALITY CONCERNS: 1. [The environment which the network logs were collected is unknown, and the results of the analysis may not apply to real world situations] 2. [The data contains only 10 variables, only one of which is numerical, which makes analysis much more complicated] 3. [request_path is a variable in the dataset that needs careful handling as there are many unique values that slightly differ from one another and there may be multiple patterns to be recognized in this single variable]
MISSING DATA IMPACT: - Most missing: [None] at [0]% - Impact on analysis: [Every variable has 0% missing data, as a network log must have all of the variables before a request is sent] - Handling strategy: [Handling missing data won’t be necessary, as there is no missing data]
JUSTIFICATION: The dataset contains only one numerical variable, and there seem to be no key findings that can be concluded from an analysis on single variables. Every finding will be found in relationships between variables and these may be hidden and hard to spot.
Research Question #1: How common are cyber attacks?
The following analysis delves into the variables action, and threat_label, in order to understand how common cyber threats are and how often these malicious and suspicious logs are allowed through.
Warning: Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
ℹ Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
ℹ Or, set `options(arrow.pull_as_vector)` globally
This warning is displayed once every 8 hours.
`summarise()` has grouped output by 'threat_label'. You can override using the
`.groups` argument.
ggplot(action_prop_malicious, aes(x = threat_label, y = percentage, fill =as.factor(action))) +geom_bar(stat ="identity", position ="fill") +labs(x ="Threat Label", y ="Proportion", fill ="Value", title ="Proportion of blocked/allowed Network Logs By Threat Label") +scale_y_continuous(labels = scales::percent_format(scale =1)) +theme_minimal()
As for malicious and suspicious logs, roughly 50% of network logs are denied, when optimally, 100% of malicious/suspicious network logs should be denied.
#1 Research Question - Results:
Of all network logs, 92% are benign and harmless, 6% are suspicious and can be harmful, and 2% are malicious logs, and 50% of all network logs are allowed through, when the goal is to detect and deny 100% of malicious and suspicious logs.
Research question #2: Are there any patterns in the request path of network logs that allow for threat detection?
Every network log has a request path which refers to the specific route or URL that a request takes from a client to a server when accessing resources on the internet. In this analysis, the goal is to potentially find a pattern in request paths of malicious and suspicious network logs
binary_data <- networklogs %>%mutate(query_string =ifelse(grepl("\\?", request_path), "Contains Query String", "No Query String")) %>%group_by(threat_label, query_string) %>%summarise(count =n(), .groups ="drop") %>%# Count occurrencescollect()# Step 2: Create the bar plotggplot(binary_data, aes(x = threat_label, y = count, fill = query_string)) +geom_bar(stat ="identity", position ="dodge") +theme_minimal() +labs(title ="Proportion of Network Logs with a Query String by Threat Label",x ="Threat Label",y ="# of Network Logs")
An analysis into the request path of network logs shows a key pattern that allows for effective detection of malicious and suspicious logs. The analysis shows that all of the malicious and suspicious request paths in the dataset contain a query string— represented by a “?” symbol in the request path— which is commonly exploited by hackers to reveal sensitive information, and no benign network log contained query strings in their request.
Research Question #3: How is the number of bytes transferred in the network log related to the intention of network logs
Do malicious and suspicious network logs transfer mores bytes of data? Or do they transfer less bytes of data? Is there a pattern in the amount of bytes transferred in a network log that can be used to detect suspicious network logs?
Analysis shows that the number of bytes transferred in network logs is not a good indicator of whether a network log is malicious or not. The graph depicted shows that there is no difference in the number of bytes transferred between benign logs and malicious/suspicious logs. Which means the bytes transferred in a network log has no correlation to its intention.
Limitations:
The dataset did not include reasoning for the action variable, so network logs that were denied, could have been denied for multiple different reasons.
Network logs can be very different based on what they are for, and the environment, and since this dataset did not specify the environment in which the logs were recorded, the analysis results may not represent some real world scenarios.
Cybersecurity is an ever evolving issue, and hackers constantly find new tricks and vulnerabilities to exploit, so there is no certain way to detect cyber threats.
Next Steps:
The next step is to create an algorithm that constantly reads network logs, stores them in a dataset and detects any patterns that may be related to new cyber threats that are being developed.
Deliverables Checklist
Ensure your submission includes:
Complete READY framework analysis with thoughtful responses
Systematic SCAN framework exploration with specific findings
Successful data loading with Arrow
Professional data description and summary statistics
Comprehensive missing value analysis with percentages
Variable summary table documenting key fields
Memory efficiency demonstration
3-5 well-defined, specific exporatory research questions
Data quality assessment with honest evaluation
Professional summary with clear next steps
Grading Criteria
READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach
Data Loading (15%): Successful Arrow implementation with proper documentation
SCAN Framework (25%): Systematic exploration with specific, meaningful findings
Data Quality Assessment (20%): Comprehensive evaluation with specific evidence
Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities
Professional Communication (5%): Clear, honest, well-organized presentation throughout
Tips for Success
Be specific in your observations - avoid vague statements
Think like a stakeholder - what would decision-makers actually want to know?
Document your reasoning for all assessment decisions
Be honest about limitations - this builds credibility
Focus on actionable insights - what can actually be learned from this data?
Ask for help if your data format doesn’t match the provided templates
Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.