Final Project - Part 2

Network Logs CyberSecurity Exploratoyy Analysis Project Part 2:

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

required_packages <- c("tidyverse", "arrow", "DBI", "glue", "duckdb")

for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

lapply(required_packages, library, character.only = TRUE)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: 'arrow'


The following object is masked from 'package:lubridate':

    duration


The following object is masked from 'package:utils':

    timestamp

[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "DBI"       "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"    
 [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
[13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
[19] "base"     

[[4]]
 [1] "glue"      "DBI"       "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

[[5]]
 [1] "duckdb"    "glue"      "DBI"       "arrow"     "lubridate" "forcats"  
 [7] "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"   
[13] "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices" "utils"    
[19] "datasets"  "methods"   "base"

library(dplyr)

For ZIP files containing CSV(s):

# Set up and extract your ZIP file
zip_path <- "data/network_logs.zip"  # UPDATE THIS PATH
outdir <- file.path(dirname(zip_path), "extracted_data")
dir.create(outdir, showWarnings = FALSE)
unzip(zip_path, exdir = outdir, overwrite = TRUE)

# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))

# Open with Arrow - specify the main file you want to work with
networklogs <- open_dataset(csv_files[1], format = "csv")  # Adjust [1] as needed

# Check memory usage
glue("Memory used by Arrow object: {format(object.size(networklogs), units = 'KB')}")

Memory used by Arrow object: 0.5 Kb

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Time period covered: January 2, 2023 to January 1, 2024

Geographic coverage: Worldwide coverage. This dataset is collected on network traffic and did not specify that it was collected in a single region or country.

Population represented: People who have access to the internet.

Potential biases or limitations: Lower income countries or poorer areas may be underrepresented, due to lack of access to internet.

E - Executive Driven Questions

How can we patch the most vulnerable spots for cybersecurity threats?
How can we detect cybersecurity attacks/threats before they happen?
how can we improve system impermeability against cyber threats?

A - Analytical Framework

Analytical Approach:

Analysis Step 1: Gather statistics on the distributions and mean values of single variables. Look for any outliers, or irregularities. See if the distribution has any meaning or leads to any key finding about the data
Analysis Step 2: Perform tests to identify patterns in the data and any correlations between variables that have significant meaning
Analysis Step 3: Extract meaning from any correlations found in the dataset. Understand what the analysis patterns mean in the context of the data

D - Data Best Practices

Missing data assessment: There is no missing data

Quality Concerns:

The dataset contains mostly categorical variables. This may limit the amount of graphical visualizations available.
The dataset did not specify what conditions the network logs were collected under. And network logs under different environments may behave differently

Y - Your Insights

Initial hypothesis: There is a correlation between at least one variable to the threat label of the network logs, which allows for effective detection of malicious network logs.

Expected findings:

I Expect to find a correlation between threat_label and protocol, such that some protocols such as HTTP may show a trend of higher ratio of suspicious network logs.
I expect to find correlations between threat_label(whether the action was blocked/allowed) and request_path, which would be key in detecting cyber threats before they happened
It would surprise me if there are no patterns that assist in predicting cyber threats

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

networklogs |>
  glimpse()

FileSystemDataset with 1 csv file
6,000,000 rows x 10 columns
$ timestamp  <timestamp[s]> 2024-04-30 20:00:00, 2024-07-17 20:00:00, 2024-04-06…
$ source_ip        <string> "192.168.1.125", "192.168.1.201", "192.168.1.248", "…
$ dest_ip          <string> "192.168.1.124", "192.168.1.201", "192.168.1.15", "1…
$ protocol         <string> "TCP", "ICMP", "HTTP", "HTTP", "ICMP", "FTP", "UDP",…
$ action           <string> "blocked", "blocked", "allowed", "allowed", "allowed…
$ threat_label     <string> "benign", "benign", "benign", "benign", "benign", "b…
$ log_type         <string> "firewall", "application", "application", "applicati…
$ bytes_transferred <int64> 10889, 36522, 20652, 5350, 40691, 15742, 30134, 1610…
$ user_agent       <string> "Nmap Scripting Engine", "Nmap Scripting Engine", "M…
$ request_path     <string> "/", "/", "/login", "/login", "/", "/backup", "/", "…

After examining the data structure, who else might be interested? Any person interested in finding patterns in network logs, for example: hackers

What specific questions would they have? How can I exploit a system through a network log, without using a known pattern

What concerns might they have about data quality? The data is not specific in how it was collected or what environment it was collected.

C - Columns and Coverage

Create a summary table of your variables:

Variable	Description
timestamp	Date and time of the network log
source_ip	IP address initiating the connection
dest_ip	Destination IP address receiving the connection
protocol	Communication protocol of the network log
action	Whether the action was blocked or allowed
threat_label	Classification as benign, suspicious or malign
log_type	Type of system logging the event
bytes_transferred	size of data transferred in bytes in the network log
user_agent	software used to initiate the request for the network log
request_path	path accessed in the request for the network log

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
ncol(networklogs)

[1] 10

nrow(networklogs)

[1] 6000000

#Statistics of numeric variables
networklogs |>
  select(where(is.numeric)) |>
  collect() |>
  summary()

 bytes_transferred
 Min.   :  100    
 1st Qu.:12584    
 Median :25036    
 Mean   :25046    
 3rd Qu.:37519    
 Max.   :50000

N - Notable Segments

# Analyze key categorical variables
# Modify based on your specific data
networklogs |>
  select(threat_label) |>
  collect() |>
  table()

threat_label
    benign  malicious suspicious 
   5517611     121506     360883

networklogs |>
  select(protocol) |>
  collect() |>
  table()

protocol
    FTP    HTTP   HTTPS    ICMP     SSH     TCP     UDP 
 300220 1201528 1199970  599903  300464 1497493  900422

networklogs |>
  select(action) |>
  collect() |>
  table()

action
allowed blocked 
3000646 2999354

networklogs |>
  select(log_type) |>
  collect() |>
  table()

log_type
application    firewall         ids 
    2001768     1998676     1999556

networklogs |>
  select(user_agent) |>
  collect() |>
  table() |>
  prop.table()

user_agent
                                                                                                          curl/7.64.1 
                                                                                                            0.2001232 
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 
                                                                                                            0.2000103 
      Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 
                                                                                                            0.2001203 
                                                                                                Nmap Scripting Engine 
                                                                                                            0.1999425 
                                                                                                       SQLMap/1.6-dev 
                                                                                                            0.1998037

missing_percentage <- networklogs |>
  summarise(across(everything(), ~ mean(is.na(.)) * 100, .names = "missing_percentage_{col}")) |>
  collect()
missing_percentage

# A tibble: 1 × 10
  missing_percentage_timestamp missing_percentage_sourc…¹ missing_percentage_d…²
                         <dbl>                      <dbl>                  <dbl>
1                            0                          0                      0
# ℹ abbreviated names: ¹missing_percentage_source_ip,
#   ²missing_percentage_dest_ip
# ℹ 7 more variables: missing_percentage_protocol <dbl>,
#   missing_percentage_action <dbl>, missing_percentage_threat_label <dbl>,
#   missing_percentage_log_type <dbl>,
#   missing_percentage_bytes_transferred <dbl>,
#   missing_percentage_user_agent <dbl>, …

Complete this comprehensive assessment:

DATASET OVERVIEW: - Records: [6 million] representing [Network logs] - Time span: [one year] from [January 1st 2024] to [December 30, 2024] - Key metrics: The dataset contains 10 columns and 6 million rows

DATA COMPLETENESS: - Core fields: [100%] complete - Variable 1: [100%] complete - Variable 2: [100%] complete. This dataset contains no missing data.

DATA QUALITY STRENGTHS: 1. [This data contains a lot of important information on network logs that can are commonly tampered in cyber threats] 2. [This data collected can’t be biased since it is impartial, and it simply contains information on the network logs]

DATA QUALITY CONCERNS: 1. [The environment which the network logs were collected is unknown, and the results of the analysis may not apply to real world situations] 2. [The data contains only 10 variables, only one of which is numerical, which makes analysis much more complicated] 3. [request_path is a variable in the dataset that needs careful handling as there are many unique values that slightly differ from one another and there may be multiple patterns to be recognized in this single variable]

MISSING DATA IMPACT: - Most missing: [None] at [0]% - Impact on analysis: [Every variable has 0% missing data, as a network log must have all of the variables before a request is sent] - Handling strategy: [Handling missing data won’t be necessary, as there is no missing data]

RELIABILITY ASSESSMENT: - Most reliable variables: threat_label, log_type, action, protocol - Variables needing caution: user_agent, and request_path - Overall confidence level: [Medium]

JUSTIFICATION: The dataset contains only one numerical variable, and there seem to be no key findings that can be concluded from an analysis on single variables. Every finding will be found in relationships between variables and these may be hidden and hard to spot.

Research Question #1: How common are cyber attacks?

The following analysis delves into the variables action, and threat_label, in order to understand how common cyber threats are and how often these malicious and suspicious logs are allowed through.

networklogs |>
  filter(!is.na(threat_label)) |>
  pull(threat_label) |>
  table() |>
  prop.table()

Warning: Default behavior of `pull()` on Arrow data is changing. Current behavior of returning an R vector is deprecated, and in a future release, it will return an Arrow `ChunkedArray`. To control this:
ℹ Specify `as_vector = TRUE` (the current default) or `FALSE` (what it will change to) in `pull()`
ℹ Or, set `options(arrow.pull_as_vector)` globally
This warning is displayed once every 8 hours.


    benign  malicious suspicious 
0.91960183 0.02025100 0.06014717

About 92% of network logs are benign, 6% are suspicious and only 2% are malicious.

# BENIGN LOGS
action_prop_malicious <- networklogs |>
  filter(threat_label == "malicious") |>
  count(action) |>
  mutate(prop = n / sum(n)) |>
  collect()

#Bar plot of action taken for benign logs
ggplot(action_prop_malicious, aes(x = action, y = prop)) +
  geom_col()

Of benign network logs, 50% are allowed and 50% are denied.

action_malicious <- networklogs |>
  filter(threat_label == "malicious" | threat_label == "suspicious") |>
  collect()



action_prop_malicious <- action_malicious %>%
  group_by(threat_label, action) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  ungroup()

`summarise()` has grouped output by 'threat_label'. You can override using the
`.groups` argument.

ggplot(action_prop_malicious, aes(x = threat_label, y = percentage, fill = as.factor(action))) +
  geom_bar(stat = "identity", position = "fill") +
  labs(x = "Threat Label", y = "Proportion", fill = "Value", title = "Proportion of blocked/allowed Network Logs By Threat Label") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal()

As for malicious and suspicious logs, roughly 50% of network logs are denied, when optimally, 100% of malicious/suspicious network logs should be denied.

#1 Research Question - Results:

Of all network logs, 92% are benign and harmless, 6% are suspicious and can be harmful, and 2% are malicious logs, and 50% of all network logs are allowed through, when the goal is to detect and deny 100% of malicious and suspicious logs.

Research question #2: Are there any patterns in the request path of network logs that allow for threat detection?

Every network log has a request path which refers to the specific route or URL that a request takes from a client to a server when accessing resources on the internet. In this analysis, the goal is to potentially find a pattern in request paths of malicious and suspicious network logs

binary_data <- networklogs %>%
  mutate(query_string = ifelse(grepl("\\?", request_path), "Contains Query String", "No Query String")) %>% 
  group_by(threat_label, query_string) %>%  
  summarise(count = n(), .groups = "drop") %>%        # Count occurrences
  collect()

# Step 2: Create the bar plot
ggplot(binary_data, aes(x = threat_label, y = count, fill = query_string)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Proportion of Network Logs with a Query String by Threat Label",
       x = "Threat Label",
       y = "# of Network Logs")

binary_data

# A tibble: 3 × 3
  threat_label query_string            count
  <chr>        <chr>                   <int>
1 benign       No Query String       5517611
2 suspicious   Contains Query String  360883
3 malicious    Contains Query String  121506

Research Question #2 - Results:

An analysis into the request path of network logs shows a key pattern that allows for effective detection of malicious and suspicious logs. The analysis shows that all of the malicious and suspicious request paths in the dataset contain a query string— represented by a “?” symbol in the request path— which is commonly exploited by hackers to reveal sensitive information, and no benign network log contained query strings in their request.

Research Question #3: How is the number of bytes transferred in the network log related to the intention of network logs

Do malicious and suspicious network logs transfer mores bytes of data? Or do they transfer less bytes of data? Is there a pattern in the amount of bytes transferred in a network log that can be used to detect suspicious network logs?

df <- networklogs |>
  select(threat_label, bytes_transferred) |>
  collect()

ggplot(df, aes(x = threat_label, y = bytes_transferred)) +
  geom_boxplot()

Research Question #3 - Results:

Analysis shows that the number of bytes transferred in network logs is not a good indicator of whether a network log is malicious or not. The graph depicted shows that there is no difference in the number of bytes transferred between benign logs and malicious/suspicious logs. Which means the bytes transferred in a network log has no correlation to its intention.

Limitations:

The dataset did not include reasoning for the action variable, so network logs that were denied, could have been denied for multiple different reasons.
Network logs can be very different based on what they are for, and the environment, and since this dataset did not specify the environment in which the logs were recorded, the analysis results may not represent some real world scenarios.
Cybersecurity is an ever evolving issue, and hackers constantly find new tricks and vulnerabilities to exploit, so there is no certain way to detect cyber threats.

Next Steps:

The next step is to create an algorithm that constantly reads network logs, stores them in a dataset and detects any patterns that may be related to new cyber threats that are being developed.

Deliverables Checklist

Ensure your submission includes:

Complete READY framework analysis with thoughtful responses
Systematic SCAN framework exploration with specific findings
Successful data loading with Arrow
Professional data description and summary statistics
Comprehensive missing value analysis with percentages
Variable summary table documenting key fields
Memory efficiency demonstration
3-5 well-defined, specific exporatory research questions
Data quality assessment with honest evaluation
Professional summary with clear next steps

Grading Criteria

READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach
Data Loading (15%): Successful Arrow implementation with proper documentation
SCAN Framework (25%): Systematic exploration with specific, meaningful findings
Data Quality Assessment (20%): Comprehensive evaluation with specific evidence
Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities
Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

Be specific in your observations - avoid vague statements
Think like a stakeholder - what would decision-makers actually want to know?
Document your reasoning for all assessment decisions
Be honest about limitations - this builds credibility
Focus on actionable insights - what can actually be learned from this data?
Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.