Working on actual problems is central to learning. This is your first problem set. The assignments consist of analysis of cyber attacks using basic data management approaches and visualization tools (in R). Late submissions will not be accepted without prior permission. Students are encouraged to discuss the problems together, but must independently produce and submit solutions.

Show your solutions in the chunks below.

Before we dive in

Familiarize yourself with RMarkdown

Check this video to familiarize yourself with RMarkdown interface

Rename this file

You are in the Problem-Set-1.Rmd file now. First, close this file and rename it using your last and first name: Last-Name-First-Name-Problem-Set-1.Rmd. Reopen the file. Good!

0 Getting started

Loading packages

Loading packages is boring and time-consuming. First, you need to install packages. Second, you need to run them in R’s environment. Delete # before install.packages("pacman") and run this chunk of code. Now this package is installed on your system. Put # back.

#install.packages("pacman")
#library("pacman")

There is and easy way: pacman package. This package checks if the package you want (say dplyr) has been installed already on your laptop and upload it. No need to use quotes in this package. The following chunk should install and upload all packages that you might need for this problem set project. Just execute it.

Loading datasets

We will use four datasets for this problem set. I describe them in detail later. First, we need to upload these datasets in R memory and store them as data-objects.

Here is the list of data-objects we need and their corresponding files:

Data objects for this assignment	File
`d`	“cyberattacks-across-the-globe-cases.csv”
`d.attacks.by.year`	“cyberattacks-by-year.csv”
`d.attacks.by.year.and.method`	“cyberattacks-by-year-and-method.csv”
`d.attacks.by.attack_on`	“cyberattacks-by-attack_on.csv”

All files are stored in your “0-Data” subfolder. To reed them and store them we use fread() function from data.table package:

d <- fread("cyberattacks-across-the-globe-cases.csv")
d.attacks.by.year <- fread("cyberattacks-by-year.csv")
d.attacks.by.year.and.method <- fread("cyberattacks-by-year-and-method.csv")
d.attacks.by.attack_on <- fread("cyberattacks-by-attack_on.csv" )

Now, we can start! Good luck!

1 Sources and targets of cyber-attacks (1 point)

1.a Explore objects with data on cyber attacks

We start with the most detailed dataset: d. Each row in d represents a cyber-attack. In the chunk below, use function names() to print the names of variables (columns) in d

names(d)

## [1] "source"    "target"    "year"      "attack_on" "method"    "success"  
## [7] "num"

What do we know about each attack?

source - a territory (country) from which the attack was organized

target - a territory (country) where the target (e.g., a private firm or a state agency) was located

year - when the attack had a place. Later in this class, we will use precise dates of attacks

attack_on - who was under attack (private firms, state agencies, or military objects)

method- a method used for this attack

success - a dummy-variable. It takes a value of 1 if attackers achieved their goals (money, concessions, physical damage, etc) and 0 otherwise

num - another dummy variable. It always takes a value of 1. We will often use it to aggregate data.

We can check what is inside any data object in different ways. For example, we can use glimpse (from dplyr package) to check the number of rows and columns, names and types of the columns (variables), as well as some examples from this columns.

glimpse(d)

## Rows: 266
## Columns: 7
## $ source    <chr> "US", "US", "Russia", "Russia", "Russia", "US", "Russia", "R…
## $ target    <chr> "Russia", "Russia", "US", "US", "US", "Russia", "US", "US", …
## $ year      <int> 2008, 2008, 2008, 2008, 2008, 2008, 2009, 2009, 2011, 2013, …
## $ attack_on <chr> "Government", "Government", "Government", "Government", "Mil…
## $ method    <chr> "Intrusion", "Infiltration", "Infiltration", "Infiltration",…
## $ success   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

In fact, we can print the content of d by putting it in the chunk without any functions (check it by deleting ‘#’ in the chunk below. Do not forget to put ‘#’ back). Do not do it in future. R will print all rows of d and your reports will be impossible to read.

#d

In the chunk below, check the structure of your d using at least 2 functions you learnt in DataCamp’s “Intro to R”.

str(d)

## Classes 'data.table' and 'data.frame':   266 obs. of  7 variables:
##  $ source   : chr  "US" "US" "Russia" "Russia" ...
##  $ target   : chr  "Russia" "Russia" "US" "US" ...
##  $ year     : int  2008 2008 2008 2008 2008 2008 2009 2009 2011 2013 ...
##  $ attack_on: chr  "Government" "Government" "Government" "Government" ...
##  $ method   : chr  "Intrusion" "Infiltration" "Infiltration" "Infiltration" ...
##  $ success  : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ num      : int  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

summary(d)

##     source             target               year       attack_on        
##  Length:266         Length:266         Min.   :2000   Length:266        
##  Class :character   Class :character   1st Qu.:2008   Class :character  
##  Mode  :character   Mode  :character   Median :2011   Mode  :character  
##                                        Mean   :2011                     
##                                        3rd Qu.:2014                     
##                                        Max.   :2016                     
##     method             success             num   
##  Length:266         Min.   :0.00000   Min.   :1  
##  Class :character   1st Qu.:0.00000   1st Qu.:1  
##  Mode  :character   Median :0.00000   Median :1  
##                     Mean   :0.04511   Mean   :1  
##                     3rd Qu.:0.00000   3rd Qu.:1  
##                     Max.   :1.00000   Max.   :1

dim(d)

## [1] 266   7

1.b Explore cyberattacks with: Vectors

Let’s explore targets and sources of cyberattacks. For example, to obtain a vector of countries that were under attack at least once, we can use unique() on variable target inside d:

target.countries <- unique(d$target)
target.countries

##  [1] "Russia"       "US"           "Iran"         "China"        "N Korea"     
##  [6] "Canada"       "UK"           "France"       "Germany"      "Poland"      
## [11] "Estonia"      "Lithuania"    "Ukraine"      "Georgia"      "Turkey"      
## [16] "Israel"       "Saudi Arabia" "Syria"        "Lebanon"      "Taiwan"      
## [21] "Japan"        "India"        "Vietnam"      "Philippines"  "S Korea"     
## [26] "Pakistan"

In the following chunk, use d to create vector source.countries that will contain all territories from which at least one attack was organized.

source.countries <- unique(d$source)
source.countries

##  [1] "US"       "Russia"   "Iran"     "Syria"    "China"    "N Korea" 
##  [7] "Ukraine"  "Georgia"  "Turkey"   "Israel"   "Taiwan"   "Vietnam" 
## [13] "S Korea"  "Japan"    "Pakistan" "India"

It is hard to compare these two vectors. To ease this comparison, we can sort() territories alphabetically (increasing order) or in the reverse order (decreasing order). You choose the order by specifying option decreasing inside sort()

target.countries <- sort(target.countries, decreasing = FALSE) 
target.countries

##  [1] "Canada"       "China"        "Estonia"      "France"       "Georgia"     
##  [6] "Germany"      "India"        "Iran"         "Israel"       "Japan"       
## [11] "Lebanon"      "Lithuania"    "N Korea"      "Pakistan"     "Philippines" 
## [16] "Poland"       "Russia"       "S Korea"      "Saudi Arabia" "Syria"       
## [21] "Taiwan"       "Turkey"       "UK"           "Ukraine"      "US"          
## [26] "Vietnam"

target.countries <- sort(target.countries, decreasing = TRUE)
target.countries

##  [1] "Vietnam"      "US"           "Ukraine"      "UK"           "Turkey"      
##  [6] "Taiwan"       "Syria"        "Saudi Arabia" "S Korea"      "Russia"      
## [11] "Poland"       "Philippines"  "Pakistan"     "N Korea"      "Lithuania"   
## [16] "Lebanon"      "Japan"        "Israel"       "Iran"         "India"       
## [21] "Germany"      "Georgia"      "France"       "Estonia"      "China"       
## [26] "Canada"

Sort source.countries in the reverse alphabetical order and store the results in the same object:

source.countries <- sort(source.countries, decreasing = TRUE)
source.countries

##  [1] "Vietnam"  "US"       "Ukraine"  "Turkey"   "Taiwan"   "Syria"   
##  [7] "S Korea"  "Russia"   "Pakistan" "N Korea"  "Japan"    "Israel"  
## [13] "Iran"     "India"    "Georgia"  "China"

Do these two vectors (target.countries and source.countries) look alike? Are source-territories and target-territories often the same? One way to check it is to look at the overlap of the two vectors. In the following chunk, create a vector target.source.intersection of countries that present both in target.countries and source.countries. You can do it manually or using functions like intersect(). Also, create a vector all.countries of all unique territories presented either in target.countries or in source.countries (in other words create a union of unique elements from target.countries and source.countries). Hint: remember, you can always combine vectors by applying function c()

source.countries <- unique(d$source)
source.countries

##  [1] "US"       "Russia"   "Iran"     "Syria"    "China"    "N Korea" 
##  [7] "Ukraine"  "Georgia"  "Turkey"   "Israel"   "Taiwan"   "Vietnam" 
## [13] "S Korea"  "Japan"    "Pakistan" "India"

target.countries <- unique(d$target)
target.countries

##  [1] "Russia"       "US"           "Iran"         "China"        "N Korea"     
##  [6] "Canada"       "UK"           "France"       "Germany"      "Poland"      
## [11] "Estonia"      "Lithuania"    "Ukraine"      "Georgia"      "Turkey"      
## [16] "Israel"       "Saudi Arabia" "Syria"        "Lebanon"      "Taiwan"      
## [21] "Japan"        "India"        "Vietnam"      "Philippines"  "S Korea"     
## [26] "Pakistan"

target.source.intersection <- intersect(target.countries , source.countries)
target.source.intersection

##  [1] "Russia"   "US"       "Iran"     "China"    "N Korea"  "Ukraine" 
##  [7] "Georgia"  "Turkey"   "Israel"   "Syria"    "Taiwan"   "Japan"   
## [13] "India"    "Vietnam"  "S Korea"  "Pakistan"

all.countries <- unique(target.countries, source.countries)
all.countries

##  [1] "Russia"       "US"           "Iran"         "China"        "N Korea"     
##  [6] "Canada"       "UK"           "France"       "Germany"      "Poland"      
## [11] "Estonia"      "Lithuania"    "Ukraine"      "Georgia"      "Turkey"      
## [16] "Israel"       "Saudi Arabia" "Syria"        "Lebanon"      "Taiwan"      
## [21] "Japan"        "India"        "Vietnam"      "Philippines"  "S Korea"     
## [26] "Pakistan"

Looks good!

Question. So what do you see? Are source-territories and target-territories tend to be the same? Write your answer below.

Your Answer: I see that they are different in the sense that source countries have a larger fiscal budget for attacks and are larger players in the world’s economy and the target countries are easier targets because they have smaller fiscal budgets in terms of what resources they have to mitigate attacks. Ironically, the large players in the source countries list are in the target countries list because they target each other.

To support your answer with some evidence, calculate the share of territories in target.source.intersection to all the territories in all.countries:

target.source.intersection <- intersect(target.countries , source.countries)
target.source.intersection

##  [1] "Russia"   "US"       "Iran"     "China"    "N Korea"  "Ukraine" 
##  [7] "Georgia"  "Turkey"   "Israel"   "Syria"    "Taiwan"   "Japan"   
## [13] "India"    "Vietnam"  "S Korea"  "Pakistan"

all.countries <- unique(target.countries, source.countries)
target.source.intersection <- intersect(target.countries, source.countries)
target.source.all.countries.intersection <- intersect(target.source.intersection, all.countries)
target.source.all.countries.intersection

##  [1] "Russia"   "US"       "Iran"     "China"    "N Korea"  "Ukraine" 
##  [7] "Georgia"  "Turkey"   "Israel"   "Syria"    "Taiwan"   "Japan"   
## [13] "India"    "Vietnam"  "S Korea"  "Pakistan"

2 Explore cyberattacks with: `table()` (1 point)

We often want to narrow our focus to specific cases. For example, we are only interested in the territories that experienced a lot of attacks. Function table() will calculate the number of time each territory appeared in column target.

table(d$target) # check the results without storing them in an object

## 
##       Canada        China      Estonia       France      Georgia      Germany 
##            2            7            4            3            6            3 
##        India         Iran       Israel        Japan      Lebanon    Lithuania 
##           20           14           11           14            1            4 
##      N Korea     Pakistan  Philippines       Poland       Russia      S Korea 
##            5            7            5            3           11           23 
## Saudi Arabia        Syria       Taiwan       Turkey           UK      Ukraine 
##            7            1            7            4            3           15 
##           US      Vietnam 
##           82            4

target.by.frequency <- table(d$target) # store the results in an object
target.by.frequency # check what is inside this object

## 
##       Canada        China      Estonia       France      Georgia      Germany 
##            2            7            4            3            6            3 
##        India         Iran       Israel        Japan      Lebanon    Lithuania 
##           20           14           11           14            1            4 
##      N Korea     Pakistan  Philippines       Poland       Russia      S Korea 
##            5            7            5            3           11           23 
## Saudi Arabia        Syria       Taiwan       Turkey           UK      Ukraine 
##            7            1            7            4            3           15 
##           US      Vietnam 
##           82            4

By examining target.by.frequency we see that top 3 countries on the list inlcude “US”, “S Korea”, and “India”. We can store them in a new object manually:

target.by.frequency.top.3 <- c("US", "S Korea", "India")
target.by.frequency.top.3

## [1] "US"      "S Korea" "India"

We could also specify the position of these three elements inside target.by.frequency:

target.by.frequency.top.3 <- target.by.frequency[c(7,18,25)]
target.by.frequency.top.3

## 
##   India S Korea      US 
##      20      23      82

But there is a difference. table() basically creates a vector of elements. Each element reflect the number of times a territory occurred in our input column d$target. Each element also has a name. To obtain just the list of names, we need to use names() on target.by.frequency.top.3:

names(target.by.frequency.top.3)

## [1] "India"   "S Korea" "US"

Use table() on d$sources to create vector source.by.frequency. Next, manually create a list of top-3 territories that launched attacks. Store the results in source.by.frequency.top.3:

table(d$source)

## 
##    China  Georgia    India     Iran   Israel    Japan  N Korea Pakistan 
##       74        1        7       33        9        3       26       13 
##   Russia  S Korea    Syria   Taiwan   Turkey  Ukraine       US  Vietnam 
##       65        7        1        1        2        2       21        1

source.by.frequency <- table(d$source) 
source.by.frequency

## 
##    China  Georgia    India     Iran   Israel    Japan  N Korea Pakistan 
##       74        1        7       33        9        3       26       13 
##   Russia  S Korea    Syria   Taiwan   Turkey  Ukraine       US  Vietnam 
##       65        7        1        1        2        2       21        1

source.by.frequency.top.3 <- table(d$source)
source.by.frequency.top.3

## 
##    China  Georgia    India     Iran   Israel    Japan  N Korea Pakistan 
##       74        1        7       33        9        3       26       13 
##   Russia  S Korea    Syria   Taiwan   Turkey  Ukraine       US  Vietnam 
##       65        7        1        1        2        2       21        1

source.by.frequency.top.3 <- c("China", "Russia", "Iran")
source.by.frequency.top.3

## [1] "China"  "Russia" "Iran"

Now, re-write `source.by.frequency.top.3` by specifying the positions of top-3 elements in `source.by.frequency`:

source.by.frequency <- table(d$source)
source.by.frequency

## 
##    China  Georgia    India     Iran   Israel    Japan  N Korea Pakistan 
##       74        1        7       33        9        3       26       13 
##   Russia  S Korea    Syria   Taiwan   Turkey  Ukraine       US  Vietnam 
##       65        7        1        1        2        2       21        1

source.by.frequency.top.3 <- source.by.frequency[1:3] 
source.by.frequency.top.3

## 
##   China Georgia   India 
##      74       1       7

names(source.by.frequency.top.3)

## [1] "China"   "Georgia" "India"

source.by.frequency.top.3 <- names(source.by.frequency.top.3)
source.by.frequency.top.3

## [1] "China"   "Georgia" "India"

Nice! But can we obtain the top-3 territories in an automated way? Yes. For example, we can use sort() on target.by.frequency and subset first or last elements of this vector. The following chunk obtains a list of three territories that experienced the least number of attacks: Note: the elements in target.by.frequency are numbers. So, sort() will use them for ordering (not the names of territories)

target.by.frequency <- table(d$target)
target.by.frequency <-  sort(target.by.frequency)
target.by.frequency.bottom.3 <- target.by.frequency[1:3]
target.by.frequency.bottom.3

## 
## Lebanon   Syria  Canada 
##       1       1       2

target.by.frequency.bottom.3 <- names(target.by.frequency.bottom.3)
target.by.frequency.bottom.3

## [1] "Lebanon" "Syria"   "Canada"

OK! Now let’s put together things we’ve learnt so far. In the following chunk, create vectors target.by.frequency.top.5 and source.by.frequency.top.5. source.by.frequency.top.5 should contain names of top-5 territories that launched attacks; target.by.frequency.top.5 should contain names of top-5 territories that experienced cyberattacks.

target.by.frequency <- table(d$target)
target.by.frequency

## 
##       Canada        China      Estonia       France      Georgia      Germany 
##            2            7            4            3            6            3 
##        India         Iran       Israel        Japan      Lebanon    Lithuania 
##           20           14           11           14            1            4 
##      N Korea     Pakistan  Philippines       Poland       Russia      S Korea 
##            5            7            5            3           11           23 
## Saudi Arabia        Syria       Taiwan       Turkey           UK      Ukraine 
##            7            1            7            4            3           15 
##           US      Vietnam 
##           82            4

source.by.frequency <- table(d$source)
source.by.frequency

## 
##    China  Georgia    India     Iran   Israel    Japan  N Korea Pakistan 
##       74        1        7       33        9        3       26       13 
##   Russia  S Korea    Syria   Taiwan   Turkey  Ukraine       US  Vietnam 
##       65        7        1        1        2        2       21        1

target.by.frequency <- sort(target.by.frequency, decreasing = TRUE)
target.by.frequency

## 
##           US      S Korea        India      Ukraine         Iran        Japan 
##           82           23           20           15           14           14 
##       Israel       Russia        China     Pakistan Saudi Arabia       Taiwan 
##           11           11            7            7            7            7 
##      Georgia      N Korea  Philippines      Estonia    Lithuania       Turkey 
##            6            5            5            4            4            4 
##      Vietnam       France      Germany       Poland           UK       Canada 
##            4            3            3            3            3            2 
##      Lebanon        Syria 
##            1            1

source.by.frequency <- sort(source.by.frequency, decreasing = TRUE)
source.by.frequency

## 
##    China   Russia     Iran  N Korea       US Pakistan   Israel    India 
##       74       65       33       26       21       13        9        7 
##  S Korea    Japan   Turkey  Ukraine  Georgia    Syria   Taiwan  Vietnam 
##        7        3        2        2        1        1        1        1

target.by.frequency.top.5 <- names(target.by.frequency[1:5])
target.by.frequency.top.5

## [1] "US"      "S Korea" "India"   "Ukraine" "Iran"

source.by.frequency.top.5 <- names(source.by.frequency[1:5])
source.by.frequency.top.5

## [1] "China"   "Russia"  "Iran"    "N Korea" "US"

The other important feature of table() is to contrast variables against each other. For example, we can contrast different methods of cyberattacks with the type of targets:

table(d$method, d$attack_on)

##               
##                Government Military Private
##   DDoS                 24        6      16
##   Defacement           20        1       7
##   Infiltration         18       18      12
##   Intrusion            70       24      50

Question. What is the most common method used against the government agencies? Military? Private? What is the most common cyber method overall?

Your Answer: Intrusion is the most common method against government agencies, Intrusion for military as well, Intrusion for Private as well, OVERALL: Intrusion.

Use table() to check how successful are cyberattacks (variable success in d) against different targets (variable attack_on)

table(d$success , d$attack_on)

##    
##     Government Military Private
##   0        128       43      83
##   1          4        6       2

Question. Attackers have higher chances when they attack …

Your Answer: Military agencies because they are the most successful at 6.

Now, analyze what methods are usually more successful than the others (variable method in d)

table(d$method , d$success)

##               
##                  0   1
##   DDoS          44   2
##   Defacement    28   0
##   Infiltration  43   5
##   Intrusion    139   5

table(d$ratio <- d$success , d$method)

##    
##     DDoS Defacement Infiltration Intrusion
##   0   44         28           43       139
##   1    2          0            5         5

Question. Calculate the ratio of successful attacks for each method …

Your Answer: DDoS Defacement Infiltration Intrusion 44 28 43 139 2 0 5 5

3 Subset data on cyber attacks (1 point)

We often conduct the analysis on a subset of data. One way to subset data is to specify the value of some variable in our dataset. (Remember, that in this case we need to use == not =.) For example, we can subset all observations from 2012 using data.table syntax:

d.2012 <- d[year ==  2012,]
# Note 1: Because class of `d` is `data.table` we do not need to specify the data object inside the square brackets.
# If we use some old school formats like `data.frame`, we will need to be explicit: d[d$year == 2012,]
# Note 2: Do not forget to put comma after your logical expression. In this way, R understand that we apply our logical expression to the rows, not to the columns

The same is true, if we want to focus on a specific territory. Here is an example for India as a target:

d.india <- d[target == "India",]

We can also make complex logical expressions with ‘&’ and subset on them:

d.india.2012 <- d[target == "India" & year == 2012,]
d.india.2012

We can also subset with multiple values. The operator %in% is our good friend here:

countries.of.interest <- c("S Korea", "N Korea")
d.subset <- d[target %in% countries.of.interest,]
head(d.subset)

Now, subset three territories that experienced the least number of attacks (remember target.by.frequency.bottom.3?). Also, only keep observations from 2010 to 2014. Store the results in d.bottom.3.from.2010.to.2014.

bottom_3_territories <- c("Lebanon", "Syria", "Canada")
d_bottom_3 <- d[d$target %in% bottom_3_territories, ]
d_bottom_3_from_2010_to_2014 <- d_bottom_3[d_bottom_3$year >= 2010 & d_bottom_3$year <= 2014, ]
d.bottom.3.from.2010.to.2014 <- d_bottom_3_from_2010_to_2014
head(d.bottom.3.from.2010.to.2014)

How many observations do you have in d.bottom.3.from.2010.to.2014? Use nrow() on your data object to answer this question.

num_observations <- nrow(d.bottom.3.from.2010.to.2014)
num_observations

## [1] 3

4 Explore sources and targets with Sankey Diagrams (1 point)

Tables are great. But it is usually hard to infer systematic patterns in your data while looking at them. For example, do the territories usually mirror each other attacks? A simple way to check it is to produce a Sankey Diagram.

To plot our Sankey Diagrams we will use function alluvial() from alluvial package (yes, the names are confusing, but it is what it is). Inside alluvial() we need to specify to options. For data we need to specify which columns represent targets and sources. We can reference them by their names in the dataset data = d[,c("source","target")]. Alternatively, we can reference them by their position data = d[,1:2].

We also need to specify frequency with freq = d$num. This option is useful when each row represents not a single attack, but for example all the attacks from source == "Country A" to target == "Country B". It is not our case, yet we are required to specify this option.

alluvial(data = d[,1:2], freq=d$num)

Well, this looks awful and messy. This is because we have a lot territores that appear in d only a couple of times. Instead, let’s subset data and keep only territories appeared either in target.by.frequency.top.5, or in source.by.frequency.top.5. Store this subset of data in in d.for.alluvial.

target.by.frequency.top.5 <- c("US", "S Korea", "India", "China", "Russia")
source.by.frequency.top.5 <- c("China", "US", "Russia", "India", "S Korea")
relevant_territories <- unique(c(target.by.frequency.top.5, source.by.frequency.top.5))
d.for.alluvial <- d[d$target %in% relevant_territories | d$source %in% relevant_territories, ]
head(d.for.alluvial)

Now, make your Sankey Diagram with alluvial() using d.for.alluvial as your input dataset.

if (!requireNamespace("alluvial", quietly = TRUE)) {
  install.packages("alluvial")
}
library(alluvial)
alluvial(data = d.for.alluvial[, c("target", "source")],freq = d.for.alluvial$num,col = "purple",border = "pink",cex = 0.5)

Question. Describe interesting patterns you see in this diagram. And what about attacks? Do we see the symmetry?

Your Answer: The attacks look like they match in a sense that if target countries are attacked then the source country will receive a counter attack so long as resources permit.

5 Explore trends in cyberattacks with `ggplot` (1 point)

Finally, we can use visual tools to explore trends in attacks with ggplot function. For example, we can look how the number of attack changes from year to year. We will use d.attacks.by.year. This dataset has only two columns year and attacks:

head(d.attacks.by.year)

ggplot works like Lego. You add up different elements to build your final figure. First you need to specify your baseplate. Your baseplate is ggplot(). It has two major options: data (works like in alluvial(), names() and other functions) and aes() – stands for aesthetics. Within aes() we specify the parameters of the plot we need, like x-axis or y-axis:

ggplot(data = d.attacks.by.year,
       aes(x = year, y = attacks)
       )

We specified the baseplate but it is blank. This is because we can use a different layers to visualize our data-points. For example, we can add geom_point() layer to produce points:

ggplot(data = d.attacks.by.year,
       aes(x = year, y = attacks)
       ) + 
  geom_point()

Use geom_line() instead of geom_point() and reproduce our plot from the previous chunk:

library(ggplot2)
yearly_attacks <- aggregate(num ~ year, data = d, FUN = sum)
base_plot <- ggplot(data = yearly_attacks, aes(x = year, y = num))
final_plot <- base_plot +
geom_line(color = "purple") +
labs(title = "Trend in Cyberattacks Over Time", x = "Year", y = "Attacks") + theme_minimal() + theme(axis.text.x = element_text(angle = 50, hjust = 0.5))
final_plot

We can also specify that our observations are organized as groups. We use color = group_variable to produce separate trends for each of the group. Let’s take d.attacks.by.year.and.method and make trends separately for each method used for cyberattacks:

ggplot(data = d.attacks.by.year.and.method,
       aes(x = year, y = attacks, color = method) # We explicitly require ggplot to treat observations as groups
       ) + # Use the plus to add layers and other elements to the baseplate. This plus should always appear at the end of the previous line of code, not at the beginning of the new line.
  geom_line()

Question. Describe what you see here

Your Answer: It appears as though that Intrusion was the largest type of attack in the year 2015 and DDos came second, infiltration went down as well. Defacement is the lowest method towards 2015.

For the final task we will use data from d.attacks.by.attack_on. This dataset summarizes cyberattacks according to the type of their target. As in our main dataset attack_on in d.attacks.by.attack_on has three values: “Government”, “Military”, and “Private”. In the chunk below, use ggplot to plot the year-trends in the number of attacks by each type of their target. Your plot should only show the attacks on private firms and government agencies (you will need to exclude military). Also, plot the results only for years from 2009 to 2014. Your plot should include both geom_line and geom_point layers.

library(ggplot2)
filtered_data <- d.attacks.by.attack_on[d.attacks.by.attack_on$attack_on %in% c("Private", "Government") & d.attacks.by.attack_on$year >= 2009 & d.attacks.by.attack_on$year <= 2014, ]
base_plot <- ggplot(data = filtered_data, aes(x = year, y = attacks, color = attack_on, group = attack_on))
final_plot <- base_plot + geom_line() + geom_point() + labs(title = "Annual Trends in Cyberattacks by Target", x = "Year",y = "Attacks",color = "Target Type") + scale_color_manual(values = c("Private" = "pink", "Government" = "purple")) + theme_minimal() + theme(axis.text.x = element_text(angle = 50, hjust = 0.5)) 
final_plot

That’s it!

Now use triangle at the ‘knit’ button to compile html version of your report. Submit your html file to the class website under “Problem Set 1”. If you cannot compile html that means you have some errors in your code. R console shows you the line with an error. Check it and try to compile your report again. If (after spending some time) you cannot compile your html, submit your Rmd file for partial credit.

Good job!

Cybersecurity Policy: Data Management and Visulization for Cyber Attacks