Working on actual problems is central to learning. This is your first problem set. The assignments consist of analysis of cyber attacks using basic data management approaches and visualization tools (in R). Late submissions will not be accepted without prior permission. Students are encouraged to discuss the problems together, but must independently produce and submit solutions.
Show your solutions in the chunks below.
Familiarize yourself with RMarkdown
Check this video to familiarize yourself with RMarkdown interface
Rename this file
You are in the Problem-Set-1.Rmd file now. First, close
this file and rename it using your last and first name:
Last-Name-First-Name-Problem-Set-1.Rmd. Reopen the file.
Good!
Loading packages is boring and time-consuming. First, you need to
install packages. Second, you need to run them in R’s environment.
Delete # before install.packages("pacman") and run this
chunk of code. Now this package is installed on your system. Put #
back.
#install.packages("pacman")
#library("pacman")
There is and easy way: pacman package. This package
checks if the package you want (say dplyr) has been
installed already on your laptop and upload it. No need to use quotes in
this package. The following chunk should install and upload all packages
that you might need for this problem set project. Just execute it.
We will use four datasets for this problem set. I describe them in detail later. First, we need to upload these datasets in R memory and store them as data-objects.
Here is the list of data-objects we need and their corresponding files:
| Data objects for this assignment | File |
|---|---|
d |
“cyberattacks-across-the-globe-cases.csv” |
d.attacks.by.year |
“cyberattacks-by-year.csv” |
d.attacks.by.year.and.method |
“cyberattacks-by-year-and-method.csv” |
d.attacks.by.attack_on |
“cyberattacks-by-attack_on.csv” |
All files are stored in your “0-Data” subfolder. To reed them and
store them we use fread() function from
data.table package:
d <- fread("cyberattacks-across-the-globe-cases.csv")
d.attacks.by.year <- fread("cyberattacks-by-year.csv")
d.attacks.by.year.and.method <- fread("cyberattacks-by-year-and-method.csv")
d.attacks.by.attack_on <- fread("cyberattacks-by-attack_on.csv" )
Now, we can start! Good luck!
We start with the most detailed dataset: d. Each row in
d represents a cyber-attack. In the chunk below, use
function names() to print the names of variables (columns)
in d
names(d)
## [1] "source" "target" "year" "attack_on" "method" "success"
## [7] "num"
What do we know about each attack?
source - a territory (country) from which the attack was
organized
target - a territory (country) where the target (e.g., a
private firm or a state agency) was located
year - when the attack had a place. Later in this class,
we will use precise dates of attacks
attack_on - who was under attack (private firms, state
agencies, or military objects)
method- a method used for this attack
success - a dummy-variable. It takes a value of 1 if
attackers achieved their goals (money, concessions, physical damage,
etc) and 0 otherwise
num - another dummy variable. It always takes a value of
1. We will often use it to aggregate data.
We can check what is inside any data object in different ways. For
example, we can use glimpse (from dplyr
package) to check the number of rows and columns, names and types of the
columns (variables), as well as some examples from this columns.
glimpse(d)
## Rows: 266
## Columns: 7
## $ source <chr> "US", "US", "Russia", "Russia", "Russia", "US", "Russia", "R…
## $ target <chr> "Russia", "Russia", "US", "US", "US", "Russia", "US", "US", …
## $ year <int> 2008, 2008, 2008, 2008, 2008, 2008, 2009, 2009, 2011, 2013, …
## $ attack_on <chr> "Government", "Government", "Government", "Government", "Mil…
## $ method <chr> "Intrusion", "Infiltration", "Infiltration", "Infiltration",…
## $ success <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
In fact, we can print the content of d by putting it in
the chunk without any functions (check it by deleting ‘#’ in the chunk
below. Do not forget to put ‘#’ back). Do not do it in future. R will
print all rows of d and your reports will be impossible to
read.
#d
In the chunk below, check the structure of your d using
at least 2 functions you learnt in DataCamp’s “Intro to R”.
str(d)
## Classes 'data.table' and 'data.frame': 266 obs. of 7 variables:
## $ source : chr "US" "US" "Russia" "Russia" ...
## $ target : chr "Russia" "Russia" "US" "US" ...
## $ year : int 2008 2008 2008 2008 2008 2008 2009 2009 2011 2013 ...
## $ attack_on: chr "Government" "Government" "Government" "Government" ...
## $ method : chr "Intrusion" "Infiltration" "Infiltration" "Infiltration" ...
## $ success : int 0 0 0 0 0 1 0 0 0 0 ...
## $ num : int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(d)
## source target year attack_on
## Length:266 Length:266 Min. :2000 Length:266
## Class :character Class :character 1st Qu.:2008 Class :character
## Mode :character Mode :character Median :2011 Mode :character
## Mean :2011
## 3rd Qu.:2014
## Max. :2016
## method success num
## Length:266 Min. :0.00000 Min. :1
## Class :character 1st Qu.:0.00000 1st Qu.:1
## Mode :character Median :0.00000 Median :1
## Mean :0.04511 Mean :1
## 3rd Qu.:0.00000 3rd Qu.:1
## Max. :1.00000 Max. :1
dim(d)
## [1] 266 7
Let’s explore targets and sources of cyberattacks. For example, to
obtain a vector of countries that were under attack at least once, we
can use unique() on variable target inside
d:
target.countries <- unique(d$target)
target.countries
## [1] "Russia" "US" "Iran" "China" "N Korea"
## [6] "Canada" "UK" "France" "Germany" "Poland"
## [11] "Estonia" "Lithuania" "Ukraine" "Georgia" "Turkey"
## [16] "Israel" "Saudi Arabia" "Syria" "Lebanon" "Taiwan"
## [21] "Japan" "India" "Vietnam" "Philippines" "S Korea"
## [26] "Pakistan"
In the following chunk, use d to create vector
source.countries that will contain all territories from
which at least one attack was organized.
source.countries <- unique(d$source)
source.countries
## [1] "US" "Russia" "Iran" "Syria" "China" "N Korea"
## [7] "Ukraine" "Georgia" "Turkey" "Israel" "Taiwan" "Vietnam"
## [13] "S Korea" "Japan" "Pakistan" "India"
It is hard to compare these two vectors. To ease this comparison, we
can sort() territories alphabetically (increasing order) or
in the reverse order (decreasing order). You choose the order by
specifying option decreasing inside sort()
target.countries <- sort(target.countries, decreasing = FALSE)
target.countries
## [1] "Canada" "China" "Estonia" "France" "Georgia"
## [6] "Germany" "India" "Iran" "Israel" "Japan"
## [11] "Lebanon" "Lithuania" "N Korea" "Pakistan" "Philippines"
## [16] "Poland" "Russia" "S Korea" "Saudi Arabia" "Syria"
## [21] "Taiwan" "Turkey" "UK" "Ukraine" "US"
## [26] "Vietnam"
target.countries <- sort(target.countries, decreasing = TRUE)
target.countries
## [1] "Vietnam" "US" "Ukraine" "UK" "Turkey"
## [6] "Taiwan" "Syria" "Saudi Arabia" "S Korea" "Russia"
## [11] "Poland" "Philippines" "Pakistan" "N Korea" "Lithuania"
## [16] "Lebanon" "Japan" "Israel" "Iran" "India"
## [21] "Germany" "Georgia" "France" "Estonia" "China"
## [26] "Canada"
Sort source.countries in the reverse alphabetical order
and store the results in the same object:
source.countries <- sort(source.countries, decreasing = TRUE)
source.countries
## [1] "Vietnam" "US" "Ukraine" "Turkey" "Taiwan" "Syria"
## [7] "S Korea" "Russia" "Pakistan" "N Korea" "Japan" "Israel"
## [13] "Iran" "India" "Georgia" "China"
Do these two vectors (target.countries and
source.countries) look alike? Are
source-territories and target-territories
often the same? One way to check it is to look at the overlap of the two
vectors. In the following chunk, create a vector
target.source.intersection of countries that present both
in target.countries and source.countries. You
can do it manually or using functions like intersect().
Also, create a vector all.countries of all unique
territories presented either in target.countries or in
source.countries (in other words create a union of unique
elements from target.countries and
source.countries). Hint: remember, you can always
combine vectors by applying function c()
source.countries <- unique(d$source)
source.countries
## [1] "US" "Russia" "Iran" "Syria" "China" "N Korea"
## [7] "Ukraine" "Georgia" "Turkey" "Israel" "Taiwan" "Vietnam"
## [13] "S Korea" "Japan" "Pakistan" "India"
target.countries <- unique(d$target)
target.countries
## [1] "Russia" "US" "Iran" "China" "N Korea"
## [6] "Canada" "UK" "France" "Germany" "Poland"
## [11] "Estonia" "Lithuania" "Ukraine" "Georgia" "Turkey"
## [16] "Israel" "Saudi Arabia" "Syria" "Lebanon" "Taiwan"
## [21] "Japan" "India" "Vietnam" "Philippines" "S Korea"
## [26] "Pakistan"
target.source.intersection <- intersect(target.countries , source.countries)
target.source.intersection
## [1] "Russia" "US" "Iran" "China" "N Korea" "Ukraine"
## [7] "Georgia" "Turkey" "Israel" "Syria" "Taiwan" "Japan"
## [13] "India" "Vietnam" "S Korea" "Pakistan"
all.countries <- unique(target.countries, source.countries)
all.countries
## [1] "Russia" "US" "Iran" "China" "N Korea"
## [6] "Canada" "UK" "France" "Germany" "Poland"
## [11] "Estonia" "Lithuania" "Ukraine" "Georgia" "Turkey"
## [16] "Israel" "Saudi Arabia" "Syria" "Lebanon" "Taiwan"
## [21] "Japan" "India" "Vietnam" "Philippines" "S Korea"
## [26] "Pakistan"
Looks good!
Question. So what do you see? Are
source-territories and target-territories tend
to be the same? Write your answer below.
Your Answer: I see that they are different in the sense that source countries have a larger fiscal budget for attacks and are larger players in the world’s economy and the target countries are easier targets because they have smaller fiscal budgets in terms of what resources they have to mitigate attacks. Ironically, the large players in the source countries list are in the target countries list because they target each other.
To support your answer with some evidence, calculate the share of
territories in target.source.intersection to all the
territories in all.countries:
target.source.intersection <- intersect(target.countries , source.countries)
target.source.intersection
## [1] "Russia" "US" "Iran" "China" "N Korea" "Ukraine"
## [7] "Georgia" "Turkey" "Israel" "Syria" "Taiwan" "Japan"
## [13] "India" "Vietnam" "S Korea" "Pakistan"
all.countries <- unique(target.countries, source.countries)
target.source.intersection <- intersect(target.countries, source.countries)
target.source.all.countries.intersection <- intersect(target.source.intersection, all.countries)
target.source.all.countries.intersection
## [1] "Russia" "US" "Iran" "China" "N Korea" "Ukraine"
## [7] "Georgia" "Turkey" "Israel" "Syria" "Taiwan" "Japan"
## [13] "India" "Vietnam" "S Korea" "Pakistan"
table() (1
point)We often want to narrow our focus to specific cases. For example, we
are only interested in the territories that experienced a lot of
attacks. Function table() will calculate the number of time
each territory appeared in column target.
table(d$target) # check the results without storing them in an object
##
## Canada China Estonia France Georgia Germany
## 2 7 4 3 6 3
## India Iran Israel Japan Lebanon Lithuania
## 20 14 11 14 1 4
## N Korea Pakistan Philippines Poland Russia S Korea
## 5 7 5 3 11 23
## Saudi Arabia Syria Taiwan Turkey UK Ukraine
## 7 1 7 4 3 15
## US Vietnam
## 82 4
target.by.frequency <- table(d$target) # store the results in an object
target.by.frequency # check what is inside this object
##
## Canada China Estonia France Georgia Germany
## 2 7 4 3 6 3
## India Iran Israel Japan Lebanon Lithuania
## 20 14 11 14 1 4
## N Korea Pakistan Philippines Poland Russia S Korea
## 5 7 5 3 11 23
## Saudi Arabia Syria Taiwan Turkey UK Ukraine
## 7 1 7 4 3 15
## US Vietnam
## 82 4
By examining target.by.frequency we see that top 3
countries on the list inlcude “US”, “S Korea”, and “India”. We can store
them in a new object manually:
target.by.frequency.top.3 <- c("US", "S Korea", "India")
target.by.frequency.top.3
## [1] "US" "S Korea" "India"
We could also specify the position of these three elements inside
target.by.frequency:
target.by.frequency.top.3 <- target.by.frequency[c(7,18,25)]
target.by.frequency.top.3
##
## India S Korea US
## 20 23 82
But there is a difference. table() basically creates a
vector of elements. Each element reflect the number of times a territory
occurred in our input column d$target. Each element also
has a name. To obtain just the list of names, we need to use
names() on target.by.frequency.top.3:
names(target.by.frequency.top.3)
## [1] "India" "S Korea" "US"
Use table() on d$sources to create vector
source.by.frequency. Next, manually create a list of top-3
territories that launched attacks. Store the results in
source.by.frequency.top.3:
table(d$source)
##
## China Georgia India Iran Israel Japan N Korea Pakistan
## 74 1 7 33 9 3 26 13
## Russia S Korea Syria Taiwan Turkey Ukraine US Vietnam
## 65 7 1 1 2 2 21 1
source.by.frequency <- table(d$source)
source.by.frequency
##
## China Georgia India Iran Israel Japan N Korea Pakistan
## 74 1 7 33 9 3 26 13
## Russia S Korea Syria Taiwan Turkey Ukraine US Vietnam
## 65 7 1 1 2 2 21 1
source.by.frequency.top.3 <- table(d$source)
source.by.frequency.top.3
##
## China Georgia India Iran Israel Japan N Korea Pakistan
## 74 1 7 33 9 3 26 13
## Russia S Korea Syria Taiwan Turkey Ukraine US Vietnam
## 65 7 1 1 2 2 21 1
source.by.frequency.top.3 <- c("China", "Russia", "Iran")
source.by.frequency.top.3
## [1] "China" "Russia" "Iran"
source.by.frequency.top.3 by specifying
the positions of top-3 elements in
source.by.frequency:source.by.frequency <- table(d$source)
source.by.frequency
##
## China Georgia India Iran Israel Japan N Korea Pakistan
## 74 1 7 33 9 3 26 13
## Russia S Korea Syria Taiwan Turkey Ukraine US Vietnam
## 65 7 1 1 2 2 21 1
source.by.frequency.top.3 <- source.by.frequency[1:3]
source.by.frequency.top.3
##
## China Georgia India
## 74 1 7
names(source.by.frequency.top.3)
## [1] "China" "Georgia" "India"
source.by.frequency.top.3 <- names(source.by.frequency.top.3)
source.by.frequency.top.3
## [1] "China" "Georgia" "India"
Nice! But can we obtain the top-3 territories in an automated way?
Yes. For example, we can use sort() on
target.by.frequency and subset first or last elements of
this vector. The following chunk obtains a list of three territories
that experienced the least number of attacks: Note: the
elements in target.by.frequency are numbers. So,
sort() will use them for ordering (not the names of
territories)
target.by.frequency <- table(d$target)
target.by.frequency <- sort(target.by.frequency)
target.by.frequency.bottom.3 <- target.by.frequency[1:3]
target.by.frequency.bottom.3
##
## Lebanon Syria Canada
## 1 1 2
target.by.frequency.bottom.3 <- names(target.by.frequency.bottom.3)
target.by.frequency.bottom.3
## [1] "Lebanon" "Syria" "Canada"
OK! Now let’s put together things we’ve learnt so far. In the
following chunk, create vectors target.by.frequency.top.5
and source.by.frequency.top.5.
source.by.frequency.top.5 should contain names of top-5
territories that launched attacks;
target.by.frequency.top.5 should contain names of top-5
territories that experienced cyberattacks.
target.by.frequency <- table(d$target)
target.by.frequency
##
## Canada China Estonia France Georgia Germany
## 2 7 4 3 6 3
## India Iran Israel Japan Lebanon Lithuania
## 20 14 11 14 1 4
## N Korea Pakistan Philippines Poland Russia S Korea
## 5 7 5 3 11 23
## Saudi Arabia Syria Taiwan Turkey UK Ukraine
## 7 1 7 4 3 15
## US Vietnam
## 82 4
source.by.frequency <- table(d$source)
source.by.frequency
##
## China Georgia India Iran Israel Japan N Korea Pakistan
## 74 1 7 33 9 3 26 13
## Russia S Korea Syria Taiwan Turkey Ukraine US Vietnam
## 65 7 1 1 2 2 21 1
target.by.frequency <- sort(target.by.frequency, decreasing = TRUE)
target.by.frequency
##
## US S Korea India Ukraine Iran Japan
## 82 23 20 15 14 14
## Israel Russia China Pakistan Saudi Arabia Taiwan
## 11 11 7 7 7 7
## Georgia N Korea Philippines Estonia Lithuania Turkey
## 6 5 5 4 4 4
## Vietnam France Germany Poland UK Canada
## 4 3 3 3 3 2
## Lebanon Syria
## 1 1
source.by.frequency <- sort(source.by.frequency, decreasing = TRUE)
source.by.frequency
##
## China Russia Iran N Korea US Pakistan Israel India
## 74 65 33 26 21 13 9 7
## S Korea Japan Turkey Ukraine Georgia Syria Taiwan Vietnam
## 7 3 2 2 1 1 1 1
target.by.frequency.top.5 <- names(target.by.frequency[1:5])
target.by.frequency.top.5
## [1] "US" "S Korea" "India" "Ukraine" "Iran"
source.by.frequency.top.5 <- names(source.by.frequency[1:5])
source.by.frequency.top.5
## [1] "China" "Russia" "Iran" "N Korea" "US"
The other important feature of table() is to contrast
variables against each other. For example, we can contrast different
methods of cyberattacks with the type of targets:
table(d$method, d$attack_on)
##
## Government Military Private
## DDoS 24 6 16
## Defacement 20 1 7
## Infiltration 18 18 12
## Intrusion 70 24 50
Question. What is the most common method used against the government agencies? Military? Private? What is the most common cyber method overall?
Your Answer: Intrusion is the most common method against government agencies, Intrusion for military as well, Intrusion for Private as well, OVERALL: Intrusion.
Use table() to check how successful are cyberattacks
(variable success in d) against different
targets (variable attack_on)
table(d$success , d$attack_on)
##
## Government Military Private
## 0 128 43 83
## 1 4 6 2
Question. Attackers have higher chances when they attack …
Your Answer: Military agencies because they are the most successful at 6.
Now, analyze what methods are usually more successful than the others
(variable method in d)
table(d$method , d$success)
##
## 0 1
## DDoS 44 2
## Defacement 28 0
## Infiltration 43 5
## Intrusion 139 5
table(d$ratio <- d$success , d$method)
##
## DDoS Defacement Infiltration Intrusion
## 0 44 28 43 139
## 1 2 0 5 5
Question. Calculate the ratio of successful attacks for each method …
Your Answer: DDoS Defacement Infiltration Intrusion 44 28 43 139 2 0 5 5
We often conduct the analysis on a subset of data. One way to subset
data is to specify the value of some variable in our dataset. (Remember,
that in this case we need to use == not =.)
For example, we can subset all observations from 2012 using
data.table syntax:
d.2012 <- d[year == 2012,]
# Note 1: Because class of `d` is `data.table` we do not need to specify the data object inside the square brackets.
# If we use some old school formats like `data.frame`, we will need to be explicit: d[d$year == 2012,]
# Note 2: Do not forget to put comma after your logical expression. In this way, R understand that we apply our logical expression to the rows, not to the columns
The same is true, if we want to focus on a specific territory. Here is an example for India as a target:
d.india <- d[target == "India",]
We can also make complex logical expressions with ‘&’ and subset on them:
d.india.2012 <- d[target == "India" & year == 2012,]
d.india.2012
We can also subset with multiple values. The operator
%in% is our good friend here:
countries.of.interest <- c("S Korea", "N Korea")
d.subset <- d[target %in% countries.of.interest,]
head(d.subset)
Now, subset three territories that experienced the least number of
attacks (remember target.by.frequency.bottom.3?). Also,
only keep observations from 2010 to 2014. Store the results in
d.bottom.3.from.2010.to.2014.
bottom_3_territories <- c("Lebanon", "Syria", "Canada")
d_bottom_3 <- d[d$target %in% bottom_3_territories, ]
d_bottom_3_from_2010_to_2014 <- d_bottom_3[d_bottom_3$year >= 2010 & d_bottom_3$year <= 2014, ]
d.bottom.3.from.2010.to.2014 <- d_bottom_3_from_2010_to_2014
head(d.bottom.3.from.2010.to.2014)
How many observations do you have in
d.bottom.3.from.2010.to.2014? Use nrow() on
your data object to answer this question.
num_observations <- nrow(d.bottom.3.from.2010.to.2014)
num_observations
## [1] 3
Tables are great. But it is usually hard to infer systematic patterns in your data while looking at them. For example, do the territories usually mirror each other attacks? A simple way to check it is to produce a Sankey Diagram.
To plot our Sankey Diagrams we will use function
alluvial() from alluvial package (yes, the
names are confusing, but it is what it is). Inside
alluvial() we need to specify to options. For
data we need to specify which columns represent targets and
sources. We can reference them by their names in the dataset
data = d[,c("source","target")]. Alternatively, we can
reference them by their position data = d[,1:2].
We also need to specify frequency with freq = d$num.
This option is useful when each row represents not a single attack, but
for example all the attacks from source == "Country A" to
target == "Country B". It is not our case, yet we are
required to specify this option.
alluvial(data = d[,1:2], freq=d$num)
Well, this looks awful and messy. This is because we have a lot
territores that appear in d only a couple of times.
Instead, let’s subset data and keep only territories appeared either in
target.by.frequency.top.5, or in
source.by.frequency.top.5. Store this subset of data in in
d.for.alluvial.
target.by.frequency.top.5 <- c("US", "S Korea", "India", "China", "Russia")
source.by.frequency.top.5 <- c("China", "US", "Russia", "India", "S Korea")
relevant_territories <- unique(c(target.by.frequency.top.5, source.by.frequency.top.5))
d.for.alluvial <- d[d$target %in% relevant_territories | d$source %in% relevant_territories, ]
head(d.for.alluvial)
Now, make your Sankey Diagram with alluvial() using
d.for.alluvial as your input dataset.
if (!requireNamespace("alluvial", quietly = TRUE)) {
install.packages("alluvial")
}
library(alluvial)
alluvial(data = d.for.alluvial[, c("target", "source")],freq = d.for.alluvial$num,col = "purple",border = "pink",cex = 0.5)
Question. Describe interesting patterns you see in this diagram. And what about attacks? Do we see the symmetry?
Your Answer: The attacks look like they match in a sense that if target countries are attacked then the source country will receive a counter attack so long as resources permit.
ggplot (1
point)Finally, we can use visual tools to explore trends in attacks with
ggplot function. For example, we can look how the number of
attack changes from year to year. We will use
d.attacks.by.year. This dataset has only two columns
year and attacks:
head(d.attacks.by.year)
ggplot works like Lego. You add up different elements to
build your final figure. First you need to specify your baseplate. Your
baseplate is ggplot(). It has two major options:
data (works like in alluvial(),
names() and other functions) and aes() –
stands for aesthetics. Within aes() we specify the
parameters of the plot we need, like x-axis or y-axis:
ggplot(data = d.attacks.by.year,
aes(x = year, y = attacks)
)
We specified the baseplate but it is blank. This is because we can
use a different layers to visualize our data-points. For example, we can
add geom_point() layer to produce points:
ggplot(data = d.attacks.by.year,
aes(x = year, y = attacks)
) +
geom_point()
Use geom_line() instead of geom_point() and
reproduce our plot from the previous chunk:
library(ggplot2)
yearly_attacks <- aggregate(num ~ year, data = d, FUN = sum)
base_plot <- ggplot(data = yearly_attacks, aes(x = year, y = num))
final_plot <- base_plot +
geom_line(color = "purple") +
labs(title = "Trend in Cyberattacks Over Time", x = "Year", y = "Attacks") + theme_minimal() + theme(axis.text.x = element_text(angle = 50, hjust = 0.5))
final_plot
We can also specify that our observations are organized as groups. We
use color = group_variable to produce separate trends for
each of the group. Let’s take d.attacks.by.year.and.method
and make trends separately for each method used for cyberattacks:
ggplot(data = d.attacks.by.year.and.method,
aes(x = year, y = attacks, color = method) # We explicitly require ggplot to treat observations as groups
) + # Use the plus to add layers and other elements to the baseplate. This plus should always appear at the end of the previous line of code, not at the beginning of the new line.
geom_line()
Question. Describe what you see here
Your Answer: It appears as though that Intrusion was the largest type of attack in the year 2015 and DDos came second, infiltration went down as well. Defacement is the lowest method towards 2015.
For the final task we will use data from
d.attacks.by.attack_on. This dataset summarizes
cyberattacks according to the type of their target. As in our main
dataset attack_on in d.attacks.by.attack_on
has three values: “Government”, “Military”, and “Private”. In the chunk
below, use ggplot to plot the year-trends in the number of
attacks by each type of their target. Your plot should only show the
attacks on private firms and government agencies (you will need to
exclude military). Also, plot the results only for years from 2009 to
2014. Your plot should include both geom_line and
geom_point layers.
library(ggplot2)
filtered_data <- d.attacks.by.attack_on[d.attacks.by.attack_on$attack_on %in% c("Private", "Government") & d.attacks.by.attack_on$year >= 2009 & d.attacks.by.attack_on$year <= 2014, ]
base_plot <- ggplot(data = filtered_data, aes(x = year, y = attacks, color = attack_on, group = attack_on))
final_plot <- base_plot + geom_line() + geom_point() + labs(title = "Annual Trends in Cyberattacks by Target", x = "Year",y = "Attacks",color = "Target Type") + scale_color_manual(values = c("Private" = "pink", "Government" = "purple")) + theme_minimal() + theme(axis.text.x = element_text(angle = 50, hjust = 0.5))
final_plot
Now use triangle at the ‘knit’ button to compile html version of your report. Submit your html file to the class website under “Problem Set 1”. If you cannot compile html that means you have some errors in your code. R console shows you the line with an error. Check it and try to compile your report again. If (after spending some time) you cannot compile your html, submit your Rmd file for partial credit.
Good job!