Below example demonstrate what it takes from Data Scientist to analyze a given data and derive a meaningful information from the data. Data scientist frequently work with ambiguous data and are tasked to infer meaningful information from data. Let’s take example of The Global Terrorism Database (GTD).
This is an open-source database including information on terrorist attacks around the world from 1970 through 2017
The GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 180,000 attacks. The database is maintained by researchers at the National Consortium for the Study of Terrorism and Responses to Terrorism (START), headquartered at the University of Maryland. More Information
Let’s see how a Data Scientist will approach this data set and what kind of insights can be obtained
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
data.url = getURL("https://media.githubusercontent.com/media/mlforsachid/Data607-Project3/master/Misc/globalterrorism.csv")
attack <- read_csv(data.url,
col_types = cols(
nkill = col_double(), #The number of total confirmed fatalities for the incident
nhours = col_double(),
propvalue = col_double(),
nwound = col_double(), #Number of confirmed non-fatal injuries to both perpetrators and victims.
nperpcap= col_double(),
nhostkid = col_double(),
nreleased= col_double(),
nkillter= col_double(),
nkillus = col_double(),
nwoundus = col_double(), #The number of confirmed non-fatal injuries to U.S. citizens, both perpetrators and victims.
ransomamt = col_double(),
ransompaid= col_double(),
nwoundte= col_double()
)
)
#iyearThis field contains the year in which the incident occurred.
attack %>%
mutate(total = 1) %>% # total attacks that year (creates a new variable)
count(iyear, wt=total-success) %>% # failed attempt
cbind("failure") -> failure
colnames(failure)[3] <- "rate"
attack %>%
count(iyear, wt = success) %>%
cbind("success")-> success
colnames(success)[3] <- "rate"
rbind(failure,success) %>%
ggplot(aes(iyear, n)) +
geom_line(aes(group=rate, colour=rate))
attack %>%
group_by(iyear,region_txt) %>% # allow operations to be performed by groups
count(success) %>% # In case, count success rate conditional on the year and country
arrange(desc(n))%>% # arrange in descending order
ggplot(aes(iyear, n)) + # x axis- year, y = success rate
geom_line(aes(group = region_txt), colour = "grey50") +
geom_point(aes(colour = region_txt))
attack %>%
# With multiple conditions for filter, make sure to use %in% and not ==
filter(parse_character(country_txt) %in% c("Singapore", "China","United States","United Kingdom")) %>% # select four countries
group_by(iyear,country_txt) %>% # allow operations to be performed by groups
count(success) %>% # In case, count success rate conditional on the year and country
ggplot(aes(iyear, n)) + # x axis- year, y = success rate
geom_line(aes(group = country_txt), colour = "grey50") +
geom_point(aes(colour = country_txt))
attack %>%
# With multiple conditions for filter, make sure to use %in% and not ==
filter(parse_character(country_txt) %in% c("United Kingdom")) %>% # select four countries
group_by(iyear,attacktype1_txt) %>% # allow operations to be performed by groups
count(success) %>% # In case, count success rate conditional on the year and attacktype1_txt
ggplot(aes(iyear, n)) + # x axis- year, y = success rate
geom_line(aes(group = attacktype1_txt), colour = "grey50") +
geom_point(aes(colour = attacktype1_txt))
# Criterion 1: POLITICAL, ECONOMIC, RELIGIOUS, OR SOCIAL GOAL (CRIT1)
# 1 = The incident meets Criterion 1.
# 0 = "No The incident does not meet Criterion 1.
attack %>%
# With multiple conditions for filter, make sure to use %in% and not ==
filter(parse_character(country_txt) %in% c("United Kingdom")) %>% # select four countries
group_by(iyear,crit1) %>% # allow operations to be performed by groups
count(success) %>% # In case, count success rate conditional on the year and crit1
ggplot(aes(iyear, n, fill=crit1)) + # x axis- year, y = success rate
geom_bar(stat="identity") + # stacking on top of each other
ggtitle("Attack Agenda")
attack %>%
select(iyear, attacktype1_txt, attacktype2_txt) %>% # select variable
filter(iyear == 2013) %>% # subset by rows based on condition
gather(attacktype1_txt,attacktype2_txt, key = 'Type', value = 'Attack') %>% # transform wide to long
filter(Attack != ".") %>% # remove all '.'
ggplot(aes(x=Type, fill=Attack)) + geom_bar(position = "dodge") +# plot
ggtitle("Types of Attack")
attack %>%
select(iyear, country_txt, city) %>% # select variable
filter(country_txt == "United Kingdom") %>%
group_by(city)%>% # group by city
count() %>% # count the number of times a city appear
arrange(desc(n)) %>% # subset by rows based on condition
head(n=10) -> state
state
## # A tibble: 10 x 2
## # Groups: city [10]
## city n
## <chr> <int>
## 1 Belfast 2170
## 2 London 399
## 3 Londonderry 366
## 4 Armagh 125
## 5 Newry 112
## 6 Unknown 90
## 7 Lurgan 71
## 8 Crossmaglen 65
## 9 Strabane 65
## 10 Dungannon 62
## plot top 10
attack %>%
filter(country_txt == "United Kingdom") %>%
select(iyear, city)%>%
filter(city %in% state$city) %>%
ggplot(aes(x=city, fill=city)) + geom_bar() +# plot
ggtitle("Attack Locations Bar Plot")
attack%>%
select(iyear, country_txt, city) %>% # select variable
filter(country_txt == "United Kingdom") %>%
group_by(city) %>% # group by city
count() %>% # count the number of times a city appear
arrange(desc(n))%>% # subset by rows based on condition
head(n=10) -> state ## plot top 10
attack %>%
filter(country_txt == "United Kingdom") %>%
select(iyear, city)%>%
filter(city %in% state$city) %>% # select top 10
filter(city !="Unknown") %>% # remove unknown
group_by(iyear,city) %>% #group to do calculations
count() %>% # count
ggplot(aes(x=iyear, y=n)) + geom_line(aes(color=city)) +# plot
ggtitle("Attack Timeline by city")
Above example exhibits that following are the fundamental building blocks while working on any data analysis project