When something noteworthy happens in your neck of the woods (however small or big the woods might be) the following chain of events unrolls: newsfeeds publish the news, media reacts with expanded coverage and analysis, people react on Facebook, Twitter and other social networks, more extended analysis from the news outlets follows, TV stations continue with reports and interviews, and relevant Wikipedia pages get updated with new facts. This is how digital media ecosystem continues to thrive through constant self-regulating cycles leaving behind unprecedented digital footprint. In fact, without such footprint none of it would be perceived as real by anyone - if you can’t share it online then it doesn’t exist.
Behind each block (except the event itself) are myriad of sites of the digital ecosystem competing for our time, clicks, bookmakrs, etc., Google PageRank score, FB likes, and ultimately traffic. What makes them stand out are timeliness, content, focus, digital savviness and culture among other elements (e.g. see 7 Factors behind Success with Digital News. Data science backed analytics and visualizations become increasingly important factor in their success as evident from the highest standards in visualization, statistics, and infographics set by The New York Times, The Economist, FiveThirtyEight and others (here and here).
Simple illustration of how data scientist can contribute to, enhance, and compliment digital news content in simple but complete example below that illustrates practical but realistic workflow of a news flowing into data that becomes a visualization that contributes back to a story again.
This is a how-to follow-up to my post How Flynn’s Term Compares to the National Security Advisor Tenures since 1953 triggered by abrupt resignation of President Trump’s national security adviser Michael Flynn on February 13. The main premise of the analysis is to demonstrate how dramatically short Michael Flynn’s tenure happened to be. I used data from the Wikipedia National Security Advisor (United States). The visual is a bar chart with tenures of all NCS advisors with emphasis on:
This example demonstrates how information from the news event (latest update to Wikipedia page) transformed and resulted in visualiztion about significance of the Flynn’s resignation in light of historical record of the rest of NSC advisors. Below I dissect this process step-by-step with explanations and code snippets in R. Actual document you read resulted from R Notebook script that automated this process from the beginning to the end.
If you found data source as part of a web page chances are it looks similar to this List of National Security Advisors found in National Security Advisor (United States) Wikipedia page:
Figure 2. Screenshot of the List of National Security Advisors (Wikipeida).
The image above is typical Wikipedia page displayed by a browser - it contains table with data we are interested in. To gain access to this table with data R program must do exactly the same steps that browser did to display it:
As almost with anything in R there are many ways to accomplish a certin task. For the tasks above we chose packages httr and XML both available from CRAN to accomplish the following (listed in order they apply):
library(httr)
secAdvisors = GET(
"https://en.wikipedia.org/",
path="wiki/National_Security_Advisor_(United_States)"
)
class(secAdvisors)
## [1] "response"
Resulting object secAdvisors (class "response") contains all information resulted in GET command against the Wikipedia URL “https://en.wikipedia.org/wiki/National_Security_Advisor_(United_States)” including HTML inside content element:
summary(secAdvisors)
## Length Class Mode
## url 1 -none- character
## status_code 1 -none- numeric
## headers 27 insensitive list
## all_headers 1 -none- list
## cookies 7 data.frame list
## content 86770 -none- raw
## date 1 POSIXct numeric
## times 6 -none- numeric
## request 7 request list
## handle 1 curl_handle externalptr
library(XML)
htmlTree = htmlParse(content(secAdvisors, "text"), asText = TRUE)
class(htmlTree)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument"
Resulting object htmlTree contains all parsed HTML elements including 3 tables:
summary(htmlTree)
## $nameCounts
##
## a td span li div tr th sup img b i
## 313 160 129 120 58 45 44 37 34 30 25
## ul link p h3 cite h2 br meta script input small
## 20 13 13 12 8 8 5 5 5 4 4
## abbr table body dl dt form h1 head html label noscript
## 3 3 1 1 1 1 1 1 1 1 1
## ol strong title
## 1 1 1
##
## $numNodes
## [1] 1110
Next we find and extract table data into list with 3 dataframes and save 2d datafrme that contains historical data with all NSC advisors:
secAdvisorsTables = readHTMLTable(htmlTree, stringsAsFactors=FALSE)
advisors = secAdvisorsTables[[2]]
head(advisors)
## V1 V2 V3 V4 V5 V6 V7
## 1 Start End Days <NA> <NA> <NA> <NA>
## 2 1 Robert Cutler March 23, 1953 April 2, 1955 740 Dwight D. Eisenhower
## 3 2 Dillon Anderson April 2, 1955 September 1, 1956 519 <NA>
## 4 3 William H. Jackson September 1, 1956 January 7, 1957 129 <NA>
## 5 4 Robert Cutler January 7, 1957 June 24, 1958 533 <NA>
## 6 5 Gordon Gray June 24, 1958 January 13, 1961 934 <NA>
After inspecting it we confirmed that it does contain List of National Security Advisors table from the wiki page.
At this point all data from Wikipedia we intend to use reside inside dataframe advisors (see above). Unfortunately, like any online data source such information is almost never 100% ready for analysis for several reasons:
Before taking on the task of fixing data always review dataframe with convinient utils::View function: { r View dataframe, results='hide'} View(advisors)
After reviewing we can narrow down to the set of just 3 columns: V3, V4, V5 (for the sake of exercise let’s ignore column containing number of days):
advisors = advisors[,3:5]
There are several ways to deal with empty rows and using complete.cases function is the most straight forward:
advisors = advisors[complete.cases(advisors), ]
Renaming columns 3, 4, and 5 to the meaningful names advisor, from, to:
names(advisors) = c('advisor','from','to')
head(advisors)
## advisor from to
## 2 Robert Cutler March 23, 1953 April 2, 1955
## 3 Dillon Anderson April 2, 1955 September 1, 1956
## 4 William H. Jackson September 1, 1956 January 7, 1957
## 5 Robert Cutler January 7, 1957 June 24, 1958
## 6 Gordon Gray June 24, 1958 January 13, 1961
## 7 McGeorge Bundy January 20, 1961 February 28, 1966
Dirty data may transpire in many forms and shapes. Observe these 4 rows, for example:
head(advisors[21:24,])
## advisor from to
## 24 Stephen Hadley January 26, 2005[11] January 20, 2009
## 25 James Jones[12] January 20, 2009 October 8, 2010
## 26 Tom Donilon[13] October 8, 2010 July 1, 2013[14]
## 27 Susan Rice[14] July 1, 2013[14] January 20, 2017
and note footnotes appended to some of the values in each column. Keeping such footnotes would render advisors’ names not presentable and dates invalid. Unfortunately, there is no single recipe to deal with dirty data and custom code is required. The following application of lapply function on dataframe removes footnotes everwhere:
advisors[] = lapply(advisors, FUN=function(x) {
ifelse(regexpr('[', x, fixed = TRUE)>0,
substr(x, 1, regexpr('[', x, fixed = TRUE)-1),
x)
})
head(advisors[21:24,])
## advisor from to
## 24 Stephen Hadley January 26, 2005 January 20, 2009
## 25 James Jones January 20, 2009 October 8, 2010
## 26 Tom Donilon October 8, 2010 July 1, 2013
## 27 Susan Rice July 1, 2013 January 20, 2017
At this point our original Wikipedia table contains all data we need for analysis removed from any artifacts of its HTML origin. But it is not ready for analysis and visualization just yet. Indeed, our dates are still strings and advisor tenures are missing.
Data that required some cleanup is likely still stored as strings even as it may contain numbers, dates, time, or currency. In our case both columns from and to contain dates but are still strings. Again, having many ways to parse and convert strings to dates in R we chose to use package lubridate:
library(lubridate)
# remove currently functioning advisor (last one)
advisors = advisors[-(dim(advisors)[[1]]:(dim(advisors)[[1]]-1)),]
advisors$from_date = mdy(advisors$from)
advisors$to_date = mdy(advisors$to)
head(advisors)
## advisor from to from_date to_date
## 2 Robert Cutler March 23, 1953 April 2, 1955 1953-03-23 1955-04-02
## 3 Dillon Anderson April 2, 1955 September 1, 1956 1955-04-02 1956-09-01
## 4 William H. Jackson September 1, 1956 January 7, 1957 1956-09-01 1957-01-07
## 5 Robert Cutler January 7, 1957 June 24, 1958 1957-01-07 1958-06-24
## 6 Gordon Gray June 24, 1958 January 13, 1961 1958-06-24 1961-01-13
## 7 McGeorge Bundy January 20, 1961 February 28, 1966 1961-01-20 1966-02-28
Notice that we had to remove a row containing currently active advisor 2 rows with interim advisor Keith Kellogg and current advisor designate H.R. McMaster because they their terms contain invalid strings inside to column but more importantly neither is relevant for this analysis.
The ooriginal data often is not all the data fed to analytics and visualization steps. Having start and end dates tells us how long the term was but not explicitly. To simplify analysis and visualization later on let’s derive number of days for each term as new column days:
advisors$days =
apply(advisors[,c('from_date','to_date')], 1, FUN=function(x) {
length(seq(as.Date(x[[1]]), as.Date(x[[2]]), by='day'))
})
head(advisors)
## advisor from to from_date to_date days
## 2 Robert Cutler March 23, 1953 April 2, 1955 1953-03-23 1955-04-02 741
## 3 Dillon Anderson April 2, 1955 September 1, 1956 1955-04-02 1956-09-01 519
## 4 William H. Jackson September 1, 1956 January 7, 1957 1956-09-01 1957-01-07 129
## 5 Robert Cutler January 7, 1957 June 24, 1958 1957-01-07 1958-06-24 534
## 6 Gordon Gray June 24, 1958 January 13, 1961 1958-06-24 1961-01-13 935
## 7 McGeorge Bundy January 20, 1961 February 28, 1966 1961-01-20 1966-02-28 1866
Sometimes new data sources required to construct more elobarate data before analysis begins. Then likely data from multiple sources are consolidted into single analytical dataset. For simplicity this example doesn’t go that far (mulitple data sources), but engineering new data as simple or complex it might be is both necessary and important step towards final result.
The table with advisors contain their names together with how long their terms were (in days). Abrupt resignation of Michael Flynn sets new record by administration for the shortest such term. But how much shorter should become evident from the visualization. Using ggplot2 package is de facto standard for R but more importantly its API transforms building and refining visualization into iterative and intuitive exercise.
We begin with quick basic bar chart which will serve as a starting point so we can immediately move to addressing problmes and making refinements.
library(ggplot2)
ggplot(advisors) +
geom_bar(aes(x=advisor, y=days), stat = 'identity', color='black', fill='white') +
coord_flip()
Figure 3. Basic Bar Chart.
The only unneccessary customization above was adding coord_flip to swap positions of the x and y coordinates which positions bars in bar chart horizontally and not vertically.
Before getting down to business of refining looks and aesthetics notice that there are 2 data points containing multiple bars: Robert Cutler and Brent Scowcroft. After reviewing NSC advisor table we find that they actually served as advisors multiple times and hence their terms appear together based on default position stack. Unfotunately, there is no easy way to address this problem with ggplot2 (which I know of) so for their terms to appear as separate data points let’s resort to some custom but rather trivial transformations:
advisors[advisors$advisor %in% c('Robert Cutler','Brent Scowcroft'),1] =
paste(advisors[advisors$advisor %in% c('Robert Cutler','Brent Scowcroft'),1], '-', rep(c('I','II'),2))
Both Robert Cutler and Brent Scowcroft appear twice - once for each of their 2 terms:
Figure 4. Basic Bar Chart after Splitting Bars with Multiple Terms.
Since Michael Flynn tenure is of singular interest of the exercise his name should stand out (would red be appropriate?). To accomplish this simple transformation and new flag attribute suffice:
advisors$isFlynn = ifelse(advisors$advisor == 'Michael Flynn', TRUE, FALSE)
ggplot(advisors) +
geom_bar(aes(x=advisor, y=days), stat = 'identity', color='black', fill='white') +
coord_flip() +
theme(axis.text.y = element_text(colour=ifelse(advisors$isFlynn,
'red','black'),
face=ifelse(advisors$isFlynn,
'bold','plain')))
Figure 5. Making Data Point Stand Out.
I also changed font fact to bold for Flynn.
Why Michael Flynn name is not in red? The answer lies in the fact that while the geom_bar uses alphabetical order of its x = advisor isFlynn vector keeps original order inherited from dataframe advisors. Such inconsistency sorts itself out - literally - with sorting of the dataframe:
advisors = advisors[order(advisors$days), ]
advisors$advisor = factor(advisors$advisor, levels=advisors$advisor[order(advisors$days)], ordered = TRUE)
Figure 6. Reordeing Bars in Bar Chart.
fill aesthetic and custom coloring schema scale_fill_manual:
Figure 7. Adding More Color in Bar Chart.
Finally, we add the following elements and adjust some properties:
ggthemesfor the final version:
library(ggthemes)
ggplot(advisors) +
geom_bar(aes(advisor, days, fill=isFlynn), stat = 'identity') +
scale_fill_manual(values=c("#3C3B6E","#B22234"), guide=FALSE) +
coord_flip() +
labs(title="Michael Flynn's 25 Days vs. The Rest",
subtitle="The National Security Advisors since 1953\nSource: https://goo.gl/rrWmeS © 2017 Gregory Kanevsky Infographics.",
y="Days in Office", x=NULL) +
theme_tufte(base_size = 16, ticks = FALSE) +
theme(axis.text.y = element_text(size=16, hjust = 0,
colour=ifelse(advisors$isFlynn,
"#B22234","#3C3B6E"),
face=ifelse(advisors$isFlynn,
'bold','plain')))
Figure 7. Final Version of Bar Chart.