How To be Part of Digital Media Ecosystem for Data Scientists

Introduction

When something noteworthy happens in your neck of the woods (however small or big the woods might be) the following chain of events unrolls: newsfeeds publish the news, media reacts with expanded coverage and analysis, people react on Facebook, Twitter and other social networks, more extended analysis from the news outlets follows, TV stations continue with reports and interviews, and relevant Wikipedia pages get updated with new facts. This is how digital media ecosystem continues to thrive through constant self-regulating cycles leaving behind unprecedented digital footprint. In fact, without such footprint none of it would be perceived as real by anyone - if you can’t share it online then it doesn’t exist.

Behind each block (except the event itself) are myriad of sites of the digital ecosystem competing for our time, clicks, bookmakrs, etc., Google PageRank score, FB likes, and ultimately traffic. What makes them stand out are timeliness, content, focus, digital savviness and culture among other elements (e.g. see 7 Factors behind Success with Digital News. Data science backed analytics and visualizations become increasingly important factor in their success as evident from the highest standards in visualization, statistics, and infographics set by The New York Times, The Economist, FiveThirtyEight and others (here and here).

How-To of Visualization in the Context of Digital News

Simple illustration of how data scientist can contribute to, enhance, and compliment digital news content in simple but complete example below that illustrates practical but realistic workflow of a news flowing into data that becomes a visualization that contributes back to a story again.

This is a how-to follow-up to my post How Flynn’s Term Compares to the National Security Advisor Tenures since 1953 triggered by abrupt resignation of President Trump’s national security adviser Michael Flynn on February 13. The main premise of the analysis is to demonstrate how dramatically short Michael Flynn’s tenure happened to be. I used data from the Wikipedia National Security Advisor (United States). The visual is a bar chart with tenures of all NCS advisors with emphasis on:

Michale Flynn tenure (single data point)
How signficantly shorter it was compared to the rest of adivsors (all data points)

This example demonstrates how information from the news event (latest update to Wikipedia page) transformed and resulted in visualiztion about significance of the Flynn’s resignation in light of historical record of the rest of NSC advisors. Below I dissect this process step-by-step with explanations and code snippets in R. Actual document you read resulted from R Notebook script that automated this process from the beginning to the end.

Web Page as a Data Source

If you found data source as part of a web page chances are it looks similar to this List of National Security Advisors found in National Security Advisor (United States) Wikipedia page:

Figure 2. Screenshot of the List of National Security Advisors (Wikipeida).

The image above is typical Wikipedia page displayed by a browser - it contains table with data we are interested in. To gain access to this table with data R program must do exactly the same steps that browser did to display it:

retrieve HTML page using HTTP protocol’s GET command
parse HTML page to extract its elements including table data

As almost with anything in R there are many ways to accomplish a certin task. For the tasks above we chose packages httr and XML both available from CRAN to accomplish the following (listed in order they apply):

Using httr execute HTTP GET method that retrieves HTML page from Wikipedia
Using XML parse HTML into list of elements representing HTML tree
Using XML read table elements including List of National Security Advisors to the list with R dataframes
Idenitfy and extract dataframe with List of National Security Advisors table data

Step 1: execute HTTP GET method

library(httr)

secAdvisors = GET(
  "https://en.wikipedia.org/",
  path="wiki/National_Security_Advisor_(United_States)"
)
class(secAdvisors)

## [1] "response"

Resulting object secAdvisors (class "response") contains all information resulted in GET command against the Wikipedia URL “https://en.wikipedia.org/wiki/National_Security_Advisor_(United_States)” including HTML inside content element:

summary(secAdvisors)

##             Length Class       Mode       
## url             1  -none-      character  
## status_code     1  -none-      numeric    
## headers        27  insensitive list       
## all_headers     1  -none-      list       
## cookies         7  data.frame  list       
## content     86770  -none-      raw        
## date            1  POSIXct     numeric    
## times           6  -none-      numeric    
## request         7  request     list       
## handle          1  curl_handle externalptr

Step 2: parse HTML tree

library(XML)

htmlTree = htmlParse(content(secAdvisors, "text"), asText = TRUE)
class(htmlTree)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"  "XMLAbstractDocument"

Resulting object htmlTree contains all parsed HTML elements including 3 tables:

summary(htmlTree)

## $nameCounts
## 
##        a       td     span       li      div       tr       th      sup      img        b        i 
##      313      160      129      120       58       45       44       37       34       30       25 
##       ul     link        p       h3     cite       h2       br     meta   script    input    small 
##       20       13       13       12        8        8        5        5        5        4        4 
##     abbr    table     body       dl       dt     form       h1     head     html    label noscript 
##        3        3        1        1        1        1        1        1        1        1        1 
##       ol   strong    title 
##        1        1        1 
## 
## $numNodes
## [1] 1110

Steps 3 and 4: read table elements from HTML tree and extract table data into dataframe

Next we find and extract table data into list with 3 dataframes and save 2d datafrme that contains historical data with all NSC advisors:

secAdvisorsTables = readHTMLTable(htmlTree, stringsAsFactors=FALSE)

advisors = secAdvisorsTables[[2]]
head(advisors)

##      V1  V2                 V3                V4                V5   V6                   V7
## 1 Start End               Days              <NA>              <NA> <NA>                 <NA>
## 2     1          Robert Cutler    March 23, 1953     April 2, 1955  740 Dwight D. Eisenhower
## 3     2        Dillon Anderson     April 2, 1955 September 1, 1956  519                 <NA>
## 4     3     William H. Jackson September 1, 1956   January 7, 1957  129                 <NA>
## 5     4          Robert Cutler   January 7, 1957     June 24, 1958  533                 <NA>
## 6     5            Gordon Gray     June 24, 1958  January 13, 1961  934                 <NA>

After inspecting it we confirmed that it does contain List of National Security Advisors table from the wiki page.

Fixing Data

At this point all data from Wikipedia we intend to use reside inside dataframe advisors (see above). Unfortunately, like any online data source such information is almost never 100% ready for analysis for several reasons:

some columns are irrelevant or contain no information
some rows are irrelevant, incomplete, or contain no information
column names may be meaningless or empty
data is dirty

Before taking on the task of fixing data always review dataframe with convinient utils::View function: { r View dataframe, results='hide'} View(advisors)

Problems 1: irrelevant columns

After reviewing we can narrow down to the set of just 3 columns: V3, V4, V5 (for the sake of exercise let’s ignore column containing number of days):

advisors = advisors[,3:5]

Problem 2: empty rows

There are several ways to deal with empty rows and using complete.cases function is the most straight forward:

advisors = advisors[complete.cases(advisors), ]

Problem 3: renaming columns

Renaming columns 3, 4, and 5 to the meaningful names advisor, from, to:

names(advisors) = c('advisor','from','to')

head(advisors)

##              advisor              from                to
## 2      Robert Cutler    March 23, 1953     April 2, 1955
## 3    Dillon Anderson     April 2, 1955 September 1, 1956
## 4 William H. Jackson September 1, 1956   January 7, 1957
## 5      Robert Cutler   January 7, 1957     June 24, 1958
## 6        Gordon Gray     June 24, 1958  January 13, 1961
## 7     McGeorge Bundy  January 20, 1961 February 28, 1966

Problem 4: cleaning up dirty data

Dirty data may transpire in many forms and shapes. Observe these 4 rows, for example:

head(advisors[21:24,])

##            advisor                 from               to
## 24  Stephen Hadley January 26, 2005[11] January 20, 2009
## 25 James Jones[12]     January 20, 2009  October 8, 2010
## 26 Tom Donilon[13]      October 8, 2010 July 1, 2013[14]
## 27  Susan Rice[14]     July 1, 2013[14] January 20, 2017

and note footnotes appended to some of the values in each column. Keeping such footnotes would render advisors’ names not presentable and dates invalid. Unfortunately, there is no single recipe to deal with dirty data and custom code is required. The following application of lapply function on dataframe removes footnotes everwhere:

advisors[] = lapply(advisors, FUN=function(x) {
  ifelse(regexpr('[', x, fixed = TRUE)>0,
         substr(x, 1, regexpr('[', x, fixed = TRUE)-1), 
         x)
})

head(advisors[21:24,])

##           advisor             from               to
## 24 Stephen Hadley January 26, 2005 January 20, 2009
## 25    James Jones January 20, 2009  October 8, 2010
## 26    Tom Donilon  October 8, 2010     July 1, 2013
## 27     Susan Rice     July 1, 2013 January 20, 2017

Engineering Data

At this point our original Wikipedia table contains all data we need for analysis removed from any artifacts of its HTML origin. But it is not ready for analysis and visualization just yet. Indeed, our dates are still strings and advisor tenures are missing.

Data type conversion

Data that required some cleanup is likely still stored as strings even as it may contain numbers, dates, time, or currency. In our case both columns from and to contain dates but are still strings. Again, having many ways to parse and convert strings to dates in R we chose to use package lubridate:

library(lubridate)

# remove currently functioning advisor (last one)
advisors = advisors[-(dim(advisors)[[1]]:(dim(advisors)[[1]]-1)),]
advisors$from_date = mdy(advisors$from)
advisors$to_date = mdy(advisors$to)

head(advisors)

##              advisor              from                to  from_date    to_date
## 2      Robert Cutler    March 23, 1953     April 2, 1955 1953-03-23 1955-04-02
## 3    Dillon Anderson     April 2, 1955 September 1, 1956 1955-04-02 1956-09-01
## 4 William H. Jackson September 1, 1956   January 7, 1957 1956-09-01 1957-01-07
## 5      Robert Cutler   January 7, 1957     June 24, 1958 1957-01-07 1958-06-24
## 6        Gordon Gray     June 24, 1958  January 13, 1961 1958-06-24 1961-01-13
## 7     McGeorge Bundy  January 20, 1961 February 28, 1966 1961-01-20 1966-02-28

Notice that we had to remove ~~a row containing currently active advisor~~ 2 rows with interim advisor Keith Kellogg and current advisor designate H.R. McMaster because they their terms contain invalid strings inside to column but more importantly neither is relevant for this analysis.

Constructing New Columns

The ooriginal data often is not all the data fed to analytics and visualization steps. Having start and end dates tells us how long the term was but not explicitly. To simplify analysis and visualization later on let’s derive number of days for each term as new column days:

advisors$days = 
  apply(advisors[,c('from_date','to_date')], 1, FUN=function(x) {
    length(seq(as.Date(x[[1]]), as.Date(x[[2]]), by='day'))
  })

head(advisors)

##              advisor              from                to  from_date    to_date days
## 2      Robert Cutler    March 23, 1953     April 2, 1955 1953-03-23 1955-04-02  741
## 3    Dillon Anderson     April 2, 1955 September 1, 1956 1955-04-02 1956-09-01  519
## 4 William H. Jackson September 1, 1956   January 7, 1957 1956-09-01 1957-01-07  129
## 5      Robert Cutler   January 7, 1957     June 24, 1958 1957-01-07 1958-06-24  534
## 6        Gordon Gray     June 24, 1958  January 13, 1961 1958-06-24 1961-01-13  935
## 7     McGeorge Bundy  January 20, 1961 February 28, 1966 1961-01-20 1966-02-28 1866

Sometimes new data sources required to construct more elobarate data before analysis begins. Then likely data from multiple sources are consolidted into single analytical dataset. For simplicity this example doesn’t go that far (mulitple data sources), but engineering new data as simple or complex it might be is both necessary and important step towards final result.

Visualizing Michael Flynn Term as NSC Advisor

The table with advisors contain their names together with how long their terms were (in days). Abrupt resignation of Michael Flynn sets new record by administration for the shortest such term. But how much shorter should become evident from the visualization. Using ggplot2 package is de facto standard for R but more importantly its API transforms building and refining visualization into iterative and intuitive exercise.

First Take

We begin with quick basic bar chart which will serve as a starting point so we can immediately move to addressing problmes and making refinements.

library(ggplot2)

ggplot(advisors) +
  geom_bar(aes(x=advisor, y=days), stat = 'identity', color='black', fill='white') +
  coord_flip()

Figure 3. Basic Bar Chart.

The only unneccessary customization above was adding coord_flip to swap positions of the x and y coordinates which positions bars in bar chart horizontally and not vertically.

Custom Data Transformations

Before getting down to business of refining looks and aesthetics notice that there are 2 data points containing multiple bars: Robert Cutler and Brent Scowcroft. After reviewing NSC advisor table we find that they actually served as advisors multiple times and hence their terms appear together based on default position stack. Unfotunately, there is no easy way to address this problem with ggplot2 (which I know of) so for their terms to appear as separate data points let’s resort to some custom but rather trivial transformations:

advisors[advisors$advisor %in% c('Robert Cutler','Brent Scowcroft'),1] =
  paste(advisors[advisors$advisor %in% c('Robert Cutler','Brent Scowcroft'),1], '-', rep(c('I','II'),2))

Both Robert Cutler and Brent Scowcroft appear twice - once for each of their 2 terms:

Figure 4. Basic Bar Chart after Splitting Bars with Multiple Terms.

Making Certain Data Point Stands Out

Since Michael Flynn tenure is of singular interest of the exercise his name should stand out (would red be appropriate?). To accomplish this simple transformation and new flag attribute suffice:

advisors$isFlynn = ifelse(advisors$advisor == 'Michael Flynn', TRUE, FALSE)

ggplot(advisors) +
  geom_bar(aes(x=advisor, y=days), stat = 'identity', color='black', fill='white') +
  coord_flip() +
  theme(axis.text.y = element_text(colour=ifelse(advisors$isFlynn,
                                                 'red','black'),
                                   face=ifelse(advisors$isFlynn,
                                               'bold','plain')))

Figure 5. Making Data Point Stand Out.

I also changed font fact to bold for Flynn.

Properly Ordering by Length of Tenure

Why Michael Flynn name is not in red? The answer lies in the fact that while the geom_bar uses alphabetical order of its x = advisor isFlynn vector keeps original order inherited from dataframe advisors. Such inconsistency sorts itself out - literally - with sorting of the dataframe:

advisors = advisors[order(advisors$days), ]
advisors$advisor = factor(advisors$advisor, levels=advisors$advisor[order(advisors$days)], ordered = TRUE)

Figure 6. Reordeing Bars in Bar Chart.

Fill the Bars with Color to Distinguish Flynn Even More

As Flynn name stands out in red his bar should be too. We use fill aesthetic and custom coloring schema scale_fill_manual:

Figure 7. Adding More Color in Bar Chart.

Final Refinements

Finally, we add the following elements and adjust some properties:

Title, subtitle and axis labels
Theme using package ggthemes
Larger font and left adjustment for the names of advisors

for the final version:

library(ggthemes)

ggplot(advisors) +
  geom_bar(aes(advisor, days, fill=isFlynn), stat = 'identity') +
  scale_fill_manual(values=c("#3C3B6E","#B22234"), guide=FALSE) +
  coord_flip() +
  labs(title="Michael Flynn's 25 Days vs. The Rest", 
       subtitle="The National Security Advisors since 1953\nSource: https://goo.gl/rrWmeS © 2017 Gregory Kanevsky Infographics.", 
       y="Days in Office", x=NULL) +
  theme_tufte(base_size = 16, ticks = FALSE) +
  theme(axis.text.y = element_text(size=16, hjust = 0, 
                                   colour=ifelse(advisors$isFlynn,
                                                 "#B22234","#3C3B6E"),
                                   face=ifelse(advisors$isFlynn,
                                               'bold','plain')))

Figure 7. Final Version of Bar Chart.