Data Storytelling Scenarios

Questions to be considered

What type of data is most suited to answer your question?
Is the quality of the data sufficiently high to answer your question?
Isn’t the information systematically flawed?

Web data quality: origin of online data

What is the primary sources of secondary data?

There may be situations where you are unable to retrace the source of online data.

If so, does it make sense to use the data from the Web? Can we think the data from the web reliable for journalism?

The transparency of online data generation

Is it legitimate to quote Wikipedia for scientific and journalistic purposes?

  Giles J. 2005. Internet encyclopae dias go head to head. Nature 438, 900-901.

  Rector LH. 2008. Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Reference Services Review 36(1), 7-22.

It is always recommended to find a second source and to compare the content. It is the cross-checking process that is a vital element of journalists’ practices.

Data quality depends on our application for storytelling.

A sample of tweets on a random day is not useful for predicting electoral outcomes because they are likely to suffer from a bias.

Online data often lack quality in terms of representativeness

Why web data can be of higher quality for the user

Are the people “representative” of the people I want to know something about?

Are the questions that I pose suited to solicit the answers to my problem?

Proxies: Indicators that do not directly measure the concept of interest, but which are strongly related.

Phone Sales

In many situations, choosing a data source is a trade-off between advantages and disadvantages, accuracy versus completeness, coverage versus validity, and so forth.

Five steps to guide our data collection process:

Make sure you know exactly what kind of information you need.
Find out whether there are any data sources on the Web that might provide direct or indirect information on your problem.
Develop a theory of the data generation process when looking into potential sources.
Balance advantages and disadvantages of potential data sources.
Make a decision!

Technologies for disseminating, extracting, and storing web data

Three areas that are important for data collection on the Web with R

Source: Automated Data Collection with R by Munzert S. et al.

Technologies for disseminating content on the Web

HyperText Markup Language (HTML): A hidden standard that structures how information is displayed in a browser

HTML

eXtensible Markup Language (XML): Hierarchical structures for data storage and exchange over the Web

Technologies for information extraction from web documents

XPath: A series of filtering and extraction steps used to select specific pieces of information from marked up documents such as HTML or XML. The main purpose of these steps is to recast or re-organize information that is stored in marked up documents into formats that are suitable for further processing and analysis with RStudio.
Selenium: AJAX technologies that enable a website to update its visual appearance in a dynamic fashion. To extract information from AJAX-enriched webpages, the Selenium framework is useful for collecting web data by directing commands to a browser window such as mouse clicks or keyboard inputs.
Regular Expressions (RegEx): To extract the systematic components in text (e.g., numbers or uppercase alphabets), regular expressions are used as abstract sequences of strings (a series of characters) that match such concrete, recurring patterns.

Technologies for data storage

R is well suited for managing data storage technologies like databases. R also has a lot of data management facilities by importing and exporting data in various formats.

Approaches to web data collection for journalism

W4-1: Data Storytelling Scenario

Shin Lee

2021 9 22