What type of data is most suited to answer your question?
Is the quality of the data sufficiently high to answer your question?
Isn’t the information systematically flawed?
There may be situations where you are unable to retrace the source of online data.
If so, does it make sense to use the data from the Web? Can we think the data from the web reliable for journalism?
Is it legitimate to quote Wikipedia for scientific and journalistic purposes?
Giles J. 2005. Internet encyclopae dias go head to head. Nature 438, 900-901.
Rector LH. 2008. Comparison of Wikipedia and other encyclopedias for accuracy, breadth, and depth in historical articles. Reference Services Review 36(1), 7-22.
It is always recommended to find a second source and to compare the content. It is the cross-checking process that is a vital element of journalists’ practices.
A sample of tweets on a random day is not useful for predicting electoral outcomes because they are likely to suffer from a bias.
Online data often lack quality in terms of representativeness
Are the people “representative” of the people I want to know something about?
Are the questions that I pose suited to solicit the answers to my problem?
Proxies: Indicators that do not directly measure the concept of interest, but which are strongly related.
Phone Sales
In many situations, choosing a data source is a trade-off between advantages and disadvantages, accuracy versus completeness, coverage versus validity, and so forth.
Make sure you know exactly what kind of information you need.
Find out whether there are any data sources on the Web that might provide direct or indirect information on your problem.
Develop a theory of the data generation process when looking into potential sources.
Balance advantages and disadvantages of potential data sources.
Make a decision!
Three areas that are important for data collection on the Web with R
Source: Automated Data Collection with R by Munzert S. et al.
HTML
XPath: A series of filtering and extraction steps used to select specific pieces of information from marked up documents such as HTML or XML. The main purpose of these steps is to recast or re-organize information that is stored in marked up documents into formats that are suitable for further processing and analysis with RStudio.
Selenium: AJAX technologies that enable a website to update its visual appearance in a dynamic fashion. To extract information from AJAX-enriched webpages, the Selenium framework is useful for collecting web data by directing commands to a browser window such as mouse clicks or keyboard inputs.
Regular Expressions (RegEx): To extract the systematic components in text (e.g., numbers or uppercase alphabets), regular expressions are used as abstract sequences of strings (a series of characters) that match such concrete, recurring patterns.
R is well suited for managing data storage technologies like databases. R also has a lot of data management facilities by importing and exporting data in various formats.