Twitter is a unique source for data, for many reasons: along with text, it includes date and time information, and data on the type of device anr/or software used to post the tweet. What we often do not have reliable Twitter data on is location. There are two ways for Twitter to record your location: one is created during your Twitter account setup, where you (optionally) fill in where you are located. Many users fill in joke locations (The Milky Way, 3rd Rock From the Sun). To do something with location data, it’d have to be somewhat standardized, i.e. text inputs for Town and State.Without these prompts, values vary wildly, from ‘North Dakota’ to ‘Brooklyn’ to ‘27510,’ so it’s really not very useful data.

The other way Twitter will record your location is by geolocating your exact position. This has to be turned on or off, by choosing if Twitter can use your location on your device. Many users - most, in fact - have location services turned off. That means if you use geolocation data, you’re only observing data from users who have opted in to location services for twitter. That’s a subset of Twitter users that will drastically skew our results, so I also discourage its use.

One other caveat with Twitter, which can be frustrating to new users: you can’t go back in time and look at every Tweet ever posted. (Access to this is referred to as Twitter’s ‘Firehose.’) Just searching Twitter for a term will only return results from the last seven days. How can we get around this limitation?

One way is to analyze an individual account, or Twitter handle - this will allow us to download thousands of Tweets, assuming we’re looking at a prolific user.

The other way, which is beyond the scope of this Chapter, is to set up a loop in R that grabs Tweets every minute, or hour, or day, and records them, over time, to a database. This (obviously) requires pre-planning, so you can’t just do a lengthy report on the history and progression of the #MeToo movement on Twitter: you can only record the present and future. (Note that some interesting historical Twitter activity has been downloaded and shared online, such as the accounts of everyone in the Trump White House during his Presidency.)

Why is all of this so complicated? Well, Twitter’s valuation is based almost entirely on this huge collection of data it’s made (the Firehose), and granting access to historical Twitter data is one way Twitter actually makes money.

Like many proprietary online sources of data, Twitter gives data analysts this limited access to their real-time data via what’s called an API, or Application Programming Interface. That means instead of downloading every tweet ever made onto your laptop, you can use this ‘interface’ to access the database of information online and only grab the relevant data. Another example may help clarify: How could we analyze the headlines of the New York Times over the last 50 years? We cannot download all of that text onto our computer, of course, but the Times provides limited historical access via their API as well. Again, the limitations are somewhat severe, as they are protecting their intellectual property: The NY Times API only allows you to get ten results every minute. In other words, you query the API, get 10 results, wait a minute, and then you can get the next 10 results.

Note: A student of mine made a package that automates this process [here] (‘https://github.com/toldham2/nytall’).

OK, so how do we gain access to an API? We’ll use Twitter as our example, as it’s overly complicated. But generally speaking, you need to have an account with the service. Twitter is free, and you need a Twitter account to access the API. (The New York Times is behind a paywall, but you can still create an account on their website and access the aPI for free without a subscription.)

So, first and foremost, create a Twitter account if you don’t have one (dont’ worry, you never have to Tweet.)

Then, go to developer.twitter.com while signed in. You have to do two somehwat confusing things: create a Project, and then create an App. The App is what we’ll use to access the API; all Apps must reside inside a Project. To quote Twitter’s Help: ” To create a Project, click on “New Project” in your dashboard or the Projects & Apps page within the developer portal. You’ll only be able to see this option if you haven’t already created a Project. You will be prompted to create a Project name, description, and use case. You will also be asked to create a new App or connect an existing standalone App.”

Why all the steps? Twitter wants to make sure you do not use their API to maliciously set up a Bot account that posts Tweets.

And what is our aim here? We need to have Twitter give us passwords, basically, that we can use to log in via R and access the API. The passwords are called the ‘API key’ and ‘API key.’ They are unique to each App.

The rest of this is best explained via video, and then we can get back into the coding.

(video embed: twitter setup)

Using You Twitter App in R

First, we have to ‘tell’ Twitter who we are and what App we’re using, buy entering in our API Key and Secret. Let’s start by loading a package that will allow for authentication via a Web Browser:

If you are lucky, you’ll get a popup in your web browser completing the authentication process.

pope <- get_timeline("#pontifex", n = 18000, retryonratelimit = TRUE)
## Warning: 34 - Sorry, that page does not exist.
## Warning: Sorry, that page does not exist.