Mining Twitter Data + Handling Strings

Albert Y. Kim
Monday 2015/04/06

Deviant Cartography

h/t to @armalarm42

Mining Twitter Data

Beer vs church map

What is the most important thing to keep in mind when observing such maps?

Mining Twitter Data

Strong selection biases at every step:

  • Who signs up to Twitter
  • Who tweets at all
  • Who tweets about these topics
  • Who lets their location be revealed

i.e. the sample is non-random, so it is not representative of the population as a whole, and hence the results of the sample are not generalizable to the entire US population.

However...

George Box, one of the most famous statisticians said: All models are wrong, but some are useful.

alt text

Create Twitter OAS Authentication

We are getting digital permission to open the Twitter pipeline for our use, just like any other developer.

  • Install twitteR package from RStudio (homepage is here)
  • Create a Twitter application at the Twitter developers page.
  • Make sure to give the app read, write and direct message authority.
  • Note your “API key”, “API secret”, “Access token”, and “Access token secret”. Do not share these with anyone nor save them in a publicly viewable file.

Handling Text in R

Text is computer programming are called (character) strings. Like handling dates, handling text is a non-sexy, but necessary, task. The stringr package tries to alleviate this problem.

  • Install the “current development version” of the package from https://github.com/hadley/stringr (not via the Packages window of RStudio)
  • Open and knit stringr_vignette.Rmd

Today's Exercise

  • Make sure you can run twitter.R code.
  • Getting familiar with all functions in the stringr_vignette.Rmd file.
  • For each function, play with the Examples at the end of the help file.
  • Things get tricky at the “Regular Expressions” section, so don't worry if things don't make complete sense just yet.
  • Sorry, there is no cheatsheet for this package.