To obtain data using Facebook’s API, we need to create an app on Facebook’s developer website (https://developers.facebook.com). Follow the link to https://developers.facebook.com and log in using your Facebook account using the button in the top right of the window. Once logged in, use the drop down menu in the same location to select “Add a New App”:
Then, select “Website” and give your app any name–such as “PAA2016_YourName”–and assign it any category. Then click “Create App ID” to continue. The quick start page will now show some code which you can ignore. You will need to scroll to the bottom of the page and give your app a web address. Since this app is only for your personal use, you can just use a page such as your directory page at your institution or your personal website if you have one. Once you enter an address, the page will expand. Continue down and look under “Next Steps” for “Skip to Developer Dashboard” and click the link.
At the top of the developer dashboard you should find your app information for authentication with the Facebook API:
You will need to copy your “App ID” and “App Secret” (press “Show” to obtain it) into specific locations in the next part of this tutorial to authenticate. Do not use the example App ID in the image above and do not share your App Secret code with anyone else. For more information on your App Secret code, look here: https://developers.facebook.com/docs/facebook-login/security#appsecret. With your App ID and App Secret, you should now be able to authorize packages like Rfacebook and SocialMediaLab to access the Facebook API. These codes are not limited to use with R packages and can be used with any software built to utilize the Facebook API.
Before we can started, we need to clear our workspace, set a working directory, and install and load packages we’ll be using. You will need to fill in your working directory below by replacing “FILL IN HERE” with the location on your computer where the workshop folder was placed.
rm(list=ls())
setwd("FILL IN HERE\PAA_2016_SMDM_Workshop\Facebook")
install.packages("SocialMediaLab")
install.packages("Rfacebook")
install.packages("plyr")
install.packages("stringr")
install.packages("dplyr")
install.packages("igraph")
library(SocialMediaLab)
library(Rfacebook)
library(plyr)
library(stringr)
library(dplyr, pos=99) # dplyr and igraph in high position to avoid masking plyr.
library(igraph, pos=100)
First we need to obtain a token to authorize our access to the Facebook API. For this you will need to fill in your Application ID and Application Secret ID below. These are taken as arguments in SocialMediaLab’s authentication function which returns an OAuth token to permit it, and other Facebook packages such as Rfacebook, to access the Facebook API to collect data and perform tasks. The option “extended_permissions” enables additional permissions needed for advanced functions outside the bounds of this workshop and “useCachedToken” tells the function to search your workspace for an existing token rather than creating a new one every time the script is run.
fb.appid <- "APPID GOES HERE"
fb.appsecret <- "APP SECRET GOES HERE"
fb_oauth <- AuthenticateWithFacebookAPI(appID = fb.appid,
appSecret = fb.appsecret,
extended_permissions = FALSE, # Public Info
useCachedToken = TRUE) # Use existing
The first example of Facebook data collection we will explore is collecting posts from public Facebook pages. Unlike collecting user data, this is not limited to your Facebook network. We will use Rfacebook’s getpage() function to get posts from the Population Association of America’s Facebook page. Comments in the code block below describe the content of the function’s arguments.
popassoc.page.df <- getPage(page="PopAssoc", # Takes page ID or page name
token=fb_oauth, # OAuth token
n=50, # Number of posts to return
since=NULL, # Date of earliest posts returned
until=NULL, # Date of latest posts returned
feed=TRUE) # T/F: Return posts by page non-owners
## 50 posts
We’ve collected up to 50 posts to the page. We can see that the data this function returns have a number of values, and we can take a look at what they contain:
names(popassoc.page.df) # Get names of returned columns
## [1] "from_id" "from_name" "message" "created_time"
## [5] "type" "link" "id" "likes_count"
## [9] "comments_count" "shares_count"
glimpse(popassoc.page.df) # Take a peek at the data
## Observations: 50
## Variables: 10
## $ from_id (chr) "524242131043367", "524242131043367", "52424213...
## $ from_name (chr) "Population Association of America", "Populatio...
## $ message (chr) "Have you checked this out?", "Have you checked...
## $ created_time (chr) "2016-03-23T05:54:50+0000", "2016-03-21T15:18:4...
## $ type (chr) "link", "link", "link", "link", "link", "link",...
## $ link (chr) "http://conta.cc/1SSjaEP", "http://conta.cc/1SW...
## $ id (chr) "524242131043367_793099594157618", "52424213104...
## $ likes_count (dbl) 1, 0, 0, 0, 1, 15, 0, 0, 1, 5, 6, 0, 119, 0, 1,...
## $ comments_count (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ shares_count (dbl) 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
The “message” component of these data, the text of posts, is likely to be the most useful for most researchas we can use it for text analysis, though there are many potential uses for other components. For this workshop, we will make use of text from posts to do sentiment analysis.
We can take the text content from pages to do sentiment analysis like one might do with Twitter data. Let’s pick some pages likely to have a large number of positive and negative words.
trump.page.df <- getPage("DonaldTrump", fb_oauth, n=10, feed=TRUE)
## 10 posts
sanders.page.df <- getPage("berniesanders", fb_oauth, n=10, feed=TRUE)
## 10 posts
Now we need to load a function to do sentiment scoring. This small script uses a simple scoring algorithm: Using dictionaries of positive and negative words it takes a sum of the number of words in a provided sentence matching positive terms and subtracts the number of negative terms and returns this integer.
source("./sentiment.r") # Script for analysis. Author: Jeffrey Breen
To use this, we need dictionaries of positive and negative words. We’ve provided some relatively basic dictionaries, but you can easily produce your own dictionaries. You need not be limited to “positive” and “negative” terms either; you could compare between and concepts using this same method with some careful planning.
pos <- readLines("./opinion_lexicon/positive_words.txt") # Positive words
neg <- readLines("./opinion_lexicon/negative_words.txt") # Negative words
# Let's see what sort of words are "positive" or "negative".
sample(pos, size=10)
## [1] "stimulates" "amazed" "straighten" "pamperedly"
## [5] "breakthrough" "brilliances" "genius" "sparkling"
## [9] "sweeping" "glorious"
sample(neg, size=10)
## [1] "carnage" "overstated" "rift" "bastards"
## [5] "opportunistic" "decry" "lorn" "engulf"
## [9] "invalidity" "arduous"
Now we use our sentiment function and dictionaries to score the text data from the pages.
sanders.ss <- score.sentiment(sentence = sanders.page.df$message, # Text to score
pos.words = pos, # Positive words
neg.words = neg)$score # Negative words
trump.ss <- score.sentiment(trump.page.df$message, pos, neg)$score
If you get an error about unrecognized characters, you may have characters the sentiment function was not able to filter out. This might require additional data cleaning using regular expressions and does happen occasionally. Learning to use regular expressions is very useful if you plan to work with text data from social media. Here is a good starting point for regular expressions in R: http://www.regular-expressions.info/rlanguage.html
Now let’s get some summary statistics on our sentiment scores. First we need to generate a standard error of the mean function, then display our results in a simple data frame.
st.err <- function(x){
return(sd(x) / sqrt(length(x)))
}
data.frame(Mean=c(mean(sanders.ss), mean(trump.ss)),
SE=c(st.err(sanders.ss), st.err(trump.ss)),
row.names=c("Sanders", "Trump"))
## Mean SE
## Sanders -0.3 0.6155395
## Trump 1.5 0.7637626
What if have a topic we’re interested in collecting page data from, but we don’t know what pages we want to use? Rfacebook includes a function for page searches. Here we will search for pages including the word “demography” in their name and see what sort of pages we get and what kind of data this search returns.
demography.search.df <- searchPages("Demography", fb_oauth, n=20)
## 20 pages
# What sort of values do we get?
names(demography.search.df)
## [1] "id" "about" "category"
## [4] "description" "general_info" "likes"
## [7] "link" "city" "state"
## [10] "country" "latitude" "longitude"
## [13] "name" "talking_about_count" "username"
## [16] "website"
# Let's see what pages we found.
demography.search.df$name
## [1] "Demography"
## [2] "Medieval demography"
## [3] "Demography of Japan"
## [4] "Demography of the United Kingdom"
## [5] "Demography of the United States"
## [6] "Animal Demography Unit"
## [7] "Center for Studies in Demography and Ecology"
## [8] "Demography of Birmingham"
## [9] "Historical demography"
## [10] "UTSA Department of Demography"
## [11] "Stockholm University Demography Unit - SUDA"
## [12] "Centre for Economic Demography"
## [13] "Divorce demography"
## [14] "Italian society of economics demography and statistics"
## [15] "Demography and Sociology/Demography, U.C. Berkeley"
## [16] "DEMOGRAPHY - Institute for Population and Human Studies"
## [17] "Demography"
## [18] "Demography (journal)"
## [19] "Demography of Afghanistan"
## [20] "FSU Center for Demography and Population Health"