Social network particularly Facebook has attached itself to the life of many and R is a very powerful tool which allows us to havest the Facebook data. This blog will guide you to make a count of a specific word from all your posts.
Here is the short step-by-step guide on how to extract Facebook data from your account:
Note: In this guide, I will use data for “Posts” and “Photos & Videos” only.
First, I load the neccessary libraries:
## Package "jasonlite": This package have tools to map between the data found in the .JSON file downloaded from Facebook and R data type.
library(jsonlite)
## Package "tidyverse": This package is very popular which many useful tools such as ggplot2, dplyr, tidyr, readr,...
library(tidyverse)
## ── Attaching packages ─────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag() masks stats::lag()
## Package "qdapRegex": This package has tool used for removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags,...
library(qdapRegex)
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:dplyr':
##
## explain
## The following object is masked from 'package:ggplot2':
##
## %+%
## The following object is masked from 'package:jsonlite':
##
## validate
## Package "tidytext": This package has tools to turn texts into token for ease of changing the words such as upper and lower cases, numbers and symbols,...
library(tidytext)
Then, I relocate the downloaded JSON file to the appropriated folder of R.
## getwd() will return the current folder that R is working on.
getwd()
## [1] "/Users/dao/dunganduyen"
The downloaded file will be in the .zip format and I want to extract data from the “Posts” folder to the R folder. Then I load the data into variable myposts with the following command:
## Turn .JSON data into data type usable in R.
FBposts = fromJSON("facebook-CopThoiNay/posts/your_posts.json")
myposts = FBposts$status_updates
Now, all my comments is on the variable myposts$data. I want to extract only the text and put quotation sign around it. I name the new variable myposts$posttext then convert it into type character.
## Remove the quotation marks around each text and turn it into character type in R.
myposts$posttext = myposts$data %>%
rm_between('"','"',extract = TRUE)
## Warning in stringi::stri_extract_all_regex(text.var, pattern): argument is
## not an atomic vector; coercing
myposts$posttext = as.character(myposts$posttext)
At this stage, data from myposts is not compatible for tidy text analysis yet. Each row is made up of multiple combined word which I have to split it to 1 token (a unit of text most often is a word) per document per row.
Also, I want to remove stop words which are common words like “the”, “of”, “to” which are useless for text analysis.
## Turn each word into token and remove stop_words.
mypost_text = myposts %>%
unnest_tokens(word, posttext) %>%
anti_join(stop_words)
## Joining, by = "word"
Next, I create a vector that have the word the the count number of it from all the posts I have on facebook.
counts = mypost_text %>%
drop_na(word) %>%
count(word, sort = TRUE)
counts
## # A tibble: 1,937 x 2
## word n
## <chr> <int>
## 1 ä 231
## 2 ng 160
## 3 cã 134
## 4 happy 124
## 5 bday 97
## 6 bẠ91
## 7 á 86
## 8 lol 81
## 9 anh 74
## 10 em 68
## # … with 1,927 more rows
Finally, I can display my result in wordcloud and display only the top 50 words by using the following code:
## Package "RColorBrewer" and "wordcloud" has useful tools to make Word Cloud as shown below.
library(RColorBrewer)
library(wordcloud)
counts %>%
with(wordcloud(word, n, max.words = 50))
Julia Silge & David Robinson. Text Mining with R. O’Reilly, 2019. Print.