Text mining - Count words on Facebook posts (Beginner)

Introduction

Social network particularly Facebook has attached itself to the life of many and R is a very powerful tool which allows us to havest the Facebook data. This blog will guide you to make a count of a specific word from all your posts.

How to get data from Facebook

Here is the short step-by-step guide on how to extract Facebook data from your account:

Login Facebook.
Go to Setting > Your Facebook Information > View > Download Your Information (Format JSON).
Wait for Facebook to process then download it.

Note: In this guide, I will use data for “Posts” and “Photos & Videos” only.

Processing Facebook Data

First, I load the neccessary libraries:

## Package "jasonlite": This package have tools to map between the data found in the .JSON file downloaded from Facebook and R data type.
library(jsonlite)
## Package "tidyverse": This package is very popular which many useful tools such as ggplot2, dplyr, tidyr, readr,...
library(tidyverse)

## ── Attaching packages ─────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()

## Package "qdapRegex": This package has tool used for removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags,...
library(qdapRegex)

## 
## Attaching package: 'qdapRegex'

## The following object is masked from 'package:dplyr':
## 
##     explain

## The following object is masked from 'package:ggplot2':
## 
##     %+%

## The following object is masked from 'package:jsonlite':
## 
##     validate

## Package "tidytext": This package has tools to turn texts into token for ease of changing the words such as upper and lower cases, numbers and symbols,...
library(tidytext)

Then, I relocate the downloaded JSON file to the appropriated folder of R.

## getwd() will return the current folder that R is working on.
getwd()

## [1] "/Users/dao/dunganduyen"

The downloaded file will be in the .zip format and I want to extract data from the “Posts” folder to the R folder. Then I load the data into variable myposts with the following command:

## Turn .JSON data into data type usable in R.
FBposts = fromJSON("facebook-CopThoiNay/posts/your_posts.json")
myposts = FBposts$status_updates

Now, all my comments is on the variable myposts$data. I want to extract only the text and put quotation sign around it. I name the new variable myposts$posttext then convert it into type character.

## Remove the quotation marks around each text and turn it into character type in R.
myposts$posttext = myposts$data %>%
  rm_between('"','"',extract = TRUE)

## Warning in stringi::stri_extract_all_regex(text.var, pattern): argument is
## not an atomic vector; coercing

myposts$posttext = as.character(myposts$posttext)

At this stage, data from myposts is not compatible for tidy text analysis yet. Each row is made up of multiple combined word which I have to split it to 1 token (a unit of text most often is a word) per document per row.

Also, I want to remove stop words which are common words like “the”, “of”, “to” which are useless for text analysis.

## Turn each word into token and remove stop_words.
mypost_text = myposts %>%
  unnest_tokens(word, posttext) %>%
  anti_join(stop_words)

## Joining, by = "word"

Next, I create a vector that have the word the the count number of it from all the posts I have on facebook.

counts = mypost_text %>%
  drop_na(word) %>%
  count(word, sort = TRUE)
counts

## # A tibble: 1,937 x 2
##    word      n
##    <chr> <int>
##  1 ä       231
##  2 ng      160
##  3 cã      134
##  4 happy   124
##  5 bday     97
##  6 báº      91
##  7 á        86
##  8 lol      81
##  9 anh      74
## 10 em       68
## # … with 1,927 more rows

Finally, I can display my result in wordcloud and display only the top 50 words by using the following code:

## Package "RColorBrewer" and "wordcloud" has useful tools to make Word Cloud as shown below.
library(RColorBrewer)
library(wordcloud)
counts %>%
  with(wordcloud(word, n, max.words = 50))

References

Julia Silge & David Robinson. Text Mining with R. O’Reilly, 2019. Print.

Text mining - Count words on Facebook posts (Beginner)

Dung Dao

26/03/2019

Introduction

How to get data from Facebook

Processing Facebook Data

References