Social Media Data Collection

Collecting Twitter data and creating social networks

In this section we will run through how to collect data from Twitter, create networks, and perform different kinds of analysis.

It is currently possible to create 3 different types of networks using Twitter data collected with SocialMediaLab. These are (1) actor networks; (2) bimodal networks; and (3) semantic networks.

First, define the API credentials. Due to the Twitter API specifications, it is not possible to save authentication tokens between sessions. The Authenticate() function is called only for its side effect, which provides access to the Twitter API for the current session.

Authenticating with the Twitter API

Go to Twitter Application Management and “Create New App”. Complete all fields in the form and create a new app. In your app page, go to “Keys and Access Tokens”, generate your access token, and copy the information to R:

myapikey <- "801xU41LxZmHrNh7NY3wVnB7C"
myapisecret <- "E4sATBjT3ZSARghmXgjrdcOZP5rVWHVXFxAwxvoG24QLN5oDQg"
myaccesstoken <- "19856730-XHDOuNW03D6Gqjv2JDQFxhZmurdsTe55nLMU48kjq" # avoids the browser authentication dance
myaccesstokensecret <- "2xFYrFO1yt1H0WzgHEjzqojf6ThNeBfC0f0bYJB2kwaQZ" # avoids the browser authentication dance

Given that we are going to be creating two different types of Twitter networks (actor and semantic), we will Collect() the data, but not pipe it directly through to Network() straight away. This means we can reuse the data multiple times to create two different kinds of networks for analysis. We will collect 150 recent tweets that have used the #auspol hashtag. This is the dominant hashtag for Australian politics. The first step in the work flow is to authorise access the Twitter API. Instructions for obtaining Twitter API access are available from the VOSON website. See the previous section for a brief explanation of APIs.

require(SocialMediaLab)
require(magrittr)
myTwitterData <- Authenticate("twitter",
                              apiKey=myapikey,
                              apiSecret=myapisecret,
                              accessToken=myaccesstoken,
                              accessTokenSecret=myaccesstokensecret) %>%
Collect(searchTerm="#auspol", numTweets=150, writeToFile=FALSE, verbose=TRUE)
[1] "Using direct authentication"
Now retrieving data based on search term: #auspol
Done
Cleaning and sorting the data...
Done

We can have a quick look at the data we just collected:

View(myTwitterData)

Note the class of the dataframe, which lets SocialMediaLab know that this is an object of class dataSource , which we can then pass to the Create() function to generate different kinds of networks:

class(myTwitterData)
[1] "data.frame" "dataSource" "twitter"   

If you find that you are encountering errors possibly related to the text of the tweets, you can try converting the tweet text to UTF-8 character encoding. Roughly speaking, this command will help to deal with ‘odd’ characters in the text.

myTwitterData$text <- iconv(myTwitterData$text, to = 'utf-8')

Mac users only may also wish to try the following if they are encountering errors that may be due to character encoding issues:

myTwitterData$text <- iconv(myTwitterData$text, to = 'utf-8-mac')

Creating social networks with Twitter data

Actor network

First, we will create an actor network. In this actor network, edges represent interactions between Twitter users. An interaction is defined as a “mention”" or “reply”" or “retweet” from user i to user j, given “tweet” m. In a nutshell, a Twitter actor network shows us who is interacting with who in relation to a particular hashtag or search term.

g_twitter_actor <- myTwitterData %>% Create("Actor")
Generating the network...

Done.

We can now examine the description of our network:

g_twitter_actor
IGRAPH a7d7116 DN-- 152 223 -- 
+ attr: name (v/c), label (v/c), edgeType (e/c), timeStamp (e/c), tweet_id (e/c)
+ edges from a7d7116 (vertex names):
 [1] rainey_knight  ->JohnWren1950    Warstub        ->unionsaustralia ballgameskeith ->ClubeGaffer     atheist4maga   ->SydneyAtheist  
 [5] ochreblue      ->sammmw8         JoRobinson_Aus ->HuffPostAU      PreAmpPlus     ->GhostWhoVotes   croswell_g     ->SydneyAtheist  
 [9] ochreblue      ->lovethatloaf    plabg          ->GoldCoastNurse  ochreblue      ->hearyanow       ochreblue      ->tilleyfab      
[13] GoldSuzie      ->protecttheplain ochreblue      ->CorruptNSW      BobOfAlex      ->CFMEUJohnSetka  PreAmpPlus     ->unionsaustralia
[17] Firemonkey991  ->WendyFarmer_    PreAmpPlus     ->MGliksmanMDPhD  NaomiCra       ->LesStonehouse   TotalREProperty->FifiFoxfoot    
[21] dannytweeets   ->breko           plabg          ->protecttheplain ochreblue      ->stopcoalexports GoldSuzie      ->protecttheplain
[25] allisonBeDemure->protecttheplain KirstyWho      ->FrBower         Redbank80Graeme->protecttheplain stopcoalexports->Qldaah         
[29] InvestInScience->protecttheplain AliKayaBrisbane->Loud_Lass       PilligaPush    ->protecttheplain FeeFeeCee      ->MGliksmanMDPhD 
+ ... omitted several edges

Semantic network

Next, we will create a semantic network. In this network nodes represent unique concepts (in this case unique terms/words extracted from a set of 150 tweets), and edges represent the co-occurrence of terms for all observations in the data set. For example, for this Twitter semantic network, nodes represent either hashtags (e.g. “#auspol”) or single terms (“politics”). If there are 150 tweets in the data set (i.e. 150 observations), and the term #auspol and the term politics appear together in every tweet, then this would be represented by an edge with weight equal to 150

g_twitter_semantic <- myTwitterData %>% Create("Semantic")
[1] "Generating Twitter semantic network..."

Done.

Let’s have a look at the network description:

g_twitter_semantic
IGRAPH 1de936d UNW- 54 121 -- 
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from 1de936d (vertex names):
 [1] #auspol        --auspol            #auspol        --turnbullmalcolm   auspol         --#nrlgf            turnbullmalcolm--#nrlgf           
 [5] #auspol        --ausunions         auspol         --#ausunions        ausunions      --#ausunions        auspol         --#qanda           
 [9] ausunions      --#qanda            #auspol        --adani             auspol         --#islam            auspol         --#paris           
[13] auspol         --#grandmufti       auspol         --#adani            adani          --#adani            #adani         --stopadani        
[17] auspol         --#stopadani        adani          --#stopadani        stopadani      --#stopadani        #auspol        --stopadani        
[21] auspol         --#marriageequality auspol         --#reachtel         auspol         --#manus            auspol         --#nauru           
[25] #stopadani     --corners           #auspol        --corners           auspol         --#4corners         adani          --#4corners        
[29] stopadani      --#4corners         corners        --#4corners         #auspol        --done              #auspol        --csg              
+ ... omitted several edges

Bimodal network

Now that we have our Twitter data we can generate a bimodal network. This kind of network provides many possibilities for analysis and generating insights from our data.

In this bimodal network there are two types of nodes: users and hashtags. The bimodal network is therefore:

  • directed (users can use hashtags, but hashtags can’t use users)
  • weighted (users can comment multiple times on a post)
  • bipartite (users can like or comment on posts, but posts can’t like or comment back)
  • multiple edges or parallel edges (we have one edge for each interaction from user i to post j)

We now run the Create() function, which creates an igraph object called g_bimodal_twitter. Creating networks in SocialMediaLab is straightforward. We simply pass the myTwitterData object to the Create() function, and it takes care of the rest. We specify what kind of network we want to create (i.e. a bimodal network) by specifying this as an argument to the Create() function.

Note also that there is a tricky operator introduced here, the ‘pipe’ operator %>% , which we have not covered yet. This operator comes from the Magrittr package, and it is used to ‘pipe’ together commands in a chain, passing the values along the pipeline until it reaches the final command, which returns the output (i.e. the network we wish to create). In this instance we are passing (or “piping”) the data we collected using Collect() through to the Create() function.

g_bimodal_twitter <- myTwitterData %>% Create("Bimodal")
[1] "Generating Twitter bimodal network..."

Done

We can now view basic information about the network:

g_bimodal_twitter
IGRAPH 9009817 DN-- 100 160 -- 
+ attr: name (v/c), label (v/c), edgeType (e/c), timeStamp (e/c), tweet_id (e/c)
+ edges from 9009817 (vertex names):
 [1] rainey_knight ->#auspol           rainey_knight ->#nrlgf            Warstub       ->#auspol           Warstub       ->#ausunions       
 [5] Charlie5009   ->#auspol           ballgameskeith->#QandA            ballgameskeith->#ausunions        ballgameskeith->#auspol          
 [9] msrose2343    ->#auspol           trollsforpeace->#AusPol           atheist4maga  ->#Islam            atheist4maga  ->#Paris           
[13] atheist4maga  ->#GrandMufti       atheist4maga  ->#Auspol           ochreblue     ->#Adani            ochreblue     ->#StopAdani       
[17] ochreblue     ->#auspol           JoRobinson_Aus->#auspol           JoRobinson_Aus->#MarriageEquality PreAmpPlus    ->#ReachTEL        
[21] PreAmpPlus    ->#auspol           croswell_g    ->#Islam            croswell_g    ->#Paris            croswell_g    ->#GrandMufti      
[25] croswell_g    ->#Auspol           ochreblue     ->#StopAdani        ochreblue     ->#auspol           bprophetable  ->#Manus           
[29] bprophetable  ->#Nauru            bprophetable  ->#auspol           plabg         ->#auspol           ochreblue     ->#StopAdani       
+ ... omitted several edges

Collecting Facebook data

In this section we will run through how to collect data from Facebook, create networks, and perform different kinds of analysis.

Authenticate with Facebook Developers API

The process of authentication, data collection, and creating social networks can be expressed with the 3 verb functions: Authenticate() , Collect() , and Create() . This simplified workflow exploits the pipe interface of the Magrittr package, and provides better handling of API authentication between R sessions. What we are doing is “piping” the data forward using the %>% operator, in a kind of functional programming approach. It means we can pipe together all the different elements of the work flow in a quick and easy manner. This also provides the ability to save and load authentication tokens, so we don’t have to keep authenticating with APIs between sessions. Obviously, this opens up possibilities for automation and data mining projects.

Go to Facebook Developers Page (https://developers.facebook.com/apps/), create a new app, and copy the appID and appSecret information.

Make sure we have our appID and appSecret values defined:

appID <- "2025204167698114" # Put your own api info here
appSecret <- "3a0b8f329e1ed9d803e02b5b3fdae831" # Put your own api info here

Save credential file for later access:

require(SocialMediaLab)
require(magrittr)
Authenticate("Facebook", appID = appID, appSecret = appSecret) %>% SaveCredential("FBCredential.RDS")

The first time you authenticate you will see this:

Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/

You have to paste url (http://localhost:1410/ ) in developer app settings: (i) Click on add platform. (ii) Choose website and paste http://localhost:1410/ in URL link.

Bimodal networks

First, we will collect 2 days worth of activity from the Star Wars official page. This will collect all the posts posted between the rangeFrom and rangeTo dates, including all comments and likes, and other associated data including usernames, timestamps for comments, etc. Note: the date format is YYYY-MM-DD.

We will be using this data to create a bimodal network. This graph object is bimodal because edges represent relationships between nodes of two different types. For example, in our bimodal Facebook network, nodes represent Facebook users or Facebook posts, and edges represent whether a user has commented or ‘liked’ a post. Edges are directed and weighted (e.g. if user i has commented n times on post j, then the weight of this directed edge equals n).

g_bimodal_facebook_star_wars <- 
  LoadCredential("FBCredential.RDS") %>% 
  Collect(pageName="StarWars", rangeFrom="2015-03-01", rangeTo="2015-03-02", writeToFile=TRUE) %>% 
  Create("Bimodal")
Now retrieving data from page: StarWars
2015-03-01  1 posts 
Now collecting data from POSTS (within the page).
 Collecting posts data from 1 posts.
Collecting maximum of 1000 comments and likes for each post.
.Facebook data was written to current working directory, with filename:
2015-03-01_to_2015-03-02_StarWars_FacebookData.csv

Creating Facebook bimodal network...

Done!

The Magrittr pipe approach used in this example means that we only end up with the final graph object (in the global environment). To ensure we retain the data that are collected, the argument writeToFile=TRUE is used. This writes the data collected using Collect() function to a local CSV file before it is piped through to the network generation function Create() . We can then read it in as a dataframe (see code snippet below).

# myStarWarsData <- read.csv("2015-03-01_to_2015-03-02_StarWars_FacebookData.csv")
myStarWarsData <- importData("2015-03-01_to_2015-03-02_StarWars_FacebookData.csv", "facebook")
View(myStarWarsData)
g_bimodal_facebook_star_wars <- myStarWarsData %>% Create("Bimodal")

Creating Facebook bimodal network...

Done!

This means we end up with two objects for further analysis, a graph object g_bimodal_facebook_star_wars , and a dataframe object myStarWarsData.

Before proceeding into analysis, we will collect 2 days worth of data from the Star Trek Facebook page, but this time we will pipe through the LoadCredential function, meaning that we are using the authentication token that we stored locally in the previous step.

require(SocialMediaLab)
g_bimodal_facebook_star_trek <- 
  LoadCredential("FBCredential.RDS") %>%
  Collect(pageName="StarTrek", rangeFrom="2015-03-01", rangeTo="2015-03-02", writeToFile=TRUE) %>%
  Create("Bimodal")
Now retrieving data from page: StarTrek
2015-03-01  1 posts 
Now collecting data from POSTS (within the page).
 Collecting posts data from 1 posts.
Collecting maximum of 1000 comments and likes for each post.
.Facebook data was written to current working directory, with filename:
2015-03-01_to_2015-03-02_StarTrek_FacebookData.csv

Creating Facebook bimodal network...

Done!

Read in the data to a dataframe

#read.csv() is not equivalent to Collect()
myStarTrekData <- importData("2015-03-01_to_2015-03-02_StarTrek_FacebookData.csv", "facebook") 
#if you want to create network from local data:
# g_bimodal_facebook_star_trek <- myStarTrekData %>% Create("Bimodal")

Now we can perform some analysis on the Star Wars network. Firstly, we will run through some essential SNA techniques. After that we will do something a bit fancier, by comparing whether there are gender differences between Star Wars and Star Trek networks.

We can get descriptive information about the network:

g_bimodal_facebook_star_wars
IGRAPH 8e21e39 DN-B 1208 1220 -- 
+ attr: name (v/c), type (v/c), label (v/c), edgeType (e/c), timestamp (e/c)
+ edges from 8e21e39 (vertex names):
 [1] AL Rbg              ->169299103121699_909270052457930 Abraham Gonzales    ->169299103121699_909270052457930
 [3] Abril Ramirez       ->169299103121699_909270052457930 Ace Falcon          ->169299103121699_909270052457930
 [5] Adam Blankenship    ->169299103121699_909270052457930 Adam Evenson        ->169299103121699_909270052457930
 [7] Adam Henshel        ->169299103121699_909270052457930 Adam Herb           ->169299103121699_909270052457930
 [9] Adam John Kiseloff  ->169299103121699_909270052457930 Adam Rollefson      ->169299103121699_909270052457930
[11] Adonis Lugo         ->169299103121699_909270052457930 Adrian Dobre        ->169299103121699_909270052457930
[13] Aidan Latham Daye   ->169299103121699_909270052457930 Aj Green            ->169299103121699_909270052457930
[15] Ajani Thomas        ->169299103121699_909270052457930 Akeem J. Webb       ->169299103121699_909270052457930
+ ... omitted several edges

This informs us that there are 1219 nodes and 1218 nodes in the network (this may differ somewhat for your own collected data). It tells us that our graph is Directed, Named, the edges are Weighted, and it also has the additional property of being a Bipartite graph.

Collecting YouTube video comment data

Authenticate

We first ensure that the SocialMediaLab package is loaded:

require(SocialMediaLab)
Loading required package: SocialMediaLab

A Google Developer API Key is required for authenticating with the API (otherwise we cannot collect data). This requires a Google account. Instructions for obtaining a Google Developer API Key are available from the Youtube Data API Overview. Main steps:

  1. You need a Google Account to access the Google Developers Console, request an API key, and register your application.
  2. Create a project in the Google Developers Console and obtain authorization credentials so your application can submit API requests.
  3. After creating your project, make sure the YouTube Data API is one of the services that your application is registered to use
my_apiKeyYoutube <- "AIzaSyAOS6z8kQjHJW5xL1kmjt8tYZd_JnzhpFE"

We then run the following function, which ensures everything is correctly set up to access the API:

apiKeyYoutube <- AuthenticateWithYoutubeAPI(my_apiKeyYoutube)

Get data

We now assign a character vector, specifying one or more YouTube video IDs that we wish to collect data from. For example, if the video URL is https://www.youtube.com/watch?v=W2GZFeYGU3s, then use videoIDs = ‘W2GZFeYGU3s’. Tip: for many videos, the function GetYoutubeVideoIDs can be used to create a vector object suitable as input for videoIDs.

videoIDs <- c("W2GZFeYGU3s","mL27TAJGlWc")

We now collect the YouTube comment data and store it in a data frame object named myYoutubeData. This data frame can then be used for creating networks for further analysis. We can supply various arguments to the CollectDataYoutube function, providing various options for the data collection process.

myYoutubeData <- CollectDataYoutube(videoIDs, apiKeyYoutube, writeToFile=TRUE, verbose=TRUE, maxComments=100)

Now scraping video number: 1 (out of 2 videos in total).
Scraping a maximum of 100 comments for each video.
.................................................................................
Now scraping video number: 2 (out of 2 videos in total).
Scraping a maximum of 100 comments for each video.
.

We can examine the structure of our data frame:

str(myYoutubeData)
Classes 'dataSource', 'youtube' and 'data.frame':   135 obs. of  9 variables:
 $ Comment           : chr  "Pascal did not die: https://www.embarcadero.com/products/delphi\nFor a language guaranteed not to disappear wit"| __truncated__ "Is it safe to assume that user base of R are more business oriented people than hardcore CS or It professional?" "thank you" "Very good video" ...
 $ User              : chr  "Chuck Becker" "Swarnkar Rajesh" "Aparajito Sengupta" "James Piazza" ...
 $ ReplyCount        : chr  "0" "1" "0" "0" ...
 $ LikeCount         : chr  "0" "0" "0" "0" ...
 $ PublishTime       : chr  "2017-07-13T20:24:51.000Z" "2017-06-26T06:41:32.000Z" "2017-06-04T12:55:42.000Z" "2017-05-26T03:17:24.000Z" ...
 $ CommentId         : chr  "z13qvx3btvr3ehlps04cghkivy2wspoorao" "z12eu1qoamb5s3sjr04cdfnb3wa4vtobhqg0k" "z135jfthfzmexpivi232xxe5ltjqfpokk" "z13giraztku0h1out22xyr0qyyvvchz2t" ...
 $ ParentID          : chr  "None" "None" "None" "None" ...
 $ ReplyToAnotherUser: chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
 $ VideoID           : chr  "W2GZFeYGU3s" "W2GZFeYGU3s" "W2GZFeYGU3s" "W2GZFeYGU3s" ...

Generate actor network

We will now create a unimodal network, a.k.a ‘actor network’, representing relationships between users who have interacted with each other. For YouTube comment threads, a relationship is defined as user i ‘replying to’ or ‘mentioning’ user j in a comment. In this network the vertices (a.k.a ‘nodes’) represent YouTube users and the edges (a.k.a ‘links’) represent whether (and how many times) user i has interacted with user j. The edges in this network are both directed and weighted. Edges are ‘directed’ because interactions may not be reciprocated (e.g. user i replies to user j, but user j does not reply to user i), and edges are also ‘weighted’ in order to show how many times user i has interacted with user j.

require(igraph)
g_actor_youtube <- myYoutubeData %>% Create("Actor")

Done!

We can now view basic information about our network, notably the number of vertices (users) and number of interactions (edges):

g_actor_youtube
IGRAPH 97a6d5d DNW- 36 41 -- 
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from 97a6d5d (vertex names):
 [1] SoHoTandCool                                                                                                             ->FarsightPress                 
 [2] FarsightPress                                                                                                            ->SoHoTandCool                  
 [3] Ed Boone                                                                                                                 ->PsycAndrew                    
 [4] bytejuggler                                                                                                              ->Tank8484                      
 [5] dagda825                                                                                                                 ->RenegadeThinking              
 [6] RenegadeThinking                                                                                                         ->dagda825                      
 [7] FarsightPress                                                                                                            ->Swarnkar Rajesh               
 [8] <U+0410><U+043B><U+0435><U+043A><U+0441><U+0430><U+043D><U+0434><U+0440> <U+0418><U+0432><U+0430><U+043D><U+043E><U+0432>->EvaSlash                      
+ ... omitted several edges

Storing your social network in graphml format

You can export a social network generated by SocialMediaLab to “graphml” format using igraph package:

require(igraph) # install.packages("igraph") if not installed yet

Now you can visualize the “graphml” file with Gelphi or other compatiable tools.

write.graph(g_bimodal_facebook_star_trek, "g_bimodal_facebook_star_trek.graphml", format="graphml")

Social Network Analysis

Basic counting

Next we will do some more descriptive analysis.

How many nodes are in the network?

vcount(g_bimodal_twitter)
[1] 100

How many edges in the network?

ecount(g_bimodal_twitter)
[1] 160

Get a list of the nodes in the network:

V(g_bimodal_twitter)
+ 100/100 vertices, named, from 9009817:
  [1] rainey_knight     Warstub           Charlie5009       ballgameskeith    msrose2343        trollsforpeace    atheist4maga      ochreblue        
  [9] JoRobinson_Aus    PreAmpPlus        croswell_g        bprophetable      plabg             GoldSuzie         asciigoat         Kaneosaurus      
 [17] BobOfAlex         Firemonkey991     NaomiCra          TotalREProperty   dannytweeets      4U_WTF            detispify         allisonBeDemure  
 [25] Bev_n_W           Medicayy          indica2007        KirstyWho         RrRjrobinson9     Redbank80Graeme   stopcoalexports   InvestInScience  
 [33] Left_of_Labor     AliKayaBrisbane   AngelaKorras      PilligaPush       FeeFeeCee         amagickeagle999   MrNixonsWife      MalurusSally     
 [41] SocialistMason    Salig08           idiosoCamel       GoldCoastNurse    Broadband_        madnyc            Elaine_de_Saxe    johngwass        
 [49] cleverclicks      MsJmaid           MiztaRabbit       mishyloan         sierra4oz         TaodeHaas         Ducatio_          #auspol          
 [57] #nrlgf            #ausunions        #QandA            #AusPol           #Islam            #Paris            #GrandMufti       #Auspol          
 [65] #Adani            #StopAdani        #MarriageEquality #ReachTEL         #Manus            #Nauru            #4Corners         #CSG             
 [73] #qldpol           #India            #Queensland       #coal             #adani            #VoteYes          #climatechange    #Trump           
+ ... omitted several vertices

List of edges in the network:

E(g_bimodal_twitter)
+ 160/160 edges from 9009817 (vertex names):
 [1] rainey_knight ->#auspol           rainey_knight ->#nrlgf            Warstub       ->#auspol           Warstub       ->#ausunions       
 [5] Charlie5009   ->#auspol           ballgameskeith->#QandA            ballgameskeith->#ausunions        ballgameskeith->#auspol          
 [9] msrose2343    ->#auspol           trollsforpeace->#AusPol           atheist4maga  ->#Islam            atheist4maga  ->#Paris           
[13] atheist4maga  ->#GrandMufti       atheist4maga  ->#Auspol           ochreblue     ->#Adani            ochreblue     ->#StopAdani       
[17] ochreblue     ->#auspol           JoRobinson_Aus->#auspol           JoRobinson_Aus->#MarriageEquality PreAmpPlus    ->#ReachTEL        
[21] PreAmpPlus    ->#auspol           croswell_g    ->#Islam            croswell_g    ->#Paris            croswell_g    ->#GrandMufti      
[25] croswell_g    ->#Auspol           ochreblue     ->#StopAdani        ochreblue     ->#auspol           bprophetable  ->#Manus           
[29] bprophetable  ->#Nauru            bprophetable  ->#auspol           plabg         ->#auspol           ochreblue     ->#StopAdani       
[33] ochreblue     ->#auspol           ochreblue     ->#4Corners         ochreblue     ->#StopAdani        ochreblue     ->#auspol          
[37] GoldSuzie     ->#auspol           GoldSuzie     ->#CSG              ochreblue     ->#4Corners         ochreblue     ->#StopAdani       
+ ... omitted several edges

Access a particular node in the network (node #42):

V(g_bimodal_twitter)[42]
+ 1/100 vertex, named, from 9009817:
[1] Salig08

Access a particular edges:

E(g_bimodal_twitter)[1]
+ 1/160 edge from 9009817 (vertex names):
[1] rainey_knight->#auspol

Graph connectivity

Look at the connectivity of the graph:

# who are the neighbours of node #42?
neighbors(g_bimodal_facebook_star_wars,42)
+ 1/1208 vertex, named, from 8e21e39:
[1] 169299103121699_909270052457930
#this is not a weakly connected component
is.connected(g_bimodal_facebook_star_wars, mode="weak")
[1] TRUE
#information on connected components
cc <- clusters(g_bimodal_facebook_star_wars)
#which component node is assigned to
# cc$membership
#size of each component
cc$csize
[1] 1208
#number of components
cc$no
[1] 1
#subnetwork - giant component
g3 <- induced_subgraph(g_bimodal_facebook_star_wars, which(cc$membership == which.max(cc$csize)))

We will now look at node centrality:

#node indegree
degree(g3, mode="in")
#node outdegree
degree(g3, mode="out")
#node indegree, using edge weights
ind <- strength(g3, mode="in")
#top-5 nodes, based on (weighted) indegree
V(g3)[order(ind, decreasing=T)[1:3]]
#closeness centrality
closeness(g3)
#betweenness centrality
betweenness(g3)
#eigenvector centrality
evcent(g3)$vector

We can look at some network cohesion measures. How dense is the graph? In other words, of all the possible connections between nodes, how many are actually observed?

# density
graph.density(g3)
[1] 0.0008367306
# (global) clustering coefficient
# rel. frequency connected triples close to form triangles
transitivity(g3)
[1] 0
# number of dyads with reciprocated (mutual)
# edges/number of dyads with single edge
reciprocity(g3, mode="default")
[1] 0
#total number of reciprocated edges/total number of edges
reciprocity(g3, mode="ratio")
[1] 0

Find important nodes in the network

Who are the top 3 most important posts in the Facebook network? There are several ways to do this. For fun we will use the PageRank algorithm implementation in igraph to calculate this. PageRank is made famous by the Google co-founders, who invented this method to determine the importance of webpages, revolutionising the search engine industry. The following code calculates PageRank for nodes in the network, and returns the 3 ‘top’ nodes (which have the highest share of PageRank), providing the ID.

pagerank_instagram <- sort(page.rank(g_bimodal_facebook_star_wars)$vector,decreasing=TRUE)
head(pagerank_instagram,n=3)
169299103121699_909270052457930                          AL Rbg                Abraham Gonzales 
                   0.4597014257                    0.0004476376                    0.0004476376 

What are the top 10 important terms in our #auspol actor network? There is no reason why we can’t use the PageRank algorithm to calculate this (as per the Instagram analysis previously):

pageRank_auspol_semantic <- sort(page.rank(g_twitter_semantic)$vector,decreasing=TRUE)
head(pageRank_auspol_semantic,n=10)
   #auspol     auspol       #csg      adani     #adani     qldpol #stopadani  #4corners  stopadani    #qldpol 
0.18150152 0.15599852 0.08531406 0.03674505 0.02625074 0.02496031 0.02491640 0.02386476 0.02338559 0.02305313 

What about the 3 least important users (with all due respect…):

tail(pageRank_auspol_semantic,n=3)
#lockthegate  #northkorea    #reachtel 
  0.00360652   0.00360652   0.00360652 

Obviously the #auspol hashtag is going to be the most important because it occurs at least once in every tweet. We can actually avoid this by using the removeTermsOrHashtags argument when we Create() the network. This argument specifies which terms or hashtags (i.e. nodes with a name that matches one or more terms) should be removed from the semantic network. This is useful to remove the search term or hashtag that was used to collect the data (i.e. remove the corresponding node in the graph). For example, a value of “#auspol” means that the node with the name “#auspol” will be removed. Note: you could also just delete the #auspol node manually.

Another key aspect of semantic networks is how many terms to include in the network. By default, SocialMediaLab does not include every unique term that it finds in the tweets, but only the 5 percent most frequently occurring terms. You can change this when calling the Create() network function, for example by specifying a value 50 (meaning that the 50 percent most frequently occurring terms will be included in the semantic network).

We can actually try this out now. We will create another semantic network, but we will exclude the #auspol hashtag, and we will include every single term available in the tweets.

g_twitter_semantic_auspol_allTerms <- myTwitterData %>% Create("Semantic", termFreq=100, removeTermsOrHashtags=c("#auspol"))
[1] "Generating Twitter semantic network..."

Done.

The size of the network will increase a lot, even in the absence of the #auspol term!

g_twitter_semantic_auspol_allTerms
IGRAPH db51550 UNW- 385 663 -- 
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from db51550 (vertex names):
 [1] auspol         --#nrlgf      turnbullmalcolm--#nrlgf      great          --#nrlgf      nrlgf          --#nrlgf      turnbull       --#nrlgf     
 [6] johnwren       --#nrlgf      judgement      --#nrlgf      keating        --#nrlgf      said           --#nrlgf      men            --#nrlgf     
[11] auspol         --#ausunions  ausunions      --#ausunions  cut            --#ausunions  pay            --#ausunions  boycott        --#ausunions 
[16] cream          --#ausunions  ice            --#ausunions  streets        --#ausunions  unionsaustralia--#ausunions  workers        --#ausunions 
[21] australia      --#ausunions  auspol         --#qanda      ausunions      --#qanda      #qanda         --qanda       #qanda         --taken      
[26] #qanda         --clubegaffer #qanda         --hasnt       #qanda         --increase    #qanda         --job         #qanda         --profiteer  
[31] #qanda         --robot       #qanda         --wealth      #qanda         --take        #ausunions     --qanda       #ausunions     --taken      
[36] #ausunions     --clubegaffer #ausunions     --hasnt       #ausunions     --increase    #ausunions     --job         #ausunions     --profiteer  
+ ... omitted several edges

What are the top 10 important terms in our semantic network now? Once again we will calculate this using PageRank:

pageRank_auspol_semantic_replicate <- sort(page.rank(g_twitter_semantic_auspol_allTerms)$vector,decreasing=TRUE)
pageRank_auspol_semantic_replicate[1:10]
 #4corners     #adani     auspol       #csg    #qldpol #stopadani   #voteyes     #qanda #ausunions     #nrlgf 
0.03945061 0.03310908 0.03280464 0.03179612 0.03114190 0.03057364 0.02554519 0.02432106 0.02142076 0.01754020 

Find communities

Is there any kind of community structure within the user network? We will use the infomap algorithm implementation in igraph.

library(igraph)
# increase nb.trials for better quality communities
imc <- infomap.community(g_twitter_actor, nb.trials = 10)
Modularity is implemented for undirected graphs only.
# create a vector of users with their assigned community number
communityMembership_auspol <- membership(imc)
# summarise the distribution of users to communities
commDistribution <- summary(as.factor(communityMembership_auspol))
# which community has the max number of users
tail(sort(commDistribution), n = 1)
 1 
81 

Look into the members of each community:

# create a list of communities that includes the users assigned to each community
communities_auspol <- communities(imc)
# look at the members of the most populated community 
communities_auspol[names(tail(sort(commDistribution),n=1))]
$`1`
 [1] "Warstub"         "ochreblue"       "PreAmpPlus"      "plabg"           "BobOfAlex"       "Firemonkey991"   "NaomiCra"        "FeeFeeCee"      
 [9] "SocialistMason"  "Elaine_de_Saxe"  "mishyloan"       "Ducatio_"        "jurylady5"       "MSMWatchdog2013" "Isayneversaydie" "pleaseuseaussie"
[17] "HittingAlice"    "MischelleCamill" "1JoyDuck"        "Ruxyrob"         "rustenburg_J"    "defendressofsan" "MdmAbsentMinded" "Left_of_Labor"  
[25] "Angelioannou"    "AnnalieseRoss"   "OzMacca46"       "pbro2333_brown"  "r7yrb7"          "blanketcrap"     "mormorlady"      "VeriteGrace"    
[33] "AmbientUXr"      "JayjaysMd"       "tilleyfab"       "alicia94985048"  "oldjoeschmo"     "energy_bu"       "vonoviedo"       "drewsmilitia"   
[41] "HOSKINMANDY22"   "MilesChamp"      "mavisgrizzltits" "deniseshrivell"  "leafyflower1"    "JustDoingJunk"   "Eschertology"    "wsj2150"        
[49] "KATHS97"         "ceciliemurray"   "bigislandwa"     "IndiBlu"         "KCGMSuperpit"    "sacarlin48"      "cameron_gobbo1"  "ajcdjp"         
[57] "fredanurks"      "Ned_Kelly"       "unionsaustralia" "GhostWhoVotes"   "lovethatloaf"    "hearyanow"       "CFMEUJohnSetka"  "WendyFarmer_"   
[65] "MGliksmanMDPhD"  "LesStonehouse"   NA                "AustralianLabor" "jackietrad"      "OzSheela"        "SaveOurSpit"     "GetUp"          
[73] "43a6f0ce5dac4ea" "gautam_adani"    "AdaniOnline"     "JulieBishopMP"   "4corners"        "JoshFrydenberg"  "FergusonNews"    "neighbour_s"    
[81] "StephenLongAus" 

Graph Projection

Another useful technique we can do is to perform a projection of the Facebook networks we just created. These networks are bipartite because nodes of the same type cannot share an edge (e.g. a user can only like/comment on a post, but not like/comment another user, and posts cannot perform directed actions either on users or other posts).

What we can do is induce two subgraphs from each network. More specifically, we can induce two actor networks, one for the users and one for the posts.

## some data preparation
# coerce to factor
g_bimodal_facebook_star_trek_projection <- g_bimodal_facebook_star_trek
V(g_bimodal_facebook_star_trek_projection)$type <- as.factor(V(g_bimodal_facebook_star_trek_projection)$type)
# coerce all posts (i.e. "1") to logical (i.e. FALSE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="1")] <- as.logical(FALSE)
# coerce all users (i.e. "2") to logical (i.e. TRUE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="2")] <- as.logical(TRUE)
# now project the network
projection_g_bimodal_facebook_star_trek <- bipartite.projection(g_bimodal_facebook_star_trek_projection)
vertex types converted to logical

Firstly, we will look at the induced graph for the “posts”. The induced “posts”" actor network consists only of nodes that are of type “post”. An edge exists between post i and post j if they are both co-liked or co-commented by the same user (i.e. if they have any user in common). Not surprisingly, every post has at least one user in common, which results in the network being “complete”.

projection_g_bimodal_facebook_star_trek[[1]]
IGRAPH b541d3a UN-- 1 0 -- 
+ attr: name (v/c), label (v/c)
+ edges from b541d3a (vertex names):
# png('facebook_star_trek_posts.png', width=800, height=700)
plot(projection_g_bimodal_facebook_star_trek[[1]], edge.width = 1.5, edge.curved = 0.5, 
    edge.arrow.size = 0.5)  #vertex.shape='none',

# dev.off()

Secondly, we will look at the induced graph for the “users”. The induced “users” actor network consists only of nodes that are of type “user”. An edge exists between user i and user j if they both co-liked or co-commented the same post (i.e. they share an interaction with a post j). As you might expect, this create a network with a massive number of edges! A lot of users co-interact with the same posts. For this example, over 4.5 million edges (your results might be somewhat different).

# warning - do not use ‘str‘ function because it will
# cause R to freeze up due to overloading the console output!
# Also: you will probably have difficulty plotting this graph in R because it is so big 
projection_g_bimodal_facebook_star_trek[[2]]
IGRAPH 5aa91b1 UNW- 1155 666435 -- 
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from 5aa91b1 (vertex names):
 [1] A Louise Garber--A.J. Zeien            A Louise Garber--Aaron Carroll         A Louise Garber--Aaron Niedzielski    
 [4] A Louise Garber--Adalmir Quintanilha   A Louise Garber--Adam Manny Breaux     A Louise Garber--Adam Redfern         
 [7] A Louise Garber--Adam Terrazas         A Louise Garber--Addie Tennant         A Louise Garber--Aditya Singh         
[10] A Louise Garber--Aditya Tamhankar      A Louise Garber--Adriana McGee         A Louise Garber--Akira Kudou          
[13] A Louise Garber--Al Shakespeare        A Louise Garber--Al Stikeleather       A Louise Garber--Alain Adriaenssens   
[16] A Louise Garber--Alan C. Huffines      A Louise Garber--Alan Crumb            A Louise Garber--Alan George          
[19] A Louise Garber--Albert Orkenbjorken   A Louise Garber--Alberto Gudi<U+00F1>o A Louise Garber--Aldemar L<U+00F3>pez 
[22] A Louise Garber--Ale Pescus            A Louise Garber--Alejandro Santillan   A Louise Garber--Alex Cheatom         
+ ... omitted several edges

Maybe there is some community structure to this large network. There are several ways to find out. We will use the infomap algorithm implementation in igraph. Infomap uses an information theoretic, flow-based approach to calculating community structure in networks. It supports weighted and directed graphs, and also scales well.

The results show that there is definitely some interesting community structure to the user actor network (a handful of large communities and a tiny community). Although your results might di er, depending on the actual data collected.

# limit the <U+2018>trials<U+2018> argument to a small number to save time
# (number of attempts to partition the network)
imc_starwars <- infomap.community(projection_g_bimodal_facebook_star_trek[[2]], 
    nb.trials = 3)

# create a vector of users with their assigned community number
communityMembership_starwars <- membership(imc_starwars)

# summarise the distribution of users to communities
commDistribution_starwars <- summary(as.factor(communityMembership_starwars))

# which community has the max number of users
tail(sort(commDistribution_starwars), n = 1)

# create a list of communities that includes the users assigned to each
# community
communities_starwars <- communities(imc_starwars)

# look at the members of the *least* populated community
communities_starwars[names(head(sort(commDistribution_starwars), n = 1))]

Text Analysis

Next, we will do some descriptive text analysis of the Star Wars fan comments.

TODO

Data pre-processing

We just want to keep the character vector of ‘comments’ data, for our purposes in this session:

fbData <- myStarWarsData$commentText

We only want elements of fbData that contain comment text (many rows of our Facebook data represent ‘likes’, rather than ‘comments’). So we remove any text data that equals “Not_applicable” (this is how SocialMediaLab designates rows in the dataframe that are ‘likes’). Note: in earlier versions of SocialMediaLab these elements were designated as NA, however this caused unintended consequences so it was changed.

toRemove <- which(fbData=="Not_applicable")
fbData <- fbData[-toRemove] # remove the elements we want to exclude

How many comments do we have left now?

length(fbData)
[1] 220

We convert the character encoding to UTF-8. This avoids errors relating to ‘odd’ characters in the text. This is usually a good idea, but there may be situations when it is not useful, or even detrimental. Note: Mac users may encounter errors/bugs relating to character encoding, and a workaround is to convert to ‘utf-8-mac’:

fbData <- iconv(fbData, to = "utf-8")
# **MAC USERS ONLY** should use this instead:
fbData <- iconv(fbData,to="utf-8-mac")

We convert our character vector fbData to a Vcorpus object:

library(tm)
fbCorpus <- VCorpus(VectorSource(fbData))

Individual comments (a.k.a. ‘documents’) can be accessed via the double brackets notation or the ‘dollar sign’ notation for accessing list elements. Let’s look at comment #4.

fbCorpus[[4]][[1]]
[1] "OMG"
# another way to access it
fbCorpus[[4]]$content
[1] "OMG"

We can perform a number of highly useful transformations of text using tm_map function (i.e. ‘mapping to the corpus’). Not all of these transformations are useful in every scenario! They should be used only when it makes sense, or as required, etc.

Converting all the text to lowercase:

fbCorpus <- tm_map(fbCorpus, content_transformer(tolower))

Remove numbers from the text:

fbCorpus <- tm_map(fbCorpus, removeNumbers)

Remove punctuation from the text:

fbCorpus <- tm_map(fbCorpus, removePunctuation)

Perform ‘word stemming’ on the text. Note: this transformation can be highly useful, but also highly detrimental!

# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores
fbCorpus <- tm_map(fbCorpus, stemDocument,lazy=TRUE)

We can also remove English ‘stop words’ from the text. These are common words (e.g. ‘the’, ‘and’, ‘or’) that we may want to exclude from our analysis. Once again, this is highly useful but also needs to be carefully applied.

fbCorpus <- tm_map(fbCorpus, removeWords, stopwords("english"),lazy=TRUE) 
# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores

Eliminate unnecessary ‘white space’ from the text. For example, “hello everyone my name is fred” becomes “hello everyone my name is fred”:

fbCorpus <- tm_map(fbCorpus, stripWhitespace, lazy=TRUE)

We can observe the di?erence now by examining comment #4 again:

fbCorpus[[4]]$content
[1] "omg"

We could also define our own stop words and transform the text using these:

myStopwords <- c("jar","binks")
fbCorpus <- tm_map(fbCorpus, removeWords, myStopwords)

Frequency analysis

Next we create a document-term matrix (DTM) from the fbCorpus object. DTMs are a very important concept for text analysis and are highly useful. DTMs can be thought about as a table (i.e. matrix) where the rows are ‘documents’ (i.e. Facebook comments in our dataset), and the columns are ‘terms’ (i.e. each unique word found across all the documents in the dataset). The ‘cells’ (i.e. elements) of the matrix indicate how many times term n occurred in document m.

Note: we use the control argument to specify that we only want to retain words that are minimum character length of 3, up to a maximum of 20 characters.

dtm <- DocumentTermMatrix(fbCorpus,control = list(wordLengths=c(3, 20)))
dtm
<<DocumentTermMatrix (documents: 220, terms: 816)>>
Non-/sparse entries: 1564/177956
Sparsity           : 99%
Maximal term length: 17
Weighting          : term frequency (tf)

What we have is a sparse matrix, i.e. most of the elements of the matrix are 0, i.e. in our dataset most Facebook comments contain only a small percentage of ‘vocabulary’ of terms observed across the entire set of comments. What we want to do is remove terms that occur very infrequently, which will leave us with the most ‘important’ terms. We remove sparse terms using the removeSparseTerms function, which removes terms that occur equal to or less than a percentage threshold.

For example, if we set it to 0.995, then all terms that are at least 99.5% sparse are removed. The following command lets us ‘test out’ what our document-term matrix would look like if we set the threshold to 0.995:

removeSparseTerms(dtm, 0.995)
<<DocumentTermMatrix (documents: 220, terms: 249)>>
Non-/sparse entries: 997/53783
Sparsity           : 98%
Maximal term length: 11
Weighting          : term frequency (tf)

0.995 will do the trick for us in this workshop, so we will create a new document-term matrix with this threshold applied to it:

dtmSparseRemoved <- removeSparseTerms(dtm, 0.995)

We can examine term frequencies in our data. We create a character vector of the sums of columns of our document-term matrix (implicitly coercing it to a matrix object), meaning that have a named character vector where the names are the unique terms in our document-term matrix, and the values of the elements are the number of times that particular word occurs across all of our corpus.

freqTerms <- colSums(as.matrix(dtmSparseRemoved))
freqTerms
     actual       admir     allianc        also       alway        amaz       among       anoth       anyon       anyth 
          4           2           2           3           5           3           2           4           2           3 
   argument         arm      artist         ask        atat      attack      awesom        back  background         bad 
          2           2           2           4           2           3           4           3           2           2 
       base       battl      beauti      behind      believ        best         big        blew        blow       brian 
          3           4           3           8           2           2           4           2           3           2 
      bring      budget       build       built   butterfli        call        came         can        cant       choke 
          2           3          16           2           4           2           4           6           2           2 
       come     command     complet   construct  contractor        cool     couldnt      dainti       darth        dead 
          2           4           3           8          10           3           4           3           3           2 
      death   deathstar     definit     destroy    destruct      detail       didnt         die        dish        dont 
         44           2           2          11           4           3           6           2           2           5 
      doubl      effort      elabor     emperor       empir       endor       enemi      episod        even        ever 
          2           4           2          17          21           6           2           3           4           3 
    everyth        evil        ewok      expect        face        fail      falcon     favorit        find      finish 
          2           5           3           2           2           2           2           3           2           8 
      first      flight        forc      forgiv       found       fulli         get        good         got      govern 
         16           2           6           3           2           6           5          11           4           2 
      great      ground       hadnt        half        hate        help        hope       hundr      imagin      imperi 
          2           2           2           4           3           2           4           3           2           3 
   independ      instal        issu        jedi         job        juan        just        keep        kill        kind 
          2           2           2           3           2           2          15           2           4           2 
       knew        know    knowledg        land       lando       laser       learn         led         let        life 
          2          10           2           6           2           2           4           2           2           2 
       like        live         lol        long        look        lost        love        luke        made        main 
         15           2           4           4           4           3           9           2           3           2 
       make         man       manag        mani        mass         may        mayb         men      moment        moon 
         11           4           2           7           2           2           4           3           2           2 
       movi        much        need         new        nice         now         old         one        oper      origin 
         11           2           4           9           3           5           2          17           7           6 
      paint       peopl     permiss         pic      pictur       pilot        plan      planet       pleas        poor 
          6           3           4           2           7           2           4           3           2           3 
     poster       power     probabl     project      rather      realli       rebel   rebellion     redoubl      releas 
          2           5           4           6           2           5          11           5           2           2 
     rememb      return       right         run         saw         say     schedul      screen      second       secur 
          2           2           3           2           4           2           7           2           6           2 
        see      shield        ship        shot    shouldnt        show      shuttl        sick        size      someth 
          7           3           5           2           2           2           7           2           2           5 
      space        span        star       start     station       steal  stormtroop    straight      strike superweapon 
          4           2          51           3           3           2           3           2           3           2 
       sure        take      target      tarkin       thank        that       there       thing       think     thought 
          4           3           2           2           2           8           5          11           9           8 
     thrawn        time      toilet        took        trap         tri      trilog        turn         two    unfinish 
          2          10           2           8          10           3           2           3           3           3 
      union     univers         use       vader     version       visit        vong        wait        wall     wallpap 
          4           2           3           9           2           3           2           2           2           2 
       want         war       wasnt       watch         way        weak      weapon        well      werent        will 
          5           9           6           5           3           3           4           5           2           5 
    william        wish     without        work      worker     wouldnt        yeah        year     yuuzhan 
          2           3           2           6           3           2           2           9           2 

We order the term frequencies and look at the 5 most frequent terms and then 5 least frequent terms:

orderTerms <- order(freqTerms,decreasing=TRUE)
freqTerms[head(orderTerms)]
   star   death   empir emperor     one   build 
     51      44      21      17      17      16 
freqTerms[tail(orderTerms)]
 werent william without wouldnt    yeah yuuzhan 
      2       2       2       2       2       2 

Which terms occurred at least 20 times?

findFreqTerms(dtmSparseRemoved, 20)
[1] "death" "empir" "star" 

We can do a basic correlation analysis by looking at the correlations between terms with the findAssocs function. If two words always appear together then corr = 1. If two terms never appear together then corr = 0. Let’s look at which terms co-occur with the term “meat”, with a lower correlation limit of 0.5.

findAssocs(dtmSparseRemoved, "good", corlimit=0.5)
$good
   evil   anyon  believ     can  expect    kind    life  return without   anyth    ever   peopl   start    turn 
   0.87    0.68    0.68    0.68    0.68    0.68    0.68    0.68    0.68    0.55    0.55    0.55    0.55    0.55 

Next, we can do some text visualisation. First, we can plot our descriptive statistics in various ways. For example, using a barchart to visualise the 20 most frequent terms (we will use the lattice package for a nice bar chart:

require(lattice)
Loading required package: lattice
# png("barchart_frequent_terms.png", width=800, height=700)
barchart(freqTerms[orderTerms[1:20]])

# dev.off()

Word Cloud

Next, we will construct a comparison word cloud of the Star Wars and Star Trek fan page comments.

# create a character vector of the Star Wars comments
# (i.e., take a subset of elements from the commentText column of the dataframe)
starWarsComments <- myStarWarsData$commentText[which(myStarWarsData$commentText!="Not_applicable")]
starWarsComments <- paste(starWarsComments , collapse = " ")
# do the same, but for Star Trek
starTrekComments <- myStarTrekData$commentText[which(myStarTrekData$commentText!="Not_applicable")]
starTrekComments <- paste(starTrekComments , collapse = " ")
# combine them together into a dataframe
df_ALL <- data.frame(group=c("Star_Wars","Star_Trek"),words=c(starWarsComments,starTrekComments))
# search for any texts that have no characters (i.e. are ’empty’)
# and then remove these elements from the vector
toRemove <- which(df_ALL$words=="")

Data pre-processing:

# search for any texts that have no characters (i.e. are ’empty’)
# and then remove these elements from the vector
toRemove <- which(df_ALL$words=="")
# are there any ’empty’ text elements?
# (i.e. length of toRemove is not equal to zero)
# if true, then we remove the corresponding rows from the dataframe
if (isTRUE(length(toRemove)!=0)) {
df_ALL <- df_ALL[-toRemove,]
}
# we create a character vector from the "words" column of df_ALL
# this will be our independent variable.
# we do not want text as factors, so we will coerce it to character
words <- df_ALL$words
# we will convert the character encoding to UTF-8
# just to be sure there are no odd characters that
# may cause problems later on
words <- iconv(words, to = "UTF-8")
# ** MAC USERS ONLY **:
words <- iconv(words, to = "UTF-8-mac")
# using ’tm’ package we convert character vector to a Vcorpus object (volatile corpus)
corp <- VCorpus(VectorSource(words))
## now we do transformations of text using tm_map (’mapping to the corpus’)
# eliminate extra whitespace
corp <- tm_map(corp, stripWhitespace)
# convert to all lowercase
corp <- tm_map(corp, content_transformer(tolower))
# perform stemming (not always useful!)
#fbCorpus <- tm_map(fbCorpus, stemDocument)
# remove numbers (not always useful!)
fbCorpus <- tm_map(fbCorpus, removeNumbers)
# remove punctuation (not always useful! e.g. text emoticons)
fbCorpus <- tm_map(fbCorpus, removePunctuation)
# remove stop words (not always useful!) - doing this in perl
corp <- tm_map(corp, removeWords, stopwords("english"))
# create a document-term matrix
# had to do it this way to be able to use colnames
tdm <- TermDocumentMatrix(corp)
tdm <- as.matrix(tdm)
#print(tdm)
colnames(tdm) <- c("Star_Wars","Star_Trek")
colorsx=c("blue","red")

Word Cloud visualization:

require(wordcloud)
Loading required package: wordcloud
#note: if changing res of png, can’t have dimensions in pixels (led to wordclouds with very few words...)
# png("facebook_starwars_startrek_comparison_cloud.png", width=12, height=8, units="in", res=300)
#comparison.cloud(tdm,max.words=300,random.order=FALSE)
comparison.cloud(tdm,max.words=200,random.order=FALSE,colors=colorsx)

#commonality.cloud(tdm,random.order=FALSE)
# dev.off()

Social Network Visualization

Small-scale visualization

Youtube actor network (user-user)

We can visualise a network by plotting it directly in R:

# png("g_actor_youtube.png", width=800, height=700)
plot(g_actor_youtube, vertex.shape="none", edge.width=1.5, edge.curved=.5, edge.arrow.size=0.5, asp=9/16, main="Users as actor network")

# dev.off()

Facebook bimodal network (user-post)

Before plotting the graph, change the node color such that Posts are red, while Users are the default color (blue).

V(g_bimodal_facebook_star_trek)$color <- ifelse(V(g_bimodal_facebook_star_trek)$type == "Post", "red", "blue")

We can see the network with the following:

plot(g_bimodal_facebook_star_trek)

In RStudio, the plot pane is generally too small and so an improvement is via opening an X11 graphics driver (only on machines with access to an X server):

x11()
plot(g_bimodal_facebook_star_trek)

The following set of commands prints the plot (with some plot options to improve the visualisation) to file:

# png('g_bimodal_facebook_star_trek.png', width=800, height=700)
plot(g_bimodal_facebook_star_trek, vertex.shape = "none", edge.width = 1.5, 
    edge.curved = 0.5, edge.arrow.size = 0.5, vertex.label.color = V(g_bimodal_facebook_star_trek)$color, 
    asp = 9/16, margin = -0.15)
# dev.off()

Twitter actor network (user-user)

Next, we can visualise the network by plotting it directly in R:

# png('g_twitter_actor.png', width=800, height=700)
plot(g_twitter_actor, vertex.shape = "none", edge.width = 1.5, edge.curved = 0.5, 
    edge.arrow.size = 0.5, asp = 9/16, margin = -0.15)

# dev.off()

Large-scale visualization

Installing Gelphi

You can export the social network by R/SocialMediaLab and import it into Gephi, which is a fantastic network visualisation program. If you wish to do this, then you need to download and install Gephi.

To do network analysis (e.g. export network to file) we also need to download and install the igraph package created by Gabor Csardi. We can do this by entering the following command into the R console:

install.packages("igraph") # Install package

Or load the package if it is already installed:

require(igraph) #Load package

Using Gelphi

Because of the size of the two networks we have created, we won’t try to visualise them in R. To visualise the “Star Trek”" networking using Gephi, first export the network usig the ‘graphml’ network file format:

write.graph(g_bimodal_facebook_star_trek, "g_bimodal_facebook_star_trek.graphml", format="graphml")

Gephi is the ideal software for visualising networks created in SocialMediaLab . In this section we will import into Gephi the Instagram network we generated earlier, in order to visualise it.

First, open Gephi. You should be presented with a dialogue box. Click “Open Graph File…” (or select from menu ‘File’–>‘Open’) and navigate to the working directory of your R project for this tutorial. The working directory should contain a graphml file that was generated in the Instagram section. Select this file and click open.

You will be presented with an ‘Import report’ dialogue box providing information about the network. Simply click OK. Your network will then be presented in a very awful format - possibly looking like a gray-coloured hairball. Let’s fix that.

In the top-left hand box under the ‘Appearance’ tab, click the little icon that says ‘Size’ when you hover over it. Select “Indegree” from the dropdown menu. Set the min size to 5 and the max size to 50, and click Apply. Now the nodes in your network are sized based on how many inlinks they receive from other nodes. It’s still an ugly network though!

---
title: "SocialMediaLab - Social Network Analysis (SNA)"
output:
  html_notebook:
    toc: yes
  html_document:
    highlight: tango
    number_sections: yes
    theme: united
    toc: yes
    toc_depth: 4
  pdf_document:
    highlight: tango
    latex_engine: xelatex
    toc: yes
  word_document:
    toc: yes
---

# Social Media Data Collection

## Collecting Twitter data and creating social networks

In this section we will run through how to collect data from Twitter, create networks, and perform different
kinds of analysis.

It is currently possible to create 3 different types of networks using Twitter data collected with SocialMediaLab.
These are (1) actor networks; (2) bimodal networks; and (3) semantic networks. 

First, define the API credentials. Due to the Twitter API specifications, it is not possible to save authentication
tokens between sessions. The Authenticate() function is called only for its side effect, which provides access
to the Twitter API for the current session.

### Authenticating with the Twitter API

Go to Twitter [Application Management](https://apps.twitter.com/) and "Create New App". Complete all fields in the form and create a new app. In your app page, go to "Keys and Access Tokens", generate your access token, and copy the information to R:
```{r}
myapikey <- "801xU41LxZmHrNh7NY3wVnB7C" # Put your own api info here
myapisecret <- "E4sATBjT3ZSARghmXgjrdcOZP5rVWHVXFxAwxvoG24QLN5oDQg" # Put your own api info here
myaccesstoken <- "19856730-XHDOuNW03D6Gqjv2JDQFxhZmurdsTe55nLMU48kjq" # Put your own api info here
myaccesstokensecret <- "2xFYrFO1yt1H0WzgHEjzqojf6ThNeBfC0f0bYJB2kwaQZ" # Put your own api info here
```

Given that we are going to be creating two different types of Twitter networks (actor and semantic), we will
Collect() the data, but not pipe it directly through to Network() straight away. This means we can reuse
the data multiple times to create two different kinds of networks for analysis.
We will collect 150 recent tweets that have used the #auspol hashtag. This is the dominant hashtag for
Australian politics.
The first step in the work flow is to authorise access the Twitter API. Instructions for obtaining Twitter API
access are available from the VOSON website. See the previous section for a brief explanation of APIs.
```{r}
require(SocialMediaLab)
require(magrittr)
myTwitterData <- Authenticate("twitter",
                              apiKey=myapikey,
                              apiSecret=myapisecret,
                              accessToken=myaccesstoken,
                              accessTokenSecret=myaccesstokensecret) %>%
Collect(searchTerm="#auspol", numTweets=150, writeToFile=FALSE, verbose=TRUE)
```

We can have a quick look at the data we just collected:
```{r}
View(myTwitterData)
```

Note the class of the dataframe, which lets SocialMediaLab know that this is an object of class dataSource ,
which we can then pass to the Create() function to generate different kinds of networks:

```{r}
class(myTwitterData)
```

If you find that you are encountering errors possibly related to the text of the tweets, you can try converting
the tweet text to UTF-8 character encoding. Roughly speaking, this command will help to deal with 'odd'
characters in the text.
```{r}
myTwitterData$text <- iconv(myTwitterData$text, to = 'utf-8')
```

Mac users only may also wish to try the following if they are encountering errors that may be due to
character encoding issues:
```{r}
myTwitterData$text <- iconv(myTwitterData$text, to = 'utf-8-mac')
```

### Creating social networks with Twitter data

#### Actor network
First, we will create an actor network. In this actor network, edges represent interactions between Twitter
users. An interaction is defined as a "mention"" or "reply"" or "retweet" from user i to user j, given "tweet" m. In
a nutshell, a Twitter actor network shows us who is interacting with who in relation to a particular hashtag
or search term.
```{r}
g_twitter_actor <- myTwitterData %>% Create("Actor")
```

We can now examine the description of our network:
```{r}
g_twitter_actor
```

#### Semantic network

[//]: # (
myTwitterData <- read.csv("Oct_02_15_08_38_2017_CEST_#auspol_TwitterData.csv")
)

Next, we will create a semantic network. In this network nodes represent unique concepts (in this case
unique terms/words extracted from a set of 150 tweets), and edges represent the co-occurrence of terms for all
observations in the data set. For example, for this Twitter semantic network, nodes represent either hashtags
(e.g. "#auspol") or single terms ("politics"). If there are 150 tweets in the data set (i.e. 150 observations),
and the term #auspol and the term politics appear together in every tweet, then this would be represented
by an edge with weight equal to 150
```{r}
g_twitter_semantic <- myTwitterData %>% Create("Semantic")
```

Let's have a look at the network description:
```{r}
g_twitter_semantic
```

#### Bimodal network

Now that we have our Twitter data we can generate a bimodal network. This kind of network provides
many possibilities for analysis and generating insights from our data.

In this bimodal network there are two types of nodes: users and hashtags. The bimodal network is therefore:

* directed (users can use hashtags, but hashtags can't use users)
* weighted (users can comment multiple times on a post)
* bipartite (users can like or comment on posts, but posts can't like or comment back)
* multiple edges or parallel edges (we have one edge for each interaction from user i to post j)

We now run the Create() function, which creates an igraph object called g_bimodal_twitter. Creating
networks in SocialMediaLab is straightforward. We simply pass the myTwitterData object to the Create()
function, and it takes care of the rest. We specify what kind of network we want to create (i.e. a bimodal
network) by specifying this as an argument to the Create() function.

Note also that there is a tricky operator introduced here, the 'pipe' operator %>% , which we have not covered yet. This operator comes from the Magrittr package, and it is used to 'pipe' together commands in a chain, passing the values along the pipeline until it reaches the final command, which returns the output (i.e. the network we wish to create). In this instance we are passing (or "piping") the data we collected using Collect() through to the Create()
function.
```{r}
g_bimodal_twitter <- myTwitterData %>% Create("Bimodal")
```

We can now view basic information about the network:
```{r}
g_bimodal_twitter
```

## Collecting Facebook data

In this section we will run through how to collect data from Facebook, create networks, and perform different
kinds of analysis.

### Authenticate with Facebook Developers API

The process of authentication, data collection, and creating social networks can be expressed with the 3 verb
functions: Authenticate() , Collect() , and Create() . This simplified workflow exploits the pipe interface
of the Magrittr package, and provides better handling of API authentication between R sessions. 
What we are doing is "piping" the data forward using the %>% operator, in a kind of functional programming
approach. It means we can pipe together all the different elements of the work flow in a quick and easy
manner. This also provides the ability to save and load authentication tokens, so we don’t have to keep authenticating
with APIs between sessions. Obviously, this opens up possibilities for automation and data mining projects.

Go to Facebook Developers Page (https://developers.facebook.com/apps/), create a new app, and copy the appID and appSecret information.

Make sure we have our appID and appSecret values defined:
```{r}
appID <- "2025204167698114" # Put your own api info here
appSecret <- "3a0b8f329e1ed9d803e02b5b3fdae831" # Put your own api info here
```

Save credential file for later access:
```{r eval=FALSE}
require(SocialMediaLab)
require(magrittr)
Authenticate("Facebook", appID = appID, appSecret = appSecret) %>% SaveCredential("FBCredential.RDS")
```

The first time you authenticate you will see this:

>Copy and paste into Site URL on Facebook App Settings: http://localhost:1410/

You have to paste url (http://localhost:1410/ ) in developer app settings: (i) Click on add platform. (ii) Choose website and paste http://localhost:1410/ in URL link.

### Bimodal networks

First, we will collect 2 days worth of activity from the Star Wars official page. This will collect all the posts
posted between the rangeFrom and rangeTo dates, including all comments and likes, and other associated
data including usernames, timestamps for comments, etc. Note: the date format is YYYY-MM-DD.

We will be using this data to create a bimodal network. This graph object is bimodal because edges represent
relationships between nodes of two different types. For example, in our bimodal Facebook network, nodes
represent Facebook users or Facebook posts, and edges represent whether a user has commented or ‘liked’ a
post. Edges are directed and weighted (e.g. if user i has commented n times on post j, then the weight of this
directed edge equals n).

```{r}
g_bimodal_facebook_star_wars <- 
  LoadCredential("FBCredential.RDS") %>% 
  Collect(pageName="StarWars", rangeFrom="2015-03-01", rangeTo="2015-03-02", writeToFile=TRUE) %>% 
  Create("Bimodal")
```

The Magrittr pipe approach used in this example means that we only end up with the final graph object (in
the global environment). To ensure we retain the data that are collected, the argument writeToFile=TRUE is
used. This writes the data collected using Collect() function to a local CSV file before it is piped through to
the network generation function Create() . We can then read it in as a dataframe (see code snippet below).
```{r}
# myStarWarsData <- read.csv("2015-03-01_to_2015-03-02_StarWars_FacebookData.csv")
myStarWarsData <- importData("2015-03-01_to_2015-03-02_StarWars_FacebookData.csv", "facebook")
View(myStarWarsData)
g_bimodal_facebook_star_wars <- myStarWarsData %>% Create("Bimodal")
```

This means we end up with two objects for further analysis, a graph object g_bimodal_facebook_star_wars ,
and a dataframe object myStarWarsData.

Before proceeding into analysis, we will collect 2 days worth of data from the Star Trek Facebook page, but
this time we will pipe through the LoadCredential function, meaning that we are using the authentication
token that we stored locally in the previous step.
```{r}
require(SocialMediaLab)
g_bimodal_facebook_star_trek <- 
  LoadCredential("FBCredential.RDS") %>%
  Collect(pageName="StarTrek", rangeFrom="2015-03-01", rangeTo="2015-03-02", writeToFile=TRUE) %>%
  Create("Bimodal")
```

Read in the data to a dataframe
```{r}
#read.csv() is not equivalent to Collect()
myStarTrekData <- importData("2015-03-01_to_2015-03-02_StarTrek_FacebookData.csv", "facebook") 

#if you want to create network from local data:
# g_bimodal_facebook_star_trek <- myStarTrekData %>% Create("Bimodal")
```

Now we can perform some analysis on the Star Wars network. Firstly, we will run through some essential
SNA techniques. After that we will do something a bit fancier, by comparing whether there are gender
differences between Star Wars and Star Trek networks.

We can get descriptive information about the network:
```{r}
g_bimodal_facebook_star_wars
```

This informs us that there are 1219 nodes and 1218 nodes in the network (this may differ somewhat for your
own collected data). It tells us that our graph is Directed, Named, the edges are Weighted, and it also has
the additional property of being a Bipartite graph.

## Collecting YouTube video comment data

### Authenticate

We first ensure that the SocialMediaLab package is loaded:
```{r}
require(SocialMediaLab)
```


A Google Developer API Key is required for authenticating with the API (otherwise we cannot collect data). This requires a Google account. Instructions for obtaining a Google Developer API Key are available from the [Youtube Data API Overview](https://developers.google.com/youtube/v3/getting-started). Main steps:

1. You need a Google Account to access the Google Developers Console, request an API key, and register your application.
2. Create a project in the Google Developers Console and obtain authorization credentials so your application can submit API requests.
3. After creating your project, make sure the YouTube Data API is one of the services that your application is registered to use

```{r}
my_apiKeyYoutube <- "AIzaSyAOS6z8kQjHJW5xL1kmjt8tYZd_JnzhpFE"
```

We then run the following function, which ensures everything is correctly set up to access the API:
```{r}
apiKeyYoutube <- AuthenticateWithYoutubeAPI(my_apiKeyYoutube)
```

### Get data

We now assign a character vector, specifying one or more YouTube video IDs that we wish to collect data from. For example, if the video URL is https://www.youtube.com/watch?v=W2GZFeYGU3s, then use videoIDs = 'W2GZFeYGU3s'. Tip: for many videos, the function GetYoutubeVideoIDs can be used to create a vector object suitable as input for videoIDs.
```{r}
videoIDs <- c("W2GZFeYGU3s","mL27TAJGlWc")
```

We now collect the YouTube comment data and store it in a data frame object named myYoutubeData. This data frame can then be used for creating networks for further analysis. We can supply various arguments to the CollectDataYoutube function, providing various options for the data collection process.
```{r cache=TRUE}
myYoutubeData <- CollectDataYoutube(videoIDs, apiKeyYoutube, writeToFile=TRUE, verbose=TRUE, maxComments=100)
```

We can examine the structure of our data frame:
```{r}
str(myYoutubeData)
```

### Generate actor network

We will now create a unimodal network, a.k.a 'actor network', representing relationships between users who have interacted with each other. For YouTube comment threads, a relationship is defined as user i 'replying to' or 'mentioning' user j in a comment. In this network the vertices (a.k.a 'nodes') represent YouTube users and the edges (a.k.a 'links') represent whether (and how many times) user i has interacted with user j. The edges in this network are both directed and weighted. Edges are 'directed' because interactions may not be reciprocated (e.g. user i replies to user j, but user j does not reply to user i), and edges are also ‘weighted’ in order to show how many times user i has interacted with user j.
```{r}
require(igraph)
g_actor_youtube <- myYoutubeData %>% Create("Actor")
```

We can now view basic information about our network, notably the number of vertices (users) and number of interactions (edges):
```{r}
g_actor_youtube
```


## Storing your social network in graphml format

You can export a social network generated by SocialMediaLab to "graphml" format using igraph package:

```{r}
require(igraph) # install.packages("igraph") if not installed yet
```

Now you can visualize the "graphml" file with Gelphi or other compatiable tools.
```{r}
write.graph(g_bimodal_facebook_star_trek, "g_bimodal_facebook_star_trek.graphml", format="graphml")
```

# Social Network Analysis

## Basic counting

Next we will do some more descriptive analysis.

How many nodes are in the network?
```{r}
vcount(g_bimodal_twitter)
```

How many edges in the network?
```{r}
ecount(g_bimodal_twitter)
```

Get a list of the nodes in the network:
```{r}
V(g_bimodal_twitter)
```

List of edges in the network:
```{r}
E(g_bimodal_twitter)
```

Access a particular node in the network (node #42):
```{r}
V(g_bimodal_twitter)[42]
```

Access a particular edges:
```{r}
E(g_bimodal_twitter)[1]
```

## Graph connectivity

Look at the connectivity of the graph:
```{r}
# who are the neighbours of node #42?
neighbors(g_bimodal_facebook_star_wars,42)
#this is not a weakly connected component
is.connected(g_bimodal_facebook_star_wars, mode="weak")
#information on connected components
cc <- clusters(g_bimodal_facebook_star_wars)
#which component node is assigned to
# cc$membership
#size of each component
cc$csize
#number of components
cc$no

#subnetwork - giant component
g3 <- induced_subgraph(g_bimodal_facebook_star_wars, which(cc$membership == which.max(cc$csize)))
```

We will now look at node centrality:
```{r eval=FALSE}
#node indegree
degree(g3, mode="in")
#node outdegree
degree(g3, mode="out")
#node indegree, using edge weights
ind <- strength(g3, mode="in")
#top-5 nodes, based on (weighted) indegree
V(g3)[order(ind, decreasing=T)[1:3]]
#closeness centrality
closeness(g3)
#betweenness centrality
betweenness(g3)
#eigenvector centrality
evcent(g3)$vector
```


We can look at some network cohesion measures. How dense is the graph? In other words, of all the possible connections between nodes, how many are actually
observed?
```{r}
# density
graph.density(g3)

# (global) clustering coefficient
# rel. frequency connected triples close to form triangles
transitivity(g3)

# number of dyads with reciprocated (mutual)
# edges/number of dyads with single edge
reciprocity(g3, mode="default")

#total number of reciprocated edges/total number of edges
reciprocity(g3, mode="ratio")
```

## Find important nodes in the network

Who are the top 3 most important posts in the Facebook network? There are several ways to do this. For fun we will
use the PageRank algorithm implementation in igraph to calculate this. PageRank is made famous by the
Google co-founders, who invented this method to determine the importance of webpages, revolutionising the
search engine industry. The following code calculates PageRank for nodes in the network, and returns the 3
'top' nodes (which have the highest share of PageRank), providing the ID.
```{r}
pagerank_instagram <- sort(page.rank(g_bimodal_facebook_star_wars)$vector,decreasing=TRUE)
head(pagerank_instagram,n=3)
```

What are the top 10 important terms in our #auspol actor network? There is no reason why we can't use
the PageRank algorithm to calculate this (as per the Instagram analysis previously):
```{r}
pageRank_auspol_semantic <- sort(page.rank(g_twitter_semantic)$vector,decreasing=TRUE)
head(pageRank_auspol_semantic,n=10)
```

What about the 3 least important users (with all due respect...):
```{r}
tail(pageRank_auspol_semantic,n=3)
```

Obviously the #auspol hashtag is going to be the most important because it occurs at least once in every
tweet. We can actually avoid this by using the removeTermsOrHashtags argument when we Create() the
network. This argument specifies which terms or hashtags (i.e. nodes with a name that matches one or more
terms) should be removed from the semantic network. This is useful to remove the search term or hashtag
that was used to collect the data (i.e. remove the corresponding node in the graph). For example, a value of
"#auspol" means that the node with the name "#auspol" will be removed. Note: you could also just delete
the #auspol node manually.

Another key aspect of semantic networks is how many terms to include in the network. By default,
SocialMediaLab does not include every unique term that it finds in the tweets, but only the 5 percent most
frequently occurring terms. You can change this when calling the Create() network function, for example by
specifying a value 50 (meaning that the 50 percent most frequently occurring terms will be included in the
semantic network).

We can actually try this out now. We will create another semantic network, but we will exclude the #auspol
hashtag, and we will include every single term available in the tweets.
```{r}
g_twitter_semantic_auspol_allTerms <- myTwitterData %>% Create("Semantic", termFreq=100, removeTermsOrHashtags=c("#auspol"))
```

The size of the network will increase a lot, even in the absence of the #auspol term!
```{r}
g_twitter_semantic_auspol_allTerms
```

What are the top 10 important terms in our semantic network now? Once again we will calculate this using PageRank:
```{r}
pageRank_auspol_semantic_replicate <- sort(page.rank(g_twitter_semantic_auspol_allTerms)$vector,decreasing=TRUE)
pageRank_auspol_semantic_replicate[1:10]
```


## Find communities

Is there any kind of community structure within the user network? We will use the infomap algorithm implementation in igraph.
```{r tidy=TRUE, warning=FALSE}
library(igraph)
# increase nb.trials for better quality communities
imc <- infomap.community(g_twitter_actor, nb.trials = 10) 

# create a vector of users with their assigned community number
communityMembership_auspol <- membership(imc)

# summarise the distribution of users to communities 
commDistribution <- summary(as.factor(communityMembership_auspol)) 

# which community has the max number of users 
tail(sort(commDistribution),n=1)
```

Look into the members of each community:
```{r}
# create a list of communities that includes the users assigned to each community
communities_auspol <- communities(imc)

# look at the members of the most populated community 
communities_auspol[names(tail(sort(commDistribution),n=1))]
```

## Graph Projection

Another useful technique we can do is to perform a projection of the Facebook networks we just created. These networks are bipartite because nodes of the same type cannot share an edge (e.g. a user can only like/comment on a post, but not like/comment another user, and posts cannot perform directed actions either on users or other posts).

What we can do is induce two subgraphs from each network. More specifically, we can induce two actor networks, one for the users and one for the posts.
```{r}
## some data preparation
# coerce to factor
g_bimodal_facebook_star_trek_projection <- g_bimodal_facebook_star_trek
V(g_bimodal_facebook_star_trek_projection)$type <- as.factor(V(g_bimodal_facebook_star_trek_projection)$type)

# coerce all posts (i.e. "1") to logical (i.e. FALSE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="1")] <- as.logical(FALSE)

# coerce all users (i.e. "2") to logical (i.e. TRUE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="2")] <- as.logical(TRUE)

# now project the network
projection_g_bimodal_facebook_star_trek <- bipartite.projection(g_bimodal_facebook_star_trek_projection)
```

Firstly, we will look at the induced graph for the "posts". The induced "posts"" actor network consists only of nodes that are of type "post". An edge exists between post i and post j if they are both co-liked or co-commented by the same user (i.e. if they have any user in common). Not surprisingly, every post has at least one user in common, which results in the network being "complete".
```{r tidy=TRUE}
projection_g_bimodal_facebook_star_trek[[1]]
# png("facebook_star_trek_posts.png", width=800, height=700) 
plot(projection_g_bimodal_facebook_star_trek[[1]], edge.width=1.5, edge.curved=0.5, edge.arrow.size=0.5) #vertex.shape="none",
# dev.off()
```

Secondly, we will look at the induced graph for the “users”. The induced “users” actor network consists only of nodes that are of type “user”. An edge exists between user i and user j if they both co-liked or co-commented the same post (i.e. they share an interaction with a post j). As you might expect, this create a network with a massive number of edges! A lot of users co-interact with the same posts. For this example, over 4.5 million edges (your results might be somewhat different).
```{r}
# warning - do not use ‘str‘ function because it will
# cause R to freeze up due to overloading the console output!
# Also: you will probably have difficulty plotting this graph in R because it is so big 
projection_g_bimodal_facebook_star_trek[[2]]
```

Maybe there is some community structure to this large network. There are several ways to find out. We will use the infomap algorithm implementation in igraph. Infomap uses an information theoretic, flow-based approach to calculating community structure in networks. It supports weighted and directed graphs, and also scales well.

The results show that there is definitely some interesting community structure to the user actor network (a handful of large communities and a tiny community). Although your results might di er, depending on the actual data collected.
```{r tidy=TRUE, warning=FALSE, eval=FALSE}
# limit the ‘trials‘ argument to a small number to save time 
# (number of attempts to partition the network) 
imc_starwars <- infomap.community(projection_g_bimodal_facebook_star_trek[[2]], nb.trials = 3)

# create a vector of users with their assigned community number
communityMembership_starwars <- membership(imc_starwars) 

# summarise the distribution of users to communities 
commDistribution_starwars <- summary(as.factor(communityMembership_starwars))

# which community has the max number of users 
tail(sort(commDistribution_starwars),n=1)

# create a list of communities that includes the users assigned to each community
communities_starwars <- communities(imc_starwars)

# look at the members of the *least* populated community 
communities_starwars[names(head(sort(commDistribution_starwars),n=1))]
```

# Text Analysis

Next, we will do some descriptive text analysis of the Star Wars fan comments.

TODO

## Data pre-processing

We just want to keep the character vector of ‘comments’ data, for our purposes in this session:
```{r}
fbData <- myStarWarsData$commentText
```

We only want elements of fbData that contain comment text (many rows of our Facebook data represent
‘likes’, rather than ‘comments’). So we remove any text data that equals “Not_applicable” (this is how
SocialMediaLab designates rows in the dataframe that are ‘likes’). Note: in earlier versions of SocialMediaLab
these elements were designated as NA, however this caused unintended consequences so it was changed.
```{r}
toRemove <- which(fbData=="Not_applicable")
fbData <- fbData[-toRemove] # remove the elements we want to exclude
```

How many comments do we have left now?
```{r}
length(fbData)
```

We convert the character encoding to UTF-8. This avoids errors relating to ‘odd’ characters in the text. This
is usually a good idea, but there may be situations when it is not useful, or even detrimental. Note: Mac users
may encounter errors/bugs relating to character encoding, and a workaround is to convert to ‘utf-8-mac’:
```{r}
fbData <- iconv(fbData, to = "utf-8")
# **MAC USERS ONLY** should use this instead:
fbData <- iconv(fbData,to="utf-8-mac")
```

We convert our character vector fbData to a Vcorpus object:
```{r warning=FALSE}
library(tm)
fbCorpus <- VCorpus(VectorSource(fbData))
```

Individual comments (a.k.a. ‘documents’) can be accessed via the double brackets notation or the ‘dollar
sign’ notation for accessing list elements. Let’s look at comment #4.
```{r}
fbCorpus[[4]][[1]]
# another way to access it
fbCorpus[[4]]$content
```

We can perform a number of highly useful transformations of text using tm_map function (i.e. ‘mapping to
the corpus’). Not all of these transformations are useful in every scenario! They should be used only when it
makes sense, or as required, etc.

Converting all the text to lowercase:
```{r}
fbCorpus <- tm_map(fbCorpus, content_transformer(tolower))
```

Remove numbers from the text:
```{r}
fbCorpus <- tm_map(fbCorpus, removeNumbers)
```

Remove punctuation from the text:
```{r}
fbCorpus <- tm_map(fbCorpus, removePunctuation)
```

Perform ‘word stemming’ on the text. Note: this transformation can be highly useful, but also highly
detrimental!
```{r}
# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores
fbCorpus <- tm_map(fbCorpus, stemDocument,lazy=TRUE)
```

We can also remove English ‘stop words’ from the text. These are common words (e.g. ‘the’, ‘and’, ‘or’) that
we may want to exclude from our analysis. Once again, this is highly useful but also needs to be carefully
applied.
```{r}
fbCorpus <- tm_map(fbCorpus, removeWords, stopwords("english"),lazy=TRUE) 
# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores
```

Eliminate unnecessary ‘white space’ from the text. For example, “hello everyone my name is fred” becomes
“hello everyone my name is fred”:
```{r}
fbCorpus <- tm_map(fbCorpus, stripWhitespace, lazy=TRUE)
```

We can observe the di?erence now by examining comment #4 again:
```{r}
fbCorpus[[4]]$content
```

We could also define our own stop words and transform the text using these:
```{r}
myStopwords <- c("jar","binks")
fbCorpus <- tm_map(fbCorpus, removeWords, myStopwords)
```



## Frequency analysis

Next we create a document-term matrix (DTM) from the fbCorpus object. DTMs are a very important
concept for text analysis and are highly useful. DTMs can be thought about as a table (i.e. matrix) where
the rows are ‘documents’ (i.e. Facebook comments in our dataset), and the columns are ‘terms’ (i.e. each
unique word found across all the documents in the dataset). The ‘cells’ (i.e. elements) of the matrix indicate
how many times term n occurred in document m.

Note: we use the control argument to specify that we only want to retain words that are minimum character
length of 3, up to a maximum of 20 characters.
```{r}
dtm <- DocumentTermMatrix(fbCorpus,control = list(wordLengths=c(3, 20)))
dtm
```

What we have is a sparse matrix, i.e. most of the elements of the matrix are 0, i.e. in our dataset most
Facebook comments contain only a small percentage of ‘vocabulary’ of terms observed across the entire set of
comments. What we want to do is remove terms that occur very infrequently, which will leave us with the
most ‘important’ terms. We remove sparse terms using the removeSparseTerms function, which removes
terms that occur equal to or less than a percentage threshold.

For example, if we set it to 0.995, then all terms that are at least 99.5% sparse are removed. The following
command lets us ‘test out’ what our document-term matrix would look like if we set the threshold to 0.995:
```{r}
removeSparseTerms(dtm, 0.995)
```

0.995 will do the trick for us in this workshop, so we will create a new document-term matrix with this
threshold applied to it:
```{r}
dtmSparseRemoved <- removeSparseTerms(dtm, 0.995)
```

We can examine term frequencies in our data. We create a character vector of the sums of columns of our
document-term matrix (implicitly coercing it to a matrix object), meaning that have a named character
vector where the names are the unique terms in our document-term matrix, and the values of the elements
are the number of times that particular word occurs across all of our corpus.
```{r}
freqTerms <- colSums(as.matrix(dtmSparseRemoved))
freqTerms
```

We order the term frequencies and look at the 5 most frequent terms and then 5 least frequent terms:
```{r}
orderTerms <- order(freqTerms,decreasing=TRUE)
freqTerms[head(orderTerms)]
freqTerms[tail(orderTerms)]
```

Which terms occurred at least 20 times?
```{r}
findFreqTerms(dtmSparseRemoved, 20)
```

We can do a basic correlation analysis by looking at the correlations between terms with the findAssocs
function. If two words always appear together then corr = 1. If two terms never appear together then corr =
0. Let’s look at which terms co-occur with the term “meat”, with a lower correlation limit of 0.5.
```{r}
findAssocs(dtmSparseRemoved, "good", corlimit=0.5)
```

Next, we can do some text visualisation. First, we can plot our descriptive statistics in various ways. For
example, using a barchart to visualise the 20 most frequent terms (we will use the lattice package for a
nice bar chart:
```{r}
require(lattice)
# png("barchart_frequent_terms.png", width=800, height=700)
barchart(freqTerms[orderTerms[1:20]])
# dev.off()
```



## Word Cloud

Next, we will construct a comparison word cloud of the Star Wars and Star Trek fan page comments.
```{r}
# create a character vector of the Star Wars comments
# (i.e., take a subset of elements from the commentText column of the dataframe)
starWarsComments <- myStarWarsData$commentText[which(myStarWarsData$commentText!="Not_applicable")]
starWarsComments <- paste(starWarsComments , collapse = " ")

# do the same, but for Star Trek
starTrekComments <- myStarTrekData$commentText[which(myStarTrekData$commentText!="Not_applicable")]
starTrekComments <- paste(starTrekComments , collapse = " ")

# combine them together into a dataframe
df_ALL <- data.frame(group=c("Star_Wars","Star_Trek"),words=c(starWarsComments,starTrekComments))
```

Data pre-processing:
```{r}
# search for any texts that have no characters (i.e. are ’empty’)
# and then remove these elements from the vector
toRemove <- which(df_ALL$words=="")

# are there any ’empty’ text elements?
# (i.e. length of toRemove is not equal to zero)
# if true, then we remove the corresponding rows from the dataframe
if (isTRUE(length(toRemove)!=0)) {
df_ALL <- df_ALL[-toRemove,]
}
# we create a character vector from the "words" column of df_ALL
# this will be our independent variable.
# we do not want text as factors, so we will coerce it to character
words <- df_ALL$words
# we will convert the character encoding to UTF-8
# just to be sure there are no odd characters that
# may cause problems later on
words <- iconv(words, to = "UTF-8")
# ** MAC USERS ONLY **:
words <- iconv(words, to = "UTF-8-mac")
# using ’tm’ package we convert character vector to a Vcorpus object (volatile corpus)
corp <- VCorpus(VectorSource(words))
## now we do transformations of text using tm_map (’mapping to the corpus’)
# eliminate extra whitespace
corp <- tm_map(corp, stripWhitespace)
# convert to all lowercase
corp <- tm_map(corp, content_transformer(tolower))
# perform stemming (not always useful!)
#fbCorpus <- tm_map(fbCorpus, stemDocument)
# remove numbers (not always useful!)
fbCorpus <- tm_map(fbCorpus, removeNumbers)
# remove punctuation (not always useful! e.g. text emoticons)
fbCorpus <- tm_map(fbCorpus, removePunctuation)
# remove stop words (not always useful!) - doing this in perl
corp <- tm_map(corp, removeWords, stopwords("english"))
# create a document-term matrix
# had to do it this way to be able to use colnames
tdm <- TermDocumentMatrix(corp)
tdm <- as.matrix(tdm)
#print(tdm)
colnames(tdm) <- c("Star_Wars","Star_Trek")
colorsx=c("blue","red")
```

Word Cloud visualization:
```{r}
require(wordcloud)
#note: if changing res of png, can’t have dimensions in pixels (led to wordclouds with very few words...)
# png("facebook_starwars_startrek_comparison_cloud.png", width=12, height=8, units="in", res=300)

#comparison.cloud(tdm,max.words=300,random.order=FALSE)
comparison.cloud(tdm,max.words=200,random.order=FALSE,colors=colorsx)
#commonality.cloud(tdm,random.order=FALSE)

# dev.off()
```




# Social Network Visualization

## Small-scale visualization

### Youtube actor network (user-user)
We can visualise a network by plotting it directly in R:
```{r}
# png("g_actor_youtube.png", width=800, height=700)
plot(g_actor_youtube, vertex.shape="none", edge.width=1.5, edge.curved=.5, edge.arrow.size=0.5, asp=9/16, main="Users as actor network")
# dev.off()
```

### Facebook bimodal network (user-post)

Before plotting the graph, change the node color such that Posts are red, while Users are the default color
(blue).
```{r}
V(g_bimodal_facebook_star_trek)$color <- ifelse(V(g_bimodal_facebook_star_trek)$type == "Post", "red", "blue")
```

We can see the network with the following:
```{r}
plot(g_bimodal_facebook_star_trek)
```

In RStudio, the plot pane is generally too small and so an improvement is via opening an X11 graphics
driver (only on machines with access to an X server):
```{r eval=FALSE}
x11()
plot(g_bimodal_facebook_star_trek)
```

The following set of commands prints the plot (with some plot options to improve the visualisation) to file:
```{r eval=FALSE, tidy=TRUE}
# png("g_bimodal_facebook_star_trek.png", width=800, height=700)
plot(g_bimodal_facebook_star_trek,vertex.shape="none",edge.width=1.5,edge.curved = .5,edge.arrow.size=0.5,vertex.label.color=V(g_bimodal_facebook_star_trek)$color,asp=9/16,margin=-0.15)
# dev.off()
```

### Twitter actor network (user-user)

Next, we can visualise the network by plotting it directly in R:
```{r tidy=TRUE}
# png("g_twitter_actor.png", width=800, height=700)
plot(g_twitter_actor,vertex.shape="none",edge.width=1.5,edge.curved = .5,edge.arrow.size=0.5,asp=9/16,margin=-0.15)
# dev.off()
```



## Large-scale visualization

### Installing Gelphi

You can export the social network by R/SocialMediaLab and import it into Gephi,
which is a fantastic network visualisation program. If you wish to do this, then you need to download and
install [Gephi](https://gephi.org/users/download/).

To do network analysis (e.g. export network to file) we also need to download and install the igraph package created by Gabor Csardi.
We can do this by entering the following command into the R console:

```{r eval=FALSE}
install.packages("igraph") # Install package
```

Or load the package if it is already installed:
```{r}
require(igraph) #Load package
```


### Using Gelphi

Because of the size of the two networks we have created, we won't try to visualise them in R. To visualise the "Star Trek"" networking using Gephi, first export the network usig the 'graphml' network file format:
```{r}
write.graph(g_bimodal_facebook_star_trek, "g_bimodal_facebook_star_trek.graphml", format="graphml")
```

Gephi is the ideal software for visualising networks created in SocialMediaLab . In this section we will import
into Gephi the Instagram network we generated earlier, in order to visualise it.

First, open Gephi. You should be presented with a dialogue box. Click “Open Graph File...” (or select from
menu ‘File’–>‘Open’) and navigate to the working directory of your R project for this tutorial. The working
directory should contain a graphml file that was generated in the Instagram section. Select this file and click
open.

You will be presented with an ‘Import report’ dialogue box providing information about the network. Simply
click OK. Your network will then be presented in a very awful format - possibly looking like a gray-coloured
hairball. Let’s fix that.

![](g_bimodal_facebook_star_trek.gelphi.png)

In the top-left hand box under the ‘Appearance’ tab, click the little icon that says ‘Size’ when you hover over
it. Select “Indegree” from the dropdown menu. Set the min size to 5 and the max size to 50, and click Apply.
Now the nodes in your network are sized based on how many inlinks they receive from other nodes.
It's still an ugly network though!

![](g_bimodal_facebook_star_trek.gelphi2.png)

