Social Network Analysis
Basic counting
Next we will do some more descriptive analysis.
How many nodes are in the network?
vcount(g_bimodal_twitter)
[1] 100
How many edges in the network?
ecount(g_bimodal_twitter)
[1] 160
Get a list of the nodes in the network:
V(g_bimodal_twitter)
+ 100/100 vertices, named, from 9009817:
[1] rainey_knight Warstub Charlie5009 ballgameskeith msrose2343 trollsforpeace atheist4maga ochreblue
[9] JoRobinson_Aus PreAmpPlus croswell_g bprophetable plabg GoldSuzie asciigoat Kaneosaurus
[17] BobOfAlex Firemonkey991 NaomiCra TotalREProperty dannytweeets 4U_WTF detispify allisonBeDemure
[25] Bev_n_W Medicayy indica2007 KirstyWho RrRjrobinson9 Redbank80Graeme stopcoalexports InvestInScience
[33] Left_of_Labor AliKayaBrisbane AngelaKorras PilligaPush FeeFeeCee amagickeagle999 MrNixonsWife MalurusSally
[41] SocialistMason Salig08 idiosoCamel GoldCoastNurse Broadband_ madnyc Elaine_de_Saxe johngwass
[49] cleverclicks MsJmaid MiztaRabbit mishyloan sierra4oz TaodeHaas Ducatio_ #auspol
[57] #nrlgf #ausunions #QandA #AusPol #Islam #Paris #GrandMufti #Auspol
[65] #Adani #StopAdani #MarriageEquality #ReachTEL #Manus #Nauru #4Corners #CSG
[73] #qldpol #India #Queensland #coal #adani #VoteYes #climatechange #Trump
+ ... omitted several vertices
List of edges in the network:
E(g_bimodal_twitter)
+ 160/160 edges from 9009817 (vertex names):
[1] rainey_knight ->#auspol rainey_knight ->#nrlgf Warstub ->#auspol Warstub ->#ausunions
[5] Charlie5009 ->#auspol ballgameskeith->#QandA ballgameskeith->#ausunions ballgameskeith->#auspol
[9] msrose2343 ->#auspol trollsforpeace->#AusPol atheist4maga ->#Islam atheist4maga ->#Paris
[13] atheist4maga ->#GrandMufti atheist4maga ->#Auspol ochreblue ->#Adani ochreblue ->#StopAdani
[17] ochreblue ->#auspol JoRobinson_Aus->#auspol JoRobinson_Aus->#MarriageEquality PreAmpPlus ->#ReachTEL
[21] PreAmpPlus ->#auspol croswell_g ->#Islam croswell_g ->#Paris croswell_g ->#GrandMufti
[25] croswell_g ->#Auspol ochreblue ->#StopAdani ochreblue ->#auspol bprophetable ->#Manus
[29] bprophetable ->#Nauru bprophetable ->#auspol plabg ->#auspol ochreblue ->#StopAdani
[33] ochreblue ->#auspol ochreblue ->#4Corners ochreblue ->#StopAdani ochreblue ->#auspol
[37] GoldSuzie ->#auspol GoldSuzie ->#CSG ochreblue ->#4Corners ochreblue ->#StopAdani
+ ... omitted several edges
Access a particular node in the network (node #42):
V(g_bimodal_twitter)[42]
+ 1/100 vertex, named, from 9009817:
[1] Salig08
Access a particular edges:
E(g_bimodal_twitter)[1]
+ 1/160 edge from 9009817 (vertex names):
[1] rainey_knight->#auspol
Graph connectivity
Look at the connectivity of the graph:
# who are the neighbours of node #42?
neighbors(g_bimodal_facebook_star_wars,42)
+ 1/1208 vertex, named, from 8e21e39:
[1] 169299103121699_909270052457930
#this is not a weakly connected component
is.connected(g_bimodal_facebook_star_wars, mode="weak")
[1] TRUE
#information on connected components
cc <- clusters(g_bimodal_facebook_star_wars)
#which component node is assigned to
# cc$membership
#size of each component
cc$csize
[1] 1208
#number of components
cc$no
[1] 1
#subnetwork - giant component
g3 <- induced_subgraph(g_bimodal_facebook_star_wars, which(cc$membership == which.max(cc$csize)))
We will now look at node centrality:
#node indegree
degree(g3, mode="in")
#node outdegree
degree(g3, mode="out")
#node indegree, using edge weights
ind <- strength(g3, mode="in")
#top-5 nodes, based on (weighted) indegree
V(g3)[order(ind, decreasing=T)[1:3]]
#closeness centrality
closeness(g3)
#betweenness centrality
betweenness(g3)
#eigenvector centrality
evcent(g3)$vector
We can look at some network cohesion measures. How dense is the graph? In other words, of all the possible connections between nodes, how many are actually observed?
# density
graph.density(g3)
[1] 0.0008367306
# (global) clustering coefficient
# rel. frequency connected triples close to form triangles
transitivity(g3)
[1] 0
# number of dyads with reciprocated (mutual)
# edges/number of dyads with single edge
reciprocity(g3, mode="default")
[1] 0
#total number of reciprocated edges/total number of edges
reciprocity(g3, mode="ratio")
[1] 0
Find important nodes in the network
Who are the top 3 most important posts in the Facebook network? There are several ways to do this. For fun we will use the PageRank algorithm implementation in igraph to calculate this. PageRank is made famous by the Google co-founders, who invented this method to determine the importance of webpages, revolutionising the search engine industry. The following code calculates PageRank for nodes in the network, and returns the 3 ‘top’ nodes (which have the highest share of PageRank), providing the ID.
pagerank_instagram <- sort(page.rank(g_bimodal_facebook_star_wars)$vector,decreasing=TRUE)
head(pagerank_instagram,n=3)
169299103121699_909270052457930 AL Rbg Abraham Gonzales
0.4597014257 0.0004476376 0.0004476376
What are the top 10 important terms in our #auspol actor network? There is no reason why we can’t use the PageRank algorithm to calculate this (as per the Instagram analysis previously):
pageRank_auspol_semantic <- sort(page.rank(g_twitter_semantic)$vector,decreasing=TRUE)
head(pageRank_auspol_semantic,n=10)
#auspol auspol #csg adani #adani qldpol #stopadani #4corners stopadani #qldpol
0.18150152 0.15599852 0.08531406 0.03674505 0.02625074 0.02496031 0.02491640 0.02386476 0.02338559 0.02305313
What about the 3 least important users (with all due respect…):
tail(pageRank_auspol_semantic,n=3)
#lockthegate #northkorea #reachtel
0.00360652 0.00360652 0.00360652
Obviously the #auspol hashtag is going to be the most important because it occurs at least once in every tweet. We can actually avoid this by using the removeTermsOrHashtags argument when we Create() the network. This argument specifies which terms or hashtags (i.e. nodes with a name that matches one or more terms) should be removed from the semantic network. This is useful to remove the search term or hashtag that was used to collect the data (i.e. remove the corresponding node in the graph). For example, a value of “#auspol” means that the node with the name “#auspol” will be removed. Note: you could also just delete the #auspol node manually.
Another key aspect of semantic networks is how many terms to include in the network. By default, SocialMediaLab does not include every unique term that it finds in the tweets, but only the 5 percent most frequently occurring terms. You can change this when calling the Create() network function, for example by specifying a value 50 (meaning that the 50 percent most frequently occurring terms will be included in the semantic network).
We can actually try this out now. We will create another semantic network, but we will exclude the #auspol hashtag, and we will include every single term available in the tweets.
g_twitter_semantic_auspol_allTerms <- myTwitterData %>% Create("Semantic", termFreq=100, removeTermsOrHashtags=c("#auspol"))
[1] "Generating Twitter semantic network..."
Done.
The size of the network will increase a lot, even in the absence of the #auspol term!
g_twitter_semantic_auspol_allTerms
IGRAPH db51550 UNW- 385 663 --
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from db51550 (vertex names):
[1] auspol --#nrlgf turnbullmalcolm--#nrlgf great --#nrlgf nrlgf --#nrlgf turnbull --#nrlgf
[6] johnwren --#nrlgf judgement --#nrlgf keating --#nrlgf said --#nrlgf men --#nrlgf
[11] auspol --#ausunions ausunions --#ausunions cut --#ausunions pay --#ausunions boycott --#ausunions
[16] cream --#ausunions ice --#ausunions streets --#ausunions unionsaustralia--#ausunions workers --#ausunions
[21] australia --#ausunions auspol --#qanda ausunions --#qanda #qanda --qanda #qanda --taken
[26] #qanda --clubegaffer #qanda --hasnt #qanda --increase #qanda --job #qanda --profiteer
[31] #qanda --robot #qanda --wealth #qanda --take #ausunions --qanda #ausunions --taken
[36] #ausunions --clubegaffer #ausunions --hasnt #ausunions --increase #ausunions --job #ausunions --profiteer
+ ... omitted several edges
What are the top 10 important terms in our semantic network now? Once again we will calculate this using PageRank:
pageRank_auspol_semantic_replicate <- sort(page.rank(g_twitter_semantic_auspol_allTerms)$vector,decreasing=TRUE)
pageRank_auspol_semantic_replicate[1:10]
#4corners #adani auspol #csg #qldpol #stopadani #voteyes #qanda #ausunions #nrlgf
0.03945061 0.03310908 0.03280464 0.03179612 0.03114190 0.03057364 0.02554519 0.02432106 0.02142076 0.01754020
Find communities
Is there any kind of community structure within the user network? We will use the infomap algorithm implementation in igraph.
library(igraph)
# increase nb.trials for better quality communities
imc <- infomap.community(g_twitter_actor, nb.trials = 10)
Modularity is implemented for undirected graphs only.
# create a vector of users with their assigned community number
communityMembership_auspol <- membership(imc)
# summarise the distribution of users to communities
commDistribution <- summary(as.factor(communityMembership_auspol))
# which community has the max number of users
tail(sort(commDistribution), n = 1)
1
81
Look into the members of each community:
# create a list of communities that includes the users assigned to each community
communities_auspol <- communities(imc)
# look at the members of the most populated community
communities_auspol[names(tail(sort(commDistribution),n=1))]
$`1`
[1] "Warstub" "ochreblue" "PreAmpPlus" "plabg" "BobOfAlex" "Firemonkey991" "NaomiCra" "FeeFeeCee"
[9] "SocialistMason" "Elaine_de_Saxe" "mishyloan" "Ducatio_" "jurylady5" "MSMWatchdog2013" "Isayneversaydie" "pleaseuseaussie"
[17] "HittingAlice" "MischelleCamill" "1JoyDuck" "Ruxyrob" "rustenburg_J" "defendressofsan" "MdmAbsentMinded" "Left_of_Labor"
[25] "Angelioannou" "AnnalieseRoss" "OzMacca46" "pbro2333_brown" "r7yrb7" "blanketcrap" "mormorlady" "VeriteGrace"
[33] "AmbientUXr" "JayjaysMd" "tilleyfab" "alicia94985048" "oldjoeschmo" "energy_bu" "vonoviedo" "drewsmilitia"
[41] "HOSKINMANDY22" "MilesChamp" "mavisgrizzltits" "deniseshrivell" "leafyflower1" "JustDoingJunk" "Eschertology" "wsj2150"
[49] "KATHS97" "ceciliemurray" "bigislandwa" "IndiBlu" "KCGMSuperpit" "sacarlin48" "cameron_gobbo1" "ajcdjp"
[57] "fredanurks" "Ned_Kelly" "unionsaustralia" "GhostWhoVotes" "lovethatloaf" "hearyanow" "CFMEUJohnSetka" "WendyFarmer_"
[65] "MGliksmanMDPhD" "LesStonehouse" NA "AustralianLabor" "jackietrad" "OzSheela" "SaveOurSpit" "GetUp"
[73] "43a6f0ce5dac4ea" "gautam_adani" "AdaniOnline" "JulieBishopMP" "4corners" "JoshFrydenberg" "FergusonNews" "neighbour_s"
[81] "StephenLongAus"
Graph Projection
Another useful technique we can do is to perform a projection of the Facebook networks we just created. These networks are bipartite because nodes of the same type cannot share an edge (e.g. a user can only like/comment on a post, but not like/comment another user, and posts cannot perform directed actions either on users or other posts).
What we can do is induce two subgraphs from each network. More specifically, we can induce two actor networks, one for the users and one for the posts.
## some data preparation
# coerce to factor
g_bimodal_facebook_star_trek_projection <- g_bimodal_facebook_star_trek
V(g_bimodal_facebook_star_trek_projection)$type <- as.factor(V(g_bimodal_facebook_star_trek_projection)$type)
# coerce all posts (i.e. "1") to logical (i.e. FALSE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="1")] <- as.logical(FALSE)
# coerce all users (i.e. "2") to logical (i.e. TRUE)
V(g_bimodal_facebook_star_trek_projection)$type[which(V(g_bimodal_facebook_star_trek_projection)$type=="2")] <- as.logical(TRUE)
# now project the network
projection_g_bimodal_facebook_star_trek <- bipartite.projection(g_bimodal_facebook_star_trek_projection)
vertex types converted to logical
Firstly, we will look at the induced graph for the “posts”. The induced “posts”" actor network consists only of nodes that are of type “post”. An edge exists between post i and post j if they are both co-liked or co-commented by the same user (i.e. if they have any user in common). Not surprisingly, every post has at least one user in common, which results in the network being “complete”.
projection_g_bimodal_facebook_star_trek[[1]]
IGRAPH b541d3a UN-- 1 0 --
+ attr: name (v/c), label (v/c)
+ edges from b541d3a (vertex names):
# png('facebook_star_trek_posts.png', width=800, height=700)
plot(projection_g_bimodal_facebook_star_trek[[1]], edge.width = 1.5, edge.curved = 0.5,
edge.arrow.size = 0.5) #vertex.shape='none',

# dev.off()
Secondly, we will look at the induced graph for the “users”. The induced “users” actor network consists only of nodes that are of type “user”. An edge exists between user i and user j if they both co-liked or co-commented the same post (i.e. they share an interaction with a post j). As you might expect, this create a network with a massive number of edges! A lot of users co-interact with the same posts. For this example, over 4.5 million edges (your results might be somewhat different).
# warning - do not use ‘str‘ function because it will
# cause R to freeze up due to overloading the console output!
# Also: you will probably have difficulty plotting this graph in R because it is so big
projection_g_bimodal_facebook_star_trek[[2]]
IGRAPH 5aa91b1 UNW- 1155 666435 --
+ attr: name (v/c), label (v/c), weight (e/n)
+ edges from 5aa91b1 (vertex names):
[1] A Louise Garber--A.J. Zeien A Louise Garber--Aaron Carroll A Louise Garber--Aaron Niedzielski
[4] A Louise Garber--Adalmir Quintanilha A Louise Garber--Adam Manny Breaux A Louise Garber--Adam Redfern
[7] A Louise Garber--Adam Terrazas A Louise Garber--Addie Tennant A Louise Garber--Aditya Singh
[10] A Louise Garber--Aditya Tamhankar A Louise Garber--Adriana McGee A Louise Garber--Akira Kudou
[13] A Louise Garber--Al Shakespeare A Louise Garber--Al Stikeleather A Louise Garber--Alain Adriaenssens
[16] A Louise Garber--Alan C. Huffines A Louise Garber--Alan Crumb A Louise Garber--Alan George
[19] A Louise Garber--Albert Orkenbjorken A Louise Garber--Alberto Gudi<U+00F1>o A Louise Garber--Aldemar L<U+00F3>pez
[22] A Louise Garber--Ale Pescus A Louise Garber--Alejandro Santillan A Louise Garber--Alex Cheatom
+ ... omitted several edges
Maybe there is some community structure to this large network. There are several ways to find out. We will use the infomap algorithm implementation in igraph. Infomap uses an information theoretic, flow-based approach to calculating community structure in networks. It supports weighted and directed graphs, and also scales well.
The results show that there is definitely some interesting community structure to the user actor network (a handful of large communities and a tiny community). Although your results might di er, depending on the actual data collected.
# limit the <U+2018>trials<U+2018> argument to a small number to save time
# (number of attempts to partition the network)
imc_starwars <- infomap.community(projection_g_bimodal_facebook_star_trek[[2]],
nb.trials = 3)
# create a vector of users with their assigned community number
communityMembership_starwars <- membership(imc_starwars)
# summarise the distribution of users to communities
commDistribution_starwars <- summary(as.factor(communityMembership_starwars))
# which community has the max number of users
tail(sort(commDistribution_starwars), n = 1)
# create a list of communities that includes the users assigned to each
# community
communities_starwars <- communities(imc_starwars)
# look at the members of the *least* populated community
communities_starwars[names(head(sort(commDistribution_starwars), n = 1))]
Text Analysis
Next, we will do some descriptive text analysis of the Star Wars fan comments.
TODO
Data pre-processing
We just want to keep the character vector of ‘comments’ data, for our purposes in this session:
fbData <- myStarWarsData$commentText
We only want elements of fbData that contain comment text (many rows of our Facebook data represent ‘likes’, rather than ‘comments’). So we remove any text data that equals “Not_applicable” (this is how SocialMediaLab designates rows in the dataframe that are ‘likes’). Note: in earlier versions of SocialMediaLab these elements were designated as NA, however this caused unintended consequences so it was changed.
toRemove <- which(fbData=="Not_applicable")
fbData <- fbData[-toRemove] # remove the elements we want to exclude
How many comments do we have left now?
length(fbData)
[1] 220
We convert the character encoding to UTF-8. This avoids errors relating to ‘odd’ characters in the text. This is usually a good idea, but there may be situations when it is not useful, or even detrimental. Note: Mac users may encounter errors/bugs relating to character encoding, and a workaround is to convert to ‘utf-8-mac’:
fbData <- iconv(fbData, to = "utf-8")
# **MAC USERS ONLY** should use this instead:
fbData <- iconv(fbData,to="utf-8-mac")
We convert our character vector fbData to a Vcorpus object:
library(tm)
fbCorpus <- VCorpus(VectorSource(fbData))
Individual comments (a.k.a. ‘documents’) can be accessed via the double brackets notation or the ‘dollar sign’ notation for accessing list elements. Let’s look at comment #4.
fbCorpus[[4]][[1]]
[1] "OMG"
# another way to access it
fbCorpus[[4]]$content
[1] "OMG"
We can perform a number of highly useful transformations of text using tm_map function (i.e. ‘mapping to the corpus’). Not all of these transformations are useful in every scenario! They should be used only when it makes sense, or as required, etc.
Converting all the text to lowercase:
fbCorpus <- tm_map(fbCorpus, content_transformer(tolower))
Remove numbers from the text:
fbCorpus <- tm_map(fbCorpus, removeNumbers)
Remove punctuation from the text:
fbCorpus <- tm_map(fbCorpus, removePunctuation)
Perform ‘word stemming’ on the text. Note: this transformation can be highly useful, but also highly detrimental!
# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores
fbCorpus <- tm_map(fbCorpus, stemDocument,lazy=TRUE)
We can also remove English ‘stop words’ from the text. These are common words (e.g. ‘the’, ‘and’, ‘or’) that we may want to exclude from our analysis. Once again, this is highly useful but also needs to be carefully applied.
fbCorpus <- tm_map(fbCorpus, removeWords, stopwords("english"),lazy=TRUE)
# use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores
Eliminate unnecessary ‘white space’ from the text. For example, “hello everyone my name is fred” becomes “hello everyone my name is fred”:
fbCorpus <- tm_map(fbCorpus, stripWhitespace, lazy=TRUE)
We can observe the di?erence now by examining comment #4 again:
fbCorpus[[4]]$content
[1] "omg"
We could also define our own stop words and transform the text using these:
myStopwords <- c("jar","binks")
fbCorpus <- tm_map(fbCorpus, removeWords, myStopwords)
Frequency analysis
Next we create a document-term matrix (DTM) from the fbCorpus object. DTMs are a very important concept for text analysis and are highly useful. DTMs can be thought about as a table (i.e. matrix) where the rows are ‘documents’ (i.e. Facebook comments in our dataset), and the columns are ‘terms’ (i.e. each unique word found across all the documents in the dataset). The ‘cells’ (i.e. elements) of the matrix indicate how many times term n occurred in document m.
Note: we use the control argument to specify that we only want to retain words that are minimum character length of 3, up to a maximum of 20 characters.
dtm <- DocumentTermMatrix(fbCorpus,control = list(wordLengths=c(3, 20)))
dtm
<<DocumentTermMatrix (documents: 220, terms: 816)>>
Non-/sparse entries: 1564/177956
Sparsity : 99%
Maximal term length: 17
Weighting : term frequency (tf)
What we have is a sparse matrix, i.e. most of the elements of the matrix are 0, i.e. in our dataset most Facebook comments contain only a small percentage of ‘vocabulary’ of terms observed across the entire set of comments. What we want to do is remove terms that occur very infrequently, which will leave us with the most ‘important’ terms. We remove sparse terms using the removeSparseTerms function, which removes terms that occur equal to or less than a percentage threshold.
For example, if we set it to 0.995, then all terms that are at least 99.5% sparse are removed. The following command lets us ‘test out’ what our document-term matrix would look like if we set the threshold to 0.995:
removeSparseTerms(dtm, 0.995)
<<DocumentTermMatrix (documents: 220, terms: 249)>>
Non-/sparse entries: 997/53783
Sparsity : 98%
Maximal term length: 11
Weighting : term frequency (tf)
0.995 will do the trick for us in this workshop, so we will create a new document-term matrix with this threshold applied to it:
dtmSparseRemoved <- removeSparseTerms(dtm, 0.995)
We can examine term frequencies in our data. We create a character vector of the sums of columns of our document-term matrix (implicitly coercing it to a matrix object), meaning that have a named character vector where the names are the unique terms in our document-term matrix, and the values of the elements are the number of times that particular word occurs across all of our corpus.
freqTerms <- colSums(as.matrix(dtmSparseRemoved))
freqTerms
actual admir allianc also alway amaz among anoth anyon anyth
4 2 2 3 5 3 2 4 2 3
argument arm artist ask atat attack awesom back background bad
2 2 2 4 2 3 4 3 2 2
base battl beauti behind believ best big blew blow brian
3 4 3 8 2 2 4 2 3 2
bring budget build built butterfli call came can cant choke
2 3 16 2 4 2 4 6 2 2
come command complet construct contractor cool couldnt dainti darth dead
2 4 3 8 10 3 4 3 3 2
death deathstar definit destroy destruct detail didnt die dish dont
44 2 2 11 4 3 6 2 2 5
doubl effort elabor emperor empir endor enemi episod even ever
2 4 2 17 21 6 2 3 4 3
everyth evil ewok expect face fail falcon favorit find finish
2 5 3 2 2 2 2 3 2 8
first flight forc forgiv found fulli get good got govern
16 2 6 3 2 6 5 11 4 2
great ground hadnt half hate help hope hundr imagin imperi
2 2 2 4 3 2 4 3 2 3
independ instal issu jedi job juan just keep kill kind
2 2 2 3 2 2 15 2 4 2
knew know knowledg land lando laser learn led let life
2 10 2 6 2 2 4 2 2 2
like live lol long look lost love luke made main
15 2 4 4 4 3 9 2 3 2
make man manag mani mass may mayb men moment moon
11 4 2 7 2 2 4 3 2 2
movi much need new nice now old one oper origin
11 2 4 9 3 5 2 17 7 6
paint peopl permiss pic pictur pilot plan planet pleas poor
6 3 4 2 7 2 4 3 2 3
poster power probabl project rather realli rebel rebellion redoubl releas
2 5 4 6 2 5 11 5 2 2
rememb return right run saw say schedul screen second secur
2 2 3 2 4 2 7 2 6 2
see shield ship shot shouldnt show shuttl sick size someth
7 3 5 2 2 2 7 2 2 5
space span star start station steal stormtroop straight strike superweapon
4 2 51 3 3 2 3 2 3 2
sure take target tarkin thank that there thing think thought
4 3 2 2 2 8 5 11 9 8
thrawn time toilet took trap tri trilog turn two unfinish
2 10 2 8 10 3 2 3 3 3
union univers use vader version visit vong wait wall wallpap
4 2 3 9 2 3 2 2 2 2
want war wasnt watch way weak weapon well werent will
5 9 6 5 3 3 4 5 2 5
william wish without work worker wouldnt yeah year yuuzhan
2 3 2 6 3 2 2 9 2
We order the term frequencies and look at the 5 most frequent terms and then 5 least frequent terms:
orderTerms <- order(freqTerms,decreasing=TRUE)
freqTerms[head(orderTerms)]
star death empir emperor one build
51 44 21 17 17 16
freqTerms[tail(orderTerms)]
werent william without wouldnt yeah yuuzhan
2 2 2 2 2 2
Which terms occurred at least 20 times?
findFreqTerms(dtmSparseRemoved, 20)
[1] "death" "empir" "star"
We can do a basic correlation analysis by looking at the correlations between terms with the findAssocs function. If two words always appear together then corr = 1. If two terms never appear together then corr = 0. Let’s look at which terms co-occur with the term “meat”, with a lower correlation limit of 0.5.
findAssocs(dtmSparseRemoved, "good", corlimit=0.5)
$good
evil anyon believ can expect kind life return without anyth ever peopl start turn
0.87 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.55 0.55 0.55 0.55 0.55
Next, we can do some text visualisation. First, we can plot our descriptive statistics in various ways. For example, using a barchart to visualise the 20 most frequent terms (we will use the lattice package for a nice bar chart:
require(lattice)
Loading required package: lattice
# png("barchart_frequent_terms.png", width=800, height=700)
barchart(freqTerms[orderTerms[1:20]])

# dev.off()
Word Cloud
Next, we will construct a comparison word cloud of the Star Wars and Star Trek fan page comments.
# create a character vector of the Star Wars comments
# (i.e., take a subset of elements from the commentText column of the dataframe)
starWarsComments <- myStarWarsData$commentText[which(myStarWarsData$commentText!="Not_applicable")]
starWarsComments <- paste(starWarsComments , collapse = " ")
# do the same, but for Star Trek
starTrekComments <- myStarTrekData$commentText[which(myStarTrekData$commentText!="Not_applicable")]
starTrekComments <- paste(starTrekComments , collapse = " ")
# combine them together into a dataframe
df_ALL <- data.frame(group=c("Star_Wars","Star_Trek"),words=c(starWarsComments,starTrekComments))
# search for any texts that have no characters (i.e. are ’empty’)
# and then remove these elements from the vector
toRemove <- which(df_ALL$words=="")
Data pre-processing:
# search for any texts that have no characters (i.e. are ’empty’)
# and then remove these elements from the vector
toRemove <- which(df_ALL$words=="")
# are there any ’empty’ text elements?
# (i.e. length of toRemove is not equal to zero)
# if true, then we remove the corresponding rows from the dataframe
if (isTRUE(length(toRemove)!=0)) {
df_ALL <- df_ALL[-toRemove,]
}
# we create a character vector from the "words" column of df_ALL
# this will be our independent variable.
# we do not want text as factors, so we will coerce it to character
words <- df_ALL$words
# we will convert the character encoding to UTF-8
# just to be sure there are no odd characters that
# may cause problems later on
words <- iconv(words, to = "UTF-8")
# ** MAC USERS ONLY **:
words <- iconv(words, to = "UTF-8-mac")
# using ’tm’ package we convert character vector to a Vcorpus object (volatile corpus)
corp <- VCorpus(VectorSource(words))
## now we do transformations of text using tm_map (’mapping to the corpus’)
# eliminate extra whitespace
corp <- tm_map(corp, stripWhitespace)
# convert to all lowercase
corp <- tm_map(corp, content_transformer(tolower))
# perform stemming (not always useful!)
#fbCorpus <- tm_map(fbCorpus, stemDocument)
# remove numbers (not always useful!)
fbCorpus <- tm_map(fbCorpus, removeNumbers)
# remove punctuation (not always useful! e.g. text emoticons)
fbCorpus <- tm_map(fbCorpus, removePunctuation)
# remove stop words (not always useful!) - doing this in perl
corp <- tm_map(corp, removeWords, stopwords("english"))
# create a document-term matrix
# had to do it this way to be able to use colnames
tdm <- TermDocumentMatrix(corp)
tdm <- as.matrix(tdm)
#print(tdm)
colnames(tdm) <- c("Star_Wars","Star_Trek")
colorsx=c("blue","red")
Word Cloud visualization:
require(wordcloud)
Loading required package: wordcloud
#note: if changing res of png, can’t have dimensions in pixels (led to wordclouds with very few words...)
# png("facebook_starwars_startrek_comparison_cloud.png", width=12, height=8, units="in", res=300)
#comparison.cloud(tdm,max.words=300,random.order=FALSE)
comparison.cloud(tdm,max.words=200,random.order=FALSE,colors=colorsx)

#commonality.cloud(tdm,random.order=FALSE)
# dev.off()

Social Media Data Collection
Collecting Twitter data and creating social networks
In this section we will run through how to collect data from Twitter, create networks, and perform different kinds of analysis.
It is currently possible to create 3 different types of networks using Twitter data collected with SocialMediaLab. These are (1) actor networks; (2) bimodal networks; and (3) semantic networks.
First, define the API credentials. Due to the Twitter API specifications, it is not possible to save authentication tokens between sessions. The Authenticate() function is called only for its side effect, which provides access to the Twitter API for the current session.
Authenticating with the Twitter API
Go to Twitter Application Management and “Create New App”. Complete all fields in the form and create a new app. In your app page, go to “Keys and Access Tokens”, generate your access token, and copy the information to R:
Given that we are going to be creating two different types of Twitter networks (actor and semantic), we will Collect() the data, but not pipe it directly through to Network() straight away. This means we can reuse the data multiple times to create two different kinds of networks for analysis. We will collect 150 recent tweets that have used the #auspol hashtag. This is the dominant hashtag for Australian politics. The first step in the work flow is to authorise access the Twitter API. Instructions for obtaining Twitter API access are available from the VOSON website. See the previous section for a brief explanation of APIs.
We can have a quick look at the data we just collected:
Note the class of the dataframe, which lets SocialMediaLab know that this is an object of class dataSource , which we can then pass to the Create() function to generate different kinds of networks:
If you find that you are encountering errors possibly related to the text of the tweets, you can try converting the tweet text to UTF-8 character encoding. Roughly speaking, this command will help to deal with ‘odd’ characters in the text.
Mac users only may also wish to try the following if they are encountering errors that may be due to character encoding issues:
Creating social networks with Twitter data
Actor network
First, we will create an actor network. In this actor network, edges represent interactions between Twitter users. An interaction is defined as a “mention”" or “reply”" or “retweet” from user i to user j, given “tweet” m. In a nutshell, a Twitter actor network shows us who is interacting with who in relation to a particular hashtag or search term.
We can now examine the description of our network:
Semantic network
Next, we will create a semantic network. In this network nodes represent unique concepts (in this case unique terms/words extracted from a set of 150 tweets), and edges represent the co-occurrence of terms for all observations in the data set. For example, for this Twitter semantic network, nodes represent either hashtags (e.g. “#auspol”) or single terms (“politics”). If there are 150 tweets in the data set (i.e. 150 observations), and the term #auspol and the term politics appear together in every tweet, then this would be represented by an edge with weight equal to 150
Let’s have a look at the network description:
Bimodal network
Now that we have our Twitter data we can generate a bimodal network. This kind of network provides many possibilities for analysis and generating insights from our data.
In this bimodal network there are two types of nodes: users and hashtags. The bimodal network is therefore:
We now run the Create() function, which creates an igraph object called g_bimodal_twitter. Creating networks in SocialMediaLab is straightforward. We simply pass the myTwitterData object to the Create() function, and it takes care of the rest. We specify what kind of network we want to create (i.e. a bimodal network) by specifying this as an argument to the Create() function.
Note also that there is a tricky operator introduced here, the ‘pipe’ operator %>% , which we have not covered yet. This operator comes from the Magrittr package, and it is used to ‘pipe’ together commands in a chain, passing the values along the pipeline until it reaches the final command, which returns the output (i.e. the network we wish to create). In this instance we are passing (or “piping”) the data we collected using Collect() through to the Create() function.
We can now view basic information about the network:
Collecting Facebook data
In this section we will run through how to collect data from Facebook, create networks, and perform different kinds of analysis.
Authenticate with Facebook Developers API
The process of authentication, data collection, and creating social networks can be expressed with the 3 verb functions: Authenticate() , Collect() , and Create() . This simplified workflow exploits the pipe interface of the Magrittr package, and provides better handling of API authentication between R sessions. What we are doing is “piping” the data forward using the %>% operator, in a kind of functional programming approach. It means we can pipe together all the different elements of the work flow in a quick and easy manner. This also provides the ability to save and load authentication tokens, so we don’t have to keep authenticating with APIs between sessions. Obviously, this opens up possibilities for automation and data mining projects.
Go to Facebook Developers Page (https://developers.facebook.com/apps/), create a new app, and copy the appID and appSecret information.
Make sure we have our appID and appSecret values defined:
Save credential file for later access:
The first time you authenticate you will see this:
You have to paste url (http://localhost:1410/ ) in developer app settings: (i) Click on add platform. (ii) Choose website and paste http://localhost:1410/ in URL link.
Bimodal networks
First, we will collect 2 days worth of activity from the Star Wars official page. This will collect all the posts posted between the rangeFrom and rangeTo dates, including all comments and likes, and other associated data including usernames, timestamps for comments, etc. Note: the date format is YYYY-MM-DD.
We will be using this data to create a bimodal network. This graph object is bimodal because edges represent relationships between nodes of two different types. For example, in our bimodal Facebook network, nodes represent Facebook users or Facebook posts, and edges represent whether a user has commented or ‘liked’ a post. Edges are directed and weighted (e.g. if user i has commented n times on post j, then the weight of this directed edge equals n).
The Magrittr pipe approach used in this example means that we only end up with the final graph object (in the global environment). To ensure we retain the data that are collected, the argument writeToFile=TRUE is used. This writes the data collected using Collect() function to a local CSV file before it is piped through to the network generation function Create() . We can then read it in as a dataframe (see code snippet below).
This means we end up with two objects for further analysis, a graph object g_bimodal_facebook_star_wars , and a dataframe object myStarWarsData.
Before proceeding into analysis, we will collect 2 days worth of data from the Star Trek Facebook page, but this time we will pipe through the LoadCredential function, meaning that we are using the authentication token that we stored locally in the previous step.
Read in the data to a dataframe
Now we can perform some analysis on the Star Wars network. Firstly, we will run through some essential SNA techniques. After that we will do something a bit fancier, by comparing whether there are gender differences between Star Wars and Star Trek networks.
We can get descriptive information about the network:
This informs us that there are 1219 nodes and 1218 nodes in the network (this may differ somewhat for your own collected data). It tells us that our graph is Directed, Named, the edges are Weighted, and it also has the additional property of being a Bipartite graph.
Collecting YouTube video comment data
Authenticate
We first ensure that the SocialMediaLab package is loaded:
A Google Developer API Key is required for authenticating with the API (otherwise we cannot collect data). This requires a Google account. Instructions for obtaining a Google Developer API Key are available from the Youtube Data API Overview. Main steps:
We then run the following function, which ensures everything is correctly set up to access the API:
Get data
We now assign a character vector, specifying one or more YouTube video IDs that we wish to collect data from. For example, if the video URL is https://www.youtube.com/watch?v=W2GZFeYGU3s, then use videoIDs = ‘W2GZFeYGU3s’. Tip: for many videos, the function GetYoutubeVideoIDs can be used to create a vector object suitable as input for videoIDs.
We now collect the YouTube comment data and store it in a data frame object named myYoutubeData. This data frame can then be used for creating networks for further analysis. We can supply various arguments to the CollectDataYoutube function, providing various options for the data collection process.
We can examine the structure of our data frame:
Generate actor network
We will now create a unimodal network, a.k.a ‘actor network’, representing relationships between users who have interacted with each other. For YouTube comment threads, a relationship is defined as user i ‘replying to’ or ‘mentioning’ user j in a comment. In this network the vertices (a.k.a ‘nodes’) represent YouTube users and the edges (a.k.a ‘links’) represent whether (and how many times) user i has interacted with user j. The edges in this network are both directed and weighted. Edges are ‘directed’ because interactions may not be reciprocated (e.g. user i replies to user j, but user j does not reply to user i), and edges are also ‘weighted’ in order to show how many times user i has interacted with user j.
We can now view basic information about our network, notably the number of vertices (users) and number of interactions (edges):
Storing your social network in graphml format
You can export a social network generated by SocialMediaLab to “graphml” format using igraph package:
Now you can visualize the “graphml” file with Gelphi or other compatiable tools.