SNA Grad Seminar, Fall 2017 Due: October 24th, 11:59 pm Name of Student:
For this lab, you will search the New York Times, save that data, create networks from that data, compare the differences among networks, and demonstrate your proficiency with basic network descriptive statistics.
When working with R, you should run each line of code individually, unless it is part of a function definition, so you can see the results. Generally speaking, any line of code that includes ‘{’ (the beginning of a function definition) should be run with all the other lines until you hit ‘}’.
# Lines that start with a hashtag/pound symbol, like this one, are comment lines. Comment lines are ignored by R when it is interpreting code.
# You only need to install packages once. Remove the # in front of each line and then run it to install each package. After successful installation, delete the line of code or replace the #s so the R Notebook doesn't run into problems.
#install.packages('magrittr', repos = "https://cran.rstudio.com")
#install.packages('igraph', repos = "https://cran.rstudio.com")
#install.packages('httr', repos = "https://cran.rstudio.com")
#install.packages('data.table', repos = "https://cran.rstudio.com")
#install.packages('dplyr', repos = "https://cran.rstudio.com")
#install.packages('xml2', repos = "https://cran.rstudio.com")
#You need to load packages every time you run the script or restart R.
library(magrittr)
library(httr)
library(data.table)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:igraph':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(xml2)
# Set your directory for the project
# You can either enter your filename path within the parentheses below and remove the # creating the comment, or select "Session > Set Working Directory ... Source File Location" in R Studio.
# setwd("Input Directory")
You can decide search terms based on personal interests, research interests, or popular topical areas, among others. You have flexibility in selecting your search term list. For example, you can search for some commercial brands, celebrities, countries, universities, etc. It will be most useful if you choose a collection of words that are not all extremely common. Think about a set of words that might have interesting co-occurrences in articles within the New York Times website. For example, you might be interested in the last names of every Senator involved in a certain political debate, football teams, or cities and their co-occurrence in news articles. Generally speaking, proper nouns are best, but you might have compelling reasons to choose verbs or adjectives. You might want to throw a couple of terms in that aren’t thematically related to make sure you don’t get a totally connected component. The more interesting your network is in terms of differing centrality, distinct components, etc., the easier it will be to do the written analysis. Keep in mind that the Article Search archive is very large; many terms co-occur. You might want to consider two tenuously related subjects. The example file uses four football teams and their home senators, plus a few topical terms.
Create a plain text file with .txt extension in the same directory as the R Markdown Notebook used in this assignment. Make a note of the file name for use in the next code snippet. Place one search term per line, and use 15â“20 terms. You’ll also likely want to add quotation marks around your search terms to ensure that you’re only receiving results for the complete term. NOTE: The function will process your terms so that they work in the URL request. You do not need to encode non-alphabetic characters.
The text file cannot include any additional information or characters and it must be a .txt file; Word or RTF documents wonât work.
a. Provide a high level overview of the terms you included in the search query.
b. Why did you choose this collection of terms? Were there some specific overarching questionâintellectual or extracurricular curiosityâthat motivated this collection of terms?
c. How did you decide which terms to use in the search query? Were these terms you intuitively deemed important? Were they culled from a specific source or the result of some separate analysis or search query?
d. What are the insights you hope to glean by looking at the network of terms in terms of individual node metrics, sub-grouping of nodes, overall global network properties?
The New York Times controls access to its API by assigning each user a key. Each key has a limited number of calls that can be made within a certain time period. You can read more about the limitations of the API system here.
You will need to create your own API key to complete this assignment. Go to the New York Times developers page and request a key. You will copy that key (received via email) into the api variable below.
# Import your word list
name_of_file <- "NYT Word list.txt" # Creates a variable called name_of_file that you should populate with the name of your text file between quotation marks.
word_list <- read.table(name_of_file, sep = "\n", stringsAsFactors = F) %>% unlist %>% as.vector # Reads the content of your file into a variable.
num_words <- length(word_list) # Creates a variable with the number of words in your list.
url_base <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
# When you receive the email with your API key, paste it below between the quotation marks.
api <- '886cfdebb3a54e0da0a75ed80a527db9'
Our first function will gather all of the search terms and their number of hits to be placed in a table. All lines of a function should be run together.
Get_hits_one <- function(keyword1) {
Sys.sleep(time=3)
url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"&begin_date=","20160101") # Begin date is in format YYYYMMDD; you can change it if you want only more recent results, for example.
# The number of results
print(keyword1)
hits <- content(GET(url))$response$meta$hits %>% as.numeric
print(hits)
# Put results in table
c(SearchTerm=keyword1,ResultsTotal=hits)
}
Now we will invoke our function to put information from the API into our global environment.
#Create a table of your words and their number of results.
total_table <- t(sapply(word_list,Get_hits_one))
## [1] "Privacy"
## [1] 2604
## [1] "Private"
## [1] 16748
## [1] "Data"
## [1] 12268
## [1] "Security"
## [1] 20344
## [1] "Secure"
## [1] 4542
## [1] "Vulnerabilities"
## [1] 433
## [1] "Personal "
## [1] 15025
## [1] "Information "
## [1] 17432
## [1] "Disclosure "
## [1] 1655
## [1] "Identity "
## [1] 5886
## [1] "Breach "
## [1] 1407
## [1] "Equifax"
## [1] 127
## [1] "Credit"
## [1] 7678
## [1] "Confidentiality"
## [1] 391
## [1] "Status "
## [1] 6954
## [1] "Anonymous "
## [1] 1795
## [1] "Surveillance"
## [1] 2509
## [1] "Secret"
## [1] 6833
## [1] "Secrecy"
## [1] 815
## [1] "Conspiracy"
## [1] 2487
## [1] "Hacking"
## [1] 1384
## [1] "Leak"
## [1] 1026
## [1] "Leaks"
## [1] 863
total_table <- as.data.frame(total_table)
total_table$ResultsTotal <- as.numeric(as.character(total_table$ResultsTotal))
If you get zero hits for any of these terms, you should substitute that term for somethign else and rerun the lab up to this point. Next, we will define the function that will collect the article co-occurences network.
Get_hits_two <- function(row_input) {
keyword1 <- row_input[1]
keyword2 <- row_input[2]
url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"+", URLencode(keyword2),"&begin_date=","20160101") #match w/ Begin Date in Get_hits_one.
# The number of results
print(paste0(keyword1," ",keyword2))
hits <- content(GET(url))$response$meta$hits %>% as.numeric
print(hits)
Sys.sleep(time=3)
# Put results in table
c(SearchTerm1=keyword1,SearchTerm2=keyword2,CoOccurrences=hits)
}
In this next step, we will call the API and collect the co-occurrence network. This may take some time. If you receive “numeric(0)” in any of your resposnes, you’ve likely hit your API key limit and will either need to wait for the calls to reset (24 hours) or request a new key. If you receive the error message “$ operator is invalid for atomic vectors,” you have also hit the API call limit. This could be due to running the script multiple times, or due to hitting too many results based on very common search terms. Request a new API, shorten your word list, and try again. Don’t forget you need to reload your word list from the first part of the Lab in order to get a different set of results! You must also rerun the functions to reassign the API value. If none of your results come back as “0,” you might want to redo your search with the appropriate words.
# Convert the pairs list into a table
pairs_list <- expand.grid(word_list,word_list) %>% filter(Var1 != Var2)
pairs_list <- t(combn(word_list,2))
#Create a network table, run the Get_hits_two function using the pairs lists
network_table <- t(apply(pairs_list,1,Get_hits_two))
## [1] "Privacy Private"
## [1] 709
## [1] "Privacy Data"
## [1] 833
## [1] "Privacy Security"
## [1] 785
## [1] "Privacy Secure"
## [1] 171
## [1] "Privacy Vulnerabilities"
## [1] 46
## [1] "Privacy Personal "
## [1] 699
## [1] "Privacy Information "
## [1] 1048
## [1] "Privacy Disclosure "
## [1] 129
## [1] "Privacy Identity "
## [1] 305
## [1] "Privacy Breach "
## [1] 89
## [1] "Privacy Equifax"
## [1] 14
## [1] "Privacy Credit"
## [1] 189
## [1] "Privacy Confidentiality"
## [1] 58
## [1] "Privacy Status "
## [1] 190
## [1] "Privacy Anonymous "
## [1] 142
## [1] "Privacy Surveillance"
## [1] 320
## [1] "Privacy Secret"
## [1] 276
## [1] "Privacy Secrecy"
## [1] 98
## [1] "Privacy Conspiracy"
## [1] 55
## [1] "Privacy Hacking"
## [1] 125
## [1] "Privacy Leak"
## [1] 38
## [1] "Privacy Leaks"
## [1] 67
## [1] "Private Data"
## [1] 2080
## [1] "Private Security"
## [1] 3097
## [1] "Private Secure"
## [1] 793
## [1] "Private Vulnerabilities"
## [1] 91
## [1] "Private Personal "
## [1] 2789
## [1] "Private Information "
## [1] 2854
## [1] "Private Disclosure "
## [1] 464
## [1] "Private Identity "
## [1] 885
## [1] "Private Breach "
## [1] 287
## [1] "Private Equifax"
## [1] 30
## [1] "Private Credit"
## [1] 1447
## [1] "Private Confidentiality"
## [1] 118
## [1] "Private Status "
## [1] 1069
## [1] "Private Anonymous "
## [1] 317
## [1] "Private Surveillance"
## [1] 421
## [1] "Private Secret"
## [1] 1249
## [1] "Private Secrecy"
## [1] 228
## [1] "Private Conspiracy"
## [1] 394
## [1] "Private Hacking"
## [1] 334
## [1] "Private Leak"
## [1] 204
## [1] "Private Leaks"
## [1] 234
## [1] "Data Security"
## [1] 2214
## [1] "Data Secure"
## [1] 581
## [1] "Data Vulnerabilities"
## [1] 121
## [1] "Data Personal "
## [1] 1782
## [1] "Data Information "
## [1] 3400
## [1] "Data Disclosure "
## [1] 327
## [1] "Data Identity "
## [1] 548
## [1] "Data Breach "
## [1] 387
## [1] "Data Equifax"
## [1] 92
## [1] "Data Credit"
## [1] 1277
## [1] "Data Confidentiality"
## [1] 66
## [1] "Data Status "
## [1] 735
## [1] "Data Anonymous "
## [1] 239
## [1] "Data Surveillance"
## [1] 458
## [1] "Data Secret"
## [1] 629
## [1] "Data Secrecy"
## [1] 131
## [1] "Data Conspiracy"
## [1] 216
## [1] "Data Hacking"
## [1] 405
## [1] "Data Leak"
## [1] 175
## [1] "Data Leaks"
## [1] 160
## [1] "Security Secure"
## [1] 1482
## [1] "Security Vulnerabilities"
## [1] 230
## [1] "Security Personal "
## [1] 2499
## [1] "Security Information "
## [1] 3962
## [1] "Security Disclosure "
## [1] 437
## [1] "Security Identity "
## [1] 1070
## [1] "Security Breach "
## [1] 612
## [1] "Security Equifax"
## [1] 78
## [1] "Security Credit"
## [1] 1169
## [1] "Security Confidentiality"
## [1] 86
## [1] "Security Status "
## [1] 1384
## [1] "Security Anonymous "
## [1] 329
## [1] "Security Surveillance"
## [1] 1105
## [1] "Security Secret"
## [1] 1600
## [1] "Security Secrecy"
## [1] 270
## [1] "Security Conspiracy"
## [1] 545
## [1] "Security Hacking"
## [1] 777
## [1] "Security Leak"
## [1] 305
## [1] "Security Leaks"
## [1] 359
## [1] "Secure Vulnerabilities"
## [1] 72
## [1] "Secure Personal "
## [1] 627
## [1] "Secure Information "
## [1] 778
## [1] "Secure Disclosure "
## [1] 95
## [1] "Secure Identity "
## [1] 254
## [1] "Secure Breach "
## [1] 130
## [1] "Secure Equifax"
## [1] 11
## [1] "Secure Credit"
## [1] 359
## [1] "Secure Confidentiality"
## [1] 25
## [1] "Secure Status "
## [1] 345
## [1] "Secure Anonymous "
## [1] 84
## [1] "Secure Surveillance"
## [1] 161
## [1] "Secure Secret"
## [1] 320
## [1] "Secure Secrecy"
## [1] 57
## [1] "Secure Conspiracy"
## [1] 105
## [1] "Secure Hacking"
## [1] 147
## [1] "Secure Leak"
## [1] 60
## [1] "Secure Leaks"
## [1] 71
## [1] "Vulnerabilities Personal "
## [1] 102
## [1] "Vulnerabilities Information "
## [1] 149
## [1] "Vulnerabilities Disclosure "
## [1] 24
## [1] "Vulnerabilities Identity "
## [1] 35
## [1] "Vulnerabilities Breach "
## [1] 36
## [1] "Vulnerabilities Equifax"
## [1] 4
## [1] "Vulnerabilities Credit"
## [1] 53
## [1] "Vulnerabilities Confidentiality"
## [1] 5
## [1] "Vulnerabilities Status "
## [1] 34
## [1] "Vulnerabilities Anonymous "
## [1] 11
## [1] "Vulnerabilities Surveillance"
## [1] 49
## [1] "Vulnerabilities Secret"
## [1] 55
## [1] "Vulnerabilities Secrecy"
## [1] 8
## [1] "Vulnerabilities Conspiracy"
## [1] 11
## [1] "Vulnerabilities Hacking"
## [1] 93
## [1] "Vulnerabilities Leak"
## [1] 16
## [1] "Vulnerabilities Leaks"
## [1] 13
## [1] "Personal Information "
## [1] 2846
## [1] "Personal Disclosure "
## [1] 391
## [1] "Personal Identity "
## [1] 1110
## [1] "Personal Breach "
## [1] 327
## [1] "Personal Equifax"
## [1] 72
## [1] "Personal Credit"
## [1] 1259
## [1] "Personal Confidentiality"
## [1] 84
## [1] "Personal Status "
## [1] 1025
## [1] "Personal Anonymous "
## [1] 413
## [1] "Personal Surveillance"
## [1] 371
## [1] "Personal Secret"
## [1] 1139
## [1] "Personal Secrecy"
## [1] 170
## [1] "Personal Conspiracy"
## [1] 387
## [1] "Personal Hacking"
## [1] 313
## [1] "Personal Leak"
## [1] 172
## [1] "Personal Leaks"
## [1] 167
## [1] "Information Disclosure "
## [1] 748
## [1] "Information Identity "
## [1] 937
## [1] "Information Breach "
## [1] 487
## [1] "Information Equifax"
## [1] 86
## [1] "Information Credit"
## [1] 1279
## [1] "Information Confidentiality"
## [1] 179
## [1] "Information Status "
## [1] 1060
## [1] "Information Anonymous "
## [1] 481
## [1] "Information Surveillance"
## [1] 796
## [1] "Information Secret"
## [1] 1494
## [1] "Information Secrecy"
## [1] 334
## [1] "Information Conspiracy"
## [1] 539
## [1] "Information Hacking"
## [1] 626
## [1] "Information Leak"
## [1] 409
## [1] "Information Leaks"
## [1] 425
## [1] "Disclosure Identity "
## [1] 100
## [1] "Disclosure Breach "
## [1] 85
## [1] "Disclosure Equifax"
## [1] 5
## [1] "Disclosure Credit"
## [1] 189
## [1] "Disclosure Confidentiality"
## [1] 44
## [1] "Disclosure Status "
## [1] 125
## [1] "Disclosure Anonymous "
## [1] 59
## [1] "Disclosure Surveillance"
## [1] 93
## [1] "Disclosure Secret"
## [1] 242
## [1] "Disclosure Secrecy"
## [1] 102
## [1] "Disclosure Conspiracy"
## [1] 71
## [1] "Disclosure Hacking"
## [1] 107
## [1] "Disclosure Leak"
## [1] 78
## [1] "Disclosure Leaks"
## [1] 75
## [1] "Identity Breach "
## [1] 120
## [1] "Identity Equifax"
## [1] 33
## [1] "Identity Credit"
## [1] 382
## [1] "Identity Confidentiality"
## [1] 27
## [1] "Identity Status "
## [1] 561
## [1] "Identity Anonymous "
## [1] 215
## [1] "Identity Surveillance"
## [1] 176
## [1] "Identity Secret"
## [1] 541
## [1] "Identity Secrecy"
## [1] 81
## [1] "Identity Conspiracy"
## [1] 176
## [1] "Identity Hacking"
## [1] 107
## [1] "Identity Leak"
## [1] 64
## [1] "Identity Leaks"
## [1] 59
## [1] "Breach Equifax"
## [1] 88
## [1] "Breach Credit"
## [1] 180
## [1] "Breach Confidentiality"
## [1] 32
## [1] "Breach Status "
## [1] 89
## [1] "Breach Anonymous "
## [1] 42
## [1] "Breach Surveillance"
## [1] 53
## [1] "Breach Secret"
## [1] 139
## [1] "Breach Secrecy"
## [1] 33
## [1] "Breach Conspiracy"
## [1] 48
## [1] "Breach Hacking"
## [1] 200
## [1] "Breach Leak"
## [1] 78
## [1] "Breach Leaks"
## [1] 58
## [1] "Equifax Credit"
## [1] 98
## [1] "Equifax Confidentiality"
## [1] 1
## [1] "Equifax Status "
## [1] 11
## [1] "Equifax Anonymous "
## [1] 1
## [1] "Equifax Surveillance"
## [1] 4
## [1] "Equifax Secret"
## [1] 5
## [1] "Equifax Secrecy"
## [1] 2
## [1] "Equifax Conspiracy"
## [1] 1
## [1] "Equifax Hacking"
## [1] 17
## [1] "Equifax Leak"
## [1] 4
## [1] "Equifax Leaks"
## [1] 2
## [1] "Credit Confidentiality"
## [1] 34
## [1] "Credit Status "
## [1] 547
## [1] "Credit Anonymous "
## [1] 121
## [1] "Credit Surveillance"
## [1] 119
## [1] "Credit Secret"
## [1] 402
## [1] "Credit Secrecy"
## [1] 66
## [1] "Credit Conspiracy"
## [1] 137
## [1] "Credit Hacking"
## [1] 134
## [1] "Credit Leak"
## [1] 69
## [1] "Credit Leaks"
## [1] 44
## [1] "Confidentiality Status "
## [1] 40
## [1] "Confidentiality Anonymous "
## [1] 28
## [1] "Confidentiality Surveillance"
## [1] 12
## [1] "Confidentiality Secret"
## [1] 70
## [1] "Confidentiality Secrecy"
## [1] 38
## [1] "Confidentiality Conspiracy"
## [1] 12
## [1] "Confidentiality Hacking"
## [1] 17
## [1] "Confidentiality Leak"
## [1] 23
## [1] "Confidentiality Leaks"
## [1] 13
## [1] "Status Anonymous "
## [1] 153
## [1] "Status Surveillance"
## [1] 139
## [1] "Status Secret"
## [1] 418
## [1] "Status Secrecy"
## [1] 65
## [1] "Status Conspiracy"
## [1] 142
## [1] "Status Hacking"
## [1] 85
## [1] "Status Leak"
## [1] 68
## [1] "Status Leaks"
## [1] 63
## [1] "Anonymous Surveillance"
## [1] 74
## [1] "Anonymous Secret"
## [1] 203
## [1] "Anonymous Secrecy"
## [1] 52
## [1] "Anonymous Conspiracy"
## [1] 66
## [1] "Anonymous Hacking"
## [1] 63
## [1] "Anonymous Leak"
## [1] 49
## [1] "Anonymous Leaks"
## [1] 66
## [1] "Surveillance Secret"
## [1] 340
## [1] "Surveillance Secrecy"
## [1] 60
## [1] "Surveillance Conspiracy"
## [1] 112
## [1] "Surveillance Hacking"
## [1] 131
## [1] "Surveillance Leak"
## [1] 73
## [1] "Surveillance Leaks"
## [1] 96
## [1] "Secret Secrecy"
## [1] 237
## [1] "Secret Conspiracy"
## [1] 307
## [1] "Secret Hacking"
## [1] 222
## [1] "Secret Leak"
## [1] 179
## [1] "Secret Leaks"
## [1] 168
## [1] "Secrecy Conspiracy"
## [1] 44
## [1] "Secrecy Hacking"
## [1] 35
## [1] "Secrecy Leak"
## [1] 60
## [1] "Secrecy Leaks"
## [1] 60
## [1] "Conspiracy Hacking"
## [1] 77
## [1] "Conspiracy Leak"
## [1] 54
## [1] "Conspiracy Leaks"
## [1] 47
## [1] "Hacking Leak"
## [1] 118
## [1] "Hacking Leaks"
## [1] 126
## [1] "Leak Leaks"
## [1] 193
#Convert the network table into a dataframe
network_table <- as.data.frame(network_table)
# Read each the content of each item within the $CoOccurreences factor as characters,
# then force those characters into the "numeric" or "double" type.
network_table$CoOccurrences <- as.numeric(as.character(network_table$CoOccurrences))
# Convert data to data.table type.
total_table <- as.data.table(total_table)
network_table <- as.data.table(network_table)
# Remove zero edges from your network
network_table <- network_table[!CoOccurrences==0]
# Create a graph object with your data
g_valued <- graph_from_data_frame(d = network_table[,1:3,with=FALSE],directed = FALSE,vertices = total_table)
# If you're having trouble with data collection, you can load the 'NFL Lab Results.RData' file now by clicking the open folder icon on the "Environment"" tab and continue the lab from here. You'll need to figure out what the significance of the terms are yourself, however.
# You should save your data at this point by clicking the floppy disk icon under the "Environment" tab.
Is the graph directed or undirected?
How many nodes and links does your network have?
numVertices <- vcount(g_valued)
numVertices
## [1] 23
numEdges <- ecount(g_valued)
numEdges
## [1] 253
What is the number of possible links in your network?
maxEdges <- numVertices*(numVertices-1)/2
maxEdges
## [1] 253
What is the density of your network?
graphDensity <- numEdges/maxEdges # manual calculation
graphDensity
## [1] 1
graphDensity1 <- graph.density(g_valued) # using the graph.density function from igraph
graphDensity1
## [1] 1
Briefly describe how your choice of dataset may influence your findings. What differences would you expect if you use different search terms? Are the current search terms related to one another? If so, how? Do you think the limitation to one word might skew your answers? (i.e. if youâre interested in Hillary Clinton, but you include âClintonâ as a term, you will get stories that mention Chelsea, Bill, & even P-Funk Allstar George Clinton).
## Learn more about plotting with igraph
# ?? igraph.plotting
colbar = rainbow(length(word_list)) ## we are selecting different colors to correspond to each word
V(g_valued)$color = colbar
# Set layout here
L = layout_with_fr(g_valued) # Fruchterman Reingold
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6, vertex.label = NA)
## Analysis In a paragraph, describe the macro-level structure of your graphs based on the Fruchterman Reingold visualization. Is it a giant, connected component, are there distinct sub-components, or are there isolated components? Can you recognize common features of the subcomponents? Does this visualization give you any insight into the co-occurrence patterns of the search-terms? If yes, what? If not, why?
Now we’ll create a second visualization using a different layout.
##You can change the layout by picking one of the other options. Uncomment one of the lines below by erasing the # and running the line. Try to find a layout that gives you different information that Fruchterman Reingold.
L = layout_with_dh(g_valued) ## Davidson and Harel
#L= layout_with_drl(g_valued) ## Force-directed
#L = layout_with_kk(g_valued) ## Spring
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6)
## Analysis
In a paragraph, compare and contrast the information given to you by the two different layouts.
Identifying subgroups within a network is of great interest to social network researchers, so a variety of algorithms have been developed to identify and measure subgroups. We will use some of Râs built-in tools to identify subgroups and central nodes for visual inspection.
For the remainder of the visualizations we will use the Fruchterman Reingold layout.
L = layout_with_fr(g_valued)
Cluster the nodes in your network.
# Learn more about the clustering algorithm.
?? cluster_walktrap
## starting httpd help server ... done
cluster <- cluster_walktrap(g_valued)
# Find the number of clusters
membership(cluster) # affiliation list
## Privacy Private Data Security
## 1 2 3 4
## Secure Vulnerabilities Personal Information
## 5 6 7 8
## Disclosure Identity Breach Equifax
## 9 10 11 12
## Credit Confidentiality Status Anonymous
## 13 14 15 16
## Surveillance Secret Secrecy Conspiracy
## 17 18 19 20
## Hacking Leak Leaks
## 21 22 23
length(sizes(cluster)) # number of clusters
## [1] 23
# Find the size the each cluster
# Note that communities with one node are isolates, or have only a single tie
sizes(cluster)
## Community sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
How many communities have been created?
How many nodes are in each community? In networks containing node attribute information, we can often gain insight into a network by looking at the nodes that get placed in the same partition. For your network, what might each cluster of nodes potentially have in common? Describe each cluster, its membership, and the relationship between nodes in the cluster. Next we visualize the clusters by coloring nodes according to their modularity class.
plot(cluster, g_valued, col = V(g_valued)$color, layout = L, vertex.size=6)
What information does this layout convey? Are the clusters well-separated, or is there a great deal of overlap? Is it easier to identify the common themes among clusters in this layout rather than looking only at the graphs?
What differences are there between nodes in the same cluster and across clusters?
Describe the brokers between any components and cliques. What are common features of these brokers? About how many brokers would you have to remove from your network to “shatter” it into two or more disconnected components?
For each network, you will use centrality metrics to improve your visualization. You may need to adjust the size parameter to make your network more easily visible.
totalDegree <- degree(g_valued,mode="all")
sort(totalDegree,decreasing=TRUE)[1:23]
## Privacy Private Data Security
## 22 22 22 22
## Secure Vulnerabilities Personal Information
## 22 22 22 22
## Disclosure Identity Breach Equifax
## 22 22 22 22
## Credit Confidentiality Status Anonymous
## 22 22 22 22
## Surveillance Secret Secrecy Conspiracy
## 22 22 22 22
## Hacking Leak Leaks
## 22 22 22
g2 <- g_valued
V(g2)$size <- totalDegree*.4 #can adjust the number if nodes are too big
plot(g2, layout = L, vertex.label=NA)
Briefly explain degree centrality and why nodes are more or less central in the network.
wd <- graph.strength(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wd,decreasing=TRUE)[1:23]
## Information Security Private Personal
## 24917 24395 20094 18744
## Data Secret Credit Status
## 16856 10235 9554 8348
## Identity Secure Privacy Surveillance
## 7786 6728 6386 5163
## Hacking Disclosure Breach Conspiracy
## 4259 3990 3600 3546
## Anonymous Leaks Leak Secrecy
## 3208 2566 2489 2231
## Vulnerabilities Confidentiality Equifax
## 1258 1012 659
wg2 <- g_valued
V(wg2)$size <- wd*.01 # adjust the number if nodes are too big
plot(wg2, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge. Otherwise edges tend to cover up all the other edges and obscure the relationships.
What does the addition of weighted degree and edge information tell you about your graph?
b <- betweenness(g_valued,directed=TRUE)
sort(b,decreasing=TRUE)[1:5]
## Privacy Private Data Security Secure
## 0 0 0 0 0
g4 <- g_valued
V(g4)$size <- b*10000 #can adjust the number
plot(g4, layout = L, vertex.label=NA)
Briefly explain betweenness centrality and why nodes are more or less central in the network.
wbtwn <- betweenness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wbtwn,decreasing=TRUE)[1:23]
## Equifax Confidentiality Privacy Private
## 212 76 0 0
## Data Security Secure Vulnerabilities
## 0 0 0 0
## Personal Information Disclosure Identity
## 0 0 0 0
## Breach Credit Status Anonymous
## 0 0 0 0
## Surveillance Secret Secrecy Conspiracy
## 0 0 0 0
## Hacking Leak Leaks
## 0 0 0
wBtwnG <- g_valued
V(wBtwnG)$size <- wbtwn*.5 # adjust the number if nodes are too big
plot(wBtwnG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.
What does the addition of weighted degree and edge information tell you about your graph?
c <- closeness(g_valued)
sort(c,decreasing=TRUE)[1:5]
## Privacy Private Data Security Secure
## 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
g5 <- g_valued
V(g5)$size <- c*500 #can adjust the number
plot(g5, layout = L, vertex.label=NA)
Briefly explain closeness centrality and why nodes are more or less central in the network.
wClsnss <- closeness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wClsnss,decreasing=TRUE)[1:5]
## Equifax Confidentiality Anonymous Conspiracy
## 0.001956947 0.001912046 0.001879699 0.001879699
## Secrecy
## 0.001814882
wClsnssG <- g_valued
V(wClsnssG)$size <- wClsnss*4000 # adjust the number if nodes are too big
plot(wClsnssG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.
What does the addition of weighted degree and edge information tell you about your graph?
eigc <- eigen_centrality(g_valued,directed=TRUE)
sort(eigc$vector,decreasing=TRUE)[1:5]
## Privacy Credit Private Secure Breach
## 1 1 1 1 1
g6 <- g_valued
V(g6)$size <- eigc$vector*50 #can adjust the number
plot(g6, layout = L, vertex.label=NA)
Briefly explain eigenvector centrality and why nodes are more or less central in the network.
Choose the visualization that you think is most interesting and briefly explain what it tells you about a central node in your network. Discuss the type of centrality, and what that nodeâs centrality score tells you about the search co-occurrence network.
Briefly discuss an interesting difference between types of centrality for your network.
Compute the network centralization scores for your network for degree, betweenness, closeness, and eigenvector centrality.
# Degree centralization
centralization.degree(g_valued,normalized = TRUE)
## $res
## [1] 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22
##
## $centralization
## [1] 0
##
## $theoretical_max
## [1] 506
# Betweenness centralization
centralization.betweenness(g_valued,normalized = TRUE)
## $res
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $centralization
## [1] 0
##
## $theoretical_max
## [1] 5082
# Closeness centralization
centralization.closeness(g_valued,normalized = TRUE)
## $res
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $centralization
## [1] 0
##
## $theoretical_max
## [1] 10.74419
# Eigenvector centralization
centralization.evcent(g_valued,normalized = TRUE)
## $vector
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $value
## [1] 22
##
## $options
## $options$bmat
## [1] "I"
##
## $options$n
## [1] 23
##
## $options$which
## [1] "LA"
##
## $options$nev
## [1] 1
##
## $options$tol
## [1] 0
##
## $options$ncv
## [1] 0
##
## $options$ldv
## [1] 0
##
## $options$ishift
## [1] 1
##
## $options$maxiter
## [1] 1000
##
## $options$nb
## [1] 1
##
## $options$mode
## [1] 1
##
## $options$start
## [1] 1
##
## $options$sigma
## [1] 0
##
## $options$sigmai
## [1] 0
##
## $options$info
## [1] 0
##
## $options$iter
## [1] 1
##
## $options$nconv
## [1] 1
##
## $options$numop
## [1] 11
##
## $options$numopb
## [1] 0
##
## $options$numreo
## [1] 11
##
##
## $centralization
## [1] 0
##
## $theoretical_max
## [1] 21
Record the centralization score of each centrality measure.
Briefly explain what the centralization of a network is.
Compare the centralization scores above with the graphs you created where the nodes are scaled by centrality. Describe the appearance of more centralized v. less centralized networks.
Networks often demonstrate power law distributions. Plot the degree distribution of the nodes in your base graph.
# Calculate degree distribution
deg <- degree(g_valued,v=V(g_valued), mode="all")
deg
## Privacy Private Data Security
## 22 22 22 22
## Secure Vulnerabilities Personal Information
## 22 22 22 22
## Disclosure Identity Breach Equifax
## 22 22 22 22
## Credit Confidentiality Status Anonymous
## 22 22 22 22
## Surveillance Secret Secrecy Conspiracy
## 22 22 22 22
## Hacking Leak Leaks
## 22 22 22
# Degree distribution is the cumulative frequency of nodes with a given degree
deg_distr <-degree.distribution(g_valued, cumulative=T, mode="all")
deg_distr
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
plot(deg_distr, ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency") #You may need to adjust the ylim to a larger or smaller number to make the graph show more data.
Test whether itâs approximately a power law, estimate log f (k) = log a â c log k. âThis says that if we have a power-law relationship, and we plot log f (k) as a function of log k, then we should see a straight line: âc will be the slope, and log a will be the y-intercept. Such a âlog-logâ plot thus provides a quick way to see if oneâs data exhibits an approximate power-law: it is easy to see if one has an approximately straight line, and one can read off the exponent from the slope.â (E&K, Chapter 18, p.546).
power <- power.law.fit(deg_distr)
power
## $continuous
## [1] FALSE
##
## $alpha
## [1] Inf
##
## $xmin
## [1] 1
##
## $logLik
## [1] NaN
##
## $KS.stat
## [1] 1.797693e+308
##
## $KS.p
## [1] 1
plot(deg_distr, log="xy", ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency")
Does your network exhibit a power law distribution of degree centrality?
Networks often demonstrate small world characteristics. Compute the average clustering coefficient (ACC) and the characteristic path length (CPL).
# Average clustering coefficient (ACC)
transitivity(g_valued, type = c("average"))
## [1] 1
# Characteristic path length (CPL)
average.path.length(g_valued)
## [1] 1
Compute the ACC and CPL for 100 random networks with the same number of nodes and ties as your test network.
accSum <- 0
cplSum <- 0
for (i in 1:100){
grph <- erdos.renyi.game(numVertices, numEdges, type = "gnm")
accSum <- accSum + transitivity(grph, type = c("average"))
cplSum <- cplSum + average.path.length(grph)
}
accSum/100
## [1] 1
cplSum/100
## [1] 1
Based on these data, would you conclude that the observed network demonstrates small world properties? Why or why not?
To complete the lab, make sure output/previews have been generated for each block of code. Then click the “Publish” button on the upper right hand corner of this screen and sign up for an RPubs account. Submit the URL of the published, completed lab on Canvas.
installr::updateR() # updating R