SNA Grad Seminar, Fall 2017 Due: October 24th, 11:59 pm Name of Student: Jue Wu

The purpose of this lab is to develop your familiarity conducting descriptive network analysis using the statistical software package R. This assignment will make use of a data set you collect by defining a search query (a collection of your user-defined search terms) from the New York Times’s Article Search Application Programming Interface. Networks are generated from the co-occurrences between search terms included in the same search query. For example, a link exists between “apple” and “orange” if there are articles in the New York Times that contained these two terms. You will be visualizing and interpreting individual and global network properties of this network.

You will be graded primarily on the completeness and accuracy of your responses, but the clarity of the prepared report will also affect your grade. While students may work together to perform the analysis, each student must submit his or her own report and is responsible for writing the narrative in the report. You must answer all of the bolded questions.

Part 1: Collect Network Data (20 pts)

For this lab, you will search the New York Times, save that data, create networks from that data, compare the differences among networks, and demonstrate your proficiency with basic network descriptive statistics.

Loading and Installing Packages, Set Working Directory

When working with R, you should run each line of code individually, unless it is part of a function definition, so you can see the results. Generally speaking, any line of code that includes ‘{’ (the beginning of a function definition) should be run with all the other lines until you hit ‘}’.

library(magrittr)
library(httr)
library(data.table)
data.table 1.10.4.1
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com

Attaching package: ‘data.table’

The following object is masked _by_ ‘.GlobalEnv’:

    .N
library(igraph)

Attaching package: ‘igraph’

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union
library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:igraph’:

    as_data_frame, groups, union

The following objects are masked from ‘package:data.table’:

    between, first, last

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
library(xml2)

Choose a topic for your search terms

You can decide search terms based on personal interests, research interests, or popular topical areas, among others. You have flexibility in selecting your search term list. For example, you can search for some commercial brands, celebrities, countries, universities, etc. It will be most useful if you choose a collection of words that are not all extremely common. Think about a set of words that might have interesting co-occurrences in articles within the New York Times website. For example, you might be interested in the last names of every Senator involved in a certain political debate, football teams, or cities and their co-occurrence in news articles. Generally speaking, proper nouns are best, but you might have compelling reasons to choose verbs or adjectives. You might want to throw a couple of terms in that aren’t thematically related to make sure you don’t get a totally connected component. The more interesting your network is in terms of differing centrality, distinct components, etc., the easier it will be to do the written analysis. Keep in mind that the Article Search archive is very large; many terms co-occur. You might want to consider two tenuously related subjects. The example file uses four football teams and their home senators, plus a few topical terms.

Create your text input

Create a plain text file with .txt extension in the same directory as the R Markdown Notebook used in this assignment. Make a note of the file name for use in the next code snippet. Place one search term per line, and use 15–20 terms. You’ll also likely want to add quotation marks around your search terms to ensure that you’re only receiving results for the complete term. NOTE: The function will process your terms so that they work in the URL request. You do not need to encode non-alphabetic characters.

The text file cannot include any additional information or characters and it must be a .txt file; Word or RTF documents won’t work.

Analysis

a. Provide a high level overview of the terms you included in the search query. The terms in the search query include NFL teams, players, and officers (New Orleans Saints, Tampa Bay Buccaneers, Carolina Panthers, Atlanta Falcons, Roger Goodell, Colin Kaepernick), and political figures (Steve Scalise, Sen. Bill Cassidy, Sen. Bill Nelson, Sen. Marco Rubio, Sen. Lindsey Graham, Sen. Tim Scott, Sen. Johnny Isakson, Sen. David Perdue, Donald Trump).

b. Why did you choose this collection of terms? Were there some specific overarching question—intellectual or extracurricular curiosity—that motivated this collection of terms? I used the provided dataset, but I was curious about how people (or organizations) in politics and sports related to each other. For example, what is the political tendency of teams and people in NFL? Is there financial interest between sports teams and political figures?

c. How did you decide which terms to use in the search query? Were these terms you intuitively deemed important? Were they culled from a specific source or the result of some separate analysis or search query? Originally I was interested in the diplomatic relations between countries (especially how US views China and whether it is biased or not), but I couldn’t successfully collect the data I wanted. In order to finish the assignment, I just used the provided dataset.

d. What are the insights you hope to glean by looking at the network of terms in terms of individual node metrics, sub-grouping of nodes, overall global network properties? My guess is that Donald Trump will probably get most links and be the highest in degree centrality because he is the President of the United States and every word in the query is American. At sub-group level, I think it is probably going to be two communities—one set of nodes for NFL related and one set of nodes for politics related—that interact a lot within the communities. At global level, I guess this network will be centralized.

Working with the API to Collect Your Data

The New York Times controls access to its API by assigning each user a key. Each key has a limited number of calls that can be made within a certain time period. You can read more about the limitations of the API system here.

You will need to create your own API key to complete this assignment. Go to the New York Times developers page and request a key. You will copy that key (received via email) into the api variable below.

# Import your word list
name_of_file <- "NFL.txt" # Creates a variable called name_of_file that you should populate with the name of your text file between quotation marks.
word_list <- read.table(name_of_file, sep = "\n", stringsAsFactors = F) %>% unlist %>% as.vector # Reads the content of your file into a variable.
num_words <- length(word_list) # Creates a variable with the number of words in your list.
url_base <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
# When you receive the email with your API key, paste it below between the quotation marks.
api <- '76f06c3d16c54280b9233d8f3d76e4bf'

Our first function will gather all of the search terms and their number of hits to be placed in a table. All lines of a function should be run together.

Get_hits_one <- function(keyword1) {
  Sys.sleep(time=3)
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"&begin_date=","20160101") # Begin date is in format YYYYMMDD; you can change it if you want only more recent results, for example.
  # The number of results
  print(keyword1)
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  # Put results in table
  c(SearchTerm=keyword1,ResultsTotal=hits)
}

Now we will invoke our function to put information from the API into our global environment.

#Create a table of your words and their number of results.
total_table <- t(sapply(word_list,Get_hits_one))
total_table <- as.data.frame(total_table)
total_table$ResultsTotal <- as.numeric(as.character(total_table$ResultsTotal))

If you get zero hits for any of these terms, you should substitute that term for somethign else and rerun the lab up to this point. Next, we will define the function that will collect the article co-occurences network.

Get_hits_two <- function(row_input) {
  keyword1 <- row_input[1]
  keyword2 <- row_input[2]
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"+", URLencode(keyword2),"&begin_date=","20160101") #match w/ Begin Date in Get_hits_one.
  # The number of results
  print(paste0(keyword1," ",keyword2)) 
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  Sys.sleep(time=3)
  # Put results in table
  c(SearchTerm1=keyword1,SearchTerm2=keyword2,CoOccurrences=hits)
} 

In this next step, we will call the API and collect the co-occurrence network. This may take some time. If you receive “numeric(0)” in any of your resposnes, you’ve likely hit your API key limit and will either need to wait for the calls to reset (24 hours) or request a new key. If you receive the error message “$ operator is invalid for atomic vectors,” you have also hit the API call limit. This could be due to running the script multiple times, or due to hitting too many results based on very common search terms. Request a new API, shorten your word list, and try again. Don’t forget you need to reload your word list from the first part of the Lab in order to get a different set of results! You must also rerun the functions to reassign the API value. If none of your results come back as “0,” you might want to redo your search with the appropriate words.

# Convert the pairs list into a table
pairs_list <- expand.grid(word_list,word_list) %>% filter(Var1 != Var2)
pairs_list <- t(combn(word_list,2))
#Create a network table, run the Get_hits_two function using the pairs lists
network_table <- t(apply(pairs_list,1,Get_hits_two))
#Convert the network table into a dataframe
network_table <- as.data.frame(network_table)
# Read each the content of each item within the $CoOccurreences factor as characters, 
# then force those characters into the "numeric" or "double" type.
network_table$CoOccurrences <- as.numeric(as.character(network_table$CoOccurrences))
# Convert data to data.table type.
total_table <- as.data.table(total_table)
network_table <- as.data.table(network_table)

# Remove zero edges from your network
network_table <- network_table[!CoOccurrences==0] 

# Create a graph object with your data
g_valued <- graph_from_data_frame(d = network_table[,1:3,with=FALSE],directed = FALSE,vertices = total_table)

# If you're having trouble with data collection, you can load the 'NFL Lab Results.RData' file now by clicking the open folder icon on the "Environment"" tab and continue the lab from here. You'll need to figure out what the significance of the terms are yourself, however.
# You should save your data at this point by clicking the floppy disk icon under the "Environment" tab.

Analysis

Is the graph directed or undirected? Undirected

How many nodes and links does your network have? There are 15 nodes and 35 links.

numVertices <- vcount(g_valued)
numVertices
[1] 15
numEdges <- ecount(g_valued)
numEdges
[1] 35

What is the number of possible links in your network? There are 105 possible links.

maxEdges <- numVertices*(numVertices-1)/2
maxEdges
[1] 105

What is the density of your network? The density is 0.3333333

graphDensity1 <- graph.density(g_valued) # using the graph.density function from igraph
graphDensity1
[1] 0.3333333

Briefly describe how your choice of dataset may influence your findings. What differences would you expect if you use different search terms? Are the current search terms related to one another? If so, how? Do you think the limitation to one word might skew your answers? (i.e. if you’re interested in Hillary Clinton, but you include “Clinton” as a term, you will get stories that mention Chelsea, Bill, & even P-Funk Allstar George Clinton).

It was good to include the full names instead of limiting the words to just one single word, which reduces the chance of getting unwanted words. Also, including Sen. in front of people’s names ensured that the results we get are the ones about politics that we wanted. However, people might also use President Trump when referring to Donald Trump, thus using “Donald Trump” might influence the results by giving us fewer datapoints. # Part 2: Visualize Your Network (20 points)

Let’s start by visualizing the network that we’ve collected from the New York Times Article Search API. We’ll need to choose node colors and set a layout. You can learn more about Fruchterman Reingold layout and other layouts here.

Analysis

In a paragraph, describe the macro-level structure of your graphs based on the Fruchterman Reingold visualization. Is it a giant, connected component, are there distinct sub-components, or are there isolated components? Can you recognize common features of the subcomponents? Does this visualization give you any insight into the co-occurrence patterns of the search-terms? If yes, what? If not, why?

It is a giant component that everyone is connected. On the first sight, the visualization tells me that Donald Trump is a cutpoint. He is the one that connects the politics camp and the sports camp; if we remove him, the network will become two components. In addition, it is apparent that the NFL related words are more connected within the community, compared to the politics community.

Now we’ll create a second visualization using a different layout.

Analysis

In a paragraph, compare and contrast the information given to you by the two different layouts. The two layouts both suggest that Donald Trump has the highest degree centrality and he is the cutpoint. However, it seems like that the sports community is less centralized in the second layout than the first one.

Part 3: Community Detection Analysis with R (20 Points)

Identifying subgroups within a network is of great interest to social network researchers, so a variety of algorithms have been developed to identify and measure subgroups. We will use some of R’s built-in tools to identify subgroups and central nodes for visual inspection.

For the remainder of the visualizations we will use the Fruchterman Reingold layout.

Cluster the nodes in your network.

cluster <- cluster_walktrap(g_valued)
# Find the number of clusters
membership(cluster)   # affiliation list
  “New Orleans Saints”        “Steve Scalise”    “Sen. Bill Cassidy” 
                     3                      1                      2 
“Tampa Bay Buccaneers”     “Sen. Bill Nelson”     “Sen. Marco Rubio” 
                     3                      1                      1 
   “Carolina Panthers”  “Sen. Lindsey Graham”       “Sen. Tim Scott” 
                     3                      1                      2 
     “Atlanta Falcons”  “Sen. Johnny Isakson”    “Sen. David Perdue” 
                     3                      1                      1 
       “Roger Goodell”         “Donald Trump”     “Colin Kaepernick” 
                     3                      1                      3 
length(sizes(cluster)) # number of clusters
[1] 3
# Find the size the each cluster 
# Note that communities with one node are isolates, or have only a single tie
sizes(cluster) 
Community sizes
1 2 3 
7 2 6 

How many communities have been created? 3

How many nodes are in each community? In networks containing node attribute information, we can often gain insight into a network by looking at the nodes that get placed in the same partition.

There are 7, 2, and 6 nodes in each community.

For your network, what might each cluster of nodes potentially have in common? Describe each cluster, its membership, and the relationship between nodes in the cluster. Cluster 1 is all politicians, including several senators, the President, and the house majority whip. The relationship between these nodes is probably work. Cluster 2 is two senators. They are both republican, but because I’m not familiar with the politics, I don’t know why they are in a different cluster other than Cluster 1. My guess is that they might have different opinions with other politicians. Cluster 3 is all NFL related, including NFL teams, officer, and player. The relationship between these nodes is probably competition and service.

Next we visualize the clusters by coloring nodes according to their modularity class.

What information does this layout convey? Are the clusters well-separated, or is there a great deal of overlap? Is it easier to identify the common themes among clusters in this layout rather than looking only at the graphs? The cluasters are well-seperated. It is easier to identify the common themes among clusters since we can tell directly from the layout that what actors are in the same cluster.

What differences are there between nodes in the same cluster and across clusters? Nodes in the same cluster are more connected with each other and have shorter distance, whereas nodes across clusters are less connected and reaching them requires longer distance.

Describe the brokers between any components and cliques. What are common features of these brokers? About how many brokers would you have to remove from your network to “shatter” it into two or more disconnected components? Brokers are the people who have high betweenness centrality and control the group’s information flows. Based on the layout, Donald Trump is the broker to Cluster 1, without whom everyone else in Cluster 1 won’t be able to get information from outside. Sen. Bill Cassidy is the broker to Cluster 2, without whom Sen. Tim Scott won’t get information. As for Cluster 3, everyone is directly connected to Donald Trump and they are interconnected as well, so no one is the broker. Commen feature of these brokers is that people in the cluster won’t get information from outside if they are removed. If we are to break network into more disconnected components, we can remove either Donald Trump or Sen. Bill Cassidy.

Part 4: Centrality Visualization & Weighted Values (20 Points)

For each network, you will use centrality metrics to improve your visualization. You may need to adjust the size parameter to make your network more easily visible.

Degree Centrality

totalDegree <- degree(g_valued,mode="all")
sort(totalDegree,decreasing=TRUE)[1:5]
        “Donald Trump”   “New Orleans Saints” “Tampa Bay Buccaneers” 
                    13                      6                      6 
   “Carolina Panthers”      “Atlanta Falcons” 
                     6                      6 
g2 <- g_valued
V(g2)$size <- totalDegree*2 #can adjust the number if nodes are too big
plot(g2, layout = L, vertex.label=NA)

Briefly explain degree centrality and why nodes are more or less central in the network. Degree centrality measures the “popularity” of the actors by counting total number of links one has with other actors. In this case, Donald Trump has the highest degree centrality and that’s probably because he is the President and every person or every team is in this country. The four NFL teams all have relative high degree centrality and the number of links they have is exactly the same. This is probably because they will play games against one another, and the number of games between these games should be the same.

Weighted Degree Centrality

wd <- graph.strength(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wd,decreasing=TRUE)[1:5]
   “Carolina Panthers”      “Atlanta Falcons”     “Colin Kaepernick” 
                   246                    213                    194 
  “New Orleans Saints” “Tampa Bay Buccaneers” 
                   190                    189 
wg2 <- g_valued
V(wg2)$size <- wd*.1 # adjust the number if nodes are too big
plot(wg2, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences))

What does the addition of weighted degree and edge information tell you about your graph? The weighted results are much different than the original one. It takes into account of the frequency that these words co-occured on NYT instead of the relations being just present or absent. NFL related words are the ones that have the highest weighted degree centrality, and this is probably because the teams play against one another quite often and they tend to co-occur more. Whereas the non-weighted centrality only tells us whether there is link between the two nodes.

Betweenness Centrality

b <- betweenness(g_valued,directed=TRUE)
sort(b,decreasing=TRUE)[1:5]
     “Donald Trump” “Sen. Bill Cassidy”     “Steve Scalise”  “Sen. Bill Nelson” 
         65.3333333          13.0000000           2.0000000           0.3333333 
 “Sen. Marco Rubio” 
          0.3333333 
g4 <- g_valued
V(g4)$size <- b*1.2#can adjust the number
plot(g4, layout = L, vertex.label=NA)

Briefly explain betweenness centrality and why nodes are more or less central in the network. Betweeness centrality assesses how much an actor lies between distinct groups by the number of geodesics passing through actor. In this case, Donald Trump has the highest betweeness centrality, as we can see that he lies between politicians and sports related words and serves as a broker. Similarly, Sen. Bill Cassidy also has high bewtweeness centrality because Sen. Tim Scott is only able to connect with others through him. Nodes with less betweeness centrality means that they are less likely to affect information flows if they are removed.

Weighted Betweenness Centrality

wbtwn <- betweenness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wbtwn,decreasing=TRUE)[1:5]
        “Donald Trump”     “Sen. Bill Nelson” “Tampa Bay Buccaneers” 
              70.50000               37.50000               22.50000 
       “Steve Scalise”    “Sen. Bill Cassidy” 
              18.33333               13.00000 
wBtwnG <- g_valued
V(wBtwnG)$size <- wbtwn*.5 # adjust the number if nodes are too big
plot(wBtwnG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences))

What does the addition of weighted degree and edge information tell you about your graph? Although Donald Trump is still the highest in weighted betweeness degree, nodes appear to be different when the calculation is weighted. For example, Sen. Bill Cassidy is no longer the second highest in betweeness degree. The weighted results would provide more information in terms of how these actors are related in the real world because the relations are no longer just binary but take into account how frequent the connections appear.

Closeness Centrality

c <- closeness(g_valued)
sort(c,decreasing=TRUE)[1:5]
        “Donald Trump”   “New Orleans Saints” “Tampa Bay Buccaneers” 
            0.06666667             0.04347826             0.04347826 
   “Carolina Panthers”      “Atlanta Falcons” 
            0.04347826             0.04347826 
g5 <- g_valued
V(g5)$size <- c*500  #can adjust the number
plot(g5, layout = L, vertex.label=NA)

Briefly explain closeness centrality and why nodes are more or less central in the network. Closeness centrality measures how easily one actor can reach rest of network. Actor with shortest average path length in the network has the highest closeness centrality and serves as pulse-taker. This means that the information will be transmitted much slower if we remove nodes with high closeness centrality. In this case, Donald Trump has the highest closeness centrality, meaning that he has shortest average path length in the network. Without him, the information flows will be “pulsed”. Other than Donald Trump, all the notes appear to be kind of high in closeness centrality—they appear to be similar in size on the layout. Nodes with less closeness centrality means that they will have the least influence on information flows when being removed.

Weighted Closeness Centrality

wClsnss <- closeness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wClsnss,decreasing=TRUE)[1:5]
       “Donald Trump”    “Sen. Bill Nelson”       “Steve Scalise” 
           0.01587302            0.01515152            0.01369863 
“Sen. Johnny Isakson”   “Sen. David Perdue” 
           0.01315789            0.01315789 
wClsnssG <- g_valued
V(wClsnssG)$size <- wClsnss*1000 # adjust the number if nodes are too big
plot(wClsnssG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences))

What does the addition of weighted degree and edge information tell you about your graph? Although Donald Trump remains to be the highest in weighted closeness centrality, other actors that are high in the weighted measure are all politicians instead of NFL teams. This suggests that if we consider the frequency of co-occurance of the words, these politicians play a more important role in effectively transmitting information to others in the network.

Eigenvector Centrality

eigc <- eigen_centrality(g_valued,directed=TRUE)
sort(eigc$vector,decreasing=TRUE)[1:5]
        “Donald Trump” “Tampa Bay Buccaneers”    “Carolina Panthers” 
             1.0000000              0.7745642              0.7745642 
       “Roger Goodell”     “Colin Kaepernick” 
             0.7745642              0.7745642 
g6 <- g_valued
V(g6)$size <- eigc$vector*50 #can adjust the number
plot(g6, layout = L, vertex.label=NA)

Briefly explain eigenvector centrality and why nodes are more or less central in the network. Eigenvector centrality can be measured by recomputing each node’s score as weighted sum of neighbors’ centralities. Highest between and/or degree central actor is often highest eigenvalue central actor, and this is exactly right in this case. Donald Trump is the one with highest eigenvector centrality, meaning that the nodes Donald Trump connects to has lots of links. However, several nodes from the NFL cluster also have high eigenvector centrality, and this is probably because actors are much more interconnected in the NFL cluster due to competition between teams.

Analysis

Choose the visualization that you think is most interesting and briefly explain what it tells you about a central node in your network. Discuss the type of centrality, and what that node’s centrality score tells you about the search co-occurrence network. I think the most interesting one is the closeness centrality. Closeness centrality measures how easily one actor can reach rest of network. Actor with shortest average path length in the network has the highest closeness centrality and serves as pulse-taker. Based on the layout, we can see that actually a lot of actors appear to be relatively high in terms of closeness centrality and we can’t really tell who has higher closeness centrality other than Donald Trump. This is not the case for other layouts, as some actors are obviously more central in those. This result tells us that many actors are important in effectively passing on information in this network.

Briefly discuss an interesting difference between types of centrality for your network. One interesting difference is the degree centrality vesus betweeness centrality. Although Donald Trump always turns out to be the one with highest centrality in all these measures, NFL teams are among the highest ones for degree centrality, whereas senators are the highest ones for betweeness centrality. This probably suggests that although NFL teams have more links and are more popular to appear on NYT, politicians are more important in terms of whether information can be passed on.

Global Network Metrics with R

Compute the network centralization scores for your network for degree, betweenness, closeness, and eigenvector centrality.

centralization.evcent(g_valued,normalized = TRUE)
$vector
 [1] 0.77456419 0.28452963 0.20947656 0.77456419 0.29025659 0.29025659
 [7] 0.77456419 0.25123207 0.03329756 0.77456419 0.15895601 0.15895601
[13] 0.77456419 1.00000000 0.77456419

$value
[1] 6.291049

$options
$options$bmat
[1] "I"

$options$n
[1] 15

$options$which
[1] "LA"

$options$nev
[1] 1

$options$tol
[1] 0

$options$ncv
[1] 0

$options$ldv
[1] 0

$options$ishift
[1] 1

$options$maxiter
[1] 1000

$options$nb
[1] 1

$options$mode
[1] 1

$options$start
[1] 1

$options$sigma
[1] 0

$options$sigmai
[1] 0

$options$info
[1] 0

$options$iter
[1] 2

$options$nconv
[1] 1

$options$numop
[1] 11

$options$numopb
[1] 0

$options$numreo
[1] 8


$centralization
[1] 0.5904349

$theoretical_max
[1] 13

Record the centralization score of each centrality measure. Degree centralization: 0.5952381 Betweeness centralization: 0.7056515 Closeness centralization: 0.7779971 Eigenvector centralization: 0.5904349

Briefly explain what the centralization of a network is. Network centralization reflects how equal every actor in the network is. Specifically, a more centralized network would be a network with some actors who have more links and some actors who have fewer links. To give an example, one of the most centralized network is the star network, where one actor has links to everyone else in the network while the others only have one link.

Compare the centralization scores above with the graphs you created where the nodes are scaled by centrality. Describe the appearance of more centralized v. less centralized networks. Centralization scores can range from 0 to 1, with 0 means least centralized and 1 means most centralized. In this case, the network is most centralized in terms of its closeness, while least centralized in terms of its eigenvector. But overall the network is centralized.

Part 5. Power Laws & Small Worlds (20)

Power Laws

Networks often demonstrate power law distributions. Plot the degree distribution of the nodes in your base graph.

deg_distr <-degree.distribution(g_valued, cumulative=T, mode="all")
deg_distr
 [1] 1.00000000 1.00000000 0.80000000 0.80000000 0.66666667 0.46666667
 [7] 0.46666667 0.06666667 0.06666667 0.06666667 0.06666667 0.06666667
[13] 0.06666667 0.06666667
plot(deg_distr, ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency") 

Test whether it’s approximately a power law, estimate log f (k) = log a − c log k. “This says that if we have a power-law relationship, and we plot log f (k) as a function of log k, then we should see a straight line: −c will be the slope, and log a will be the y-intercept. Such a “log-log” plot thus provides a quick way to see if one’s data exhibits an approximate power-law: it is easy to see if one has an approximately straight line, and one can read off the exponent from the slope.” (E&K, Chapter 18, p.546).

power <- power.law.fit(deg_distr)
power
$continuous
[1] TRUE

$alpha
[1] 3.069992

$xmin
[1] 0.4666667

$logLik
[1] 1.343847

$KS.stat
[1] 0.2920275

$KS.p
[1] 0.502511
plot(deg_distr, log="xy", ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency")

Does your network exhibit a power law distribution of degree centrality? The network does not really exhibit a power law distribution of degree centrality because the plot does not look like a straight line where we can easily tell the slope.

Small Worlds

Networks often demonstrate small world characteristics. Compute the average clustering coefficient (ACC) and the characteristic path length (CPL). ACC = 1.771429 CPL = 0.8269915

average.path.length(g_valued)
[1] 1.771429

Compute the ACC and CPL for 100 random networks with the same number of nodes and ties as your test network. ACC = 0.8893372 CPL = 1.110294

accSum <- 0
cplSum <- 0
for (i in 1:100){
  grph <- erdos.renyi.game(numVertices, numEdges, type = "gnm")
  accSum <- accSum + transitivity(grph, type = c("average"))
  cplSum <- cplSum + average.path.length(grph)
}
accSum/100
[1] 0.8893372
cplSum/100
[1] 1.110294

Based on these data, would you conclude that the observed network demonstrates small world properties? Why or why not? I would say the observed network does not demonstrate small world properties. In order to be a small world network, the network should have high local clustering coefficient and low average path length to other actors. In the observed network, ACC = 0.8279915 and CPL = 1.771429. Now comparing the data to the random generated network, we can find out that ACC of the observed network is lower than the random network results and CPL of the observed network is higher than the random network. Therefore, the observed network does not demonstrate small world properities.

Wrapping up

To complete the lab, make sure output/previews have been generated for each block of code. Then click the “Publish” button on the upper right hand corner of this screen and sign up for an RPubs account. Submit the URL of the published, completed lab on Canvas.

---
title: 'Descriptive Analytic Exercise 1: Visualizing and Interpreting Networks'
output:
  html_notebook: default
  html_document: default
  pdf_document: default
  word_document: default
---
**SNA Grad Seminar, Fall 2017**
**Due:** October 24th, 11:59 pm
**Name of Student**: Jue Wu

The purpose of this lab is to develop your familiarity conducting descriptive network analysis using the statistical software package R. This assignment will make use of a data set you collect by defining a search query (a collection of your user-defined search terms) from the *[New York Times](www.nytimes.com)*'s Article Search [Application Programming Interface](https://en.wikipedia.org/wiki/Application_programming_interface). Networks are generated from the co-occurrences between search terms included in the same search query. For example, a link exists between “apple” and “orange” if there are articles in the *New York Times* that contained these two terms.  You will be visualizing and interpreting individual and global network properties of this network.

You will be graded primarily on the completeness and accuracy of your responses, but the clarity of the prepared report will also affect your grade.  While students may work together to perform the analysis, each student must submit his or her own report and is responsible for writing the narrative in the report. You must answer all of the bolded questions.

# Part 1: Collect Network Data (20 pts)

For this lab, you will search the *New York Times*, save that data, create networks from that data, compare the differences among networks, and demonstrate your proficiency with basic network descriptive statistics.

## Loading and Installing Packages, Set Working Directory

When working with R, you should run each line of code individually, unless it is part of a function definition, so you can see the results. Generally speaking, any line of code that includes '{' (the beginning of a function definition) should be run with all the other lines until you hit '}'.

```{r}
# Lines that start with a hashtag/pound symbol, like this one, are comment lines. Comment lines are ignored by R when it is interpreting code.
# You only need to install packages once. Remove the # in front of each line and then run it to install each package. After successful installation, delete the line of code or replace the #s so the R Notebook doesn't run into problems.
# install.packages('magrittr', repos = "https://cran.rstudio.com")
# install.packages('igraph', repos = "https://cran.rstudio.com")
# install.packages('httr', repos = "https://cran.rstudio.com")
# install.packages('data.table', repos = "https://cran.rstudio.com")
# install.packages('dplyr', repos = "https://cran.rstudio.com")
# install.packages('xml2', repos = "https://cran.rstudio.com")
# You need to load packages every time you run the script or restart R.
library(magrittr)
library(httr)
library(data.table)
library(igraph)
library(dplyr)
library(xml2)
# Set your directory for the project
# You can either enter your filename path within the parentheses below and remove the # creating the comment, or select "Session > Set Working Directory ... Source File Location" in R Studio.
# setwd("Input Directory")
```

## Choose a topic for your search terms

You can decide search terms based on personal interests, research interests, or popular topical areas, among others. You have flexibility in selecting your search term list. For example, you can search for some commercial brands, celebrities, countries, universities, etc. It will be most useful if you choose a collection of words that are not all extremely common. Think about a set of words that might have interesting co-occurrences in articles within the *New York Times* website. For example, you might be interested in the last names of every Senator involved in a certain political debate, football teams, or cities and their co-occurrence in news articles. Generally speaking, proper nouns are best, but you might have compelling reasons to choose verbs or adjectives. You might want to throw a couple of terms in that aren't thematically related to make sure you don't get a totally connected component. The more interesting your network is in terms of differing centrality, distinct components, etc., the easier it will be to do the written analysis. Keep in mind that the Article Search archive is very large; many terms co-occur. You might want to consider two tenuously related subjects. The example file uses four football teams and their home senators, plus a few topical terms.

## Create your text input

Create a plain text file with .txt extension in the same directory as the R Markdown Notebook used in this assignment. Make a note of the file name for use in the next code snippet. Place one search term per line, and use 15–20 terms.  You'll also likely want to add quotation marks around your search terms to ensure that you're only receiving results for the complete term. NOTE: The function will process your terms so that they work in the URL request. You do not need to encode non-alphabetic characters.

The text file cannot include any additional information or characters and it must be a .txt file; Word or RTF documents won’t work.

## Analysis

**a.	Provide a high level overview of the terms you included in the search query.**
The terms in the search query include NFL teams, players, and officers (New Orleans Saints, Tampa Bay Buccaneers, Carolina Panthers, Atlanta Falcons, Roger Goodell, Colin Kaepernick), and political figures (Steve Scalise, Sen. Bill Cassidy, Sen. Bill Nelson, Sen. Marco Rubio, Sen. Lindsey Graham, Sen. Tim Scott, Sen. Johnny Isakson, Sen. David Perdue, Donald Trump).

**b.	Why did you choose this collection of terms?  Were there some specific overarching question—intellectual or extracurricular curiosity—that motivated this collection of terms?**
I used the provided dataset, but I was curious about how people (or organizations) in politics and sports related to each other. For example, what is the political tendency of teams and people in NFL? Is there financial interest between sports teams and political figures?

**c.	How did you decide which terms to use in the search query? Were these terms you intuitively deemed important? Were they culled from a specific source or the result of some separate analysis or search query?**
Originally I was interested in the diplomatic relations between countries (especially how US views China and whether it is biased or not), but I couldn't successfully collect the data I wanted. In order to finish the assignment, I just used the provided dataset. 

**d.	What are the insights you hope to glean by looking at the network of terms in terms of individual node metrics, sub-grouping of nodes, overall global network properties?**
My guess is that Donald Trump will probably get most links and be the highest in degree centrality because he is the President of the United States and every word in the query is American. At sub-group level, I think it is probably going to be two communities---one set of nodes for NFL related and one set of nodes for politics related---that interact a lot within the communities. At global level, I guess this network will be centralized. 

## Working with the API to Collect Your Data
The *New York Times* controls access to its API by assigning each user a key. Each key has a limited number of calls that can be made within a certain time period. You can read more about the limitations of the API system [here](http://developer.nytimes.com/article_search_v2.json#).

You will need to create your own API key to complete this assignment. Go to the *New York Times* [developers page](https://developer.nytimes.com/signup) and request a key. You will copy that key (received via email) into the api variable below.

```{r, eval = FALSE}
# Import your word list
name_of_file <- "NFL.txt" # Creates a variable called name_of_file that you should populate with the name of your text file between quotation marks.
word_list <- read.table(name_of_file, sep = "\n", stringsAsFactors = F) %>% unlist %>% as.vector # Reads the content of your file into a variable.
num_words <- length(word_list) # Creates a variable with the number of words in your list.
url_base <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
# When you receive the email with your API key, paste it below between the quotation marks.
api <- '76f06c3d16c54280b9233d8f3d76e4bf'
```

Our first function will gather all of the search terms and their number of hits to be placed in a table. All lines of a function should be run together.

```{r, eval = FALSE}
Get_hits_one <- function(keyword1) {
  Sys.sleep(time=3)
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"&begin_date=","20160101") # Begin date is in format YYYYMMDD; you can change it if you want only more recent results, for example.
  # The number of results
  print(keyword1)
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  # Put results in table
  c(SearchTerm=keyword1,ResultsTotal=hits)
}
```

Now we will invoke our function to put information from the API into our global environment.

```{r, eval = FALSE}
#Create a table of your words and their number of results.
total_table <- t(sapply(word_list,Get_hits_one))
total_table <- as.data.frame(total_table)
total_table$ResultsTotal <- as.numeric(as.character(total_table$ResultsTotal))
```
If you get zero hits for any of these terms, you should substitute that term for somethign else and rerun the lab up to this point.
Next, we will define the function that will collect the article co-occurences network.
```{r, eval = FALSE}
Get_hits_two <- function(row_input) {
  keyword1 <- row_input[1]
  keyword2 <- row_input[2]
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"+", URLencode(keyword2),"&begin_date=","20160101") #match w/ Begin Date in Get_hits_one.
  # The number of results
  print(paste0(keyword1," ",keyword2)) 
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  Sys.sleep(time=3)
  # Put results in table
  c(SearchTerm1=keyword1,SearchTerm2=keyword2,CoOccurrences=hits)
} 
```

In this next step, we will call the API and collect the co-occurrence network. This may take some time. If you receive "numeric(0)" in any of your resposnes, you've likely hit your API key limit and will either need to wait for the calls to reset (24 hours) or request a new key. If you receive the error message "$ operator is invalid for atomic vectors," you have also hit the API call limit. This could be due to running the script multiple times, or due to hitting too many results based on very common search terms. Request a new API, shorten your word list, and try again. Don't forget you need to reload your word list from the first part of the Lab in order to get a different set of results! You must also rerun the functions to reassign the API value. If none of your results come back as "0," you might want to redo your search with the appropriate words.

```{r, eval = FALSE}
# Convert the pairs list into a table
pairs_list <- expand.grid(word_list,word_list) %>% filter(Var1 != Var2)
pairs_list <- t(combn(word_list,2))
#Create a network table, run the Get_hits_two function using the pairs lists
network_table <- t(apply(pairs_list,1,Get_hits_two))
#Convert the network table into a dataframe
network_table <- as.data.frame(network_table)
# Read each the content of each item within the $CoOccurreences factor as characters, 
# then force those characters into the "numeric" or "double" type.
network_table$CoOccurrences <- as.numeric(as.character(network_table$CoOccurrences))
# Convert data to data.table type.
total_table <- as.data.table(total_table)
network_table <- as.data.table(network_table)

# Remove zero edges from your network
network_table <- network_table[!CoOccurrences==0] 

# Create a graph object with your data
g_valued <- graph_from_data_frame(d = network_table[,1:3,with=FALSE],directed = FALSE,vertices = total_table)

# If you're having trouble with data collection, you can load the 'NFL Lab Results.RData' file now by clicking the open folder icon on the "Environment"" tab and continue the lab from here. You'll need to figure out what the significance of the terms are yourself, however.
# You should save your data at this point by clicking the floppy disk icon under the "Environment" tab.
```

## Analysis

**Is the graph directed or undirected?** 
Undirected

**How many nodes and links does your network have? **
There are 15 nodes and 35 links.
```{r}
numVertices <- vcount(g_valued)
numVertices
numEdges <- ecount(g_valued)
numEdges
```

**What is the number of possible links in your network? **
There are 105 possible links.
```{r}
maxEdges <- numVertices*(numVertices-1)/2
maxEdges
```

**What is the density of your network?** 
The density is 0.3333333
```{r}
graphDensity <- numEdges/maxEdges # manual calculation
graphDensity
graphDensity1 <- graph.density(g_valued) # using the graph.density function from igraph
graphDensity1
```

**Briefly describe how your choice of dataset may influence your findings.**  What differences would you expect if you use different search terms? Are the current search terms related to one another? If so, how? Do you think the limitation to one word might skew your answers? (i.e. if you’re interested in Hillary Clinton, but you include “Clinton” as a term, you will get stories that mention Chelsea, Bill, & even P-Funk Allstar George Clinton).

It was good to include the full names instead of limiting the words to just one single word, which reduces the chance of getting unwanted words. Also, including Sen. in front of people's names ensured that the results we get are the ones about politics that we wanted. However, people might also use President Trump when referring to Donald Trump, thus using "Donald Trump" might influence the results by giving us fewer datapoints. 
# Part 2: Visualize Your Network (20 points)

Let's start by visualizing the network that we've collected from the *New York Times* Article Search API. We'll need to choose node colors and set a layout. You can learn more about Fruchterman Reingold layout and other layouts [here](http://igraph.org/r/doc/layout_with_fr.html).

```{r}
## Learn more about plotting with igraph
?? igraph.plotting
colbar = rainbow(length(word_list)) ## we are selecting different colors to correspond to each word
V(g_valued)$color = colbar
# Set layout here 
L = layout_with_fr(g_valued)  # Fruchterman Reingold
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6) 

```
## Analysis
**In a paragraph, describe the macro-level structure of your graphs based on the Fruchterman Reingold visualization.**
Is it a giant, connected component, are there distinct sub-components, or are there isolated components?  Can you recognize common features of the subcomponents?  Does this visualization give you any insight into the co-occurrence patterns of the search-terms?  If yes, what? If not, why?

It is a giant component that everyone is connected. On the first sight, the visualization tells me that Donald Trump is a cutpoint. He is the one that connects the politics camp and the sports camp; if we remove him, the network will become two components. In addition, it is apparent that the NFL related words are more connected within the community, compared to the politics community.

Now we'll create a second visualization using a different layout.
```{r}
## You can change the layout by picking one of the other options. Uncomment one of the lines below by erasing the # and running the line. Try to find a layout that gives you different information that Fruchterman Reingold.

 L = layout_with_dh(g_valued) ## Davidson and Harel

# L = layout_with_drl(g_valued) ## Force-directed

# L = layout_with_kk(g_valued) ## Spring
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6) 
```
## Analysis

**In a paragraph, compare and contrast the information given to you by the two different layouts.**
The two layouts both suggest that Donald Trump has the highest degree centrality and he is the cutpoint. However, it seems like that the sports community is less centralized in the second layout than the first one. 

# Part 3: Community Detection Analysis with R (20 Points)

Identifying subgroups within a network is of great interest to social network researchers, so a variety of algorithms have been developed to identify and measure subgroups.  We will use some of R’s built-in tools to identify subgroups and central nodes for visual inspection.

For the remainder of the visualizations we will use the Fruchterman Reingold layout.
```{r}
L = layout_with_fr(g_valued) 
```

Cluster the nodes in your network.
```{r}
# Learn more about the clustering algorithm.
?? cluster_walktrap
cluster <- cluster_walktrap(g_valued)
# Find the number of clusters
membership(cluster)   # affiliation list
length(sizes(cluster)) # number of clusters
# Find the size the each cluster 
# Note that communities with one node are isolates, or have only a single tie
sizes(cluster) 
```

**How many communities have been created?**
3

**How many nodes are in each community?**
In networks containing node attribute information, we can often gain insight into a network by looking at the nodes that get placed in the same partition. 

There are 7, 2, and 6 nodes in each community.

**For your network, what might each cluster of nodes potentially have in common? Describe each cluster, its membership, and the relationship between nodes in the cluster.**
Cluster 1 is all politicians, including several senators, the President, and the house majority whip. The relationship between these nodes is probably work. Cluster 2 is two senators. They are both republican, but because I'm not familiar with the politics, I don't know why they are in a different cluster other than Cluster 1. My guess is that they might have different opinions with other politicians. Cluster 3 is all NFL related, including NFL teams, officer, and player. The relationship between these nodes is probably competition and service. 

Next we visualize the clusters by coloring nodes according to their modularity class. 
```{r}
plot(cluster, g_valued, col = V(g_valued)$color, layout = L, vertex.size=6)
```

**What information does this layout convey?  Are the clusters well-separated, or is there a great deal of overlap? Is it easier to identify the common themes among clusters in this layout rather than looking only at the graphs?**
The cluasters are well-seperated. It is easier to identify the common themes among clusters since we can tell directly from the layout that what actors are in the same cluster.

**What differences are there between nodes in the same cluster and across clusters?**
Nodes in the same cluster are more connected with each other and have shorter distance, whereas nodes across clusters are less connected and reaching them requires longer distance. 

**Describe the brokers between any components and cliques.  What are common features of these brokers?  About how many brokers would you have to remove from your network to "shatter" it into two or more disconnected components?**
Brokers are the people who have high betweenness centrality and control the group’s information flows. Based on the layout, Donald Trump is the broker to Cluster 1, without whom everyone else in Cluster 1 won't be able to get information from outside. Sen. Bill Cassidy is the broker to Cluster 2, without whom Sen. Tim Scott won't get information. As for Cluster 3, everyone is directly connected to Donald Trump and they are interconnected as well, so no one is the broker. Commen feature of these brokers is that people in the cluster won't get information from outside if they are removed. If we are to break network into more disconnected components, we can remove either Donald Trump or Sen. Bill Cassidy.

# Part 4: Centrality Visualization & Weighted Values (20 Points)

For each network, you will use centrality metrics to improve your visualization. You may need to adjust the size parameter to make your network more easily visible.

## Degree Centrality
```{r}
totalDegree <- degree(g_valued,mode="all")
sort(totalDegree,decreasing=TRUE)[1:5]
g2 <- g_valued
V(g2)$size <- totalDegree*2 #can adjust the number if nodes are too big
plot(g2, layout = L, vertex.label=NA)
```
**Briefly explain degree centrality and why nodes are more or less central in the network.**
Degree centrality measures the "popularity" of the actors by counting total number of links one has with other actors. In this case, Donald Trump has the highest degree centrality and that's probably because he is the President and every person or every team is in this country. The four NFL teams all have relative high degree centrality and the number of links they have is exactly the same. This is probably because they will play games against one another, and the number of games between these games should be the same. 

## Weighted Degree Centrality
```{r}
wd <- graph.strength(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wd,decreasing=TRUE)[1:5]
wg2 <- g_valued
V(wg2)$size <- wd*.1 # adjust the number if nodes are too big
plot(wg2, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge. Otherwise edges tend to cover up all the other edges and obscure the relationships.
```
**What does the addition of weighted degree and edge information tell you about your graph?**
The weighted results are much different than the original one. It takes into account of the frequency that these words co-occured on NYT instead of the relations being just present or absent. NFL related words are the ones that have the highest weighted degree centrality, and this is probably because the teams play against one another quite often and they tend to co-occur more. Whereas the non-weighted centrality only tells us whether there is link between the two nodes. 

## Betweenness Centrality
```{r}
b <- betweenness(g_valued,directed=TRUE)
sort(b,decreasing=TRUE)[1:5]
g4 <- g_valued
V(g4)$size <- b*1.2#can adjust the number
plot(g4, layout = L, vertex.label=NA)
```
**Briefly explain betweenness centrality and why nodes are more or less central in the network.**
Betweeness centrality assesses how much an actor lies between distinct groups by the number of geodesics passing through actor. In this case, Donald Trump has the highest betweeness centrality, as we can see that he lies between politicians and sports related words and serves as a broker. Similarly, Sen. Bill Cassidy also has high bewtweeness centrality because Sen. Tim Scott is only able to connect with others through him. Nodes with less betweeness centrality means that they are less likely to affect information flows if they are removed.

### Weighted Betweenness Centrality
```{r}
wbtwn <- betweenness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wbtwn,decreasing=TRUE)[1:5]
wBtwnG <- g_valued
V(wBtwnG)$size <- wbtwn*.5 # adjust the number if nodes are too big
plot(wBtwnG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.
```
**What does the addition of weighted degree and edge information tell you about your graph?**
Although Donald Trump is still the highest in weighted betweeness degree, nodes appear to be different when the calculation is weighted. For example, Sen. Bill Cassidy is no longer the second highest in betweeness degree. The weighted results would provide more information in terms of how these actors are related in the real world because the relations are no longer just binary but take into account how frequent the connections appear.

## Closeness Centrality
```{r}
c <- closeness(g_valued)
sort(c,decreasing=TRUE)[1:5]
g5 <- g_valued
V(g5)$size <- c*500  #can adjust the number
plot(g5, layout = L, vertex.label=NA)
```
**Briefly explain closeness centrality and why nodes are more or less central in the network.**
Closeness centrality measures how easily one actor can reach rest of network. Actor with shortest average path length in the network has the highest closeness centrality and serves as pulse-taker. This means that the information will be transmitted much slower if we remove nodes with high closeness centrality. In this case, Donald Trump has the highest closeness centrality, meaning that he has shortest average path length in the network. Without him, the information flows will be "pulsed". Other than Donald Trump, all the notes appear to be kind of high in closeness centrality---they appear to be similar in size on the layout. Nodes with less closeness centrality means that they will have the least influence on information flows when being removed.

### Weighted Closeness Centrality

```{r}
wClsnss <- closeness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wClsnss,decreasing=TRUE)[1:5]
wClsnssG <- g_valued
V(wClsnssG)$size <- wClsnss*1000 # adjust the number if nodes are too big
plot(wClsnssG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.
```
**What does the addition of weighted degree and edge information tell you about your graph?**
Although Donald Trump remains to be the highest in weighted closeness centrality, other actors that are high in the weighted measure are all politicians instead of NFL teams. This suggests that if we consider the frequency of co-occurance of the words, these politicians play a more important role in effectively transmitting information to others in the network.

## Eigenvector Centrality
```{r}
eigc <- eigen_centrality(g_valued,directed=TRUE)
sort(eigc$vector,decreasing=TRUE)[1:5]
g6 <- g_valued
V(g6)$size <- eigc$vector*50 #can adjust the number
plot(g6, layout = L, vertex.label=NA)
```

**Briefly explain eigenvector centrality and why nodes are more or less central in the network.**
Eigenvector centrality can be measured by recomputing each node’s score as weighted sum of neighbors’ centralities. Highest between and/or degree central actor is often highest eigenvalue central actor, and this is exactly right in this case. Donald Trump is the one with highest eigenvector centrality, meaning that the nodes Donald Trump connects to has lots of links. However, several nodes from the NFL cluster also have high eigenvector centrality, and this is probably because actors are much more interconnected in the NFL cluster due to competition between teams.

## Analysis
**Choose the visualization that you think is most interesting and briefly explain what it tells you about a central node in your network. Discuss the type of centrality, and what that node’s centrality score tells you about the search co-occurrence network.**
I think the most interesting one is the closeness centrality. Closeness centrality measures how easily one actor can reach rest of network. Actor with shortest average path length in the network has the highest closeness centrality and serves as pulse-taker. Based on the layout, we can see that actually a lot of actors appear to be relatively high in terms of closeness centrality and we can't really tell who has higher closeness centrality other than Donald Trump. This is not the case for other layouts, as some actors are obviously more central in those. This result tells us that many actors are important in effectively passing on information in this network.

**Briefly discuss an interesting difference between types of centrality for your network.**
One interesting difference is the degree centrality vesus betweeness centrality. Although Donald Trump always turns out to be the one with highest centrality in all these measures, NFL teams are among the highest ones for degree centrality, whereas senators are the highest ones for betweeness centrality. This probably suggests that although NFL teams have more links and are more popular to appear on NYT, politicians are more important in terms of whether information can be passed on. 

## Global Network Metrics with R

Compute the network centralization scores for your network for degree, betweenness, closeness, and eigenvector centrality.

```{r}
# Degree centralization
centralization.degree(g_valued,normalized = TRUE)

# Betweenness centralization
centralization.betweenness(g_valued,normalized = TRUE)

# Closeness centralization 
centralization.closeness(g_valued,normalized = TRUE)

# Eigenvector centralization 
centralization.evcent(g_valued,normalized = TRUE)

```
**Record the centralization score of each centrality measure.**
Degree centralization: 0.5952381
Betweeness centralization: 0.7056515
Closeness centralization: 0.7779971
Eigenvector centralization: 0.5904349

**Briefly explain what the centralization of a network is.**
Network centralization reflects how equal every actor in the network is. Specifically, a more centralized network would be a network with some actors who have more links and some actors who have fewer links. To give an example, one of the most centralized network is the star network, where one actor has links to everyone else in the network while the others only have one link.

**Compare the centralization scores above with the graphs you created where the nodes are scaled by centrality. Describe the appearance of more centralized v. less centralized networks.**
Centralization scores can range from 0 to 1, with 0 means least centralized and 1 means most centralized. In this case, the network is most centralized in terms of its closeness, while least centralized in terms of its eigenvector. But overall the network is centralized. 

## Part 5. Power Laws & Small Worlds (20)

## Power Laws
Networks often demonstrate power law distributions. Plot the degree distribution of the nodes in your base graph. 
```{r}
# Calculate degree distribution
deg <- degree(g_valued,v=V(g_valued), mode="all")
deg

# Degree distribution is the cumulative frequency of nodes with a given degree
deg_distr <-degree.distribution(g_valued, cumulative=T, mode="all")
deg_distr
plot(deg_distr, ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency") #You may need to adjust the ylim to a larger or smaller number to make the graph show more data.
```

Test whether it’s approximately a power law, estimate log f (k) = log a − c log k. “This says that if we have a power-law relationship, and we plot log f (k) as a function of log k, then we should see a straight line: −c will be the slope, and log a will be the y-intercept. Such a “log-log” plot thus provides a quick way to see if one’s data exhibits an approximate power-law: it is easy to see if one has an approximately straight line, and one can read off the exponent from the slope.” (E&K, Chapter 18, p.546).

```{r}
power <- power.law.fit(deg_distr)
power
plot(deg_distr, log="xy", ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency")
```

**Does your network exhibit a power law distribution of degree centrality?**
The network does not really exhibit a power law distribution of degree centrality because the plot does not look like a straight line where we can easily tell the slope.

## Small Worlds

Networks often demonstrate small world characteristics. Compute the average clustering coefficient (ACC) and the characteristic path length (CPL).
ACC = 1.771429
CPL = 0.8269915
```{r}
# Average clustering coefficient (ACC)
transitivity(g_valued, type = c("average"))

# Characteristic path length (CPL)
average.path.length(g_valued)
```

Compute the ACC and CPL for 100 random networks with the same number of nodes and ties as your test network. 
ACC = 0.8893372
CPL = 1.110294
```{r}
accSum <- 0
cplSum <- 0
for (i in 1:100){
  grph <- erdos.renyi.game(numVertices, numEdges, type = "gnm")
  accSum <- accSum + transitivity(grph, type = c("average"))
  cplSum <- cplSum + average.path.length(grph)
}
accSum/100
cplSum/100
```

**Based on these data, would you conclude that the observed network demonstrates small world properties? Why or why not?**
I would say the observed network does not demonstrate small world properties. In order to be a small world network, the network should have high local clustering coefficient and low average path length to other actors. In the observed network, ACC = 0.8279915 and CPL = 1.771429. Now comparing the data to the random generated network, we can find out that ACC of the observed network is lower than the random network results and CPL of the observed network is higher than the random network. Therefore, the observed network does not demonstrate small world properities.

## Wrapping up
To complete the lab, make sure output/previews have been generated for each block of code. Then click the "Publish" button on the upper right hand corner of this screen and sign up for an RPubs account. Submit the URL of the published, completed lab on Canvas.