SNA Grad Seminar, Fall 2017 Due: October 24th, 11:59 pm Name of Student: Carmanie Bhatti

The purpose of this lab is to develop your familiarity conducting descriptive network analysis using the statistical software package R. This assignment will make use of a data set you collect by defining a search query (a collection of your user-defined search terms) from the New York Times’s Article Search Application Programming Interface. Networks are generated from the co-occurrences between search terms included in the same search query. For example, a link exists between “apple” and “orange” if there are articles in the New York Times that contained these two terms. You will be visualizing and interpreting individual and global network properties of this network.

You will be graded primarily on the completeness and accuracy of your responses, but the clarity of the prepared report will also affect your grade. While students may work together to perform the analysis, each student must submit his or her own report and is responsible for writing the narrative in the report. You must answer all of the bolded questions.

Part 1: Collect Network Data (20 pts)

For this lab, you will search the New York Times, save that data, create networks from that data, compare the differences among networks, and demonstrate your proficiency with basic network descriptive statistics.

Loading and Installing Packages, Set Working Directory

When working with R, you should run each line of code individually, unless it is part of a function definition, so you can see the results. Generally speaking, any line of code that includes ‘{’ (the beginning of a function definition) should be run with all the other lines until you hit ‘}’.

# Lines that start with a hashtag/pound symbol, like this one, are comment lines. Comment lines are ignored by R when it is interpreting code.
# You only need to install packages once. Remove the # in front of each line and then run it to install each package. After successful installation, delete the line of code or replace the #s so the R Notebook doesn't run into problems.
install.packages('magrittr', repos = "https://cran.rstudio.com")

## 
## The downloaded binary packages are in
##  /var/folders/h9/_68xmrd96fj7lzq9wsgr020w0000gn/T//RtmpJUZzOT/downloaded_packages

#install.packages('igraph', repos = "https://cran.rstudio.com")
#install.packages('httr', repos = "https://cran.rstudio.com")
#install.packages('data.table', repos = "https://cran.rstudio.com")
#install.packages('dplyr', repos = "https://cran.rstudio.com")
#install.packages('xml2', repos = "https://cran.rstudio.com")
# You need to load packages every time you run the script or restart R.
library(magrittr)
library(httr)
library(data.table)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(xml2)
# Set your directory for the project
# You can either enter your filename path within the parentheses below and remove the # creating the comment, or select "Session > Set Working Directory ... Source File Location" in R Studio.
# setwd("Input Directory")

Choose a topic for your search terms

Inernational movies watched by Americans and television shows watched by Indians
You can decide search terms based on personal interests, research interests, or popular topical areas, among others. You have flexibility in selecting your search term list. For example, you can search for some commercial brands, celebrities, countries, universities, etc. It will be most useful if you choose a collection of words that are not all extremely common. Think about a set of words that might have interesting co-occurrences in articles within the New York Times website. For example, you might be interested in the last names of every Senator involved in a certain political debate, football teams, or cities and their co-occurrence in news articles. Generally speaking, proper nouns are best, but you might have compelling reasons to choose verbs or adjectives. You might want to throw a couple of terms in that aren’t thematically related to make sure you don’t get a totally connected component. The more interesting your network is in terms of differing centrality, distinct components, etc., the easier it will be to do the written analysis. Keep in mind that the Article Search archive is very large; many terms co-occur. You might want to consider two tenuously related subjects. The example file uses four football teams and their home senators, plus a few topical terms.

Create your text input

Create a plain text file with .txt extension in the same directory as the R Markdown Notebook used in this assignment. Make a note of the file name for use in the next code snippet. Place one search term per line, and use 15–20 terms. You’ll also likely want to add quotation marks around your search terms to ensure that you’re only receiving results for the complete term. NOTE: The function will process your terms so that they work in the URL request. You do not need to encode non-alphabetic characters.

The text file cannot include any additional information or characters and it must be a .txt file; Word or RTF documents won’t work.

Analysis

a. Provide a high level overview of the terms you included in the search query. Tom Cruise, Chris Pine,Shah Rukh Khan, Priyanka Chopra, Kuch Kuch Hota Hai,Quantico, Top Gun, Wonder Woman, India, United States
b. Why did you choose this collection of terms? Were there some specific overarching question—intellectual or extracurricular curiosity—that motivated this collection of terms? These actors, movies and television show are from romantic genre. Do people prefer to watch romantic movies or do they appreciate these actors? c. How did you decide which terms to use in the search query? Were these terms you intuitively deemed important? Were they culled from a specific source or the result of some separate analysis or search query? Romantic movie genre is different in Hollywood and Bollywood. How do people perceive the notion of a romantic movie, detective serie and science fiction movie? d. What are the insights you hope to glean by looking at the network of terms in terms of individual node metrics, sub-grouping of nodes, overall global network properties? Do people like to watch movies or television shows? Do they watch these movies because of their favorite actors or because of its genre? ## Working with the API to Collect Your Data The New York Times controls access to its API by assigning each user a key. Each key has a limited number of calls that can be made within a certain time period. You can read more about the limitations of the API system here.

You will need to create your own API key to complete this assignment. Go to the New York Times developers page and request a key. You will copy that key (received via email) into the api variable below.

# Import your word list
name_of_file <- "bhatti.txt" # Creates a variable called name_of_file that you should populate with the name of your text file between quotation marks.
word_list <- read.table(name_of_file, sep = "\n", stringsAsFactors = F) %>% unlist %>% as.vector # Reads the content of your file into a variable.
num_words <- length(word_list) # Creates a variable with the number of words in your list.
url_base <- "https://api.nytimes.com/svc/search/v2/articlesearch.json"
# When you receive the email with your API key, paste it below between the quotation marks.
api <- 'dd2debaca7064fa9a8fa4bdde0d7f284'

Our first function will gather all of the search terms and their number of hits to be placed in a table. All lines of a function should be run together.

Get_hits_one <- function(keyword1) {
  Sys.sleep(time=3)
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"&begin_date=","20160101") # Begin date is in format YYYYMMDD; you can change it if you want only more recent results, for example.
  # The number of results
  print(keyword1)
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  # Put results in table
  c(SearchTerm=keyword1,ResultsTotal=hits)
}

Now we will invoke our function to put information from the API into our global environment.

#Create a table of your words and their number of results.
total_table <- t(sapply(word_list,Get_hits_one))
total_table <- as.data.frame(total_table)
total_table$ResultsTotal <- as.numeric(as.character(total_table$ResultsTotal))

If you get zero hits for any of these terms, you should substitute that term for somethign else and rerun the lab up to this point. Next, we will define the function that will collect the article co-occurences network.

Get_hits_two <- function(row_input) {
  keyword1 <- row_input[1]
  keyword2 <- row_input[2]
  url <- paste0(url_base, "?api-key=", api, "&q=", URLencode(keyword1),"+", URLencode(keyword2),"&begin_date=","20160101") #match w/ Begin Date in Get_hits_one.
  # The number of results
  print(paste0(keyword1," ",keyword2)) 
  hits <- content(GET(url))$response$meta$hits %>% as.numeric
  print(hits)
  Sys.sleep(time=3)
  # Put results in table
  c(SearchTerm1=keyword1,SearchTerm2=keyword2,CoOccurrences=hits)
}

In this next step, we will call the API and collect the co-occurrence network. This may take some time. If you receive “numeric(0)” in any of your resposnes, you’ve likely hit your API key limit and will either need to wait for the calls to reset (24 hours) or request a new key. If you receive the error message “$ operator is invalid for atomic vectors,” you have also hit the API call limit. This could be due to running the script multiple times, or due to hitting too many results based on very common search terms. Request a new API, shorten your word list, and try again. Don’t forget you need to reload your word list from the first part of the Lab in order to get a different set of results! You must also rerun the functions to reassign the API value. If none of your results come back as “0,” you might want to redo your search with the appropriate words.

# Convert the pairs list into a table
pairs_list <- expand.grid(word_list,word_list) %>% filter(Var1 != Var2)
pairs_list <- t(combn(word_list,2))
#Create a network table, run the Get_hits_two function using the pairs lists
network_table <- t(apply(pairs_list,1,Get_hits_two))
#Convert the network table into a dataframe
network_table <- as.data.frame(network_table)
# Read each the content of each item within the $CoOccurreences factor as characters, 
# then force those characters into the "numeric" or "double" type.
network_table$CoOccurrences <- as.numeric(as.character(network_table$CoOccurrences))
# Convert data to data.table type.
total_table <- as.data.table(total_table)
network_table <- as.data.table(network_table)

# Remove zero edges from your network
network_table <- network_table[!CoOccurrences==0] 

# Create a graph object with your data
g_valued <- graph_from_data_frame(d = network_table[,1:3,with=FALSE],directed = FALSE,vertices = total_table)

# If you're having trouble with data collection, you can load the 'NFL Lab Results.RData' file now by clicking the open folder icon on the "Environment"" tab and continue the lab from here. You'll need to figure out what the significance of the terms are yourself, however.
# You should save your data at this point by clicking the floppy disk icon under the "Environment" tab.

load("BhattiMovieData.Rdata")

Analysis

Is the graph directed or undirected? This is an undirected graph How many nodes and links does your network have?

numVertices <- vcount(g_valued)
numVertices

## [1] 10

numEdges <- ecount(g_valued)
numEdges

## [1] 24

What is the number of possible links in your network?

maxEdges <- numVertices*(numVertices-1)/2
maxEdges

## [1] 45

What is the density of your network?

graphDensity <- numEdges/maxEdges # manual calculation
graphDensity

## [1] 0.5333333

graphDensity1 <- graph.density(g_valued) # using the graph.density function from igraph
graphDensity1

## [1] 0.5333333

0.5333333 Briefly describe how your choice of dataset may influence your findings. What differences would you expect if you use different search terms? Are the current search terms related to one another? If so, how? Do you think the limitation to one word might skew your answers? (i.e. if you’re interested in Hillary Clinton, but you include “Clinton” as a term, you will get stories that mention Chelsea, Bill, & even P-Funk Allstar George Clinton). The names of actors, movie titles and cinema industries,in this case,India and the U.S. have one commonality-the romantic movie context. If I typed the word “advertisements,”" it could search for either Indian advertisents starring Shah Rukh Khan and Priyanka Chopra or, promotional advertisement for the movie titles enlisted.This could not have spoiled the research.

Part 2: Visualize Your Network (20 points)

Let’s start by visualizing the network that we’ve collected from the New York Times Article Search API. We’ll need to choose node colors and set a layout. You can learn more about Fruchterman Reingold layout and other layouts here.

## Learn more about plotting with igraph
?? igraph.plotting
colbar = rainbow(length(word_list)) ## we are selecting different colors to correspond to each word
V(g_valued)$color = colbar
# Set layout here 
L = layout_with_fr(g_valued)  # Fruchterman Reingold
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6)

## Analysis In a paragraph, describe the macro-level structure of your graphs based on the Fruchterman Reingold visualization. Is it a giant, connected component, are there distinct sub-components, or are there isolated components? Can you recognize common features of the subcomponents? Does this visualization give you any insight into the co-occurrence patterns of the search-terms? If yes, what? If not, why?

This is not a giant componant. The movie:Kuch Kuch Hota Hai is an isolated component, however, the other component is a connected component. Interestingly, though there should not have been any connection between actress Priyanka Chopra and movie Wonder Woman, the graph shows a connection. This is to state that Priyanka Chopra is a Bollywood actress and Chris Pine a Hollywood actor. Also that the movie Wonder Woman is a superhero film genre and Priyanka Chopra acts in the romantic or action film genre. Now we’ll create a second visualization using a different layout.

## You can change the layout by picking one of the other options. Uncomment one of the lines below by erasing the # and running the line. Try to find a layout that gives you different information that Fruchterman Reingold.

 L = layout_with_dh(g_valued) ## Davidson and Harel

# L = layout_with_drl(g_valued) ## Force-directed

# L = layout_with_kk(g_valued) ## Spring
plot(g_valued,vertex.color=V(g_valued)$color, layout = L, vertex.size=6)

## Analysis

In a paragraph, compare and contrast the information given to you by the two different layouts. The first startling reaction that I had was seeing no network connection between actor Shah Rukh Khan and the movie Kuch Kuch Hota Hai or India, becasue Shah Rukh was the protagonist for this movie that was released in India. Overall, both the graphs show slightlfy different results. This suggests that Graph 1 depicts a connection between the movie Wonder Woman and actor Priyanka Chopra, while Graph 2 shows a connection between actor Shah Rukh Khan and the movie Wonder Woman. The further claim is that Shah Rukh is a romantic hero and movie Wonder Woman is a super hero film.

Part 3: Community Detection Analysis with R (20 Points)

Identifying subgroups within a network is of great interest to social network researchers, so a variety of algorithms have been developed to identify and measure subgroups. We will use some of R’s built-in tools to identify subgroups and central nodes for visual inspection.

For the remainder of the visualizations we will use the Fruchterman Reingold layout.

L = layout_with_fr(g_valued)

Cluster the nodes in your network.

# Learn more about the clustering algorithm.
?? cluster_walktrap
cluster <- cluster_walktrap(g_valued)
# Find the number of clusters
membership(cluster)   # affiliation list

##         “Tom Cruise”         “Chris Pine”     “Shah Rukh Khan” 
##                    1                    1                    2 
##    “Priyanka Chopra” “Kuch Kuch Hota Hai”           “Quantico” 
##                    2                    3                    2 
##            “Top Gun”       “Wonder Woman”              “India” 
##                    1                    1                    2 
##      “United States” 
##                    1

length(sizes(cluster)) # number of clusters

## [1] 3

# Find the size the each cluster 
# Note that communities with one node are isolates, or have only a single tie
sizes(cluster)

## Community sizes
## 1 2 3 
## 5 4 1

How many communities have been created? Three different communities exist.
How many nodes are in each community? community 1 has 5 nodes. community 2 has 4 nodes. community 3 has 1 nodes.

For your network, what might each cluster of nodes potentially have in common? Describe each cluster, its membership, and the relationship between nodes in the cluster. There are two clusters in this network. Cluster 1 has nodes: India, Shah Rukh Khan, Quantico and Priyanka. Cluster 2 has nodes: Tom Cruise, Chris Pine, U.S., Wonder Woman and Top Gun. There is one isolate in this network, Kuch Kuch Hota Hai. There is no node that is common on both the clusters.

Next we visualize the clusters by coloring nodes according to their modularity class.

plot(cluster, g_valued, col = V(g_valued)$color, layout = L, vertex.size=6)

What information does this layout convey? Are the clusters well-separated, or is there a great deal of overlap? Is it easier to identify the common themes among clusters in this layout rather than looking only at the graphs? There is no overlap and the clusters are well seprated with respect to countries. Yes, the clusters are well seprated, and the common themes are countries and film actors. What differences are there between nodes in the same cluster and across clusters? Two clusters are separated by countries, India and the U.S.A. Few other clusters are seprated by the movies/tv show that certain actors have been cast for. Few examples in this regard are : Chris Pine, Wonder Woman (and third node as Tom Cruise.) Tom Cruise and Top Gun (third node as Wonder Woman.) Priyanka Chopra and Quantico (third node as India) suggesting Priyanka is an Indian native.

The intresting connections seen in other clusters are between Tom Cruise, Chris Pine and, Wonder Woman, which might be because Chris is one of the lead actors in Wonder Woman suggesting he is a superhero film actor. In a simialr manner, the cluster between Tom Cruise, U.S. and Wonder Woman suggests that Tom Cruise is an Hollywood actor and that the movie Wonder Woman is an American film.

Actor Shah Rukh has an undirected connection with Quantico. This is to say he is connected to Priyanka as his co-star, but Priyanka has worked in the television show, Quantico.

Interestingly, there is no cluster formation between the nodes (movie) Kuch Kuch Hota Hai, India and actor Shah Rukh. The reason being this film to have won 12 awards, including the Best Film and Best Actor, in a male and female role, respectively.

Additionally, Tom Cruise and Chris Pine are seen together as nodes in the second cluster mainly because they are Hollywood actors, though they belong to different film genres. Hence, the common theme between clusters is actors that belong to their respetive native countries, India and the U.S. Describe the brokers between any components and cliques. What are common features of these brokers? About how many brokers would you have to remove from your network to “shatter” it into two or more disconnected components? The brokers between cliques and components are India and U.S, which happen to be countries. If the node India is removed, the component and clique between nodes Tom Cruise and Chris Pine will separate. It would be similar for actors Shah Rukh and Priyanka. This component and clique would shatter if the node U.S is removed. If node Priyanka is removed, the clique between India and Shah Ruk will break. In a simialr manner, if the node Chris Pine is removed, the clique between Tom Cruise and Wonder Woman will shatter. # Part 4: Centrality Visualization & Weighted Values (20 Points) For each network, you will use centrality metrics to improve your visualization. You may need to adjust the size parameter to make your network more easily visible.

Degree Centrality

totalDegree <- degree(g_valued,mode="all")
sort(totalDegree,decreasing=TRUE)[1:5]

##   “United States”      “Tom Cruise” “Priyanka Chopra”    “Wonder Woman” 
##                 8                 7                 7                 6 
##           “India” 
##                 6

g2 <- g_valued
V(g2)$size <- totalDegree*2 #can adjust the number if nodes are too big
plot(g2, layout = L, vertex.label=NA)

Briefly explain degree centrality and why nodes are more or less central in the network. The degree centrality depicts neighbors of a node. A network can be analyzed if there is a connection between nodes. The most popular node in this network is the U.S., the second popular nodes are Chris Pine, a Hollywood supehero film actor and Priynaka, a Bollywood film actor, who works in the romantic genre of films that are popualr among audeiences. ## Weighted Degree Centrality

wd <- graph.strength(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wd,decreasing=TRUE)[1:5]

## “United States”         “India”  “Wonder Woman”    “Tom Cruise” 
##            1774            1665              76              52 
##      “Quantico” 
##              36

wg2 <- g_valued
V(wg2)$size <- wd*.1 # adjust the number if nodes are too big
plot(wg2, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge. Otherwise edges tend to cover up all the other edges and obscure the relationships.

What does the addition of weighted degree and edge information tell you about your graph? U.S. has the highest weigted degree centrality and India, the second highest. Wodner Woman has the third highest weigted degree centrality in comparison to U.S. and India.The reason is that both countries produce a number of movies annually, while the movie Wonder Woman has been produced in the U.S., only once and in English. ## Betweenness Centrality

b <- betweenness(g_valued,directed=TRUE)
sort(b,decreasing=TRUE)[1:5]

##   “United States”      “Tom Cruise” “Priyanka Chopra”    “Wonder Woman” 
##          4.833333          2.333333          2.333333          1.250000 
##           “India” 
##          1.250000

g4 <- g_valued
V(g4)$size <- b*1.2#can adjust the number
plot(g4, layout = L, vertex.label=NA)

Briefly explain betweenness centrality and why nodes are more or less central in the network. Betweeness centrality is a measure of centrality based on shortest paths between nodes. In order for one to caluculate the geodesic distance between nodes, it is important to note how many times a node has interruped shortest paths between two nodes of a pair. U.S. is noted for having the highest betweenness centrality followed by Tom Cruise and Priyanka. Nodes India and Wonder Woman have a low betweeness centrality, because India produces hundreds of movies in its individual cinemas that might/might not be released internationally, while Wonder Woman being a superhero film genre movie might not be popualr among international audeinces outside the U.S. who do not appreicate fiction or super hero films. ### Weighted Betweenness Centrality

wbtwn <- betweenness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wbtwn,decreasing=TRUE)[1:5]

## “Priyanka Chopra”  “Shah Rukh Khan”    “Wonder Woman”           “India” 
##                21                 7                 7                 7 
##      “Tom Cruise” 
##                 1

wBtwnG <- g_valued
V(wBtwnG)$size <- wbtwn*.5 # adjust the number if nodes are too big
plot(wBtwnG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.

What does the addition of weighted degree and edge information tell you about your graph? There seems to be a strong connection between nodes, India and the U.S. The broker in this network is Priyanka, becasue she is not only working for the Indian film industry, but has worked for an American television show and movie. Therefore, Priyanka’s weighted betweenness centrality is high in comparison to actor Shah Rukh who works only in Indian Hindi films. In the same way, the weighted betweenness centrality is low for Wonder Woman because the movie might be watched in English by the English speaking audiences or those that appreciate the American and culture, including the fiction genre. ## Closeness Centrality

c <- closeness(g_valued)
sort(c,decreasing=TRUE)[1:5]

##   “United States”      “Tom Cruise” “Priyanka Chopra”    “Wonder Woman” 
##        0.05555556        0.05263158        0.05263158        0.05000000 
##           “India” 
##        0.05000000

g5 <- g_valued
V(g5)$size <- c*500  #can adjust the number
plot(g5, layout = L, vertex.label=NA)

Briefly explain closeness centrality and why nodes are more or less central in the network. It is the degree of measure that a node is close to other nodes. If a node is not close, or not connected at all, there is no closeness centrality between them. Nodes U.S., Tom Cruise and Priyanka have more closeness centrality in the network than Wonder Woman and India. One of the reasons for node U.S. to experience closeness centrality is it produces movies in American English that is more popular among international audiences, while Tom Cruise is a romantic and action film hero, much appreciated by audiences. In a simialr manner, not only Priyanka works for Indian romantic genre of films, but Quantico and Bay Watch where most of her scenes integrate romanticism and action, popular among audiences.

In this graph, though there should have been a network between actor Shah Rukh and movie Kuch Kuch Hota Hai, no closeness centrality exists between the two.

Weighted Closeness Centrality

wClsnss <- closeness(g_valued,weights = E(g_valued)$CoOccurrences)
sort(wClsnss,decreasing=TRUE)[1:5]

## “Priyanka Chopra”      “Tom Cruise”  “Shah Rukh Khan”      “Chris Pine” 
##        0.03333333        0.02941176        0.02941176        0.02777778 
##    “Wonder Woman” 
##        0.02500000

wClsnssG <- g_valued
V(wClsnssG)$size <- wClsnss*1000 # adjust the number if nodes are too big
plot(wClsnssG, layout = L, vertex.label=NA, edge.width=sqrt(E(g_valued)$CoOccurrences)) #taking the square root is a good way to make a large range of numbers visible in an edge.

What does the addition of weighted degree and edge information tell you about your graph? Nodes Priyanka, Tom Cruise and Shah Rukh Khan have more weighted degree than Chris Pine and Wonder Woman. The edge information might relate to attributes of the above mentioned actors, who work for action and romantic genre.
## Eigenvector Centrality

eigc <- eigen_centrality(g_valued,directed=TRUE)
sort(eigc$vector,decreasing=TRUE)[1:5]

##   “United States” “Priyanka Chopra”      “Tom Cruise”           “India” 
##         1.0000000         0.9291746         0.9291746         0.8345373 
##    “Wonder Woman” 
##         0.8345373

g6 <- g_valued
V(g6)$size <- eigc$vector*50 #can adjust the number
plot(g6, layout = L, vertex.label=NA)

Briefly explain eigenvector centrality and why nodes are more or less central in the network. It is the influence of a node in a network. If there are isolates or certain components have been created in a graph, it means that a certain node/s has influenced the network. To explain this, the graph shows U.S. with highest eigenvector centrality, followed by Priyanka and Tom Cruise. The reason for this is that U.S. attracts international film and television actors to work, like Priyanka Chopra producing not only action and romantic genres, but also superceding in fiction and super hero films, like Wonder Woman. Priyanka has second highest eigenvector centrality because she is a model, Bollywood film actor, former Miss World, brand ambassador and now has worked for an American drama and movie. The third node with highest eigenvector centrality is Tom Cruise who works only in American romantic and action films.
## Analysis Choose the visualization that you think is most interesting and briefly explain what it tells you about a central node in your network. Discuss the type of centrality, and what that node’s centrality score tells you about the search co-occurrence network. I will analyze betweeness centrality network. The central node in this network is U.S., with other popular nodes, Tom Cruise and Priyanka. Node U.S. is connected to other nodes in the network like India providing a way for an Indian actor like Priyanka to work for its drama and movie. In a simialr way, node Tom Cruise is popular because he is an American romantic and action film hero, who is watched by international audiences, including India. Priyanka is the third famous node in this network becasue she is working for both U.S. and India.
Briefly discuss an interesting difference between types of centrality for your network. This network depicts eigenvector centrality as well as an isolate with no centrality. ## Global Network Metrics with R

Compute the network centralization scores for your network for degree, betweenness, closeness, and eigenvector centrality.

# Degree centralization
centralization.degree(g_valued,normalized = TRUE)

## $res
##  [1] 7 4 4 7 0 3 3 6 6 8
## 
## $centralization
## [1] 0.3555556
## 
## $theoretical_max
## [1] 90

# Betweenness centralization
centralization.betweenness(g_valued,normalized = TRUE)

## $res
##  [1] 2.333333 0.000000 0.000000 2.333333 0.000000 0.000000 0.000000
##  [8] 1.250000 1.250000 4.833333
## 
## $centralization
## [1] 0.1121399
## 
## $theoretical_max
## [1] 324

# Closeness centralization 
centralization.closeness(g_valued,normalized = TRUE)

## $res
##  [1] 0.4736842 0.4090909 0.4090909 0.4736842 0.1000000 0.3913043 0.3913043
##  [8] 0.4500000 0.4500000 0.5000000
## 
## $centralization
## [1] 0.2247403
## 
## $theoretical_max
## [1] 4.235294

# Eigenvector centralization 
centralization.evcent(g_valued,normalized = TRUE)

## $vector
##  [1] 0.9291746 0.6403818 0.6403818 0.9291746 0.0000000 0.4792540 0.4792540
##  [8] 0.8345373 0.8345373 1.0000000
## 
## $value
## [1] 5.766695
## 
## $options
## $options$bmat
## [1] "I"
## 
## $options$n
## [1] 10
## 
## $options$which
## [1] "LA"
## 
## $options$nev
## [1] 1
## 
## $options$tol
## [1] 0
## 
## $options$ncv
## [1] 0
## 
## $options$ldv
## [1] 0
## 
## $options$ishift
## [1] 1
## 
## $options$maxiter
## [1] 1000
## 
## $options$nb
## [1] 1
## 
## $options$mode
## [1] 1
## 
## $options$start
## [1] 1
## 
## $options$sigma
## [1] 0
## 
## $options$sigmai
## [1] 0
## 
## $options$info
## [1] 0
## 
## $options$iter
## [1] 4
## 
## $options$nconv
## [1] 1
## 
## $options$numop
## [1] 14
## 
## $options$numopb
## [1] 0
## 
## $options$numreo
## [1] 11
## 
## 
## $centralization
## [1] 0.4041631
## 
## $theoretical_max
## [1] 8

Record the centralization score of each centrality measure. This is a pseudograph. The score for degree centralization is: 0.35, Between centralization is: 0.11, Closenss centralization is: 0.22 and Eigenvector centralization is: 0. 40.
Briefly explain what the centralization of a network is. Actors with high degree centrality qualify for a centralized network. This is to say that score for actors within this network is going to be high, based on any reason, their attribute or connection with other nodes within ther network. Compare the centralization scores above with the graphs you created where the nodes are scaled by centrality. Describe the appearance of more centralized v. less centralized networks. This is a less centralized graph based on the difference of genre of movies that these actors appear in. ## Part 5. Power Laws & Small Worlds (20)

Power Laws

Networks often demonstrate power law distributions. Plot the degree distribution of the nodes in your base graph.

# Calculate degree distribution
deg <- degree(g_valued,v=V(g_valued), mode="all")
deg

##         “Tom Cruise”         “Chris Pine”     “Shah Rukh Khan” 
##                    7                    4                    4 
##    “Priyanka Chopra” “Kuch Kuch Hota Hai”           “Quantico” 
##                    7                    0                    3 
##            “Top Gun”       “Wonder Woman”              “India” 
##                    3                    6                    6 
##      “United States” 
##                    8

# Degree distribution is the cumulative frequency of nodes with a given degree
deg_distr <-degree.distribution(g_valued, cumulative=T, mode="all")
deg_distr

## [1] 1.0 0.9 0.9 0.9 0.7 0.5 0.5 0.3 0.1

plot(deg_distr, ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency") #You may need to adjust the ylim to a larger or smaller number to make the graph show more data.

Test whether it’s approximately a power law, estimate log f (k) = log a − c log k. “This says that if we have a power-law relationship, and we plot log f (k) as a function of log k, then we should see a straight line: −c will be the slope, and log a will be the y-intercept. Such a “log-log” plot thus provides a quick way to see if one’s data exhibits an approximate power-law: it is easy to see if one has an approximately straight line, and one can read off the exponent from the slope.” (E&K, Chapter 18, p.546).

power <- power.law.fit(deg_distr)
power

## $continuous
## [1] TRUE
## 
## $alpha
## [1] 3.192999
## 
## $xmin
## [1] 0.5
## 
## $logLik
## [1] 1.43094
## 
## $KS.stat
## [1] 0.3422286
## 
## $KS.p
## [1] 0.3059285

plot(deg_distr, log="xy", ylim=c(.01,10), bg="black",pch=21, xlab="Degree", ylab="Cumulative Frequency")

Does your network exhibit a power law distribution of degree centrality? Yes it does. One of the reasons is whether it is countries India and U.S or, film actors, each provide a reason for power to increase.
## Small Worlds

Networks often demonstrate small world characteristics. Compute the average clustering coefficient (ACC) and the characteristic path length (CPL).

# Average clustering coefficient (ACC)
transitivity(g_valued, type = c("average"))

## [1] 0.8190476

# Characteristic path length (CPL)
average.path.length(g_valued)

## [1] 1.333333

Compute the ACC and CPL for 100 random networks with the same number of nodes and ties as your test network.

accSum <- 0
cplSum <- 0
for (i in 1:100){
  grph <- erdos.renyi.game(numVertices, numEdges, type = "gnm")
  accSum <- accSum + transitivity(grph, type = c("average"))
  cplSum <- cplSum + average.path.length(grph)
}
accSum/100

## [1] 0.5387913

cplSum/100

## [1] 1.486889

Based on these data, would you conclude that the observed network demonstrates small world properties? Why or why not? Yes, it does, excepot that there is only one isolate. Since the nodes are countries and film actors, it is likely that there is some sort of connection netween these persons or places. The biggest question I have is how is actor Shah Rukh not connected to his block buster film, Kuch Kuch Hota Hai. ## Wrapping up To complete the lab, make sure output/previews have been generated for each block of code. Then click the “Publish” button on the upper right hand corner of this screen and sign up for an RPubs account. Submit the URL of the published, completed lab on Canvas.

Descriptive Analytic Exercise 1: Visualizing and Interpreting Networks

Part 1: Collect Network Data (20 pts)

Loading and Installing Packages, Set Working Directory

Choose a topic for your search terms

Create your text input

Analysis

Analysis

Part 2: Visualize Your Network (20 points)

Part 3: Community Detection Analysis with R (20 Points)

Degree Centrality

Weighted Closeness Centrality

Power Laws