1 Deliverable 3 Questions and Answers:

Explain what the issue was with solo authors in past deliverables.

Solo authors in past deliverables were dropped from the adjacency tables due to the NAs they were paired with. As a result, those solo authors were not able to be included in the igraph of the one-mode network displaying connections between co-authors.
How does the transformations conducted in the R code file “two_mode_coauthorship.r” handles solo authors?

By creating a two-mode network, the connections are established between each author’s ID and the paper the author wrote. The co-authorships between authors, on the other hand, are shown by their mutual affiliations with the same paper’s EID. This way, no solo authors are missing from the two-mode network, because even if the author is a solo author, their ID will be kept by its connection to their paper’s EID, hence will not be dropped.
Does our dataset with ID, names, and publication counts contain solo authors? If so, how, or why was this achieved?

Yes, that dataset contains solo authors. This is because when we run the code “df_id_au_unlisted <- data.frame(ID = unlist(id_split), Names = unlist(au_split))”, we are putting each author’s name and ID together into the same row in the dataframe. Instead of dropping the NAs in the ID adjacency list, which drops the solo authors’ IDs hence losing solo author information, we drop the NAs from this unlisted dataframe. Since those solo authors have both author names and IDs stored in the dataframe, their rows do not have empty IDs, and so their information will not be dropped by the code “df_id_au_unlisted <- df_id_au_unlisted[!is.na(df_id_au_unlisted$ID),]”.
What is the meaning of the dimensions of the object “edgelist_two_mode” in plain English?

The dimension of “edgelist_two_mode” is (2743, 2). It means there are a total of 2743 relationships captured in the dataset between authors and papers, but it does not tell us how many unique authors and how many unique papers are included in the number.
In class we observed that the result of the following tabulations rendered different True and False distributions: table(is.na(V(g2)$label)) and table(V(g2)$type). What was the reason for that discrepancy and how did we address it (you can place the code we used and explain it to address this question.

The purpose of running both codes was to know how many of the 3176 vertices are authors and how many are papers. Based on the code “table(V(g2)$type)”, which returns the number of “trues” (i.e. the number of papers) and “falses” (i.e. the number of authors) among the graph g2’s vertices, we got the result that there are 986 papers and 2190 authors. However, when using the code “table(is.na(V(g2)$label))”, we found only 2184 authors. 6 authors were missing in this case. This was because there are 6 instance in the database where the authors’ names were missing (due to reasons like the inaccurate use of separator between two authors which leads to the two distinct authors being recognized by R as one single author) but their IDs were present. When using the code “table(is.na(V(g2)$label))”, those 6 authors without names were then dropped because they were represented by NA in the label column.
Did you face a similar problem than we did in class? Do all your authors have names associated with IDs?

No, I didn’t. both commands returned the same dimension, (425, 154).

2 Steps for deliverable 2:

#clean data, prepare author names that match ID, get one-mode matrix g and two-mode matrix g2 
library(splitstackshape)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

source <- read.csv("/Users/joyceshuai/Desktop/SSNA/classtech_tenyrs.csv")
### create an author-author matrix through transformation of a two-mode edgelist
# store authors' IDs into a dataframe called "authors" and name the column "AU"
authors <- as.data.frame(source$Author.s..ID)
colnames(authors) <- "AU"
# put each ID into a separate column and name the list "source_split"
source_split <- cSplit(authors, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

# trimsws means to trim the whitespace; lappy means to apply the function to all the in cells in source_split
source_split <- data.frame(lapply(source_split, trimws), stringsAsFactors = FALSE)
# turn source_split into a matrix
mat_source_split <- as.matrix(source_split)
#combine the column of EID from the original dataset with the matrix of source_split
combined <- cbind(source$EID, mat_source_split)
mat_combined_eid_source_split <- as.matrix(combined)
#connect the EID in each row with author IDs in the same row, one by one, into an edgelist
edgelist_two_mode <- cbind(mat_combined_eid_source_split[, 1], c(mat_combined_eid_source_split[, -1]))
#drop rows without authors
edgelist_two_mode <- edgelist_two_mode[!is.na(edgelist_two_mode[,2]), ]
dim(edgelist_two_mode)

## [1] 436   2

# plot the two-mdoe edgelist into a graph
g2 <- graph.edgelist(edgelist_two_mode[, 2:1], directed = FALSE) 
g2

## IGRAPH 080b650 UN-- 579 436 -- 
## + attr: name (v/c)
## + edges from 080b650 (vertex names):
##  [1] 57209211014--2-s2.0-85118353799 57209977398--2-s2.0-85069208777
##  [3] 57206698047--2-s2.0-85120980960 57222036217--2-s2.0-85130859202
##  [5] 57672771100--2-s2.0-85129722779 56285357900--2-s2.0-85092481468
##  [7] 8247022600 --2-s2.0-85141740589 57204936425--2-s2.0-85132312213
##  [9] 35409037500--2-s2.0-85094896026 50261568800--2-s2.0-85058508859
## [11] 58044794200--2-s2.0-85146963015 7004594995 --2-s2.0-85044028507
## [13] 57198347616--2-s2.0-85036472155 57194013851--2-s2.0-85109103540
## [15] 57195758433--2-s2.0-85029746212 57222635157--2-s2.0-85149058232
## + ... omitted several edges

#create a new attribute called type for g2 to distinguish authors and papers
V(g2)$type <- V(g2)$name %in% edgelist_two_mode[ , 1]
#count the number of Trues (EIDs) and Falses (authorIDs) in the type attribute
table(V(g2)$type)

## 
## FALSE  TRUE 
##   425   154

#Creating adjacency list from author names
df_au <- as.data.frame(source$Authors) 
colnames(df_au) <- "Name"
#separate authors' names into each column
au_split <- cSplit(df_au, splitCols = "Name", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

dim(au_split)

## [1] 154  11

#Creating adjacency list from authorsID
df_id <- as.data.frame(source[,3])
colnames(df_id) <- "ID"
#separate ids into each column
id_split <- cSplit(df_id, splitCols = "ID", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

dim(id_split)

## [1] 154  11

#Decompose authors and IDs and put them into a dataframe with 2 columns, one called "ID" and one called "Names"
df_id_au_unlisted <- data.frame(ID = unlist(id_split), Names = unlist(au_split))
#drop the rows where the "ID" is NA, i.e. those with no more co-authors
df_id_au_unlisted <- df_id_au_unlisted[!is.na(df_id_au_unlisted$ID),]
#drop duplicates in the column "id" so that we don't count the same authors multiple times
df_id_au_unlisted <- df_id_au_unlisted[!duplicated(df_id_au_unlisted$ID),]
dim(df_id_au_unlisted)

## [1] 425   2

#add author names to the graph
V(g2)$label <- df_id_au_unlisted$Names[match(V(g2)$name, df_id_au_unlisted$ID)]
# the following code returns the number of publications, because only human vertices have labels
table(is.na(V(g2)$label))

## 
## FALSE  TRUE 
##   425   154

#to deal with the discrepancy due to names being dropped when matching IDs and names, use an if else statement
# if the name is not found in the label attribute of g2's vertices is missing, return the author's name, else, return the author's name (i.e. keep the values that are not missing), add them into the label attribute of the vertices
V(g2)$label <- ifelse(is.na(V(g2)$label), V(g2)$name, V(g2)$label)
table(is.na(V(g2)$label))

## 
## FALSE 
##   579

###Transformations to retain actors
mat_g2_incidence <- get.incidence(g2) 
dim(mat_g2_incidence)

## [1] 425 154

# multiply (986,2190) with (2190,986) to get (986, 986) to retain actors in a one-mode network
mat_g2_incidence_to_1 <- mat_g2_incidence%*%t(mat_g2_incidence)
dim(mat_g2_incidence_to_1)

## [1] 425 425

diag(mat_g2_incidence_to_1)<-0
g <- graph.adjacency(mat_g2_incidence_to_1, mode = "undirected")
g

## IGRAPH 67e5a42 UN-- 425 597 -- 
## + attr: name (v/c)
## + edges from 67e5a42 (vertex names):
##  [1] 57209211014--55505853200 57209211014--57320369800 57206698047--36154796800
##  [4] 57206698047--57685428900 57222036217--57202216646 57222036217--57203923749
##  [7] 57672771100--57194830879 56285357900--56575051400 8247022600 --6507838357 
## [10] 57204936425--57200916290 57204936425--57750635500 35409037500--16237824500
## [13] 50261568800--57205094161 58044794200--57209485440 58044794200--39362289600
## [16] 58044794200--55560959100 7004594995 --7006464833  7004594995 --7006544785 
## [19] 57198347616--57190220967 57198347616--57198345447 57198347616--57198345447
## [22] 57198347616--24173606100 57198347616--24173606100 57198347616--24173606100
## + ... omitted several edges

EDUC 7847 Deliverable 3

Yizheng Shuai

June 25, 2023

1 Deliverable 3 Questions and Answers:

2 Steps for deliverable 2: