Useful commands or R Markdown Cheat Sheet
Ignacio et al. (2022) Banyard, Hamby, and Grych (2017) Cai et al. (2024) Cano et al. (2020) Cromer et al. (2019) Hong et al. (2018) Ferber and Weller (2022) Callaghan et al. (2019) Chan et al. (2023) Weidacker et al. (2022) Hirshfeld-Becker et al. (2019) Gur et al. (2020) Bram, Gottschalk, and Leeds (2018) Wymbs et al. (2020) Bernstein and McNally (2018) Gildawie, Honeycutt, and Brenhouse (2020) Kuhlman et al. (2023) Kayser et al. (2019) Dvorsky et al. (2019) Kirby et al. (2022) Murphy et al. (2017) Carter, Powers, and Bradley (2020) Ramaiya et al. (2018) Sendzik et al. (2017) Wu, Slesnick, and Murnan (2018) Cornwell et al. (2024) Skinner et al. (2020) Gibbons and Bouldin (2019) Rodman et al. (2019) Higheagle Strong et al. (2020) McRae et al. (2017) Martinez Jr et al. (2022) Vannucci et al. (2019) Noroña-Zhou and Tung (2021) Motsan, Yirmiya, and Feldman (2022) Rich et al. (2019) Krause et al. (2018) Sui et al. (2020) Grych et al. (2020) Vega-Torres et al. (2020) Kliewer and Parham (2019) Griffith, Farrell-Rosen, and Hankin (2023) Siciliano et al. (2023) White et al. (2021) Bettis et al. (2019) Rudolph et al. (2024) Linke et al. (2020) Cornwell et al. (2023) Criss et al. (2017) Koban et al. (2017) Kashdan et al. (2020) Priel et al. (2020) Hannan et al. (2017) Malberg (2023) Tang, Tang, and Gross (2019) Caceres et al. (2024) Gupta, Dickey, and Kujawa (2022) Gee (2022) Khahra et al. (2024) Szoko et al. (2023) Barzilay et al. (2020) Jiang, Paley, and Shi (2022) Xiao et al. (2019) Finkelstein-Fox, Park, and Riley (2018) Sevinc et al. (2019) Blair et al. (2018) Buthmann et al. (2024) Schäfer et al. (2017) Smith and Jen’nan (2024) explored research focusing on various populations and areas to find the effects of several factors like living experience, environment, physical and mental health, and their relationship of emotion regulation and resilience development.
This is where the steps go
#clean data, prepare author names that match ID, get one-mode matrix g and two-mode matrix g2
library(splitstackshape)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
source <- read.csv("/Users/jinzeyang/Desktop/Fall 2024/Social Networks Analysis/A_Delieverables/US emotion regulation.csv")
dim(source)
## [1] 70 25
# clean the data, keep only authors and articles
source <- source[source$Author.s..ID!="[No author id available]",]
dim(source)
## [1] 70 25
#Addressing names and titles
#Create labels for each author into V(g)
source$au <- #gsub(" Jr.,", "",
gsub(" II.,", "",
#gsub(" Jr.", "",
# gsub(" M.S.", "",
# gsub(" M.S.,", "",
gsub(" II.", "", source[,1])) #))
#source$au <- gsub("\\.,", ";", source$au)
source$au <- tolower(source$au)
head(source$au )
## [1] "ignacio d.a.; emick-seibert j.; serpas d.g.; fernandez y.s.; bargotra s.; bush j."
## [2] "banyard v.; hamby s.; grych j."
## [3] "cai y.; she x.; singh m.k.; wang h.; wang m.; abbey c.; rozelle s.; tong l."
## [4] "cano m.á.; castro f.g.; de la rosa m.; amaro h.; vega w.a.; sánchez m.; rojas p.; ramírez-ortiz d.; taskin t.; prado g.; schwartz s.j.; córdova d.; salas-wright c.p.; de dios m.a."
## [5] "cromer k.d.; d'agostino e.m.; hansen e.; alfonso c.; frazier s.l."
## [6] "hong f.; tarullo a.r.; mercurio a.e.; liu s.; cai q.; malley-morrison k."
# Creating adjacency list from names
df_all_authors <- as.data.frame(source[ , ncol(source)]) #use ncol(data.frame) to get the last column of the data.frame
colnames(df_all_authors) <- "AU"
head(df_all_authors)
## AU
## 1 ignacio d.a.; emick-seibert j.; serpas d.g.; fernandez y.s.; bargotra s.; bush j.
## 2 banyard v.; hamby s.; grych j.
## 3 cai y.; she x.; singh m.k.; wang h.; wang m.; abbey c.; rozelle s.; tong l.
## 4 cano m.á.; castro f.g.; de la rosa m.; amaro h.; vega w.a.; sánchez m.; rojas p.; ramírez-ortiz d.; taskin t.; prado g.; schwartz s.j.; córdova d.; salas-wright c.p.; de dios m.a.
## 5 cromer k.d.; d'agostino e.m.; hansen e.; alfonso c.; frazier s.l.
## 6 hong f.; tarullo a.r.; mercurio a.e.; liu s.; cai q.; malley-morrison k.
#As can be seen in the file the separator of interest is authors' name:
authornames_split <- cSplit(df_all_authors, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE) #retain the matrix form version of the adjacency list input
dim(authornames_split)
## [1] 70 16
#Adjacency list from authors' IDs, that is, split the authorsID column
df_authors_id <- as.data.frame(source[ , 3])
colnames(df_authors_id) <- "AU"
author_id_split <- cSplit(df_authors_id, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE)
dim(author_id_split)
## [1] 70 16
#start unlisting remove solo authors
df_authorid_authorname_unlisted <- data.frame(id = unlist(author_id_split), names = unlist(authornames_split))
dim(df_authorid_authorname_unlisted)
## [1] 1120 2
#remove NA
df_authorid_authorname_unlisted <- df_authorid_authorname_unlisted[!is.na(df_authorid_authorname_unlisted$id),]
dim(df_authorid_authorname_unlisted)
## [1] 407 2
#remove duplicated entries.
# the number and the list of authors who publish article more than one (>=1)
df_authorid_authorname_unlisted <- df_authorid_authorname_unlisted[!duplicated(df_authorid_authorname_unlisted$id),]
dim(df_authorid_authorname_unlisted)
## [1] 370 2
#most Prolific authors by ID
pub_count <- as.data.frame(table(unlist(author_id_split)))
#count frequencies of author names
summary(pub_count)
## Var1 Freq
## 6506441612: 1 Min. :1.0
## 6506508954: 1 1st Qu.:1.0
## 6506979787: 1 Median :1.0
## 6507073144: 1 Mean :1.1
## 6507460719: 1 3rd Qu.:1.0
## 6507545038: 1 Max. :3.0
## (Other) :364
#match names and id
df_authorid_authorname_unlisted$pub_count <- pub_count$Freq[match(df_authorid_authorname_unlisted$id, pub_count$Var1)]
head(df_authorid_authorname_unlisted)
## id names pub_count
## AU_011 57211550278 ignacio d.a. 1
## AU_012 57192659150 banyard v. 2
## AU_013 59169224100 cai y. 1
## AU_014 36141884200 cano m.á. 1
## AU_015 57070244400 cromer k.d. 1
## AU_016 57196352348 hong f. 1
#retrive names in order
a<-df_authorid_authorname_unlisted
a<-a[order(a$pub_count, decreasing=T), ]
head(a)
## id names pub_count
## AU_0621 7004847215 compas b.e. 3
## AU_012 57192659150 banyard v. 2
## AU_0112 7103065698 gur r.e. 2
## AU_0126 57194620691 cornwell h. 2
## AU_0139 6603834101 grych j. 2
## AU_0140 57208803309 smith m.r. 2
Answer:
The first problem we intend to solve is simply node representation. Authors have long and complex names, which makes it hard to visualize the network. However, an ID adjacency list makes unique pairs of IDs and authors, which could help the network be much clearer to see and easier to conduct the whole computational process and network analysis.
Another problem is that we cannot interpret ID numbers directly. So, the names adjacency list could help ID numbers connect with specific authors’ names so we could interpret findings, like prolific authors or other connections, directly with authors’ names instead of ID numbers.
Answer: dimension: (70,16). 70 means 70 articles; 16 means the maximum number of authors in one article.
Answer:
The discrepancy might be that the authors’ names in each column may be affected when we remove labels, affecting the number of columns. So, we need to check labels in the “source” data frame and make sure to avoid removing labels like names. Or, another reason might be the original dataset includes different delimiters we assume (e.g., semicolons vs. commas). The above reasons could lead to incorrect splitting during the creation of the adjacency lists.
Also, the discrepancy might be due to the same ID being associated with different names. Authors may use different name formats or change their last names due to marriage or other reasons. As a result, we may have failed to match all names for the same authors (IDs) in the dataset.
Answer:
The number of rows means 370 connections of ID and authors’ names in the co-authorship network.
Answer:
Not, it does not match. Object g has 376 authors, but here, we have 370 authors.
Answer:
We have a mismatch because we have articles with solo authors. We did not remove solo authors in deliverable 1, but we do this in this deliverable.
Answer:
Here, we use unlisting to combine the ID and name adjacency lists and ensure that each ID or name is an independent unit to make connections. So, in the same dimension, we connect ID and names in the same data frame with the following, removing NA cells and duplicate entries. The unlisting helps us to build specific connections to each ID and name and prepare our data for future co-authorship network analysis.