1.1 In your own words, explain what problems we aimed to solve with the creation of an ID adjacency list and a names adjacency list?
We want to have one ID for each person no matter how many time he/she chnages their names. We can always add more than one name for each person.
2.1 During your cleaning procedures, did your adjacency lists for names and ID have the same dimensions?
Yes
2.2 If so, what is the meaning of these dimensions?
The meaning is that each author has each ID related to themselves, we are not missing any IDs.
2.3 Let us assume that you did not have the same dimensions, what would be the reason for this discrepancy and how would you address this issue?
It could be downloading errors from Scopus, human errors, etc. We need to go back to the document and check out which one is missing.
As part of the feature engineering procedures, we created a database with ID and Name columns.
3.1 What is the meaning of the number of rows of this database?
How many articles we have in our database.
3.2 Does this number match the number of unique actors you have in your first deliverable in the object g (i.e., the graph we created)?
Yes
3.3 If the number does not match, what do you think is the reason for this mismatch?
In the first deliverable, we eliminated all the articles with single authors, now we have included every article, that could be the reason why there’s a mismatch.
4.1 Please explain the role of unlisting in this data cleaning or feature engineering process.
We want to create an information back that matches IDs with authors’ names for later use. We also removed the duplicate authors who published more than once.
5.1 Who are the top five most prolific authors? And what are they publishing about?
Bialystok E. Marin-Marin L. Costumero V. Malyshevskaya A.S. Berkes M.
They are publishing about the difference between the monolingual and bilingual brains, and methods training the cognitive reserve of the human brain.
setwd("/Users/zufuquan/Downloads")
#clean data, prepare author names that match ID, get one-mode matrix g and two-mode matrix g2
library(splitstackshape)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
source <- read.csv("/Users/zufuquan/Downloads/scopus (4).csv")
# clean the data, keep only authors and articles
source <- source[source$Author.s..ID!="[No author id available]",]
#Addressing names and titles
#Create labels for each author into V(g)
source$au <- gsub(" Jr.,", "",
gsub(" II.,", "",
gsub(" Jr.", "",
gsub(" M.S.", "",
gsub(" M.S.,", "",
gsub(" II.", "", source[,1]))))))
source$au <- gsub("\\.,", ";", source$au)
source$au <- tolower(source$au)
# Creating adjacency list from names
df_all_authors <- as.data.frame(source[ , ncol(source)]) #use ncol(data.frame) to get the last column of the data.frame
colnames(df_all_authors) <- "AU"
#As can be seen in the file the separator of interest is authors' name:
authornames_split <- cSplit(df_all_authors, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE) #retain the matrix form version of the adjacency list input
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
dim(authornames_split)
## [1] 28 11
#Adjacency list from authors' IDs, that is, split the authorsID column
df_authors_id <- as.data.frame(source[ , 3])
colnames(df_authors_id) <- "AU"
author_id_split <- cSplit(df_authors_id, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
dim(author_id_split)
## [1] 28 11
#If dimensions of Id and name adjacency list match, we are good if not we need to find differences
df_authorid_authorname_unlisted <- data.frame(id = unlist(author_id_split), names = unlist(authornames_split))
df_authorid_authorname_unlisted <- df_authorid_authorname_unlisted[!is.na(df_authorid_authorname_unlisted$id),]
df_authorid_authorname_unlisted <- df_authorid_authorname_unlisted[!duplicated(df_authorid_authorname_unlisted$id),]
#most Prolific authors by ID
pub_count <- as.data.frame(table(unlist(author_id_split)))
df_authorid_authorname_unlisted$pub_count <- pub_count$Freq[match(df_authorid_authorname_unlisted$id, pub_count$Var1)]
head(df_authorid_authorname_unlisted[order(df_authorid_authorname_unlisted$pub_count, decreasing=T), ],5)
## id names pub_count
## AU_0228 7004079669 bialystok e. 4
## AU_018 57211234788 marin-marin l. 3
## AU_0112 55385615100 costumero v. 3
## AU_016 58031163900 malyshevskaya a.s. 2
## AU_0116 57190801405 berkes m. 2
a <- df_authorid_authorname_unlisted
a <- a[order(a$pub_count, decreasing=T),5]
head(a)
## NULL