With your own words, explain what problems we aimed to solve with the creation of an ID adjacency list and a names adjacency list?
We aimed to solve the potential problem that there may be authors who have used multiple names for their various publications, yet they always have the same ID throughout the publications. We can tell those are the same authors from their duplicated IDs but not from the different names they appear as. To clean the data and eliminate duplicates, we created an ID adjacency list and a names adjacency list to match the authors and their IDs.
During your cleaning procedures, did your adjacency lists for names and ID have the same dimensions? If so, what is the meaning of these dimensions?
Yes, the dimensions of the names adjacency list and the ID adjacency list are both (154, 11). It means in both lists there are 154 rows and 11 columns. In the context of this dataset, it means there are 154 papers and at least one of these papers are co-authored by 11 people. Since the dimensions are the same, it also implies there are no discrepancies between the authors names and their IDs as recorded in the dataset. Each unique author name is paired with a unique author ID.
Let us assume that you did not have the same dimensions, what would be the reason for this discrepancy and how would you address this issue?
It could be that an author has a first name that ends with “.” that confuses R to mistakenly separate this author’s first name and last name and treat them as two distinct authors. In this case, the same author would appear in different columns in the adjacency list of author(s) names, but still the same person in the adjacency list of author(s) ID, causing R to return different dimensions for the two adjacency lists. To solve the issue, we can use gsub function to substitute all potential confusing ending of first names, such as “Jr.”, “II.”, and “M.S.” with “” before creating the adjacency list of author(s) names.
As part of the feature engineering procedures, we created a database with ID and Name columns. What is the meaning of the number of rows of this database?
The number of rows means the number of unique authors as captured in this dataset, which is 425.
Does this number match the number of unique actors you have in your first deliverable in the object g (i.e., the graph we created)?
No, the numbers do not match. There are only 401 unique authors in the graph but 425 unique authors in the dataset.
If the number does not match, what do you think is the reason for this mismatch?
It may be that all the solo authors are removed from the graph because they are not connected with anyone else, but those solo authors are still included in the database with ID and Name columns.
Please explain the role of unlisting in this data cleaning or feature engineering process.
Unlisting authors names decomposes the individual authors by their names and place those names one per row. Unlisting author(s)IDs also decomposes the IDs to be placed one on each row. After unlisting, we can then clean up the NAs and duplicates present in IDs. Since we have combined the author names column and the IDs column, we can also drop those associated author names, keeping those authors who appear by multiple names with their most recent names as appeared in their most recent publications, finally receiving a matrix with unique authors and their IDs. Only have we perform this step, can we proceed with counting the number of papers written by each unique author using the table function.
Who are the top five most prolific authors? And what are they publishing about?
Within this dataset, there are 6 authors who have publishsed more than 1 paper and the rest all published only 1 paper. Thus the top 6 prolific authors and their publications are :
- vongkulluksn v.w.:
The role of value on teachers’ internalization of external barriers and externalization of personal beliefs for classroom technology integration
Cognitive engagement with technology scale: a validation study
Investing time in technology: Teachers’ value beliefs and time cost profiles for classroom technology integration
E-Reader apps and reading engagement: A descriptive case study
Teachers’ exposure to professional development and the quality of their instructional technology use: The mediating role of teachers’ value and ability beliefs
- xie k.:
The role of value on teachers’ internalization of external barriers and externalization of personal beliefs for classroom technology integration
Cognitive engagement with technology scale: a validation study
Investing time in technology: Teachers’ value beliefs and time cost profiles for classroom technology integration
Teachers’ exposure to professional development and the quality of their instructional technology use: The mediating role of teachers’ value and ability beliefs
- donham c.:
I will teach you here or there, I will try to teach you anywhere: perceived supports and barriers for emergency remote teaching during the COVID-19 pandemic
Increasing Student Engagement through Course Attributes, Community, and Classroom Technology: Lessons from the Pandemic
- bowman m.a.:
The role of value on teachers’ internalization of external barriers and externalization of personal beliefs for classroom technology integration
Teachers’ exposure to professional development and the quality of their instructional technology use: The mediating role of teachers’ value and ability beliefs
- menke e.:
I will teach you here or there, I will try to teach you anywhere: perceived supports and barriers for emergency remote teaching during the COVID-19 pandemic
Increasing Student Engagement through Course Attributes, Community, and Classroom Technology: Lessons from the Pandemic
- kranzfelder p.:
I will teach you here or there, I will try to teach you anywhere: perceived supports and barriers for emergency remote teaching during the COVID-19 pandemic
Increasing Student Engagement through Course Attributes, Community, and Classroom Technology: Lessons from the Pandemic
#read the dataset downloaded from Scopus
b <- read.csv("/Users/joyceshuai/Desktop/SSNA/classtech_tenyrs.csv")
b <- b[b$Author.s..ID!="[No author id available]",]
dim(b)
## [1] 154 23
#Create a column called "au" in source to store author names, and account for possible cases of confusion
b$au <- gsub(" Jr.,", "", b[,1])
b$au <- gsub(" II.,", "", b[,1])
b$au <- gsub(" Jr.", "", b[,1])
b$au <- gsub(" M.S.", "", b[,1])
b$au <- gsub(" M.S.,", "", b[,1])
b$au <- gsub(" II.", "", b[,1])
b$au <- gsub("\\.,", ";", b$au)
b$au <- tolower(b$au)
dim(b)
## [1] 154 24
# Creating adjacency list from au
df_au <- as.data.frame(b[ , ncol(b)])
colnames(df_au) <- "AU"
#separate authors' names into each column
library(splitstackshape)
au_split <- cSplit(df_au, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
dim(au_split)
## [1] 154 11
#Creating adjacency list from authorsID
df_id <- as.data.frame(b[,3])
colnames(df_id) <- "ID"
#separate ids into each column
id_split <- cSplit(df_id, splitCols = "ID", sep = ";", direction = "wide", drop = TRUE)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE
dim(id_split)
## [1] 154 11
#Decompose authors and IDs and put them into a dataframe with 2 columns, one called "ID" and one called "Names"
df_id_au_unlisted <- data.frame(ID = unlist(id_split), Names = unlist(au_split))
#drop the rows where the "ID" is NA, i.e. those with no more co-authors
df_id_au_unlisted <- df_id_au_unlisted[!is.na(df_id_au_unlisted$ID),]
#drop duplicates in the column "id" so that we don't count the same authors multiple times
df_id_au_unlisted <- df_id_au_unlisted[!duplicated(df_id_au_unlisted$ID),]
dim(df_id_au_unlisted)
## [1] 425 2
#find the most prolific authors by ID
pub_count <- as.data.frame(table(unlist(id_split)))
# link the two tables by matching IDs, generate a new column called pub_count in df_id_au_unlisted
df_id_au_unlisted$pub_count <- pub_count$Freq[match(df_id_au_unlisted$ID, pub_count$Var1)]
#sort by the column "pub_count", returns all columns of the top 5 authors with the most publications
head(df_id_au_unlisted[order(df_id_au_unlisted$pub_count, decreasing=T), ])
## ID Names pub_count
## ID_0113 57198347616 vongkulluksn v.w. 5
## ID_0213 24173606100 xie k. 4
## ID_0153 57207690066 donham c. 2
## ID_0184 57198345447 bowman m.a. 2
## ID_0364 34975670700 menke e. 2
## ID_0464 58052982400 kranzfelder p. 2