1 Introduction of the Dataset:

The topic you selected (with a description of why you are interested in such a topic) The keyword I used to search for paper entries on Scopus is “classroom technology”:

I meant to look for papers that discuss the adoption of technical or technological devices and/or means in classroom teaching contexts (or the lack thereof). I am interested in this topic because I perceive a wider scale of technology integration into classroom in recent years, especially after the pandemic.
The timespan (why):

I limited the timespan to be the most recent 10 years, that is, from 2014 to 2023. I hoped to see a trend in the number of paper published on this topic, and more specifically, the increase in ways of classroom technology integration as years go by.
Type of publication (journal article, conference papers, book chapters) mentioning also why you selected this topic:

I also limited the type of publication to be journal articles only, and language to be English only. I wanted to get rid of those textbook or handbook chapters, which more likely describes the theoretical basis of technology use instead of analyzing the condition of its real-life applications. These qualifiers returned a decent number of papers, i.e.154.

2 Deliverable 1 Questions and Answers:

What are the dimensions of the initial adjacency list matrix, and what those number represent (in actual English)?

The dimensions of the initial adjacency list matrix is [154,11]. It means this dataset contains154 papers, and at least one of the papers is co-authored by 11 people.
After generating your edgelist, how many authors do you have in your resulting network?

There are 403 unique authors in the network, since the network has 403 vertices.
How many connections are there in such a network?

There are 597 connections in the network, since the network has 597 edges.

3 Steps for deliverable 1:

# read the dataset downloaded from Scopus
a <- read.csv("/Users/joyceshuai/Desktop/SSNA/classtech_tenyrs.csv")

# only column 3 contains authors ID, which is the authorship information needed here
a <- as.data.frame(a[,3])
colnames(a) <- "Authors"

library(splitstackshape)
# split authors in the column into separate columns 
a1 <- cSplit(a, splitCols = "Authors", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

# turn a1 into matrix form
mat <- as.matrix(a1)
dim(mat)

## [1] 154  11

# use a for loop to store co-authorship information in an edgelist
edgelist <- matrix(NA, 1, 2)
for (i in 1:(ncol(mat)-1)){
  indiv_edgelist <- cbind(mat[,i], c(mat[,-c(1:i)]))
  edgelist <- rbind(edgelist, indiv_edgelist)
  edgelist <- edgelist[!is.na(edgelist[,2]),]
  edgelist <- edgelist[edgelist[,2]!="",]
}
dim(edgelist)

## [1] 597   2

head(edgelist)

##             [,1]        [,2]
## [1,] 57209211014 55505853200
## [2,] 57206698047 36154796800
## [3,] 57222036217 57202216646
## [4,] 57672771100 57194830879
## [5,] 56285357900 56575051400
## [6,]  8247022600  6507838357

# turn the edgelist into a graph
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

g <- graph.data.frame(edgelist, directed = FALSE)
dim(g)

## NULL

x11()
plot(g)

4 Deliverable 2 Questions and Answers:

With your own words, explain what problems we aimed to solve with the creation of an ID adjacency list and a names adjacency list?

We aimed to solve the potential problem that there may be authors who have used multiple names for their various publications, yet they always have the same ID throughout the publications. We can tell those are the same authors from their duplicated IDs but not from the different names they appear as. To clean the data and eliminate duplicates, we created an ID adjacency list and a names adjacency list to match the authors and their IDs.
During your cleaning procedures, did your adjacency lists for names and ID have the same dimensions? If so, what is the meaning of these dimensions?

Yes, the dimensions of the names adjacency list and the ID adjacency list are both (154, 11). It means in both lists there are 154 rows and 11 columns. In the context of this dataset, it means there are 154 papers and at least one of these papers are co-authored by 11 people. Since the dimensions are the same, it also implies there are no discrepancies between the authors names and their IDs as recorded in the dataset. Each unique author name is paired with a unique author ID.
Let us assume that you did not have the same dimensions, what would be the reason for this discrepancy and how would you address this issue?

It could be that an author has a first name that ends with “.” that confuses R to mistakenly separate this author’s first name and last name and treat them as two distinct authors. In this case, the same author would appear in different columns in the adjacency list of author(s) names, but still the same person in the adjacency list of author(s) ID, causing R to return different dimensions for the two adjacency lists. To solve the issue, we can use gsub function to substitute all potential confusing ending of first names, such as “Jr.”, “II.”, and “M.S.” with “” before creating the adjacency list of author(s) names.
As part of the feature engineering procedures, we created a database with ID and Name columns. What is the meaning of the number of rows of this database?

The number of rows means the number of unique authors as captured in this dataset, which is 425.
Does this number match the number of unique actors you have in your first deliverable in the object g (i.e., the graph we created)?

No, the numbers do not match. There are only 401 unique authors in the graph but 425 unique authors in the dataset.
If the number does not match, what do you think is the reason for this mismatch?

It may be that all the solo authors are removed from the graph because they are not connected with anyone else, but those solo authors are still included in the database with ID and Name columns.
Please explain the role of unlisting in this data cleaning or feature engineering process.

Unlisting authors names decomposes the individual authors by their names and place those names one per row. Unlisting author(s)IDs also decomposes the IDs to be placed one on each row. After unlisting, we can then clean up the NAs and duplicates present in IDs. Since we have combined the author names column and the IDs column, we can also drop those associated author names, keeping those authors who appear by multiple names with their most recent names as appeared in their most recent publications, finally receiving a matrix with unique authors and their IDs. Only have we perform this step, can we proceed with counting the number of papers written by each unique author using the table function.
Who are the top five most prolific authors? And what are they publishing about?

Within this dataset, there are 6 authors who have publishsed more than 1 paper and the rest all published only 1 paper. Thus the top 6 prolific authors are : vongkulluksn v.w.; xie k. donham c.; bowman m.a.; menke e.; and kranzfelder p.

5 Steps for deliverable 2:

#read the dataset downloaded from Scopus
b <- read.csv("/Users/joyceshuai/Desktop/SSNA/classtech_tenyrs.csv")
b <- b[b$Author.s..ID!="[No author id available]",]
dim(b)

## [1] 154  23

#Create a column called "au" in source to store author names, and account for possible cases of confusion
b$au <- gsub(" Jr.,", "", b[,1])
b$au <- gsub(" II.,", "", b[,1])
b$au <- gsub(" Jr.", "", b[,1])
b$au <- gsub(" M.S.", "", b[,1])
b$au <- gsub(" M.S.,", "", b[,1])
b$au <- gsub(" II.", "", b[,1])
b$au <- gsub("\\.,", ";", b$au)
b$au <- tolower(b$au)
dim(b)

## [1] 154  24

# Creating adjacency list from au
df_au <- as.data.frame(b[ , ncol(b)]) 
colnames(df_au) <- "AU"

#separate authors' names into each column
au_split <- cSplit(df_au, splitCols = "AU", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

dim(au_split)

## [1] 154  11

#Creating adjacency list from authorsID
df_id <- as.data.frame(b[,3])
colnames(df_id) <- "ID"

#separate ids into each column
id_split <- cSplit(df_id, splitCols = "ID", sep = ";", direction = "wide", drop = TRUE)

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by
## the caller; using TRUE

dim(id_split)

## [1] 154  11

#Decompose authors and IDs and put them into a dataframe with 2 columns, one called "ID" and one called "Names"
df_id_au_unlisted <- data.frame(ID = unlist(id_split), Names = unlist(au_split))
#drop the rows where the "ID" is NA, i.e. those with no more co-authors
df_id_au_unlisted <- df_id_au_unlisted[!is.na(df_id_au_unlisted$ID),]
#drop duplicates in the column "id" so that we don't count the same authors multiple times
df_id_au_unlisted <- df_id_au_unlisted[!duplicated(df_id_au_unlisted$ID),]
dim(df_id_au_unlisted)

## [1] 425   2

#find the most prolific authors by ID
pub_count <- as.data.frame(table(unlist(id_split)))


# link the two tables by matching IDs, generate a new column called pub_count in df_id_au_unlisted
df_id_au_unlisted$pub_count <- pub_count$Freq[match(df_id_au_unlisted$ID, pub_count$Var1)] 
#sort by the column "pub_count", returns all columns of the top 5 authors with the most publications
head(df_id_au_unlisted[order(df_id_au_unlisted$pub_count, decreasing=T), ])

##                  ID             Names pub_count
## ID_0113 57198347616 vongkulluksn v.w.         5
## ID_0213 24173606100            xie k.         4
## ID_0153 57207690066         donham c.         2
## ID_0184 57198345447       bowman m.a.         2
## ID_0364 34975670700          menke e.         2
## ID_0464 58052982400    kranzfelder p.         2

6 Deliverable 3 Questions and Answers:

Explain what the issue was with solo authors in past deliverables.

Solo authors in past deliverables were dropped from the adjacency tables due to the NAs they were paired with. As a result, those solo authors were not able to be included in the igraph of the one-mode network displaying connections between co-authors.
How does the transformations conducted in the R code file “two_mode_coauthorship.r” handles solo authors?

By creating a two-mode network, the connections are established between each author’s ID and the paper the author wrote. The co-authorships between authors, on the other hand, are shown by their mutual affiliations with the same paper’s EID. This way, no solo authors are missing from the two-mode network, because even if the author is a solo author, their ID will be kept by its connection to their paper’s EID, hence will not be dropped.
Does our dataset with ID, names, and publication counts contain solo authors? If so, how, or why was this achieved?

Yes, that dataset contains solo authors. This is because when we run the code “df_id_au_unlisted <- data.frame(ID = unlist(id_split), Names = unlist(au_split))”, we are putting each author’s name and ID together into the same row in the dataframe. Instead of dropping the NAs in the ID adjacency list, which drops the solo authors’ IDs hence losing solo author information, we drop the NAs from this unlisted dataframe. Since those solo authors have both author names and IDs stored in the dataframe, their rows do not have empty IDs, and so their information will not be dropped by the code “df_id_au_unlisted <- df_id_au_unlisted[!is.na(df_id_au_unlisted$ID),]”.
What is the meaning of the dimensions of the object “edgelist_two_mode” in plain English?

The dimension of “edgelist_two_mode” is (2743, 2). It means there are a total of 2743 relationships captured in the dataset between authors and papers, but it does not tell us how many unique authors and how many unique papers are included in the number.
In class we observed that the result of the following tabulations rendered different True and False distributions: table(is.na(V(g2)$label)) and table(V(g2)$type). What was the reason for that discrepancy and how did we address it (you can place the code we used and explain it to address this question.

The purpose of running both codes was to know how many of the 3176 vertices are authors and how many are papers. Based on the code “table(V(g2)$type)”, which returns the number of “trues” (i.e. the number of papers) and “falses” (i.e. the number of authors) among the graph g2’s vertices, we got the result that there are 986 papers and 2190 authors. However, when using the code “table(is.na(V(g2)$label))”, we found only 2184 authors. 6 authors were missing in this case. This was because there are 6 instance in the database where the authors’ names were missing (due to reasons like the inaccurate use of separator between two authors which leads to the two distinct authors being recognized by R as one single author) but their IDs were present. When using the code “table(is.na(V(g2)$label))”, those 6 authors without names were then dropped because they were represented by NA in the label column.
Did you face a similar problem than we did in class? Do all your authors have names associated with IDs?

No, I didn’t. both commands returned the same dimension, (425, 154).

EDUC 7847 Deliverable 1-3

Yizheng Shuai

June 19, 2023

1 Introduction of the Dataset:

2 Deliverable 1 Questions and Answers:

3 Steps for deliverable 1:

4 Deliverable 2 Questions and Answers:

5 Steps for deliverable 2:

6 Deliverable 3 Questions and Answers: