STM_course

It’s a part of my course work and in this work I would like to focus on science specialization in two fields: Scientometrics and Science & Technology Studies. In the early 80th these disciplines differentiated from each other with own specialization. But they both focused on scientific practices and their (re)production. So, I would like to explore connection between these fields today. In this work I would like to do semantic level analysis, which consists of doing Topic Modeling on abstracts downloaded from discipline related journals. We do structural topic modeling method as we can use document level covariate namely journal and might test relatedness of journals to different topics.

I exported 7252 articles from Web of Science that have been published in the following journals: Journal of Informetrics, Minerva, Research Evaluation, Research Policy, Science as Culture, Science and Public Policy, Scientometrics, Social Studies of Science, Science Techonolgy and Human Values. The data was collected for the period from 2007 to 2018. My decision about what journals to take is based on the list of journals added on the website by Center for Science and Technology Studies, European Univesity at SPb (they recently changed their website and I don’t know how to find the previous link) and the dicsussion on ResearchGate.

setwd("C:/Users/Proksenia/Documents/rockstar")

D <- readFiles(list.files(pattern = "*.txt"))
M <- convert2df(D, dbsource = "wos")

Ms <- M %>% 
  filter(DT == "ARTICLE") %>% 
  select(TI, SO, AB) %>% 
  mutate(sent_id = row_number())

Preprocessing of data, we use “SMART” stopwords list and it covers around 500 words. We also remove Elseiver signature in the end of the abstracts of Journal of Informetrics, as well as numbers and punctuation.

processed_again <- textProcessor(Ms$AB, metadata = Ms,removestopwords = TRUE,
                                 removenumbers = TRUE, removepunctuation = TRUE, stem = FALSE, customstopwords = c("(c)", "elseiver", "ltd.", "all", "rights", "reserved"))

On the next step we set thresholds to decide with what words we would like to work, based on the number of times we encountred them in our corpus. The graph shows how many tokens, words and documents we can lose according to our thresholds.

plotRemoved(processed_again$documents, lower.thresh=seq(1,1000, by=5))

out_again <- prepDocuments(processed_again$documents, processed_again$vocab, processed_again$meta, lower.thresh = 9, upper.thresh = 500)

## Removing 24531 of 29911 terms (192359 of 529546 tokens) due to frequency 
## Your corpus now has 7252 documents, 5380 terms and 337187 tokens.

docs_again <- out_again$documents
vocab_again <- out_again$vocab
meta_again <- out_again$meta

out_again$meta$SO <- as.factor(out_again$meta$SO)

For a long time I modeled with 9 as a lower and 750 as an upper threshold, but the closer look on topics has shown that there are some topics that have one or two main vague words that seem have too much power on their topics like in the picture below:

figure.1

So, I ended up with 9 and 500 thresholds. On the next step we need to decide how many topics we would like to see. Firstly, I tried to use an algorithm which showed me the “optimal” number of topics and it’s 83. I also tried 30, 50, 70, 90. 90 was the best one, slightly better than 70 topics. But I found that so large a number of topics is not realy handy and easily intepretable. So, I tried to work with 30 and 50 topics and stayed with 50 topics because 30 turned out to be a little bit rough for clustering. So, our final model is a structural topic model with 50 topics and I added journal as a document-level covariate. Here I leave the code for comparison but it’s resourceful to execute.

# searchK(docs_again, vocab_again, K = 0, content = out_again$meta$Ms.SO, data = meta_again)

# storepls <- searchK(out_again$documents, out_again$vocab, K = seq(30,90, by = 20), content = out_again$meta$Ms.SO, data = meta_again)

poliblogPrevFit_50 <- stm(out_again$documents, out_again$vocab, K=50,
                          verbose = TRUE, content = out_again$meta$SO, 
                          max.em.its=15, init.type="Spectral", 
                          interactions = FALSE, LDAbeta = FALSE, seed=8458159)

What we can do next?

Well, we can start with doubt that our model is any good, because our model is one of many models that stm can create with different combinations of topics and words. What we can do with that is to reduce the number of words (which is relatively small from what was in the beginning) or we can keep using stm package toolkit.

poliblogSelect <- selectModel(out_again$documents, out_again$vocab, K = 50,
                              max.em.its = 15,
                              content = out_again$meta$SO,
                              verbose = TRUE,
                              data = out_again$meta, runs = 20, seed =8458159)

The above line creates an object that contains models (in our case 4) with high likelihood values from several runs on a small number of EM iterations. We can compare the final models based on several criteria: semantic coherence, exclusivity, sparsity, but my kind of model (with the covariate / content variable) can calculate semantic coherence and sparsity. From the results 4th is the best model.

plotModels(poliblogSelect)

## [1] "Model 1 has on average -180.335309809554 semantic coherence and 0.992373512808676 sparsity"
## [1] "Model 2 has on average -176.525729167944 semantic coherence and 0.992328967935791 sparsity"
## [1] "Model 3 has on average -179.076096142275 semantic coherence and 0.992456528253598 sparsity"
## [1] "Model 4 has on average -175.321722030713 semantic coherence and 0.992397000105288 sparsity"
## [1] "Model 5 has on average -177.646756147809 semantic coherence and 0.99245126385953 sparsity"
## [1] "Model 6 has on average -177.589329376003 semantic coherence and 0.992262150626463 sparsity"
## [1] "Model 7 has on average -177.622096113897 semantic coherence and 0.992383636643422 sparsity"
## [1] "Model 8 has on average -178.962207672171 semantic coherence and 0.992425751795968 sparsity"
## [1] "Model 9 has on average -177.951462061293 semantic coherence and 0.992408743753594 sparsity"
## [1] "Model 10 has on average -177.145551874924 semantic coherence and 0.992477585829871 sparsity"

poliblogfinal_50 <- poliblogSelect$runout[[4]]

Now we are more or less ready to step forward in describing our model

plot(poliblogfinal_50, type="summary", topics = 1:25)

plot(poliblogfinal_50, type="summary", topics = 26:50)

From the graphs above we can see the proportion of each topic. It’s relatively similar around 0.02 for each topic (for me it was a good sign because in simple LDA, proportion of topics ranged from 0.01 and 0.06 which I considered as the possibility for further crushing of bigger topics on smaller one).

out_corr <- topicCorr(poliblogfinal_50)

plot(topicCorr(poliblogfinal_50, cutoff = 0.15))

The graph above shows correlation between topics, I struggle in describing what this correlation refers to, because stm package puts it bluntly: “Positive correlations between topics indicate that both topics are likely to be discussed within a document”. My best guess would be it’s based on the similarity of words in topics. Coming back to the graph we can clearly see here some clusters, so my next step was clustering.

fviz_nbclust(out_corr$cor, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")

fviz_nbclust(out_corr$cor, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

fviz_nbclust(out_corr$cor, kmeans, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 50)  [one "." per sample]:
## .................................................. 50

res.hc <- eclust(scale(out_corr$cor), "hclust", nboot = 500)

## Clustering k = 1,2,..., K.max (= 10): .. done
## Bootstrapping, b = 1,2,..., B (= 500)  [one "." per sample]:
## .................................................. 50 
## .................................................. 100 
## .................................................. 150 
## .................................................. 200 
## .................................................. 250 
## .................................................. 300 
## .................................................. 350 
## .................................................. 400 
## .................................................. 450 
## .................................................. 500

fviz_dend(res.hc, rect = TRUE)

vec_color <- res.hc$cluster
vec_color <- to_factor(vec_color)
levels(vec_color) <- c("yellow", "blue", "purple", "red")

Here we can see the results of function with different methods of estimating the optimal number of clusters. Most of them result in 4 clusters. I used the results of hierarchical clustering to assign for each topic their “clustering color” and applied this color to our previous network.

plot(topicCorr(poliblogfinal_50, cutoff = 0.10), vertex.color = vec_color)

From this graph we can see that we have 3 closely-knit clusters and 1 spread. It’s hard to explain whether the spread one represents its own theme because it may be formed as cluster based on topics that didn’t find strong enough recognition in other clusters and function assigned them to a “rubbish” cluster.

My next intention was to try describing these clusters based on their themes or based on the journals and their similarities.

prep_50 <- estimateEffect(formula = 1:50 ~ SO, stmobj = poliblogfinal_50,
                          metadata = out_again$meta,
                          uncertainty = "Global")
effect_50 <- extract.estimateEffect(prep_50, "SO", method = "pointestimate")

effect_50$topic <- reorder(x = effect_50$topic, effect_50$estimate)

effect_50 <- effect_50 %>% mutate(topic = fct_reorder2(as.factor(topic), covariate.value == "SCIENCE TECHNOLOGY & HUMAN VALUES", estimate))

effect_50 %>% 
  ggplot(., aes(x = topic, y = estimate, group = covariate.value, color = covariate.value)) +
  geom_point(size = 2 ) +
  geom_smooth(se = FALSE) +
  scale_y_continuous(limits = c(0,0.075)) +
  scale_x_discrete("Topics") +
  theme(axis.text.x = element_text(color = ifelse(levels(effect_50$topic) %in% as.character(clust_purp), "purple",
                                                  ifelse(levels(effect_50$topic) %in% as.character(clust_blue), "blue",
                                                         ifelse(levels(effect_50$topic) %in% as.character(clust_red), "red",
                                                                ifelse(levels(effect_50$topic) %in% as.character(clust_yellow), "yellow", "black"))))))

To do this I run estimateEffect function which give us the regression-like output with journal as independent variable and from my understanding the coefficient means the estimated proportion of each topic in these journals because the coefficients sum up to 1 for every journal. I draw a plot and ordered topics based on the proportion of the journal Scince, Technology & Human Values from higher to lower to each topic. Based on this order we can see that journals Science as Culture, Science, Technology & Human Values, Social Studies of Science and Minerva tend to be represented in blue cluster and journals Journal of Informetrics, Scientometrics tend to be in red cluster (on the network graph it’s yellow despite I tried to make the colors consistent), lastly Research Policy also seems to be highly represented in yellow cluster (it’s orange on the graph). So, its likely that clusters are based on the journals’ “textual boundaries”, but it’s not that certain and we still haven’t discussed the topics themselves and topics that are for some reason connecting different clusters.

To do this I reduced my woeful model to 20 topics, because it was hard for me to understand why certain topics get in the same cluster, it seems that topics are not that clear and it’s hard to reduce topics to certain themes because it seems that topic contains distribution of different themes and some are more prevalent. I explain it this way: in abstracts some themes depend on each other and are both reflected in articles, for example article using networks/clustering/statistics/mapping in bibliometrics journal are likely to use citation data and they will all use more or less similar words. So, in the model with 20 topics, the topics will be more coherent and easier to explain word distribution. And I tried to improve the model by adding more words in our vocabular with new threshold from 9 to 1000 occurrences of words.

out_more <- prepDocuments(processed_again$documents, processed_again$vocab, processed_again$meta, lower.thresh = 9, upper.thresh = 1000)

docs_more <- out_more$documents
vocab_more <- out_more$vocab
meta_more <- out_more$meta

out_more$meta$SO <- as.factor(out_more$meta$SO)

poliblogSelect_new <- selectModel(out_more$documents, out_more$vocab, K = 20, max.em.its = 50,
                              content = out_more$meta$SO,
                              verbose = TRUE,
                              data = out_more$meta, runs = 50, N = 6, seed = 8458150, to.disk = TRUE)

We select from 6 models the one with highest semantic coherence.

plotModels(poliblogSelect_new)

## [1] "Model 1 has on average -131.845880331693 semantic coherence and 0.988665709219141 sparsity"
## [1] "Model 2 has on average -135.73280688192 semantic coherence and 0.988247668750126 sparsity"
## [1] "Model 3 has on average -136.338326798322 semantic coherence and 0.988646444681398 sparsity"
## [1] "Model 4 has on average -132.191283094821 semantic coherence and 0.988644518227624 sparsity"
## [1] "Model 5 has on average -136.822517346354 semantic coherence and 0.988573239437976 sparsity"
## [1] "Model 6 has on average -132.345175128286 semantic coherence and 0.988478843203038 sparsity"

poliblogfinal_20 <- poliblogSelect_new$runout[[1]]

Here is the proportion of each topic.

sage <- sageLabels(poliblogfinal_20, n = 3)

plot(poliblogfinal_20, type="summary", topics = 1:20, custom.labels = as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", ")), n = 1)

Here we plot expected proportion of each topic in journal and network of topic relatedness.

effect_20 <- effect_20 %>% mutate(topic = fct_reorder2(as.factor(topic), covariate.value == "SOCIAL STUDIES OF SCIENCE", estimate))
effect_20 %>% 
  ggplot(., aes(x = topic, y = estimate, group = covariate.value, color = covariate.value)) +
  geom_point(size = 2 ) +
  geom_smooth(se = FALSE) +
  scale_y_continuous(limits = c(0,0.25)) +
  scale_x_discrete("Topics")

out_corr_20 <- topicCorr(poliblogfinal_20)
plot(topicCorr(poliblogfinal_20, cutoff = 0.1), noverlap = TRUE, vlabels = paste0("Topic ", seq(1:20), ":", as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", "))))

They don’t change that much from the model with 50 topics, and it seems that we have 3-4 clusters. But if we change the cutoff value to 0.07 and it will cut off edges with correlation between topics lesser than 0.07,

plot(topicCorr(poliblogfinal_20, cutoff = 0.07))

here it looks like 3 clusters.

To describe these topics we have several vizualisation tools and techniques: we can look at the abstracts with the highest proportion of particular topic, we can look at the most associated words to topic of our interest using different methods, we can search for the topics with given words, and we can search for the abstracts that fit our “mixture” of topic proportions. My idea was to look at the abstracts to get an idea of what are the topics about. With 20 topics they seems to be more consistent, but I get in conflict with “word methods”. For example let’s look at the topic 11.

cloud(poliblogfinal_20, topic = 11)

sage_10 <- sageLabels(poliblogfinal_20, n = 10)

sage_10$marginal$frex[11,]

##  [1] "health"         "scientometrics" "attention"      "governance"    
##  [5] "concept"        "cell"           "society"        "issues"        
##  [9] "issue"          "care"

thoughts <- findThoughts(poliblogfinal_20, texts = meta_more$AB, n = 5, topics = 11)$docs[[1]]
plotQuote(thoughts, width = 80, maxwidth = 250, text.cex = 0.8)

We got wordcloud which “shows words weighted by their probability conditional that the word comes from a particular topic. With content covariates it averages over the values for all levels of the content covariate weighted by the empirical frequency in the dataset”.

Console line shows 10 words with the highest FREX score, which is harmonic mean of the word’s rank in the given word-topic distribution. Basically this score is more sensitive to words which have both high word frequency and exclusivity. This method was suggested over word probability in a given topic.

And the last output is the first part of the abstracts that have highest proportion of a given topic which is topic 11.(We can take a look at the full abstracts but it’s a much longer output).

It’s hard to find here the main theme for these 5 articles, but it seems that most of them describe relation between public and something governmental/organizational/political which raises issues/debates/controversies. However, if we take a look at the graph with topic proportions in journals, we can see that there is only a small part of Scientometrics and JOI, but we have high scores on the words “scientometrics” and “bibliometric” in terms of FREX (and it’s the only topic where these words are significant indicators) which doesn’t support the idea that topics (and clusters) should be described by their association with journal. In the end I will finally present my ideas that I grasp from topic-related abstracts answering the question what every topic is all about.

plot(topicCorr(poliblogfinal_20, cutoff = 0.09), vlabels = paste0("Topic ", seq(1:20), ":", as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", "))), vertex.color = c("blue", "blue",
                                                                                                                                                                                    rep("green", 7), "blue", rep("green", 6), "blue", "green", "green", "blue"))

Topic 20: University ranking & academic ranking. This topic covers themes of university ranking, educational background, professorship, performance of universities and their transformation as institutes.

Topic 17 (diverse): This topic focuses both on companies/organizations and different research areas and their partnership, transfers, absorptive capacity, collaborations following the analysis of transformation/activities/perfromance/relationship.

Topic 10: Mostly patents/innovation/investment relations with different variables with evaluation via citation or market value, and generaly covers their performance in different scenarios.

Topic 1: Scientific collaboration between countries (it seems mostly with China) and regions, also nation’s research output.

Topic 2 (diverse): Funding/careers/women inequality in science and academy.

plot(topicCorr(poliblogfinal_20, cutoff = 0.09), vlabels = paste0("Topic ", seq(1:20), ":", as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", "))), vertex.color = c("green", "green",
                                                                                                                                                                                    "green", "blue", "blue", "green", "blue", rep("green", 8), "blue", "green", "blue", "green", "green"))

Topic 5: Research metrics, indexes, research evaluation

Topic 18: Citation behaviour/relations in different cases like multiauthorship, collaboration, non-cited papers, time periods etc

Topic 16: Similar to Topic 18, but focused more on databases and their coverage

Topic 4: Delayed recognition/sleeping-beauties papers + some citation behavior

Topic 7: Bibliometric analysis focused more on the level of articles and their attributes, like keywords, byline, type of references/citation and ways of assessing it

plot(topicCorr(poliblogfinal_20, cutoff = 0.09), vlabels = paste0("Topic ", seq(1:20), ":", as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", "))), vertex.color = c("green", "green",
                                                                                                                                                                                    "blue", rep("green", 9), "blue", "blue", "blue", rep("green", 5)))

Topic 13: Clustering, science mapping, thematic areas, bibliometric networks, visualisation, semantic analysis + methods

Topic 15: It seems reminding of topic 13 in terms of data of analysis, but topic 15 seems to be more about problem solving and describing cases of ambiguity. It also tries to reflect literature linkages or even some events and even more on methods, their evaluation and possible interaction.

Topic 3: It’s a nightmare, too much about everything I can’t give it description from abstracts.

Topic 14: Is largely about networks, but it describes them in two ways: network as a method of analysis and network as social network theory. First one is more focused on the methods itself and how it can be improved or used with (as) indexes, the second one is more like analytical toolkit in terms of link formation/strong ties/“heterogeneous network” and there are also words like collaboration, coauthors, indexes, edge, node

plot(topicCorr(poliblogfinal_20, cutoff = 0.09), vlabels = paste0("Topic ", seq(1:20), ":", as.matrix(unite(data.frame(sage$marginal$frex), "frex", sep = ", "))), vertex.color = c(rep("green", 5), 
                                                                                                                                                                                    "blue", "green", "blue", "blue", "green", "blue", "blue", rep("green", 6), "blue", "green"))

Topic 12: It’s quite a broad topic and is also very close to topic 19, the articles may not cohere with each other and they come from different sources, but they all describe global processes on large scale like disease/war/earth/climate/progress in terms of history or even world policy, they analyse population, policymaking, technology, countries, they also describe it with words: coordination, decades/centuries.

Topic 19: Is close to topic 12, but 19 is more about cities and urban environment, management, infrastruture. From my point of view it also favors case study approach and focuses on objects and techologies themselves, rather than processes, but in terms of policymaking they look similar.

Topic 8: (STS-like) clinics/human +bodies/rights/race/genome + genetics

Topic 11: citizen + engagement + technoscience + politics + political philosophy + publics + participation + government decision

Topic 9: users, community, innovation, technologies/software, firms, development, contribute/needs and solution, diffusion, internal/external, boundaries

Topic 6: Risk regulation, funding/investment, evaluation, agency, standards + looks like in the field of food industries, medicine (pharmaceutics/drugs), administrative (research funding)

To conclude, we tried to make an analysis of 20 topics and while doing it we separated them on 4 clusters which seem to flow from more Scientometrics-like to STS-like with buffer zone between them being networks, it seems generally plausible but it’s too simplistic a view. If we drill down to topics we see that only some of them could be successfully described through the lens of journals. Themes of policymaking, governance, technology, public or innovation tend to be considered in all clusters and are covered in a large number of topics. So, we need to work further with this or consider that it might be the problem of my analysis with some fallacious interpretation.

STM_course

Siberia

1 октября 2020 г