NOTICE TO MARINERS
The corpus
A t-SNA model
Finding clusters of texts
- Density based clusters
- Subsetting and investigating the clusters
Get the total picture
- The totality: clusters and texts

NOTICE TO MARINERS

We will look into how “DIGITALIZATION” is written about in social sciences. The data is gathered from WoS SSCI 1992-2018, using “digital*" to harvest English article abstracts, in total a little more than 25 000. The idea is to get an idea of what scholars in social sciences write about digitalization.

We will in this exercise try something other than the usual run-of-the-mill LDA approach to text mining. Instead, we will try out t-distributed stochastic neighbor embedding (t-SNE). The goal is to get an initial structured overview of the data.

Clusters of text on digitalization from social science papers

Little surprise, this quick fix of text-mining does not get us very far. There are several reasons for that. The major lesson is how little there is to say about digitalization: What we do get in the end are the major topics in SSCI. There are no common digitalization topics, or, we do not get to them with the method we have used. Put differently, social scientist does not seem to have anything particular to say about digitalization. A special difficulty is the sparsity of the word vector: while almost all corpora will produce a sparse matrix, it is in this case extraordinary sparse because a) scientific language is highly technical and prone to use a lot of specific, unusual technical terms, and to make it worse b) the corpus transcends several disciplines, so we get the lexical specificity from every single discipline fed into the corpus, leaving us with over 60’ unique tokens. Also, there is something to be said for the miller: it is hard to beat LDA in the LDA-game.

Well, something learned then, after all.

prescript

#If you, like me, run your code in RStudio, this could save you 1.5 seconds :-):

library(rstudioapi)
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))

script

The corpus

library(textstem)
text0$AB <- lemmatize_strings(text0$AB) #AB stands for abstract. Lemmatization will reduce the no of unique tokens slightly. An option is stemming.

text0[text0==""] <- NA #Mark empty AB-fields as NA.

text0 <- text0[complete.cases(text0),] #We have the WoS no so we can match it with the the corpus, for later reassembly.

library(tm)
vsc <- VectorSource(text0$AB) #make a vector

corp1 <- Corpus(vsc) #make a corpus

#clean up the copus
corp1 <- tm_map(corp1, removePunctuation)
corp1 <- tm_map(corp1, removeNumbers)
corp1 <- tm_map(corp1, content_transformer(tolower))
corp1 <- tm_map(corp1, removeWords, stopwords("english"))
corp1 <- tm_map(corp1, stripWhitespace)

dtm1 <- DocumentTermMatrix(corp1, control = list(weighting = weightTf)) #We'll just weight with term fq to control for document length. 

ncol(dtm1)

## [1] 51449

inspect(removeSparseTerms(dtm1, 0.8)) #with inspect you can test the no of tokens with different levels of sparsity.

## <<DocumentTermMatrix (documents: 25418, terms: 29)>>
## Non-/sparse entries: 221954/515168
## Sparsity           : 70%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    can datum digital information much research result study
##   10914   0     5       2          11    7       11      1    20
##   12341   4     5       7           0    2       10      0     1
##   17192   0     0       3           0   10        0      1     0
##   18206   1    26       1           0    1        0      2     0
##   24293   7    13       1          12    1        1      2     0
##   2627    2     5       7           0    0        6      1     2
##   2873    2     1      38           3    0        4      3    11
##   5792    3    17       1           1    2        3      1     4
##   6403    6     1       2           3    2        6      1     8
##   9112   15     4      21           1    1        4      0     0
##        Terms
## Docs    technology use
##   10914          0   2
##   12341          2   3
##   17192         30  19
##   18206          1   7
##   24293         21   0
##   2627          13  10
##   2873           1   4
##   5792           0   3
##   6403           0   2
##   9112           0   4

dtm2 <- removeSparseTerms(dtm1, 0.8) #Lets be radical here and go for 80%. I reduces the matrix to almost nothing: 29 unique tokens.

ncol(dtm2)

## [1] 29

dtmMATRIX <- as.matrix(dtm2) #Make the dtm a matrix to preapare for a t-SNE.

This is not, and I repeat, NOT an wholesome approach for text-mining. But I know the corpus a little, and so I know already that this will (curiously) work anyway.

A t-SNA model

set.seed(44)
m0_8 <- Rtsne(dtmMATRIX, pca=TRUE,
            max_iter=2500, check_duplicates=FALSE,
            perplexity=100, theta=0.5, dims=2,
            num_threads=0, stop_lying_iter=1000,
            exaggeration_factor = 12)
#pca=TRUE->the initial pca step.
#perpelexity typically between 5:100, but smaller for smaller datasets. Crucial but slippery.
#theta= Speed/accuracy trade-off (increase for less accuracy). 0.5 ok.

The model is called m0_8 because I have psychic powers. I know there are 8 clusters in there.

# Show the objects in the 2D tsne representation
par(family = "serif", family = "serif")
plot(m0_8$Y)

Representation of the most probable neighbors in a two dimensional space.

Finding clusters of texts

Density based clusters

Clustering on top of any t-SNE model is a precarious matter. The space is not, in my view, properly Euclidean. A density base clustering approach is therefore to be preferred.

library(dbscan)
df_m12 <- as.data.frame(m0_8$Y)

cl <- hdbscan(df_m12, minPts = 300) #using the hierarchical approach, we do not need to f**k around with parameters.
cl

## HDBSCAN clustering for 25418 objects.
## Parameters: minPts = 300
## The clustering contains 7 cluster(s) and 8146 noise points.
## 
##     0     1     2     3     4     5     6     7 
##  8146  1380  1056   782   854   983  1536 10681 
## 
## Available fields: cluster, minPts, cluster_scores,
##                   membership_prob, outlier_scores, hc

df_m12$cluster <- as.factor(cl$cluster)
df_m12$membership <- cl$membership_prob
df_m12 <- transform(df_m12, membership = (membership - min(membership)) / (max(membership) - min(membership))) # Stdz, to allow plotly to handle the degees of probability better.

kindofpretty <- c("#E5E6CF", "#DEA9A9", "#5F854B", "#796985", "#EB912A", "#A32F2F", "#2D4585", "#52522C")

library(plotly)

t <- list(
  family = "serif",
  size = 11,
  color = "#852E2E")

pal <- c("red", "blue", "green")

p <- plot_ly(type ='scatter', mode = 'markers', colors = kindofpretty, color = df_m12$cluster) %>%
  add_trace(
    x = df_m12$V1,
    y = df_m12$V2,
    marker = list(
      size = 6,
      opacity = df_m12$membership
      ),
    name = df_m12$cluster,
    text = paste("Cluster no.: ", df_m12$cluster),
    showlegend = T
  ) %>%
  
  layout(
    title = "Clusters of texts",
    titlefont = list(
      size = 10),
    font = t,
    xaxis = list(
      zeroline = F
    ),
    yaxis = list(
      zeroline = F
    )
  )  
p

Pretty clusters of text

Subsetting and investigating the clusters

So far, we only have pretty clusters to look at, but we do not know what they mean. We need to get the words back into the game to get a sense of what is going on. We will kind of start over, with a different approach to the text, using the very competent package Quanteda. It has its own classes of document term matrices and corpora. Once we get them in place, there are numerous analytic/illustrative possibilities. Not that we no longer need the heavily stripped dtm, we can reuse the texts to get a richer material.

library(quanteda)

colnames(text0)[colnames(text0)=="AB"] <- "text" #Quanteda need the text column to be labelled "text".

#create the corpus and inspect.
text0$cluster <- cl$cluster #glue cluster memebership to the texts
corpQ <- corpus(text0)

library(quanteda)
corpQ <- corpus(text0) 

#Now let us combine the texts with the clustering.

#subset the corpus.
subsetcorp <- corpus_subset(corpQ, 
                             cluster %in% c("0", "1", "2", "3", "4", "5", "6", "7"))
#inspect
summary(subsetcorp, 100)

#inherit the subsetting into the term matrix. Note that Quanteda uses its own class, "dfm".
subset_dfm <- dfm(subsetcorp, groups = "cluster")

# so called Keyness is a way of looking at the lexical distinctiveness of a text in relation to other texts. It is thus a very differnt approach compared to tf-idf. The choice of reference text(s) become a somewhat obscured but crucial analytical choice. Here we use the whole corpus apart from the text in focus.

library(quanteda)

kwic1 <- kwic(corpQ, pattern = "digitalization") #keywords in context
head(kwic1)

##                                                                    
##    [42, 39]         efficiency in scale, whereas | digitalization |
##   [84, 123]                manner. Moreover, the | digitalization |
##  [139, 203] broader process of blossoming health | digitalization |
##   [180, 56]  energy transition, urbanization and | digitalization |
##   [186, 41]           houses to the emergence of | digitalization |
##   [237, 18]            been a consequence of the | digitalization |
##                                                
##  requires customer-oriented business models and
##  of users, things,                             
##  . Accordingly, the disruptive                 
##  the Finnish district heating sector           
##  , we distill a novel                          
##  of artifacts, which has

kwic2 <- kwic(corpQ, pattern = "digital*") #it is possible to truncate to get the broader context.
head(kwic2)

##                                                             
##   [1, 8]   border control increasingly rely on |  digital  |
##  [1, 93]                      2013, it aims at | digitally |
##  [2, 24] Utilizing in-depth interviews with 53 |  digital  |
##  [2, 31]            working in both legacy and | digitally |
##  [5, 50]                    by a wide range of |  digital  |
##  [5, 85]       create value, the opportunities |  digital  |
##                                          
##  biometrics in order to sort             
##  registering all third-country nationals'
##  journalists working in both legacy      
##  native newsrooms, the results           
##  technologies that aim to make           
##  technologies offer must meet the

result_keyness1 <- textstat_keyness(my_dfm, target = "1", measure = c("lr"
                                                                     )) #several different measures are aviable, including std chi2. Here we use likelihood ratio.

textplot_keyness(result_keyness1)

textplot_keyness(result_keyness1, show_reference = FALSE) #The reference is less interesting to look at.

result_keyness2 <- textstat_keyness(my_dfm, target = "2", measure = c("lr"
                                                                     ))
result_keyness3 <- textstat_keyness(my_dfm, target = "3", measure = c("lr"
                                                                     ))
result_keyness4 <- textstat_keyness(my_dfm, target = "4", measure = c("lr"
                                                                     ))
result_keyness5 <- textstat_keyness(my_dfm, target = "5", measure = c("lr"
                                                                     ))
result_keyness6 <- textstat_keyness(my_dfm, target = "6", measure = c("lr"
                                                                     ))
result_keyness7 <- textstat_keyness(my_dfm, target = "7", measure = c("lr"
                                                                     ))

textplot_keyness(result_keyness2, show_reference = FALSE)

textplot_keyness(result_keyness3, show_reference = FALSE)

textplot_keyness(result_keyness4, show_reference = FALSE)

textplot_keyness(result_keyness5, show_reference = FALSE)

textplot_keyness(result_keyness6, show_reference = FALSE)

textplot_keyness(result_keyness7, show_reference = FALSE)

Get the total picture

The totality: clusters and texts

#By the psychic powers inveseted in me, I shall name the clsuters prior to producing the wordcloud. 
text0[text0=="1"] <- 'Information Science'
text0[text0=="2"] <- 'Social Media'
text0[text0=="3"] <- 'Modeling'
text0[text0=="4"] <- 'Geo-Science'
text0[text0=="5"] <- 'Digital Education'
text0[text0=="6"] <- 'Documentation'
text0[text0=="7"] <- 'Health Care'
text0[text0=="0"] <- 'Noise'

library(quanteda)
library(wordcloud)
set.seed(1)
corpus_subset(corpQ, 
              cluster %in% c("Noise", "Information Science", "Social Media", "Modeling", "Geo-Science", "Digital Education", "Documentation", "Health Care")) %>%
  dfm(groups = "cluster", remove = stopwords("english"), remove_punct = TRUE) %>%
  dfm_trim(min_termfreq = 5, verbose = FALSE) %>%
  textplot_wordcloud(comparison = TRUE, min_count = 6, random_order = FALSE,
                                        rotation = .25, 
                                        color = RColorBrewer::brewer.pal(8,"Dark2"))

This is the end of the line for Phantom 309. How can we get further? We could use kwic, we can employ various linguistic measurements and we could do a lot better when sampling. With the present approach, to much of the remaining signal, even if we where to use a more sparse matrix with a few thousand tokens, is almost exclusively at a high level and determined by the WoS subject categories.

GMY

## [1] "MYA"

Min(d)ing Digitalization

Love Börjeson, PhD

January, 2019