Required packages

以下が、このプログラムの実行に必要なパッケージ

require(easyPubMed)

## Loading required package: easyPubMed

require(tm)

## Loading required package: tm

## Loading required package: NLP

require(udpipe)

## Loading required package: udpipe

require(wordcloud)

## Loading required package: wordcloud

## Loading required package: RColorBrewer

require(word2vec)

## Loading required package: word2vec

require(Rtsne)

## Loading required package: Rtsne

require(plotly)

## Loading required package: plotly

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

require(here)

## Loading required package: here

## here() starts at /Users/hiro/Documents/Rprojects/textMining

Data preparation

PubMedからデータを取得する

query <- "COVID-19" 
ids <- get_pubmed_ids(query)
pmd.xml <- fetch_pubmed_data(ids, retmax = 10000)
pmd.list <- articles_to_list(pmd.xml)
length(pmd.list)

## [1] 4997

タイトルと要旨（アブストラクト）からなるデータフレームを作成する。また、欠測がある（要旨が無い）論文を除いておく。

titl <- rep(NA, length(pmd.list))
abst <- rep(NA, length(pmd.list))
for(i in 1:length(pmd.list)) {
  df <- article_to_df(pmd.list[[i]], max_chars = -1, getAuthors = F)
  titl[i] <- df$title
  abst[i] <- df$abstract
}
df <- data.frame(titl, abst)
df <- na.omit(df)
dim(df)

## [1] 4264    2

Analysis of the titles

タイトルに現れる単語の出現頻度を調べる。

オリジナルのタイトルのデータを確認。

doc <- df$titl
doc[1]

## [1] "[Effort-Reward Imbalance, Ability to Work and the Desire for Career Exits: a Cross-sectional Study of Nurses]."

全て小文字に変換し、数字や、カッコや句読点を取り除く。

doc.cleaned <- stripWhitespace(
  removePunctuation(
    removeNumbers(tolower(doc))))
doc.cleaned[1]

## [1] "effortreward imbalance ability to work and the desire for career exits a crosssectional study of nurses"

頻度をカウントする（論文ごと）。

dtf <- document_term_frequencies(doc.cleaned)
head(dtf, 10)

##     doc_id         term freq
##  1:   doc1 effortreward    1
##  2:   doc1    imbalance    1
##  3:   doc1      ability    1
##  4:   doc1           to    1
##  5:   doc1         work    1
##  6:   doc1          and    1
##  7:   doc1          the    1
##  8:   doc1       desire    1
##  9:   doc1          for    1
## 10:   doc1       career    1

全ての論文に対して頻度を足し合わせる。

res <- tapply(dtf$freq, dtf$term, sum)
sort(res, decreasing = T)[1:50]

##          of         the       covid         and          in           a 
##        3484        2737        2726        2639        2347        1702 
##    pandemic         for        with      during          to     sarscov 
##         868         781         778         760         697         668 
##          on       study    patients      health        from          an 
##         664         473         444         350         339         286 
##      impact       among        care   infection      review     disease 
##         280         278         260         240         219         218 
##     vaccine        case vaccination          by    analysis coronavirus 
##         205         191         178         177         172         163 
##    clinical       after          as       using          at  associated 
##         157         149         149         136         129         121 
##        risk     against      mental  systematic   treatment      severe 
##         118         116         116         111         111         109 
##     factors  healthcare      social         use     between     effects 
##         108         108         105         105         104         101 
##    hospital      report 
##          99          99

よくある（あまり意味をもたない）単語を取り除く。そのための単語のリストを準備する。

stp <- stopwords("en")
head(stp, 10)

##  [1] "i"         "me"        "my"        "myself"    "we"        "our"      
##  [7] "ours"      "ourselves" "you"       "your"

上のリストのいずれかに一致する場合はデータから除く。

selector <- !(dtf$term %in% stopwords())
dtf.sel <- dtf[selector, ]

数え上げをする。

word.count <- tapply(dtf.sel$freq, dtf.sel$term, sum)
sort(word.count, decreasing = T)[1:50]

##          covid       pandemic        sarscov          study       patients 
##           2726            868            668            473            444 
##         health         impact          among           care      infection 
##            350            280            278            260            240 
##         review        disease        vaccine           case    vaccination 
##            219            218            205            191            178 
##       analysis    coronavirus       clinical          using     associated 
##            172            163            157            136            121 
##           risk         mental     systematic      treatment         severe 
##            118            116            111            111            109 
##        factors     healthcare         social            use        effects 
##            108            108            105            105            101 
##       hospital         report       students       syndrome          first 
##             99             99             99             98             96 
##    respiratory         survey       children       response           data 
##             95             95             94             93             91 
##        workers          acute crosssectional         public      detection 
##             91             90             90             88             86 
##       learning       lockdown       outcomes        patient         cohort 
##             86             85             85             85             80

ワードクラウドを用いて表示する。頻出単語上位100のみを表示する。

word.top100 <- sort(word.count, decreasing = T)[1:100]
wordcloud(names(word.top100), freq = word.top100)

Analysis with word2vec

Word2vecを使った解析を行う。なお、Word2vecについての原著論文は、https://arxiv.org/pdf/1301.3781.pdf。

また、こちらのブログや論文も良い参考になる。 https://ruder.io/word-embeddings-1/ https://arxiv.org/pdf/1411.2738.pdf

まずは、データを準備する。

x <- txt_clean_word2vec(df$abst)

次に、word2vec関数で、単語間の関係を学習する。ここでは、skip-gramアルゴリズムを用いる。

model <- word2vec(x, type = "skip-gram", dim = 30, window = 15, iter = 5)

結果を表示。単語がベクトル空間内の点として表される。

head(as.matrix(model), 10)

##                       [,1]        [,2]       [,3]       [,4]        [,5]
## spread        0.4520214200  0.01982646  0.7402269  0.2336672  0.29515609
## 2250         -0.4601969123 -0.81091893  1.3559760 -0.4495805  0.89198577
## development  -0.8400969505  1.37467599 -0.4643043  0.9992725 -0.29507962
## overweight    0.7383289933 -1.15229416  0.1796875  0.3445530 -0.21481802
## facilitators -1.0576896667 -2.06134129  1.0013345  1.1189053 -0.29168746
## offered       0.1299842596 -0.59942853  0.2731722  1.2644868 -0.09282218
## constant      0.0665775165 -0.88700992  2.5957649  1.5450702  1.52805603
## makes        -0.3491486609  0.67862570  0.8007759  0.4107775 -0.95123988
## hrct          0.0007800715  1.71242642  1.3537987 -1.3064724 -0.93673891
## door          0.1165204793 -0.81286192  1.1735734  0.6422958  0.20757714
##                    [,6]       [,7]        [,8]        [,9]       [,10]
## spread        0.7151327  1.8754727 -0.13276638  0.33863384  0.16876695
## 2250         -0.8129580 -0.4815224  1.79528105  0.30786598  0.33757219
## development  -0.1777177  0.8938470  1.45137095  0.21241586  0.53466266
## overweight   -0.5310414 -1.2843819  0.34786931 -0.16680828 -0.29187000
## facilitators -1.3703411 -0.3458766 -0.60719353  0.53521961  0.24375539
## offered      -0.7633473 -0.9698189 -0.07846674  2.01354384 -0.49253932
## constant     -1.0559298  0.5479648  0.22827454  0.06985193  0.12351657
## makes         0.8303112 -0.3699438 -0.07734013  1.62063396  0.61739147
## hrct         -0.2289768  0.5148864 -0.18482909  0.15089843  0.75733632
## door          1.3958154  0.9105923 -0.19801074  1.48099649  0.03333392
##                    [,11]        [,12]       [,13]      [,14]      [,15]
## spread        1.53732014  1.000580430  2.02346063 -1.9172499 -0.5497498
## 2250         -1.18838871  0.089142457 -0.46073514  0.2598677 -0.5783548
## development   1.16170585  0.573057592 -0.04535265 -0.6475110 -1.2943190
## overweight   -0.04613147 -0.880035222 -0.99052644 -1.3121864 -1.7649405
## facilitators  0.94924349  0.013399127 -0.23672156 -0.7423967  0.1365642
## offered       1.22995186  0.745887995 -0.41149566 -0.2073045 -0.8300067
## constant     -0.20016693 -0.008759694 -0.30882466 -0.7804393 -0.6129454
## makes         1.64125228 -0.208750471  1.50842083 -1.4000434 -2.0361733
## hrct         -0.80800796  0.128040150 -2.05782747 -1.1445453 -1.1808213
## door         -0.04776298  2.149433851 -0.36034513  0.0254088 -0.9793495
##                   [,16]       [,17]      [,18]      [,19]      [,20]
## spread        0.3599528  0.88801330 -0.1652039 -2.1712296  0.1274741
## 2250          1.7029228 -0.52262402  0.5174385 -1.3846997 -0.5156091
## development  -0.6658509  1.42778051  0.5021302 -2.2183750  0.7130885
## overweight   -0.8172048  0.38596201  0.1197450 -1.3696548  2.0808895
## facilitators -1.1144742  1.64357948 -0.2167569 -0.6922774 -0.4530377
## offered      -0.7720231  0.05587474 -0.8646914 -2.2204714 -1.3066896
## constant      0.2980711  1.81263018  0.5249712 -1.2585499 -0.2972787
## makes         0.9320040  0.72332364  1.0177402 -2.0599470 -0.5641957
## hrct          1.4552363  0.81961507  0.4495192 -2.2947118 -0.8491314
## door         -0.6278632  0.17910148 -0.5655844 -2.6831987  0.2821904
##                    [,21]       [,22]       [,23]       [,24]      [,25]
## spread       -1.53338873 -1.14535475 -0.39852038  0.28479096 -0.6879856
## 2250          0.59850174 -1.55611455 -0.04851867  1.57306707  0.5445135
## development  -1.67078209 -0.04150536  0.26096937 -0.95307463 -1.7819059
## overweight   -2.05460143  0.40159649 -0.52835780  2.02463603  0.4342352
## facilitators  0.15430093 -1.20883024 -1.33872032  0.06882582 -0.1982248
## offered      -1.28653717 -1.14365685 -0.02885747 -0.37034440 -0.1310008
## constant      0.98075253 -0.91148055 -1.43347883 -0.09543622 -1.4255906
## makes        -0.64314681 -0.25234625  0.86694264 -0.30478802  0.2575774
## hrct          0.01785801 -0.37175333  0.82698339  0.64089406  0.8169389
## door          0.14667721 -1.50685155  0.37082866 -1.38918281 -0.6206015
##                   [,26]       [,27]       [,28]      [,29]       [,30]
## spread       -1.3818840 -0.67085230  0.62786621  0.3635623 -0.73535007
## 2250         -1.6989353 -2.17069674 -0.50919485 -0.2794248 -0.90682179
## development  -1.4372041  0.27258244 -0.74469632 -1.0989015  0.22390436
## overweight   -0.2818217  1.07850897 -1.10142291 -0.6930379 -0.49691492
## facilitators -2.3339920  0.07341208 -1.94650507 -0.8710522  0.23805828
## offered      -1.8110186 -0.92995977 -1.71545672 -0.9417403 -0.25999641
## constant     -1.7988596 -0.20455910 -0.48102352 -0.1841014  0.62163550
## makes        -1.5269244 -0.08891134  0.88788396  1.1059235  0.12578748
## hrct         -0.6283227 -0.23575822 -0.05935121  1.6600286 -0.07598664
## door         -1.8516753  0.49490950  0.51699054 -0.2403716 -0.03158898

“mask”という単語と類似度が高い単語を30個リストアップする。

nn <- predict(model, c("mask"), type = "nearest", top_n = 30)
nn

## $mask
##    term1      term2 similarity rank
## 1   mask    wearing  0.9583051    1
## 2   mask      masks  0.9442208    2
## 3   mask       ffp2  0.9281084    3
## 4   mask      cloth  0.9160715    4
## 5   mask       wear  0.9016488    5
## 6   mask        n95  0.8987269    6
## 7   mask    washing  0.8900403    7
## 8   mask      hands  0.8854057    8
## 9   mask    placing  0.8766692    9
## 10  mask   facemask  0.8765922   10
## 11  mask    exhaled  0.8685558   11
## 12  mask disposable  0.8632147   12
## 13  mask  sanitizer  0.8589706   13
## 14  mask   aerosols  0.8541734   14
## 15  mask    correct  0.8536401   15
## 16  mask       face  0.8533462   16
## 17  mask     gloves  0.8531128   17
## 18  mask   improper  0.8529463   18
## 19  mask  facemasks  0.8470936   19
## 20  mask respirator  0.8415533   20
## 21  mask  procedure  0.8408191   21
## 22  mask   droplets  0.8402680   22
## 23  mask distancing  0.8360549   23
## 24  mask       hand  0.8336802   24
## 25  mask     cotton  0.8298696   25
## 26  mask  usability  0.8296772   26
## 27  mask    avoided  0.8275607   27
## 28  mask     shield  0.8257001   28
## 29  mask exercising  0.8256449   29
## 30  mask    aerosol  0.8239759   30

“omicron”という単語と類似度が高い単語を30個リストアップする。

nn <- predict(model, c("omicron"), type = "nearest", top_n = 30)
nn

## $omicron
##      term1            term2 similarity rank
## 1  omicron              voc  0.9592404    1
## 2  omicron          variant  0.9587039    2
## 3  omicron         variants  0.9559210    3
## 4  omicron           escape  0.9422399    4
## 5  omicron        mutations  0.9386650    5
## 6  omicron         mutation  0.9347604    6
## 7  omicron             vois  0.9328879    7
## 8  omicron             vocs  0.9326198    8
## 9  omicron            d614g  0.9306197    9
## 10 omicron          mutated  0.9291540   10
## 11 omicron            n501y  0.9195092   11
## 12 omicron            delta  0.9167343   12
## 13 omicron            n440k  0.9001133   13
## 14 omicron      outcompeted  0.8999096   14
## 15 omicron        antigenic  0.8998986   15
## 16 omicron             beta  0.8998317   16
## 17 omicron            vnars  0.8967146   17
## 18 omicron transmissibility  0.8962524   18
## 19 omicron              529  0.8950707   19
## 20 omicron     neutralizing  0.8916616   20
## 21 omicron            gamma  0.8915190   21
## 22 omicron         lineages  0.8913069   22
## 23 omicron            alpha  0.8907861   23
## 24 omicron               mu  0.8904122   24
## 25 omicron       neutralize  0.8892992   25
## 26 omicron             229e  0.8880194   26
## 27 omicron            r346k  0.8876979   27
## 28 omicron            spike  0.8871691   28
## 29 omicron          concern  0.8864110   29
## 30 omicron      infectivity  0.8857899   30

“machine”という単語と類似度が高い単語を30個リストアップする。

nn <- predict(model, c("machine"), type = "nearest", top_n = 30)
nn

## $machine
##      term1            term2 similarity rank
## 1  machine        algorithm  0.9279332    1
## 2  machine       prediction  0.9087546    2
## 3  machine         machines  0.9044300    3
## 4  machine         forecast  0.9042990    4
## 5  machine    architectures  0.9037781    5
## 6  machine         proposed  0.8938621    6
## 7  machine            built  0.8931577    7
## 8  machine   classification  0.8925775    8
## 9  machine              svm  0.8917192    9
## 10 machine generalizability  0.8906193   10
## 11 machine        denoising  0.8904656   11
## 12 machine       algorithms  0.8892074   12
## 13 machine    automatically  0.8891503   13
## 14 machine     unsupervised  0.8885775   14
## 15 machine            model  0.8881191   15
## 16 machine        histogram  0.8880148   16
## 17 machine     efficientnet  0.8875355   17
## 18 machine    convolutional  0.8865160   18
## 19 machine           models  0.8860496   19
## 20 machine        construct  0.8847618   20
## 21 machine      adversarial  0.8824225   21
## 22 machine        computing  0.8806093   22
## 23 machine            vgg19  0.8804816   23
## 24 machine         gaussian  0.8798761   24
## 25 machine       classifier  0.8798543   25
## 26 machine             deep  0.8795468   26
## 27 machine      forecasting  0.8783342   27
## 28 machine       supervised  0.8758425   28
## 29 machine          stacked  0.8756320   29
## 30 machine      expectation  0.8751626   30

次に、タイトルでよく用いられていた（頻出単語上位100個）について、ベクトル空間上での位置関係をt-SNE法によって視覚化する。

selector <- names(word.top100) %in% rownames(as.matrix(model))
dm <- as.matrix(model)[names(word.top100)[selector],]
word <- rownames(dm)
tsne.dm <- Rtsne(dm)

視覚化

df <- data.frame(tsne.dm$Y, word)
plot_ly(df, x = ~X1, y = ~ X2, type = "scatter", mode = "text", text = word)

次に、得られた単語の関係をもとに、文書を30次元の空間に埋め込む。

x <- data.frame(doc_id = titl, text = abst)
emb <- doc2vec(model, x, type = "embedding")

omicronに関する説明（WHOのホームページより）をクエリにして、類似度の文書を得る。 https://www.who.int/news/item/28-11-2021-update-on-omicron

q <- txt_clean_word2vec("On 26 November 2021, WHO designated the variant B.1.1.529 a variant of concern, named Omicron, on the advice of WHO’s Technical Advisory Group on Virus Evolution (TAG-VE).  This decision was based on the evidence presented to the TAG-VE that Omicron has several mutations that may have an impact on how it behaves, for example, on how easily it spreads or the severity of illness it causes. Here is a summary of what is currently known. ")
# from https://www.who.int/news/item/28-11-2021-update-on-omicron
newdoc <- doc2vec(model, q)
sim.om <- word2vec_similarity(emb, newdoc)
names(sim.om) <- rownames(emb)
sort(sim.om, decreasing = T)[1:10]

##                                                         The omicron variant of SARS-CoV-2: Understanding the known and living with unknowns. 
##                                                                                                                                    0.9916369 
##                                                                      The unresolved question on COVID-19 virus origin: The three cards game? 
##                                                                                                                                    0.9907525 
##                                                                The challenges of COVID-19 Delta variant: Prevention and vaccine development. 
##                                                                                                                                    0.9906693 
##                                                                Sequence analysis of the Emerging Sars-CoV-2 Variant Omicron in South Africa. 
##                                                                                                                                    0.9902452 
##                                                                                PCR performance in the SARS-CoV-2 Omicron variant of concern? 
##                                                                                                                                    0.9901107 
##                   Detection of Omicron (B.1.1.529) variant has created panic among the people across the world: What should we do right now? 
##                                                                                                                                    0.9900354 
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility? 
##                                                                                                                                    0.9895625 
##                                  Characterization of the novel SARS-CoV-2 Omicron (B.1.1.529) Variant of Concern and its global perspective. 
##                                                                                                                                    0.9891242 
##                                                            OMICRON (B.1.1.529): A new SARS-CoV-2 Variant of Concern mounting worldwide fear. 
##                                                                                                                                    0.9886011 
##                                                                          The Development of SARS-CoV-2 Variants: The Gene Makes the Disease. 
##                                                                                                                                    0.9867635

COVID-19に関するニューラルネットワークの論文をクエリにして類似の文書を得る。 https://pubmed.ncbi.nlm.nih.gov/34745319/

my.abst <- txt_clean_word2vec("Recently, people around the world are being vulnerable to the pandemic effect of 
the novel Corona Virus. It is very difficult to detect the virus infected chest 
X-ray (CXR) image during early stages due to constant gene mutation of the 
virus. It is also strenuous to differentiate between the usual pneumonia from 
the COVID-19 positive case as both show similar symptoms. This paper proposes a 
modified residual network based enhancement (ENResNet) scheme for the visual 
clarification of COVID-19 pneumonia impairment from CXR images and 
classification of COVID-19 under deep learning framework. Firstly, the residual 
image has been generated using residual convolutional neural network through 
batch normalization corresponding to each image. Secondly, a module has been 
constructed through normalized map using patches and residual images as input. 
The output consisting of residual images and patches of each module are fed into 
the next module and this goes on for consecutive eight modules. A feature map is 
generated from each module and the final enhanced CXR is produced via 
up-sampling process. Further, we have designed a simple CNN model for automatic 
detection of COVID-19 from CXR images in the light of 'multi-term loss' function 
and 'softmax' classifier in optimal way. The proposed model exhibits better 
result in the diagnosis of binary classification (COVID vs. Normal) and 
multi-class classification (COVID vs. Pneumonia vs. Normal) in this study. The 
suggested ENResNet achieves a classification accuracy 99.7% and 98.4% for binary 
classification and multi-class detection respectively in comparison with 
state-of-the-art methods.")
# Ghosh and Ghosh (2022) ENResNet: A novel residual neural network for chest X-ray enhancement based COVID-19 detection. Biomed Signal Process doi: 10.1016/j.bspc.2021.103286.
# PMID: 34745319
newdoc <- doc2vec(model, my.abst)
sim.ml <- word2vec_similarity(emb, newdoc)
names(sim.ml) <- rownames(emb)
sort(sim.ml, decreasing = T)[1:10]

##                                                                 Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT. 
##                                                                                                                                 0.9975762 
##                   DC-GAN-based synthetic X-ray images augmentation for increasing the performance of EfficientNet for COVID-19 detection. 
##                                                                                                                                 0.9957998 
##                                    CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images. 
##                                                                                                                                 0.9957918 
##                               Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection. 
##                                                                                                                                 0.9953493 
##                                 Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network. 
##                                                                                                                                 0.9953292 
##                                     C3D-UNET: A Comprehensive 3D Unet for Covid-19 Segmentation with Intact Encoding and Local Attention. 
##                                                                                                                                 0.9947068 
##                                                         Quadruple Augmented Pyramid Network for Multi-class COVID-19 Segmentation via CT. 
##                                                                                                                                 0.9939834 
##                 COVID-19 Volumetric Pulmonary Lesion Estimation on CT Images using a U-NET and Probabilistic Active Contour Segmentation. 
##                                                                                                                                 0.9939392 
##                                          Feature extraction with capsule network for the COVID-19 disease prediction though X-ray images. 
##                                                                                                                                 0.9938018 
## One Shot Model For The Prediction of COVID-19 And Lesions Segmentation In Chest CT Scans Through The Affinity Among Lesion Mask Features. 
##                                                                                                                                 0.9937366

Use Pretrained model

Google Newsで予めトレーニングされているモデルを使う。モデルは、以下の”Pre-trained word and phrase vectors”の”The archive is available here: GoogleNews-vectors-negative300.bin.gz.”というリンクからダウンロードできる。 https://code.google.com/archive/p/word2vec/

model.gn <- read.word2vec(here("data", "GoogleNews-vectors-negative300.bin"))
model.gn

## $model
## <pointer: 0x7fb15ea64a10>
## 
## $model_path
## [1] "/Users/hiro/Documents/Rprojects/textMining/data/GoogleNews-vectors-negative300.bin"
## 
## $dim
## [1] 300
## 
## $vocabulary
## [1] 3e+06
## 
## attr(,"class")
## [1] "word2vec"

\(3\times10^6\)個の単語が300次元のデータとして、ベクトル化されている。

nn <- predict(model.gn, c("mask"), type = "nearest", top_n = 30)
nn

## $mask
##    term1                          term2 similarity rank
## 1   mask                           mask  0.1752332    1
## 2   mask                          masks  0.1589547    2
## 3   mask                  surgical_mask  0.1574033    3
## 4   mask                protective_mask  0.1536028    4
## 5   mask                  SSL_VPN_Proxy  0.1509127    5
## 6   mask  reader_interaction_discussion  0.1493519    6
## 7   mask         Spelling_follows_North  0.1480598    7
## 8   mask                filters_encrypt  0.1458109    8
## 9   mask             bomb_shaped_turban  0.1435986    9
## 10  mask                       ski_mask  0.1415209   10
## 11  mask                  silicone_mask  0.1412422   11
## 12  mask                       facemask  0.1408077   12
## 13  mask                 surgical_masks  0.1402936   13
## 14  mask             sports_magazine_SZ  0.1388565   14
## 15  mask                    oxygen_mask  0.1380737   15
## 16  mask             strengths_weakness  0.1363707   16
## 17  mask                   blue_bandana  0.1350546   17
## 18  mask eyewitness_accounts_background  0.1348199   18
## 19  mask                          cloak  0.1343265   19
## 20  mask                        goggles  0.1343031   20
## 21  mask             invisibility_cloak  0.1340850   21
## 22  mask                      balaclava  0.1338594   22
## 23  mask                       disguise  0.1338415   23
## 24  mask            Important_locations  0.1337272   24
## 25  mask              Enterprise_Terror  0.1333233   25
## 26  mask                           veil  0.1330992   26
## 27  mask             Dow_Jones_Reprints  0.1323623   27
## 28  mask                prosthetic_nose  0.1323556   28
## 29  mask                      spacesuit  0.1322882   29
## 30  mask    Remember_comment_moderation  0.1322630   30

当然ながら、COVID-19関連の論文の要旨から学習されたものとは異なるものが類似単語として検索される。

nn <- predict(model.gn, c("machine"), type = "nearest", top_n = 30)
nn

## $machine
##      term1                        term2 similarity rank
## 1  machine                      machine  0.1520944    1
## 2  machine                     Snow_WPS  0.1428445    2
## 3  machine          Custom_manufacturer  0.1417987    3
## 4  machine                     machines  0.1411512    4
## 5  machine                       lathes  0.1299090    5
## 6  machine               lever_machines  0.1291039    6
## 7  machine                  hedged_play  0.1255907    7
## 8  machine              voting_machines  0.1244253    8
## 9  machine                        servo  0.1229710    9
## 10 machine              optical_scanner  0.1227231   10
## 11 machine               automated_spam  0.1223796   11
## 12 machine        Simple_Moving_Average  0.1219988   12
## 13 machine                Sunspot_forum  0.1200565   13
## 14 machine                 CNC_machines  0.1200184   14
## 15 machine                    workpiece  0.1197284   15
## 16 machine                       coater  0.1196761   16
## 17 machine           Dow_Jones_Reprints  0.1195178   17
## 18 machine                   tabulators  0.1193935   18
## 19 machine     visit_eFinancialNews.com  0.1192743   19
## 20 machine              thermal_printer  0.1191476   20
## 21 machine                        lathe  0.1191054   21
## 22 machine                    tabulator  0.1190799   22
## 23 machine                       sorter  0.1189296   23
## 24 machine             optical_scanners  0.1189102   24
## 25 machine          strong_bullish_bias  0.1188440   25
## 26 machine             printing_presses  0.1183960   26
## 27 machine redesigned_Internet_Explorer  0.1181469   27
## 28 machine                  centrifuges  0.1180813   28
## 29 machine                   workpieces  0.1179092   29
## 30 machine           entirety_via_email  0.1177564   30

次に、タイトルでよく用いられていた（頻出単語上位100個）について、Google Newsでトレーニングされたモデルをもとに、ベクトル空間上での位置関係を求める。

selector <- names(word.top100) %in% rownames(as.matrix(model.gn))
dm <- as.matrix(model.gn)[names(word.top100)[selector],]
word <- rownames(dm)
tsne.dm <- Rtsne(dm)

視覚化

df <- data.frame(tsne.dm$Y, word)
plot_ly(df, x = ~X1, y = ~ X2, type = "scatter", mode = "text", text = word)

Google Newsのデータで文書間を

x <- data.frame(doc_id = titl, text = abst)
emb.gn <- doc2vec(model.gn, x, type = "embedding")

omicronに関する説明（WHOのホームページより）をクエリにして、類似度の文書を得る。 https://www.who.int/news/item/28-11-2021-update-on-omicron

q <- txt_clean_word2vec("On 26 November 2021, WHO designated the variant B.1.1.529 a variant of concern, named Omicron, on the advice of WHO’s Technical Advisory Group on Virus Evolution (TAG-VE).  This decision was based on the evidence presented to the TAG-VE that Omicron has several mutations that may have an impact on how it behaves, for example, on how easily it spreads or the severity of illness it causes. Here is a summary of what is currently known. ")
# from https://www.who.int/news/item/28-11-2021-update-on-omicron
newdoc <- doc2vec(model.gn, q)
sim.om.gn <- word2vec_similarity(emb.gn, newdoc)
names(sim.om.gn) <- rownames(emb.gn)
sort(sim.om.gn, decreasing = T)[1:10]

##                                       Estimating a time-to-event distribution from right-truncated data in an epidemic: A review of methods. 
##                                                                                                                                    0.9329065 
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility? 
##                                                                                                                                    0.9325275 
##                                                                          Test sensitivity for infection versus infectiousness of SARS-CoV-2. 
##                                                                                                                                    0.9310710 
##                                                                          The Development of SARS-CoV-2 Variants: The Gene Makes the Disease. 
##                                                                                                                                    0.9308123 
##                                                  Can COVID 19 cause atypical forms of Pityriasis Rosea refractory to conventional therapies? 
##                                                                                                                                    0.9293905 
##                                           Waves and variants of SARS-CoV-2: understanding the causes and effect of the COVID-19 catastrophe. 
##                                                                                                                                    0.9290826 
##                                                                           Transgenic Model Systems Have Revolutionized the Study of Disease. 
##                                                                                                                                    0.9284527 
##                                                                      The unresolved question on COVID-19 virus origin: The three cards game? 
##                                                                                                                                    0.9284171 
##                                        Time series analysis and predicting COVID-19 affected patients by ARIMA model using machine learning. 
##                                                                                                                                    0.9281802 
##                                                                                          A network SIRX model for the spreading of COVID-19. 
##                                                                                                                                    0.9275452

Google Netで事前学習されたベクトル空間と、PubMedのCOVID-19関連の論文の要旨から学習されたベクトル空間内における類似度の比較をする。

plot(sim.om, sim.om.gn)

cor(sim.om, sim.om.gn, method = "pearson", use = "pairwise.complete.obs")

##          [,1]
## [1,] 0.987598

cor(sim.om, sim.om.gn, method = "spearman", use = "pairwise.complete.obs")

##           [,1]
## [1,] 0.7956069

比較的高い相関がみられるが、順位相関は非常に高いわけではない。

上位30文献の一致度を確認する。

top100 <- rank(1 - sim.om) <= 30
top100.gn <- rank(1 - sim.om.gn) <= 30
table(top100, top100.gn)

##        top100.gn
## top100  FALSE TRUE
##   FALSE  4943   24
##   TRUE     24    6

6文献が一致していた。

一致していたのは以下の6文献。

sim.om[top100 & top100.gn]

##                                                                          The Development of SARS-CoV-2 Variants: The Gene Makes the Disease. 
##                                                                                                                                    0.9867635 
##                                                                                                          Omicron: call for updated vaccines. 
##                                                                                                                                    0.9863397 
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility? 
##                                                                                                                                    0.9895625 
##                                           Waves and variants of SARS-CoV-2: understanding the causes and effect of the COVID-19 catastrophe. 
##                                                                                                                                    0.9863994 
##                                                                      The unresolved question on COVID-19 virus origin: The three cards game? 
##                                                                                                                                    0.9907525 
##                                  Estimating the transmission advantage of the D614G mutant strain of SARS-CoV-2, December 2019 to June 2020. 
##                                                                                                                                    0.9848339

COVID-19のomicron株の情報がないGoogle Newsで事前トレーニングされていても、omicron関連の文献がいくつか検出されていることが分かる。

同様の解析を、機械学習の例についても行ってみる。

my.abst <- txt_clean_word2vec("Recently, people around the world are being vulnerable to the pandemic effect of 
the novel Corona Virus. It is very difficult to detect the virus infected chest 
X-ray (CXR) image during early stages due to constant gene mutation of the 
virus. It is also strenuous to differentiate between the usual pneumonia from 
the COVID-19 positive case as both show similar symptoms. This paper proposes a 
modified residual network based enhancement (ENResNet) scheme for the visual 
clarification of COVID-19 pneumonia impairment from CXR images and 
classification of COVID-19 under deep learning framework. Firstly, the residual 
image has been generated using residual convolutional neural network through 
batch normalization corresponding to each image. Secondly, a module has been 
constructed through normalized map using patches and residual images as input. 
The output consisting of residual images and patches of each module are fed into 
the next module and this goes on for consecutive eight modules. A feature map is 
generated from each module and the final enhanced CXR is produced via 
up-sampling process. Further, we have designed a simple CNN model for automatic 
detection of COVID-19 from CXR images in the light of 'multi-term loss' function 
and 'softmax' classifier in optimal way. The proposed model exhibits better 
result in the diagnosis of binary classification (COVID vs. Normal) and 
multi-class classification (COVID vs. Pneumonia vs. Normal) in this study. The 
suggested ENResNet achieves a classification accuracy 99.7% and 98.4% for binary 
classification and multi-class detection respectively in comparison with 
state-of-the-art methods.")
# Ghosh and Ghosh (2022) ENResNet: A novel residual neural network for chest X-ray enhancement based COVID-19 detection. Biomed Signal Process doi: 10.1016/j.bspc.2021.103286.
# PMID: 34745319
newdoc <- doc2vec(model.gn, my.abst)
sim.ml.gn <- word2vec_similarity(emb.gn, newdoc)
names(sim.ml.gn) <- rownames(emb.gn)
sort(sim.ml.gn, decreasing = T)[1:10]

##                                                 Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT. 
##                                                                                                                 0.9716943 
##                                  Multi-feature Multi-Scale CNN-Derived COVID-19 Classification from Lung Ultrasound Data. 
##                                                                                                                 0.9629323 
##                    CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images. 
##                                                                                                                 0.9598468 
##               Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection. 
##                                                                                                                 0.9590419 
##                      MRFGRO: a hybrid meta-heuristic feature selection method for screening COVID-19 using deep features. 
##                                                                                                                 0.9582909 
##      Non-contact Measurement of Pulse Rate Variability Using a Webcam and Application to Mental Illness Screening System. 
##                                                                                                                 0.9559405 
##                Automated Detection of COVID-19 Cases using Recent Deep Convolutional Neural Networks and CT images<sup/>. 
##                                                                                                                 0.9553129 
##                    Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications. 
##                                                                                                                 0.9549385 
## A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection. 
##                                                                                                                 0.9547197 
##                 Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network. 
##                                                                                                                 0.9546166

Google Netで事前学習されたベクトル空間と、PubMedのCOVID-19関連の論文の要旨から学習されたベクトル空間内における類似度の比較をする。

plot(sim.ml, sim.ml.gn)

cor(sim.ml, sim.ml.gn, method = "pearson", use = "pairwise.complete.obs")

##           [,1]
## [1,] 0.9914014

cor(sim.ml, sim.ml.gn, method = "spearman", use = "pairwise.complete.obs")

##           [,1]
## [1,] 0.8211091

こちらの例のほうが、先の例より相関が高い。先の例では、新変異株omicronというGoogle Netでは訓練データに入っていなかった情報をクエリにつかっていたからかもしれない。

上位30文献の一致度を確認する。

top100 <- rank(1 - sim.ml) <= 30
top100.gn <- rank(1 - sim.ml.gn) <= 30
table(top100, top100.gn)

##        top100.gn
## top100  FALSE TRUE
##   FALSE  4956   11
##   TRUE     11   19

18文献が一致していた。やはり、先の例よりも一致度は高い。

一致していたのは以下の18文献。

sim.om[top100 & top100.gn]

##               A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection. 
##                                                                                                                               0.9624341 
## Unsupervised Anomaly Detection in Multivariate Spatio-Temporal Data Using Deep Learning: Early Detection of COVID-19 Outbreak in Italy. 
##                                                                                                                               0.9683324 
##                                                             COVID-19 Detection Through Transfer Learning Using Multimodal Imaging Data. 
##                                                                                                                               0.9566748 
##                                                               Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT. 
##                                                                                                                               0.9623592 
##                     COVID-MTL: Multitask learning with Shift3D and random-weighted loss for COVID-19 diagnosis and severity assessment. 
##                                                                                                                               0.9549226 
##                                           Transfer learning based novel ensemble classifier for COVID-19 detection from chest CT-scans. 
##                                                                                                                               0.9739927 
##                                    MRFGRO: a hybrid meta-heuristic feature selection method for screening COVID-19 using deep features. 
##                                                                                                                               0.9700499 
##                                   Automatic detection of multiple types of pneumonia: Open dataset and a multi-scale attention network. 
##                                                                                                                               0.9539314 
##                               Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network. 
##                                                                                                                               0.9551770 
##                             Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection. 
##                                                                                                                               0.9571036 
##                               Double paths network with residual information distillation for improving lung CT image super resolution. 
##                                                                                                                               0.9622886 
##                 DC-GAN-based synthetic X-ray images augmentation for increasing the performance of EfficientNet for COVID-19 detection. 
##                                                                                                                               0.9582230 
##                  Local binary pattern and deep learning feature extraction fusion for COVID-19 detection on computed tomography images. 
##                                                                                                                               0.9619632 
##         Assessing Lobe-wise Burden of COVID-19 Infection in Computed Tomography of Lungs using Knowledge Fusion from Multiple Datasets. 
##                                                                                                                               0.9660850 
##                Multi-class Generative Adversarial Networks: Improving One-class Classification of Pneumonia Using Limited Labeled Data. 
##                                                                                                                               0.9688569 
##                              Automated Detection of COVID-19 Cases using Recent Deep Convolutional Neural Networks and CT images<sup/>. 
##                                                                                                                               0.9683853 
##                                  CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images. 
##                                                                                                                               0.9614700 
##                                                Multi-feature Multi-Scale CNN-Derived COVID-19 Classification from Lung Ultrasound Data. 
##                                                                                                                               0.9663518 
##                                   C3D-UNET: A Comprehensive 3D Unet for Covid-19 Segmentation with Intact Encoding and Local Attention. 
##                                                                                                                               0.9654121

An example code of text mining

岩田洋佳 hiroiwata@g.ecc.u-tokyo.ac.jp

2021/12/25

Required packages

Data preparation

Analysis of the titles

Analysis with word2vec

Use Pretrained model