以下が、このプログラムの実行に必要なパッケージ
require(easyPubMed)
## Loading required package: easyPubMed
require(tm)
## Loading required package: tm
## Loading required package: NLP
require(udpipe)
## Loading required package: udpipe
require(wordcloud)
## Loading required package: wordcloud
## Loading required package: RColorBrewer
require(word2vec)
## Loading required package: word2vec
require(Rtsne)
## Loading required package: Rtsne
require(plotly)
## Loading required package: plotly
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
require(here)
## Loading required package: here
## here() starts at /Users/hiro/Documents/Rprojects/textMining
PubMedからデータを取得する
query <- "COVID-19"
ids <- get_pubmed_ids(query)
pmd.xml <- fetch_pubmed_data(ids, retmax = 10000)
pmd.list <- articles_to_list(pmd.xml)
length(pmd.list)
## [1] 4997
タイトルと要旨(アブストラクト)からなるデータフレームを作成する。また、欠測がある(要旨が無い)論文を除いておく。
titl <- rep(NA, length(pmd.list))
abst <- rep(NA, length(pmd.list))
for(i in 1:length(pmd.list)) {
df <- article_to_df(pmd.list[[i]], max_chars = -1, getAuthors = F)
titl[i] <- df$title
abst[i] <- df$abstract
}
df <- data.frame(titl, abst)
df <- na.omit(df)
dim(df)
## [1] 4264 2
タイトルに現れる単語の出現頻度を調べる。
オリジナルのタイトルのデータを確認。
doc <- df$titl
doc[1]
## [1] "[Effort-Reward Imbalance, Ability to Work and the Desire for Career Exits: a Cross-sectional Study of Nurses]."
全て小文字に変換し、数字や、カッコや句読点を取り除く。
doc.cleaned <- stripWhitespace(
removePunctuation(
removeNumbers(tolower(doc))))
doc.cleaned[1]
## [1] "effortreward imbalance ability to work and the desire for career exits a crosssectional study of nurses"
頻度をカウントする(論文ごと)。
dtf <- document_term_frequencies(doc.cleaned)
head(dtf, 10)
## doc_id term freq
## 1: doc1 effortreward 1
## 2: doc1 imbalance 1
## 3: doc1 ability 1
## 4: doc1 to 1
## 5: doc1 work 1
## 6: doc1 and 1
## 7: doc1 the 1
## 8: doc1 desire 1
## 9: doc1 for 1
## 10: doc1 career 1
全ての論文に対して頻度を足し合わせる。
res <- tapply(dtf$freq, dtf$term, sum)
sort(res, decreasing = T)[1:50]
## of the covid and in a
## 3484 2737 2726 2639 2347 1702
## pandemic for with during to sarscov
## 868 781 778 760 697 668
## on study patients health from an
## 664 473 444 350 339 286
## impact among care infection review disease
## 280 278 260 240 219 218
## vaccine case vaccination by analysis coronavirus
## 205 191 178 177 172 163
## clinical after as using at associated
## 157 149 149 136 129 121
## risk against mental systematic treatment severe
## 118 116 116 111 111 109
## factors healthcare social use between effects
## 108 108 105 105 104 101
## hospital report
## 99 99
よくある(あまり意味をもたない)単語を取り除く。そのための単語のリストを準備する。
stp <- stopwords("en")
head(stp, 10)
## [1] "i" "me" "my" "myself" "we" "our"
## [7] "ours" "ourselves" "you" "your"
上のリストのいずれかに一致する場合はデータから除く。
selector <- !(dtf$term %in% stopwords())
dtf.sel <- dtf[selector, ]
数え上げをする。
word.count <- tapply(dtf.sel$freq, dtf.sel$term, sum)
sort(word.count, decreasing = T)[1:50]
## covid pandemic sarscov study patients
## 2726 868 668 473 444
## health impact among care infection
## 350 280 278 260 240
## review disease vaccine case vaccination
## 219 218 205 191 178
## analysis coronavirus clinical using associated
## 172 163 157 136 121
## risk mental systematic treatment severe
## 118 116 111 111 109
## factors healthcare social use effects
## 108 108 105 105 101
## hospital report students syndrome first
## 99 99 99 98 96
## respiratory survey children response data
## 95 95 94 93 91
## workers acute crosssectional public detection
## 91 90 90 88 86
## learning lockdown outcomes patient cohort
## 86 85 85 85 80
ワードクラウドを用いて表示する。頻出単語上位100のみを表示する。
word.top100 <- sort(word.count, decreasing = T)[1:100]
wordcloud(names(word.top100), freq = word.top100)
Word2vecを使った解析を行う。なお、Word2vecについての原著論文は、https://arxiv.org/pdf/1301.3781.pdf。
また、こちらのブログや論文も良い参考になる。 https://ruder.io/word-embeddings-1/ https://arxiv.org/pdf/1411.2738.pdf
まずは、データを準備する。
x <- txt_clean_word2vec(df$abst)
次に、word2vec関数で、単語間の関係を学習する。ここでは、skip-gramアルゴリズムを用いる。
model <- word2vec(x, type = "skip-gram", dim = 30, window = 15, iter = 5)
結果を表示。単語がベクトル空間内の点として表される。
head(as.matrix(model), 10)
## [,1] [,2] [,3] [,4] [,5]
## spread 0.4520214200 0.01982646 0.7402269 0.2336672 0.29515609
## 2250 -0.4601969123 -0.81091893 1.3559760 -0.4495805 0.89198577
## development -0.8400969505 1.37467599 -0.4643043 0.9992725 -0.29507962
## overweight 0.7383289933 -1.15229416 0.1796875 0.3445530 -0.21481802
## facilitators -1.0576896667 -2.06134129 1.0013345 1.1189053 -0.29168746
## offered 0.1299842596 -0.59942853 0.2731722 1.2644868 -0.09282218
## constant 0.0665775165 -0.88700992 2.5957649 1.5450702 1.52805603
## makes -0.3491486609 0.67862570 0.8007759 0.4107775 -0.95123988
## hrct 0.0007800715 1.71242642 1.3537987 -1.3064724 -0.93673891
## door 0.1165204793 -0.81286192 1.1735734 0.6422958 0.20757714
## [,6] [,7] [,8] [,9] [,10]
## spread 0.7151327 1.8754727 -0.13276638 0.33863384 0.16876695
## 2250 -0.8129580 -0.4815224 1.79528105 0.30786598 0.33757219
## development -0.1777177 0.8938470 1.45137095 0.21241586 0.53466266
## overweight -0.5310414 -1.2843819 0.34786931 -0.16680828 -0.29187000
## facilitators -1.3703411 -0.3458766 -0.60719353 0.53521961 0.24375539
## offered -0.7633473 -0.9698189 -0.07846674 2.01354384 -0.49253932
## constant -1.0559298 0.5479648 0.22827454 0.06985193 0.12351657
## makes 0.8303112 -0.3699438 -0.07734013 1.62063396 0.61739147
## hrct -0.2289768 0.5148864 -0.18482909 0.15089843 0.75733632
## door 1.3958154 0.9105923 -0.19801074 1.48099649 0.03333392
## [,11] [,12] [,13] [,14] [,15]
## spread 1.53732014 1.000580430 2.02346063 -1.9172499 -0.5497498
## 2250 -1.18838871 0.089142457 -0.46073514 0.2598677 -0.5783548
## development 1.16170585 0.573057592 -0.04535265 -0.6475110 -1.2943190
## overweight -0.04613147 -0.880035222 -0.99052644 -1.3121864 -1.7649405
## facilitators 0.94924349 0.013399127 -0.23672156 -0.7423967 0.1365642
## offered 1.22995186 0.745887995 -0.41149566 -0.2073045 -0.8300067
## constant -0.20016693 -0.008759694 -0.30882466 -0.7804393 -0.6129454
## makes 1.64125228 -0.208750471 1.50842083 -1.4000434 -2.0361733
## hrct -0.80800796 0.128040150 -2.05782747 -1.1445453 -1.1808213
## door -0.04776298 2.149433851 -0.36034513 0.0254088 -0.9793495
## [,16] [,17] [,18] [,19] [,20]
## spread 0.3599528 0.88801330 -0.1652039 -2.1712296 0.1274741
## 2250 1.7029228 -0.52262402 0.5174385 -1.3846997 -0.5156091
## development -0.6658509 1.42778051 0.5021302 -2.2183750 0.7130885
## overweight -0.8172048 0.38596201 0.1197450 -1.3696548 2.0808895
## facilitators -1.1144742 1.64357948 -0.2167569 -0.6922774 -0.4530377
## offered -0.7720231 0.05587474 -0.8646914 -2.2204714 -1.3066896
## constant 0.2980711 1.81263018 0.5249712 -1.2585499 -0.2972787
## makes 0.9320040 0.72332364 1.0177402 -2.0599470 -0.5641957
## hrct 1.4552363 0.81961507 0.4495192 -2.2947118 -0.8491314
## door -0.6278632 0.17910148 -0.5655844 -2.6831987 0.2821904
## [,21] [,22] [,23] [,24] [,25]
## spread -1.53338873 -1.14535475 -0.39852038 0.28479096 -0.6879856
## 2250 0.59850174 -1.55611455 -0.04851867 1.57306707 0.5445135
## development -1.67078209 -0.04150536 0.26096937 -0.95307463 -1.7819059
## overweight -2.05460143 0.40159649 -0.52835780 2.02463603 0.4342352
## facilitators 0.15430093 -1.20883024 -1.33872032 0.06882582 -0.1982248
## offered -1.28653717 -1.14365685 -0.02885747 -0.37034440 -0.1310008
## constant 0.98075253 -0.91148055 -1.43347883 -0.09543622 -1.4255906
## makes -0.64314681 -0.25234625 0.86694264 -0.30478802 0.2575774
## hrct 0.01785801 -0.37175333 0.82698339 0.64089406 0.8169389
## door 0.14667721 -1.50685155 0.37082866 -1.38918281 -0.6206015
## [,26] [,27] [,28] [,29] [,30]
## spread -1.3818840 -0.67085230 0.62786621 0.3635623 -0.73535007
## 2250 -1.6989353 -2.17069674 -0.50919485 -0.2794248 -0.90682179
## development -1.4372041 0.27258244 -0.74469632 -1.0989015 0.22390436
## overweight -0.2818217 1.07850897 -1.10142291 -0.6930379 -0.49691492
## facilitators -2.3339920 0.07341208 -1.94650507 -0.8710522 0.23805828
## offered -1.8110186 -0.92995977 -1.71545672 -0.9417403 -0.25999641
## constant -1.7988596 -0.20455910 -0.48102352 -0.1841014 0.62163550
## makes -1.5269244 -0.08891134 0.88788396 1.1059235 0.12578748
## hrct -0.6283227 -0.23575822 -0.05935121 1.6600286 -0.07598664
## door -1.8516753 0.49490950 0.51699054 -0.2403716 -0.03158898
“mask”という単語と類似度が高い単語を30個リストアップする。
nn <- predict(model, c("mask"), type = "nearest", top_n = 30)
nn
## $mask
## term1 term2 similarity rank
## 1 mask wearing 0.9583051 1
## 2 mask masks 0.9442208 2
## 3 mask ffp2 0.9281084 3
## 4 mask cloth 0.9160715 4
## 5 mask wear 0.9016488 5
## 6 mask n95 0.8987269 6
## 7 mask washing 0.8900403 7
## 8 mask hands 0.8854057 8
## 9 mask placing 0.8766692 9
## 10 mask facemask 0.8765922 10
## 11 mask exhaled 0.8685558 11
## 12 mask disposable 0.8632147 12
## 13 mask sanitizer 0.8589706 13
## 14 mask aerosols 0.8541734 14
## 15 mask correct 0.8536401 15
## 16 mask face 0.8533462 16
## 17 mask gloves 0.8531128 17
## 18 mask improper 0.8529463 18
## 19 mask facemasks 0.8470936 19
## 20 mask respirator 0.8415533 20
## 21 mask procedure 0.8408191 21
## 22 mask droplets 0.8402680 22
## 23 mask distancing 0.8360549 23
## 24 mask hand 0.8336802 24
## 25 mask cotton 0.8298696 25
## 26 mask usability 0.8296772 26
## 27 mask avoided 0.8275607 27
## 28 mask shield 0.8257001 28
## 29 mask exercising 0.8256449 29
## 30 mask aerosol 0.8239759 30
“omicron”という単語と類似度が高い単語を30個リストアップする。
nn <- predict(model, c("omicron"), type = "nearest", top_n = 30)
nn
## $omicron
## term1 term2 similarity rank
## 1 omicron voc 0.9592404 1
## 2 omicron variant 0.9587039 2
## 3 omicron variants 0.9559210 3
## 4 omicron escape 0.9422399 4
## 5 omicron mutations 0.9386650 5
## 6 omicron mutation 0.9347604 6
## 7 omicron vois 0.9328879 7
## 8 omicron vocs 0.9326198 8
## 9 omicron d614g 0.9306197 9
## 10 omicron mutated 0.9291540 10
## 11 omicron n501y 0.9195092 11
## 12 omicron delta 0.9167343 12
## 13 omicron n440k 0.9001133 13
## 14 omicron outcompeted 0.8999096 14
## 15 omicron antigenic 0.8998986 15
## 16 omicron beta 0.8998317 16
## 17 omicron vnars 0.8967146 17
## 18 omicron transmissibility 0.8962524 18
## 19 omicron 529 0.8950707 19
## 20 omicron neutralizing 0.8916616 20
## 21 omicron gamma 0.8915190 21
## 22 omicron lineages 0.8913069 22
## 23 omicron alpha 0.8907861 23
## 24 omicron mu 0.8904122 24
## 25 omicron neutralize 0.8892992 25
## 26 omicron 229e 0.8880194 26
## 27 omicron r346k 0.8876979 27
## 28 omicron spike 0.8871691 28
## 29 omicron concern 0.8864110 29
## 30 omicron infectivity 0.8857899 30
“machine”という単語と類似度が高い単語を30個リストアップする。
nn <- predict(model, c("machine"), type = "nearest", top_n = 30)
nn
## $machine
## term1 term2 similarity rank
## 1 machine algorithm 0.9279332 1
## 2 machine prediction 0.9087546 2
## 3 machine machines 0.9044300 3
## 4 machine forecast 0.9042990 4
## 5 machine architectures 0.9037781 5
## 6 machine proposed 0.8938621 6
## 7 machine built 0.8931577 7
## 8 machine classification 0.8925775 8
## 9 machine svm 0.8917192 9
## 10 machine generalizability 0.8906193 10
## 11 machine denoising 0.8904656 11
## 12 machine algorithms 0.8892074 12
## 13 machine automatically 0.8891503 13
## 14 machine unsupervised 0.8885775 14
## 15 machine model 0.8881191 15
## 16 machine histogram 0.8880148 16
## 17 machine efficientnet 0.8875355 17
## 18 machine convolutional 0.8865160 18
## 19 machine models 0.8860496 19
## 20 machine construct 0.8847618 20
## 21 machine adversarial 0.8824225 21
## 22 machine computing 0.8806093 22
## 23 machine vgg19 0.8804816 23
## 24 machine gaussian 0.8798761 24
## 25 machine classifier 0.8798543 25
## 26 machine deep 0.8795468 26
## 27 machine forecasting 0.8783342 27
## 28 machine supervised 0.8758425 28
## 29 machine stacked 0.8756320 29
## 30 machine expectation 0.8751626 30
次に、タイトルでよく用いられていた(頻出単語上位100個)について、ベクトル空間上での位置関係をt-SNE法によって視覚化する。
selector <- names(word.top100) %in% rownames(as.matrix(model))
dm <- as.matrix(model)[names(word.top100)[selector],]
word <- rownames(dm)
tsne.dm <- Rtsne(dm)
視覚化
df <- data.frame(tsne.dm$Y, word)
plot_ly(df, x = ~X1, y = ~ X2, type = "scatter", mode = "text", text = word)
次に、得られた単語の関係をもとに、文書を30次元の空間に埋め込む。
x <- data.frame(doc_id = titl, text = abst)
emb <- doc2vec(model, x, type = "embedding")
omicronに関する説明(WHOのホームページより)をクエリにして、類似度の文書を得る。 https://www.who.int/news/item/28-11-2021-update-on-omicron
q <- txt_clean_word2vec("On 26 November 2021, WHO designated the variant B.1.1.529 a variant of concern, named Omicron, on the advice of WHO’s Technical Advisory Group on Virus Evolution (TAG-VE). This decision was based on the evidence presented to the TAG-VE that Omicron has several mutations that may have an impact on how it behaves, for example, on how easily it spreads or the severity of illness it causes. Here is a summary of what is currently known. ")
# from https://www.who.int/news/item/28-11-2021-update-on-omicron
newdoc <- doc2vec(model, q)
sim.om <- word2vec_similarity(emb, newdoc)
names(sim.om) <- rownames(emb)
sort(sim.om, decreasing = T)[1:10]
## The omicron variant of SARS-CoV-2: Understanding the known and living with unknowns.
## 0.9916369
## The unresolved question on COVID-19 virus origin: The three cards game?
## 0.9907525
## The challenges of COVID-19 Delta variant: Prevention and vaccine development.
## 0.9906693
## Sequence analysis of the Emerging Sars-CoV-2 Variant Omicron in South Africa.
## 0.9902452
## PCR performance in the SARS-CoV-2 Omicron variant of concern?
## 0.9901107
## Detection of Omicron (B.1.1.529) variant has created panic among the people across the world: What should we do right now?
## 0.9900354
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility?
## 0.9895625
## Characterization of the novel SARS-CoV-2 Omicron (B.1.1.529) Variant of Concern and its global perspective.
## 0.9891242
## OMICRON (B.1.1.529): A new SARS-CoV-2 Variant of Concern mounting worldwide fear.
## 0.9886011
## The Development of SARS-CoV-2 Variants: The Gene Makes the Disease.
## 0.9867635
COVID-19に関するニューラルネットワークの論文をクエリにして類似の文書を得る。 https://pubmed.ncbi.nlm.nih.gov/34745319/
my.abst <- txt_clean_word2vec("Recently, people around the world are being vulnerable to the pandemic effect of
the novel Corona Virus. It is very difficult to detect the virus infected chest
X-ray (CXR) image during early stages due to constant gene mutation of the
virus. It is also strenuous to differentiate between the usual pneumonia from
the COVID-19 positive case as both show similar symptoms. This paper proposes a
modified residual network based enhancement (ENResNet) scheme for the visual
clarification of COVID-19 pneumonia impairment from CXR images and
classification of COVID-19 under deep learning framework. Firstly, the residual
image has been generated using residual convolutional neural network through
batch normalization corresponding to each image. Secondly, a module has been
constructed through normalized map using patches and residual images as input.
The output consisting of residual images and patches of each module are fed into
the next module and this goes on for consecutive eight modules. A feature map is
generated from each module and the final enhanced CXR is produced via
up-sampling process. Further, we have designed a simple CNN model for automatic
detection of COVID-19 from CXR images in the light of 'multi-term loss' function
and 'softmax' classifier in optimal way. The proposed model exhibits better
result in the diagnosis of binary classification (COVID vs. Normal) and
multi-class classification (COVID vs. Pneumonia vs. Normal) in this study. The
suggested ENResNet achieves a classification accuracy 99.7% and 98.4% for binary
classification and multi-class detection respectively in comparison with
state-of-the-art methods.")
# Ghosh and Ghosh (2022) ENResNet: A novel residual neural network for chest X-ray enhancement based COVID-19 detection. Biomed Signal Process doi: 10.1016/j.bspc.2021.103286.
# PMID: 34745319
newdoc <- doc2vec(model, my.abst)
sim.ml <- word2vec_similarity(emb, newdoc)
names(sim.ml) <- rownames(emb)
sort(sim.ml, decreasing = T)[1:10]
## Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT.
## 0.9975762
## DC-GAN-based synthetic X-ray images augmentation for increasing the performance of EfficientNet for COVID-19 detection.
## 0.9957998
## CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images.
## 0.9957918
## Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection.
## 0.9953493
## Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network.
## 0.9953292
## C3D-UNET: A Comprehensive 3D Unet for Covid-19 Segmentation with Intact Encoding and Local Attention.
## 0.9947068
## Quadruple Augmented Pyramid Network for Multi-class COVID-19 Segmentation via CT.
## 0.9939834
## COVID-19 Volumetric Pulmonary Lesion Estimation on CT Images using a U-NET and Probabilistic Active Contour Segmentation.
## 0.9939392
## Feature extraction with capsule network for the COVID-19 disease prediction though X-ray images.
## 0.9938018
## One Shot Model For The Prediction of COVID-19 And Lesions Segmentation In Chest CT Scans Through The Affinity Among Lesion Mask Features.
## 0.9937366
Google Newsで予めトレーニングされているモデルを使う。モデルは、以下の”Pre-trained word and phrase vectors”の”The archive is available here: GoogleNews-vectors-negative300.bin.gz.”というリンクからダウンロードできる。 https://code.google.com/archive/p/word2vec/
model.gn <- read.word2vec(here("data", "GoogleNews-vectors-negative300.bin"))
model.gn
## $model
## <pointer: 0x7fb15ea64a10>
##
## $model_path
## [1] "/Users/hiro/Documents/Rprojects/textMining/data/GoogleNews-vectors-negative300.bin"
##
## $dim
## [1] 300
##
## $vocabulary
## [1] 3e+06
##
## attr(,"class")
## [1] "word2vec"
\(3\times10^6\)個の単語が300次元のデータとして、ベクトル化されている。
nn <- predict(model.gn, c("mask"), type = "nearest", top_n = 30)
nn
## $mask
## term1 term2 similarity rank
## 1 mask mask 0.1752332 1
## 2 mask masks 0.1589547 2
## 3 mask surgical_mask 0.1574033 3
## 4 mask protective_mask 0.1536028 4
## 5 mask SSL_VPN_Proxy 0.1509127 5
## 6 mask reader_interaction_discussion 0.1493519 6
## 7 mask Spelling_follows_North 0.1480598 7
## 8 mask filters_encrypt 0.1458109 8
## 9 mask bomb_shaped_turban 0.1435986 9
## 10 mask ski_mask 0.1415209 10
## 11 mask silicone_mask 0.1412422 11
## 12 mask facemask 0.1408077 12
## 13 mask surgical_masks 0.1402936 13
## 14 mask sports_magazine_SZ 0.1388565 14
## 15 mask oxygen_mask 0.1380737 15
## 16 mask strengths_weakness 0.1363707 16
## 17 mask blue_bandana 0.1350546 17
## 18 mask eyewitness_accounts_background 0.1348199 18
## 19 mask cloak 0.1343265 19
## 20 mask goggles 0.1343031 20
## 21 mask invisibility_cloak 0.1340850 21
## 22 mask balaclava 0.1338594 22
## 23 mask disguise 0.1338415 23
## 24 mask Important_locations 0.1337272 24
## 25 mask Enterprise_Terror 0.1333233 25
## 26 mask veil 0.1330992 26
## 27 mask Dow_Jones_Reprints 0.1323623 27
## 28 mask prosthetic_nose 0.1323556 28
## 29 mask spacesuit 0.1322882 29
## 30 mask Remember_comment_moderation 0.1322630 30
当然ながら、COVID-19関連の論文の要旨から学習されたものとは異なるものが類似単語として検索される。
nn <- predict(model.gn, c("machine"), type = "nearest", top_n = 30)
nn
## $machine
## term1 term2 similarity rank
## 1 machine machine 0.1520944 1
## 2 machine Snow_WPS 0.1428445 2
## 3 machine Custom_manufacturer 0.1417987 3
## 4 machine machines 0.1411512 4
## 5 machine lathes 0.1299090 5
## 6 machine lever_machines 0.1291039 6
## 7 machine hedged_play 0.1255907 7
## 8 machine voting_machines 0.1244253 8
## 9 machine servo 0.1229710 9
## 10 machine optical_scanner 0.1227231 10
## 11 machine automated_spam 0.1223796 11
## 12 machine Simple_Moving_Average 0.1219988 12
## 13 machine Sunspot_forum 0.1200565 13
## 14 machine CNC_machines 0.1200184 14
## 15 machine workpiece 0.1197284 15
## 16 machine coater 0.1196761 16
## 17 machine Dow_Jones_Reprints 0.1195178 17
## 18 machine tabulators 0.1193935 18
## 19 machine visit_eFinancialNews.com 0.1192743 19
## 20 machine thermal_printer 0.1191476 20
## 21 machine lathe 0.1191054 21
## 22 machine tabulator 0.1190799 22
## 23 machine sorter 0.1189296 23
## 24 machine optical_scanners 0.1189102 24
## 25 machine strong_bullish_bias 0.1188440 25
## 26 machine printing_presses 0.1183960 26
## 27 machine redesigned_Internet_Explorer 0.1181469 27
## 28 machine centrifuges 0.1180813 28
## 29 machine workpieces 0.1179092 29
## 30 machine entirety_via_email 0.1177564 30
次に、タイトルでよく用いられていた(頻出単語上位100個)について、Google Newsでトレーニングされたモデルをもとに、ベクトル空間上での位置関係を求める。
selector <- names(word.top100) %in% rownames(as.matrix(model.gn))
dm <- as.matrix(model.gn)[names(word.top100)[selector],]
word <- rownames(dm)
tsne.dm <- Rtsne(dm)
視覚化
df <- data.frame(tsne.dm$Y, word)
plot_ly(df, x = ~X1, y = ~ X2, type = "scatter", mode = "text", text = word)
Google Newsのデータで文書間を
x <- data.frame(doc_id = titl, text = abst)
emb.gn <- doc2vec(model.gn, x, type = "embedding")
omicronに関する説明(WHOのホームページより)をクエリにして、類似度の文書を得る。 https://www.who.int/news/item/28-11-2021-update-on-omicron
q <- txt_clean_word2vec("On 26 November 2021, WHO designated the variant B.1.1.529 a variant of concern, named Omicron, on the advice of WHO’s Technical Advisory Group on Virus Evolution (TAG-VE). This decision was based on the evidence presented to the TAG-VE that Omicron has several mutations that may have an impact on how it behaves, for example, on how easily it spreads or the severity of illness it causes. Here is a summary of what is currently known. ")
# from https://www.who.int/news/item/28-11-2021-update-on-omicron
newdoc <- doc2vec(model.gn, q)
sim.om.gn <- word2vec_similarity(emb.gn, newdoc)
names(sim.om.gn) <- rownames(emb.gn)
sort(sim.om.gn, decreasing = T)[1:10]
## Estimating a time-to-event distribution from right-truncated data in an epidemic: A review of methods.
## 0.9329065
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility?
## 0.9325275
## Test sensitivity for infection versus infectiousness of SARS-CoV-2.
## 0.9310710
## The Development of SARS-CoV-2 Variants: The Gene Makes the Disease.
## 0.9308123
## Can COVID 19 cause atypical forms of Pityriasis Rosea refractory to conventional therapies?
## 0.9293905
## Waves and variants of SARS-CoV-2: understanding the causes and effect of the COVID-19 catastrophe.
## 0.9290826
## Transgenic Model Systems Have Revolutionized the Study of Disease.
## 0.9284527
## The unresolved question on COVID-19 virus origin: The three cards game?
## 0.9284171
## Time series analysis and predicting COVID-19 affected patients by ARIMA model using machine learning.
## 0.9281802
## A network SIRX model for the spreading of COVID-19.
## 0.9275452
Google Netで事前学習されたベクトル空間と、PubMedのCOVID-19関連の論文の要旨から学習されたベクトル空間内における類似度の比較をする。
plot(sim.om, sim.om.gn)
cor(sim.om, sim.om.gn, method = "pearson", use = "pairwise.complete.obs")
## [,1]
## [1,] 0.987598
cor(sim.om, sim.om.gn, method = "spearman", use = "pairwise.complete.obs")
## [,1]
## [1,] 0.7956069
比較的高い相関がみられるが、順位相関は非常に高いわけではない。
上位30文献の一致度を確認する。
top100 <- rank(1 - sim.om) <= 30
top100.gn <- rank(1 - sim.om.gn) <= 30
table(top100, top100.gn)
## top100.gn
## top100 FALSE TRUE
## FALSE 4943 24
## TRUE 24 6
6文献が一致していた。
一致していたのは以下の6文献。
sim.om[top100 & top100.gn]
## The Development of SARS-CoV-2 Variants: The Gene Makes the Disease.
## 0.9867635
## Omicron: call for updated vaccines.
## 0.9863397
## The Electrostatic Potential of the Omicron Variant Spike is Higher than in Delta and Delta-plus Variants: a Hint to Higher Transmissibility?
## 0.9895625
## Waves and variants of SARS-CoV-2: understanding the causes and effect of the COVID-19 catastrophe.
## 0.9863994
## The unresolved question on COVID-19 virus origin: The three cards game?
## 0.9907525
## Estimating the transmission advantage of the D614G mutant strain of SARS-CoV-2, December 2019 to June 2020.
## 0.9848339
COVID-19のomicron株の情報がないGoogle Newsで事前トレーニングされていても、omicron関連の文献がいくつか検出されていることが分かる。
同様の解析を、機械学習の例についても行ってみる。
my.abst <- txt_clean_word2vec("Recently, people around the world are being vulnerable to the pandemic effect of
the novel Corona Virus. It is very difficult to detect the virus infected chest
X-ray (CXR) image during early stages due to constant gene mutation of the
virus. It is also strenuous to differentiate between the usual pneumonia from
the COVID-19 positive case as both show similar symptoms. This paper proposes a
modified residual network based enhancement (ENResNet) scheme for the visual
clarification of COVID-19 pneumonia impairment from CXR images and
classification of COVID-19 under deep learning framework. Firstly, the residual
image has been generated using residual convolutional neural network through
batch normalization corresponding to each image. Secondly, a module has been
constructed through normalized map using patches and residual images as input.
The output consisting of residual images and patches of each module are fed into
the next module and this goes on for consecutive eight modules. A feature map is
generated from each module and the final enhanced CXR is produced via
up-sampling process. Further, we have designed a simple CNN model for automatic
detection of COVID-19 from CXR images in the light of 'multi-term loss' function
and 'softmax' classifier in optimal way. The proposed model exhibits better
result in the diagnosis of binary classification (COVID vs. Normal) and
multi-class classification (COVID vs. Pneumonia vs. Normal) in this study. The
suggested ENResNet achieves a classification accuracy 99.7% and 98.4% for binary
classification and multi-class detection respectively in comparison with
state-of-the-art methods.")
# Ghosh and Ghosh (2022) ENResNet: A novel residual neural network for chest X-ray enhancement based COVID-19 detection. Biomed Signal Process doi: 10.1016/j.bspc.2021.103286.
# PMID: 34745319
newdoc <- doc2vec(model.gn, my.abst)
sim.ml.gn <- word2vec_similarity(emb.gn, newdoc)
names(sim.ml.gn) <- rownames(emb.gn)
sort(sim.ml.gn, decreasing = T)[1:10]
## Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT.
## 0.9716943
## Multi-feature Multi-Scale CNN-Derived COVID-19 Classification from Lung Ultrasound Data.
## 0.9629323
## CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images.
## 0.9598468
## Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection.
## 0.9590419
## MRFGRO: a hybrid meta-heuristic feature selection method for screening COVID-19 using deep features.
## 0.9582909
## Non-contact Measurement of Pulse Rate Variability Using a Webcam and Application to Mental Illness Screening System.
## 0.9559405
## Automated Detection of COVID-19 Cases using Recent Deep Convolutional Neural Networks and CT images<sup/>.
## 0.9553129
## Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications.
## 0.9549385
## A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection.
## 0.9547197
## Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network.
## 0.9546166
Google Netで事前学習されたベクトル空間と、PubMedのCOVID-19関連の論文の要旨から学習されたベクトル空間内における類似度の比較をする。
plot(sim.ml, sim.ml.gn)
cor(sim.ml, sim.ml.gn, method = "pearson", use = "pairwise.complete.obs")
## [,1]
## [1,] 0.9914014
cor(sim.ml, sim.ml.gn, method = "spearman", use = "pairwise.complete.obs")
## [,1]
## [1,] 0.8211091
こちらの例のほうが、先の例より相関が高い。先の例では、新変異株omicronというGoogle Netでは訓練データに入っていなかった情報をクエリにつかっていたからかもしれない。
上位30文献の一致度を確認する。
top100 <- rank(1 - sim.ml) <= 30
top100.gn <- rank(1 - sim.ml.gn) <= 30
table(top100, top100.gn)
## top100.gn
## top100 FALSE TRUE
## FALSE 4956 11
## TRUE 11 19
18文献が一致していた。やはり、先の例よりも一致度は高い。
一致していたのは以下の18文献。
sim.om[top100 & top100.gn]
## A novel unsupervised approach based on the hidden features of Deep Denoising Autoencoders for COVID-19 disease detection.
## 0.9624341
## Unsupervised Anomaly Detection in Multivariate Spatio-Temporal Data Using Deep Learning: Early Detection of COVID-19 Outbreak in Italy.
## 0.9683324
## COVID-19 Detection Through Transfer Learning Using Multimodal Imaging Data.
## 0.9566748
## Automatic Sequence-Based Network for Lung Diseases Detection in Chest CT.
## 0.9623592
## COVID-MTL: Multitask learning with Shift3D and random-weighted loss for COVID-19 diagnosis and severity assessment.
## 0.9549226
## Transfer learning based novel ensemble classifier for COVID-19 detection from chest CT-scans.
## 0.9739927
## MRFGRO: a hybrid meta-heuristic feature selection method for screening COVID-19 using deep features.
## 0.9700499
## Automatic detection of multiple types of pneumonia: Open dataset and a multi-scale attention network.
## 0.9539314
## Automated COVID-19 detection from X-ray and CT images with stacked ensemble convolutional neural network.
## 0.9551770
## Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection.
## 0.9571036
## Double paths network with residual information distillation for improving lung CT image super resolution.
## 0.9622886
## DC-GAN-based synthetic X-ray images augmentation for increasing the performance of EfficientNet for COVID-19 detection.
## 0.9582230
## Local binary pattern and deep learning feature extraction fusion for COVID-19 detection on computed tomography images.
## 0.9619632
## Assessing Lobe-wise Burden of COVID-19 Infection in Computed Tomography of Lungs using Knowledge Fusion from Multiple Datasets.
## 0.9660850
## Multi-class Generative Adversarial Networks: Improving One-class Classification of Pneumonia Using Limited Labeled Data.
## 0.9688569
## Automated Detection of COVID-19 Cases using Recent Deep Convolutional Neural Networks and CT images<sup/>.
## 0.9683853
## CNN Filter Learning from Drawn Markers for the Detection of Suggestive Signs of COVID-19 in CT Images.
## 0.9614700
## Multi-feature Multi-Scale CNN-Derived COVID-19 Classification from Lung Ultrasound Data.
## 0.9663518
## C3D-UNET: A Comprehensive 3D Unet for Covid-19 Segmentation with Intact Encoding and Local Attention.
## 0.9654121