LaCCAN, Brasil | Federal University of Alagoas
The project associated to this document can be found on github at https://github.com/SensorNet-UFAL/slrqqanalysis, we are accepting pull requests, please feel free to send us yours recommendations. All data was tested in this configuration settings:
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=pt_BR.UTF-8
## [5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=pt_BR.UTF-8
## [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.4 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.4.4 htmltools_0.3.6 yaml_2.1.19 Rcpp_0.12.17
## [9] stringi_1.2.3 rmarkdown_1.10 knitr_1.20 stringr_1.3.1
## [13] digest_0.6.15 evaluate_0.11
There are many ways to read bibtex data in R, we’ll list all the different methods we can find out there, or propose new ones.
Bibliometrix is the library used in this snippet, savedrecs.bib is the bibtex file extracted.
library(bibliometrix)
D <- readFiles("savedrecs.bib")
M <- convert2df(D, dbsource = "isi", format = "bibtex")
##
## Converting your isi collection into a bibliographic dataframe
##
## Articles extracted 100
## Articles extracted 200
## Articles extracted 300
## Articles extracted 358
## Done!
##
##
## Generating affiliation field tag AU_UN from C1: Done!
#One can explore all the field tags imported from bib file
M
Based on https://cran.r-project.org/web/packages/bibliometrix/vignettes/bibliometrix-vignette.html, more details exploring the library bibliometrix can be found there, details of fields http://www.bibliometrix.org/documents/Field_Tags_bibliometrix.pdf.
For reading .pdf
files, one can take the followings steps:
The use of topic modeling on published articles provides a way to quantify the content of the study. Following snippets will walk you through these findings.
tm is the library that we’ll explore, all files are in the folder pdfs
.
library(tm)
# read pdf files from folder pdfs
# Warning: will pop an error if has a number in the pdf file
files <- list.files("pdfs/",pattern = "pdf$")
# defines a function called rpdf
rpdf <- readPDF(control = list(text = "-layout"))
setwd("pdfs/")
# variable my.pdf can be explored using $
# verify and try to extract some information
my.corpus <- Corpus(URISource(files), readerControl = list(reader = rpdf))
my.pdfdtm <- DocumentTermMatrix(my.corpus, control=list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE, stemming = FALSE, removeNumbers = TRUE))
Using bibliometrix
package, below gaphic can be visualized.
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
net=networkPlot(NetMatrix, normalize="association", weighted=T, n = 30, Title = "Keyword Co-occurrences", type = "fruchterman", size=T,edgesize = 5,labelsize=0.7)
Countries collaborating on this set of bibliographic data
countries <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
NetMatrix <- biblioNetwork(countries, analysis = "collaboration", network = "countries", sep = ";")
net=networkPlot(NetMatrix, n = dim(NetMatrix)[1], Title = "Country Collaboration", type = "circle", size=TRUE, remove.multiple=FALSE,labelsize=0.7,cluster="none")
One can visualize for instance, which articles cite a specific author
M[grep("CHEONG E", M$CR),2]
## [1] "A SIMPLE VISUALIZATION AND PROGRAMMING FRAMEWORK FOR WIRELESS SENSOR NETWORKS: PROVIZ"
## [2] "A SERVICE-ORIENTED APPROACH TO FACILITATE WSAN APPLICATION DEVELOPMENT"
## [3] "BOTS: A CONSTRAINT-BASED COMPONENT SYSTEM FOR SYNTHESIZING SCALABLE SOFTWARE SYSTEMS"
## [4] "PROGRAMMING PARADIGMS FOR NETWORKED SENSING: A DISTRIBUTED SYSTEMS' PERSPECTIVE"
## [5] "HIERARCHICAL INTEGRATION OF RUNTIME MODELS"
Local citations
localTC <- localCitations(M, sep = ";")
## Articles analysed 100
## Articles analysed 200
## Articles analysed 267
lapply(localTC, head)
## $Authors
## Author LocalCitations
## 13 AKERHOLM M 3
## 112 CARLSON J 3
## 246 FREDRIKSSON J 3
## 307 HAKANSSON J 3
## 315 HANSSON H 3
## 533 MOELLER A 3
##
## $Papers
## Paper DOI Year LCS
## 96 AKERHOLM M, 2007, J SYST SOFTW 10.1016/J.JSS.2006.08.016 2007 3
## 2 COSTAGLIOLA G, 1995, COMPUTER 10.1109/2.366162 1995 2
## 150 BOZZANO M, 2011, COMPUT J 10.1093/COMJNL/BXQ024 2011 2
## 38 ELLIOTT C, 2003, J FUNCT PROGRAM 10.1017/S0956796802004574 2003 1
## 132 BENISTY M, 2010, ASTRON ASTROPHYS 10.1051/0004-6361/201014776 2010 1
## 174 MATTINGLEY J, 2012, OPTIM ENG 10.1007/S11081-011-9176-9 2012 1
## GCS
## 96 27
## 2 36
## 150 58
## 38 30
## 132 24
## 174 236
Or to obtain the most dominant authors in a set
library(bibliometrix)
results <- biblioAnalysis(M, sep = ";")
dominance(results, k=10)
DE Author’s keywords Type help("conceptualStructure")
to see details about this function.
library(bibliometrix)
cs <- conceptualStructure(M, field = "DE", method = "MCA", quali.supp = NULL, quanti.supp = NULL, minDegree = 2, k.max = 5, stemming = FALSE, labelsize = 10, documents = 50)
library(bibliometrix)
library(reshape2)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
kword <- KeywordGrowth(M, Tag = "DE", sep = ";", top = 15, cdf = TRUE)
DF = melt(kword, id='Year')
#timeline keywords ggplot
ggplot(DF,aes(x=Year,y=value, group=variable, shape=variable, colour=variable))+
geom_point()+geom_line()+
scale_shape_manual(values = 1:15)+
labs(color="Author Keywords")+
scale_x_continuous(breaks = seq(min(DF$Year), max(DF$Year), by = 5))+
scale_y_continuous(breaks = seq(0, max(DF$value), by=10))+
guides(color=guide_legend(title = "Author Keywords"), shape=FALSE)+
labs(y="Count", variable="Author Keywords", title = "Author's Keywords Usage Evolution Over Time")+
theme(text = element_text(size = 10))+
facet_grid(variable ~ .)
Create a historigraphic map proposed by E. Garfield to represent a chronological network map os most relevant direct citations resulting from a bibliographic collection.
options(width=130)
histResults <- histNetwork(M, min.citations = 10, sep = ". ")
## Articles analysed 79
net <- histPlot(histResults, n=10, size = 10, labelsize=5, size.cex=TRUE, arrowsize = 0.5, color = TRUE)
##
## Legend
##
## Paper
## 1991 - 1 MYERS PC, 1991, ASTROPHYS J
## 1995 - 2 COSTAGLIOLA G, 1995, COMPUTER
## 1997 - 3 COSTAGLIOLA G, 1997, IEEE TRANS SOFTW ENG
## 1997 - 4 BHATTACHARYYA SS, 1997, DES AUTOM EMBED SYST
## 1997 - 5 FRANZ M, 1997, MOBILE OBJECT SYSTEMS: TOWARDS THE PROGRAMMABLE INTERNET
## 1998 - 6 LANG J, 1998, ACM TRANS PROGRAM LANG SYST
## 1998 - 7 POIGNE A, 1998, FORM METHODS SYST DES
## 1999 - 8 BUTLER KL, 1999, IEEE TRANS VEH TECHNOL
## 1999 - 9 RAU BR, 1999, DES AUTOM EMBED SYST
## 2000 - 10 ELLIOTT C, 2000, SEMANTICS, APPLICATIONS AND IMPLEMENTATION OF PROGRAM GENERATION, PROCEEDINGS
## 2000 - 11 DRUSINSKY D, 2000, SPIN MODEL CHECKING AND SOFTWARE VERIFICATION
## 2001 - 12 JENKO M, 2001, MICROPROCESS MICROSYST
## 2001 - 13 FRIEDRICH IF, 2001, IEEE MICRO
## 2002 - 14 JAHNKE JH, 2002, IEEE WIREL COMMUN
## 2002 - 15 COSTAGLIOLA G, 2002, J VIS LANG COMPUT
## 2002 - 16 CZARNECKI K, 2002, GENERATIVE PROGRAMMING AND COMPONENT ENGINEERING 2002, PROCEEDINGS
## 2002 - 17 SZTIPANOVITS J, 2002, GENERATIVE PROGRAMMING AND COMPONENT ENGINEERING, PROCEEDINGS
## 2002 - 18 CORSARO A, 2002, ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2002: COOPLS, DOA, AND ODBASE
## 2003 - 19 DOUCET F, 2003, IEEE TRANS COMPUT-AIDED DES INTEGR CIRCUITS SYST
## 2003 - 20 ELLIOTT C, 2003, J FUNCT PROGRAM
## 2003 - 21 NEEMA S, 2003, EMBEDDED SOFTWARE, PROCEEDINGS
## 2003 - 22 EKER J, 2003, PROC IEEE
## 2003 - 23 WILLIAMS BC, 2003, PROC IEEE
## 2004 - 24 HSIUNG PA, 2004, IEEE TRANS SOFTW ENG
## 2004 - 25 APVRILLE L, 2004, IEEE TRANS SOFTW ENG
## 2004 - 26 VAN DER BIJL M, 2004, FORMAL APPROACHES TO SOFTWARE TESTING
## 2004 - 27 KULLMANN PHM, 2004, J NEUROPHYSIOL
## 2005 - 28 PAUL JM, 2005, ACM TRANSACT DES AUTOMAT ELECTRON SYST
## 2005 - 29 VOLGYESI P, 2005, SCI COMPUT PROGRAM
## 2005 - 30 CORRADINI F, 2005, FORMAL METHODS FOR MOBILE COMPUTING
## 2005 - 31 WUYTS R, 2005, J SYST SOFTW
## 2006 - 32 GABOR MU, 2006, REAL-TIME SYST
## 2006 - 33 GIESE H, 2006, COMPUTER SAFETY, RELIABILITY, AND SECURITY, PROCEEDINGS
## 2006 - 34 BRAVENBOER M, 2006, GENERATIVE AND TRANSFORMATIONAL TECHNIQUES IN SOFTWARE ENGINEERING
## 2007 - 35 CHEN M, 2007, IEEE WIREL COMMUN
## 2007 - 36 AKERHOLM M, 2007, J SYST SOFTW
## 2007 - 37 SUBRAMONIAN V, 2007, J SYST SOFTW
## 2007 - 38 GRUNSKE L, 2007, J SYST SOFTW
## 2007 - 39 KUZ I, 2007, J SYST SOFTW
## 2007 - 40 HUSSEIN M, 2007, J SYST SOFTW
## 2007 - 41 ROSING TS, 2007, IEEE TRANS VERY LARGE SCALE INTEGR (VLSI) SYST
## 2007 - 42 ARZEN KE, 2007, EUR J CONTROL
## 2007 - 43 BALASUBRAMANIAN K, 2007, J COMPUT SYST SCI
## 2008 - 44 CORTELLESSA V, 2008, COMPUT OPER RES
## 2008 - 45 HUGHES D, 2008, CONCURR COMPUT -PRACT EXP
## 2008 - 46 COULSON G, 2008, ACM TRANS COMPUT SYST
## 2008 - 47 PRADAL C, 2008, FUNCT PLANT BIOL
## 2009 - 48 KWON H, 2009, IEEE TRANS EDUC
## 2009 - 49 HAN SI, 2009, INTEGRATION-VLSI J
## 2010 - 50 OBERMAISSER R, 2010, IEEE TRANS IND INFORM
## 2010 - 51 DI NATALE M, 2010, IEEE TRANS IND INFORM
## 2010 - 52 BENISTY M, 2010, ASTRON ASTROPHYS
## 2010 - 53 CONMY P, 2010, IEEE TRANS IND INFORM
## 2010 - 54 BLACK G, 2010, IEEE TRANS AUTOM SCI ENG
## 2010 - 55 RENARD Y, 2010, PRESENCE-TELEOPER VIRTUAL ENV
## 2010 - 56 PHAITHOONBUATHONG P, 2010, INT J COMPUT INTEGR MANUF
## 2011 - 57 KHALGUI M, 2011, IEEE-ASME TRANS MECHATRON
## 2011 - 58 PALVIAINEN M, 2011, J SYST SOFTW
## 2011 - 59 BOZZANO M, 2011, COMPUT J
## 2011 - 60 HWANG KS, 2011, IEEE TRANS EDUC
## 2011 - 61 CANETE E, 2011, AD HOC NETW
## 2011 - 62 KHALGUI M, 2011, IEEE TRANS COMPUT
## 2011 - 63 ROMPF T, 2011, ACM SIGPLAN NOT
## 2011 - 64 MOHAMMAD M, 2011, J SYST SOFTW
## 2012 - 65 ROMPF T, 2012, COMMUN ACM
## 2012 - 66 SEINTURIER L, 2012, SOFTW -PRACT EXP
## 2012 - 67 CHEN YL, 2012, SENSORS
## 2012 - 68 MATTINGLEY J, 2012, OPTIM ENG
## 2012 - 69 WAECHTER J, 2012, NAT HAZARDS EARTH SYST SCI
## 2013 - 70 DAI W, 2013, IEEE TRANS IND INFORM
## 2013 - 71 BARAKOVA EI, 2013, ROBOT AUTON SYST
## 2013 - 72 TAHERKORDI A, 2013, ACM TRANS SENS NETW
## 2013 - 73 MUBEEN S, 2013, COMPUT SCI INF SYST
## 2014 - 74 GUENTHER C, 2014, COMPUT FLUIDS
## 2014 - 75 MUBEEN S, 2014, J SYST ARCHITECT
## 2015 - 76 ZHANG S, 2015, ACM TRANS SENS NETW
## 2015 - 77 CIMATTI A, 2015, SCI COMPUT PROGRAM
## 2016 - 78 HARRISON R, 2016, PROC IEEE
## 2016 - 79 VALE T, 2016, J SYST SOFTW
## DOI Year LCS GCS
## 1991 - 1 10.1086/170070 1991 0 93
## 1995 - 2 10.1109/2.366162 1995 2 36
## 1997 - 3 10.1109/32.637392 1997 0 35
## 1997 - 4 10.1023/A:1008806425898 1997 0 18
## 1997 - 5 <NA> 1997 0 15
## 1998 - 6 10.1145/276393.276395 1998 0 18
## 1998 - 7 10.1023/A:1008697810328 1998 0 11
## 1999 - 8 10.1109/25.806769 1999 0 110
## 1999 - 9 10.1023/A:1008842521805 1999 0 12
## 2000 - 10 <NA> 2000 0 16
## 2000 - 11 <NA> 2000 0 100
## 2001 - 12 10.1016/S0141-9331(01)00120-X 2001 0 12
## 2001 - 13 10.1109/40.928765 2001 0 19
## 2002 - 14 10.1109/MWC.2002.1160084 2002 0 18
## 2002 - 15 10.1006/JVLC.2002.0234 2002 0 35
## 2002 - 16 <NA> 2002 0 30
## 2002 - 17 <NA> 2002 0 16
## 2002 - 18 <NA> 2002 0 11
## 2003 - 19 10.1109/TCAD.2003.819385 2003 0 12
## 2003 - 20 10.1017/S0956796802004574 2003 1 30
## 2003 - 21 <NA> 2003 0 33
## 2003 - 22 10.1109/JPROC.2002.805829 2003 0 358
## 2003 - 23 10.1109/JPROC.2002.805828 2003 0 50
## 2004 - 24 10.1109/TSE.2004.68 2004 0 25
## 2004 - 25 10.1109/TSE.2004.34 2004 0 33
## 2004 - 26 <NA> 2004 0 31
## 2004 - 27 10.1152/JN.00559.2003 2004 0 52
## 2005 - 28 10.1145/1080334.1080335 2005 0 16
## 2005 - 29 10.1016/J.SCICO.2004.11.012 2005 0 11
## 2005 - 30 <NA> 2005 0 16
## 2005 - 31 10.1016/J.JSS.2003.05.004 2005 0 11
## 2006 - 32 10.1007/S11241-006-6883-Y 2006 0 18
## 2006 - 33 <NA> 2006 0 11
## 2006 - 34 <NA> 2006 0 16
## 2007 - 35 10.1109/MWC.2007.4407223 2007 0 93
## 2007 - 36 10.1016/J.JSS.2006.08.016 2007 1 27
## 2007 - 37 10.1016/J.JSS.2006.08.023 2007 0 11
## 2007 - 38 10.1016/J.JSS.2006.08.014 2007 0 24
## 2007 - 39 10.1016/J.JSS.2006.08.039 2007 0 12
## 2007 - 40 10.1016/J.JSS.2006.08.017 2007 0 13
## 2007 - 41 10.1109/TVLSI.2007.895245 2007 0 32
## 2007 - 42 10.3166/EJC.13.261-279 2007 0 11
## 2007 - 43 10.1016/J.JCSS.2006.04.008 2007 0 12
## 2008 - 44 10.1016/J.COR.2007.01.011 2008 0 40
## 2008 - 45 10.1002/CPE.1279 2008 0 13
## 2008 - 46 10.1145/1328671.1328672 2008 0 67
## 2008 - 47 10.1071/FP08084 2008 0 128
## 2009 - 48 10.1109/TE.2008.927691 2009 0 12
## 2009 - 49 10.1016/J.VLSI.2008.08.003 2009 0 14
## 2010 - 50 10.1109/TII.2010.2071388 2010 0 18
## 2010 - 51 10.1109/TII.2010.2072511 2010 0 20
## 2010 - 52 10.1051/0004-6361/201014776 2010 1 24
## 2010 - 53 10.1109/TII.2009.2039938 2010 0 17
## 2010 - 54 10.1109/TASE.2008.2007216 2010 0 45
## 2010 - 55 10.1162/PRES.19.1.35 2010 0 161
## 2010 - 56 10.1080/09511920903440313 2010 0 17
## 2011 - 57 10.1109/TMECH.2010.2050697 2011 0 19
## 2011 - 58 10.1016/J.JSS.2011.01.048 2011 0 16
## 2011 - 59 10.1093/COMJNL/BXQ024 2011 1 58
## 2011 - 60 10.1109/TE.2010.2049359 2011 0 11
## 2011 - 61 10.1016/J.ADHOC.2010.08.022 2011 0 12
## 2011 - 62 10.1109/TC.2010.96 2011 0 14
## 2011 - 63 10.1145/1942788.1868314 2011 0 13
## 2011 - 64 10.1016/J.JSS.2010.08.048 2011 0 12
## 2012 - 65 10.1145/2184319.2184345 2012 0 33
## 2012 - 66 10.1002/SPE.1077 2012 0 60
## 2012 - 67 10.3390/S120302373 2012 0 18
## 2012 - 68 10.1007/S11081-011-9176-9 2012 0 236
## 2012 - 69 10.5194/NHESS-12-1923-2012 2012 0 14
## 2013 - 70 10.1109/TII.2012.2235450 2013 0 15
## 2013 - 71 10.1016/J.ROBOT.2012.08.001 2013 0 18
## 2013 - 72 10.1145/2422966.2422971 2013 0 12
## 2013 - 73 10.2298/CSIS120614011M 2013 0 11
## 2014 - 74 10.1016/J.COMPFLUID.2014.06.023 2014 0 11
## 2014 - 75 10.1016/J.SYSARC.2013.10.008 2014 0 11
## 2015 - 76 10.1145/2746343 2015 0 16
## 2015 - 77 10.1016/J.SCICO.2014.06.011 2015 0 18
## 2016 - 78 10.1109/JPROC.2015.2510665 2016 0 21
## 2016 - 79 10.1016/J.JSS.2015.09.019 2016 0 17
Using dplyr
, ggplot2
and tidytext
packages, corpus data can be explored to extract information and visualization.
For this, LDA (Latent Dirichlet Allocation) is a great algorithm to extract information from all this data.
Library ldatuning
to find the best number of topics to represent this sample on a LDA model.
# observe the method used, you can choose Gibbs or VEM
library(ldatuning)
library(topicmodels)
tunningresult <- FindTopicsNumber(
my.pdfdtm,
topics = seq(from = 2, to = 15, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 3L,
verbose = TRUE)
## fit models... done.
## calculate metrics:
## Griffiths2004... done.
## CaoJuan2009... done.
## Arun2010... done.
## Deveaud2014... done.
FindTopicsNumber_plot(tunningresult)
You can run this snnipet as many times as you’d like, varying the number of CPU cores to suit your needs.
Observe carefully the graph above, it suggests a model with k between 12 and 16, we’ll choose k=13
# Observe the value of k and the method used, you have two options (VEM or Gibbs)
# matrix Whether to tidy the beta (per-term-per-topic, default) or gamma (per-document-per-topic) matrix
library(ldatuning)
library(topicmodels)
library(tidytext)
my.pdf.lda <- LDA(my.pdfdtm, k=13, control=list(seed=77), method="Gibbs")
my.pdf.topics <- tidy(my.pdf.lda, matrix = "beta")
# optional: to visualize the data uncomment below
#my.pdf.ap_topics
Observe the use of matrix=“beta”, this is important for our analysis, because, later on we’ll analyze how much of the topic is represented on an article.
To a better visualization of the topics, below snnipet gets the top 10 words in each topic.
library(ldatuning)
library(topicmodels)
library(tidytext)
library(dplyr)
library(magrittr)
my.pdf.top_terms <- my.pdf.topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup %>%
arrange(topic, -beta)
# optional: to visualize the data uncomment below
#my.pdf.aptop_terms
Finnaly, the visualization of top 10 words in each topic.
library(ldatuning)
library(topicmodels)
library(tidytext)
library(magrittr)
my.pdf.top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill=factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Now we are going to explore how those topics spread in the articles content.
#resgatar os documentos associados a cada tópico
my.pdf.gamma <- tidy(my.pdf.lda, matrix = "gamma")
#plot data factor topic
my.pdf.gamma %>%
mutate(title = reorder(document, gamma * topic)) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ title)
library(dplyr)
library(ggplot2)
library(tidytext)
my.pdftidy <- tidy(my.pdfdtm)
my.pdfsentiment <- my.pdftidy %>%
inner_join(get_sentiments("bing"), by = c(term ="word"))
my.pdfsentiment %>%
count(sentiment, term, wt = count) %>%
ungroup() %>%
filter(n >= 60) %>%
mutate(n =ifelse(sentiment == "negative", -n, n))%>%
mutate(term=reorder(term,n)) %>%
ggplot(aes(term, n, fill = sentiment)) +
geom_bar(stat="identity") +
ylab("Sentiment analysis on this set") +
coord_flip()
Is a measure of how well a probability model predicts a sample, all credits are from http://freerangestats.info/blog/2017/01/05/topic-model-cv, please visit this site for more details.