openalex是一个大型免费学术资源索引,该名称源于古埃及的亚历山大图书馆馆名。openalex既提供免费版,也提供收费的高级服务。免费版API的每人每天最大访问量是10万次。
openalex数据集描述了5类学术实体,以及这些实体之间的联系,每类实体分别对应的目标有:
成果:包括论文、书籍、数据集等,会引用其他成果
作者:做出成果的人
出处:刊载成果的期刊或资源库
机构:(通过作者)与成果相关的大学或其他组织
概念:以话题方式对成果进行标签
该数据库可以通过API访问。如果我们已经知道了图中所包含的作者姓名或ORCID、论文DOI、概念、作者所在机构、国籍、文献来源、主题、概念等,都可以进行检索。在R语言中,可以使用openalexR包对其进行访问。下表列出了openalex的五个主要功能:
功能 | 说明 |
---|---|
oa_fetch | 由以下三个函数组成,即oa_query |
oa_query | 根据用户提供的一组参数生成一个有效的查询,该查询以遵循OpenAlex API的语法编写。 |
oa_request | 下载与oa_query创建的查询或用户手动编写的查询相匹配的实体集合,并以列表格式返回JSON对象。 |
oa2df | 将JSON对象转换为经典书目/数据框架。 |
oa_random | 获取随机实体,例如,oa_random (“works”)每次运行时给出不同的结果 |
安装openalexR包的途径有两个,一是从github上安装开发版:
options(repos = c(CRAN = "https://cran.rstudio.com/"))
utils::install.packages("remotes")
## 程序包'remotes'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\RENJIANCHAO\AppData\Local\Temp\RtmpcTsarq\downloaded_packages里
remotes::install_github("ropensci/openalexR")
## Downloading GitHub repo ropensci/openalexR@HEAD
## Running `R CMD build`...
## * checking for file 'C:\Users\RENJIANCHAO\AppData\Local\Temp\RtmpcTsarq\remotes43803b863e86\ropensci-openalexR-e4ffbe3/DESCRIPTION' ... OK
## * preparing 'openalexR':
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## Removed empty directory 'openalexR/vignettes'
## * building 'openalexR_1.3.0.tar.gz'
二是从CRAN上安装发行版
install.packages("openalexR")
##
## 有二进制版本的,但源代码版本是后来的:
## binary source needs_compilation
## openalexR 1.2.3 1.3.0 FALSE
## 安装源码包'openalexR'
为了获得比较好的体验,在开始之前最好将把个人邮箱添加到API请求(当然,如果有OpenAlex高级版,可以将API密钥添加到openalexR的Apikey选项):
options(openalexR.mailto = "847807047@sina.com")
options(openalexR.apikey = "EXAMPLE_APIKEY")
另外,也可以通过输入命令file.edit(“~/.Renviron”)
打开.Renviron直接添加 openalexR.mailto = 你的邮箱
openalexR.apikey = EXAMPLE_APIKEY
使用doi作为成果过滤器:
library(openalexR)
## Thank you for using openalexR!
## To acknowledge our work, please cite the package by calling `citation("openalexR")`.
## To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
works_from_dois <- oa_fetch(
entity = "works",
doi = c("10.1016/j.joi.2017.08.007", "https://doi.org/10.1007/s11192-013-1221-3"),
verbose = TRUE
)
## Requesting url: https://api.openalex.org/works?filter=doi%3A10.1016%2Fj.joi.2017.08.007%7Chttps%3A%2F%2Fdoi.org%2F10.1007%2Fs11192-013-1221-3
## Getting 1 page of results with a total of 2 records...
# 我们可以在RStudio中交互式地查看works_from_dois输出的tibble/dataframe,,或者使用str或head等基本函数检查输出结果。目前的版本还提供了实验性的show_works函数来简化结果(例如,删除一些列,保留第一/最后作者),以便于查看。
#注意:下面的表被包装在knitr:: able()中,以便在这个自述文件中很好地显示,但是您很可能不需要这个函数。
# str(works_from_dois, max.level = 2)
# head(works_from_dois)
# show_works(works_from_dois)
works_from_dois |>
show_works() |>
knitr::kable()
id | display_name | first_author | last_author | so | url | is_oa | top_concepts |
---|---|---|---|---|---|---|---|
W2755950973 | bibliometrix : An R-tool for comprehensive science mapping analysis | Massimo Aria | Corrado Cuccurullo | Journal of informetrics | https://doi.org/10.1016/j.joi.2017.08.007 | FALSE | Workflow, Bibliometrics, Software |
W2038196424 | Coverage and adoption of altmetrics sources in the bibliometric community | Stefanie Haustein | Jens Terliesner | Scientometrics | https://doi.org/10.1007/s11192-013-1221-3 | FALSE | Altmetrics, Bookmarking, Social media |
以下是使用author.orcid作为过滤器(带或不带 https://orcid.org/前缀都可以)
library(openalexR)
works_from_orcids <- oa_fetch(
entity = "works",
author.orcid = c("0000-0001-6187-6610", "0000-0002-8517-9411"),
verbose = TRUE
)
## Requesting url: https://api.openalex.org/works?filter=author.orcid%3A0000-0001-6187-6610%7C0000-0002-8517-9411
## Getting 2 pages of results with a total of 252 records...
## Warning in oa_request(oa_query(filter = filter_i, multiple_id = multiple_id, :
## The following work(s) have truncated lists of authors: W4230863633.
## Query each work separately by its identifier to get full list of authors.
## For example:
## lapply(c("W4230863633"), \(x) oa_fetch(identifier = x))
## Details at https://docs.openalex.org/api-entities/authors/limitations.
works_from_orcids |>
show_works() |>
knitr::kable()
id | display_name | first_author | last_author | so | url | is_oa | top_concepts |
---|---|---|---|---|---|---|---|
W2755950973 | bibliometrix : An R-tool for comprehensive science mapping analysis | Massimo Aria | Corrado Cuccurullo | Journal of informetrics | https://doi.org/10.1016/j.joi.2017.08.007 | FALSE | Workflow, Bibliometrics, Software |
W2741809807 | The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles | Heather Piwowar | Stefanie Haustein | PeerJ | https://doi.org/10.7717/peerj.4375 | TRUE | Citation, License, Bibliometrics |
W2122130843 | Scientometrics 2.0: New metrics of scholarly impact on the social Web | Jason Priem | Bradely H. Hemminger | First Monday | https://doi.org/10.5210/fm.v15i7.2874 | FALSE | Bookmarking, Altmetrics, Social media |
W2038196424 | Coverage and adoption of altmetrics sources in the bibliometric community | Stefanie Haustein | Jens Terliesner | Scientometrics | https://doi.org/10.1007/s11192-013-1221-3 | FALSE | Altmetrics, Bookmarking, Social media |
W2396414759 | The Altmetrics Collection | Jason Priem | Dario Taraborelli | PloS one | https://doi.org/10.1371/journal.pone.0048753 | TRUE | Social media, Citation, Altmetrics |
W2408216567 | Foundations and trends in performance management. A twenty-five years bibliometric analysis in business and public administration domains | Corrado Cuccurullo | Fabrizia Sarto | Scientometrics | https://doi.org/10.1007/s11192-016-1948-8 | FALSE | Domain (mathematical analysis), Content analysis, Public domain |
下载所有在2020年至2021年间发表的被引用50次以上的作品,在文献标题中包含“bibliometric analysis”或“science mapping”。也许我们还希望按总引用数降序对结果进行排序。
library(openalexR)
works_search <- oa_fetch(
entity = "works",
title.search = c("bibliometric analysis", "science mapping"),
cited_by_count = ">50",
from_publication_date = "2020-01-01",
to_publication_date = "2021-12-31",
options = list(sort = "cited_by_count:desc"),
verbose = TRUE
)
## Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2020-01-01%2Cto_publication_date%3A2021-12-31&sort=cited_by_count%3Adesc
## Getting 2 pages of results with a total of 236 records...
works_search |>
show_works() |>
knitr::kable()
id | display_name | first_author | last_author | so | url | is_oa | top_concepts |
---|---|---|---|---|---|---|---|
W3160856016 | How to conduct a bibliometric analysis: An overview and guidelines | Naveen Donthu | Weng Marc Lim | Journal of business research | https://doi.org/10.1016/j.jbusres.2021.04.070 | TRUE | Bibliometrics, Field (mathematics), Resource (disambiguation) |
W3038273726 | Investigating the emerging COVID-19 research trends in the field of business and management: A bibliometric analysis approach | Surabhi Verma | Anders Gustafsson | Journal of business research | https://doi.org/10.1016/j.jbusres.2020.06.057 | TRUE | Bibliometrics, Field (mathematics), Empirical research |
W3001491100 | Software tools for conducting bibliometric analysis in science: An up-to-date review | José A. Moral-Muñoz | Manuel J. Cobo | El Profesional de la información | https://doi.org/10.3145/epi.2020.ene.03 | TRUE | Bibliometrics, Visualization, Set (abstract data type) |
W2990450011 | Forty-five years of Journal of Business Research: A bibliometric analysis | Naveen Donthu | Debidutta Pattnaik | Journal of business research | https://doi.org/10.1016/j.jbusres.2019.10.039 | FALSE | Publishing, Bibliometrics, Empirical research |
W3044902155 | Financial literacy: A systematic review and bibliometric analysis | Kirti Goyal | Satish Kumar | International journal of consumer studies | https://doi.org/10.1111/ijcs.12605 | FALSE | Financial literacy, Content analysis, Citation |
W2990688366 | A bibliometric analysis of board diversity: Current status, development, and future research directions | H. Kent Baker | Arunima Haldar | Journal of business research | https://doi.org/10.1016/j.jbusres.2019.11.025 | FALSE | Diversity (politics), Ethnic group, Bibliometrics |
我们首先下载涉及100万件以上作品的所有一级概念/关键词的记录:
library(openalexR)
library(gghighlight)
## 载入需要的程辑包:ggplot2
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
concept_df <- oa_fetch(
entity = "concepts",
level = 1,
ancestors.id = "https://openalex.org/C86803240", # Biology
works_count = ">1000000"
)
concept_df |>
select(display_name, counts_by_year) |>
tidyr::unnest(counts_by_year) |>
filter(year < 2022) |>
ggplot() +
aes(x = year, y = works_count, color = display_name) +
facet_wrap(~display_name) +
geom_line(linewidth = 0.7) +
scale_color_brewer(palette = "Dark2") +
labs(
x = NULL, y = "Works count",
title = "Virology spiked in 2020."
) +
guides(color = "none") +
gghighlight(
max(works_count) > 200000,
min(works_count) < 400000,
label_params = list(nudge_y = 10^5, segment.color = NA)
)
## label_key: display_name
#> label_key: display_name
我们希望下载所有关于意大利机构(country_code:it)的记录,这些机构被分类为教育(type:education)。同样,我们检查有多少记录与查询相匹配,然后下载相应的记录集合:
italy_insts <- oa_fetch(
entity = "institutions",
country_code = "it",
type = "education",
verbose = TRUE
)
## Requesting url: https://api.openalex.org/institutions?filter=country_code%3Ait%2Ctype%3Aeducation
## Getting 2 pages of results with a total of 232 records...
#> Requesting url: https://api.openalex.org/institutions?filter=country_code%3Ait%2Ctype%3Aeducation
#> Getting 2 pages of results with a total of 232 records...
italy_insts |>
slice_max(cited_by_count, n = 8) |>
mutate(display_name = forcats::fct_reorder(display_name, cited_by_count)) |>
ggplot() +
aes(x = cited_by_count, y = display_name, fill = display_name) +
geom_col() +
scale_fill_viridis_d(option = "E") +
guides(fill = "none") +
labs(
x = "Total citations", y = NULL,
title = "Italian references"
) +
coord_cartesian(expand = FALSE)
# The package wordcloud needs to be installed to run this chunk
library(wordcloud)
## 载入需要的程辑包:RColorBrewer
concept_cloud <- italy_insts |>
select(inst_id = id, x_concepts) |>
tidyr::unnest(x_concepts) |>
filter(level == 1) |>
select(display_name, score) |>
group_by(display_name) |>
summarise(score = sum(score))
pal <- c("black", scales::brewer_pal(palette = "Set1")(5))
set.seed(1)
wordcloud::wordcloud(
concept_cloud$display_name,
concept_cloud$score,
scale = c(2, .4),
colors = pal
)
我们首先下载所有发表作品超过30万篇的期刊记录,然后将其评分概念可视化:
# The package ggtext needs to be installed to run this chunk
# library(ggtext)
jours_all <- oa_fetch(
entity = "sources",
works_count = ">200000",
verbose = TRUE
)
## Requesting url: https://api.openalex.org/sources?filter=works_count%3A%3E200000
## Getting 1 page of results with a total of 42 records...
jours <- jours_all |>
filter(!is.na(x_concepts), type != "ebook platform") |>
slice_max(cited_by_count, n = 9) |>
distinct(display_name, .keep_all = TRUE) |>
select(jour = display_name, x_concepts) |>
tidyr::unnest(x_concepts) |>
filter(level == 0) |>
left_join(concept_abbrev, by = join_by(id, display_name)) |>
mutate(
abbreviation = gsub(" ", "<br>", abbreviation),
jour = gsub("Journal of|Journal of the", "J.", gsub("\\(.*?\\)", "", jour))
) |>
tidyr::complete(jour, abbreviation, fill = list(score = 0)) |>
group_by(jour) |>
mutate(
color = if_else(score > 10, "#1A1A1A", "#D9D9D9"), # CCCCCC
label = paste0("<span style='color:", color, "'>", abbreviation, "</span>")
) |>
ungroup()
jours |>
ggplot() +
aes(fill = jour, y = score, x = abbreviation, group = jour) +
facet_wrap(~jour) +
geom_hline(yintercept = c(45, 90), colour = "grey90", linewidth = 0.2) +
geom_segment(
aes(x = abbreviation, xend = abbreviation, y = 0, yend = 100),
color = "grey95"
) +
geom_col(color = "grey20") +
coord_polar(clip = "off") +
theme_bw() +
theme(
plot.background = element_rect(fill = "transparent", colour = NA),
panel.background = element_rect(fill = "transparent", colour = NA),
panel.grid = element_blank(),
panel.border = element_blank(),
axis.text = element_blank(),
axis.ticks.y = element_blank()
) +
ggtext::geom_richtext(
aes(y = 120, label = label),
fill = NA, label.color = NA, size = 3
) +
scale_fill_brewer(palette = "Set1", guide = "none") +
labs(y = NULL, x = NULL, title = "Journal clocks")
用户还可以使用oa_snowball执行滚雪球操作。滚雪球是一种文献检索技术,研究人员从一组文章开始,找到引用或被原始集合引用的文章。Oa_snowball返回一个包含两个元素的列表:节点和边(nodes and edges)。与oa_fetch类似,oa_snowball查找并返回满足某些条件的核心文章集的信息,但与oa_fetch不同的是,它还返回引用过和被该核心文章集引用过的文章的信息。
# The packages ggraph and tidygraph need to be installed to run this chunk
library(ggraph)
library(tidygraph)
##
## 载入程辑包:'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
#>
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#>
#> filter
snowball_docs <- oa_snowball(
identifier = c("W1964141474", "W1963991285"),
verbose = TRUE
)
## Requesting url: https://api.openalex.org/works?filter=openalex%3AW1964141474%7CW1963991285
## Getting 1 page of results with a total of 2 records...
## Collecting all documents citing the target papers...
## Requesting url: https://api.openalex.org/works?filter=cites%3AW1963991285%7CW1964141474
## Getting 3 pages of results with a total of 533 records...
## Collecting all documents cited by the target papers...
## Requesting url: https://api.openalex.org/works?filter=cited_by%3AW1963991285%7CW1964141474
## Getting 1 page of results with a total of 91 records...
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW1964141474%7CW1963991285
#> Getting 1 page of results with a total of 2 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW1963991285%7CW1964141474
#> Getting 3 pages of results with a total of 533 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW1963991285%7CW1964141474
#> Getting 1 page of results with a total of 91 records...
ggraph(graph = as_tbl_graph(snowball_docs), layout = "stress") +
geom_edge_link(aes(alpha = after_stat(index)), show.legend = FALSE) +
geom_node_point(aes(fill = oa_input, size = cited_by_count), shape = 21, color = "white") +
geom_node_label(aes(filter = oa_input, label = id), nudge_y = 0.2, size = 3) +
scale_edge_width(range = c(0.1, 1.5), guide = "none") +
scale_size(range = c(3, 10), guide = "none") +
scale_fill_manual(values = c("#a3ad62", "#d46780"), na.value = "grey", name = "") +
theme_graph() +
theme(
plot.background = element_rect(fill = "transparent", colour = NA),
panel.background = element_rect(fill = "transparent", colour = NA),
legend.position = "bottom"
) +
guides(fill = "none")
OpenAlex提供了对Work实体(这些实体的id以“W”开头)的全文N-grams(有限)支持。给定一个实体id向量,oa_ngrams返回每个实体的n个gram数据的数据框(在ngrams列表列中)。
ngrams_data <- oa_ngrams(
works_identifier = c("W1964141474", "W1963991285"),
verbose = TRUE
)
ngrams_data
## # A tibble: 2 × 4
## id doi count ngrams
## <chr> <chr> <int> <list>
## 1 https://openalex.org/W1964141474 https://doi.org/10.1016/j.conb.… 2733 <df>
## 2 https://openalex.org/W1963991285 https://doi.org/10.1126/science… 2338 <df>
#> # A tibble: 2 × 4
#> id doi count ngrams
#> <chr> <chr> <int> <list>
#> 1 https://openalex.org/W1964141474 https://doi.org/10.1016/j.conb.… 2733 <df>
#> 2 https://openalex.org/W1963991285 https://doi.org/10.1126/science… 2338 <df>
lapply(ngrams_data$ngrams, head, 3)
## [[1]]
## ngram ngram_count ngram_tokens
## 1 brain basis and core cause 2 5
## 2 cause be not yet fully 2 5
## 3 include structural and functional magnetic 2 5
## term_frequency
## 1 0.0006637902
## 2 0.0006637902
## 3 0.0006637902
##
## [[2]]
## ngram ngram_count ngram_tokens
## 1 intact but less accessible phonetic 1 5
## 2 accessible phonetic representation in Adults 1 5
## 3 representation in Adults with Dyslexia 1 5
## term_frequency
## 1 0.0003756574
## 2 0.0003756574
## 3 0.0003756574
#> [[1]]
#> ngram ngram_count ngram_tokens
#> 1 brain basis and core cause 2 5
#> 2 cause be not yet fully 2 5
#> 3 include structural and functional magnetic 2 5
#> term_frequency
#> 1 0.0006637902
#> 2 0.0006637902
#> 3 0.0006637902
#>
#> [[2]]
#> ngram ngram_count ngram_tokens
#> 1 intact but less accessible phonetic 1 5
#> 2 accessible phonetic representation in Adults 1 5
#> 3 representation in Adults with Dyslexia 1 5
#> term_frequency
#> 1 0.0003756574
#> 2 0.0003756574
#> 3 0.0003756574
ngrams_data |>
tidyr::unnest(ngrams) |>
filter(ngram_tokens == 2) |>
select(id, ngram, ngram_count) |>
group_by(id) |>
slice_max(ngram_count, n = 10, with_ties = FALSE) |>
ggplot(aes(ngram_count, forcats::fct_reorder(ngram, ngram_count))) +
geom_col(aes(fill = id), show.legend = FALSE) +
facet_wrap(~id, scales = "free_y") +
labs(
title = "Top 10 fulltext bigrams",
x = "Count",
y = NULL
)
oa_ngrams有时会很慢,因为n-gram的数据可能会变得非常大,但考虑到n-gram是“通过CDN缓存的”[https://docs.openalex.org/api-entities/works/get-n-grams#api-endpoint],您也可以考虑对这种特殊情况进行并行化(如果您使用{curl} >= v5.0.0, oa_ngrams会自动进行并行化)。
来源:https://github.com/ropensci/openalexR
https://baijiahao.baidu.com/s?id=1722906102002408666&wfr=spider&for=pc
https://zhuanlan.zhihu.com/p/643347576