This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
John Bohannon published an article and analysis of download data from SciHub in Science. Bohannon noted that the single largest download city in the US was Ashburn, VA with 96,857 downloads in the 6 months through February 2016.
See map at http://science.sciencemag.org/content/352/6285/508.figures-only.
There was some speculation in the article and on twitter that Janelia Research Campus, located in Ashburn, could be the source of these downloads. I was curious since I have lots of collaborators there!
John Bohannon and Alexandra Elbakyan shared the raw data used for the analysis in the article on Dryad (see references).
I downloaded the 650 Mb and grepped all the files for Ashburn to produce a small table which is used for this analysis:
grep 'Ashburn' -r scihub_data > Ashburn.tab
library(readr)
ashburn=read_tsv("Ashburn.tab", col_names = c("date","doi","IP_code","country","city","coords"))
Now we need to do a little munging of the first column to recover the date time date
library(stringr)
unmangled=str_match(ashburn$date,"([^:]+):(.*)")[,2:3]
ashburn$file=unmangled[,1]
ashburn$date=parse_datetime(unmangled[,2])
And we can also compute a DOI prefix
ashburn$prefix=str_split_fixed(ashburn$doi, fixed("/"), 2)[,1]
Let’s check that I picked up just one city as expected
length(unique(ashburn$coords))
## [1] 1
Good only 1 location
Elbakyan apparently encoded the raw IP addresses before passing them to Bohannon. I’m not sure if the resultant IP_code column is unique per IP address, but that’s what it looks like. Of course institutions frequently map all of their outgoing traffic through a single IP address so this would still not equate with single users.
t_ip=table(ashburn$IP_code)
plot(t_ip)
plot(ecdf(t_ip))
So we can see that one IP_code accounts for >70% of the downloads and two more account for almost all the rest. But are these institutions or individuals?
The top IP_code accounts for 71.5116099 percent of the total downloads.
Let’s take a look at when those people were downloading
sorted_ips=names(sort(t_ip, decreasing = T))
top1=sorted_ips[1]
library(dplyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
ashburn %>%
filter(IP_code==top1) %>%
arrange(date) %>%
with(expr = qplot(date, 1:length(date), ylab='Download number'))
So that’s pretty interesting. These downloads are happening in bursts every week. This is almost certainly a script, probably being run by a single individual. Note, that they only happened in the period up to 2nd November. The next run on ~ 9 November would have been after the scihub domain was seized and scihub went offline.
Let’s check what time of day these downloads start. I’m not actually sure of the timezone being used (server, downloader, UTC) so I won’t try to set it.
library(lubridate)
ashburn = mutate(ashburn, h=hour(date)+minute(date)/60)
Now let’s filter out their data to do some more analysis.
top1d=ashburn %>%
filter(IP_code==top1) %>%
mutate(index=seq_along(date))
Now let’s plot the time of day:
qplot(h, index, data=top1d)
So the downloads are all starting at the same time of day. Definitely looks like a script.
So what is this person interested in? There are a lot of neuroscientists at Janelia. Is this person interested in neuroscience? I picked three well-known neuroscience journals (Nature Neuroscience, J. Neurosci, Neuron)
neuro_doistems <- c("10.1038/nn.", "10.1523/JNEUROSCI.", "10.1016/j.neuron.")
neuro_dois <- unlist(sapply(neuro_doistems, grep, ashburn$doi, value=T, fixed=T))
sum(top1d$doi %in% neuro_dois)
## [1] 0
Nope no downloads from them. So what are they downloading? Let’s look at the dois in more detail. First are there any repeats?
downloads_doi=sort(table(top1d$doi))
table(downloads_doi)
## downloads_doi
## 1 2 3 4 5 6 7 8 9
## 105 430 244 31 39 2872 4788 2049 12
median(downloads_doi)
## [1] 7
mean(downloads_doi)
## [1] 6.552886
Well, that’s a bit odd! Only 10570 out of 69264 downloads are actually unique! Nearly all of the downloads are repeats (median 7, mean 6.5). The number of repeats is close to the number of download bursts (9 weeks - the first week of October there were hardly any downloads).
So what are a some of these downloads? I picked downloads from Elsevier journals (the largest group represented). Elsevier now have a standard form for their dois with an abbreviation of the journal name.
library(knitr)
top1d %>%
filter(prefix=='10.1016') %>%
distinct(doi) %>%
sample_n(100) %>%
select(doi) %>%
kable
10.1016/j.apmr.2012.06.002
10.1016/j.amepre.2015.04.026
10.1016/j.jada.2005.05.171
10.1016/j.anai.2015.08.005
10.1016/j.vaccine.2004.11.057
10.1016/j.jpainsymman.2013.04.010
10.1016/S0025-6196(11)61093-8
10.1016/j.socscimed.2012.03.016
10.1016/S0890-4065(98)90023-8
10.1016/j.orcp.2014.10.007
10.1016/j.janxdis.2010.01.010
10.1016/j.healthpol.2012.09.016
10.1016/S0140-6736(11)60783-6
10.1016/j.jtcvs.2010.08.059
10.1016/S0738-3991(03)00109-5
10.1016/j.resuscitation.2011.02.023
10.1016/j.envint.2012.11.005
10.1016/j.pmn.2012.06.004
10.1016/j.amjmed.2005.02.013
10.1016/j.arth.2013.03.004
10.1016/j.earlhumdev.2007.01.001
10.1016/j.jadohealth.2005.05.016
10.1016/j.amepre.2009.10.034
10.1016/j.jsat.2004.09.002
10.1016/j.pec.2010.05.006
10.1016/j.wombi.2006.08.001
10.1016/j.ajic.2014.03.121
10.1016/j.acra.2006.04.009
10.1016/j.surg.2009.11.002
10.1016/j.pec.2009.09.035
10.1016/j.amjcard.2009.03.013
10.1016/j.jadohealth.2004.09.022
10.1016/j.vaccine.2007.08.008
10.1016/j.atherosclerosis.2005.06.018 10.1016/j.jadohealth.2008.05.011
10.1016/j.cnur.2005.07.006
10.1016/j.acra.2008.01.018
10.1016/j.jpainsymman.2010.04.006
10.1016/j.healthpol.2013.12.007
10.1016/j.earlhumdev.2010.08.005
10.1016/j.ijmedinf.2011.02.010
10.1016/j.acalib.2010.01.006
10.1016/j.ejcnurse.2003.12.002
10.1016/j.jclinepi.2009.01.010
10.1016/S0022-3182(98)70320-6
10.1016/j.ijmedinf.2009.12.003
10.1016/j.jvs.2008.07.014
10.1016/j.jamcollsurg.2009.07.013
10.1016/j.vaccine.2010.04.052
10.1016/j.vaccine.2003.11.053
10.1016/j.chiabu.2010.11.003
10.1016/j.pec.2003.12.010
10.1016/0190-7409(88)90033-3
10.1016/j.ejca.2008.11.040
10.1016/j.maturitas.2009.09.004
10.1016/j.ejcnurse.2008.03.001
10.1016/j.annemergmed.2009.07.024
10.1016/j.ijnurstu.2013.05.015
10.1016/j.schres.2010.12.018
10.1016/j.midw.2004.02.001
10.1016/j.annepidem.2013.07.001
10.1016/j.jaci.2003.12.584
10.1016/j.ijnurstu.2009.12.011
10.1016/j.jpainsymman.2010.02.026
10.1016/S0140-6736(12)60240-2
10.1016/j.jclinepi.2006.02.009
10.1016/j.resuscitation.2014.07.005
10.1016/j.aprim.2008.05.009
10.1016/j.jaad.2009.08.047
10.1016/j.jpainsymman.2005.02.008
10.1016/j.pec.2010.12.001
10.1016/j.jclinepi.2011.08.007
10.1016/j.healthplace.2010.08.011
10.1016/j.healthpol.2010.01.008
10.1016/j.ijmedinf.2012.06.003
10.1016/j.amjsurg.2009.05.016
10.1016/j.jpsychores.2009.10.012
10.1016/j.jsr.2007.09.002
10.1016/j.jsams.2008.10.003
10.1016/j.pec.2011.09.007
10.1016/j.jtcvs.2009.08.045
10.1016/j.ajog.2011.05.014
10.1016/j.envres.2009.09.009
10.1016/j.ygyno.2008.12.010
10.1016/j.zefq.2014.10.013
10.1016/j.amepre.2013.12.003
10.1016/j.jamda.2014.06.012
10.1016/j.tmaid.2011.12.003
10.1016/j.semarthrit.2014.08.002
10.1016/j.amjcard.2008.10.029
10.1016/j.apmr.2013.05.015
10.1016/j.vaccine.2010.04.010
10.1016/j.ecns.2009.05.007
10.1016/j.jen.2009.11.003
10.1016/j.ijcard.2010.04.035
10.1016/j.thromres.2012.02.048
10.1016/j.jhin.2007.05.009
10.1016/j.resuscitation.2010.11.022
10.1016/S0165-0327(12)70012-5
10.1016/j.jval.2014.08.1596
Looking at those, this person is downloading exclusively from medical journals. I didn’t look in more detail, but this seems most unlikely to be someone at Janelia.
In some exploratory analysis, I noticed that IP_code 2 and 3 seem linked
ashburn %>%
filter(IP_code %in% sorted_ips[2:3]) %>%
arrange(date) %>%
mutate(index=1:length(date)) %>%
with(expr = qplot(date, index, col=factor(IP_code)))
This pattern of downloads suggested that IP_code 56ed2b50aaad6 migrated to 56ed2ca722e35 in late January 2016.
top23d=ashburn %>%
filter(IP_code %in% sorted_ips[2:3])
with(expr=plot(table(prefix)))
pref23=table(top23d$prefix)
Let’s try the same trick looking at a sample of journal titles here:
top23d=ashburn %>%
filter(IP_code %in% sorted_ips[2:3])
top23d %>%
filter(prefix=='10.1016') %>%
distinct(doi) %>%
sample_n(100) %>%
select(doi) %>%
kable
10.1016/j.ultrasmedbio.2008.09.028
10.1016/j.rser.2015.10.088
10.1016/j.mrrev.2008.02.004
10.1016/j.cps.2012.07.018
10.1016/0379-6779(93)91140-W
10.1016/j.ijhydene.2013.06.065
10.1016/j.compscitech.2005.11.027
10.1016/S0378-1127(00)00710-6
10.1016/j.biomaterials.2004.10.012
10.1016/j.envsoft.2015.12.003
10.1016/S0360-3199(97)00109-2
10.1016/j.theriogenology.2013.02.017
10.1016/S0167-2991(09)60222-6
10.1016/0963-8695(96)00013-8
10.1016/j.foodchem.2005.04.031
10.1016/S0263-7863(01)00069-2
10.1016/0006-8993(82)90818-6
10.1016/j.chemphys.2010.02.021
10.1016/j.energy.2015.05.038
10.1016/0014-4827(71)90315-6
10.1016/j.fertnstert.2014.01.055
10.1016/j.aim.2004.06.002
10.1016/j.cofs.2015.05.011
10.1016/j.coi.2013.09.015
10.1016/j.applthermaleng.2007.06.025
10.1016/j.phytochem.2013.08.005
10.1016/j.aquatox.2014.01.005
10.1016/S0093-691X(01)00524-6
10.1016/j.ijpe.2013.10.009
10.1016/j.jbankfin.2009.02.011
10.1016/j.soilbio.2013.03.017
10.1016/j.apsusc.2010.09.094
10.1016/j.matdes.2009.11.038
10.1016/0742-8413(92)90203-J
10.1016/bs.mcb.2015.05.008
10.1016/j.expthermflusci.2009.07.009
10.1016/j.jcorpfin.2004.02.005
10.1016/j.cca.2010.07.008
10.1016/j.geothermics.2012.04.003
10.1016/j.jdeveco.2012.09.004
10.1016/S0378-5173(99)00173-8
10.1016/j.jelechem.2013.09.016
10.1016/j.jmatprotec.2006.11.173
10.1016/B978-0-12-386944-9.50001-7
10.1016/0092-8674(91)90360-B
10.1016/j.arcmed.2013.10.004
10.1016/j.future.2013.01.010
10.1016/0742-8413(94)00048-F
10.1016/j.jmacro.2004.02.008
10.1016/S0168-1605(99)00072-0
10.1016/0042-6822(73)90065-2
10.1016/j.jtherbio.2005.11.028
10.1016/j.econedurev.2004.03.004
10.1016/j.jenvman.2014.12.051
10.1016/S0273-1223(99)00428-X
10.1016/j.energy.2012.10.039
10.1016/S0196-8904(01)00082-6
10.1016/j.compbiomed.2015.03.023
10.1016/j.ijmachtools.2009.07.007
10.1016/j.tics.2005.03.005
10.1016/j.neuropsychologia.2008.10.020 10.1016/j.jceh.2015.08.001
10.1016/j.jamda.2012.09.007
10.1016/j.tree.2013.05.012
10.1016/j.agwat.2015.08.003
10.1016/j.fct.2009.06.024
10.1016/j.tibs.2014.10.005
10.1016/j.jpowsour.2015.06.062
10.1016/j.jvir.2014.10.044
10.1016/j.bmcl.2009.02.025
10.1016/j.intell.2010.07.003
10.1016/j.jmb.2004.10.035
10.1016/S0169-409X(02)00044-3
10.1016/j.plefa.2007.10.015
10.1016/S0008-6223(97)00121-8
10.1016/S0022-2836(63)80023-6
10.1016/j.paid.2012.12.021
10.1016/j.rser.2014.10.105
10.1016/j.intacc.2015.10.007
10.1016/s0140-6736(16)00273-7
10.1016/j.foodres.2015.06.014
10.1016/j.jbiotec.2010.08.454
10.1016/j.surfcoat.2014.08.021
10.1016/0042-6822(75)90077-x
10.1016/j.engstruct.2011.09.025
10.1016/j.precamres.2012.03.001
10.1016/0300-483X(81)90136-0
10.1016/0003-2697(77)90052-5
10.1016/j.corsci.2008.10.021
10.1016/0378-5963(85)90056-X
10.1016/j.jtherbio.2003.10.004
10.1016/j.biomaterials.2014.06.036
10.1016/0375-6505(77)90007-4
10.1016/S0376-7388(00)00514-7
10.1016/j.soildyn.2016.01.015
10.1016/j.bcp.2015.10.001
10.1016/j.yofte.2006.09.001
10.1016/j.memsci.2007.03.038
10.1016/S1381-5148(98)00029-7
10.1016/j.jappgeo.2009.02.001
These look more varied to me, but again the mix of journals doesn’t look like it would be of major interest at Janelia. The mix looks a bit eclectic to me for a single individual so I suspect that this could be gateway for an institution with many scihub users. Note that this quite a bit of use, roughly 115.9944444 per day! Together they account for 21.5565215 percent of the total downloads.
The large number of downloads in Ashburn is almost entirely (70%) due to the activity of a single individual repeatedly downloading the same set of medical articles. Most of the remaining 30% originate from a pair of related IP_codes which may represent multiple individuals at an institution with a wide range of research interests. The mix of downloads from both sources seems inconsistent with research interests at Janelia. Oh well!
Bohannon J (2016) Who’s downloading pirated papers? Everyone. Science 352(6285): 508-512. http://dx.doi.org/10.1126/science.352.6285.508
Elbakyan A, Bohannon J (2016) Data from: Who’s downloading pirated papers? Everyone. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.q447c