There are two different datasets to use.
The abstracts for 10 different journals for past 5 years were collected. In addition to abstract text, author names, keywords or MeSH terms were collected in separate data frames. The journals were selected so that text mining in bioinformatics field can be performed. Out of 10 journals, 7 of them are focused on bioinformatics field. Remaining three, Nature, Science and PeerJ, have more broad spectrum. They can be used as background or reference if needed be.
| Journal | h-index | Pubmed Query | Abstract count | Author count | Keyword count | MESH term count |
|---|---|---|---|---|---|---|
| Bioinformatics | 300 | link | 3976 | 18941 | 26 | 23812 |
| BMC Bioinformatics | 163 | link | 2699 | 12749 | 4237 | 22220 |
| BMC Syst Biol | 61 | link | 759 | 3780 | 1000 | 5890 |
| Brief. Bioinformatics | 79 | link | 605 | 3062 | 2797 | 2168 |
| Genome Biol. | 188 | link | 1147 | 11243 | 1570 | 11319 |
| Genome Res. | 248 | link | 892 | 9090 | 0 | 8797 |
| Nature | 1011 | link | 4311 | 58890 | 0 | 46914 |
| PeerJ | 27 | link | 4043 | 18784 | 24923 | 0 |
| PLoS Comput. Biol. | 125 | link | 2774 | 12826 | 0 | 26681 |
| Science | 978 | link | 4253 | 45522 | 0 | 30951 |
The abstract count data might not match the pubmed query results, since articles without abstract text are eliminated.
Rdata file can be downloaded from this link (you don’t need to have Dropbox account, just click No thanks, continue to view if you’re asked to sign in or sign up). After downloading the file, load it with following command:
load(pubmed_project.rda)
After this command, you’ll have four tables in your environment.
Here are the contents of each table:
keywords abstracts authors
+----------+ +-----------+ +--------------+
| pmid | <----+--+ pmid +----> | pmid |
+----------+ | +-----------+ +--------------+
| keyword | | | journal | | firstname |
+----------+ | +-----------+ +--------------+
| | title | | lastname |
| +-----------+ +--------------+
mesh | | type | | affiliation |
+-----------+ | +-----------+ +--------------+
| pmid | <---+ | abstract |
+-----------+ +-----------+
| mesh_term | | year |
+-----------+ +-----------+
| month |
+-----------+
Some notes about fields:
Genome Res. instead of Genome Research)NA, 09 or Sep. So, use month column with cautionauthors table, the affiliation column refers to university or institution the author belongs to. However, there are problematic entries where a multi author publication (with same pmid) lists autors separately but keeps affiliation information of all authors in one cell next to first author. Also, be aware that there might be many authors with same firstname and lastname!/ character.This data set is comprised of 2.1 million abstracts, published on or after 2016. Due to size of data, only single table is generated and fields of the table are as follows:
abstracts_two
+--------------+
| pmid |
+--------------+
| author_count |
+--------------+
| journal |
+--------------+
| title |
+--------------+
| type |
+--------------+
| abstract |
+--------------+
| year |
+--------------+
| month |
+--------------+
The size of the file is too big to share, thus please visit instructor with a usb drive to get the dataset.
The story you would like to tell is up to you. You can try to
Your analysis should be performed in Rmarkdown format. Both Rmd and html outputs are needed for evaluation. A dashboard style output will get higher score. Please visit flexdashboard samples page to have an idea.
If you need help preparing or shaping a data that you found, I can help integrating to abstract data.
Please consider packages like openNLP or word2vec to do in depth text mining of the data.
Please visit Data Analysis and Visualization Exercises (dav-exercises) repo for instructions.
Grading: 5 points is needed and this is how it can be collected;
Under Issues tab you can find sample questions submitted by instructor.
January 10th, 2018, Wed 23:59 ***
Last Updated 2017-12-12