Pubmed Abstract Text Mining Project

There are two different datasets to use.

Dataset 1 - Bioinformatics Journals - Last 5 years

The abstracts for 10 different journals for past 5 years were collected. In addition to abstract text, author names, keywords or MeSH terms were collected in separate data frames. The journals were selected so that text mining in bioinformatics field can be performed. Out of 10 journals, 7 of them are focused on bioinformatics field. Remaining three, Nature, Science and PeerJ, have more broad spectrum. They can be used as background or reference if needed be.

Journal h-index Pubmed Query Abstract count Author count Keyword count MESH term count
Bioinformatics 300 link 3976 18941 26 23812
BMC Bioinformatics 163 link 2699 12749 4237 22220
BMC Syst Biol 61 link 759 3780 1000 5890
Brief. Bioinformatics 79 link 605 3062 2797 2168
Genome Biol. 188 link 1147 11243 1570 11319
Genome Res. 248 link 892 9090 0 8797
Nature 1011 link 4311 58890 0 46914
PeerJ 27 link 4043 18784 24923 0
PLoS Comput. Biol. 125 link 2774 12826 0 26681
Science 978 link 4253 45522 0 30951

The abstract count data might not match the pubmed query results, since articles without abstract text are eliminated.

Rdata file can be downloaded from this link (you don’t need to have Dropbox account, just click No thanks, continue to view if you’re asked to sign in or sign up). After downloading the file, load it with following command:

load(pubmed_project.rda)

After this command, you’ll have four tables in your environment.

Here are the contents of each table:

  keywords             abstracts            authors
+----------+         +-----------+      +--------------+
| pmid     | <----+--+ pmid      +----> | pmid         |
+----------+      |  +-----------+      +--------------+
| keyword  |      |  | journal   |      | firstname    |
+----------+      |  +-----------+      +--------------+
                  |  | title     |      | lastname     |
                  |  +-----------+      +--------------+
     mesh         |  | type      |      | affiliation  |
+-----------+     |  +-----------+      +--------------+
| pmid      | <---+  | abstract  |
+-----------+        +-----------+
| mesh_term |        | year      |
+-----------+        +-----------+
                     | month     |
                     +-----------+

Some notes about fields:

  • journal is the abbreviated name of the journal (e.g., Genome Res. instead of Genome Research)
  • type is combination of “journal article”, “review”, “case study”. this field is not tidy (multiple values in one cell) since it’s used for filtering purposes. Abstracts that contain “journal article” had been selected for this dataset
  • year and month refers to when the article has been published. year column contain 4 digit year but month column can contain NA, 09 or Sep. So, use month column with caution
  • in authors table, the affiliation column refers to university or institution the author belongs to. However, there are problematic entries where a multi author publication (with same pmid) lists autors separately but keeps affiliation information of all authors in one cell next to first author. Also, be aware that there might be many authors with same firstname and lastname!
  • keyword is used supplied keyword which was provided by author during submission of the article. However not all journals practice having keywords, so not all journals or articles have this.
  • mesh_term is mesh term assigned by NCBI. We have several issues in this column:
    • almost all journals have this field (except PeerJ)
    • there’s subheading (hierarchy) within a header. In such cases the parent term is pasted next to MeSH term with / character.

Dataset 2 - All Journals - Last 2 years

This data set is comprised of 2.1 million abstracts, published on or after 2016. Due to size of data, only single table is generated and fields of the table are as follows:

 abstracts_two
+--------------+
| pmid         |
+--------------+
| author_count |
+--------------+
| journal      |
+--------------+
| title        |
+--------------+
| type         |
+--------------+
| abstract     |
+--------------+
| year         |
+--------------+
| month        |
+--------------+

The size of the file is too big to share, thus please visit instructor with a usb drive to get the dataset.

Analysis

The story you would like to tell is up to you. You can try to

Your analysis should be performed in Rmarkdown format. Both Rmd and html outputs are needed for evaluation. A dashboard style output will get higher score. Please visit flexdashboard samples page to have an idea.

If you need help preparing or shaping a data that you found, I can help integrating to abstract data.

Please consider packages like openNLP or word2vec to do in depth text mining of the data.

Question pool

Please visit Data Analysis and Visualization Exercises (dav-exercises) repo for instructions.

Grading: 5 points is needed and this is how it can be collected;

Under Issues tab you can find sample questions submitted by instructor.

Deadline for both project and question pool

January 10th, 2018, Wed 23:59 ***

Last Updated 2017-12-12