PubMedQA - A Dataset for Biomedical Research Question Answering

Summary

In this paper, the authors have presented a dataset of 3 splits

  1. Human annotated - 1k data (test + validation)
  2. Artificially Labeled - 211k (by simple deterministic heuristic)
  3. Unlabeled - (61k)

The heuristic the authors have used to artificially label the unlabeled dataset is that,

  1. Identify POS tagging structures of NP-(VBP/VBZ) [1]
  2. Then the verb is used to convert the statement to make question ("is", "are" or "does", "do")

Annotations

Annotation

« PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. »()

Annotation

« Interestingly, more than half of the question titles of PubMed articles can be briefly answered by yes/no/maybe, »(2)

Annotation

« manually labeled 1k of them for cross-validation and testing »(2)

Annotation

« The rest of yes/no/answerable QA instances compose of the unlabeled subset which can be used for semisupervised learning. »(2)

Annotation

« we automatically convert statement titles of 211.3k PubMed articles to questions and label them with yes/no answers using a simple heuristic. »(2)

Annotation

« PubMed articles which have i) a question mark in the titles and ii) a structured abstract with conclusive part are collected and denoted as pre-PQA-U. Now each instance has 1) a question which is the original title 2) a context which is the structured abstract without the conclusive part and 3) a long answer which is the conclusive part of the abstract »(3)

Annotation

« We denote prediction using question and context as a reasoning-required setting, »(4)


  1. Using Stanford CoreNLP parser ↩︎


Related Notes