RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims on Social Media
(EACL 2023)


Reddit Health Online Talk (RedHOT) is a large scale corpus of over 22k richly annotated social media posts from Reddit spanning 24 health conditions. Annotations include demarcations of spans corresponding to medical claims, personal experiences, and questions. Additionally, claims are annotated for medically relevant Populations, Interventions, and Outcomes (PIO elements).

We introduce the task of retrieving trustworthy evidence relevant to a given claim made on social media. To do this, we also propose a novel method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model -- which outperforms baselines when evaluated by medical doctors (MDs).

The Evidence Retrieval Task

We describe the task of evidence retrieval as follows:

Given a natural language medical claim, identified PIO elements, and a very large corpus of medical abstracts from randomized control trials (RCTs), identify the relevant abstracts that aid users in making an informed decision about the underlying claim.

Example from Pride and Prejudice of the task of literary evidence retrieval.


Link to paper

  author={Somin Wadhwa and Vivek Khetan and Silvio Amir and Byron Wallace},
  Booktitle = {European Association of Computational Linguistics (EACL)},
  Year = "2023",
  Title={RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims on Social Media}


Reddit posts we have collected are public and typically made under anonymous pseudonyms, but nonetheless these are health-related comments and so inherently sensitive. To respect this, we (a) notified all users in the dataset of their (potential) inclusion in this corpus, and provided opportunity to opt-out, and, (b) we do not release the data directly, but rather a script to download annotated comments, so that individuals may choose to remove their comments in the future. Furthermore, we consulted with our Institutional Review Board (IRB) and confirmed that the initial collection and annotation of such data does not constitute human subjects research. However, EACL reviewers rightly pointed out that certain uses of this data may be sensitive. Therefore, to access the collected dataset we require researchers to self-attest that they have obtained prior approval from their own IRB regarding their intended use of the corpus.