A Python script that extracts ClinicalTrials.gov (“NCT”) numbers from abstracts in Pubmed XML search results and checks for a corresponding entry on ClinicalTrials.gov
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
The research fairy 4a5cd1a5ad Added warning about change to Pubmed in 2020 3 months ago
LICENSE Initial commit 10 months ago
README.md Added warning about change to Pubmed in 2020 3 months ago
extractor.py Added Python script 10 months ago
pubmed-1.png Updated README.md to include documentation 10 months ago
pubmed-2.png Updated README.md to include documentation 10 months ago

README.md

Pubmed-NCT-extractor

A Python script that extracts ClinicalTrials.gov ("NCT") numbers from abstracts in Pubmed XML search results and checks for a corresponding entry on ClinicalTrials.gov

NOTE: The Pubmed web front-end was changed in 2020 to remove the ability to download search results as XML, and so this script will be of limited usefulness, and the following instructions are of historical value only.

System requirements

This script was written for Python 3 v. 3.6.9 and tested on elementary OS v. 5.1.

How to use

  • Save extractor.py from this repository to a new empty local folder
  • Navigate to Pubmed in your web browser
  • Conduct a search of your choice

An example Pubmed search for: Carlisle, B

  • On the Pubmed search result page, click "Send to," then "File." Choose XML format and click the "Create file" button

Pubmed "Send to" options with XML export selected

  • Save the resulting Pubmed XML output file as pubmed_result.xml to the local folder where you saved extractor.py
  • In your terminal, navigate to the local folder where you saved extractor.py and the Pubmed XML output file, and enter the following:
$ python3 extractor.py pubmed_result.xml Y > extracted_ncts.tsv
  • If you do not wish to have column headings, enter the following in your terminal:
$ python3 extractor.py pubmed_result.xml N > extracted_ncts.tsv

Interpreting the output

If everything is working properly, the above instructions will create a new tab-separated value file called extracted_ncts.tsv. The first line will be the column headings. The script prints out one row per entry if there are no NCT numbers identified, or if there is exactly one NCT number in the abstract. If there is more than one NCT number in the abstract, the script will print out one row per NCT number and indicate how many were found in that abstract in the "Number of NCTs extracted" column.

The "Extracted NCT" column indicates the raw text found in the abstract. The "Compressed NCT" column is the same as the "Extracted NCT" column, but with spaces and hyphens (if any) removed.

The "PMID" column is the Pubmed ID for the entry in question.

The script tries to populate the "Date" column first from the ArticleDate field, then from the PubMedPubDate field.

The "Abstract" and "Journal" fields are filled with the abstract and the journal of the entry in question.

The "Pubmed Metadata Registry" column is populated from the "AccessionNumberList" field.

This file can be opened with LibreOffice Calc, or imported directly into R for analysis.

How to cite

BibTeX

@software{carlisle_pubmed_nct_extractor_2020,
location = {{Retrieved from https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor}},
title = {Pubmed NCT Extractor},
url = {https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/},
organization = {{The Grey Literature}},
date = {2020-02},
author = {Carlisle, Benjamin Gregory}
}

Vancouver

Carlisle BG. Pubmed NCT Extractor [Internet]. Retrieved from https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/: The Grey Literature; 2020. Available from: https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor

Acknowledgements

Many thanks to Alex Bannach-Brown and Peter Grabitz for conversations that motivated this tool.