A Python script that extracts ClinicalTrials.gov (“NCT”) numbers from abstracts in Pubmed XML search results and checks for a corresponding entry on ClinicalTrials.gov
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
bgcarlisle 9501c63bb6 Added Acknowledgements 2 months ago
LICENSE Initial commit 2 months ago
README.md Added Acknowledgements 2 months ago
extractor.py Added Python script 2 months ago
pubmed-1.png Updated README.md to include documentation 2 months ago
pubmed-2.png Updated README.md to include documentation 2 months ago

README.md

Pubmed-NCT-extractor

A Python script that extracts ClinicalTrials.gov (“NCT”) numbers from abstracts in Pubmed XML search results and checks for a corresponding entry on ClinicalTrials.gov

System requirements

This script was written for Python 3 v. 3.6.9 and tested on elementary OS v. 5.1.

How to use

  • Save extractor.py from this repository to a new empty local folder
  • Navigate to Pubmed in your web browser
  • Conduct a search of your choice

  • On the Pubmed search result page, click “Send to,” then “File.” Choose XML format and click the “Create file” button

  • Save the resulting Pubmed XML output file as pubmed_result.xml to the local folder where you saved extractor.py
  • In your terminal, navigate to the local folder where you saved extractor.py and the Pubmed XML output file, and enter the following:
$ python3 extractor.py pubmed_result.xml Y > extracted_ncts.tsv
  • If you do not wish to have column headings, enter the following in your terminal:
$ python3 extractor.py pubmed_result.xml N > extracted_ncts.tsv

Interpreting the output

If everything is working properly, the above instructions will create a new tab-separated value file called extracted_ncts.tsv. The first line will be the column headings. The script prints out one row per entry if there are no NCT numbers identified, or if there is exactly one NCT number in the abstract. If there is more than one NCT number in the abstract, the script will print out one row per NCT number and indicate how many were found in that abstract in the “Number of NCTs extracted” column.

The “Extracted NCT” column indicates the raw text found in the abstract. The “Compressed NCT” column is the same as the “Extracted NCT” column, but with spaces and hyphens (if any) removed.

The “PMID” column is the Pubmed ID for the entry in question.

The script tries to populate the “Date” column first from the ArticleDate field, then from the PubMedPubDate field.

The “Abstract” and “Journal” fields are filled with the abstract and the journal of the entry in question.

The “Pubmed Metadata Registry” column is populated from the “AccessionNumberList” field.

This file can be opened with LibreOffice Calc, or imported directly into R for analysis.

How to cite

BibTeX

@software{carlisle_pubmed_nct_extractor_2020,
location = {{Retrieved from https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor}},
title = {Pubmed NCT Extractor},
url = {https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/},
organization = {{The Grey Literature}},
date = {2020-02},
author = {Carlisle, Benjamin Gregory}
}

Vancouver

Carlisle BG. Pubmed NCT Extractor [Internet]. Retrieved from https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/: The Grey Literature; 2020. Available from: https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor

Acknowledgements

Many thanks to Alex Bannach-Brown and Peter Grabitz for conversations that motivated this tool.