|
6 months ago | |
---|---|---|
LICENSE | 1 year ago | |
README.md | 6 months ago | |
extractor.py | 1 year ago | |
pubmed-1.png | 1 year ago | |
pubmed-2.png | 1 year ago |
A Python script that extracts ClinicalTrials.gov ("NCT") numbers from abstracts in Pubmed XML search results and checks for a corresponding entry on ClinicalTrials.gov
NOTE: The Pubmed web front-end was changed in 2020 to remove the ability to download search results as XML, and so this script will be of limited usefulness, and the following instructions are of historical value only.
This script was written for Python 3 v. 3.6.9 and tested on elementary OS v. 5.1.
extractor.py
from this repository to a new empty local folderpubmed_result.xml
to the local folder where you saved extractor.py
extractor.py
and the Pubmed XML output file, and enter the following:$ python3 extractor.py pubmed_result.xml Y > extracted_ncts.tsv
$ python3 extractor.py pubmed_result.xml N > extracted_ncts.tsv
If everything is working properly, the above instructions will create a new tab-separated value file called extracted_ncts.tsv
. The first line will be the column headings. The script prints out one row per entry if there are no NCT numbers identified, or if there is exactly one NCT number in the abstract. If there is more than one NCT number in the abstract, the script will print out one row per NCT number and indicate how many were found in that abstract in the "Number of NCTs extracted" column.
The "Extracted NCT" column indicates the raw text found in the abstract. The "Compressed NCT" column is the same as the "Extracted NCT" column, but with spaces and hyphens (if any) removed.
The "PMID" column is the Pubmed ID for the entry in question.
The script tries to populate the "Date" column first from the ArticleDate field, then from the PubMedPubDate field.
The "Abstract" and "Journal" fields are filled with the abstract and the journal of the entry in question.
The "Pubmed Metadata Registry" column is populated from the "AccessionNumberList" field.
This file can be opened with LibreOffice Calc, or imported directly into R for analysis.
BibTeX
@software{carlisle_pubmed_nct_extractor_2020,
location = {{Retrieved from https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor}},
title = {Pubmed NCT Extractor},
url = {https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/},
organization = {{The Grey Literature}},
date = {2020-02},
author = {Carlisle, Benjamin Gregory}
}
Vancouver
Carlisle BG. Pubmed NCT Extractor [Internet]. Retrieved from https://blog.bgcarlisle.com/2020/02/05/introducing-pubmed-nct-extractor/: The Grey Literature; 2020. Available from: https://codeberg.org/bgcarlisle/Pubmed-NCT-extractor
Many thanks to Alex Bannach-Brown and Peter Grabitz for conversations that motivated this tool.