Med7 — an information extraction model for clinical natural language processing

5 min readMar 3, 2020

A brief overview

Recent years have seen remarkable technological advances in healthcare and biomedical research, mostly driven by the availability of a vast amount of digital patient-generated data and democratisation of the state-of-the-art algorithms from computer science and engineering. Such open source frameworks and libraries, among others, as PyTorch, TensorFlow, fast.ai, spacy.io, scikit-learn and huggingface.co have simplified the utilisation of complex machine learning and deep learning pipelines in research and production.

In the era of digital platforms, and in particular in medicine and healthcare, the majority of patients’ medical records are now being collected electronically and therefore represent a true asset for research, personalised approach to treatments and as a result, it leads to improvements of patients’ outcomes. However, the majority of patients’ information is contained in a free-text form as summarised by clinicians, nurses and care givers through the interview and assessments. The free-text medical records normally contain very rich information about a patient’s history as it is expressed in natural language and allows to reflect nuanced details, however it poses certain challenges in the utilisation of free-text records as opposed to structured and ready-to-use data source. Recent advances in the field of natural language processing (NLP), augmented with deep learning and novel Transformer-based architectures, offer new opportunities to extract meaningful information from unstructured medical records.

Concepts recognition

Identification of concepts of interest in free texts is a sub-task of information extraction, more commonly known as Named-Entity Recognition (NER) and seeks to classify tokens (words) into pre-defined categories. For example, using the NER component of spaCy:

where some of the words (tokens) were identified as concepts and classified (labelled) appropriately:

SpaCy’s NER model is ready-to-use in various NLP downstream tasks and is able to identify 18 various concepts in texts, ranging from people names (including fictional), countries, locations, vehicles, food, titles of books, dates and numerical quantities. While spaCy’s NER is fairly generic, several python implementations of biomedical NER have been recently introduced (scispaCy, BioBERT and ClinicalBERT). These models were trained to identify particular concepts in biomedical texts, such as drug names, organ tissue, organism, cell, amino acid, gene product, cellular component, DNA, cell types and others.

In order to maximise the utilisation of free-text electronic health records (EHR), we focused on a particular subtask of clinical information extraction and developed a dedicated named-entity recognition model Med7 for identification of 7 medication-related concepts, dosage, drug names, duration, form, frequency, route of administration and strength. The model is trained on MIMIC-III, which is one of the largest openly available dataset developed by the MIT Lab for Computational Physiology. MIMIC-III comprises EHR from over 60,000 intensive care unit admissions, including both, structured and unstructured medical records. Med7 is open source and utilises the best practices introduced in spaCy and is interoperable across pipelines from within the spaCy Universe. Additionally, we provide a number of pre-trained spaCy weights on the entire MIMIC-III corpus, comprising over 2 million documents, using various architectural parameters. It has been shown, that initialisation of the model weights by using pre-training on data from the target domain, marginally improves the performance of the model on downstream NLP tasks when training with limmited amount of gold-annotated examples. This problem is particularly pertinent to EHR domain, where the lack of high quality manually annotated training examples with correctly identified clinical concepts is seriously lacking.

Med7 in a nutshell

Med7 is a freely available python package for spaCy. As a prerequisite, it requires the latest version of spaCy (2.2.3) and Python 3.6+. It is trained in part on manually annotated data provided by the 2018 National NLP Clinical Challenges (n2c2), which comprises a collection of 303 and 202 documents for training and testing respectively, sampled from the discharge notes category of the MIMIC-III data. In order to improve the accuracy of the Med7 NER, we have created a noisy training ‘silver’-annotated data set of 303 documents from MIMIC-III, where we used spaCy’s rule-based matching with a list of patterns for each of the seven categories. Additionally, to gather even more gold-labelled training data two annotators used the radically efficient active-learning annotation tool Prodigy to annotate 606 additional documents sampled from MIMIC-III, by closely following the official 2018 n2c2 annotation guidance. Below are presented examples of the seven categories and their description:

How to install Med7

It is recommended to create a dedicated virtual environment and install all recent required packages in there. The trained model was tested with spaCy version 2.3.2 and Python 3.7. For example, if the anaconda distribution of Python is already installed:

create a new virtual environment:

(base) conda create -n med7 python=3.7

2. activate and install spaCy:

(base) conda activate med7
(med7) pip install spacy==2.3.5

3. once all went through smoothly, please check Huggingface repository

The Med7 vectors-based model can be installed:

(med)pip install https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl

or Transformer-based models:

pip install https://huggingface.co/kormilitzin/en_core_med7_trf/resolve/main/en_core_med7_trf-any-py3-none-any.whl

For more details, please see the dedicated GitHub repository

How to use Med7:

import spacy

med7 = spacy.load("en_core_med7_lg")

# create distinct colours for labels
col_dict = {}
seven_colours = ['#e6194B', '#3cb44b', '#ffe119', '#ffd8b1', '#f58231', '#f032e6', '#42d4f4']
for label, colour in zip(med7.pipe_labels['ner'], seven_colours):
    col_dict[label] = colour

options = {'ents': med7.pipe_labels['ner'], 'colors':col_dict}

text = 'A patient was prescribed Magnesium hydroxide 400mg/5ml suspension PO of total 30ml bid for the next 5 days.'
doc = med7(text)

spacy.displacy.render(doc, style='ent', jupyter=True, options=options)

[(ent.text, ent.label_) for ent in doc.ents]

and the output:

[('Magnesium hydroxide', 'DRUG'),
 ('400mg/5ml', 'STRENGTH'),
 ('suspension', 'FORM'),
 ('PO', 'ROUTE'),
 ('30ml', 'DOSAGE'),
 ('bid', 'FREQUENCY'),
 ('for the next 5 days', 'DURATION')]

also, it is possible to display the identified concepts:

This example can also be run in Colab.

The developed NER model can easily be integrated into pipelines developed within the spaCy framework. For example, integration with -negspaCy will identify the negated concepts, such as drugs which were mentioned, but not actually prescribed.

This article is the first step towards the open source models for clinical natural language processing. More information about the model development can be found in our recent pre-print: Med7: a transferable clinical natural language processing model for electronic health records.