MedScan (indra.sources.medscan)¶
MedScan is Elsevier’s proprietary text-mining system for reading the biological literature. This INDRA module enables processing output files (in CSXML format) from the MedScan system into INDRA Statements.
MedScan API (indra.sources.medscan.api)¶
-
indra.sources.medscan.api.process_directory(directory_name, lazy=False)[source]¶ Processes a directory filled with CSXML files, first normalizing the character encodings to utf-8, and then processing into a list of INDRA statements.
Parameters: - directory_name (str) – The name of a directory filled with csxml files to process
- lazy (bool) – If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.
Returns: mp – A MedscanProcessor populated with INDRA statements extracted from the csxml files
Return type:
-
indra.sources.medscan.api.process_directory_statements_sorted_by_pmid(directory_name)[source]¶ Processes a directory filled with CSXML files, first normalizing the character encoding to utf-8, and then processing into INDRA statements sorted by pmid.
Parameters: directory_name (str) – The name of a directory filled with csxml files to process Returns: pmid_dict – A dictionary mapping pmids to a list of statements corresponding to that pmid Return type: dict
-
indra.sources.medscan.api.process_file(filename, interval=None, lazy=False)[source]¶ Process a CSXML file for its relevant information.
Consider running the fix_csxml_character_encoding.py script in indra/sources/medscan to fix any encoding issues in the input file before processing.
-
indra.sources.medscan.api.filename¶ The csxml file, containing Medscan XML, to process
Type: str
-
indra.sources.medscan.api.interval¶ Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.
Type: (start, end) or None
-
indra.sources.medscan.api.lazy¶ If True, the statements will not be generated immediately, but rather a generator will be formulated, and statements can be retrieved by using iter_statements. If False, the statements attribute will be populated immediately. Default is False.
Type: bool
Returns: mp – A MedscanProcessor object containing extracted statements Return type: MedscanProcessor -
-
indra.sources.medscan.api.process_file_sorted_by_pmid(file_name)[source]¶ Processes a file and returns a dictionary mapping pmids to a list of statements corresponding to that pmid.
Parameters: file_name (str) – A csxml file to process Returns: s_dict – Dictionary mapping pmids to a list of statements corresponding to that pmid Return type: dict
MedScan Processor (indra.sources.medscan.processor)¶
-
class
indra.sources.medscan.processor.MedscanEntity(name, urn, type, properties, ch_start, ch_end)¶ -
ch_end¶ Alias for field number 5
-
ch_start¶ Alias for field number 4
-
name¶ Alias for field number 0
-
properties¶ Alias for field number 3
-
type¶ Alias for field number 2
-
urn¶ Alias for field number 1
-
-
class
indra.sources.medscan.processor.MedscanProcessor[source]¶ Processes Medscan data into INDRA statements.
The special StateEffect event conveys information about the binding site of a protein modification. Sometimes this is paired with additional event information in a seperate SVO. When we encounter a StateEffect, we don’t process into an INDRA statement right away, but instead store the site information and use it if we encounter a ProtModification event within the same sentence.
-
statements¶ A list of extracted INDRA statements
Type: list<str>
-
sentence_statements¶ A list of statements for the sentence we are currently processing. Deduplicated and added to the main statement list when we finish processing a sentence.
Type: list<str>
-
num_entities¶ The total number of subject or object entities the processor attempted to resolve
Type: int
-
num_entities_not_found¶ The number of subject or object IDs which could not be resolved by looking in the list of entities or tagged phrases.
Type: int
-
last_site_info_in_sentence¶ Stored protein site info from the last StateEffect event within the sentence, allowing us to combine information from StateEffect and ProtModification events within a single sentence in a single INDRA statement. This is reset at the end of each sentence
Type: SiteInfo
-
agent_from_entity(relation, entity_id)[source]¶ Create a (potentially grounded) INDRA Agent object from a given Medscan entity describing the subject or object.
Uses helper functions to convert a Medscan URN to an INDRA db_refs grounding dictionary.
If the entity has properties indicating that it is a protein with a mutation or modification, then constructs the needed ModCondition or MutCondition.
Parameters: - relation (MedscanRelation) – The current relation being processed
- entity_id (str) – The ID of the entity to process
Returns: agent – A potentially grounded INDRA agent representing this entity
Return type: indra.statements.Agent
-
process_csxml_file(filename, interval=None, lazy=False)[source]¶ Processes a filehandle to MedScan csxml input into INDRA statements.
The CSXML format consists of a top-level <batch> root element containing a series of <doc> (document) elements, in turn containing <sec> (section) elements, and in turn containing <sent> (sentence) elements.
Within the <sent> element, a series of additional elements appear in the following order:
- <toks>, which contains a tokenized form of the sentence in its text attribute
- <textmods>, which describes any preprocessing/normalization done to the underlying text
- <match> elements, each of which contains one of more <entity> elements, describing entities in the text with their identifiers. The local IDs of each entities are given in the msid attribute of this element; these IDs are then referenced in any subsequent SVO elements.
- <svo> elements, representing subject-verb-object triples. SVO elements with a type attribute of CONTROL represent normalized regulation relationships; they often represent the normalized extraction of the immediately preceding (but unnormalized SVO element). However, in some cases there can be a “CONTROL” SVO element without its parent immediately preceding it.
Parameters: - filename (string) – The path to a Medscan csxml file.
- interval ((start, end) or None) – Select the interval of documents to read, starting with the `start`th document and ending before the `end`th document. If either is None, the value is considered undefined. If the value exceeds the bounds of available documents, it will simply be ignored.
- lazy (bool) – If True, only create a generator which can be used by the get_statements method. If True, populate the statements list now.
-
process_relation(relation, last_relation)[source]¶ Process a relation into an INDRA statement.
Parameters: - relation (MedscanRelation) – The relation to process (a CONTROL svo with normalized verb)
- last_relation (MedscanRelation) – The relation immediately proceding the relation to process within the same sentence, or None if there are no preceding relations within the same sentence. This proceeding relation, if available, will refer to the same interaction but with an unnormalized (potentially more specific) verb, and is used when processing protein modification events.
-
-
class
indra.sources.medscan.processor.MedscanProperty(type, name, urn)¶ -
name¶ Alias for field number 1
-
type¶ Alias for field number 0
-
urn¶ Alias for field number 2
-
-
class
indra.sources.medscan.processor.MedscanRelation(pmid, uri, sec, entities, tagged_sentence, subj, verb, obj, svo_type)[source]¶ A structure representing the information contained in a Medscan SVO xml element as well as associated entities and properties.
-
pmid¶ The URI of the current document (such as a PMID)
Type: str
-
sec¶ The section of the document the relation occurs in
Type: str
-
entities¶ A dictionary mapping entity IDs from the same sentence to MedscanEntity objects.
Type: dict
-
tagged_sentence¶ The sentence from which the relation was extracted, with some tagged phrases and annotations.
Type: str
-
subj¶ The entity ID of the subject
Type: str
-
verb¶ The verb in the relationship between the subject and the object
Type: str
-
obj¶ The entity ID of the object
Type: str
-
svo_type¶ The type of SVO relationship (for example, CONTROL indicates that the verb is normalized)
Type: str
-
-
class
indra.sources.medscan.processor.ProteinSiteInfo(site_text, object_text)[source]¶ Represent a site on a protein, extracted from a StateEffect event.
Parameters: - site_text (str) – The site as a string (ex. S22)
- object_text (str) – The protein being modified, as the string that appeared in the original sentence
-
indra.sources.medscan.processor.normalize_medscan_name(name)[source]¶ Removes the “complex” and “complex complex” suffixes from a medscan agent name so that it better corresponds with the grounding map.
Parameters: name (str) – The Medscan agent name Returns: norm_name – The Medscan agent name with the “complex” and “complex complex” suffixes removed. Return type: str