Monday, October 21, 2013

'13 ShARe/CLEF eHealth's Top System: UTHealthCCB (Tang et al.)

The First Team's Paper: Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model

First of all, I am happy to see that I had captured most of the related work accurately in my 2012 HIMSS Poster: i2b2 NLP challenges, MedLEE, SymText/MPlus, MetaMap, KnowledgeMap, cTAKES and HiTEX.

Second, authors clarify the similarity and difference between this task and 2010 i2b2 challenge on clinical problem extraction. The two major differences between these two tasks: 1) ShARe/CLEF task allowed disjoint entities, while 2010 i2b2 clinical problem extraction only dealt with entities of consecutive words; and 2) ShARe/CLEF task required mapping disorder entities to SNOMED-CT (using UMLS CUIs), which was not required in the 2010 i2b2 challenge.

The overview architecture of the disorder concept extraction systems:

  1. Preprocessing: Sentence boundary detection and tokenization

  2. Entity Representation: Representation of disorder mentions

  3. Machine Learning: CRF or SSVM

  4. Entity Parsing: Parse results of disorder mentions

  5. Post-processing: Alignment of sentences and tokens

  6. [For Task 1b] CUI Mapping: Vector Space Model (VSM)

Disorder entity recognition : (i) For consecutive disorder entities, authors used traditional BIO approach (for NER in ML) where each word is labeled as B (beginning of an entity), I (inside an entity), or O(outside of an entity). Thus NER problem turns into a trinary classification problem. (ii) For disjoint disorder entities authors introduced two additional sets of tags D{B,I} and H{B,I}. Words labeled as HB or HI belonged to two or more disjoint concepts. E.g.:
Sentence: “The aortic root and ascending aorta are moderately dilated .”
Encoding: “The/O aortic/DB root/DI and/O ascending/DB aorta/DI are/O moderately/O dilated/HB ./O”

ML algorithms employed :

  1. Conditional Random Fields (CRFs) [CRFsuite]

  2. Structural Support Vector Machines (SSVMs) [SVM-hmm]

Features used : Bag-of-words, part-of-speech (POS) from Stanford tagger, type of notes, section information, word representation from Brown clustering and random indexing, semantic categories of words based on UMLS lookup, MetaMap, or cTAKEs outputs. It turned out that SSVM returned ~.03 higher F-measure although CRF had ~.03 better precision.

[Task 1b] Disorder entity encoding : Authors approach it as a ranking problem where a query is an identified entity (in Task 1a) and the documents are candidate SNOMED-CT terms. For a given disorder entity, corresponding terms of CUIs containing all the words (except stop words) are selected as candidates, their tf-idf vectors are created using all SNOMED-CT terms, and cosine similarities are calculated between pairs of a candidate and the disorder to rank the candidates.

No comments:

Post a Comment