Towards CSS

Wednesday, November 6, 2013

Sentiment Analysis and Opinion Mining

In this post, I am sharing my notes on the first three chapters of the book Sentiment Analysis and Opinion Mining by Bing Liu

Sentiment Analysis Research

Document level, sentence level, entity and aspect level

Sentiment Lexicon and Its Issues

Context matters: It sucks vs vacuum cleaner sucks
Questions and conditional statements might not express an opinion: Is android good? If android is good, then I will buy it
Sarcastic sentences: Nokia 5310 is great to use as a brick
No opinionated words mention: its color changed after one time use

Opinion Spam Detection

Definition (opinion): An opinion is a quintuple, (ei, aij, sijkl, hk, tl ), where ei is the name of an entity, aij is an aspect of ei, sijkl is the sentiment on aspect aij of entity ei, hk is the opinion holder, and tl is the time when the opinion is expressed by hk. The sentiment sijkl is positive, negative, or neutral, or expressed with different strength /intensity levels, e.g., 1–5 stars as used by most review sits on the Web. When an opinion is on the entity itself as a whole, the special aspect GENERAL is used to denote it. Here, ei and aij together represent the opinion target.

Task 1 (entity extraction and categorization): Extract all entity expressions in D, and categorize or group synonymous them into clusters (or categories). Each entity expression cluster indicates a unique entity ei.
Task 2 (aspect extraction and categorization): Extract all aspect expressions of the entities, and categorize them into clusters. Each aspect expression cluster of entity ei represents a unique aspect aij.
Task 3 (opinion holder extraction and categorization): Extract opinion holders for opinions from text or structured data and categorize them. The task is analogous to the above two tasks.
Task 4 (time extraction and standardization): Extract the times when opinions are given and standardize different time formats. The task is also analogous to the above tasks.
Task 5 (aspect sentiment classification): Determine whether an opinion on an aspect aij is positive, negative or neutral, or assign a numeric sentiment rating to the aspect.
Task 6 (opinion quintuple generation): Produce all opinion quintuples (ei, aij, sijkl, hk, tl) expressed in document d based on the results of the above tasks. This task is seemingly very simple but it is in fact very difficult in many cases as example below shows.

Example: Posted by: big John Date: Sept. 15, 2011
(1) i bought a Samsung camera and my friends brought a canon camera yesterday. (2) in the past week, we both used the cameras a lot. (3) the photos from my Samy are not that great, and the battery life is short too. (4) my friend was very happy with his camera and loves its picture quality. (5) i want a camera that can take good photos. (6) i am going to return it tomorrow.
Output:
(Samsung, picture_quality, negative, big John, Sept-15-2011)
(Samsung, battery_life, negative, big John, Sept-15-2011)
(Canon, GENERAL, positive, big John’s_friend, Sept-15-2011)
(Canon, picture_quality, positive, big John’s_friend, Sept-15-2011)

Sentiment classification using supervised learning

Terms and their frequency. Part of speech. Sentiment words and phrases. Rules of opinions. Sentiment shifters. Syntactic dependency
Utilize the features listed to run traditional or new ML algorithms

Sentiment classification using unsupervised learning

Five patterns of POS tags used for extracting two-word phrases, such as adjective followed by a noun
Sentiment orientation (SO) of phrases using point-wise mutual information (PMI)
Lexicon based method: words/phrases mapped to strength like [-2,+2]

Sentiment rating prediction

SVM one-VS-all (OVA) approach (reported as poor)
Similarity graph is generated to smoothen the ratings of SVM OVA
Constrained ridge regression on bag of opinions (sentiment-word, negator, modifier)
Aggregate rating of aspects
Learning from comprehensive reviews only with a Bayesian model

Cross-domain sentiment classification
Cross-language sentiment classification

Monday, October 21, 2013

'13 ShARe/CLEF eHealth's Top System: UTHealthCCB (Tang et al.)

The First Team's Paper: Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model

First of all, I am happy to see that I had captured most of the related work accurately in my 2012 HIMSS Poster: i2b2 NLP challenges, MedLEE, SymText/MPlus, MetaMap, KnowledgeMap, cTAKES and HiTEX.

Second, authors clarify the similarity and difference between this task and 2010 i2b2 challenge on clinical problem extraction. The two major differences between these two tasks: 1) ShARe/CLEF task allowed disjoint entities, while 2010 i2b2 clinical problem extraction only dealt with entities of consecutive words; and 2) ShARe/CLEF task required mapping disorder entities to SNOMED-CT (using UMLS CUIs), which was not required in the 2010 i2b2 challenge.

The overview architecture of the disorder concept extraction systems:

Preprocessing: Sentence boundary detection and tokenization

Entity Representation: Representation of disorder mentions

Machine Learning: CRF or SSVM

Entity Parsing: Parse results of disorder mentions

Post-processing: Alignment of sentences and tokens

[For Task 1b] CUI Mapping: Vector Space Model (VSM)

Disorder entity recognition : (i) For consecutive disorder entities, authors used traditional BIO approach (for NER in ML) where each word is labeled as B (beginning of an entity), I (inside an entity), or O(outside of an entity). Thus NER problem turns into a trinary classification problem. (ii) For disjoint disorder entities authors introduced two additional sets of tags D{B,I} and H{B,I}. Words labeled as HB or HI belonged to two or more disjoint concepts. E.g.:

Sentence: “The aortic root and ascending aorta are moderately dilated .”
Encoding: “The/O aortic/DB root/DI and/O ascending/DB aorta/DI are/O moderately/O dilated/HB ./O”

ML algorithms employed :

Conditional Random Fields (CRFs) [CRFsuite]

Structural Support Vector Machines (SSVMs) [SVM-hmm]

Features used : Bag-of-words, part-of-speech (POS) from Stanford tagger, type of notes, section information, word representation from Brown clustering and random indexing, semantic categories of words based on UMLS lookup, MetaMap, or cTAKEs outputs. It turned out that SSVM returned ~.03 higher F-measure although CRF had ~.03 better precision.

[Task 1b] Disorder entity encoding : Authors approach it as a ranking problem where a query is an identified entity (in Task 1a) and the documents are candidate SNOMED-CT terms. For a given disorder entity, corresponding terms of CUIs containing all the words (except stop words) are selected as candidates, their tf-idf vectors are created using all SNOMED-CT terms, and cosine similarities are calculated between pairs of a candidate and the disorder to rank the candidates.

CLEF e-Health 2013

Here is the proceedings link where you can find the task reviews as well as the articles for free.

Task #1 : Annotation of disorder mentions in clinical reports by (1a) identifying a span of text as a disorder mention, and (1b) [optional] mapping the span to a UMLS CUI (i.e. SNOMED-CT codes).

Dataset : From different clinical encounters including radiology reports, discharge summaries, and ECG/ECHO reports, about 181K words annotated by organizers and 200 documents were provided as training (5811 disorder entities were annotated and mapped to 1007 unique CUIs or CUI-less) and 100 documents were spared for testing (5340 disorder entities with 795 CUIs or CUI-less).

Competition Results : The best systems had an F1 score of 0.75 (0.80 Precision, 0.71 Recall) in Task 1a and an accuracy of 0.59 in Task 1b. Task 1a top three teams are namely, UTHealthCCB.A, NCBI and CLEAR.

Wednesday, September 25, 2013

Last Post?

Human is an interesting creature. Its like a machine controlled by different people inside oneself. If a person couldn't reduce the number of I's within himself then his first post may turn out to be his last post as well.