User Profile Modeling in eLearning using Sentiment Extraction from Text

Adrian Iftene

This article focuses on compounding as a process of word formation within the theoretical framework of lexeme-based morphology. It provides a systematic analysis of the two types of compounding in French: native compounding, the main type, and neoclassical compounding, which is quite marginal. It presents the various rules: native compounds are prototypically constructed of two lexemes and form a third one; they are predominantly endocentric; the governing constituent and the compound head, if any, is on the left and controls the semantic relations between the two constituents, whether coordinated, attributive or subordinating. Neoclassical compounds are prototypically constructed of bound neoclassical elements and form adjectives; they are often exocentric; the governing constituent is on the right. Inflection in native compounds is complex. Several areas of the analysis remain unresolved, particularly regarding the boundaries between morphological/syntactic compounds.

In this paper, I describe some of the most distinct differences between typical endocentric and exocentric text structure, with particular reference to Danish and Italian respectively. I point to some of the phenomena which – in my experience as a teacher of Italian in Denmark for over 40 years – have been most problematic to Danish students of Italian, namely the differences in text complexity and text density between otherwise parallel Danish and Italian texts. Regarding text and text type comparison, I follow the theoretical framework suggested by Hartmann (1980), cf. also Skytte (2000), with distinctions between what Hartmann calls “Class B” and “Class C” parallel texts (“Class A” being translations), where “Class B” texts are adaptations “conveying an identical message to receivers of sometimes very different cultural backgrounds” (Hartmann 1980: 38), e.g. news bulletins, and “Class C” texts are authentic texts produced independently in the languages in question, but in equivalent situations and with equivalent targets and contents, texts that I shall refer to as “comparable texts” (Korzen & Gylling 2017; Korzen 2021). As can be gathered from to brief examples, (1)-(2), my focus is on the textualisation and syntactic combination of propositions in the two languages. Other things being equal, Romance text structure typically reveals a more compact and complex form than Scandinavian text structure, with more propositions per sentence and more propositions textualised without a finite verb, i.e. “deverbalised”. Whereas examples (1)-(2) are taken from comparable texts, the picture changes – not surprisingly – when we consider adapted texts, “Class B” texts in Hartmann’s terminology. However, with regard to Danish and Italian text structure, the picture seems to change on one account only, namely the sentence compactness, i.e. the number of propositions textualised in the same sentence; not regarding deverbalisation. On the basis of statistical analyses of four different text corpora, three of comparable texts and one of adaptations, I discuss the usefulness of these two kinds of text comparison, as well as whether the mentioned text structure differences should be considered as a question of language typology or language use.

Bauer (2008, 2010) provides a whole new way of looking at exocentric compounds, supplying, for the first time, both a typology and the requisite terminology for discussing exocentric compounds. Bauer's papers constitute a marked deviation from other approaches that perceive exocentricity as a marginal feature of the lexicon of a language, where it is attested. Appah (forthcoming) has subsequently shown that three of the five types posited by Bauer occur in Akan. This paper shows the current state of the typology of exocentric compounds and suggests the need for more research based on Bauer's typology to test the robustness of the typology, to see what other (sub)types may be proposed and, more importantly, to find the best way of eliciting data on exocentric compounds within and across languages.

Special issue: Natural Language Processing and its Applications Research in Computing Science Series Editorial Board Comité Editorial de la Serie Editors-in-Chief: Editores en Jefe Associate Editors: Editores Asociados Jesús Angulo (France) Jihad El-Sana (Israel) Jesús Figueroa (Mexico) Alexander Gelbukh (Russia) Ioannis Kakadiaris (USA) Serguei Levachkine (Russia) Petros Maragos (Greece) Julian Padget (UK) Mateo Valero (Spain) Juan Humberto Sossa Azuela (Mexico) Gerhard Ritter (USA) Jean Serra (France) Ulises Cortés (Spain) Editorial Coordination: Formatting: Coordinación Editorial Formato Blanca Miranda Valencia Sulema Torres Ramos Research in Computing Science es una publicación trimestral, de circulación internacional, editada por el Centro de Investigación en Computación del IPN, para dar a conocer los avances de investigación científica y desarrollo tecnológico de la comunidad científica internacional. Volumen 46 Marzo, 2010. Tiraje: 500 ejemplares. Certificado de Reserva de Derechos al Uso Exclusivo del Título No. 04-2004-062613250000102, expedido por el Instituto Nacional de Derecho de Autor. Certificado de Licitud de Título No. 12897, Certificado de licitud de Contenido No. 10470, expedidos por la Comisión Calificadora de Publicaciones y Revistas Ilustradas. El contenido de los artículos es responsabilidad exclusiva de sus respectivos autores. Queda prohibida la reproducción total o parcial, por cualquier medio, sin el permiso expreso del editor, excepto para uso personal o de estudio haciendo cita explícita en la primera página de cada documento. Impreso en la Ciudad de México, en los Talleres Gráficos del IPN – Dirección de Publicaciones, Tres Guerras 27, Centro Histórico, México, D.F. Distribuida por el Centro de Investigación en Computación, Av. Juan de Dios Bátiz S/N, Esq. Av. Miguel Othón de Mendizábal, Col. Nueva Industrial Vallejo, C.P. 07738, México, D.F. Tel. 57 29 60 00, ext. 56571. Editor Responsable: Juan Humberto Sossa Azuela, RFC SOAJ560723 Research in Computing Science is published by the Center for Computing Research of IPN. Volume 46, March, 2010. Printing 500. The authors are responsible for the contents of their articles. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission of Centre for Computing Research. Printed in Mexico City, March, 2010, in the IPN Graphic Workshop – Publication Office. Volume 46 Volumen 46 Special issue: Natural Language Processing and its Applications Volume Editor: Editor de Volumen Alexander Gelbukh Instituto Politécnico Nacional Centro de Investigación en Computación México 2010 ISSN: 1870-4069 Copyright © Instituto Politécnico Nacional 2010 Copyright © by Instituto Politécnico Nacional Instituto Politécnico Nacional (IPN) Centro de Investigación en Computación (CIC) Av. Juan de Dios Bátiz s/n esq. M. Othón de Mendizábal Unidad Profesional “Adolfo López Mateos”, Zacatenco 07738, México D.F., México http://www.ipn.mx http://www.cic.ipn.mx The editors and the Publisher of this journal have made their best effort in preparing this special issue, but make no warranty of any kind, expressed or implied, with regard to the information contained in this volume. All rights reserved. No part of this publication may be reproduced, stored on a retrieval system or transmitted, in any form or by any means, including electronic, mechanical, photocopying, recording, or otherwise, without prior permission of the Instituto Politécnico Nacional, except for personal or classroom use provided that copies bear the full citation notice provided on the first page of each paper. Indexed in LATINDEX and Periodica / Indexada en LATINDEX y Periodica Printing: 500 / Tiraje: 500 Printed in Mexico / Impreso en México Preface Natural Language Processing is an interdisciplinary research area at the border between linguistics and artificial intelligence aiming at developing computer programs capable of human-like activities related to understanding or producing texts or speech in a natural language, such as English or Chinese. The most important applications of natural language processing include information retrieval and information organization, machine translation, and natural language interfaces, among others. However, as in any science, the activities of the researchers are mostly concentrated on its internal art and craft, that is, on the solution of the problems arising in analysis or generation of natural language text or speech, such as syntactic and semantic analysis, disambiguation, or compilation of dictionaries and grammars necessary for such analysis. This volume presents 27 original research papers written by 63 authors representing 25 different countries: Argentina, Canada, China, Cuba, Czech Republic, Denmark, France, Germany, India, Indonesia, Islamic Republic of Iran, Italy, Japan, Republic of Korea, Mexico, Republic of Moldova, Pakistan, Portugal, Romania, Spain, Sweden, Tajikistan, Turkey, United Kingdom, and United States. The volume is structured in 8 thematic areas of both theory and applications of natural language processing: – – – – – – – – Semantics Morphology, Syntax, Named Entity Recognition Opinion, Emotions, Textual Entailment Text and Speech Generation Machine Translation Information Retrieval and Text Clustering Educational Applications Applications The papers included in this volume were selected on the base of rigorous international reviewing process out of 101 submissions considered for evaluation; thus the acceptance rate of this volume was 27%. I would like to cordially thank all people involved in the preparation of this volume. In the first place I want to thank the authors of the published paper for their excellent research work that gives sense to the work of all other people involved, as well as the authors of rejected papers for their interest and effort. I also thank the members of the Editorial Board of the volume and additional reviewers for their hard work on reviewing and selecting the papers. I thank Sulema Torres and Corina Forăscu for their valuable collaboration in preparation of this volume. The submission, reviewing, and selection process was supported for free by the EasyChair system, www.EasyChair.org. Alexander Gelbukh March 2010 Table of Contents Semantics Lexical Representation of Agentive Nominal Compounds in French and Swedish.................................................................................................3 Maria Rosenberg Computing Linear Discriminants for Idiomatic Sentence Detection .........................17 Jing Peng, Anna Feldman, Laura Street Robust Temporal Processing: from Model to System ...............................................29 Tommaso Caselli, Irina Prodanof Near-Synonym Choice using a 5-gram Language Model..........................................41 Aminul Islam, Diana Inkpen Morphology, Syntax, Named Entity Recognition Exploring the N-th Dimension of Language..............................................................55 Prakash Mondal Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition ..............................................................................67 Mircea Petic POS-tagging for Oral Texts with CRF and Category Decomposition .......................79 Isabelle Tellier, Iris Eshkol, Samer Taalab, Jean-Philippe Prost Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields........................................91 Xiaojia Pu, Qi Mao, Gangshan Wu, Chunfeng Yuan Ontology-Driven Approach to Obtain Semantically Valid Chunks for Natural Languaje Enabled Business Applications .............................................105 Shailly Goyal, Shefali Bhat, Shailja Gulati, C Anantaram Opinion, Emotions, Textual Entailment Word Sense Disambiguation in Opinion Mining: Pros and Cons............................119 Tamara Martín, Alexandra Balahur, Andrés Montoyo, Aurora Pons Improving Emotional Intensity Classification using Word Sense Disambiguation..........................................................................131 Jorge Carrillo de Albornoz, Laura Plaza, Pablo Gervás Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework ......................................................143 Plaban Kumar Bhowmick, Anupam Basu, Pabitra Mitra, Abhisek Prasad Recognizing Textual Entailment: Experiments with Machine Learning Algorithms and RTE Corpora ......................155 Julio J. Castillo Text and Speech Generation Discourse Generation from Formal Specifications Using the Grammatical Framework, GF..................................................................167 Dana Dannélls An Improved Indonesian Grapheme-to-Phoneme Conversion Using Statistic and Linguistic Information ..............................................................179 Agus Hartoyo, Suyanto Machine Translation Long-distance Revisions in Drafting and Post-editing ............................................193 Michael Carl, Martin Kay, Kristian Jensen Dependency-based Translation Equivalents for Factored Machine Translation............................................................................205 Elena Irimia, Alexandru Ceauşu Information Retrieval and Text Clustering Relation Learning from Persian Web: A Hybrid Approach.....................................219 Hakimeh Fadaei, Mehrnoush Shamsfard Towards a General Model of Answer Typing: Question Focus Identification..................................................................................231 Razvan Bunescu, Yunfeng Huang Traditional Rarámuri Songs used by a Recommender System to a Web Radio ........................................................................................................243 Alberto Ochoa-Zezzatti, Julio Ponce, Arturo Hernández, Sandra Bustillos, Francisco Ornelas, Consuelo Pequeño Improving Clustering of Noisy Documents through Automatic Summarisation.........................................................................253 Seemab Latif, Mary McGee Wood, Goran Nenadic Educational Applications User Profile Modeling in eLearning using Sentiment Extraction from Text ...........267 Adrian Iftene, Ancuta Rotaru Predicting the Difficulty of Multiple-Choice Close Questions for Computer-Adaptive Testing...............................................................................279 Ayako Hoshino, Hiroshi Nakagawa MathNat - Mathematical Text in a Controlled Natural Language ...........................293 Muhammad Humayoun, Christophe Raffalli Applications A Low-Complexity Constructive Learning Automaton Approach to Handwritten Character Recognition ....................................................................311 Aleksei Ustimov, M. Borahan Tümer, Tunga Güngör Utterances Assessment in Chat Conversations ........................................................323 Mihai Dascalu, Stefan Trausan-Matu, Philippe Dessus Punctuation Detection with Full Syntactic Parsing..................................................335 Miloš Jakubíček, Aleš Horák Author Index ..........................................................................................................345 Editorial Board of the Volume .............................................................................347 Additional Referees ...............................................................................................347 Semantics Lexical Representation of Agentive Nominal Compounds in French and Swedish Maria Rosenberg Stockholm University, maria.rosenberg @fraita.su.se Abstract. This study addresses the lexical representation of French VN and Swedish NV-are agentive nominal compounds. The objective is to examine their semantic structure and output meaning. The analysis shows that, as a result of their semantic structure, the compounds group into some major output meanings. Most frequently, the N constituent corresponds to an Undergoer in the argument structure of the V constituent, and the compound displays an Actor role, which more precisely denote entities such as Persons, Animals, Plants, Impersonals, Instruments or Locatives, specified in the Telic role in the Qualia. We propose that the Agentive role can be left unspecified with regard to action modality. In conclusion, our study proposes a unified semantic account of the French and Swedish compounds, which can have applications for NLP systems, particularly for disambiguation and machine translation tasks. Keywords. Agentive nominal compounds, Actor, Undergoer, semantic structure, lexical representation, Generative Lexicon, telic, disambiguation 1 Introduction This study addresses the semantics of French and Swedish agentive nominal compounds that contain an N and a V constituent, manifesting an argumental relation. French has only one such compound type, which has exocentric structure [1]1: • [VNy]Nx: porte-drapeau ‘bear-flag=flag bearer’ Table 1 shows the initial data for our study, which aimed to localize Swedish correspondents of French VN compounds. By going through four bilingual FrenchSwedish dictionaries, we attested 432 French nominal VN compounds. Among these, 229 were rendered by four Swedish compound types. The remaining cases corresponded mainly to simple words or syntactic phrases. Apart from the data in Table 1, our data draws mainly from dictionaries (TLFi and SAOB) and the Internet. We support our analysis by a restricted sample of representative examples. 1 Romance VN compounds are also analyzed as left-headed: a nominal zero suffix adds to the V [2], or as right-headed: a nominal zero suffix adds to the compound, considered as a VP, [3]. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 3-15 Received 24/11/09 Accepted 16/01/10 Final version 08/03/10 4 Rosenberg M. Table 1. French VN compounds and their corresponding Swedish compounds in four bilingual dictionaries. Compound type VN (fra) NV-are (swe) NV-a (swe) NV (swe) VN (swe) n 432 108 16 54 51 According to Table 1, the Swedish NV-are compound is the most frequent counterpart. Hence, we focus solely on this Swedish compound in this study. Swedish NV-are compounds are right-headed and also called synthetic, since they involve both compounding and derivation. Their formation can be considered as a process of conflation, uniting two templates [NV]V and [V are]N into a unified productive template [[NV] are]N (cf. [1]): • [[NV] are]N: fanbärare ‘flag bearer’ French VN and Swedish NV-are compounds give rise to polysemy and sense shifting within the agentive domain. They do not only denote humans, but all sorts of animate entities, e.g. animals, plants and insects, as well as artefacts, e.g. instruments, impersonal agents and locatives. They can also refer to places, events and results. Thus, morphologically, VN and NV-are correspond to two constructions, which have several underlying semantic structures. Our main objective is to examine the lexical representation of the compounds. We explore the semantic roles of the N constituents and the semantic characteristics of the V constituents, as well as their semantic structures. Note that both the N constituent and the compound fulfil different roles in the argument structure of the V. The role displayed by the compound corresponds to its output meaning. Despite the formal differences of French and Swedish compounds, we aim at a unified semantic account. Moreover, we discuss the importance of action modality as a component in the lexical representation of agentive compounds. Automatic analyses of nominal compounds constitute an intriguing question within NLP. In order for our study to have some predictive power, we attempt to relate the semantic structures and output meanings to productivity and frequency. At the present, we are setting the ground for a future implementation of regular lexical morphology principles in a machine translation (MT) system. Section 2 addresses the morphological context. Section 3 deals with the semantic characteristics of the constituents within the compounds. In section 4, we analyze the semantic structures and the output meanings of the compounds. Section 5 discusses the notion of action modality. In Section 6, we propose GL representations for the three most frequent cases. Section 7 discusses potential applications within NLP, and Section 8 contains a conclusion. Lexical Representation of Agentive Nominal Compounds in French and Swedish 5 2 Compounds in Morphology The morphological approach is lexeme-based, and adheres to Construction Morphology, being elaborated by Booij (e.g. [1]). A compound is defined as a sequence which cannot be generated otherwise than by morphological rules. Hence, it does not have syntactic structure [4]. This criterion is valid for French VN and Swedish NV-are compounds. Interaction between syntax and lexicon is however tolerated: the lexical rules may make use of syntactic information [5]. “Wordformation patterns can be seen as abstract schemas that generalize over sets of existing complex words with a systematic correlation between form and meaning” [1]. The generalizations are expressed in the lexicon by assuming intermediate levels of abstractions between the most general schema and individual existent compounds. Hence, the lexicon is hierarchically structured. The morphology combines three aspects of complex words: phonological form, formal structure and meaning. The architecture of grammar is tripartite and parallel [6]. Abstract schemas coexist with their individual instantiations in the lexicon. Thus, outputs of productive rules can also be listed [1]. 3 Semantic Characteristics of the Compounds 3.1 Argument Structure and Semantic Roles of the N Constituent Agentive compounds contain an argumental relation between the V and N constituents. According to our analysis, the N constituent can correspond, more or less, to all four types of arguments, distinguished in the Generative Lexicon (GL) framework [7]: • True arguments: ouvre-boîte ‘open-can=can opener’, burköppnare ‘can opener’. • Default arguments: claque-soif ‘die-thirst=person dying of thirst’, cuit-vapeur ‘boil-steam=steamer’, ångkokare, ‘steam+boiler=steamer’ betonggjutare ‘concrete caster’. • Shadow arguments: marche-pied ‘march-foot=step, running board’, bensparkare ‘leg kicker’. • True adjuncts: réveille-matin ‘wake-morning=alarm clock’, trädkrypare ‘tree+crawler=bird’. In other terms, the semantic roles of the N constituents can correspond to Agent (croque-madame ‘crunch-madam=toast’), Patient2 (ouvre-boîte ‘open-can=can opener’), Theme (hatthängare ‘hathanger=hat-rack’), Place (bordslöpare ‘table+runner=cloth’), Time (dagdrömmare ‘day dreamer’), Instrument/Manner (cuitvapeur ‘steam+boiler=steamer’, fotvandrare ‘footrambler’, Cause (claque-soif ‘diethirst=someone dying of thirst’, sorgedrickare ‘grief+drinker’) or Goal (cherche-pain ‘search-bread=beggar’, målsökare ‘target seeker’). 2 Patient corresponds to an entity, internally affected by the event expressed by the V, whereas Theme corresponds to an entity in motion, in change or being located [8]. 6 Rosenberg M. However, an overwhelming majority of the French VN, 96 % (415/432), and the Swedish NV-are compounds, 97% (105/108), in our data, contain an N which is a direct object of a transitive V. Furthermore, about 73 % (79/108) of the Swedish NVare counterparts of the French VN compounds contain semantically similar lexical units, such as allume-gaz ‘light-gas=gas lighter’ vs. gaständare ‘gas lighter’. Hence, a MT system could benefit from an implementation of these facts (cf. section 7) 3.2 The Four Classes of Aktionsart The four Aktionsarten [9] can occur within the French and Swedish compounds. The state reading is rare, but not unproductive. New formations arise quite easily, such as godisälskare ‘candy lover’. • • • • Activity: traîne-nuit ‘loaf-night’, dagdrivare ‘day loafer’. Accomplishment: presse-citron ‘squeeze-lemon’, pennvässare ‘pencil sharpener’. Achievement: presse-bouton ‘push-button’, cigarrtändare ‘cigar lighter’. State: songe-malice ‘think-malice=someone who plots to evil’, vinkännare ‘wine+knower=connoisseur of wine’. 3.3 Unaccusative and Unergative Verbs We see that both unaccusative and unergative readings of intransitive verbs can be attested within the compounds [10], [11]: • Unaccusative: caille-lait ‘clot-milk=plant’, oljedroppare ‘oil dripper’. • Unergative: trotte-bébé ‘baby walker’, hundpromenerare ‘dog+walker=person who takes the dog out for a walk’. 4 Semantic Structures and Output Meanings of the Compounds The output meaning of French VN and Swedish NV-are compounds is taken to be a function of the meanings of their constituents [6]. The agentive compounds themselves display a role in the argument structure of the V (cf. [12] for French VN compounds). The N constituent can be classified for semantic macro-role, Actor or Undergoer [8] (correspond more or less to Proto-Agent and Proto-Patient [13]). In general, the compound corresponds to the Actor (including thematic roles such as Agent, Instrument and Experiencer), and its N constituent to an Undergoer (comprising roles such as Patient, Theme, Source and Recipient) of a transitive V. In order to come up with a more fine-grained semantic analysis, we split up the Actor interpretation into Actors corresponding to first arguments, Instruments, Locatives and Causatives. We propose that the construction itself links to the output meaning (N3). The structure of French VN compounds corresponds to [V1N2]N3 in our analysis. The same proposal is made for Swedish NV-are compounds. Instead of linking the Actor Lexical Representation of Agentive Nominal Compounds in French and Swedish 7 interpretation to the suffix, it links to the entire construction [N1V2-are]N3. Thus, the meaning of the compound is the output of its semantic structure. We assume French VN and Swedish NV-are compounds to have similar semantic structures, their exocentricity or endocentricity, as well as the order between the V and N constituents, are of minor importance. Two implications follow from our proposal: the formal structure, not the -are suffix, is polysemous; null elements are not stipulated. We adopt Jackendoff’s framework [6] to account for the semantic structures of the compounds. Table 2 shows the frequency of the output meanings of French VN and Swedish NV-are compounds in the initial data. The most frequent cases are Instrument, Agent and Instrumental Locative. They account for more than 90 % of all cases. This figure is confirmed for a collection of 1075 French VN compounds drawn from TLFi [14], but further data needs to be added for Swedish NV-are compounds. Table 2. Output meanings of French VN and Swedish NV-are compounds in the initial data. Arg1 VN (fra) 128 30% NV-are (swe) 40 37% ACTOR INSTR LOC CAUS 193 45% 47 44% 3 0.7% 1 0.9% 84 19% 20 19% UND PLACE EV RES n 2 0.5% 1 0.2% 18 4% 3 0.7% 432 108 4.1 Actor is the First Argument In the Actor interpretation, where the compound corresponds to the first argument of the V, we find compounds such as porte-drapeau or fanbärare, both ‘flag bearer’: ‘a flag bearer bears a flag’ (cf. 1-2). In some cases, the V is intransitive, and the N constituent displays a Place role (cf. 3-4). The compounds denote not only human agents, but also animals (cf. 3), plants, impersonals (cf. 5) (cf. also [15] who relates the Agent polysemy to the Animacy hierarchy). Sometimes, according to the semantics of the V, the compounds manifest an Experiencer role (cf. 6). According to [6], the function PROTECT (X, Y FROM Z) creates two groups of compounds, ‘N2 protects N1’ (cf. 7-8) and ‘N2 protects from N1’ (cf. 9-10), which denote some sort of disposal. According to Lieber “verbs which take more than one obligatory internal argument (e.g., put) [i.e. ditransitives] cannot form the base of synthetic compounds” [16]. This claim does not seem to be an absolute restriction, in any case not for French and Swedish (cf. also 17-18 in sub-section 4.3). 1. 2. 3. 4. 5. 6. 7. 8. 9. [porte-1drapeau2]3= PERSON3α; [BEAR1 (α, FLAG2)] [fan1bär2are]3= PERSON3α; [BEAR2 (α, FLAG1)] [trotte-1chemin2]3= ANIMAL3α; [TROT1 (α, ON ROAD2)] [kåk1far2are]3 = PERSON3α; [GO2 (α, IN SLAMMER1)] [lave-1vaisselle2]3 = MACHINE3α; [WASH1 (α, DISH2)] [vin1känn2are]3 = PERSON3α; [KNOW2 (α, WINE1)] [garde-1roue2]3 = DISPOSAL3α; [PROTECT1 (α, WHEEL2, FROM INDEF)] [blus1skydd2are]3 = DISPOSAL3α; [PROTECT2 (α, BLOUSE1, FROM INDEF)] [garde-1boue2]3 = DISPOSAL 3α; [PROTECT1 (α, INDEF, FROM MUD2)] 8 Rosenberg M. 10. [blixt1skydd2are]3 = DISPOSAL3α; [PROTECT2 (α, INDEF, FROM LIGHTNING1)] 4.2 Instrument Some Instrument denoting compounds are ouvre-boîte or burköppnare, both ‘can opener’: ‘one opens a can with a can opener’, or casse-noix or nötknäppare, both ‘nutcracker’. This meaning is the most productive one in both French and Swedish. 11. [ouvre-1boîte2]3 = INSTR3α; [OPEN1 (INDEF, CAN2, WITH α)] 12. [burk1öppn2are]3 = INSTR3α; [OPEN2 (INDEF, CAN1, WITH α)] 13. [casse-1noix2]3 = INSTR3α; [CRACK1 (INDEF, NUT2, WITH α)] 14. [nöt1knäpp2are]3 = INSTR3α; [CRACK2 (INDEF, NUT1, WITH α)] 4.3 Locative French VN and Swedish NV-are compounds do quite frequently have a Locative interpretation. It is close to the Instrument meaning, but instead of denoting something that one does things with, the compound denotes a location: ‘one burns incense in a brûle-parfum ‘censer’’ (cf. 15-16) or ‘one hangs saucepans on a saucepan hanger’ (cf. 17-18). 15. [brûle-1parfum2]3 = LOC3α; [BURN1 (INDEF, INCENSE2, IN α)] 16. [kaffe1bränn2are]3 = LOC3α; [BURN2 (INDEF, COFFEE1, IN α)] 17. [accroche-1casseroles2]3 = LOC3α; [HANG1 (INDEF, SAUCEPAN2, ON α)] 18. [kastrull1häng2are]3 = LOC3α; [HANG2 (INDEF, SAUCEPAN1, ON α)] 4.4 Causative Some of the rare French VN compounds that accept unaccusative and unergative Vs receive a reading involving a causative relation. We assume a same semantic structure for both cases: an additional argument (the causer or Actor) adds to the V, and the N is interpreted as an Undergoer (not acting entirely volitionally) of the V (cf. [17]). For example, trotte-bébé ‘toddle-baby=baby walker’, is a device that makes the baby toddle. In Swedish, folkförödare ‘people+devastater=tuberculosis’ involves a causative relation. 19. [trotte-1bébé2]3= DEVICE3α; [CAUSE (α (TODDLE1 (BABY2)))] 20. [folk1föröd2are]3 = DISEASE3α; [CAUSE (α (DEVASTATE2 (PEOPLE1)))] 4.5 Undergoer Exceptionally, a few French VN compounds have an Undergoer interpretation, in which the N constituent, instead, is an Actor. This case is thus the opposite of the Agent case in sub-section 4.1. For example, croque-monsieur ‘crunch-sir=toast (that Lexical Representation of Agentive Nominal Compounds in French and Swedish 9 the sir crunches)’, or pique-poule (normally spelled as picpoul) ‘pick-hen=grape (picked by hens)’. This meaning is unproductive in contemporary French, and seems to be ruled out for Swedish NV-are compounds. 21. [croque-1monsieur2]3= UND3α; [CRUNCH1 (SIR2, α)] 4.6 Place and Event Apart from the output meanings above, French VN compounds can denote the place, where the event expressed takes place: coupe-gorge ‘cut-throat=dangerous place where one risks having one’s throat cut’. They are often toponyms, such as Chantemerle ‘sing-blackbird=a place where the blackbirds sing’. We have not attested any Swedish NV-are compound with a Place meaning (cf. diner in English). 22. [coupe-1gorge2]3= PLACE3; [LOC (CUT1 (INDEF, THROAT2))] 23. [Chante-1merle2]3= PLACE3; [LOC (SING1 (BLACKBIRD2))] In addition, French VN and Swedish NV-are compounds can denote the event itself expressed by the compound, such as höftrullare ‘hip roller=rolling the hip’. Some of the compounds with an Event meaning can, according to context, have an additional result interpretation, e.g. baise-main kiss-hand ‘the act of kissing a hand’ vs. ‘handkiss’. 24. [höft1rull2are]3 = EVENT3; ROLL1 (HIP2) 25. [baise-1main2]3= EVENT3; KISS1 (INDEF, HAND2) The Place and Event cases do not involve any linking variable. The compound’s output meaning does not correspond to a participant in the argument structure of the V; the N constituent can either be an Actor of a V, taking one argument, or an Undergoer of a V, taking two arguments. 5 Action Modality In [18] a distinction is made between event and non-event English -er nominals, corresponding more or less to the distinction between stage-level and individual-level nominals [7], [19]. Inheritance of complement and argument structure correlates with the event interpretation, whereas instruments and occupations, which do not presuppose the existence of an event, typically are non-events. Busa [20], instead, claims that all agentive nominals are best characterized in terms of events, and distinguishes between a changeable property for stage-level nominals, encoded as an Agentive role (cf. 26), and a persistent property for individual-level nominals, encoded as a Telic role (cf. 27): 26. passenger QUALIA = FORMAL = person AGENTIVE = travel on vehicle 10 Rosenberg M. 27. smoker QUALIA = FORMAL = person TELIC = smoke Moreover, Busa [20] argues that state predicates of individual-level nominals can also encode for an agentive role, such as Habit for smoker or Ability for violinist: 28. violinist QUALIA = FORMAL = person TELIC = play violin AGENTIVE = ability to play violin 29. smoker QUALIA = FORMAL = person TELIC = smoke AGENTIVE = habit to smoke Jackendoff [6], referring to [21], emphasizes that action modality is an important component for the interpretation and lexical representation of agentive nominals, and not only a matter of pragmatics. There are five major types: • • • • • Current (e.g. gâte-fête, festförstörare, ‘party trasher’) Ability (e.g. gobe-mouches, flugsnappare ‘fly catcher’) Habit (e.g. rabat-joie, glädjedödare ‘killjoy’) Occupation (e.g. croque-mort, likbärare ‘pall bearer’) Proper function (ouvre-boîte, burköppnare ‘can opener’) The Current modality refers to a specific activity on a specific occasion. It concerns stage-level nominals and encodes as an Agentive role. Ability presupposes a potential event (may or may not occur), whilst Habit presupposes repetitive events. Occupation regards persons, practicing the profession indicated by the compound. Proper function concerns objects, and is true irrespectively of actual situations. Thus, the last four modalities involve state predicates and encode as a Telic role, but could additionally encode for an Agentive role [20]. In Table 3, we relate the semantic structures within French VN and Swedish NVare compounds to action modalities. We see that only Proper Function is relevant for objects, all other modalities concern Actors. None of the modalities are relevant for compounds with Place or Event meanings. According to our data, Current (stage-level interpretation) is rarely lexicalized among French VN and Swedish NV-are compounds. Table 3. Semantic structures in relation to action modalities ACTOR (Arg1) CAUSATIVE INSTRUMENT LOCATIVE PLACE EVENT Current Habit Current Habit Proper function Proper function ? ? Ability Ability Occupation ? Lexical Representation of Agentive Nominal Compounds in French and Swedish 11 In relation to the general notion of modality, action modality is normally labelled as dynamic, which can be abilitive or volitive [22]. Nuyts [23] proposes dynamic modality to be a subcategory of quantificational aspect, since notions such as “ability/potential” and “need” are semantically similar to notions such as “iterative”, “habitual” and “generic”. Furthermore, action modality is not overtly linguistically coded and does not affect the lexical content of the verb stem [24]. It can be lexicalized and does not depend solely on context for its interpretation. Hence, action modality, which cannot be defined as an attitudinal expression, seems to be a sort of objective modality [25]. In our opinion, we cannot really see the need for this notion: we propose that the Agentive can be left underspecified for action modality, and that only the Telic is important for the lexical representation of agentive nominal compounds in French and Swedish. 6 GL Representations of the Most Frequent Cases In order for our study to have some predictive power and importance for NLP systems, we focus on the lexical representation of the three most frequent cases, namely Actor (where the compound is the Arg1 of the V), Instrument and Locative, in which the N constituent is an Undergoer and the compound an Actor in relation to the V. Another possible analysis, different from ours, would be to consider the Actor interpretation as a case of lexical underspecification [26]. The other semantic structures and output meanings, some of them unproductive, are marginal and can probably be exhausted. Nevertheless, instead of proposing a single lexical rule, with a common denominator (cf. [17] for French VN compounds), we propose different lexical representations. The output meanings of the compounds are assumed to be specified in the Type structure, and their internal semantic structure in the Telic role. We do not assume an “instrumental subject”-interpretation of compounds with Instrument or Locative meanings: ‘*a can opener opens can’ or ‘*a clothes hanger hangs clothes’ are not well-formed, in our opinion. Instead, we introduce a default argument, preferably a human agent (not a user w, cf. [21]). We use a simplified form of the GL [7], [27], and omit for example Constitutive and Agentive in the Qualia structure. 30. porte-drapeau, fanbärare ‘flag bearer’ TYPESTR = [ ARG1 = x: human ] ARGSTR = [ D-ARG1 = y: flag ] EVENTSTR = [ D-E1 = e: process ] QUALIA = FORMAL = x TELIC = bear_flag_act (e, x, y) 31. ouvre-boîte, burköppnare ‘can opener’ TYPESTR = [ ARG1 = z: artefact_instrument ] ARGSTR = D-ARG1 = x: human D-ARG2 = y: can EVENTSTR = [ D-E1 = e: process ] QUALIA = FORMAL = z TELIC = open_can_act (e, x, y, with z) 12 Rosenberg M. 32. accroche-casseroles, kastrullhängare ‘saucepan hanger’ TYPESTR = [ ARG1 = z: artefact_locative ] ARGSTR = D-ARG1 = x: human D-ARG2 = y: saucepan EVENTSTR = [ D-E1 = e: process ] QUALIA = FORMAL = z TELIC = hang_saucepan_act (e, x, y, on z) Note that x in the representations (30-32) can be filled with any entity able to display an Actor role. Likewise, y can be filled with any entity manifesting an Undergoer role. The events can also be of different types. Our data seems to indicate that intransitive Vs and N constituents with Place or Time meanings (roles displayed by adjuncts in syntax) occur especially in compounds with Actor (Arg1) meanings. In sum, the specification of arguments and predicate structures in the Qualia is important for the analysis of compounds: those included here are all linked to the Telic. Furthermore, phrase structure schemes could be used to account for their compounding (cf. [27]). 7 Discussion Our analysis of compounds is domain independent, and aims at general semantic structures (cf. [28]), supposed to be lexicalized and more or less productive. Through knowledge about productive semantic patterns, new compounds are created and interpreted [29]. Odd interpretations of compounds are in fact rare [30]. Lapata [31] underlines three problems that compounds still pose for automatic interpretation within NLP: (i) their high productivity implies a need to interpret previously unseen formations; (ii) their internal semantic relation is often implicit; (iii) context and pragmatics have impact on their interpretation. Contextual information can help to disambiguate unknown compounds of the types included in our study: e.g. subject position in combination with Actor (Arg1) interpretation, “with” in combination with Instruments, and “in” or “on” in combination with Locatives (cf. however [30] for the problematic distinction between agent and instrument at both the morphological and the syntactic level). Since the V constituents in French VN and Swedish NV-are compounds cannot always occur as independent Ns in syntax (*porte-, *häng-/*hängare), it is not possible to map each of the constituents onto a conceptual representation as is possible for NN root compounds (cf. the systems of [32], [33]). However, a disambiguation algorithm can map the V constituents to their respective verbs and examine distributional properties: e.g. retrieve frequencies of the verb’s relation to its objects (verb-argument tuples). In the majority of cases, the N constituent is an internal argument of the (transitive) V constituent. The set of possible interpretations provided by our study enables manual disambiguation of compounds in context, which then can be added to the lexicon. Our unified account of French VN and Swedish NV-are compounds can have relevance for MT or other multi-lingual language processing tasks with regard to Romance and Germanic languages: the GL representation constitutes a neutral platform [27], [32]. Cartoni [34] proposes a prototype of a MT system for handling constructed neologisms, and to which our analysis could be fitted. The first module Lexical Representation of Agentive Nominal Compounds in French and Swedish 13 checks unknown words with regard to their being potentially constructed or not. If they are, it performs a morphological analysis of their structure and lexeme-bases. The second module generates a possible translation of the analyzed construction. The prototype relies on lexical resources and a set of bilingual Lexeme Formations Rules. The lexeme-bases are checked against the lexical resources and the rules provide information of how to translate them into the target language (e.g. French VxNy → Swedish NyVxare: brise-glace ‘break-ice=icebreaker’→ isbrytare, or alternatively, if the French V constituent corresponds to a lexically established N in Swedish, French VxNy → Swedish NyNx: appuie-tête ‘rest-head=headrest’ → huvudstöd). 8 Conclusion This study has attempted to provide a unified account of the complex semantics of French VN and Swedish NV-are compounds. We have adopted the frameworks of Jackendoff [6] and GL [7], [27], and been able to find some general semantic structures giving rise to particular output meanings. In the most productive semantic structures, the compounds as well as the N constituents display a role in the argument structure of the V constituent. We assume the Telic role in the Qualia to be most important for their lexical representation. Contrary to the opinion expressed in [20], we suggest that the Agentive role can be left un(der)specified, since it does not add much to their disambiguation or analysis. In conclusion, we hope that our study can have application for NLP systems. Possible applications could be to elaborate a probabilistic algorithm dealing with a disambiguation task for unseen compounds within domain-independent unrestricted text. Our unified account also has relevance for machine translation between French and Swedish, and for multi-lingual language processing with regard to Romance and Germanic languages. Dictionaries Fransk-svensk ordbok. (1995). Natur och kultur, Stockholm. Norstedts fransk-svenska ordbok. (1993). Norstedt, Stockholm. Norstedts stora fransk-svenska ordbok. (1998). Norstedt, Stockholm. Norstedts stora svensk-franska ordbok. (1998). Norstedt, Stockholm. TLFi, Le Trésor de La Langue Française informatisé. http://atilf.atilf.fr/tlf.htm SAOB, Svenska Akademiens Ordbok. http://g3.spraakdata.gu.se/saob/ References 1. Booij, G.: Compounding and Construction Morphology. In: Lieber, R., Štekauer, P. (eds.) The Oxford Handbook of Compounding, pp. 201-216. Oxford University Press, Oxford (2009) 14 Rosenberg M. 2. Bisetto, A.: Italian Compounds of the Accendigas Type: A Case of Endocentric Formations?. In: Bouillon, P., Estival, D. (eds.) Proceedings of the Workshop on Compound Nouns: Multilingual Aspects of Nominal Composition, pp. 77-87. ISSCO, Geneva (1994) 3. Lieber, R.: Deconstructing Morphology: Word Formation in Syntactic Theory. University of Chicago Press, Chicago (1992) 4. Corbin, D.: Hypothèses sur les frontières de la composition nominale. Cahiers de grammaire 17, 25-55 (1992) 5. Booij, G.: The Grammar of Words. Oxford University Press, Oxford (2005) 6. Jackendoff, R.: Compounding in the Parallel Architecture and Conceptual Semantics. In: Lieber, R., Štekauer, P. (eds.) The Oxford Handbook of Compounding, pp. 105-128. Oxford University Press, Oxford (2009) 7. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge, MA (1995) 8. Van Valin, R. D., Jr.: Semantic Macroroles in Role and Reference Grammar. In: Kailuweit, R., Hummel, M. (eds.) Semantische Rollen, pp. 62-82. Narr, Tübingen (2002) 9. Vendler, Z.: Verbs and Times. The Philosophical Review 66, 143-160 (1957) 10. Burzio, L.: Italian Syntax: A Government-Binding Approach. Reidel, Dordrecht (1986) 11. Perlmutter, D. M.: The Split Morphology Hypothesis. In: Hammond, M., Noonan, M. (eds.) Theoretical Morphology: Approaches in Modern Linguistics, pp. 79-100. Academic Press, San Diego (1988) 12. Fradin, B.: On a Semantically Grounded Difference between Derivations and Compounding. In: Dressler, W. U. et al. (eds.) Morphology and its Demarcations, pp. 161182. Benjamins, Amsterdam/Philadelphia (2005) 13. Dowty, D.: Thematic Proto-Roles and Argument Selection. Language 67, 547-619 (1991) 14.Rosenberg, M.: La formation agentive en français : les composés [VN/A/Adv/P]N/A et les dérivés V-ant, V-eur et V-oir(e). PhD Dissertation, Department of French, Italian and Classical Languages, Stockholm University (2008) 15. Dressler, W. U.: Explanation in Natural Morphology, Illustrated with Comparative and Agent-Noun Formation. Linguistics 24, 519-548 (1986) 16. Lieber, R.: Morphology and Lexical Semantics. Cambridge University Press, Cambridge (2004) 17. Roussarie, L.,Villoing, F.: Some Semantic Investigation of the French VN Construction. In: Bouillon, P., Kanzaki, K. (eds.) Proceedings of the Second International Workshop on Generative Approaches to the Lexicon, Geneva, Switzerland (2003). 18. Rappaport Hovav, M., Levin, B.: -er Nominals: Implications for the Theory of Argument Structure. In: Stowell, T., Wehrli, E. (eds.) Syntax and the Lexicon. Syntax and Semantics 26, pp. 127-153. Academic Press, San Diego, CA (1992) 19. Carlson, Gregory N. 1977. Reference to Kinds in English. PhD Dissertation, University of Massachusetts, Amherst. 20. Busa, F.: The Semantics of Agentive Nominals. In: Saint-Dizier, P. (ed.) Proceedings of ECAI Workshop on Predicative Forms for the Lexicon. Toulouse, France (1996) 21. Busa, F. Compositionality and the Semantics of Nominals. PhD Dissertation, Department of Computer Science, Brandeis University (1996) 22. Palmer, F. R.: Mood and Modality. 2nd ed. Cambridge University Press, Cambridge (2001) 23. Nuyts, J.: The Modal Confusion. In Klinge, A., Müller, H. H. (eds.) Modality: Studies in Form and Function, pp. 5-38. Equinox, London (2005) 24. Bybee, J. L.: Morphology: A Study of the Relation between Meaning and Form. Benjamins, Amsterdam/Philadelphia (1985) 25. Herslund, M.: Subjective and Objective Modality. In: Klinge, A., Müller, H. H. (eds.), Modality: Studies in Form and Function, pp. 39-48. Equinox, London (2005) 26. Pustejovsky, J.: The Semantics of Lexical Underspecification. Folia Linguistica 32, 323348 (1998) Lexical Representation of Agentive Nominal Compounds in French and Swedish 15 27. Johnston, M., Busa, F.: Qualia Structure and the Compositional Interpretation of Compounds. In: Viegas, E. (ed.) Breadth and Depth of Semantic Lexicons, pp. 167-189. Kluwer, Dordrecht (1999) 28. Fabre, C.: Interpretation of Nominal Compounds: Combining Domain-Independent and Domain-Specific Information. In: Proceedings of the 16th Conference of Computational Linguistics, pp. 364-369. Copenhagen, Denmark (1996) 29. Ryder, M. E.: Ordered Chaos: The Interpretation of English Noun-Noun Compounds. University of California Press, Berkeley (1994) 30. Isabelle, P.: Another Look at Nominal Compounds. In: Proceedings of the 10th International Conference on Computational Linguistics and the 22nd Annual Meeting of the Association for Computational Linguistics, pp. 509-516. Stanford, California (1984) 31. Lapata, M.: The Disambiguation of Nominalizations. Computational Linguistics 28, 357-38 (2002). 32. McDonald, D. B.: Understanding Noun Compounds. PhD Dissertation, Carnegie Mellon University, Pittsburgh, Pennsylvania (1982) 33. Finin, T.: The Semantic Interpretation of Nominal Compounds. In: Proceedings of First Annual National Conference on Artificial Intelligence, pp. 310-315. Stanford, California (1980) 34. Cartoni, B.: Lexical Morphology in Machine Translation: A Feasibility Study. In: Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 130-38. Association for Computational Linguistics (2009). Computing Linear Discriminants for Idiomatic Sentence Detection Jing Peng1 , Anna Feldman1,2 , and Laura Street2 1 Department of Computer Science 2 Department of Linguistics Montclair State University Montclair, NJ 07043, USA {pengj,feldmana,streetl1}@mail.montclair.edu Abstract. In this paper, we describe the binary classification of sentences into idiomatic and non-idiomatic. Our idiom detection algorithm is based on linear discriminant analysis (LDA). To obtain a discriminant subspace, we train our model on a small number of randomly selected idiomatic and non-idiomatic sentences. We then project both the training and the test data on the chosen subspace and use the three nearest neighbor (3NN) classifier to obtain accuracy. The proposed approach is more general than the previous algorithms for idiom detection — neither does it rely on target idiom types, lexicons, or large manually annotated corpora, nor does it limit the search space by a particular linguistic construction. 1 Introduction Previous work on automatic idiom classification has typically been of two types: those which make use of type-based classification methods (Lin, 1999; Baldwin et al., 2002; Fazly and Stevenson, 2006; Bannard, 2007; Fazly et al., 2009) and those which make use of token-based classification methods (Birke and Sarkar, 2006; Katz and Giesbrecht, 2006; Fazly et al., 2009; Sporleder and Li, 2009). Type-based classification methods recognize idiomatic expressions (=types) to include in a lexicon and typically rely on the notion that many idioms share unique properties with one another. For instance, several idioms are composed of verb-noun constructions (e.g., break a leg, get a grip, kick the bucket) that cannot be altered syntactically or lexically (e.g., break a skinny leg, a grip was got, kick the pail ). These unique properties are used to distinguish idiomatic expressions from other types of expressions in a text. Token-based classification methods recognize a particular usage (literal vs. non-literal) of a potentially idiomatic expression. Both of these approaches view idioms as multi-word expressions (MWEs) and rely crucially on preexisting lexicons or manually annotated data. They also tend to limit the search space by a particular type of linguistic construction (e.g., Verb+Noun combinations). The task of automatic idiom classification is extremely important for a variety of NLP applications; e.g., a machine translation system must translate held fire differently in The army held their fire and The worshippers held the fire up to the idol (Fazly et al., 2009). © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 17-28 Received 23/11/09 Accepted 16/01/10 Final version 10/03/10 18 Peng J., Feldman A., Street L. 2 Our Approach Unlike previous work on idiom detection, we view the solution to this problem as a two-step process: 1) filtering out sentences containing idioms; 2) extracting idioms from these filtered out sentences. In our current work we only consider step 1, and we frame this task as one of classification. We believe that the result of filtering out idiomatic sentences is already useful for many applications such as machine translation, information retrieval, or foreign/ second language instruction, e.g., for effective demonstrations of contexts in which specific idioms might occur Our idiom detection algorithm is based on linear discriminant analysis (LDA). To obtain a discriminant subspace, we train our model on a small number of randomly selected idiomatic and non-idiomatic sentences. We then project both the training and the test data on the chosen subspace and use the three nearest neighbor (3NN) classifier to obtain accuracy. The proposed approach is more general than the previous algorithms for idiom detection — neither does it rely on target idiom types, lexicons, or large manually annotated corpora, nor does it limit the search space by a particular type of linguistic construction. The following sections describe the algorithm, the data and the experiments in more detail. 2.1 Idiom Detection based on Discriminant Analysis The approach we are taking for idiomatic sentence detection is based on linear discriminant analysis (LDA) (Fukunaga, 1990). LDA often significantly simplifies tasks such as regression and classification by computing low-dimensional subspaces having statistically uncorrelated or discriminant variables. In language analysis, statistically uncorrelated or discriminant variables are extracted and utilized for description, detection, and classification. Woods et al. (1986), for example, use statistically uncorrelated variables for language test scores. A group of subjects was scored on a battery of language tests, where the subtests measured different abilities such as vocabulary, grammar or reading comprehension. Horvath (1985) analyzes speech samples of Sydney speakers to determine the relative occurrence of five different variants of each of five vowels sounds. Using this data, the speakers clustered according to such factors as gender, age, ethnicity and socio-economic class. LDA is a class of methods used in machine learning to find the linear combination of features that best separate two classes of events. LDA is closely related to principal component analysis (PCA), where a linear combination of features that best explains the data. Discriminant analysis explicitly exploits class information in the data, while PCA does not. Idiom detection based on discriminant analysis has several advantages. First, it does not make any assumption regarding data distributions. Many statistical detection methods assume a Gaussian distribution of normal data, which is far from reality. Second, by using a few discriminants to describe data, discriminant Computing Linear Discriminants for Idiomatic Sentence Detection 19 analysis provides a compact representation of the data, resulting in increased computational efficiency and real time performance. 2.2 Linear Discriminant Analysis In LDA, within-class, between-class, and mixture scatter matrices are used to formulate the criteria of class separability. Consider a J class problem, where m0 is the mean vector of all data, and mj is the mean vector of jth class data. A within-class scatter matrix characterizes the scatter of samples around their respective class mean vector, and it is expressed by Sw = J X pj j=1 lj X (xji − mj )(xji − mj )t , (1) i=1 P where lj is the size of the data in the jth class, pj ( j pj = 1) represents the proportion of the jth class contribution, and t denotes the transpose operator. A between-class scatter matrix characterizes the scatter of the class means around the mixture mean m0 . It is expressed by Sb = J X pj (mj − m0 )(mj − m0 )T . (2) j=1 The mixture scatter matrix is the covariance matrix of all samples, regardless of their class assignment, and it is given by Sm = l X (xi − m0 )(xi − m0 )T = Sw + Sb . (3) i=1 The Fisher criterion is used to find a projection matrix W ∈ <q×d that maximizes J(W ) = |W t Sb W | . |W t Sw W | (4) In order to determine the matrix W that maximizes J(W ), one can solve the generalized eigenvalue problem: Sb wi = λi Sw wi . The eigenvectors corresponding to the largest eigenvalues form the columns of W . For a two class problem, it can be written in a simpler form: Sw w = m = m1 − m2 , where m1 and m2 are the means of the two classes. In practice, the small sample size problem is often encountered, when l < q. In this case Sw is singular. Therefore, the maximization problem can be difficult to solve. 2.3 Margin Criterion for Linear Dimensionality Reduction For idiomatic sentence detection, we propose an alternative to the Fisher criterion. Here we first focus on two class problems. We note that the goal of LDA is 20 Peng J., Feldman A., Street L. to find a direction w that simultaneously places two classes afar and minimizes within class variations. Fisher’s criterion 4 achieves this goal. Alternatively, we can achieve this goal by maximizing J(w) = tr(wt (Sb − Sw )w), (5) where tr denotes the trace operator. Notice that tr(Sb ) measures the overall scatter of class means. Therefore, a large tr(Sb ) implies that the class means spread out in a transformed space. On the other hand, a small tr(Sw ) indicates that in the transformed space the spread of each class is small. Thus, when maximized, J indicates that data points are close to each other within a class, while they are far from each other if they come from different classes. To see that our proposal (Eq. 5) is margin based, notice that maximizing P2 P2 tr(Sb − Sw ) is equivalent to maximizing J = 21 i j pi pj d(Ci , Cj ), where pi denotes the probability of class Ci . The interclass distance d is defined as d(Ci , Cj ) = d(mi , mj ) − tr(Si ) − tr(Sj ), where mi represents the mean of class Ci , and Si represents the scatter matrix of class Ci . Here d(Ci , Cj ) measures the average margin between two classes. Therefore, maximizing our objective produces large margin linear discriminants. Large margin discriminants often result in better generalization (Vapnik, 1998). In addition, there is no need to calculate the inverse of Sw , thereby avoiding the small sample size problem associated with the Fisher criterion. 3 Computing Linear Discriminants with Semi-Definite Programming Suppose that w optimizes (5). So does cw for any constant c = 6 0. Thus we require that w have unit length. The optimization problem then becomes max tr(wt (Sb − Sw )w) w subject to: kwk = 1. This is a constraint optimization problem. Since tr(wt (Sb − Sw )w) = tr((Sb − Sw )wwt ) = tr((Sb −Sw )X), where X = wwt , we can rewrite the above constraint optimization problem as max tr((Sb − Sw )X) X I •X =1 X0 (6) where I is identity matrix and the inner product of symmetric matrices is Pthe n A • B = i,j aij bij , and X 0 means that the symmetric matrix X is positive semi-definite. Indeed, if X is a solution to the above optimization problem, then X 0 and I • X = 1 implies kwk = 1, assuming rank(X) = 1. Computing Linear Discriminants for Idiomatic Sentence Detection 21 The above problem is a semi-definite program (SDP), where the objective is linear with linear matrix inequality and affine equality constraints. Because linear matrix inequality constraints are convex, SDPs are convex optimization problems. The significance of SDP is due to several factors. SDP is an elegant generalization of linear programming, and inherits its duality theory. For a comprehensive overview on SDP, see (Vandenberghe and Boyd, 1996). SDPs arise in many applications, including sparse PCA, learning kernel matrices, Euclidean embedding, and others. In general, generic methods are rarely used for solving SPDs, because their time grows at the rate of O(n3 ) and their memory grows in O(n2 ), where n is the number of rows (or columns) of a semidefinite matrix. When n is greater than a few thousands, SDPs are typically not used. However, there are algorithms that have a good theoretical foundation to solve SDPs (Vandenberghe and Boyd, 1996). In addition, semidefinite programming is a very useful technique for solving many problems. For example, SDP relaxations can be applied to clustering problems such that after solving a SDP, final clusters can be computed by projecting the data onto the space spanned by the first few eigenvectors of the SDP solution. For large-scale problems, there is a tremendous opportunity for exploiting special structures in problems, as those suggested in (Ben-Tal and Nemirovski, 2004; Nesterov, 2003). Assume rank(X) = 1. Since X is symmetric, one can show that rank(X) = 1 iff X = wwt for some vector w. Therefore, we can recover w from X as follows. Select any column (say the ith column) of X such that X(1, i) 6= 0, and let w = X(:, i)/X(1, i), (7) where X(:, i) denotes the ith column of the matrix X. Thus, our goal here is to ensure the solution X to the above constraint optimization problem has rank at most 1. One way to guarantee rank(X) = 1 is to use rank(X) = 1 as an additional constraint in the optimization problem. However, the constraint rank(X) = 1 is not convex and the resulting problem is difficult to solve. It turns out that the above formulation (6) is sufficient to ensure that the rank of the optimal solution X to Eq. (6) is one, i.e., rank(X) = 1. Theorem 1. Let X be the solution to the semi-definite program (6). Also, let rank(X) = r. Then r = rank(X) = 1. The proof of the theorem is in Appendix A. The theorem states that our procedure for computing w from the matrix X (Eq. 7) is guaranteed to produce the correct answer. We call our algorithm SDP-LDA. An attractive property associated with our algorithm is that it does not have any procedural parameters. Thus, it does not require expensive cross-validation to determine its optimal performance. 4 Dataset In our experiments, we used the dataset described by Fazly et al. (2009). This is a dataset of verb-noun combinations extracted from the British National Cor- 22 Peng J., Feldman A., Street L. pus (BNC, Burnard (2000)). The VNC tokens are annotated as either literal, idiomatic, or unknown. The list contains only those VNCs whose frequency in BNC was greater than 20 and that occurred at least in one of two idiom dictionaries (Cowie et al., 1983; Seaton and Macaulay, 2002). The dataset consists of 2,984 VNC tokens3 . Since our task is framed as sentence classification rather than MWE extraction and filtering, we had to translate this data into our format. Basically, our dataset has to contain sentences with the following tags: I (=idiomatic sentence), L (=literal), and Q (=unknown). Translating the VNC data into our format is not trivial. A sentence that contains a VNC idiomatic construction can be unquestionably marked as I (=idiomatic); however, a sentence that contains a non-idiomatic occurrence of VNC cannot be marked as L since these sentences could have contained other types of idiomatic expressions (e.g., prepositional phrases) or even other figures of speech. So, by marking automatically all sentences that contain non-idiomatic usages of VNCs, we create an extremely noisy dataset of literal sentences. The dataset consists of 2,550 sentences, of which 2,013 are idiomatic sentences and the remaining 537 are literal sentences. 5 Experiments We first apply the bag-of-words model to create a term-by-sentence representation of the 2,550 sentences in a 6,844 dimensional term space. The Google stop list is used to remove stop words. We randomly choose 300 literal sentences and 300 idiomatic sentences as training and randomly choose 100 literals and 100 idioms from the remaining sentences as testing. Thus the training dataset consists of 600 examples, while the test dataset consists of 200 examples. We train our model on the training data and obtain one discriminant subspace. We then project both training and test data on the chosen subspace. Note that for the two class case (literal vs. idiom), one dimensional subspace is sufficient. In the reduced subspace, we compare three classifiers: the three nearest neighbor (3NN) classifier, the quadratics classifier that fits multivariate normal densities with covariance estimates stratified by classes (Krzanowski, 1988), and support vector machines (SVMs) with the Gaussian kernel (Cristianini and Shawe-Taylor, 2000). The kernel parameter was chosen through 10 fold cross-validation. We repeat the experiment 10 times to obtain the average accuracy rates registered by the three methods. The following table shows the accuracy rates over the ten runs. We compare the proposed technique against a random baseline approach. The baseline approach flips a fair coin. If the outcome is head, it classifies a given sentence as idiomatic. If the outcome is tail, it classifies a given sentence as a regular sentence. Even though we used Fazly et al. (2009)’s dataset for these experiments (see Section 4), the direct comparison with their methods is impossible here because 3 To read more about this dataset, the reader is referred to Cook et al. (2008) Computing Linear Discriminants for Idiomatic Sentence Detection 23 3NN Quadratic SVMs Baseline 0.8015 0.7690 0.7890 0.50 Table 1. Classification accuracy rates computed by the three competing methods compared against the baseline. our tasks are formulated differently. Fazly et al. (2009)’s unsupervised model that relies on the so-called canonical forms (CForm) gives 72.4% (macro-)accuracy on the extraction of idiomatic tokens when evaluated on their test data. 6 Analysis To gain insights into the performance of the proposed technique, we created a dataset that is manually annotated to avoid noise in the literal dataset. We asked three human subjects to annotate 200 sentences from the VNC dataset as idiomatic, non-idiomatic or unknown. 100 of these sentences contained idiomatic expressions from the VNC data. We then merged the result of the annotation by the majority vote. We also measured the inter-annotator agreement (the Cohen kappa k, Cohen (1960); Carletta (1996)) on the task. Interestingly, the Cohen kappa coefficient was much higher for the idiomatic data than for the so-called literal data: k (idioms) = 0.91; k (literal) = 0.66. There are several explanations of this performance. First, the idiomatic data is much more homogeneous since we selected sentences that already contained VNC idiomatic expressions. The rest of the sentences might have contained metaphors or other figures of speech and thus the judgments were more difficult to do. Second, humans easily identify idioms, but the decision whether a sentence is literal or figurative is much more challenging. The notion of “figurativeness” is not a binary property (as might be suggested by the labels that were available to the annotators). “Figurativeness” falls on a continuum from completely transparent (= literal) to entirely opaque (=figurative)4 Third, the human annotators had to select the label, literal or idiomatic, without having access to a larger, extra-sentential context, which might have affected their judgements. Although the boundary between idiomatic and literal expressions is not entirely clear (expressions do seem to fall on a continuum in terms of idiomaticity), some expressions are clearly idiomatic and others clearly literal based on the overall agreement of our annotators. By classifying sentences as either idiomatic or literal, we believe that this additional sentential context could be used to further investigate how speakers go about making these distinctions. 4 A similar observation is made by Cook et al. (2008) with respect to idioms. 24 Peng J., Feldman A., Street L. 7 Discussion Below we provide output sentences identified by our algorithm as either idiomatic or literal. 1. True Positives (TP): Idiomatic sentences identified as idiomatic – We lose our temper, feel cornered and frightened, it can be the work of an instant. – Omanis made their mark in history as early as the third century. 2. False Positives (FP): Non-idiomatic sentences identified as idiomatic – We had words of the sixties, there were words of the seventies, there were words of the eighties, words of the nineties, and we’re influencing by those words, actually that’s reasonably in popularity and er increasing usage, and sometime we, people actually use it and they don’t know what it means. – Therefore, taking the square root of this measure we get the correlation coefficient. 3. True Negatives (TN): Non-idiomatic sentences identified as non-idiomatic – It holds up to three horses and will be driven to and from London by Mrs. Charley from their home just outside Coventry. – The referee blew a toy trumpet and Harry Payne gave the golf club a mighty hit with his bat, breaking the shaft in two. 4. False Negatives (FN): Idiomatic sentences identified as non-idiomatic – It therefore has a long-term future. – It has also been agreed that Italy will pay a reciprocal visit to Dublin in April when they will take part in a Four Nations competition to replace the Home. Our error analysis reveals that many cases are fuzzy and clear literal/idiomatic demarcation is difficult. In examining our false positives (i.e., non-idiomatic expressions that were marked as idiomatic by the model), it becomes apparent that the classification of cases is not clear-cut. The expression words of the sixties/seventies/ eighties/nineties is not idiomatic; however, it is not entirely literal either. It is metonymic – these decades could not literally produce words. Another false positive contains the expression take the square root. While seemingly similar to the idiom take root in plans for the new park began to take root, the expression take the square root is not idiomatic. It does not mean ”to take hold like roots in soil.” Like the previous false positive, we believe take the square root is figurative to some extent. A person cannot literally take the square root of a number like he can literally take milk out of the fridge. When it comes to classifying expressions as idiomatic or literal, our false negatives (i.e., idiomatic expressions that were marked as non-idiomatic by the model) reveal that human judgments can be misleading. For example, It therefore has a long-term future was marked as idiomatic in the test corpus. While our human annotators may have thought that an object could not literally have (or hold) a long-term future, this expression does not appear to be truly idiomatic. We do not consider it to be as figurative as a true positive like lose our temper. Computing Linear Discriminants for Idiomatic Sentence Detection 25 Another false negative contains a case of metonymy Italy will pay a reciprocal visit and the verbal phrase take part. In this case, our model correctly predicted that the expression is non-idiomatic. Properties of metonymy are different from those of idioms, and the verbal phrase take part has a meaning separate from that of the idiomatic expression take someone’s part. Another interesting feature that we discovered in analyzing our false negatives is that some idiomatic expressions still retain their original meanings even when other words intervene and the idioms’ component words are separated and reordered. For example, in the sentence I was little better than a criminal on whom they must keep tabs, the prepositional phrase on whom is removed from the end of the idiom keep tabs on whom and placed in an earlier position in the sentence. Despite this permutation, the idiom still maintains its idiomatic meaning. All of these observations support Gibbs (1984)’s claim (based on experimental evidence) that the distinctions between literal and figurative meanings have little psychological validity. He views literal and figurative expressions as end points of a single continuum. This makes the task of idiom detection even more challenging because often, perhaps, there is no objective clear boundary between idioms and literal expressions. 8 Conclusion In this study we did not want to restrict ourselves to idioms of a particular syntactic form. We applied this method to English and used the VNC (Fazly et al., 2009) corpus for our experiments. However, in principle, the technique is language- and structure-independent. Our binary classification approach has multiple practical applications. It is useful for indexing purposes in information retrieval (IR) as well as for increasing the precision of IR systems. Knowledge of which sentences should be interpreted literally and which figuratively can also improve text summarization and machine translation systems. Applications such as style evaluation or textual steganography detection can directly benefit from the method proposed in this paper as well. Classified sentences are useful for language instruction as well, e.g., for effective demonstrations of contexts in which specific idioms might occur. We also feel that identifying idioms at the sentence level may provide new insights into the kinds of contexts that idioms are situated in. These findings could further highlight properties that are unique to specific idioms if not idioms in general. Our current work is concerned with improving the detection rates of our model. At present, our model does not use text coherence as a feature, and we think we could significantly improve our performance if we considered a larger context. Once the detection rates of our model have been improved, we will extract idioms from the sentences our model has classified as idiomatic. We have yet to see which method will work the best for this task. 26 Peng J., Feldman A., Street L. A Proof of Theorem 1 Proof. We rewrite Sb − Sw = 2Sb − Sm , where Sm = Sb + Sw . Let null(A) denote the null space of matrix A. Since null(Sm ) ⊆ null(Sb ), there exists a matrix P ∈ <q×s that simultaneously diagonalizes Sb and Sm Fukunaga (1990), where s ≤ min{l − 1, q} is the rank of Sm . The matrix P is given by P = QΛ−1/2 U, m where Λm and Q are the eigenvalue and eigenvector matrices of Sm , and U is −1/2 −1/2 the eigenvector matrix of Λm Qt Sb QΛm . Thus, the columns of P are the eigenvectors of 2λSb − Sm and the corresponding eigenvalues are 2Λb − I. We then have P t Sb P = Λb , P t Sm P = I. (8) where Λb = diag{σ1 , · · · , σs }. Consider the range of P over Y ∈ <s×q with rank(Y ) = s. The range W = P Y includes all q × q matrices with rank = s. Then max tr(W t (2Sb − Sm )W ) = max tr((P Y )t (2Sb − Sm )P Y ) W Y = max tr(Y t (2Λb − I)Y ). Y It is straightforward to show that the maximum is attained by Y = [e1 e2 · · · er ; 0], where ei is a vector whose ith component is one and the rest is 0. From this it is clear that W = P Y consists of the first r columns of P , i.e., the eigenvectors corresponding to 2λσi − 1 > 0. Pr Now, since X = W W t , we have X = i=1 wi wit . Thus, tr(X) = r X wit wi = r. i=1 However, the constraint I · X = 1 states that tr(X) = 1. It follows that r = 1. That is, rank(X) = 1. Computing Linear Discriminants for Idiomatic Sentence Detection 27 References 1. Baldwin, T., C. Bannard, T. Tanaka, and D. Widdows (2002). An empirical model of multiword expression decomposability. In Proceedings of the ACL 03 Workshop on Multiword expressions: analysis, acquisition and treatment, pp. 89– 96. 2. Bannard, C. (2007). A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the ACL 07 Workshop on A Broader Perspective on Multiword Expressions, pp. 1–8. 3. Ben-Tal, A. and A. Nemirovski (2004). Non-euclidean restricted memory level method for large-scale convex optimization. 4. Birke, J. and A. Sarkar (2006). A clustering approach to the nearly unsuper- vised recognition of nonliteral language. In Proceedings of the 11th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, pp. 329–226. 5. Burnard, L. (2000). The British National Corpus Users Reference Guide. Oxford University Computing Services. 6. Carletta, J. (1996). Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics 22(2), 249–254. 7. Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales. Education and Psychological Measurement (20), 37–46. 8. Cook, P., A. Fazly, and S. Stevenson (2008, June). The VNC-Tokens Dataset. In Proceedings of the LREC Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco. 9. Cowie, A. P., R. Mackin, and I. R. McCaig (1983). Oxford Dictionary of Current Idiomatic English, Volume 2. Oxford University Press. 10. Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge, UK: Cam- bridge University Press. 11. Fazly, A., P. Cook, and S. Stevenson (2009). Unsupervised Type and Token Identification of Idiomatic Expressions. Computational Linguistics 35 (1), 61– 103. 12. Fazly, A. and S. Stevenson (2006). Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), Trento, Italy, pp. 337–344. 13. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Academic Press. 14. Gibbs, R. W. (1984). Literal Meaning and Psychological Theory. Cognitive Science 8, 275–304. 15. Horvath, B. M. (1985). Variation in Australian English. Cambridge: Cambridge University PRess. 16. Katz, G. and E. Giesbrecht (2006). Automatic identification of non- compositional multi-word expressions using latent semantic analysis. In Proceedings of the ACL’06 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 12–19. 17. Krzanowski, W. (1988). Principles of Multivariate Analysis. UK: Oxford 28 Peng J., Feldman A., Street L. University Press. 18. Lin, D. (1999). Automatic Identification of Non-compositional Phrases. In Proceedings of ACL, College Park, Maryland, pp. 317–324. 19. Nesterov, I. (2003). Smooth minimization of non-smooth functions. 20. Seaton, M. and A. Macaulay (Eds.) (2002). Collins COBUILD Idioms Dictionary(second ed.). HarperCollins Publishers. 21. Sporleder, C. and L. Li (2009). Lexical Encoding of MWEs. In Proceedings of EACL 2009. 22. Vandenberghe, L. and S. Boyd (1996). Semidefinite programming. SIAM Review 38(1), 49–95. 23. Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley. 24. Woods, A., P. Fletcher, and A. Hughes (1986). Statistics in Language Studies. Cambridge: Cambridge University Press. Robust Temporal Processing: from Model to System Tommaso Caselli and Irina Prodanof ILC-CNR, Pisa firstName.secondName@ilc.cnr.it Abstract. This paper shows the functioning and the general architecture of an empirically-based model for robust temporal processing of text/discourse. The starting point for this work has been the understanding of how humans process and recognize temporal relations. The empirical results show that the different salience of the linguistic and commonsense knowledge sources of information calls for specific computational components and procedures to deal with them. 1 Introduction Temporal processing of text/discourse has recently become one of the most active areas in NLP, boosted by the presence of specific markup languages (ISO-TimeML, SemAF/Time Project) and by a growing number of initiatives (CLEF, TERN, SemEval-TempEval2). Natural languages have a variety of devices to communicate information about events and their temporal organization and the identification of the temporal relations in a text/discourse is not a trivial task. Previous research has explored and analyzed what sources of information are at play when inferring the temporal orders of eventualities such as tense, temporal adverbs, signals, viewpoint aspect, lexical aspect, discourse relations, commonsense and pragmatic knowledge. Most sources of information for inferring a temporal relation very rarely code in an explict and clear-cut way the specific temporal relation holding between two entities (i.e. eventuality - eventuality, eventuality - temporal expression) and this may lead to biases and incorrect tagging. Substantial linguistic processing is required for a system to perform temporal inferences and commonsense knowledge can hardly be encoded in domain independent programs. One of the main issues which has not been answered so far is how the linguistic devices which languages have at disposal to codify temporal relations © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 29-40 Received 23/11/09 Accepted 16/01/10 Final version 09/03/10 30 Caselli T., Prodanof I. interact both with each other and under which conditions they are autonomous, i.e. able to codify a temporal relations between eventualities without the support of non-purely linguistic elements, like discourse structure or world-knowledge based inferences. This calls for the development of procedures and techniques which maximize the role of the sources of information and the conditions under which they are necessary and sufficient to determine the current temporal relation. This paper presents a general empirically-based model for robust temporal processing of text/discourse. Though the experiments have been conducted on Italian, the model is language independent. The lack of a complete system is mainly due the absence of temporally annotated resources for Italian over which systems could be developed and evaluated. The remaining of the paper is organized as follows: in sect. 2 we illustrate the methodology and the experimental results on the basis of which the model has been developed. Section 3 reports the overall architecture of the model and the functioning of its core components. Finally, sect. 4 presents the conclusion and observations for future work. 2 Linguistic Information and Pragmatic Mechanisms: Defining an Order of Application Recent psychological studies ([1], [2]) have established correspondences between the formal aspect of the temporal structure of discourse and the mental representations interpreters built. The order in which eventualities are presented in a text/discourse vs. their real chronological order, the use of particular tenses, the presence of elements which explicitly mark a temporal relations are all features used when constructing a mental model of a text/discourse. As [3] pointed out, these features are of the kind which can be constructed automatically by information extraction systems. Knowing how these features interact with respect to their different nature, i.e. linguistic vs. world-knowledge based, is a necessary step to have robust automatic extraction systems. To develop a model for temporal processing we have decided to in- Robust Temporal Processing: from Model to System 31 vestigate through an experimental study if it is possible to determine a hierarchical order of application of the linguistic and non-linguistic sources of information and under which conditions purely linguistic information is necessary and sufficient to determine the temporal relations between the entities in analysis. The aim of this study is that of identifying how deep must the computation of information go, that is how many modules must be activated in order to obtain a reliable temporal representation of the text/discourse. 2.1 Methodology In order to verify the existence of a salience order of the sources of information and to obtain cues on the way the model should be implemented we have elaborated a test which was submitted to two groups of subjects: a first group of 29 subjects, none of them having knowledge in linguistics (Group 1), and a second group of 6 subjects, all MA students in Linguistics. The two groups were submitted with comparable, though not identical, test data, provided their different backgrounds and the level of metalinguistic analysis required. In both experiments the subjects were presented with a set of 52 discourse excerpts, automatically extracted from the Italian Syntactic Semantic Treebank (ISST), and were asked to temporally order two highlighted eventualities in the discourse segments. To improve the reliability and avoid inconsistency, the subjects were asked to choose the temporal relations among a restricted set of 5 predetermined values, namely BEFORE, AFTER, SIMULTANEOUS, OVERLAP, and NO TEMPORAL RELATION. No binary interpretation of the temporal relations was allowed. In order to discover the existence of a salience order of application of the sources of information and to determine in a reliable way under which conditions linguistic (grammatical and lexical) information is autonomous (i.e. necessary and sufficient) for the identification of the temporal relations with respect to non-purely linguistic i.e. con-textual one, the subjects were asked to state what source of information had mostly helped them in the identification of the temporal relation. Similarly to the first task and to keep the experiments under control, we provided the subjects with a predetermined set of possible answers according to their background. Group 1 had at disposal the val- 32 Caselli T., Prodanof I. ues TENSE, TEMPORAL EXPRESSIONS, and NOT SPECIFIED, while Group 2 had a larger set, i.e. TENSE, TEMPORAL EXPRESSIONS, SIGNAL, ASPECT, SEMANTICS and NOT SPECIFIED. 2.2 Data Analysis and Results In Table 1 we report the results obatined for the identification of temporal relations. The agreement among the subjects have been computed by means of the K statistics. Table 1. Agreement of the subjects on temporal relation identification Agreement of temporal relations Overall agreement Agreement in presence of temporal expressions Agreement in presence of signals Agreement in presence of shifts in tense K value 0.58 0.64 0.73 0.70 As the results illustrate, the identification of temporal relations is a challenging task. Only in presence of specific markers of temporal relations, such as shifts in tense, temporal expressions and signals, the agreement raises up to reliable values. The data have also shown that the temporal representations humans construct are varied: in absence of specific information they are mainly coarse grained, while unique and clear-cut values can be obtained only in presence of explicit information, i.e. of markers which guides the interpretation process. One of the main results is the identification of a set of constraints and preferences. The constraints apply both to the role and to the relationships among the sources of information, while the preferences deal with tense patterns and associated temporal relations. Different constraints are activated for each source of information. The constraints can be conceived as representing the condition under which each source of information is a necessary and sufficient element for recovering the correct temporal relation. In particular, we claim that: – Constraint 1 (Tense): sequences of adjacent eventualities must have different tense meaning, otherwise other sources of information are responsible for their ordering; Robust Temporal Processing: from Model to System 33 – Constraint 2 (Temporal Expressions): temporal expressions may represent explicit information for the ordering of eventualities provided that: a.) they are related to the moment of the event, E, or to a secondary deictic moment, R; b.) if more than one temporal expressions is present, they must stand in an anchoring relation one with each other to signal a temporal relation; c.) in case there is just one temporal expression related to an eventuality, its contribution is relevant for the anchoring of eventuality on the time line but it is null for the determing the temporal relation between two eventualities; – Constraint 3 (Signals): signals represent salient information for temporal relations only when their semantics is explicit, like for dopo [after], intanto [simultaneously]. When they are implicit, like quando [when], per [for/since], they offer ancillary information which reinforces the contribution of other sources, like tense, viewpoint aspect, lexical aspect, temporal expression, and commonsense knowledge; – Constraint 4 (Viewpoint and Lexical Aspect): these two types of information have a constraint similar to that for tense: when adjacent eventualities have the same values, either for viewpoint or lexical aspect, their knowledge is necessary but not sufficient to determine the temporal relation. When all other sources of information fail to provide distinguishing cues, commonsense knowledge is used. We claim that commonsense knowledge is the most salient source of information for recovering temporal relations but also the less affordable, since it may introduce biases and disagreement. The identification that a sequence of sentences forms a text/discourse is the pre-condition for the existence of any kind of relations between the discourse entities. It is only in this sense that temporal relations are a by-product of the computation of the general discourse structure, and, as the data have shown, discourse structure cannot be considered the primary source of information for the identification of temporal relations. Of course, knowledge of discourse structure can improve the automatic recognition of temporal relations, as [4] have demonstrated. To illustrate how the constraints work, consider this example: 34 Caselli T., Prodanof I. (1) Marco è caduto e1 . Giovanni l’ha spinto e2 . Marco felle1 . Giovanni pushed hime2 . In 1 the two eventualities cannot be ordered by exploiting only linguistic information. They are not compliant neither with Constraint 1 (different tense meaning) nor with Constraint 4 (different viewpoint and lexical aspect). Constraints 2 and 3 do not apply since no temporal expression and signal is present. The only available source of information is the commonsense knowledge on the basis of which can infer that e2 stands in a precedence relation with e1 . In case we had more specific information (about the context of occurrence), the temporal relation could be overriden. The analysis of the correlation between tense patterns and temporal relations has suggested that it is possible to associate a preference order for temporal relations according to the combination of the tense forms. In particular, it appears from the data that certain tense forms when appearing in particular tense patterns, like the trapassato I [past perfect], tend to grammaticalize particular temporal relations, while others are more prone to code a larger set of relations, like the passato composto [present perfect or simple past]. The preferences have been introduced to reduce the possible temporal relations which may be computed. To clarify how preference rules work consider the following example which is an adapatation from example 1. In this reduced and simplified formalization, t represent the beginning or ending point of the eventutalities, S the moment of utterance and R, as already stated a possible secondary deictic moment/point necessary to describe the semantics of some tense forms, like the trapassato I. (2) Marco è caduto e1 . Giovanni l’aveva spinto e2 . Marco felle1 . Giovanni had pushed hime2 . – discourse sequence: passato composto e1 - trapassato I e2 – e1 = ((E1 ≺ S) ∧ (t1 ≤ E1 ≤ t2 )) [tense analysis for e1 ] – e2 = ((E2 ≺ R2 ) ∧ (R2 ≺ S) ∧ (t3 ≤ R2 ≤ t4 )) [tense analysis for e2 ] – (t1 ≺ t3 ) ∧ (t2 ≺ t4 ) ∧ (t1 ≺ t4 ) ∧ (t2 ??t3 )) – possible temporal relations: ((E2 ≺ E1 )∨ (E 2 m E 1 )∨ (E 2 o E 1 )) Robust Temporal Processing: from Model to System 35 The final output does not provide a unique temporal relations due to the missing information between the ending point of e1 and the beginning point of e2 , i.e. (t2 ??t3 ). The application of the preference rule for sequences of passato composto - trapassato I states that the reliable temporal relation is that of precedence: – possible temporal relations: ((E2 ≺ E1 )∨ (E 2 m E 1 )∨ (E 2 o E 1 )) – Preference Rule: if the sequence is passato composto - trapassato I then reduce the output to E2 ≺ E1 – final output: (E2 ≺ E1 ) It is important to point out that the preference rules do not apply for all tense patterns. For instance, with the futuro composto [future pefect] and the futuro nel passato [future-in-the-past], where the relationship between E and S cannot be reliably stated, no preference rules apply and the output of the component is obtained by disjunctive finely grained relations which can be rearranged in terms of coarse grained temporal relations. Finally, it has been possible to formulate a saliency-based hierarchical order of application of the sources of information as reported in Formula 1. The symbol / stands for “in absence of more specific linguistic information, X is the most salient source of information” and . for “in absence of more specific information, X is the most salient source of information”. Notice that when stating “in absence of more specific (linguistic) information”, we are referring to the constraints we have identified for the saliency of the sources of information: Formula 1 (Hierarchical order of information) : COMMONSENSE KNOWLEDGE / (IMPLICIT SIGNALS . TENSE . VIEWPOINT ASPECT . LEXICAL ASPECT . TEMPORAL EXPRESSIONS . EXPLICIT SIGNALS) The saliency based hierarchy is an abstraction. A human interpreter always has at disposal all the sources of information. On the basis of the experimental data, we have deducted that the most probable order for processing the information is the one illustrated in the hierarchy since as soon as the subjects have found a reliable solution 36 Caselli T., Prodanof I. they should have blocked their inferencing processes. The behavior of the pragmatic, i.e. commonsense, knowledge seems to offer further support to this observation. In fact, this source was selected as the most salient only when all the others were “absent”, i.e. when the constraints we have illustrated were not respected. 3 The Model Architecture This section is devoted to the illustration of the general architecture and mechanisms of a computational model for automatically resolve temporal relations in a text/discourse. The model is based on the empirical data and results illustrated above. Its modular organization is proposed as a strategy to improve the reliability of the output and avoid failure. Each module has some specialized functions and components which are conceived to deal with a source of information at time. The modules are organized on a pipeline according to which the output of one module represents the input for the following. On the basis of the hierarchy we have illustrated by means of the Formula 1 and as a general strategy, the specialized components of each module should be activated only when necessary. Figure 1 illustrates the overall workflow of the model, from raw input text to the final output. The first module is responsible for two primary tasks: the identification, normalization and assignment of temporal relations between the temporal expressions by means of a Timex Grammar and Normalizer and the identification of the eventualities and the assignment of their default lexical aspect through an Event Detector component. The two components take in input shallow parsed text since the chunks’ extent approximates the extent of temporal expressions and eventualities. Moreover, chunks can be easily combined together for items whose extent corresponds to more than one token. The second module has three main components which compute three different types of information strictly connected with each other, namely tense, viewpoint aspect and lexical aspect. The main result of the analysis of this internal submodule is the formalization of each eventuality in its corresponding interval representation. The output of these three components is necessary for two elements: Robust Temporal Processing: from Model to System 37 Fig. 1. Workflow of the Model. firstly, it is used to to determine the temporal relations between temporal expressions and eventualities when associated to the output of the temporal expression component of module 1, and secondly, it is used to activate the components of module 3 only when they can provide a reliable output. The temporal relations which are assumed to be valid are all [5]’s 13 interval relations and [6]’s 8 instant-interval relations. The two signal components are activated when the connection between the eventuality and the temporal expression is “mediated” by a signal. The third module is responsible for the identification of the temporal relations between eventualities. Each component has a set of heuristics based on the empirical data and provides as output the temporal relation value(s). The heuristics are divided into two main groups: one for complex sentence contexts and the others for adjacent eventualities in discourse segments. The four internal components are mutually exclusive one with respect to the other. A general 38 Caselli T., Prodanof I. principle, which results from the constraints illustrated in sect. 2, guides their functioning. Each component is activated if and only if a set of preconditions is respected, otherwise the temporal ordering is completely inferred by means of the external component (module 4). The module assumes as basic temporal relations [5]’s 13 interval relations and [6]’s 8 instant-interval relations. A special predicate, hold, is postulated to account for the measure of the duration of the eventualities. As a general principle, the temporal relations between two adjacent eventualities are computed by considering the relations between the beginning and ending points of their interval representations. However, much of this information is missing or only vaguely present in the text/discourse. In order to deal with this issue, the final outputs of the internal components can differ in terms of the preciseness of the temporal knowledge expressed so that we can have (i.) precise temporal knowledge, when a single temporal relation can be stated or (ii.) coarse grained temporal knowledge, when more than one temporal relation can be inferred. In presence of coarse grained knowledge, the multiple temporal relations do not represent contradicting temporal representations, but related or conceptually adjacent temporal relations. Instead of expressing these types of temporal relations by means of disjunctive finely grained relations, we have decided to use of coarse grained knowledge based on [7]’s notion of conceptual neighbors. The main advantage of such a representation is two-folded: on the one hand, the model is somehow cognitively similar to the temporal representations that humans may have, and, on the other hand, it avoids that the inferencing module could fail to complete the whole set of temporal relations. According to our analysis, we have instances of precise temporal relations only when the output is obtained by three components, namely (i.) tense, (ii.) explicit signals and (iii.) discourse relations. The viewpoint and lexical aspect components will produce coarse grained temporal knowledge. Module 4 is responsible for the inferencing process of temporal relations. This module takes in input both the output of module 2 and that from module 3 and it activates two different types of inferencing mechanisms according to which the module provides its output. When Module 2 provides the input, the eventualities are already connected by means of a temporal relation to a temporal expression. In this case, module 4 activates a set of inferencing rules according Robust Temporal Processing: from Model to System 39 to which the relations between the temporal expressions are transferred to their connected eventualities. Things are more complicate when the input comes from module 3. In this case, module 4 looks for couples of adjacent eventualities with one of them in common. Once identified, it will activate inferencing rules based on a transitivity table which preserves the insights of Allen’s table and the coarse grained knowledge, as proposed by [7]. As for eventualities realized by nouns or other parts of speech different than verbs, our model implements a strategy based on [3]. Event nouns do not present information on their temporal location, thus the identification of the temporal relations requires an extended use of commonsense knowledge. Following [3]’s proposal, an abstract device, a Chronoscope, apply. This device allows temporal representation abstraction. The Chronoscope requires that we index temporal relations to a certain level of time granularity g. In our account, event nouns (and all parts-of-speech other than verbs) will be considered as simultaneous with the verbal eventualities with which they co-occur. This means that, the finely grained distinction between tensed eventualities will be maintained and preserved, and, at the same time, there is no need to make reference to commonsense knowledge in order to extend the model to event nouns. The only operation is to abstract their temporal representation on a level of same temporal granularity as that of the verbal eventualities. 4 Conclusion and Future Work This paper illustrates a general architecture for robust temporal processing of text/discourse. The workflow of the model has been elaborated on the basis of experimetal data which have suggested a saliency-based order of application of the linguistic and non-linguistic sources of information involved in this task. With respect to previous research, this work has presented a first systematization of the linguistic devices involved in temporal processing and how they interact with each other. The development of a system compliant to the model has also advantages for the interpretation of errors and may facilitate their solution. The actual relization of the model is ongoing. Some components (temporal expression tagger [8], an intrasentential tagger for tem- 40 Caselli T., Prodanof I. poral relations between eventualities and temporal expressions in presence of signals [9]) and procedures (use of lexical resources for the identification of eventualities) have been realized and evaluated on specifically annotated documents by using different techiniques (rule-based systems vs. machine learning techniques). As a side effect of this work is the realization of an annotated corpus for Italian for events, temporal expressions and temporal relations, comparable to the English TimeBank[10]. As future work, we plan to experiment the validity of the experimental results by developing classifiers which apply different orders of application of the linguistic information with respect to those illustrated in sect. 2. A further element of analysis will be the evaluation of the flow of information from one module to the other and how errors can influence their functioning. References 1. van der Meer, E., Beyer, R., Heinze, B., Badel, I.: Temporal order relations in language comprehension. Journal of Experimental Psychology: Learning, Memory and Cognition 28 (2002) 770–79 2. Kelter, S., Kaup, B., Claus, B.: Representing a described sequence of events: A dynamic view of narrative comprehension. Journal of Experimental Psychology: Learning, Memory and Cognition 30 (2004) 451–64 3. Mani, I.: Chronoscopes: A theory of underspecified temporal representation. In Schilder, F., Katz, G., Pustejovsky, J., eds.: Annotating, Extracting and Reasoning about Time and Events. LNAI. Springer-Verlag, Berlin Heidelberg (2007) 127–39 4. Forascu, C., Pistol, I., Cristea, D.: Temporality in relation with discourse structure. In: Proceedings of the Fifth International conference on Language Resources and Evaluation (LREC-06). (2006) 65–70 5. Allen, J.: Maintainig knowledge about temporal intervals. Communications of ACM 26(11) (1983) 832–43 6. Allen, J., Hayes, P.: Moments and points in an interval-based temporal logic. Computational Intelligence 5(3) (1989) 225–38 7. Freska, C.: Temporal reasoning based on semi-intervals. Artificial Intelligence 54 (1992) 199–227 8. Caselli, T., dell’Orletta, F., Prodanof, I.: A timeML compliant timex tagger for italian. In: Proceedings of the International Multiconference on Computer Science and Information Technology – IMCSIT 2009. Volume 4. (2009) 185 – 192 9. Caselli, T., dell’Orletta, F., Prodanof, I.: Temporal relations with signals: the case of italian temporal prepositions. In Lutz, C., Raskin, J.F., eds.: 16th International Symposium on Temporal Representation and Reasoning, 2009. TIME 2009. (2009) 125 – 132 10. Pustejovsky, J., Hanks, P., Saurı́, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D., Ferro, L., Lazo, M.: The TIMEBANK corpus. In: Corpus Linguistics 2003. (2003) Near-Synonym Choice using a 5-gram Language Model Aminul Islam and Diana Inkpen University of Ottawa School of Information Technology and Engineering Ottawa, ON, Canada, K1N 6N5 {mdislam, diana}@site.uottawa.ca Abstract. In this work, an unsupervised statistical method for automatic choice of near-synonyms is presented and compared to the stateof-the-art. We use a 5-gram language model built from the Google Web 1T data set. The proposed method works automatically, does not require any human-annotated knowledge resources (e.g., ontologies) and can be applied to different languages. Our evaluation experiments show that this method outperforms two previous methods on the same task. We also show that our proposed unsupervised method is comparable to a supervised method on the same task. This work is applicable to an intelligent thesaurus, machine translation, and natural language generation. 1 Introduction Choosing the wrong near-synonym can convey unwanted connotations, implications, or attitudes. In machine translation and natural language generation systems, the choice among near-synonyms needs to be made carefully. By nearsynonyms we mean words that have the same meaning, but differ in lexical nuances. For example, error, mistake, and blunder all mean a generic type of error, but blunder carries an implication of accident or ignorance. In addition to paying attention to lexical nuances, when choosing a word we need to make sure it fits well with the other words in a sentence. In this paper we investigate how the collocational properties of near-synonyms can help with choosing the best words. This problem is difficult because the near-synonyms have senses that are very close to each other, and therefore they occur in similar contexts. We build a strong representation of the context in order to capture the more subtle differences specific to each near-synonym. The work we present here can be used in an intelligent thesaurus. A writer can access a thesaurus to retrieve words that are similar to a given word, when there is a need to avoid repeating the same word, or when the word does not seem to be the best choice in the context. A standard thesaurus does not offer any explanation about the differences in nuances of meaning between the possible word choices. This work can also be applied to a natural language generation system [1] that needs to choose among near-synonyms. Inkpen and Hirst [1] included a © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 41-52 Received 23/11/09 Accepted 16/01/10 Final version 10/03/10 42 Islam A., Inkpen D. preliminary collocation module that reduces the risk of choosing a near-synonym that does not fit with the other words in a generated sentence (i.e., violates collocational constraints). The work presented in this paper allows for a more comprehensive near-synonym collocation module. The task we address in this paper is the selection of the best near-synonym that should be used in a particular context. Inkpen [2] argues that the natural way to validate an algorithm for this task would be to ask human readers to evaluate the quality of the algorithm’s output, but this kind of evaluation would be very laborious. Instead, Inkpen [2] validates her algorithms by deleting selected words from sample sentences, to see whether the algorithms can restore the missing words. That is, she creates a lexical gap and evaluates the ability of the algorithms to fill the lexical gap. Two examples from [2] are presented in Figure 1. All the near-synonyms of the original word, including the word itself, become the choices in the solution set (see the figure for two examples of solution sets). The task is to automatically fill the gap with the best choice in the particular context. We present a method that can be used to scoring the choices. For our particular task, we choose only the highest scoring near-synonym. In order to evaluate how well our method works we consider that the only correct solution is the original word. This will cause our evaluation scores to underestimate the performance of our method, as more than one choice will sometimes be a perfect solution. Moreover, what we consider to be the best choice is the typical usage in the corpus, but it may vary from writer to writer. Nonetheless, it is a convenient way of producing test data in an automatic way. To verify how difficult the task is for humans, Inkpen [2] performed experiments with human judges on a sample of the test data. The near-synonym choice method that we propose here uses the Google Web 1T n-gram data set [3], contributed by Google Inc., that contains English word n-grams (from unigrams to 5-grams) and their observed frequency counts calculated over 1 trillion words from web page text collected by Google in January 2006. The text was tokenized following the Penn Treebank tokenization, except that hyphenated words, dates, email addresses and URLs are kept as single tokens. The sentence boundaries are marked with two special tokens <S> and </S>. Words that occurred fewer than 200 times were replaced with the special token <UNK>. Table 1 shows the data sizes of the Web 1T corpus. The n-grams Table 1. Google Web 1T Data Sizes Number of Number Tokens 1,024,908,267,229 Sentences 95,119,665,584 Unigrams 13,588,391 Bigrams 314,843,401 Trigrams 977,069,902 4-grams 1,313,818,354 5-grams 1,176,470,663 Size on disk (in KB) N/A N/A 185,569 5,213,440 19,978,540 32,040,884 33,678,504 Near-Synonym Choice using a 5-gram Language Model 43 themselves must appear at least 40 times to be included in the Web 1T corpus1 . It is expected that this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. Sentence: This could be improved by more detailed consideration of the processes of ......... propagation inherent in digitizing procedures. Original near-synonym: error Solution set: mistake, blooper, blunder, boner, contretemps, error, faux pas, goof, slip, solecism Sentence: The day after this raid was the official start of operation strangle, an attempt to completely destroy the ......... lines of communication. Original near-synonym: enemy Solution set: opponent, adversary, antagonist, competitor, enemy, foe, rival Fig. 1. Examples of sentences with a lexical gap, and candidate near-synonyms to fill the gap. This paper is organized as follow: Section 2 presents a brief overview of the related work. Our proposed method is described in Section 3. Evaluation and experimental results are discussed in Section 4. We conclude in Section 5. 2 Related Work The idea of using the Google Web 1T n-gram data set as a resource in different natural language processing applications has been exploited by many researchers. Islam and Inkpen [4] use 3-grams of this data set to detect and correct real-word spelling errors and also use n-grams to only correct real-word spelling errors [5]. Nulty and Costello [6] deduce the semantic relation that holds between two nouns in a noun-noun compound phrase such as “flu virus” or “morning exercise” using lexical patterns in the Google Web 1T corpus. Klein and Nelson [7] investigate the relationship between term count (TC) and document frequency (DF) values of terms occurring in the Web as Corpus (WaC) and also the similarity between TC values obtained from the WaC and the Google n-gram dataset and they mention that a strong correlation between the two would give them confidence in using the Google n-grams to estimate accurate inverse document frequency (IDF) values in order to generate well-performing lexical signatures based on the TF-IDF scheme. Murphy and Curran [8] explore the strengths and limitations of Mutual Exclusion Bootstrapping (MEB) by applying it to two novel lexicalsemantic extraction tasks: extracting bigram named entities and WordNet lexical file classes [9] from the Google Web 1T 5-grams. Turney et al. [10] addressed the multiple-choice synonym problem: given a word, choose a synonym for that word, among a set of possible solutions. In 1 Details of the Google Web 1T data set can www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt be found at 44 Islam A., Inkpen D. this case the solutions contain one synonym and some other (unrelated) words. They achieves high performance by combining classifiers. Clarke and Terra [11] addressed the same problem as Turney et al., using statistical associations measures computed with counts from the Waterloo terabyte corpus. In our case, all the possible solutions are synonyms of each other, and the task is to choose one that best matches the context: the sentence in which the original synonym is replaced with a gap. It is much harder to choose between words that are near-synonyms because the context features that differentiate a word from other words might be shared among the near-synonyms. In fact, the works that address exactly the same task are that of Edmonds [12] and Inkpen [2], as far as we are aware. Edmonds [12] gives a solution based on a lexical co-occurrence network that included second-order co-occurrences whereas Inkpen [2] uses a much larger corpus and a simpler method, and obtains better results than that of [12]. Inkpen’s [2] unsupervised method is based on the mutual information scores between a near-synonym and the content words in the context filtering out the stopwords2 . The pointwise mutual information (PMI) between two words x and y compares the probability of observing the two words together (their joint probability) to the probabilities of observing x and y independently (the probability of occurring together by chance) [13]. PMI(x, y) = log2 P (x, y) P (x)P (y) The probabilities can be approximated by: P (x) = C(x)/N , P (y) = C(y)/N , P (x, y) = C(x, y)/N , where C denote frequency counts and N is the total number of words in the corpus. Therefore, PMI(x, y) = log2 C(x, y) · N C(x) · C(y) where N can be ignored in comparisons, since is it the same in all the cases. Inkpen [2] models the context as a window of size 2k around the gap (the missing word): k words to the left and k words to the right of the gap. If the sentence is s = · · · w1 · · · wk Gap wk+1 · · · w2k · · · , for each near-synonym N Si from the group of candidates, the score is computed by the following formula: k−1 2k Score(N Si , s) = Σj=1 PMI(N Si , wj ) + Σj=k+1 PMI(N Si , wj ). In a supervised learning method, Inkpen [2] trains classifiers for each group of near-synonyms. The classes are the near-synonyms in the solution set. Each sentence is converted into a vector of features to be used for training the supervised classifiers. Inkpen used two types of features. One type of features consists in the scores of the left and right context with each class (i.e., with each nearsynonym from the group). The number of features of this type is equal to twice the number of classes: one feature for the score between the near-synonym and 2 We do not filter out stopwords or punctuation in our method. Near-Synonym Choice using a 5-gram Language Model 45 the part of the sentence at the left of the gap, and one feature for the score between the near-synonym and the part of the sentence at the right of the gap. The second type of features is formed by the words in the context windows. For each group of near-synonyms, Inkpen used as features the 500 most frequent words situated close to the gaps in a development set. The value of a word feature for each training example is 1 if the word is present in the sentence (at the left or at the right of the gap), and 0 otherwise. Inkpen trained classifiers using several machine learning algorithms to see which one is best at discriminating among the near-synonyms. There has been quite a lot of work in unsupervised learning of word clusters based on n-grams [14–16]. Our task has similarities to the word sense disambiguation task. Our near-synonyms have senses that are very close to each other. In Senseval, some of the fine-grained senses are also close to each other, so they might occur in similar contexts, while the coarse-grained senses are expected to occur in distinct contexts. In our case, the near-synonyms are different words to choose from, not the same word with different senses. 3 Proposed Method Our task is to find the best near-synonym from a set of candidates that could fill in the gap in an input text, using the Google Web 1T data set. Let us consider an input text W which after tokenization3 has p (2≤p≤9) words4 , i.e., W = {. . . wi−4 wi−3 wi−2 wi−1 wi wi+1 wi+2 wi+3 wi+4 . . .} where wi (in position i) indicates the gap and wi in wi denotes a set of m near-synonyms (i.e., wi = {s1 , s2 , · · · , sj , · · · , sm }). We take into account at most four words before the gap and at most four words after the gap. Our task is to choose the sj ∈ wi that best matches with the context. In other words, the position i is the gap that needs to be filled with the best suited member from the set, wi . 3 We need to tokenize the input sentence to make the n-grams formed using the tokens returned after the tokenization consistent with the Google n-grams. The input sentence is tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: - Hyphenated word are usually separated, and hyphenated numbers usually form one token. - Sequences of numbers separated by slashes (e.g., in dates) form one token. - Sequences that look like urls or email addresses form one token. 4 If the input text has more than 9 words then we keep at most four words before the gap and at most four words after the gap to make the length of the text 9. We choose these numbers so that we could maximize the number of n-grams to use, given that we have up to 5-grams in the Google Web 1T data set. 46 Islam A., Inkpen D. We construct m strings (S1 · · · Sm ) replacing the gaps in position i with sj ∈ wi as follows: S1 = · · · wi−4 wi−3 wi−2 wi−1 s1 wi+1 wi+2 wi+3 wi+4 · · · S2 = · · · wi−4 wi−3 wi−2 wi−1 s2 wi+1 wi+2 wi+3 wi+4 · · · .. . Sm = · · · wi−4 wi−3 wi−2 wi−1 sm wi+1 wi+2 wi+3 wi+4 · · · Using equation 7 in Section 3.2, we calculate P (S1 ), · · · , P (Sm ). The index of the target synonym, j, will be argmax P (Sj ). j∈1···m 3.1 n-gram Language Model A language model is usually formulated as a probability distribution P (S) over strings S, and attempts to reflect how frequently a string S occurs as a sentence. The most widely-used language models, by far, are n-gram language models [17]. We introduce these models by considering the case n = 5; these models are called 5-gram models. For a sentence S composed of the words w1 · · · wp , without loss of generality we can express P (S) as P (S) = P (w1 )P (w2 |w1 )P (w3 |w1 w2 ) · · · P (wp |w1 · · · wp−1 ) = p Y P (wi |w1 · · · wi−1 ) (1) i=1 For n-gram models where n >2, we condition the probability of a word on the identity of the last n-1 words. Generalizing equation 1 to n >2, we get P (S) = p+1 Y i−1 P (wi |wi−n+1 ) (2) i=1 i−1 where wij denotes the words wi · · · wj . To estimate P (wi |wi−n+1 ), the frequency i−1 with which the word wi occurs given that the words wi−n+1 precede the current i word wi , we simply count how often the n-gram wi−n+1 occurs in some text and normalize by the total number of occurrences of any word in the same context. i i Let C(wi−n+1 ) denote the number of times the n-gram wi−n+1 occurs in the given text. Then i C(wi−n+1 ) i−1 P (wi |wi−n+1 )= P (3) i C(w i−n+1 ) wi P i−1 i Notice that C(wi−n+1 ) ≥ wi C(wi−n+1 ). To understand the inequality of P i−1 i−1 i C(wi−n+1 ) and wi C(wi−n+1 ), assume that both i and n are 5. Then, C(wi−n+1 ) becomes C(w1 w2P w3 w4 ), which is actually P the frequency of a specific 4-gram, i w1 w2 w3 w4 , and wi C(wi−n+1 ) becomes w5 C(w1 w2 w3 w4 w5 ), which is the Near-Synonym Choice using a 5-gram Language Model 47 sum of the frequencies of all the 5-grams that P start with the 4-gram, w1 w2 w3 w4 . Thus, in general, C(w1 w2 w3 w4 ) is equal to w5 C(w1 w2 w3 w4P w5 ). But, for some specific cases, it is possible that C(w1 w2 w3 w4 ) is greater than w5 C(w1 w2 w3 w4 w5 ). For example, all the n-grams (2 ≤ n ≤ 5) that appear less than 40 times have been filtered out from the Google Web 1T n-grams. This means all the 5-grams (starting with theP 4-gram w1 w2 w3 w4 ) that appear less than 40 times have not been included in w5 C(w1 w2 w3 w4 w5 ). Thus, when we deal with the Web 1T 5-grams and 4-grams, it is obvious that C(w1 w2 w3 w4 ) is greater than or P i−1 equal to w5 C(w1 w2 w3 w4 w5 ). Thus, in general, we can say that C(wi−n+1 )≥ P i C(w ). This also supports the idea of using the missing count (equai−n+1 wi tion 4) in the smoothing formula (equation 5). We use a smoothing method loosely based on the one-count method described in [18]. Because tokens that appears less than 200 times and n-grams that appear less than 40 times have been filtered out from the Web 1T, we use n-grams with missing counts instead of n-grams with one counts [19]. The missing count is defined as: X i−1 i−1 i M (wi−n+1 ) = C(wi−n+1 )− C(wi−n+1 ) (4) wi The corresponding smoothing formula is: i−1 P (wi |wi−n+1 )= i−1 i−1 i C(wi−n+1 ) + (1 + αn )M (wi−n+1 )P (wi |wi−n+2 ) i−1 i−1 C(wi−n+1 ) + αn M (wi−n+1 ) (5) Yuret [19] optimized the parameters αn > 0 for n = 2 · · · 5 on the Brown corpus to yield a cross entropy of 8.06 bits per token. The optimized parameters are: α2 = 6.71, α3 = 5.94, α4 = 6.55, α5 = 5.71 Thus, incorporating the smoothing formula in equation 2, we get P (S) = 3.2 p+1 Y i−1 i−1 i C(wi−n+1 ) + (1 + αn )M (wi−n+1 )P (wi |wi−n+2 ) i=1 i−1 i−1 C(wi−n+1 ) + αn M (wi−n+1 ) (6) The Language Model Used for Our Task For the specified task, we simplify the 5-gram model described in Section 3.1. From equation 2, it is clear that the maximum number of products possible is 10 i−2 as 2 ≤ p ≤ 9. Among these products, we can omit the products P (wi−1 |wi−5 ), i−3 i−4 i−5 i+4 P (wi−2 |wi−6 ), P (wi−3 |wi−7 ), P (wi−4 |wi−8 ), and P (wi+5 |wi+1 ) because these product items have the same values for all j ∈1 · · · m. Thus, the five product items i−1 i+1 i+2 i that we consider are: P (wi |wi−4 ), P (wi+1 |wi−3 ), P (wi+2 |wi−2 ), P (wi+3 |wi−1 ), i+3 and P (wi+4 |wi ). Applying this simplification in equation 2 and equation 6, we get P (S) = = p Y i−1 P (wi |wi−n+1 ) i=5 p Y i−1 i−1 i C(wi−n+1 ) + (1 + αn )M (wi−n+1 )P (wi |wi−n+2 ) i=5 i−1 i−1 C(wi−n+1 ) + αn M (wi−n+1 ) (7) 48 Islam A., Inkpen D. Equation 7 is actually used in a recursive way for n=5,4,3,2,1 (i.e., if the current n-gram count is zero, then it is used with a lower n-gram, and so on). For example, P (w5 |w1 w2 w3 w4 ) is a function of C(w1 w2 w3 w4 w5 ) and P (w5 |w2 w3 w4 ); if C(w1 w2 w3 w4 w5 ) > 0 then we do not consider P (w5 |w2 w3 w4 ). This is a backoff language model. 4 4.1 Evaluation and Experimental Results Comparison to Edmonds’s and Inkpen’s methods In this section we present results of the proposed method explained in Section 3. We compare our results with those of Inkpen [2] and Edmonds [12]. Edmonds [12] solution used the texts from the year 1989 of the Wall Street Journal (WSJ) to build a lexical co-occurrence network for each of the seven groups of near-synonyms from Table 2. The network included second-order co-occurrences. Edmonds used the WSJ 1987 texts for testing, and reported accuracies only a little higher than the baseline. Inkpen’s [2] method is based on mutual information, not on co-occurrence counts. Inkpen’s counts are collected from a much larger corpus. Test Set difficult, hard, tough error, mistake, oversight job, task, duty responsibility, burden, obligation, commitment material, stuff, substance give, provide, offer settle, resolve ALL (average over all sentences) ALL (average from group averages) 6,630 1,052 5,506 3,115 41.7% 30.9% 70.2% 38.0% 47.9% 48.9% 68.9% 45.3% Accuracy Inkpen’s Method (Supervised) 57.3% 70.8% 86.7% 66.7% 1,715 11,504 1,594 31,116 59.5% 36.7% 37.0% 44.9% 64.6% 48.6% 65.9% 53.5% 71.0% 56.1% 75.8% 65.2% 72.2% 52.7% 76.9% 61.7% 70.4% 55.8% 70.8% 65.3% 31,116 44.8% 55.7% 69.2% 66.0% 69.9% Number Base Edmonds’s of Cases Line Method Inkpen’s Method (Unsup.) 59.1% 61.5% 73.3% 66.0% Proposed Method (Unsup.) 63.2% 78.7% 78.2% 72.2% Table 2. Comparison among the new proposed method, a baseline algorithm, Edmonds’s method, and Inkpen’s unsupervised and supervised method For comparison purposes, in this section we use the same test data (WSJ 1987) and the same groups of near-synonyms. The seven groups of near-synonyms used by Edmonds are listed in the first column of Table 2. The near-synonyms in the seven groups were chosen to have low polysemy. This means that some sentences with wrong senses of near-synonyms might be in the automatically produced test set, but hopefully not many. Near-Synonym Choice using a 5-gram Language Model 49 Before we look at the results, we mention that the accuracy values we compute are the percentage of correct choices when filling in the gap with the winning near-synonym. The expected solution is the near-synonym that was originally in the sentence, and it was taken out to create the gap. This measure is conservative; it does not consider cases when more than one solution is correct. Table 2 presents the comparative results for the seven groups of near-synonyms. The second last row averages the accuracies for all the test sentences, i.e., these are calculated as the number of correct choices returned over total number of sentences (i.e., 31116). The last row averages the accuracies for all the groups averages, i.e., these are calculated as the sum of the accuracies (in percentage) of all the seven groups over the number of groups (i.e., 7). The second column shows how many test sentences we collected for each near-synonym group. The third column is for the baseline algorithm that always chooses the most frequent near-synonym. The fourth column presents the results reported in [12]. The fifth column presents the result of Inkpen’s [2] supervised method when using boosting (decision stumps) as machine learning method and PMI+500 words as features. The sixth column presents the result of Inkpen’s [2] unsupervised method when using word counts in PMI, and the last column is for our proposed unsupervised method using the Google n-grams. We show in bold the best accuracy figure for each data set. We notice that the automatic choice is more difficult for some near-synonym groups than for the others. Our method performs significantly better than the baseline algorithm, Edmond’s method, and Inkpen’s unsupervised method and comparable to Inkpen’s supervised method. For all the results presented in this paper, statistical significance tests were done using the paired t-test, as described in [20], page 209. Error analysis reveals that incorrect choices happen more often when the context is weak, that is, very short sentences or sentences with very few content words. On average, our method performs 25 percentage points better than the baseline algorithm, 14 percentage points better than Edmonds’s method, 4 percentage points better than Inkpen’s unsupervised method, and comparable to Inkpen’s supervised method. An important advantage of our method is that it works on any group of near-synonyms without training, whereas Edmonds’s method requires a lexical co-occurrence network to be built in advance for each group of near-synonyms and Inkpen’s supervised method requires training for each near-synonym group. Some examples of correct and incorrect choices, using our proposed method are shown in Table 3. Table 4 shows some examples of cases where our proposed method fails to generate any suggestion. Cases where our method failed to provide any suggestion are due to the appearances of some very uncommon proper names or nouns, contractions (e.g., n’t), hyphens between two words (e.g., teen-agers), single and double inverted commas in the n-grams, and so on. Preprocessing to tackle this issues would improve the results. 50 Islam A., Inkpen D. CORRECT CHOICE: · · · viewed widely as a mistake → mistake [error, mistake, oversight] and a major ··· · · · analysts expect stocks to settle → settle [settle, resolve] into a steady · · · · · · Sometimes that task → task [job, task, duty] is very straightforward · · · · · · carry a heavier tax burden → burden [responsibility, burden, obligation, commitment] during 1987 because · · · INCORRECT CHOICE: · · · would be a political mistake → error [error, mistake, oversight] to criticize the ··· · · · its energies on the material → substance [material, stuff, substance] as well as ··· · · · 23 , and must provide → give [give, provide, offer] Washington-based USAir at ··· · · · Phog Allen – to resolve → settle [settle, resolve] the burning question · · · Table 3. Examples of correct and incorrect choices using our proposed method. Italics indicate the proposed near-synonym choice returned by the method, arrow indicates the original near-synonym, square brackets indicate the solution set. NO CHOICE: · · · two exchanges ’ ” → commitment [responsibility, burden, obligation, commitment] to making serious · · · · · · He said ECD ’s → material [material, stuff, substance] is a multicomponent ··· · · · Safe Rides , teen-agers → give [give, provide, offer] rides to other · · · · · · guilty plea does n’t → resolve [settle, resolve] the issue for · · · · · · sees Mr. Haig ’s → tough [difficult, hard, tough] line toward Cuba · · · · · · The 1980 Intelligence → Oversight [error, mistake, oversight] Act requires that ··· · · · thought Mr. Cassoni ’s → job [job, task, duty] will be made · · · Table 4. Examples of sentences where our method fails to generate any suggestion. 4.2 Experiment with human judges Inkpen [2] asked two human judges, native speakers of English, to guess the missing word in a random sample of the experimental data set (50 sentences for each of the 7 groups of near-synonyms, 350 sentences in total). The results in Table 5 show that the agreement between the two judges is high (78.5%), but not perfect. This means the task is difficult even if some wrong senses in the automatically-produced test data might have made the task easier in a few cases. 5 Here, each of the seven groups has equal number of sentences, which is 50. Thus, the average from all 350 sentences and the average from group averages are the same. Near-Synonym Choice using a 5-gram Language Model 51 Test set J1-J2 J1 J2 Inkpen’s Our Agreement Accuracy Accuracy Accuracy Accuracy difficult, hard, tough 72% 70% 76% 53% 62% error, mistake, oversight 82% 84% 84% 68% 70% job, task, duty 86% 92% 92% 78% 80% responsibility, burden, 76% 82% 76% 66% 76% obligation, commitment material, stuff, substance 76% 82% 74% 64% 56% give, provide, offer 78% 68% 70% 52% 52% settle, resolve 80% 80% 90% 77% 66% All (average)5 78.5% 79.7% 80.2% 65.4% 66% Table 5. Experiments with two human judges on a random subset of the experimental data set The human judges were allowed to choose more than one correct answer when they were convinced that more than one near-synonym fits well in the context. They used this option sparingly, only in 5% of the 350 sentences. In future work, we plan to allow the system to make more than one choice when appropriate (e.g., when the second choice has a very close score to the first choice). 5 Conclusions We presented an unsupervised statistical method of choosing the best nearsynonym in a context. We compared this method with three previous methods (Edmonds’s method and two of Inkpen’s method) and show that the performance improved considerably. It is interesting that our unsupervised statistical method is comparable to a supervised learning method. Future work includes an intelligent thesaurus and a natural language generation system that has knowledge of nuances of meaning of near-synonyms. We plan to include a near-synonym sense disambiguation module to ensure that the thesaurus does not offer alternatives for wrong senses of words. References 1. Inkpen, D.Z., Hirst, G.: Near-synonym choice in natural language generation. In: Proceedings of the International Conference RANLP-2003 (Recent Advances in Natural Language Processing), Borovets, Bulgaria (2003) 204–211 2. Inkpen, D.: A statistical model for near-synonym choice. ACM Transactions on Speech and Language Processing 4 (2007) 1–17 3. Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Technical report, Google Research (2006) 52 Islam A., Inkpen D. 4. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web 1T 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, Association for Computational Linguistics (2009) 1241–1249 5. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web 1T ngram data set. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, ACM (2009) 1689–1692 6. Nulty, P., Costello, F.: Using lexical patterns in the Google Web 1T corpus to deduce semantic relations between nouns. In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), Boulder, Colorado, Association for Computational Linguistics (2009) 58–63 7. Klein, M., Nelson, M.L.: Correlation of term count and document frequency for Google n-grams. In et al., B.M., ed.: Proceedings of the 31st European Conference on Information Retrieval. Volume 5478/2009 of Lecture Notes in Computer Science., Springer Berlin / Heidelberg (2009) 620–627 8. Murphy, T., Curran, J.: Experiments in mutual exclusion bootstrapping. In: Proceedings of the Australasian Language Technology Workshop 2007, Melbourne, Australia (2007) 66–74 9. Fellbaum, C., ed.: WordNet: An electronic lexical database. MIT Press (1998) 10. Turney, P., Littman, M., Bigham, J., Shnayder, V.: Combining independent modules to solve multiple-choice synonym and analogy problems. In: Proceedings of the International Conference RANLP-2003 (Recent Advances in Natural Language Processing), Borovets, Bulgaria (2003) 482–489 11. Clarke, C.L.A., Terra, E.: Frequency estimates for statistical word similarity measures. In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Edmonton, Canada (2003) 165–172 12. Edmonds, P.: Choosing the word most typical in context using a lexical cooccurrence network. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain (1997) 507–509 13. Church, K., Hanks, P.: Word association norms, mutual information and lexicography. Computational Linguistics 16 (1) (1991) 22–29 14. Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18 (1992) 467–479 15. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of english words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. (1993) 183–190 16. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th international conference on Computational linguistics, Morristown, NJ, USA, Association for Computational Linguistics (1998) 768–774 17. Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Computer Science Group, Harvard University (1998) 18. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: 34th Annual Meeting of the Association for Computational Linguistics. (1996) 310–318 19. Yuret, D.: KU: Word sense disambiguation by substitution. In: Proceedings of the 4th workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic (2007) 207–214 20. Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts (1999) Morphology, Syntax, Named Entity Recognition Exploring the N-th Dimension of Language Prakash Mondal Cognitive Science Lab, International Institute of Information Technology Gachibowli, Hyderabad 500032, India Abstract. This paper is aimed at exploring the hidden fundamental computational property of natural language that has been so elusive that it has made all attempts to characterize its real computational property ultimately fail. Earlier natural language was thought to be context-free. However, it was gradually realized that this does not hold much water given that a range of natural language phenomena have been found as being of non-context-free character that they have almost scuttled plans to brand natural language contextfree. So it has been suggested that natural language is mildly context-sensitive and to some extent context-free. In all, it seems that the issue over the exact computational property has not yet been solved. Against this background it will be proposed that this exact computational property of natural language is perhaps the N-th dimension of language, if what we mean by dimension is nothing but universal (computational) property of natural language. Keywords: Hidden fundamental variable; natural language; context-free; computational property; N-th dimension of language. 1 Introduction Let’s start with the question “What exactly is the universal computational property of natural language?” The simple and perhaps a little mysterious answer is that nobody knows what it is. Then we may ask the other natural and subsequent question “Why?”. Even here we are nowhere nearer to having any clearer grasp of the reason why the exact natural language computational property is beyond our reach. Perhaps it is not knowable at all. Or perhaps it is knowable, but the question on this issue is not a relevant one. Or even perhaps there is no single answer to this question. This is the reason why we have till now a plethora of grammar formalisms or models of natural language defining and characterizing different classes of language; some defining context-free languages, some context-sensitive, and some are of mixed nature with each type having weak or strong generative capacity. Here in this paper it will be argued that this seemingly unknowable computational fundamental having universal applicability and realization constitutes the N-th dimension of natural language. Hence, we may know a little bit about N-1,…,N-m (when m is an arbitrary © A. Gelbukh (Ed.) Received 28/10/09 Special issue: Natural Language Processing and its Applications. Accepted 16/01/10 Research in Computing Science 46, 2010, pp. 55-66 Final version 08/03/10 56 Mondal P. number and m < N) computational properties of natural language, but we do not understand the N-th. This is the N-th dimension of natural language in that we do not yet know what N stands for. On the surface of it all, it appears that the exact computational property of natural language defies any attempt to uncover it. However, it will also be argued that it may have, though not necessarily, something to do with the emergent nature of natural language since natural language is an emergent entity derived out of the interaction with several cognitive domains involving emotion, social cognition, vision, memory, attention, motor system, auditory system etc. And here at the linguistic level, language emerges through integrated and interwoven, but partially constrained, interactions between syntax, semantics, morphology, lexicon and phonology which form an overlapping network. This is a case of recursive emergence in that there are two layers of emergence: one at the level of language and another at the level of cognitive architecture as a whole. Apart from all that, it will also be shown that as we do not know the N-th dimension of natural language, it hampers our progress in natural language processing in particular and artificial intelligence in general. That means we do not yet have a universally implementable grammar formalism or formal model of natural language that can characterize to the fullest possible extent only and all conceivable natural languages and all possible natural language expressions in some sense but metaphorically like Universal Turing machine. The lack of this hinders us in building a parser that will have in it such a grammar as being able to universally generate all possible natural languages and all possible natural language expressions. The same can be said of natural language generation system, natural language understanding system etc. However, it does not now exist, so it does not necessarily entail that we can never achieve this. Once achieved, it will perhaps serve as a key to towards solving most of our problems we are facing in building a universally implementable language model that may facilitate cost-effective solutions across languages. Let’s now explore the possible nature, form and origin of the N-th dimension of language. 2 Laying the Exploration out against the Backdrop From the earliest days of Chomskyan hierarchy, the computational nature of natural language was assumed to be context-free. In fact, Chomsky’s generative grammar started out with context-free grammar [1]. Even before that, language was characterized as context-free in the structuralist tradition. But later Chomsky [2] himself argued against the validity of context-free grammar in characterizing natural language. And with this the debate started about whether natural language is contextfree or not. With this a cascade of papers came off arguing that natural language cannot be context-free [3], [4], [5], [6], [7], [8]. But Pullum and Gazdar [9] and Pullum [10] on the other hand threw skepticism over such arguments. In the middle, Joshi [11], [12] tried to strike a middle ground between mild context-sensitivity and context-freeness with his Tree Adjoining Grammar. Even Chomsky’s updated versions of transformational grammar [13], [14], [15] had to constrain the power of his generative grammar by imposing filters in the forms of hard constraints. This is to enrich context-free phrase structure grammar. The issue did not, however, come to a Exploring the N-th Dimension of Language 57 halt. Head-Driven Phrase Structure Grammar [16] and Lexical Functional Grammar [17] are also of a sort of similar nature. This shows that the question of what natural language belongs to as far as the class of formal languages is concerned is still a moot point. Is it because we classify languages (both artificial and natural) in terms of the Chomskyan hierarchy? Or is it because of some sort of natural order that has been imposed upon natural languages such that categorizing them into an exact class is not possible within the boundaries of classes of language that we know of. It seems that both questions are related to each other in that the available classes of languages in Chomskyan hierarchy do not take us much further in identifying and thereby understanding what computational property natural language universally possesses. The problem with such an approach is that it is not still clear what it would mean to revise and modify Chomskyan hierarchy in view of the fact that we cannot determine the exact computational dimension all natural languages have. In sum, what is clear enough is that this dimension which can be called the N-th dimension of language has not been broadly explored. People are debating whether natural language is context-free or context-sensitive; nobody seems to be concerned about whether natural language has some unknown computational dimension separate, segregated from and independent of context-freeness and context-sensitivity. Nobody wonders whether it is reasonable at all to get mired into the possibility of natural language being context-free or context-sensitive, because right wisdom among the computational linguistics community says language is not context-free at all. So one possibility may be to explore how much and to what extent natural language is context-free [18]. Nobody has yet done it, but its implications may be nontrivial in some respects. Let’s now turn to some examples to get a flavor of the complication that the hidden dimension of language creates. If we have cases like the following in natural language, (1) The girl the cat the dog chased bit saw … the boy. (2) I know he believes she knows John knows I believe …. we are here. we have a class of languages like anbn which is exactly the class of context-free languages. In (1) we have a rule of the form (NP)n (VP)n and (NP VP)n in (2). But it is quite clear that such cases are just a fragment of the set of all possible natural language expressions. Context-sensitivity also appears in a range of natural language expressions (especially in Dutch). Consider the following from English, (3) What did I claim to know without seeing? (4) What did you believe without seeing? Cases like these show multiple dependencies between ‘what’ and the gaps after ‘know’ and ‘seeing’ in (3), and between ‘what’ and the gaps after ‘believe’ and ‘seeing’. They may well fall into a pattern of context-sensitivity. Even such cases are pretty clear as to the way they reveal computational properties of context-sensitivity. What is not clear is that these examples neither explicitly demonstrate whether this can suffice for us to characterize natural language invariably as context-free or context-sensitive as long as one looks at it holistically, nor do they show the exact computational property within which both context-freeness and context-sensitivity 58 Mondal P. can be accommodated. Overall, it seems that in showing different fragments of natural language expressions from different languages as context-free and contextsensitive, we are missing out some fundamental generalizations which might underlie the way natural language shows both context-sensitive and context-free properties. We still do not know what these generalizations may amount to, but they clearly point to the vacuum as it is unclear how to fit the property of context-freeness into that of context-sensitivity. All this makes it quite evident that natural language is neither fully context-free nor fully context-sensitive. If this is the case, then it is of some unknown computational property which is intermediate between context-freeness and context-sensitivity; and it may well be possible it has some property of type 0 grammar as well. Nobody knows. This is what we may term as the N-th dimension of language. Let’s now move over to its nature. 3 The Nature of the N-th Dimension A little bit of precision is now necessary. Hence, for the sake of precision, we will model this N-th dimension as a 4-tuple = < Dn, Ep, w, ∞ >. (1) Dn is the set of hidden sub-dimensions characterizing and representing the components of the N-th dimension. Ep is the unknown expressive power of the N-th dimensional computational property of natural language. w is the Cartesian product of Din ⊗,…, ⊗Dmn where m ≠ n. And ∞ is the distribution of the other three in a linear (or real or complex) space Ś. What is required at this stage is the stipulation that Dn determines how the unknown computational property appears whenever natural language is put under analysis, since this set contains the hidden components constituting the N-th dimension. So Dn ⊨ (∀x) y [x ≜ y] when x ϵ Dn and y ϵ P(Dn)C, the complement of the power set of Dn . This makes sure that it may well be possible that some components of the hidden N-th dimension of natural language correspond to some other components in the other class of computational properties in the Chomskyan hierarchy. This is what leads us to make such claims that natural language is context-free or context-sensitive. No moving off, let’s now say something about Ep. It can describe and generate all possible natural languages and all possible natural language expressions with its absolute generative capacity. Let’s first denote ℒ the set of all possible languages (both artificial and natural). And then let ℒC denote the set of only and all possible natural languages. And is the set of all possible expressions in natural language. So we can now say that Ep ⋟ (ℒC)| É ⊆ ∧É= when Ep ⋡É . (2) Exploring the N-th Dimension of Language 59 Here read the sign ⋟ as ‘generable from’. This guarantees that there cannot exist any subset of the set of all possible expressions of natural languages which is not generable from Ep. At last ∞ orders the 4-tuple in configurations we are still unaware of. Roughly for the purpose of modeling, let’s say there can possibly be C= {C1 , … , Ck} such configurations where k can be any arbitrary number. So we have O1 (Dn), O2(Ep), O3( w) ⊢ Oi (∞). (3) This is what expresses the ordering Oi of all the other three in the 4-tuple with respect to ∞. Now it becomes clear that the N-th dimension of language is quite complex in nature. Even if it can be mathematically modeled, it is at best patchy. We can never be sure that we have grasped it. For the sake of simplification, it was vital to model it. And now a bit of focus will be shed upon how it might work. 3.1 How the N-th Dimension Possibly Works The description of how the N-th Dimension can possibly work is provided to facilitate an understanding of the way it works in natural language in general. In addition, this will also help us in proceeding one step ahead in realizing why it is so elusive as well that it has cudgeled so many brains of computer scientists and linguists both. Given that Dn is the set of hidden sub-dimensions characterizing and representing the components of the N-th dimension, we are now able to say that ∫ f ,…, f (x , …, x ) d x , … , d x . Δ∞. 1 m 1 m 1 m (4) where x1 , …, xm ϵ Dn when m ≤ n such that |Dn| = n. This shows that x1 , …, xm ϵ Dn may appear in different distributions which determine differential probabilities associated with each distribution. The nature of such very differential probabilities themselves space out the way the N-th dimension appears in our analysis. Hence it is not simply a product of a linear function mapping a class of computational property into something, for example, from context-freeness into the N-th dimension. Using type predicate calculus [19], we can say that the following Г ⊢ t1 : T1 , … , Г ⊢ tn : Tn Г ⊢ F(t1, ..., tn) : Oi(T1, ..., Tn) . (5) does not hold for the N-th dimension of natural language when Г is a context, t1, ..., tn are computational properties of formal languages in the Chomskyan hierarchy like finite-stateness, context-freeness, context-sensitivity etc. which can be labeled as being of different types T1, ..., Tn . Let’s us suppose here that the operator Oi fixes how the N-th dimension arises out of such a mapping from those computational properties t1, ..., tn into the types T1, ..., Tn. But even this does not capture the N-th 60 Mondal P. dimension as analyzing natural language as being a complex of different computational types does not help us much in that we do not know how Oi really works. We have just a vague idea of how it may function given that natural language has the properties of the types T1, ..., Tn. Other dimensions of natural language also do not take us any farther in such a situation. We now know that a language has a grammar along with a lexicon and it generates expressions of that language. This is one of the (computational) properties of (natural) language. Given a grammar and a lexicon, the generated expressions of the grammar in that language are string sets. Of course language users do not think of their languages as string sets. However, this is another dimension. So is infinite recursion of string sets in natural language. Constraints on natural language constructions are determined by a set of universal principles. Pendar [20] has emphasized that natural language is a modularized computational system of interaction and satisfaction of such soft constraints. Optimality theory [21] a theory of natural language has built into it just such a system. These are some of the dimensions of language. Now consider the question: what if we add them together and then derive the N-th dimension since above all cumulative summing over all dimensions is equivalent to a higher dimension? A little bit of common sense shows that this may well be wrong. Let’s see how. Let’s sum all the existing dimensions of natural language n S (d) = Σdi (6) i=1 Here S (d) is never equivalent to the N-th dimension. Had it been the case, then by principle we would obtain the following S(d) = < Dn, Ep, w, ∞ >. (7) But a careful consideration of the matter suggests that we do not get Ep out of S(d) even if the equivalence shows that we should. We do not get Ep out of S(d) mainly because S(d) may well generate some artificial language which does not have Ep. Analogically, let’s suppose that entities in the world have only two dimensions, namely, width and length, but not height. In such a case combining the known will perhaps never give us the third one, height. All this indicates that the N-the dimension is much more complex than has been recognized. However, it does not give us any reason for rejoicing in such a nature of natural language. Rather, it challenges us to decode its complexity which is so overwhelming that it is at present out of the bounds of our rational hold. This brings us to the next issue, that is, the relation of complexity to the N-th dimension. Exploring the N-th Dimension of Language 61 3.2 Complexity and N-th Dimension It may now be necessary to draw up connections between the N-th dimension and complexity. Is there any connection between them at all? How are they related? Of course, it is true that answers to such questions are not easy to get. We shall not try to answer those questions over here. Instead, an attempt will be made to map out the space over which the N-th dimension can be said to be complex. Now let’s proceed. The notion of Kolmogorov complexity will be used here as it is handy enough in that it has been proved that it is machine independent [22]. We know that Kolmogorov complexity is the shortest possible program for a Universal Turing machine to compute the description of an object. Now let’s assume that the object under consideration is the N-th dimension. What would its measure of complexity be with respect to Kolmogorov complexity? Mathematically speaking, we have ( ) = min l (p) p : (p) = (8) when ( ) is the Kolmogorov complexity computed by a Universal Turing machine and p is the shortest possible program that can give a description of , the N-th dimension. If according to Kolmogorov complexity, the complexity of is the length of the shortest possible program for computing , then such a program p does not independently exist. The term ‘shortest possible program’ is relative. Unless we find out a set of such programs P= {pi , … , pk}, we can never measure the complexity of . But obviously we are sure that the ( ) is lower than l(this paper itself)! Now suppose the 4-tuple model that has been provided above to model is actually equal to ( ). What is the implication then? So the length of any program that can model the 4-tuple is the complexity of ( ). Are we done with it? Possibly no. We still do not know whether this is the shortest program in all possible worlds. But if the program above is by the present criteria the complexity measure for , we have to accept that. However, the problem is still open ended, because we have not yet fully understood at all despite the progress that has been made in formal language theory. This leads us to give the following definition ( )): Given that there may Definition (Unboundedness of the Complexity of well exist a set of programs P= {pi , … , pk} whose actual length we are yet to figure out, ( ) is not bound from above by any constant C. This definition ensures that we are still in uncertainty as to what the actual complexity of can be. Science consists in simplification of complex phenomena. Here the N-th dimension of natural language is such a case. Even if we aim at simplifying it, it defies such attempts at simplification. That is what this section teaches us. This may be because of another important property of natural language, it is an emergent entity. Let’s see what it has for the N-the dimension of natural language. 62 Mondal P. 3.3 Emergence and the N-Dimension of Language Now we have come near to the last candidate that has any claim that it can be related to the N-th dimension. It is not, of course, necessary that a link has to be forged between the N-th dimension and the case of emergence according to which the whole is not a sum of its parts. The case of natural language being emergent has been made quite forcefully by Andreewsky [23]. It appears that language is recursively emergent in that language emerges out of the interaction with several cognitive domains involving emotion, social cognition, vision, memory, attention, motor system, auditory system etc. at the higher level. And then at the linguistic level spanning out at the lower level, language emerges through integrated and interwoven, but partally constrained, interactions between syntax, semantics, morphology, lexicon and phonology which form an overlapping network. One type of emergence is thus embedded into another. The cognitive architecture of language at the higher level has been shown below Fig. 1. Interconnection of language to other cognitive domains Fig.1. above is symmetric with certain constraints on mutual interaction of the cognitive domains/systems such that not all information passes from one domain/system to another. Such constraints- computational, algorithmic and implementational in Marr’s [24] sense- are specified by our genome. Here all these domains have varying interactions with respect to each other; what this means is that all these cognitive domains have differential mutual interactions with respect to each other. And by following Mondal [25] this can be represented through a distribution of joint probability in the following equation The interaction potential I(φ) can be characterized as N I(φ) = ΣP ∫ d i=1 i c1, … d cN δ(c1 … cN). Δψ. (9) Exploring the N-th Dimension of Language 63 Here c1 … cN are differential probability functions coming out of the interaction dynamics of language with other cognitive domains and N in c1 … cN must be a finite arbitrary number as required for the totality of cognitive domains; P is a probability function; Δψ is a temporal distribution. The cognitive domains are functionally coherent units of emergent self-organizing representational resources realized in diffuse, often shared and overlapping, networks of brain regions as mediated by the bidirectional interactions in Fig. 1. Now it may be asked what connection there exists between such cognitive emergence and the N-th dimension. Note that what is emergent is so complex that its complexity can be neither represented nor predicted. The above discussion on complexity and the N-th dimension just shows this much more clearly. In addition, emergent entities are chaotic and random at most times. Natural language is no exception. The way the lack of grasp of N-th dimension has been the centerpiece of debate on the computational property of natural language evidences the fact that it has appeared randomly as different to different researchers. That is why it has been once claimed to be of finite-state nature or sometimes context-free; sometimes mildly context-sensitive and so on. This is perhaps because of the emergent randomness and an apparent lack of order in the dimensions of natural language. One may also argue that here two separate issues- the computational property of linguistic structures and the evolution of language – are being conflated. But one should also note that whatever the (computational) properties of natural language are, they have come only through the mechanisms of evolution. Even many distinguishing properties of natural language like ambiguity, vagueness, redundancy, arbitrariness have appeared in natural language only as products or bi-products of evolution which is generally blind and so not goal-driven [26]. And it may well be plausible that it is such properties that confound and mystify the exact computational property of natural language. However, whatever the N-th dimension turns out to be, it will certainly be one of the greatest challenges for natural language processing, mathematical linguistics and theoretical computer science in general. This brings us nearer to the issue of what Nth dimension means for natural language processing. 4 What the N-th Dimension Means for Natural Language Processing Grammar formalism is at the heart of natural language processing. Whether we are doing parsing, natural language generation, natural language understanding or machine translation etc., grammar formalisms are inevitable. Even if in recent years grammar induction through statistical techniques has been possible, the combination of robustness and minimum computational costs in terms of time and storage, still a goal aimed at, demands that a grammar formalism be available at hand. Even in cases where grammar induction is possible through machine learning, having a grammar formalism is always indispensable in that before the training period a standard model must be available for modeling [27]. In fact, there can be, in a more general sense, no parsing without a grammar. And parsing is at the heart of a range of natural language 64 Mondal P. processing tasks ranging from natural language generation, natural language understanding or machine translation to dialog systems etc. All present types of parsing use one or another grammar formalism based on different types of formal grammar [28], [29], [30], [31], [32], [33], [34]. So let’s us confront the question of the N-th dimension in a rather straightforward manner. If we become successful some day in future in finding out this dimension, however chaotic and random it may well be, it will be one of the greatest achievements in the entire field of natural language processing and AI in general; for this will place us on a pedestal in building grammar models that can be implemented across languages universally. There will be no variation, no fractionation in the way grammar formalisms and models are built and implemented computationally. Once we coast to the N-th dimension of natural language processing, it will enable us in finessing and enhancing the effectiveness of our natural language processing tools which are still far behind compared to humans [35]. This will also help us in having a uniformity in our grammar models thereby eliminating the barrier of cross-linguistic transfer. This will propagate into multiple channels of progress in natural language technology in general. Just as the discovery of formal language hierarchy opened up new vistas for computation in general which ultimately gave birth to the field of natural language processing or NLP, the discovery of the N-th dimension will take us one revolutionary step farther toward a ‘golden age’ of natural language technology. One may complain that too much optimism is misplaced; but consider the sort of achievement in formal language theory this will amount to, as formal language theory has long been embroiled in the debates on the exact computational nature of natural language. After a lot of years, the efforts have been futile in that nobody has yet discovered the N-th dimension. What if we discover it some day? Would not it constitute a bigger achievement than even the discovery of Chomskyan hierarchy? Well, only the future will tell us. We are then nearing the conclusion with this ray of hope. 5 Concluding Remarks This paper has sketched a brief sketch of what the N-th dimension is, its nature, its possible functioning and its links to complexity, emergence and NLP. After this brief tour it seems that we have achieved very little about the computational nature of natural language, even if we may know a lot about formal or artificial languages. The reason behind this is perhaps the N-th dimension itself. It is a major bottleneck in our path towards being able to compute all natural languages on earth. It may sound too ambitious, but any NLP researcher may be asked whether he/she does not want to see how his/her algorithms fare in other languages. The answer must at better be yes. Moreover, it is also a dream for formal language theorists and perhaps mathematical linguists. In such a situation there is still enough scope for the claim that we may never be able to discover the N-th dimension of natural language because it is emergent, random and chaotic. We think such claims are as plausible as the claim that there does exist the N-th dimension of natural language to be discovered. Nobody can predict the path of the trajectory that research takes on in any field. The same applies Exploring the N-th Dimension of Language 65 to NLP as well. Even if ultimately it transpires that the entire research program following on from the search for the computational class of natural language is flawed and wrong, this will still be fruitful in showing us the right path and a gateway towards a greater understanding of natural language. References 1. Chomsky, N.: Syntactic Structures. Mouton, The Hague (1957). 2. Chomsky, N.: Formal Properties of Grammars. In: Luce, R. D., Bush, R. R., Galanter, E. (eds.) Handbook of Mathematical Psychology. Vol. II. Wiley, New York (1963). 3. Postal, P. M.: Limitations of Phrase-Structure Grammars. In: Fodor, Jerry, Katz, Jerald. (eds.) The Structure of Language: Readings in the Philosophy of Language. Prentice Hall, New Jersey (1964). 4. Langendoen, D. T.: On the Inadequacy of Type-2 and Type-3 Grammars for Human Languages. In: Hopper, P. J. (ed.) Studies in Descriptive and Historical Linguistics. John Benjamins, Amsterdam (1977). 5. Higginbotham, J.: English is not a Context-Free Language. Linguistic Inquiry. Vol. 15, 225-234 (1984). 6. Langendoen, D. T., Postal, P. M.: The Vastness of Natural Language. Blackwell, Oxford (1984). 7. Langendoen, D. T., Postal, P. M.: English and the Class of Context-Free Languages. Computational Linguistics. Vol. 10, 177-181 (1985). 8. Shieber, S. M.: Evidence against Context-Freeness of Natural Language. Linguistics and Philosophy. Vol. 8, 333-343 (1985). 9. Pullum, G. K., Gazdar, G.: Natural Languages and Context-Free Languages. Linguistics and Philosophy. Vol 4, 471-504 (1982). 10. Pullum, G. K.: On Two Recent Attempts to Show that English is not a CFL. Computational Linguistics. Vol. 10, 182-186 (1985). 11. Joshi, A.: Factoring Recursion and Dependencies: An Aspect of Tree-Adjoining Grammars and a Formal Comparison of some Formal Properties of TAGs, GPSGs, PLGs, and LFGs. In: Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 7-15. Association for Computational Linguistics, Cambridge, MA (1983). 12. Joshi, A.: Tree Adjoining Grammars: How much Context-Sensitivity is Required to Provide Reasonable Structural Descriptions? In: Dowty, D., Karttunen, L., Zwicky, A. M. (eds.) Natural Language Processing: Psycholinguistic, Computational and Theoretical Perspectives. Cambridge University Press, New York (1985). 13. Chomsky, N.: Aspects of the Theory of Syntax. MIT Press, Cambridge, Mass. (1965). 14. Chomsky, N.: Rules and Representations. Columbia University Press, New York (1980). 15. Chomsky, N.: The Minimalist Program. MIT Press, Cambridge, Mass. (1995). 16. Pollard, C., Sag, I.: Head-Driven Phrase Structure Grammar. Chicago University Press, Chicago (1994). 17. Bresnan, J.: The Mental Representation of Grammatical Relations. MIT Press, Cambridge, Mass. (1982). 18. Pullum, G. K.: Systematicity and Natural Language Syntax. Croatian Journal of Philosophy. Vol. 7. 375-402 (2007). 19. Turner, R.: Computable Models. Springer, London (2009). 20. Pendar, N.: . Linguistic Constraint Systems as General Soft Constraint Satisfaction. Research on Language and Computation. Vol. 6:163–203 (2008). 21. Prince, A., Smolensky, P.: Optimality Theory: Constraint Interaction in Generative Grammar. Rutgers University Centre for Cognitive Science, New Jersey (1993). 66 Mondal P. 22. Cover, T. M., Thomas, J. A.: Elements of Information Theory. Wiley, New York (1991). 23. Andreewsky, E.: Complexity of the Basic Unit of Language: Some Parallels in Physics and Biology. In: Mugur-Schachetr, M., van der Merwe, A. (eds.) Quantum Mechanics, Mathematics, Cognition and Action. Springer, Amsterdam (2002). 24. Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W H Freeman, San Francisco (1982). 25. Mondal, P.: How Limited is the Limit? In: Proceedings of the International Conference on Recent Advances in Natural Languages Processing, pp. 270-274. RANLP, Borovets (2009). 26. Kinsella, A. R., Marcus, G. F.: Evolution, Perfection and Theories of Language. Biolinguistics. Vol 3.2–3: 186–212 (2009). 27. Ristad, E. S.: The Language Complexity Game. MIT Press, Cambridge, Mass. (1993). 28. Goodman, J.: Semiring Parsing. Computational Linguistics. Vol. 25, 573.605 (1999). 29. Charnaik, E.: A Maximum-Entropy-Inspired Parser. In: Proceedings of North American Chapter of the Association for Computational Linguistics-Human Language Technology, pp. 132-139. Association for Computational Linguistics, Washington (2000). 30. Eisner, J.: Bilexical Grammars and their Cubic-Time Parsing Algorithms. In: Bunt, H., Nijholt, A. (eds.) Advances in Probabilistic and Other Parsing Technologies, pp. 29.62. Kluwer Academic Publishers, Amsterdam (2000). 31. Nederhof, M. J.: Weighted Deductive Parsing and Knuth's Algorithm. Computational Linguistics. Vol. 6. 135.143 (2003). 32. Pradhan, S., Ward, W., Hacioglu, K., Martin, J. H., Jurafsky, D.: Shallow Semantic Parsing Using Support Vector Machines. In: Proceedings of North American Chapter of the Association for Computational Linguistics-Human Language Technology, pp. 233-240. Association for Computational Linguistics, Boston, MA (2004). 33. Charnaik, E., Johnson, M.: Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 173-180. Association for Computational Linguistics, Michigan (2005). 34. Huang, L., Chiang, D.: Better k-Best Parsing. In: Proceedings of International Workshop on Parsing Technologies (IWPT), pp. 53-64. Vancouver, Omnipress (2005). 35. Moore, R. K.: Towards a Unified Theory of Spoken Language Processing. In: Proceedings of 4th IEEE International Conference on Cognitive Informatics, pp. 167172. IEEE Press, Irvine, CA (2005). Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition Petic Mircea Institute of Mathematics and Computer Science of the ASM, 5, Academies str., Chisinau, MD-2028, Republic of Moldova mirsha@math.md Abstract. The derivation with affixes is a method of vocabulary enrichment. The consciousness of the stages in the process of the lexicon enrichment by means of derivational morphology mechanisms will lead to the construction of the new derivatives automatic generator. Therefore the digital variant of the derivatives dictionary helps to overcome difficult situations in the process of the new word validation and the handling of the uncertain character of the affixes. In addition, the derivatives groups, the concrete consonant and vowel alternations and the lexical families can be established using the dictionary of derivatives. 1 Introduction The linguistic resources represent the fundamental support for automatic tools development in the processing of linguistic information. The lexical acquisition constitutes one of the most important methods for lexical resources enrichment. The examination of the problems referring to automatization is one of the main aspects in the process of the linguistic resources creation. The need of the lexical resources enrichment is satisfied not only by borrowings of words from other languages, but also by the use of some exclusively internal processes [1]. Inflection, derivation and compounding are the most useful ways of word formation in Romanian language. Inflection is the generation of word forms without changing their initial meaning. Derivation means the creation of a new word by adding some affixe(s) to existent lexical base. Compounding is made up of the words which exist and are independent or with the elements of thematic type. In this article the derivation is investigated. The aim of this article is to study the derivational morphology mechanisms which permit the automatic lexical resources acquisition for Romanian language. In this context the paper is structured in the following way. Firstly, the derivational particularities of Romanian language are described, and then the most important stages in the automatic derivation are the subject to review. A special description is dedicated to the methods of new word validation. The uncertain particularities of Romanian affixes are examined. The solution for the uncertainty was found in the digital variant of the Romanian derivatives dictionary. The new electronic resource can be used in the process of detection of the derivatives groups. Moreover, it permits the vowel and consonant alternations examination and the construction of lexical families in the derivation process. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 67-78 Received 22/11/09 Accepted 16/01/10 Final version 09/03/10 68 Mircea P. 2 Derivational Particularities of Romanian Language The majority of Romanian language derivational mechanisms are also common for other European languages, especially for those of Latin origin. The affix represents any morpheme which remains beside the root, when a word is segmented. It includes prefixes (added before the root) and suffixes (added after the root) [2]. Romanian language has 86 prefixes [3] and more than 600 suffixes. There are two types of suffixes: lexical and grammatical. Lexical suffix is a group of letters (or a single letter) which is added after the root and forms a new word. Grammatical suffix is a group of letters (or a single letter) which is added after the stem and, as a result, a form of the same word is obtained [4]. Roots, prefixes and suffixes represent the morphemes [5]. The word formed by adding a prefix or a suffix is called derivative. For example, the derivative frumuseţe, consists of the root frumos and the suffix -eţe. Often suffixes and prefixes are not added directly to roots but to a lexical stem, which represents a root and, at least, a prefix or a suffix, for example, străbătător is formed from the prefix stră- and the stem bătător, which consists of the root bate and the suffix -ător [2]. The derivatives can be classified in three groups: analyzable, semi-analyzable, and non-analyzable. In the analyzable derivatives both the affix and the root are distinguished. Semi-analyzable derivatives represent the words in which only the affix is distinguished. In the non-analyzable derivatives we can not distinguish neither the affix nor the root [3]. 3 Stages in the Automatic Derivation In the process of the derivational morphology automatization it is important to examine the following steps: the analysis of the affixes, the elaboration of affixes detection algorithm, the formalization of the derivational rules and the validation of the new words generated by specific algorithms. The analysis of the affixes consists in the establishing of its quantitative features. The the quantitative features for some prefixes and suffixes were set up according to the list of Romanian prefixes and suffixes. On the base of the elaborated algorithms [6], programs were developed that allowed to find out: – the number of words which begin (end) with some prefixes (suffixes); – the number of words derived from these affixes; – the repartition of letters followed by the mentioned affixes; – the part of speech distribution for some Romanian affixes. Taking into account the obtained results, it was noticed that some prefixes and suffixes have more phonological forms. That is why it is difficult to set up the quantitative characteristics of the affixes [6]. As not all the words end (begin) with the same suffixes (prefixes), some algorithms were elaborated for enabling the automatic extraction of the derivatives from the lexicon . The elaborated algorithms took into account the fact that being x, y ∈ Σ + , where Σ + is the set of all possible roots, and if y = xv then v Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition 69 is the suffix of y and if y = ux then u is the prefix of y. In this context both y and x must be valid words in Romanian language, and u and v are strings that can be affixes attested for Romanian language [3]. The problem of consonant and vocalic alternations was neglected in the case of the algorithm derivatives extraction. This fact does not permit the exact detecting of all derivatives. For a certain number of affixes was found out some derivational rules which allow the generation of new derived words unregistered in the dictionaries [7]. In the process of the completion of linguistic resources by automatic derivation appears a natural tendency of using the most frequent affixes. But, in fact, the use of most productive affixes become to be problematic because of their irregular behavior [7]. That is why those affixes which permit to formulate simpler rules of behavior without having many exceptions were taken for the research. As a consequence, these rules operate with prefixes ne- and re-, and also with suffixes -tor and -bil (the last being frequent in the derivational process with the prefix ne-). The lexical suffix -iza was also included in the research, having neological origin and being actually very productive in the strong relation with the lexical suffixes -ism and -ist. After derivatives generation process, not all obtained words can be considered valid. The set of words needs to pass the step of validation, which represents an important level in the correct detection of the generated derivatives, thanks to the programs based on derivation rules. Also, it is possible to validate the words with the help of linguists, but it requires more time and there is the possibility of making mistakes. On the other hand, validation can be done by checking the derived words presence in the electronic documents. In this situation it is possible to use verified electronic documents, such as different Romanian corpora. Unfortunately, the corpora have insufficient number of words, which can be used in new derivatives validation. Therefore, the simplest way is to work with the existent documents on Internet, which are considered to be unverified sources. In this case the difficulties appear in setting up the conditions which assure a valid word. So, it was attempted to determine the indices of frequency [7] for some Romanian affixes. Thus, a program which extracts the number of appearances of words in Google searching engine for Romanian language was elaborated. 4 Digital Variant of the Derivatives Dictionary The derivatives dictionary [8] has only the graphic representation of the derivative and the constituent morphemes without any information about the part of speech of the derivatives and of its stems. The digital variant of the dictionary [8] is obtained after the process of scanning, OCR-izing and the correction of the original entries. This electronic variant of the dictionary [8] becomes important as it is difficult to establish the criterion for validation of new derived words. Nevertheless, it permits the detecting of the derivatives morphemes with its type (prefix, root and suffix) and constitutes an important electronic linguistic resource. 70 Mircea P. Practically, the inputs in this dictionary are constructed being based on an unsure scheme. It is not clear where the affixes and the root are. In order to exclude the uncertainty in the input of the digital variant of the dictionary, it was made a regular expression that represents the structure of the derivatives: derivative = (+morpheme)∗ .morpheme(−morpheme)∗ where +morpheme is the prefix, .morpheme is the root and −morpheme is the suffix. Table 1. The most frequent prefixes Prefix Number of distinct derivatives neı̂nredespre- 571 293 281 109 109 To find out the statistic characteristics of the dictionary algorithms were elaborated and then programs were developed. It was calculated that dictionary consists of 15.300 derivatives with 42 prefixes (Table 1), 433 suffixes (Table 2) and over 6800 roots. Table 2. The most frequent suffixes Suffix Number of distinct derivatives -re -tor -toare -eală -ie 5 2793 605 522 514 400 Uncertain Characters of the Affixes One of the main problems concerning Romanian derivation is the uncertainty of morpheme boundaries. In many cases different people, or even the same person put in different situations, divides the same word into segments in different ways. Besides the segmentation in morphemes (for example: anti-rachet) or in the allomorph variants, a word form can be segmented in different ways [8] Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition 71 such as: syllables (for example: an-ti-ra-che-tă), sounds (for example: a-n-t-ir-a-ch-e-t-ă) and letters (for example: a-n-t-i-r-a-c-h-e-t-ă). All these types of segmentation have nothing to do with segmentation in morphemes, because they do not indicate the morpheme boundaries. The segmentation in morphemes implies the detection of the morphemes with its types (root, prefix, and suffix). Unfortunately, not all morphemes are different. It was observed that the derivatives can contain roots which can be also suffixes or prefixes (for example: -re, -tor, -os, -uşor, -uliţă). Also, there are morphemes where the root, prefix and suffix coincide, for example, the morpheme an is a prefix in the word anistoric, in the word anişor is a root and in the word american is a suffix. Taking into account this uncertain character of the morphemes boundaries within the derivatives, it appears a natural tendency of applying to the probabilistic method in measuring this uncertainty. Let X be a discrete variable with the values {x1 , x2 , x3 , x4 , x5 , x6 }, where x1 represents the case when a string is a prefix, for example, anistoric; x2 – a part of a prefix, for example, antevorbitor; x3 – a root, for example, anişor; x4 – a part of a root, for example, uman; x5 – a suffix, for example, american; x6 – a part of a suffix, for example, comediant. Table 3. Prefixes with the lowest entropy String Number Number Number Number Number Number Entropy of of parts of roots of parts of of parts prefixes of prefixes of roots suffixes of suffixes super ultra arhe com auto 6 12 1 1 87 0 0 0 0 0 0 0 0 0 0 0 0 0 63 6 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 0.1161 0.3451 Let p(xk ) be the probability of the xk event. Of course, not all the affixes can correspond to all the values of the discrete variable X. Therefore, let consider the set: Y = {yk |p(xk ) 6= 0, 1 6 k 6 6}. Let H(Y ) be the entropy of the Y : H(Y ) = − 6 X i=1 p(yi ) × log2 p(yi ) 72 Mircea P. For example, the prefix răs- corresponds to three values of the discrete variable X. So, the set is Y = {y1 , y2 , y4 }. The respective probabilities are p(y1 ) = 0, 34313, p(y2 ) = 0, 01960, p(y4 ) = 0, 63725. H(Y ) = −(p(y1 ) × log2 p(y1 ) + p(y2 ) × log2 p(y2 ) + p(y4 ) × log2 p(y4 )) = 1, 05495 It is worth being mentioned that if the uncertainty is higher, it means that the number of cases will increase, so the entropy will increase too, otherwise it will decrease. For example for the prefix ultra-, as a result, it is used only as a prefix, that is why the entropy is 0 (Table 3). In the case of the prefix an- the entropy is 1,2322 (Table 4), because it can correspond to all six values of the discrete variable X. Table 4. Prefixes with the highest entropy String Number Number Number Number Number Number Entropy of of parts of roots of parts of of parts prefixes of prefixes of roots suffixes of suffixes re intra an de auto 6 281 2 1 71 571 118 0 105 209 0 0 2 1 0 0 4708 1 1058 547 426 3218 0 74 0 0 975 0 200 0 39 1.6005 1.5219 1.2322 1.2000 1.1790 The Analysis of the Consonant and Vowel Alternations The problem of derivation consists not only in the detection of the derivational rules for separate affixes, but also in the examination of the concrete consonant and vowel alternations for the affixes. It is important that not all affixes need vowel and consonant alternations in the process of derivation. On the purpose of precising which affixes have alternations in the process of derivation the digital variant of the derivatives dictionary was studied (Table 5). The situation is that there are more derivatives without alternations, especially in the case of the prefix derivation. The lack of vowel and consonant alternations in the process of derivation is observed with the following most frequent prefixes: ne-, re-, pre-, anti-, auto-, supra-, and de-. The prefixes ı̂n-, des-, sub-, dez -, and ı̂m- use the vowel and consonant alternations in the process of derivation (Table 6). There are several types of vowel and consonant alternations in the process of derivation with the prefixes: Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition 73 Table 5. Statistics about the derivatives Affix Number of derivatives Number of derivatives witout alternations with alternations prefix 1134 suffix 6809 prefix and suffix 632 total 8575 224 6381 191 6796 – the addition of a letter to the end of the root, for example, şurub → ı̂nşuruba, bold → ı̂mboldi, plin → ı̂mplini ; – the changing of the final letter in the root, for example, lı̂nă → dezlı̂na, purpură → ı̂mpurpura, puşcă → ı̂mpuşca; – the changing of the final letter in the root and the addition of the letter to the end of the root, for example, avut → ı̂navuţi, compus → descompune, păros → ı̂mpăroşa, blı̂nd → ı̂mblı̂nzi ; – the changing of the vowels in the root, for example, cataramă → ı̂ncătărăma, primăvară → desprimăvăra, rădăcină → dezrădăcina, platoşă → ı̂mplătoşi ; – the changing in the prefix, for example, şoca → deşoca, pat → supat; – the avoiding of the double consonant, for example, spinteca → despinteca, braţ → subraţ. Table 6. Prefixes with vowel and consonant alternations Prefix Number of derivatives Number of derivatives Total number witout alternations with alternations of derivatives ı̂ndessubdezı̂mtotal 33 57 81 32 3 206 115 6 2 2 38 163 148 63 83 34 41 369 In the most of cases in the process of derivation with prefixes ı̂n- and ı̂mthe alternations are used because of the part of speech changing, especially from adjectives and nouns to verbs. The process of derivation with suffixes does not attest cases without consonant and vowel alternations. It means that there are situations when the derivation is made up with minimum number of alternations (Table 7) and with maximum cases of changes in the root (Table 8). The possible vowel and consonant alternations are so varied that it is difficult to describe them all in a chapter, but it is possible, at least, to classify them: 74 Mircea P. Table 7. Suffixes with fewer derivatives formed without alternations Suffix Number of derivatives Number of derivatives Total number witout alternations with alternations of derivatives -re -tor -toare -iza -tură 2782 561 478 221 205 11 43 35 23 27 2793 605 522 244 232 – the changing of the final letter in the root, for example, alinia → aliniere, aşchia → aşchietor, cumpăra → cumpărător, curăţi → curăţător, delăsa → delăsător, depune → depunător, faianţă → faianţator, fărı̂ma → fărı̂mător, ı̂mpinge → ı̂mpingător, transcrie → transcriitor ; – the removing of the last vowel in the root, for example, răşchia → răşchitor, acri → acreală, aduna → adunătoare; – the removing of the final vowel in the root and the changing of the letter before the last one, for example, zeflemea → zeflemitor, ascunde → ascunzătoare; Table 8. Suffixes with fewer derivatives formed without alternation Suffix Number of derivatives Number of derivatives Total number witout alternations with alternations of derivatives -eală -ătoare -ător -ar -ie 19 5 5 110 166 495 281 279 249 234 514 286 284 359 400 – the changing of two final letters in the root, for example, bea → băutor, bea → băutoare, ı̂ncăpea → ı̂ncăpătoare; – the changing of the first letter of the suffix, for example, bı̂ntui → bı̂ntuială, murui → muruială; – the removing of the final letter in the root with the vowel changing, for example, cană → căneală, atrage → atrăgătoare, bate → bătătoare; – the removing of the two letters in the suffix, for example, căpia → căpială, ı̂ncleia → ı̂ncleială; – the removing of the final letter and that of the vowel inside the root, for example, coace → cocătoare; Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition 75 – the removing of the final vowel and the changing of the final consonant, for example, descreşte → descrescătoare, ı̂nchide → ı̂nchizătoare, ı̂ncrede → ı̂ncrezătoare, promite → primiţătoare; – the changing in the root, for example, rı̂de → rı̂zătoare, recunoaşte → recunoscătoare, roade → rozătoare, sta → stătătoare, şedea → şezătoare, vedea → văzătoare, şti → ştiutor. 7 Lexical Families The set of derivatives with a common root and meaning represents a lexical family. So, the second part of the definition is very important because there is a tendency of grouping the words in lexical families only by a common root. Therefore the following words: alb, albastru, and albanez can be considered lexical family, so ignoring the fact that lexical family consists of the words that have the appropriate meaning. In addition, a lexical family consists of the words which are different in what concern the grammatical categories, beginning with the same root. The word base can have the same form of the root for all words from the family, for example: actor actoraş, actoricesc, actorie. When in a derivative the affixes are trimmed, it means that only the root remains. The root can suffer little alternations: ţară (ţar-/ ţăr -); but it is never changed. During the derivation, it was observed that not all the lexical units derive directly from the root, some of them derive from previous derivatives, for example, [cruce]noun →[cruciş]adj,adv →[ı̂ncrucişa]verb →[ı̂ncrucişator]adj,noun . Lexical family consists also of compounds, which contain the word base from the respective families. So, the compound word cal-de-mare is in the lexical family of the word cal and also in that of the word mare. But, it will not be in the lexical family of the word călare, which is the word base of the other family: călari, călarie, călarime, călaresc, călaraş. In the electronic variant of the derivatives dictionary the most numerous lexical families are of the root bun (32 derivatives with the prefixes: stră-, ı̂m-, ne- and ı̂n-; and the suffixes: -el, -eţe, -ătate, -ic, -uţă, -icea, -icel, -icică, -işoară, -işor, -iţă, -uţ, -re, -i, -ariţă, -atic, -eală, -ească, -esc, -eşte, -toare, -tor, and -ie); alb (25 derivatives with the prefix: ı̂n-; and with the suffixes: -eaţă, -ei, -eţ, -eală, -icioasă, -icios,-iliţă, -re, -i, -ime, -ineaţă, -ineţ, -ior, -işor, -itoare, itor, -ie, -itură, -iţă, -ui, -uie, -uleţ, -uş, and -uţ); şarpe (22 derivatives without any prefixes and with the suffixes: -ar, -aş, -ărie, -ească, -esc, -eşte,-işor, oaică, -oaie, -oi, -ui, -eală, -re, -toare, -tor, -tură, -urel, and -uşor ); roată (22 derivatives without any prefixes and with the suffixes: -ar, -easă, -ie, -it, -iţă, -aş,-at, -ată, -i, -cică, -re, -tor, -toare, -tură, -ilă, -at, -iş, -ocoală, and -ocol ); om (20 derivatives with the prefixes: ne- and supra-; and with the suffixes: -ime, -oasă, -os, -oi, -uleţ, -uşor, -ească, -esc, -eşte, and -ie). In the same context there are over 3000 roots with a single derivative. 76 Mircea P. There were found out that 7 prefixes and namely a-, arhe-, para-, dis-, i-, im and ı̂ntru-, are not attached directly to roots, but only to stems. Also there are several suffixes, that are not attached to root. 8 8.1 The Process of Derivative Groups Establishing Derivative Groups with Prefixes The derivatives were extracted separately for every affix and were compared with the flection groups. Thanks to flection groups from [9, 10] and derivatives from morphological dictionary [8] it was attempted to detect the derivation groups. The morphological dictionary [8] consists already of many derivatives. So, in order to acquire information referring to affixes, which can be attached to roots from the dictionary [9] special programs were developed. The derivatives those which have the following prefixes proved to be the most numerous, in a descending order: ne-, re-, ı̂n-, des-, pre-, anti-, auto-, sub-, dez-, supra-, de-, and ı̂m-. The rest of the prefixes have an unsignificant number of derivatives. These 12 prefixes from 42 form 88.2 per cent from all derivatives with prefixes, registered in this electronic dictionary. Firstly, it was set up the flection groups of the roots, which correspond to derivatives with prefixes without any suffixes. Nevertheless, there is a big number of flection groups for a single prefix, for example, nescris=+ne.scrisN 29 nescris=+ne.scrisN 1 nescris=+ne.scrisN 24 nescris=+ne.scrisA4 nescris=+ne.scrisM 6 . For every prefix was set up the most frequent flection group of the derivatives roots. So, the roots with verbal flection group V201 is attached mostly to the following prefixes: re-, des-, auto-, dez-, de-; masculine noun flection group M1: anti-, sub-, ı̂m-; feminine noun flection group F1: ı̂n-, supra-; verbal flection group V401: pre-; adjectival flection group A2: ne-. Secondly, it was extracted the flection group of the roots that correspond to derivatives with prefixes that was first derivated with suffixes. In this case, there is the probability that there are roots with the same flection groups that form derivatives with different suffixes, for example, subordonator=+sub.ordonaV 201 -tor subordonatoare=+sub.ordonaV 201 -toare. In the same procedure as that described above the derivatives are used with the most frequent flection groups of the derivatives roots: verbal flection group V201 uses mostly the following prefixes: pre-, dez-, auto-, sub-, de-, supra-; verbal flection group V401: ne-, des-; masculine noun flection group M1: ı̂m-; adjectival flection group A1: anti-; neutral noun flection group N24: ı̂n-. Automatic Derivational Morphology Contribution to Romanian Lexical Acquisition 77 During the investigation there were found out that the following prefixes are attached especially to the nouns: ne-, ı̂n-, anti-, sub-, supra and ı̂n-. Another group of prefixes is in the most of cases attached to the verbs: de-, dez-, auto-, pre-, des- and re-. 8.2 Derivative Groups with Suffixes In the same way as for the prefixes, in order to decide to which roots can be attached the concrete suffix from the morphological dictionary, special programs were developed, which extracted derivatives separately for every suffix, and after that, it is compared with the flection group. The most numerous derivatives proved to be, in a descending order, the following suffixes: -re, -tor, -toare, -eală, -ie, -ătoare, -iza, -oasă, -ar, -ător, ească, -os, -aş, -esc, -tură, -iţă, -ist, -uţă, -el, -i, -ui, -ătură, -eşte, -ism, -a, -ărie, -ică, -ime, -itate, -ioară, -işor, -işoară, -ic, -uleţ, -că, -ean, -iş, -easă, -bil, -uţ, -at, -oaică, -uşor, -an, -oi, -uliţ, -iu, -enie, -istă, -al, and -ea. The rest of the suffixes have an insignificant number of derivatives. From 430 suffixes registered in this electronic dictionary 52 of them form 87.7 per cent from suffix derivatives. It was retrieved the flection group of the roots that correspond to derivatives with suffixes. The words in the dictionary [9] have several entries for different flection groups, for example, the verb ı̂nvı̂rti takes part from verbal flection groups V401 and V305 (the same part of speech), and the word croi belongs to different parts of speech: noun flection group N67 and verbal flection group V408. For every suffix it was established the most frequent flection groups for the derivatives roots. So, from the roots with the masculine noun flection group M1 flection group can be generated the derivatives with the following suffixes: -ie, -ească, -aş, -esc, -iţă, -el, -i, -eşte, -ism, -ime, -ic, -iş, -oaică, -an, -oi, -iu, -uţ; neutral noun flection group N24: -oasă, -os, -arie, -işor, -uleţ, -ean, -uţ, uşor, -al ; feminine noun flection group F1: -ar, -uţă, -ică, -işoară, -uliţă; verbal flection group V401: -tor, -toare, -eală, -tora, -enie; verbal flection group V201: -re, -ătoare, -ător, -ătură, -bil, N1: -ist, -at, -al (also neutral noun flection group N24), adjectival flection group A1: -iza, -itate, feminine noun flection group F135: -ioară, masculine noun flection group M20: -că, neutral noun flection group N11: -istă, feminine noun flection group F43: -ea. After the examinations of the most frequent suffixes it was found that the following suffixes are attached mostly to the nouns: -ie, -iza, -oasă, -ar, -ească, -os, -aş, -esc, -tură, -iţă, -ist, -uţă, -el, -i, -ui, -eşte, -ism, -a, -ărie, -ică, -ioară, -işor, -işoară, -ic, -uleţ, -că, -ean, -iş, -easă, -uş, -at, -oaică, -uşor, -an, -oi, -uliţă, -iu, -enie, -istă, -al, -ea, and uţ. Another group of suffixes, and namely: -re, -tor, -toare, -eală, -ătoare, -ător, -tură, -ătură, and -bil, is attached mostly to verbs. Only two suffixes are attached in the most of cases to adjectives: -ime and -itate. 78 Mircea P. 9 Conclusions The process of the derivatives generator construction needs the detailed studying of the affixes and the derivatives features. The new derivatives validation is one of the steps in automatic derivation that raises many questions. In the case it is difficult to set up the criterion for words validation by means of Internet, it is important to use the digital variant of the derivatives dictionary, which will permit the establishing of the morphemes of the derivatives with its type (prefix, root and suffix). It was proved to be useful in the detection of entire variety of consonant and vowel alternations in the process of derivation with prefixes and suffixes. Another notion connected with the lexical derivation is the lexical family, that offers the possibility of acquiring sets of affixes possible to attach to the concrete roots. The grouping of the derivatives by flection classes can give the possibility of finding the morphological characteristics of the words. The following step in the research should answer whether the most numerous flection groups for affixes can serve for the process of automatic generation of new valid derivatives. References 1. Tufiş, D., Barbu, A.M., Revealing Translator’s Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing, International Journal of Speech Technology 5, 2002, pp. 199-209. 2. Hristea, T., Sinteze de limba română, Bucureşti, 1984, pp. 66-99. 3. Graur, Al., Avram, M., Formarea cuvintelor ı̂n limba română, vol. II Editura Academiei, Bucureşti, 1978. 4. Gramatica limbii române. Cuvı̂ntul, Editura Academiei Române, Bucureşti, 2005. 5. Hausser, R., Foundations of Computational Linguistics. Human-Computer Communication in Natural Language, 2nd Edition, Revised and Extended, Springer, 2001, 580 p. 6. Petic, M., Specific features in automatic processing of formations with prefixes, Computer Science Journal of Moldova, 4 1(7), 2008, pp. 209-222. 7. Cojocaru, S., Boian, E., Petic, M., Stages in automatic derivational morphology processing, International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, Selected Papers, Cluj-Napoca, July 24, 2009, pp. 97-104. 8. Constantinescu, S., Dicţionar de cuvinte derivate. Editura Herra, Bucureşti, 2008. 9. Lombard, A., Gâdei, C., Dictionnaire morphologique de la langue roumaine, Editura Academiei, Bucureşti, 1981, 232 p. 10. S.Cojocaru, M.Evstunin, V.Ufnarovski, Detecting and correcting spelling errors for the Roumanian language. Computer Science Journal of Moldova, vol.1, no.1(1), 1993, pp 3-21. POS-tagging for Oral Texts with CRF and Category Decomposition Isabelle Tellier1 , Iris Eshkol2 , Samer Taalab1 , and Jean-Philippe Prost1,3 1 LIFO, Université d’Orléans, France LLL, Université d’Orléans, France 3 INRIA Lille - Nord Europe, France {name}.{lastname}@univ-orleans.fr 2 Abstract. The ESLO (Enquête sociolinguistique d’Orléans, i.e. Sociolinguistic Survey of Orléans) campaign gathered a large oral corpus, which was later transcribed into a text format. The purpose of this work is to assign morpho-syntactic labels to each unit of this corpus. To this end, we first studied the specificities of the labels required for oral data, and their various possible levels of description. This led to a new original hierarchical structure of labels. Then, since our new set of labels was different from any of those of existing taggers, which are usually not fit for oral data, we have built a new labelling tool using a Machine Learning approach. As a starting point, we used data labelled by Cordial and corrected by hand. We used CRF (Conditional Random Fields), to try to take the best possible advantage of the linguistic knowledge used to define the set of labels. We measure accuracy between 85 and 90, depending on the parameters. 1 Introduction Morpho-syntactic tagging is essential to text analysis, as a preliminary step to any high level processing. Different reliable taggers exist for French, but they have been designed for handling written texts and are, therefore, not suited to the specificities of less “normalised” language. Here, we are interested in the ESLO4 corpus, which comes from records of spoken language. ESLO thus presents specific features, which are not well accounted for by standard taggers. Several options are possible to label transcribed spoken language: one can take a tagger initially developed for written texts, providing new formal rules let us adapt it to take into account disfluences (Dister, 2007 [1]); or one can adapt the transcribed corpus to the requirements of written language (Valli and Veronis, 1999 [2]). We have chosen a different methodological approach. Starting from the output of a tagger for written language, we have first defined a new tag set, which meets our needs; then, we have annotated a reference corpus with those new tags and used it to train a Machine Learning system. For that kind of annotation task, the state-of-the-art technology for supervised example-based Machine Learning are the Conditional Random Fields 4 Enquête sociolinguistique d’Orléans, i.e. Sociolinguistic Survey of Orléans © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 79-90 Received 27/11/09 Accepted 16/01/10 Final version 09/03/10 80 Tellier I., Eshkol I., Taalab S., Prost J. (CRF). CRF is a family of recently introduced statistical models (Lafferty et al., 2001 [3], Sutton and McCallum, 2007 [4]), which have already proven their efficiency in many natural language engineering tasks (McCallum and Li, 2003 [5], Pinto et al., 2003 [6], Altun et al., 2003 [7], Sha and Pereira, 2003 [8]). Our experiments make use of CRF++5 , a free open-source library, developed by Taku Kado. We proceed with the testing of various strategies for decomposing the labels into a hierarchy of simpler sub-labels. The approach is original, in that it eases the learning process, while optimising the use of the linguistic knowledge that ruled the choice of intial labels. In that, we follow the procedure suggested by Jousse (2007 [9]), and Zidouni (2009 [10]). In the following, Sect. 2 is dedicated to the presentation of our corpus and of the tagging process, focusing on the labelling problems raised by spoken language. After justifying the choice of a new tag set, we explain the procedure we have adopted for acquiring a sample set of correctly labelled data. In Sect. 3 we present the experiments we have carried out with CRF++, to learn a morpho-syntactic tagger from this sample. We show how the performance of the learning process can be influenced by different possible decompositions of the target labels into simpler sub-labels. 2 Oral Corpus and its Tagging This section deals with the morpho-syntactic tagging of an oral corpus, and the difficulties that it causes to the tagger Cordial. The specificities of spoken language lead us to propose a new set of more suitable tags. The morpho-syntactic Tagging of an Oral Corpus. The purpose of tagging is to assign to each word in a corpus a tag containing morpho-syntactic information about that word. This process can be coupled with stemming, to reduce the occurrence of a given word to its base form or lemma. The main difficulty of morpho-syntactic tagging is the ambiguity of words belonging to different lexical categories (e.g. the form portes in French is either a plural noun (doors) or the second person singular of present indicative or subjunctive of the verb porter (to carry): the tagger must assign the correct tag in a given context. Taggers usually also have problems with words which are not recognized by their dictionary: misspelled words, proper nouns, neologisms, compound words, . . . Tagging an oral corpus faces even more problems. Firstly, the transcriptions are usually not punctuated in order to avoid anticipating the interpretation (Blanche-Benveniste and Jeanjean, 1987 [11]). Punctuation marks such as comma or full stop, and casing are typographical marks. The notion of sentence, mostly graphical, was quickly abandoned by linguists interested in speech. Studies on spoken language have also identified phenomena which are specific to speech, called disfluence: repeats, self-corrections, truncations, etc. Following (Blanche-Benveniste, 2005 [12]), we believe that all these phenomena should be 5 http://crfpp.sourceforge.net/ POS-tagging for Oral Texts with CRF and Category Decomposition 81 included in the analysis of language even if it raises processing issues. Elements, like hein, bon, bien, quoi, voilà, comment dire, (eh, well, what, how to say, . . . ) with a high frequency of occurrence in oral corpora, and without punctuation, can be ambiguous, because they can sometimes also be nouns or adjectives (as it is the case for bon, bien — meaning good). The currently existing tools for tagging are not suitable for oral, which is why this task is so difficult. The Tagging by Cordial and its Limits. The Socio-Linguistic Survey in Orleans (Enquête Socio-Linguistique à Orléans, ESLO) is a large oral corpus of 300 hours of speech (approximately 4,500,000 words) which represents a collection of 200 interviews recorded in a professional or private context. This investigation was carried out towards the end of the Sixties by British academics in a didactic aim. The corpus is made up of 105 XML files generated by Transcriber, and converted to text format. Each file corresponds to a recording situation. Significant conventions of transcription are: – the segmentation was made according to an intuitive unit of the type “breathing group” and was performed by a human transcriber; – the turn-taking was defined by speaker changes; – no punctuation except exclamation and question marks; – no upercase letters except named entities; – word truncation is indicated by the dash (word-); – the transcription is orthographic. The transcribed data was tagged by Cordial in order to have a reference corpus. This software was chosen for its reliability. As of today, it is one of the best taggers for French written language, with a wide range of tags, rich in linguistic information. The result of tagging is presented in a 3-column format: word, lemma and lexical category (POS), as exemplified in Table 1. Cordial uses about Lemma POS Word comment (how) comment ADV vous (you) vous PPER2P faı̂tes (make/do) faire VINDP2P un DETIFS une (one/a) omelette (omelette) omelette NCFS Table 1. Example of tagging by Cordial. 200 tags encoding morphological information of different kinds such as gender, number or invariability for nouns and adjectives; distinction in modality, tense and person for verbs, and even the presence of aspirated ‘h’ at the beginning of words. However, the analysis of Cordial’s outcome revealed a number of errors. The first group of errors includes “classical” errors such as the following ones. 82 Tellier I., Eshkol I., Taalab S., Prost J. Ambiguity: et vous êtes pour ou contre (and are you for or against) ⇒ {contre, contrer, VINDP3S} instead of {contre, contre, PREP3}. Proper nouns: les différences qu’il y a entre les lycées les CEG et les CES (the differences that exist among [types of secondary and high schools]) ⇒ {CEG, Ceg, NPMS} instead of {CEG, CEG, NPPIG4} and {CES, ce, DETDEM} instead of {CES, CES, NPPIG}. Locutions: en effet (indeed) ⇒ analysed in two separate lines, as opposed to a compound: {en, en, PREP}, then {effet, effet, NCMS} while it is an adverb. We have also found errors, which are specific to oral data, as : Truncation: by convention in ESLO, word truncation is indicated by the dash, which raises a problem for tagging by Cordial: on fait une ou deux réclam- réclamations (we make one or two complaints) ⇒ {réclam- réclamations, réclamréclamations, NCMIN5} instead of analysing the sequence in two separate units: {réclam-, reclam-, NCI6 } and {réclamations, réclamation, NCFP} Interjection: Cordial does not recognize all the possible interjections present in oral corpora: alors ben écoutez madame (so uh listen madam) ⇒ {ben, ben, NCMIN). This phenomenon also presents a problem of ambiguity since, according to Dister (2007 [1, p. 350]): any form can potentially become an interjection. One, then, observes a grammatical re-classification (. . . ), the phenomenon whereby a word belongs to a grammatical class may, in speech, change. j’ai quand même des attaches euh ben de la campagne qui est proche quoi (PRI7) (I still have ties [euh ben] to the nearby countryside [quoi]) Repeat and self-correction: je crois que le ({le, le, PPER3S} instead of {le, le, DETDMS}) le ({le, le, DETDMS}) les saisons (I think that the the seasons) Note, as well, a number of errors such as typos or spellings made by human transcribers. The transcription was not spell-checked. New Choice of Tags. In order to adapt the tagging to our needs, we propose a number of modifications to the tag set. Those changes are motivated on the one hand by the reduction of the number of tags without loss of the necessary linguistic information, and on the other hand, by the need to adapt the tags to spoken language and to the conventions adopted for the transcription of our corpus. We present here a (non-exhaustive) list of modifications. – New tags were introduced, such as MI (unknown word) for cases of truncation, and PRES (announcer) for phrases such as il y a, c’est, voilà (there is, this is, there it is), both very frequent in oral. 6 Common noun invariable in number POS-tagging for Oral Texts with CRF and Category Decomposition 83 – A few tags, which are too detailed in Cordial, were simplified. For example, the set of tags marking the invariability of adjectives or nouns (masculine invariant in number, feminine invariant in number, singular invariant in gender, plural invariant in number and gender) were replaced by a single tag (invariable). The tags marking the possibility for the word to begin with an aspirated ‘h’ were removed. – In order to make the system more uniform, some tags were enriched. For example, indications about the gender and number were added to the demonstrative and possessive determiners for coherence purpose with other types, such as definite and indefinite determiners. The morpho-syntactic tags often contain information of different kinds. They always mark information about the Part-Of-Speech. That basic information seems to be the easiest to acquire from dictionary lookup except, of course, in the case of lexical ambiguity. The tags generally include additional information from different linguistic dimensions: morphological: concerns the structure of a word as its type and number, the invariability for nouns, adjectives, pronouns and some determiners; syntactic: describes the function of words in the sentence and they relate with each other, e.g. coordination and subordination for conjunctions; semantic: related to the description of word’s meaning such as feature of possessive, demonstrative, definite, indefinite or interrogative for the determiners. In order to account for that extra information, we propose to structure the tags in 3 levels, called respectively L0 (level of POS tags), L1 (morphological level) and L2 (syntactic and semantic level). A sample of that hierarchical structure is illustrated in Fig. 1. As shown in Fig. 1, some tags: L0: DET L1:DETMS DETFS ... L2:DETMSDEF DETMSIND DETMSDEM L0: N L0: PREP L1:NMS NMP NFS NFP L1:PREP L2:NMS NMP NFP NFP L2:PREP Fig. 1. Sample of the tags hierarchical structure – remain the same on all 3 levels, e.g. adverbs, announcer, prepositions, . . . ; – only change on level 2, such as nouns, adjectives, verbs; – change on every level, including new information such as pronouns and determiners. In addition to that hierarchical structure, other types of linguistic knowledge can be taken into account during tagging. According to inflectional morphology, a word is made up of a root and a sequence of letters, which often carry morphological information: in French, endings such as -ait, -ais, -e ,-es indicate the tense, gender, number, . . . . In inflectional morphology, those endings are called grammatical morphemes. When considering the root as the part shared by all 84 Tellier I., Eshkol I., Taalab S., Prost J. the forms of a word, it is possible to extract these final sequences from the surface form in order to determine the morphological part of the tag which must be associated with this word. That linguistic knowledge can be exploited in order to improve the performance of a Machine Learning system, as we discuss it in the next section. The reference corpus contains 18,424 words, and 1723 utterances. This data was first tagged by Cordial, and then corrected semi-automatically, in order to make it meet our new tagging conventions. The hand processing was made by linguistic students as part of a 3-month internship. 3 Experiments We now have a reference corpus, whose labelling was manually corrected and is considered as (nearly) perfect. It is, thus, possible to use it for training a Machine Learning system. As far as learning a tagger is concerned, the best performing statistical model is the one of Conditional Random Fields (CRF) (Lafferty et al., 2001 [3], Sutton and McCallum, 2007 [4]). We choose to work with it. In that section, we first briefly describe the fundamental properties of CRF, then present the experimental process, and finally we detail the results. Our goal is to maximise the use of the linguistic knowledge which guided the definition of our tag set, in order to improve the quality of the learned labels. In particular, we want to determine whether learning the full labels (i.e., those containing all the hierarchical levels of information) could be improved by a sequence of intermediate learning steps involving less information. Note that we do not rely on any dictionary, which would enumerate all the possible labels for a text unit. CRF and CRF++. CRF are a family of statistical models, which associate an observation x with an annotation y using a set of labelled training pairs (x, y). In our case, each x coincides with a sequence of words, possibly enriched with additional information (e.g., if the words’ lemmas are available, x becomes a sequence of pairs (word, lemma)), and y is the sequence of corresponding morpho-syntactic labels. Both x and y are decomposed into random variables. The dependencies among the variables Yi are represented in an undirected graph. The probability p(y|x) of an annotation y, knowing the observation x is: XY 1 Y p(y|x) = Z(x) ψc (yc , x) with Z(x) = c∈C ψc (yc , x) y c∈C where C is the set of cliques (i.e. completely connected subgraph) over the graph, yc is the configuration taken by the set of random variables Y belonging to the clique c, and Z(x) is a normalization factor. The potential functions ψc (yc , x) take the following form: X ψc (yc , x) = exp λk fk (yc , x, c) k The functions fk are called features, each one being weighted by a parameter λk . The set of features must be provided to the system, whose learning purpose is to POS-tagging for Oral Texts with CRF and Category Decomposition 85 assign the most likely values for each λk according to the available valued data. Most of the time, function results are 0 or 1 (but they could also be real-valued). In linear CRF, which are well-suited to sequence annotation, the graph simply links together the successive variables associated with the sequence elements. The maximal cliques of that kind of graph are, thus, the successive pairs (Yi , Yi+1 ). That model is potentially richer than the HMM one, and usually gives better results. CRF++, the software that we are using, is based on that model. Features can be specified through templates, which are instanciated with example pairs (x, y) provided to the program. We kept the default templates provided by the library; they generate boolean functions using the words located within a twoword neighborhood around the current position, as exemplified in Ex. 1. Example 1. In Table 1, the first column corresponds to the observation x, the third one to the annotation. Hence: x = “comment vous faites une omelette”7 , y = ADV, PPER2P, VINDP2, PPER2P, DETIFS, NCFS. For a given position i identifying the clique (i, i+1), the template tests the values of Y s in the clique, and the values of X in position i, i − 2, i − 1, i + 1, i + 2. At position i = 3 we get the following feature f : if Yi = VINDP2 and Yi+1 = PPER2P and Xi = ‘faites’ and Xi−2 = ‘comment’ and Xi−1 = ‘vous’ and Xi+1 = ‘vous’ and Xi+2 = ‘une’ then f = 1 else f = 0. The template also generates simpler functions, where only the positions i, i − 1, and i + 1 of X are tested, for example. With that example, we see that the disfluencies are directly taken into account in the model by the fact that they occur in the set of training examples provided to the learner. Experimental Framework. For our experiments, the corpus was split in 10 subsets and we performed a 10-fold cross-validation. The features are mainly built from observations over words. We have also carried out experiments where the word lemma is supposed to be known. In order to enrich the data even more, we also relied on inflectional morphology, mentioned in Sect. 2: 1. the distinction between root and rest: the root is the string shared between the word and the lemma, while the rest is the difference between them; if word=lemma, by convention we note Rword=Rlemma=x, else word= Root+Rword and lemma=Root+Rlemma (where the symbol + denotes here the string concatenation); 2. the word tail: Dn (word) = n last letters of the word; for instance, if word= ‘marchant’ and lemma=‘marcher’ then Root=‘march’, Rword=‘ant’, Rlemma =‘er’ and D2 (word)=‘nt’. Reference Experiments. The reference experiments consist of learning the most detailed level (L2) directly. Six different tests were run, which we describe next. We denote by F eat(args) the features built from (args). 86 Tellier I., Eshkol I., Taalab S., Prost J. Test I F eat(word, lemma) : about 10,000,000 features produced; F1 = 0.86. Test II F eat(word, lemma, Rword, Rlemma) ; 11,000,000 features; F1 = 0.88. Test III If word=lemma we use D2 (word) and D3 (lemma), hence: F eat(word, lemma, Rword|D2 (word),Rlemma|D3 (lemma)) ; 20,000,000 feat.; F1 = 0.82. Now, if the lemmas are unknown, we obtain the following: Test IIIbis F eat(word, D3 (word)) ; 8,000,000 features; F1 = 0.87; Test IV Similar to III, but with D3 everywhere: F eat(word, lemma, Rword|D3 (word), Rlemma|D3 (lemma)) ; 20,000,000 feat.; F1 = 0.89. Again, without relying on lemmas, we get: Test IVbis F eat(word, D3 (word), D2 (word), D1 (word)) ; 20,000,000 feat.; F1 = 0.88. As expected, the richer the features, the better the results. Knowing the lemmas, for instance, increases the accuracy by 2 points in average. The downside is the increased cost timewise for the learning phase, caused by a much larger number of generated features. Cascasde Learning. In order to exploit the knowledge contained in the labels — i.e., mainly their organisation in a 3-level hierarchical structure — we first learned each level independently, using the same test set (Test I to IVbis) as previously. The scores obtained are presented in Table 2. We observe that the Level (num. of tags) Test I L0 (16) 0.93 L1 (72) 0.86 L2 (107) 0.86 Test II Test III Test IV Test IIIbis Test IVbis 0.93 0.94 0.94 0.92 0.93 0.89 0.9 0.9 0.88 0.89 0.88 0.82 0.89 0.87 0.88 Table 2. Accuracy measures when learning separately each of the hierarchical levels of labels. coarser the levels in terms of how detailed the information is, the easier they are to learn. It can be explained by the reduced number of labels to be learned. Meanwhile, since each level of labels in the hierarchy depends on the previous one, one can hypothesise that using the results from the levels Lj when learning Level Li (withj < i) may improve the results at Li . The purpose of the next set of experiments is to test that hypothesis: we say that the different hierarchical levels are learned in cascade, as in Jousse (2007 [9]) and Zidouni (2009 [10]). In the previous set of experiments, the best scoring tests are Test III and IV; as an attempt to improve those tests, we have designed Test V and VI as follows. We denote by CRF(Li|feat(args)) the learning of level Li knowing the features based on args. POS-tagging for Oral Texts with CRF and Category Decomposition 87 Test V (derived from Test III) word, lemma and D3 (lemma) are used to generate the features for learning Level L0; then the result RL0 is used, with the same data, to learn Level L1; and so forth. The successive learning phases are given next. – CRF(L0|feat(word, lemma, D3 (lemma))) → ResL0 – CRF(L1|feat(word, lemma, D3 (lemma), ResL0)) → ResL01 – CRF(L1|feat(word, lemma, D3 (lemma), ResL0, ResL01)) → ResL012 Test VI (derived from TestIV) This time, the initial features are generated with word, Rword, Rlemma, D3 (word),D3 (lemma). The successive learning phases are the following: – CRF(L0|feat(word, Rword, Rlemma, D3 (word), D3 (lemma))) → ResL0 – CRF(L0|feat(word, Rword, Rlemma, D3 (word), D3 (lemma), ResL0)) → ResL01 – CRF(L0|feat(word, Rword, Rlemma, D3 (word), D3 (lemma), ResL0, ResL01)) → ResL012 Level L0 L1 L2 III 0.94 0.9 0.82 IV 0.94 0.9 0.89 V 0.94 0.88 0.87 VI 0.94 0.9 0.89 Fig. 2. Accuracy measures for Test III to VI As shown in Fig. 2, Test V and VI give good results, but not really better than the initial Test III and IV. Therefore, unfortunately, cascade learning does not seem to improve the results obtained in the reference experiments, where L2 is learned directly. That conclusion is confirmed by the experiments without lemmas, the outcome of which we do not detail. Next, we re-consider the way the information contained in the L2 labels is decomposed, in order to better learn those labels. Learning by Decomposition and Recombination of Labels. We decompose the L2 labels into components, so that they can be learned independently. We call label component a group of atomic symbols, which cannot all belong to the same label. Intuitively, those components correspond to the possible values for a linguistic feature, such as Gender or Number. Example 2. The labels in L = {NFS, NFP, NMS, NMP} come from the concatenation of elements in the sets {N } (Noun), {M, F } (Gender), and {S, P } (Number). All the four labels in L can be recombined by the cartesian product of three components: {N }·{M, F }·{S, P }, where · (dot) denotes the concatenation of sub-labels. 88 Tellier I., Eshkol I., Taalab S., Prost J. A first option for building those components is to propose the following sets: – POS={ADJ, ADV, CH, CONJCOO, CONJSUB, DET, INT, INT, MI, N, PREP, PRES, P, PP, V} – Genre={M, F}; Pers={1, 2, 3}; Num={S, P} – Mode Tense={CON, IMP, SUB, IND, INDF, INDI, INDP, INF, PARP, PARPRES} – Det Pro={IND, DEM, DEF, POSS, PER, INT} Each of those components can be learned independently. However, some of them are still mutually exclusive: for example, Person and Gender can be grouped together since their values (respectively in {1, 2, 3} and {M, F }) never occur together. On the contrary, Gender and Number can not be grouped, since, for instance, the value ‘F’ may occur with ‘S’ or ‘P’ within the same label. We end up working with 4 components: G0 = POS, G1 = Genre ∪ Pers ∪ {}, G2 = Num ∪ {}, and G3 = Mode Tense ∪ Det Pro ∪ {}. with the empty string, neutral element for the concatenation. Each of these label subsets can now be learned independently by a specific CRF. In that case, the final label proposed by the system results from the concatenation of all the CRF outcomes. If we denote by · (dot) the concatenation operator, the cartesian product G0 · G1 · G2 · G3 generates every L2 label. But it also generates labels which are not linguistically motivated. For example, ADVMP = ADV · M · P · is meaningless, since an adverb is invariable. In order to avoid that problem, we have tested two different approaches. The first one consists of using a new CRF, whose features are the components learned independently. The second one consists of intoducing explicit symbolic rules in the concatenation process, in order to rule out the illegal combinations. Example rules are as follows: – ADV, CONJCOO, CONJSUB and INT can only combine with – V can not combine with values of Det Pro – DET can not combine with values of Mode Tense Those rules exploit the fact that the POS category (i.e. the G0 component) is learned with strong enough confidence (F1 = 0.94) to constrain the other sub-labels with which it may combine. We have carried out the tests presented below: Expe. 1 CRF(Gi |feat(word, lemma, D3 (word))) → ResGi We have also tested different versions where feature generation is achieved without lemma but with word tails instead (as in Test IVbis). Expe. 2 CRF(Gi |feat(word, D3 (word), D2 (word), D1 (word))) → ResbisGi Test VII CRF(L2|feat(word, Rlemma, ResG0, ResG1, ResG2, ResG3)) → ResL2 Test VIIbis Same as VII, but without lemmas: CRF(L2|feat(word, ResbisG0, ResbisG1, ResbisG2, ResbisG3)) → ResbisL2 POS-tagging for Oral Texts with CRF and Category Decomposition 89 Test VIII Here, the outcome from Test VII is replaced by symbolic combination rules using the results ResGi. Test VIIIbis Same as VIII, except that the combination rules use ResbisGi. Figure 3 illustrates the two possible strategies for recombining the full labels, along with the results from learning each component independently (accuracy measures). Note that the component G2 is better learned without lemmas but Component G0 G1 G2 G3 ResGi ResbisGi 0.94 0.92 0.92 0.92 0.93 0.95 0.95 0.94 Test VII VIIbis VIII VIIIbis F1 0.89 0.875 0.9 0.895 Fig. 3. Learning components: accuracy measures for two recombination strategies. with word tails. The recombination strategy based on CRF (Test VII and VIIbis) does not improve the scores obtained by direct learning of the full labels on L2 (Test IV and IVbis). However, the rule-based recombination strategy (Test VIII and VIIIbis) does improve direct learning. Test VIIIbis illustrates that, in general, the absence of lemmas can be balanced by word tails associated with symbolic recombination of the labels. Meanwhile, timewise the learning phase is considerably improved by the recombination strategy: Test VIII only takes 75 min., while Test IV takes up to 15h. (using a standard PC). It should also be noted that since the labels obtained by recombination are most of the time only partially (in)correct, those experiments would be better evaluated with other measurements than accuracy. Note, as well, that the definition we provide of a specific set of labels prevents comparing our performances against those of other taggers. 4 Conclusion In that paper, we have shown that it is possible to efficiently learn a morphosyntactic tagger specialised for a specific type of corpus. First, we have seen that the specificities of spoken language are difficult to enumerate. Instead of trying to rule them all, it is natural to rely on Machine Learning techniques. Our experiments all take the input data as they are, without filtering out any difficulties. 90 Tellier I., Eshkol I., Taalab S., Prost J. Note that it is not possible to rigourously compare the performance achieved by Cordial with the performances reported here, since the target label sets are different. Yet, the performances of the best learned taggers seem comparable to those usually obtained by Cordial on oral corpora. The insentive of using CRF for that task is that it does not require many parameters to be set, and that the settings involved are flexible enough to integrate external linguistic knowledge. We have mostly used here our understanding of the labels in order to focus on learning sub-labels, which are simpler and more coherent. The performance would have also certainly been improved by the use of a dictionary of labels for each word, or each lemma, to specify features. Finally, it seems quite difficult to still improve the quality of the learned labels by simply relying on the decomposition in simpler sub-labels. However, that strategy by decomposition is very efficient timewise, and the learning process has been greatly improved in that respect. It is also interesting to notice that the most efficient strategy relies on a combination between statistic learning and symbolic rules. Further works are going in that direction. References 1. Dister, A.: De la transcription à l’étiquetage morphosyntaxique. Le cas de la banque de données textuelle orale VALIBEL. Thèse de doctorat, Université de Louvain (2007) 2. Valli, A., Veronis, J.: Étiquetage grammatical des corpus de parole : problèmes et perspectives. Revue française de linguistique appliquée IV(2) (1999) 113–133 3. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML’01. (2001) 282–289 4. Sutton, C., McCallum, A. In: 1 An Introduction to Conditional Random Fields for Relational Learning. The MIT Press (2007) 5. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of CoNLL. (2003) 6. Pinto, D., McCallum, A., Lee, X., Croft, W.: Table extraction using conditional random fields. In: SIGIR’03: Proceedings of the 26th ACM SIGIR. (2003) 7. Altun, Y., Johnson, M., Hofmann, T.: Investigating loss functions and optimization methods for discriminative learning of label sequences. In: Proceedings of EMNLP. (2003) 8. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL. (2003) 213–220 9. Jousse, F.: Transformations d’abres XML avec des modèles probabilistes pour l’annotation. Thèse de doctorat, Université de Lille (2007) 10. Zidouni, A., Glotin, H., Quafafou, M.: Recherche d’Entités Nommées dans les Journaux Radiophoniques par Contextes Hiérarchique et Syntaxique. In: Actes de Coria. (2009) 11. Blanche-Benveniste, C., Jeanjean, C.: Le Français parlé. Transcription et édition, Paris (1987) 12. Blanche-Benveniste, C.: Les aspects dynamiques de la composition sémantique de l’oral. In Condamines, A., ed.: Sémantique et corpus. Lavoisier, Hermès, Paris (2005) 39–74 Chinese Named Entity Recognition with the Improved Smoothed Conditional Random Fields1 Xiaojia Pu, Qi Mao, Gangshan Wu2, and Chunfeng Yuan Department of Computer Science and Technology, Nanjing University {puxiaojia, maoq1984}@gmail.com, {gswu, cfyuan}@nju.edu.cn Abstract. As a kind of state-of-the-art sequence classifier, Conditional Random Fields (CRFs) recently have been widely used for some natural language processing tasks which could be viewed as the sequence labeling problems such as POS tagging, named entity recognition(NER) etc. But CRFs suffer from the failing that they are prone to overfitting when the number of features grows. For NER task, the feature set is very large, especially for Chinese language, because of it’s complex characteristics. Existing approaches to avoid overfitting include the regularization and feature selection. The main shortcoming of these approaches is that they ignore the so-called unsupported features which are the features appearing in the test set but with zero count in the training set. Actually, without the information of them, the generalization of the CRFs suffers. This paper describes a model called Improved Smoothed CRF which could capture the information of the unsupported features using the smoothing features. It provides a very effective and practical way to improve the generalization performance of CRFs. Experiments on Chinese NER proved the effectiveness of our method. Keywords: Chinese named entity recognition; Conditional Random Fields; overfitting; generalization; Improved Smoothed CRF; smoothing feature 1 Introduction Named entity recognition (NER) is one of the fundamental works in natural language processing and text processing. The aim of NER is to find the names of some special entities from the free texts, e.g. person, location, organization etc. Compared with English, Chinese NER is more difficult because of its complex characteristics. For example, a sentence of English is a sequence of words, and the words will be separated by the space, but in Chinese, the sentence is a sequence of characters without any spaces between them. Viewed as a sequence labeling problem, various sequence labeling models have been used to solve the NER problem such as Hidden Markov model (HMM) [1], Maximum Entropy (ME) [2], Maximum Entropy Markov model (MEMM) [3], 1 2 This work was supported by the research grants from the Natural Science Foundation of China (60721002 and 60975043). Corresponding author. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 91-103 Received 05/12/09 Accepted 16/01/10 Final version 12/03/10 92 Pu X., Mao Q., Wu G., Yuan C. Conditional Random Fields (CRFs) [4-10],and so on. Many works have proven that CRFs get the excellent performance and are superior to the previous models [13]. A key advantage of CRFs is that they can support the use of complex features, e.g. long-range features that are overlapping and inter-dependent. This flexibility encourages the use of a rich collection of features. But with a large feature set, the CRFs are prone to overfitting [13] [15], e.g. in the Chinese NER task, because of the complex characteristics, the scale of the features will be more than millions. So it’s really important to solve the overfitting problem. There are two existing approaches to address this problem: regularization [12] [14] and feature selection [11] [15]. The regularization method is adding a regularization term to the objective function to penalize the large weight vectors. This method is also called smoothing [13] [20], similar with the smoothing methods in Maximum Entropy model [19].A typical feature selection approach is the feature induction [11], which induces the best features from the candidate sets in an iterative and approximate manner. We conducted a detailed analysis of the issues which could influence the generalization ability of CRFs. After the study of these existing approaches, we found that the main disadvantage of these approaches is that they didn’t take into account the so-called unsupported features [12], which were with zero count in the training set but occurred in the test set. Surely, these unsupported features will have an important impact on the generalization of CRFs [12]. For example, in the named entity recognition task, due to the lack of some efficient features which just appear in the test data set, some named entities could always be very difficult to recognize. In this paper, we propose a new model called Improved Smoothed CRF, adopting a new smoothing method to help improve the generalization ability of CRFs. The insight of our method is that though the unsupported features are unknown, but we could use some high-level features to cover them. These high-level features, called smoothing features, are predefined based on the feature templates. Each smoothing feature is corresponding to a feature template. During the training procedure, in order to estimate the distribution of these smoothing features, a validation set, which is divided from the whole training set randomly, will be used to simulate the test set. Then, the ordinary features and smoothing features will be integrated together for training. During the decoding procedure, the unknown feature will all be mapped into the corresponding smoothing feature which will keep the information of unknown feature, rather than just omitting it. In order to evaluate our method, we did some comparative experiments on Chinese NER task, and the result showed its effectiveness. Besides, we try to analyze why this method is efficient, and have a discussion about this. The contribution of our work mainly includes: (1) a detailed analysis of the issues which influence the generalization of CRFs and some existing approaches to improve the generalization ability; (2) propose an Improved Smoothed CRF and show its effectiveness on Chinese NER task. The paper is organized as follows. In Section 2, we will review Conditional Random Fields and analyze its generalization ability. In Section 3, we will have a detailed description about our Improved Smoothed CRF. Consequently, the experiment and the result analysis will be arranged in Section 4. Finally, we will give the conclusion and some possible future works. Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 93 2 Conditional Random Fields and Analysis of its Generalization Conditional Random Fields (CRFs) are a class of discriminative probabilistic models trained to maximize a conditional probability. A common used special graph structure is a linear chain as shown in fig.1 and it avoids the label biased problem of Maximum Entropy Markov Models (MEMM) [3]. y2 y1 y3 yT x = {x1 , x2 ,..., xT } Fig. 1. A linear-chain CRF model A linear-chain CRF 3 with parameters Λ = {λ1 , λ2 ,..., λT } defines a conditional probability for a state sequence y = { y1 , y2 ,..., yT } given the observation sequence x = { x1 , x2 ,..., xT } be the pΛ ( y | x ) = exp( ∑ ∑ λ k f k ( y t −1 , y t , x , t )) t k (1) , Z ( x) where Z ( x ) is the normalization constant that makes the probability sum to one. fk is a feature function which is often binary-valued, and λk is a learned weight associated with feature fk . Feature functions can measure any aspect of the a state transition, yt −1 − > yt ,and the observation sequence, x ,centered at the current time step, t .For example, one feature function might have the value 1 when yt −1 is the sate B, yt is the state I, and xi is the character ‘国’. 2.1 Training of CRF The training will be finished by maximizing the log-likelihood LΛ on the given training set D = {( x1 , y1 ),( x2 , y 2 ),...,( xN , y N )} . 3 For convince to describe our work, the CRF mentioned in this paper will all be linear-chain CRF. 94 Pu X., Mao Q., Wu G., Yuan C. (2) ~ Λ = arg max LΛ Λ∈ R k where N   LΛ = ∑  ∑∑ λk f k ( y it −1 , y ti , x i , t ) − log Z ( x i )  . i =1  t k  (3) Because the likelihood function is convex, we can get the optimization by seeking the zero of the gradient, i.e. the partial derivation N N ∂LΛ = ∑∑ f k ( yti −1 , yti , xi , t ) − ∑∑∑ pΛ ( y | xi ) f k ( yt −1 , yt , xi , t ) . ∂λk i =1 t i =1 t y ( ) (4) The first term is the expected value of fk under the empirical distribution ~ p( x, y) .The second term is the expectation of fk under the model distribution p( y | x) .For the easy understanding, the formula (4) could be written as ∂LΛ = E ~ p ( x , y ) [ f k ] − E p ( y | x ) [ f k ], ∀k . Λ ∂λk (5) Therefore, for the maximum likelihood solution, when the gradient is zero, the two expectations are equal. And setting this gradient to zero does not result in any closed form solution, so it typically resorts to iterative methods, such as the L-BFGS [16], which has been the method of choice since the work of [17]. 2.2 Analysis of the Generalization In this section, we try to conduct a detailed analysis of the issues which will lead to the overfitting problem and decrease the generalization ability of CRF. These issues include: Firstly, the method of CRF training can be viewed as maximum likelihood estimation (MLE), and like other maximum likelihood methods, this type of modeling is prone to overfitting, because of its inherent weaknesses, i.e. without any prior information about the parameter distribution [18]. The common method to avoid this is using the Maximum a Posterior (MAP) method instead. Secondly, formula (1) tells us that CRF is an exponential linear model. The scale of the parameter is equal to the number of features. With the increasing of features, the parameter dimension will be very large, and thus, the freedom of parameters will be enlarged. For Chinese NER, the scale of the features reached millions or more than millions, and almost 35% of them are sparse features. In order to fit these sparse features, some parameters will become very large. This highly uneven distribution of the parameter values will lead to the overfitting problem. Further, during the original CRF training, usually, the features used are simply aggregated from the training set following the feature templates. But, the diversity between the training set and test set is inevitable, the features occurred in the training Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 95 set are not able to cover the full feature set. So, during the decoding period, the original CRF will omit the unsupported features [12], the features occurred in the test set but with zero count in the training set, ignoring the fact that these features surely contain the useful information. In [12], they found that the unsupported features can be extremely useful for pushing Viterbi inference away from certain paths by assigning such features negative weight. So if a model trained without the information of the unsupported features, when applied to the test set, of course, the generalization ability will decrease. 2.3 Existing Work Focusing on part of the issues we talked above, there have been some approaches to avoid the overfitting problem, including feature selection and regularization. Feature Selection. Wise choice of features is always vital in machine learning solutions. For CRF, this method aims at the second issue we talked above to reduce the parameter dimension. Typical method for the feature selection is feature induction [11] which induces the best features from the candidate sets in an iterative and approximate manner. But the computing cost will be very large when the scale of the features grows, so it’s not very practical in some applications such as Chinese NER. Regularization. As a common way to avoid overfitting, regularization is a penalty on the parameter vectors whose norm is too large. Instead of maximizing solely the likelihood of the training set, typically, a quadratic penalty term is added N λ2   LΛ = ∑  ∑∑ λk f k ( yti −1 , yt i , x i , t ) − log Z ( x i )  − ∑ k 2 , i =1  t k  k 2δ (6) where δ 2 specifies how much the penalty is applied. In general, the penalty prevents the absolute values of parameters {| λk |} from becoming too large. And this method has a Maximum a Posterior (MAP) interpretation that the parameter Λ follows a Gaussian prior distribution. Following the expression of formula (5), the gradient of the objective function could be written as: ∂LΛ λ = E ~ p ( x , y ) [ f k ] − E p ( y | x ) [ f k ] − k2 , ∀k . Λ δ ∂λk (7) The regularization method is also called smoothing [13] [15] [20], because it is similar with the smoothing methods in the Maximum Entropy model [19]. It aims at the first and second issues we talked above, by adapting a MAP training method and tighten the values of the parameters. So we can conclude that the existing approaches all ignored the third issues, i.e. the unsupported features, actually, the unsupported features have a great impact on the generalization performance of the CRF. 96 Pu X., Mao Q., Wu G., Yuan C. An intuitive approach is that we numerate the full feature set, containing all the possible features, and then use the regularization (smoothing) method. Then during the training, all the unsupported features will get the non-zero weight. However, for Chinese NER, it’s hard to numerate the full feature set due to the high scale of Chinese characters and the complex characteristics of language, and besides, doing so often greatly increases the number of parameters which will cause the overfitting. [12] presents a compromising method called incremental support, which will introduce just some heuristic unsupported features in a iterative way. However, it’s not practical for the large feature set of Chinese NER. 3 Improved Smoothed Conditional Random Fields We propose a new model called Improved Smoothed Conditional Random Field, which provides a practical way to capture the information of unsupported features, by the means of inducing the smoothing features. The inspiration of our method is that though the unsupported features are unknown, but we could use some high-level features to cover them. Every high-level feature, called smoothing feature, is predefined corresponding to a feature template. Since the unsupported feature is also generated based on a special feature template, we could use the smoothing feature to replace it. During the training, in order to extract the smoothing features with the estimation of the distribution of them, we use a validation set as the simulation of test set. The validation set is divided from the whole training set randomly. And in the end, the ordinary features and smoothing features will be put together for the training. During the decoding, the unknown feature will all be mapped to the corresponding smoothing feature rather than solely omitted. The advantage of our method is that we provide a practical way to capture the information of unsupported features, and meanwhile, the parameter dimension will not be enlarged too much because the smoothing feature set is small. 3.1 Smoothing Features The so-called smoothing features are predefined based on the feature templates, as shown in Table 2. For a given training set Train _ set , after the feature selection, we could get a feature vector F1 = { f1 , f 2 ,..., f K } , and for a given validation set V _ set , following the same feature selection method, we could get a feature vector F2 = { f1 ', f 2 ',..., f m '} .we use the T ( f ) to represent the template which generates the feature f . If f ∈ F2 and f ∈/ F1 , we use a special feature f * (T ) to describe T ( f ) , and f * (T ) is called the smoothing feature for a given feature template. Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 97 3.2 Extraction of Features For the training of CRF, the extraction of features is vital as the first step, including calculating the frequencies of each feature. The extraction sequence for the Improve Smoothed CRF is shown in Fig.2. Different with the original CRF, during the extraction procedure, the Improved Smoothed CRF will divide the features into ordinary features and smoothing features. As shown in Fig. 2, the training set D is divided into two subsets, D1 and D2 . D1 is the data set, and D2 is the validation set. F3 is the collection of the smoothing features, and F * , containing the distribution of smoothing features, is the final smoothing feature set. Actually, the validation set is used to simulate the test set to get distribution of smoothing features. F1 and F2 − F * are the ordinary features. F is the final full feature set with ordinary features and smoothing features including the distribution of them. Fig. 2. The sequence of extracting features 3.3 The Definition of Improved Smoothed CRF Following the definition of original CRF, the conditional probability of the state sequence y = { y1 , y2 ,..., yT } given the observation sequence x = { x1 , x2 ,..., xT } defined by the Improved Smoothed CRF could be formulized as    exp  ∑  ∑ λk f k ( yt −1 , yt , x, t ) + ∑ µm g m ( yt −1 , yt , x, t )   t k m    p ( y | x) = Z ( x) (8) where g m ( yt −1 , yt x, t ) ∈ F * is the smoothing feature at the time t , µm is the weight associated with the feature. From formula (8), it seems that the new model is the same as the original CRF, but essentially, they are different. The new model use the predefined smoothing features 98 Pu X., Mao Q., Wu G., Yuan C. to cover the unsupported features, and predict the distribution of unsupported features based on the validation set which is the simulation of the test data. Incorporating the smoothing features, the model will more fit to the test data, and improve the generalization performance. Because it can be seen as an extension of regularization (smoothing) in an improved smoothing way, we call it the Improved Smoothed CRF. 3.4 Training of the Improved Smoothed CRF After the extraction of the features, the training procedure is the same as the original CRF, the entire training process is shown blow: Algorithm: The training of Improved Smoothed CRF Improved Smoothed CRF training ( D, F _ temp ) Input: D = {di }, di = {xi , yi }, F _ temp //D is the training set,F_temp are the feature templates Output: Λ = {λ1 ,..., λk , µ1 ,..., µm } // Λ is the weights of the parameters, λ1 ,..., λk for the ordinary features, µ1 ,..., µm for the smoothing features begin divide D into D1 and D2 (validation set) extracting F1 , F2 from D1 , D2 generate F3 based on F _ temp extracting F * with F1 , F2 and F3 Λ = 0;//the initialization do { for d = {x, y} in D do calculate the pΛ ( y | x) of d based on formula(1), for λ / µ in the features of d do update ∂LΛ / ∂λ ( µ ) based on formula(6) end do update LΛ based on pΛ ( y | x) end do L − BFGS (Λ, LΛ ,{∂L / ∂λ ( µ )i }) //update the parameters of L-BFGS }until(converged) end. 3.5 Decoding of the Improved Smoothed CRF The decoding process is the same as the original CRF after replacing the unsupported features with the smoothing features. Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 99 4 Experiments We did comparative experiments on Chinese NER task to evaluate our method in two different corpuses, MSR [21] and PKU [22]. We pick up 3000 sentences separately from the two corpuses as the test sets, the remains are used for training. Chinese NER problem can be solved mainly in two levels: word level [8] [9] and character level [6] [10].In order to analyze the results clearly, we did our experiment in the character level. We extend the CRF4 in Mallet toolkit [23] to implement our improved smoothed CRF, and the baseline model is CRF4, which is an implementation of original CRF. The evaluation metrics is precision, recall and F measure. 4.1 Feature Templates and Tag Set The feature templates are shown in Table 1. Since the purpose of our experiments is to compare the performance between the original CRF and Improved Smoothed CRF, so the features templates we used are the basic and simple ones, which will make the analysis of the results more clearly and persuasive. Table 1. Feature templates used in the experiment type Base feature template Cn (n = −2, −1,0,1, 2) Bi-gram feature CnCn +1 (n = −2, −1,0,1) C−1C1 Cn is the character with the relative distance n from the observation position, these feature templates are the basic templates for Chinese NER. Corresponding to the feature templates, some smoothing features are listed in Table 2. We just list the node features without the combination with the state transition features, actually during the training, the state transition should be considered. Table 2. Some corresponding smoothing features template C2 Smoothing features <Unknown>@2 C-1C0 <Unknown>_-1&_<Unknown>@0 C1C2 <Unknown>@1_&_<Unknown>@2 … … The tag set is the coding of the states, and we choose BIO as our tag set, the full states are {O, B − NR, I − NR, B − NS , I − NS , B − NT , I − NT } . NR , NS and NT represent the name, location and organization respectively. O represents that it’s not the named entity, and B − NR represents that it’s the first 100 Pu X., Mao Q., Wu G., Yuan C. character of the named entity, I − NR represents that it’s the character of the named entity but not the first. The NS and NT are similar. 4.2 Comparative Results The comparative experiments results are shown in Table 3 and Table 4.The three types of named entities to be recognized are person, location and organization. Table 3. Original CRF Imp. Smoothed CRF Table 4. Original CRF Imp. Smoothed CRF Comparative results based on PKU corpus Metric Person Loc. Org. All Precision Recall F-measure Precision Recall F-measure 92.83% 89.62% 85.21% 90.26% 80.69% 86.33% 93.05% 87.96% 90.43% 81.45% 85.34% 90.72% 85.38% 87.97% 77.54% 81.19% 87.66% 81.46% 84.44% 80.13% 84.89% 91.28% 85.90% 88.51% Comparative results based on MSR corpus Metric Person Loc. Org. All Precision Recall F-measure Precision Recall F-measure 93.95% 86.02% 82.08% 87.44% 72.14% 81.62% 92.69% 82.36% 87.22% 74.14% 79.64% 87.72% 79.02% 83.15% 64.25% 72.08% 81.99% 70.49% 75.81% 70.79% 78.24% 87.84% 77.78% 82.50% For the experiments in PKU corpus, we used about 16 thousand sentences for training, and in MSR corpus, we used about 20 thousand sentences. Because we aim to compare the performance of the models rather than get the best recognition result, so we didn’t use the entire corpus for training. The scale of our validation set is about the 1/3 to 1/2 of the training corpus. We can find that with the Improved Smoothed CRF, the F measure improved in both corpuses, e.g. 3.62% in PKU, 4.26% in MSR. The recall got the largest increase, e.g. 5.77% in PKU, 6.99% in MSR. This increase indicates that the information of unsupported features is very useful, and our model could capture them efficiently. With this information, we could recognize some entities which couldn’t be recognized correctly using the original model. 4.3 The Change of the Parameters In order to know clearly about the impact on the parameters by the Improved Smoothed CRF, some values of the parameters f ( yt −1 , yt ,'国') are shown in Table 5. Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 101 Table 5. The Change of some example parameters Features(国) O -> O O -> B-ns Original CRF Imp. Smoothed CRF 0.27270093137463947 0.5236346196173783 -0.09193322956278925 0.3431855304817065 B-nr -> O -0.18969131328341626 0.11111461613932119 B-nr -> I-nr 0.14558200966493068 0.12232168710742694 B-nr -> B-ns 0.05574692403247397 -0.11382501213178316 B-nr -> B-nt 0.03156266967717719 0.11615382524919472 ... ... ... We can see that the parameters become more tightly, which could help to improve the generalization performance and the experiment result in deed showed this. 4.4 Different Training Size We also did an experiment in MSR corpus to find how the size of training set influences the performance. The result is shown in fig. 3. The 4 training set we used are respectively 25%, 50%, 75%, 100% of the whole corpus, for the 75% and 100% of the corpus, due to the limit of the memory of JVM and our machine, we discarded the sparse features, but the experiment result still proved that our model could improve the performance. With the increase of the training scale, the improvement reduced relatively. This indicated that the unseen information of the unsupported features is more useful for the small training set, and our model could deal with this effectively. Fig. 3. The result based on different training set size 102 Pu X., Mao Q., Wu G., Yuan C. 4.5 Analysis and discussion The most particulars of our smoothing method are that they could capture some unknown features, and also could reduce the computation cost. We think that, for NER task, the sparse unknown features have something in common, and our method could make the best use of the common characteristics. Anyway, for Chinese NER, this special task, our smoothing method give a very effective and practical way to improve the generalization of CRF. And in the future works, we want to do some experiments in other tasks to analyze that whether this smoothing method is limited to some special tasks or depends on factors like text genre or text domain. 5 Conclusion In this paper, we take a detailed analysis of the factors influencing the generalization ability, and then we propose an Improved Smoothed CRF. The substance of our work is that using smoothing features to capture the information of unsupported features and using the validation set to simulate the test set. This Improved Smoothed CRF provides a practical and effective way to increase the generalization performance of CRF. The experiments on Chinese NER proved its effectiveness. We will incorporate more meaningful feature templates such as surname list, location ending list, etc. in [5-10] to achieve a better result for the Chinese NER. And we believe that our method could also be useful in other NLP tasks, e.g. POS tagging, Chinese word segmentation, etc. References 1. L. R. Rabiner: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In: Proceedings of IEEE, pp. 257--285. (1989) 2. A. L. Berger, S. A. D. Pietra, V. J. D. Pietra: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, pp. 39-71. (1996) 3. A. McCallum, D. Freitag, F. Pereira: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of ICML’2000. 4. J. Lafferty, A. McCallum, F. Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of ICML’ 2001, pages 282-289. 5. A. McCallum, W. Li: Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In: Proceedings of 7th Conference on Natural Language Learning (CoNLL). (2003) 6. W. Chen, Y. Zhang, H. Isahara: Chinese Named Entity Recognition with Conditional Random Fields. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing & the 3rd International Chinese Language Processing Bakeoff. (2006) 7. A. Chen, F. Peng, R. Shan, G. Sun: Chinese Named Entity Recognition with Conditional Probabilistic Models. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing& the 3rd International Chinese Language Processing Bakeoff. (2006) Chinese Named Entiy Recognition with the Improved Smoothed Conditional Random Fields 103 8. Z. Xu, X. Qian, Y. Zhang, Y. Zhou: CRF-based Hybrid Model for Word Segmentation, NER and even POS Tagging. In: The 4th International Chinese Language Processing Bakeoff & the First CIPS Chinese Language Processing Evaluation. (2008) 9. H. Zhao, C. Kit: Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition. In: The 4th International Chinese Language Processing Bakeoff & the First CIPS Chinese Language Processing Evaluation. (2008) 10. Yuanyong Fen, Le Sun, Wenbo Li, Dakun Zhang: A Rapid Algorithm to Chinese Named Entity Recognition Based on Single Character Hints. Journal of Chinese Information Processing, Vol.22, No.1, pp.104-110. (2008) 11. A. McCallum: Efficiently Inducing Features of Conditional Random Fields. In: Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence (UAI). (2003) 12. F. Peng, A. McCallum: Accurate Information Extraction from Research Papers using Conditional Random Fields. In: Proceedings of Human Language Technologies: The 11thAnnual Conference of the North American Chapter of the Association for Computational Linguistics. (2004) 13. Roamn Klinger, Katrin Tomanek: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Dortmund University of Technology. (2007) 14. Charles Sutton, Andrew McCallum: An introduction to Conditional Random Fields for Relational Learning. In: Lise Getoor, Benjamin Taskar (Editors.): Introduction to Statistical Relational Learning, MIT Press, Chap.1, pp.93-127. (2006) 15. Trevor A. Cohn: Scaling Conditional Random Fields for Natural Language Processing. Ph.D Thesis, University of Melbourne. (2007) 16. D.C. Liu, J. Nocedal: On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming, pp. 49-55. (1989) 17. F. Sha, F. Pereira: Shallow Parsing with Conditional Random Fields. In: Proceedings of Human Language Technologies: The 11thAnnual Conference of the North American Chapter of the Association for Computational Linguistics. (2003) 18. Stanley F. Chen, Rosenfeld Ronald: A Survey of Smoothing Techniques for ME Models. In: IEEE Transactions on Speech and Audio Processing, Vol.8, No.1, pp. 37-50. (2000) 19. Stanley F. Chen, Rosenfeld Ronald: A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS-99-108, Carnegie Mellon University. (1999) 20. D. L. Vail, J. D. Lafferty, M. M. Veloso: Feature Selection in Conditional Random Fields for Activity Recognition. In: Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, Oct 29- Nov 2.(2007) 21. Corpus of MSR, http:// www.sighan.org/bakeoff2006/ 22. Corpus of Peking University, http://icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp 23. Mallet Toolkit, http://mallet.cs.umass.edu Ontology-Driven Approach to Obtain Semantically Valid Chunks for Natural Language Enabled Business Applications Shailly Goyal, Shefali Bhat, Shailja Gulati, C Anantaram Innovation Labs, Tata Consultancy Services Ltd Gurgaon, India Abstract. For a robust natural language question answering system to business applications, query interpretation is a crucial and complicated task. The complexity arises due to the inherent ambiguity in natural language which may result in multiple interpretations of the user’s query. General purpose natural language (NL) parsers are also insufficient for this task because while they give syntactically correct parse, they lose on the semantics of the sentence. This is because such parsers lack domain knowledge. In the present work we address this shortcoming and describe an approach to enrich a general purpose NL parser with domain knowledge to obtain semantically valid chunks for an input query. A part of the domain knowledge, expressed as domain ontology, along with the part-of-speech (POS) tagging is used to identify the correct predicate-object pairs. These pairs form the constraints in the query. In order to identify the semantically valid chunks of a query, we use the syntactic chunks obtained from a parser, constraints obtained by predicate-object binding, and the domain ontology. These semantically valid chunks help in understanding the intent of the input query, and assist in its answer extraction. Our approach seamlessly works across various domains, given the corresponding domain ontology is available. 1 Introduction Natural language (NL) enabled question answering systems to business applications [1, 2] aim at providing appropriate answers to the user queries. In such systems, query interpretation is a fundamental task. However, due to the innately ambiguous nature of the natural language, interpretation of a user’s query is usually not straightforward. The ambiguity can be either syntactic, e.g., prepositional phrase (PP) attachment, or it can be semantic. In order to resolve such ambiguities, NL enabled question answering systems mostly use general purpose NL parsers. Although these parsers give syntactically correct chunks for a sentence, these chunks might not be semantically meaningful in a domain. For example, consider the following queries: – “List the employees working in loss making projects”. In this query, a human can easily disambiguate that “loss making” is a modifier of “projects”, and “working in loss making projects” is a modifier of “the employees”. That is, © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 105-116 Received 23/11/09 Accepted 16/01/10 Final version 10/03/10 106 Goyal S., Bhat S., Gulati S., Anantaram C. the correct chunks are “[List [[the employees] [working [in [[loss making [projects]]]]]]]”. However, the chunks obtained from an NL parser are “[List [[[the employees] [working [in [loss]]]] [making [projects]]]]”, which will be interpreted as “List the employees who are working in loss and who make projects”. – “Give the projects having costing and billing >$25000 and <$35000, respectively”. The NL chunker may chunk this query as “[Give [[the projects] [having [[costing] and [billing]] [>$25000] and [<$35000]]], respectively”. From these chunks it is not possible to identify that “costing >$25,000” and “billing <$35,000” are the two constraints which modify “the projects”. Thus the chunks obtained from such parsers may not be helpful in extracting the answer to the user’s query. The problem becomes even more severe in case of complex queries involving multiple constraints and nested sub-questions. Thus the problem at hand is “How can we automatically enrich the output of a general purpose NL parser with the domain knowledge in order to obtain syntactically as well as semantically valid chunks for the queries in the domain? ” We put forward an approach to solve the queries in a domain using the domain knowledge and the syntactic structure of the queries. A part of the domain knowledge is represented in the form of domain ontology which is based on semantic web technologies [3]. We define ‘constraints’ and ‘semantically valid chunks’ for a query. Semantically valid chunks aid in the interpretation and extraction of the answers to the user’s queries unambiguously. 2 Related Work Processing natural language questions to obtain an equivalent structured query has long been an area of research in artificial intelligence [2, 1]. Still, most existing approaches cannot interpret real-life natural language questions properly. Even the few which can do so depend so much on domain-specific rules, that porting to other domains becomes an issue. Popescu et al. [4] adapts the Charniak parser [5] for domain-specific question answering by extending the training corpus of the parser with a set of 150 handtagged domain-specific questions. Further, semantic rules inferred from domain knowledge are used to check and correct preposition attachment and preposition ellipsis errors. START [6] decomposes complex questions syntactically or semantically to obtain sub questions that can be answered from available resources. If these answers are not sufficient to solve the question, semantic information - in the form of rules that map ‘key’ domain questions to the answers - is used. The main drawback of these approaches is that the creation of domain-specific rules is very resource intensive, and hence restricts portability. AquaLog [7] tries to transform the NL question to ontology-specific triples using syntactic annotations, semantic terms and relations, and question words to interpret the natural language question. If these cannot resolve the ambiguity in the question, domain ontology and/or WordNet are used to make sense of the input query. Ontology-Driven Approach to Obtain Semantically Valid Chunks... 3 107 The Architecture NL based Question Answering system requires the queries to be analyzed and chunked in an appropriate manner so as to have correct query generation and answer extraction. An NL query can be viewed as consisting of a set of unknown predicates whose values need to be determined based on the constraints imposed by the rest of the query. Domain ontology along with a POS tagger is used to identify the constraints in the query. These constraints along with the domain knowledge and the parse structure of the query are used to find the semantically valid chunk set. These chunks are then converted to a formal query language and the answer is retrieved from the ontology. Figure 1 demonstrates the overall architecture of our approach. In this figure, solid arrows represent the process flow for a query, and dashed arrows represent the information flow. Hence for the appropriate interpretation and analysis of the query, the important issues that need to be addressed can be summarized as: Constraint identification. This involves identifying the correct predicate-object pairs. Semantically valid chunk set. This involves identification of valid constraints for each unknown predicate so that correct interpretation of the given query can be ensured. Query generation. In this step, the semantically valid chunk set is converted to appropriate formal language query using the domain ontology. Fig. 1. The Architecture 4 In Section 4 we briefly describe the domain ontology and formalize the terminologies used in this work. In the subsequent sections we discuss constraint identification and semantically valid chunk set formation. Domain Ontology and Other Preliminaries We use semantic web technologies [3] to create the domain ontology (in RDF format) using the relational data of the business application along with its meta information stored in the seed ontology1 [8]. The ontology DO of a domain D describes the domain terms and their relationships in the hsubject − predicate − objecti format. For illustration, hRitesh - project name - Bechteli describes that the predicate ‘project name’ of the subject ‘Ritesh’ has object ‘Bechtel’. A synonym dictionary having information about the synonyms of the domain terms is also maintained. 1 The seed ontology has meta information about the domain, like hns : employee owl : type owl : personi, hns : employee ns : hasN ame ns : employee namei. 108 Goyal S., Bhat S., Gulati S., Anantaram C. The domain ontology and synonym dictionary are used to identify the concepts in the user query Q posed in the domain D. The domain ontology DO is used to further classify the concepts as predicates and objects. For a query Q, we denote the set of predicates as PQ = {p1 , p2 , . . . pn | ∃hs − pi − oi ∈ DO , and pi is present in the query Q}. The set of objects present in the query Q is OQ = {o1 , o2 , . . . om | ∃hs − p − oi i ∈ DO or oi is a numerical/date value, and oi is present in the query Q}. In this work, we have considered examples from a logical subset of the Project Management System for an organization. The database consists of tables containing data about the projects, the various costs associated with the projects, employees and their allocations in different projects. The important tables are named ProjectDetails, Employees, CostDetails and Allocations. Some of the attributes of these tables are: ProjectDetails. project id, project name, project type Employees. employee id, employee name, age, gender, joining date CostDetails. costDetails id, project id, costing, billing, revenue Allocations. allocation id, employee id, project id, role For illustration, consider the query: Example 1. What is the role of the associates who joined before 18/10/2008 in the SWON projects with costing and revenue more than $15000 and < $25000, respectively? In this query, the concepts identified using the domain ontology are: role, employee name2 , joining date, SWON, project name, costing, and revenue. After considering the date and the numeric values in the query, i.e., 18/10/2008, 15000 and 25000 as objects, the predicate and object set obtained are: PQ = {role, employee name, joining date, project name, costing, revenue}, OQ = {SWON, 18/10/2008, 15000, 25000}. In the following section we present an algorithm to bind these predicates and objects so as to obtain constraints for the query. 5 Constraint Identification For a successful query creation and execution, identification and formulation of correct constraints is of utmost importance. With reference to this work, constraint identification involves binding each ‘object’ in the query with its corresponding ‘predicate’. This predicate-object pair is referred to as ‘constraint’. We define a constraint as ck = (pi , oj ), oj ∈ OQ , pi ∈ PQ , and oj is the value for the predicate pi in Q. All the constraints in the query Q are identified, and CQ = {c1 , c2 , . . . cm } denotes the constraints set. Predicate used in any constraint is referred to as constraint predicate. The set of constraint predicate is PQC = {pi | pi ∈ PQ such that ∃(pi , oi ) ∈ CQ }. Predicates that do not form part of the constraint set are referred to as unknown predicates. The set of unknown predicates is PQU = {pi | pi ∈ PQ , pi ∈ / PQC }. 2 The synonym dictionary has mappings defined between ‘employee’ and ‘associate’ etc. Ontology-Driven Approach to Obtain Semantically Valid Chunks... 109 In a natural language query, constraint identification (or predicate-object binding) needs special attention due to the following reasons: Unspecified predicates. For some (or all) of the objects present in the query, the corresponding predicate might not be explicitly specified. For example consider the query ‘Give me the role of Puneet in the project having Ritesh as project leader”. Here the objects ‘Puneet’ and ‘Ritesh’ need to be attached to the corresponding predicate ‘employee name’, which is not specified in the query. Constraint vs. unknown predicate. The issue of unspecified predicates becomes even more severe when a predicate pi for an object o is present in the query, but the same predicate pi also happens to be an unknown predicate. For example, in the query mentioned above, the value ‘project leader’ in OQ is compatible to the predicate ‘role’ in PQ . But these predicate and value are not to be bound as the predicate ‘role’ is an unknown predicate, whose value needs to be determined. Predicates followed by the respective objects. In questions with multiple constraints, sometimes predicate and its object may not be given consecutively. Instead, the query may have a predicate list followed by the corresponding object list (or vice versa). Considering the same Example 1 (Section 4), in the phrase ‘with costing and revenue more than $15000 and < $25000, respectively’, the predicate list, i.e. [costing, revenue], is followed by the corresponding objects list, i.e. [more than $15000, < $25000]. We need to identify and bind the appropriate predicate-operator-object pairs from the predicate and the object lists. In the following we describe an algorithm for predicate-object discovery and binding. 5.1 Algorithm for Predicate-Object Binding The main steps of the algorithm are as follows. Step 1. Operator-Object binding for numerical/date objects. The first step towards Operator-Object binding is the identification of the comparison operators in the query. For operator identification, the system maintains a mathematical operator dictionary. All the occurrences of the string comparators in the question are replaced by the corresponding mathematical comparator. Also, if there is any numeric value in the question that is not preceded by any operator, by default ‘=’ operator is prefixed. These form the corresponding operator-object pairs. Henceforth we refer to these operator-object pairs as objects. Step 2. Group the predicates and objects that immediately follow/preceed the POS tags of the assignment words3 . POS tags of some such words are ‘VBZ’, ‘VBP’, ‘IN’, ‘SYM’ etc. We also group the predicates that are immediately followed (or preceded) by any object. In case there is a list of predicates and a list of objects satisfying the above, then these lists are also grouped. These groups are the possible pairs for predicate-object binding. For instance, in Example 1, we 3 Assignment words are words which are usually specified between the predicates and its objects. E.g., form of copula ‘be’, preposition ‘as’, or mathematical operators. 110 Goyal S., Bhat S., Gulati S., Anantaram C. get the pairs (joining date : <18/10/2008), (project name, (SWON)), (costing, revenue : >$15000, <$25000). Step 3. From the groups obtained in Step 2, we bind the predicates and objects that are of the same data type. The compatibility for predicate and the object is checked using the domain ontology. In case of predicate and object list, one-on-one binding is done. For example, from the groups obtained above, we get the following predicate-object pairs: (joining date, <18/10/2008), (costing, >$15000), and (revenue, <$25000). The pair (project name, SWON) is not bound because the predicate of the object ‘SWON’ as obtained from the domain ontology is ‘project type’. Step 4. The string objects that are not bound to any predicate in Step 3 are bound to their compatible predicates. The compatible predicate for an object is determined using the domain ontology. For instance, since the object ‘SWON’ is not bound to any predicate till now, it is bound to its compatible predicate ‘project type’ to obtain the constraint (project type, SWON). The predicates bound to any object in the above steps form the constraint predicate set, and the remaining predicates constitute the unknown predicate set. Using the above algorithm for Example 1, we obtain the constraint set, constraint predicate set and the unknown predicate set as: – CQ = {(joining date, <18/10/2008), (project type, SWON), (costing, >$15000), (revenue, <$25000)}. C – PQ = {joining date, project type, costing, revenue}. U – PQ = {role, employee name, project name}. The constraint sets thus obtained are used to find the semantically valid chunk set as discussed in the following section. 6 Semantically Valid Chunk Set Semantically valid chunk set identify the conditions on each unknown predicate in the query, and are constituted from the constraints and unknown predicates as obtained in Section 5. For instance, in Example 1 (Section 4), ‘joining date < 18/10/2008’ is a condition on the predicate ‘employee name’. Due to the syntactic ambiguity, more than one syntactic parse might be obtained for a NL query. Such cases may eventually result in more than one semantically viable chunk set. Formally we define semantically viable chunk sets as follows. Definition. A Semantically viable chunk set (SVC set) of a query Q correp p sponding to the k th parse is a set SV CQk = {SCQ | p ∈ PQU } where SCQ is a k k semantic chunk. Semantic chunk of a predicate p ∈ PQU is defined as: 0 p p – If CQ 6= {}, SCQ = hp, c1 , c2 , . . . ci , . . . cr i (r ≥ 1), where ci ∈ CQ or ci = SCQ ∈ k k SV CQk , and ci is a condition on the predicate p p U – If CQ = {} (and PQ 6= {}), SCQ = hpi k Such that, SV CQk satisfies the following: Ontology-Driven Approach to Obtain Semantically Valid Chunks... 111 p U a. ∀p ∈ PQ , ∃SCQ ∈ SV CQk . k p p 0 b. ∀c ∈ CQ , ∃SCQ ∈ SV CQk such that SCQ = (p, c1 , c2 , . . . c0 , . . . cr ). k k The condition ‘a’ states that there is a semantic chunk for each unknown predicates in the query. The condition ‘b’ states that each constraint in the query is used in at least one semantic chunk. Definition. For a query Q, the semantically viable chunk set which is semantically valid as per the domain ontology is the semantically valid chunk set, SV aCQ . These sets are referred to as SVaC sets. We can classify the queries posed for a natural language interface to business application in the following categories: 1. The unknown predicate is specified as a noun in the question. E.g., ‘What is the role of Ritesh in AB Corp?’, ‘Give me the project of Ritesh’. In this case the syntactic modifiers of the predicate (noun) determine the constraints on the predicate. 2. A wh-word (‘who/when/where’) refers to the unknown predicate. E.g., ‘Who is the project leader of Bechtel?’, ‘When did Ritesh join Bechtel?’ 3. The unknown predicate may have a wh-word (‘what/which/whose/how much/how many’) as a determiner. In such questions the wh-word is placed before the unknown predicate, and these words ask which thing or person is being referred to. For example, ‘In which project is Ritesh allocated?’, ‘How many employees are allocated to Bechtel?’ 4. Why/how questions. These questions expect descriptive answer. Since our system aims to handle only factual questions, such questions are out of scope. 5. Yes/No questions. These questions expect ‘yes’ or ‘no’ as answer. Due to space limitations, such questions are kept out of scope of the current work. We use syntactic information of the question to obtain the semantically viable chunk sets as discussed in the following section. 6.1 Finding Semantically Viable Chunk Sets For a query, the main task for identification of semantically viable chunk sets is to identify the conditions for all the unknown predicates. We exploit syntactic information of the query for this purpose. We use a dependency-based parser (e.g. Stanford Parser [9], Link Parser [10]) to obtain the syntactic structure of the question. These parsers provide the phrase structure as well as the dependencies between different words of a given sentence. In case of syntactic ambiguity, these parsers provide all possible interpretations of the input sentence. In the following, we discuss the process of identifying the appropriate semantic chunks for different categories of queries. Unknown Predicate as Noun. If an unknown predicate in the query plays the role of noun, its syntactic modifiers identify the constraints on the predicate. Dependency based parsers provide dependencies between noun and its modifiers. This information along with the phrase structure of the query is 112 Goyal S., Bhat S., Gulati S., Anantaram C. used to determine the phrase modifying the unknown predicate. These phrases give the constraints for the unknown predicate. The unknown predicate with its constraint is a candidate semantic chunk. For example, for the question ‘Give me the role of the associates with age > 30 years?’, the preposition phrase ‘with age > 30 years’ is a post-nominal modifier of the noun ‘associates’. The constraint corresponding to this preposition phrase is ‘age > 30’, and hence the corresponding semantic chunk can be obtained as employee name = hemployee name, age > 30i. SCQ Further, the preposition phrase ‘of the associates with age > 30 years’ is modemployee name , for the phrase ifying the noun ‘role’. Since the semantic chunk, SCQ ‘of the associates with age > 30 years’ has already been identified, the seemployee name role i. mantic chunk for the predicate ‘role’ is SCQ = hrole, SCQ Unknown Predicate as wh-word. In a domain ‘who’ usually refers to a person, such as ‘employee name’, ‘student name’; ‘when’ refers to date/time attributes like ‘joining date’, ‘completion time’; and ‘where’ refers to locations like ‘address’, ‘city’. For the given business application, this information about the wh-words is identified, and stored in the seed ontology. In questions involving any of these wh-word, the predicate corresponding to the wh-word is found using the domain ontology, which might be a possible candidate for being a unknown predicate. If the wh-word in the question is compatible to more than one predicate in the domain, then more semantic chunks - corresponding to each compatible predicate - are obtained. Semantic information is used in such cases to resolve the ambiguity regarding the most appropriate predicate (See Section 6.2). The constraints of the wh-word are determined on the basis of the role of the wh-word in the question as discussed below. – If the wh-word is the subject in the question, the corresponding verb phrase determines the constraint on the wh-word. For example, consider the query ‘Who is the project leader of Bechtel?’ In this question, the verb phrase ‘is the project leader of Bechtel’ gives the constraints of ‘who’. The constraints embedded in this verb phrase are ‘role=project leader’ and ‘project name=Bechtel’. Since in our domain ‘who’ corresponds only to the predicate ‘employee name’, the semantic chunk is hemployee name, role = project leader, project name = Bechteli. – In other cases, the words in the phrase enclosing the wh-word determines the constraints on the wh-word. For example, for the question ‘When did Ritesh join Bechtel?’, the chunk structure as given by a NL parser is ‘[S When did [NP Ritesh NP] [VP join [NP Bechtel NP] VP] S]’. In our domain ‘when’ corresponds to ‘joining date’. The constraints in this question are ‘employee name=Ritesh’ and ‘project name=Bechtel’. Thus, the semantic chunk obtained is hjoining date, employee name = Ritesh, project name = Bechteli. Wh-word as the Determiner of the Unknown Predicate. In this case also the constraints are determined as name, discussed above. For example, consider employee the question ‘In which project is Ritesh allocated?’. The constraint for the unknown predicate ‘project name’ can be identified as ‘employee name=Ritesh’. name = Riteshi. Thus the semantic chunk is hproject Ontology-Driven Approach to Obtain Semantically Valid Chunks... 113 Using the syntactic information as discussed above, all possible semantic chunks for a parse structure of the question are determined. The set of these chunks is a semantically viable chunk set only if the chunk set satisfies the conditions (a) and (b) specified in the definition of SVC sets. For instance, for the query in Example 1, two SVC sets are obtained as given in Figure 2. employee role SV CQ1 = {SCQ , SCQ 1 name 1 project , SCQ name 1 }, where: project name = (project name, project type = SW ON, costing > $15000, revenue < – SCQ 1 $25000); employee name = (employee name, joining date < 18/10/2008); – SCQ 1 employee name project name role , SCQ ). – SCQ = (role, SCQ 1 1 1 And, project name employee name role , SCQ }, where: SV CQ2 = {SCQ , SCQ 2 2 2 name = (project name, project type = SW ON ); 2 employee name = (employee name, joining date SCQ 2 project – SCQ – $15000, revenue < $25000); project role – SCQ = (role, SCQ 2 2 name employee , SCQ 2 name < 18/10/2008, costing > ). Fig. 2. SVC Sets for Example 1 If for a query Q, only one semantically viable chunk set is found then this chunk set is the semantically valid chunk set. In other cases, the semantically valid chunk set is found by using the domain specific semantic information as discussed in the following section. 6.2 Finding Semantically Valid Chunk Sets If more than one semantically viable chunk sets are obtained for a question, semantic information obtained from the domain ontology is used to determine p the semantically valid chunk set. Let SV CQ1 = {SCQ |p ∈ PQU } and SV CQ2 = 1 p U {SCQ2 |p ∈ PQ } be any two SVC sets for a query Q. Since there are more than one SVC set for Q, ∃pi , pj ∈ PQU , and c0 = (p0 , v 0 ) ∈ CQ such that c0 is a conp pi stituent of SCQ ∈ SV CQ1 and SCQj2 ∈ SV CQ2 . But, in the valid interpretation 1 0 of Q, c can specify either the unknown predicate pi or the unknown predicate pj . Hence we conclude that, in this case, the syntactic information is not sufficient to resolve the ambiguity whether c0 is a constraint of pi or pj . For instance, consider the SVC sets of the query in Example 1 (Figure 2). In the two SVC sets obtained for this query, the constraints ‘costing > $15000’ and ‘revenue < $25000’ are bound to ‘project name’ in SV CQ1 , and to ‘employee name’ in SV CQ2 . To resolve such ambiguities, we use depth between the concerned predicates. The number of tables required to be traversed4 in order to find relationship between any two predicates is determined through the domain ontology. This is referred to as the depth between the two predicates. If for a pair of predicates, 4 This is found using the primary and foreign key information of the tables. 114 Goyal S., Bhat S., Gulati S., Anantaram C. there exists more than one path then we choose the one with the minimum depth. It is observed that the semantic chunk in which the unknown predicate and the constraint predicate pair has lesser depth is the one which is more likely to be the correct pair. We use domain ontology to find the depth between two predicates as described below. Step 1. Breadth first search (BFS): The system does a BFS on the tables in the ontology to determine if pi or pj belongs in the same table as that of p0 . Without loss of generality, assume that pi and p0 belong to the same table, and pj does not pi belong to the table of p0 . In this case, SCQ , and consequently SV CQ1 is assumed 1 to be correct, and SV CQ2 is rejected. This in this case, SV aCQ = SV CQ1 . Step 2. Depth first search (DFS): We involve DFS method to resolve the ambiguity regarding the constraint c0 if BFS is not able to do so. The depth of the path from p0 to pi and pj is found using the domain ontology. The constraint c0 is attached to the predicate with which the distance of p0 is minimum, and the corresponding SVC set is the semantically viable chunk set. In Example 1 (Section 4), the constraint predicates ‘revenue’ and ‘costing’ are found to be closer to the predicate ‘project name’ than to the predicate ‘employee name’. Hence, the semantically viable chunk set SV CQ1 is the semantically valid chunk set. An advantage of this approach is that depending upon the question complexity the system does a deeper analysis. Domain ontology is used only if a question cannot be resolved by using just the syntactic information. If domain information also is not sufficient for question interpretation, then answers for all interpretations are found, and the user is asked to choose the correct answer. 7 Formal Query Formation The semantic chunks of the SVaC set are processed by the QueryManager. In this module, a formal query is generated on-the-fly from the semantic chunks to extract the answer of the user’s question. Since the domain ontology is in RDF format, we generate queries in SPARQL5 which is a query language for RDF. For a semantically valid chunk set, we start with formulating SPARQL queries for the semantic chunks which do not contain any sub-chunk. The unknown predicate of the semantic chunk forms the ‘SELECT’ clause, and the constraints form a part of the ‘WHERE’ clause. For instance, from the SVaC set SV aCQ (i.e., SV CQ1 in Figure 2) of Examproject name ple 1, we first formulate SPARQL query for the semantic chunks SCQ 1 employee name , and obtain the answer from the RDF. Assume that the and SCQ 1 answers of these chunks are obtained as “[Quantas, Bechtel, NYK]”, and “[Rajat, Nidhi]”. The answers obtained from independent semantic chunks are then substituted in the semantic chunks involving nested sub-chunks. For example, to role formulate the query for the semantic chunk SCQ , the answers of the semantic 1 5 http://dev.w3.org/cvsweb/2004/PythonLib-IH/Doc/sparqlDesc.html?rev=1.11 Ontology-Driven Approach to Obtain Semantically Valid Chunks... 115 project name employee name chunks SCQ and SCQ are used. Therefore, upon substi1 1 role tuting these answers, the semantic chunk SCQ is modified as: 1 role SCQ = {role, project name = Quantas, project name = Bechtel, project name = 1 N Y K, employee name = Rajat, employee name = N idhi}. The SPARQL query for this chunk is then generated, and answer of the user query is retrieved. 8 Experimental Results To test the approach discussed above, we carried out experiments on various domains (project management, retail, asset management). Users were asked to pose queries to an existing question answering system [8]. A set of almost 2750 questions were posed to the system, out of which approximately 35% consisted of fairly simple questions (No chunking required, but predicate-value binding required) e.g. “List all projects with costing less than $30000”. The remaining 65% questions required correct predicate-value binding as well as correct chunking, like “list the employees in the projects having costing less than $30000, started on 2009-10-10 with Ritesh as group leader”. When compared with actual answers following observations were made: – 791 out of 946 questions were answered correctly for simple questions, which sums up approximately to 83.6%. – For complex queries, 431 out of 1804 were correctly answered which approximately accounts 23.9% The users’ questions were again tested on the new system as discussed in this work. We have used Link Parser [10] for the syntactic analysis of the queries. The link parser yields the syntactic relations between different words of the sentence in the form of labeled links. In case of ambiguity in the input sentence, the link parser yields syntactic structures corresponding to all possible interpretations. The following observations were made: – 863 out of 946 questions were answered correctly for simple questions, which sums up approximately to 91.2%. – 1508 out of 1804 were correct from complex questions that accounts 83.6% Comparing the results of the two approaches, a direct increase of about 7% was attained for simple questions and for complex queries, it went up by almost 58%. The increment in the correctness of the answers were due to the following: – Due to the predicate object binding, correct bindings were obtained (even for simple questions) which increased the number of answers correctly. – The chunks obtained from the link parser when furnished with domain knowledge in the form of constraints, helped in achieving higher correct results. Approximately 13% of the questions were not answered at all, or were answered incorrectly, or in some cases partially correct answers were obtained. After the analysis as to why the answers to those queries were not obtained, following conclusions were made. – Syntactic analysis by the parser was not correct. – The question posed by the user was grammatically incorrect. – The system was not able to capture the semantic knowledge. E.g. for the queries ‘Who is reporting to Ritesh?’ and ‘Who Ritesh is reporting to?”, the system could not fetch correct answers. We are currently working towards handling these issues. 116 9 Goyal S., Bhat S., Gulati S., Anantaram C. Conclusion We have described an approach to obtain semantically valid chunk set for NLenabled business applications. For any question posed by a user to a business application system in natural language, the system should be robust enough to analyze, understand and comprehend the question and come up with the appropriate answer. This requires correct parsing, chunking, constraints formulation and sub-query generation. Although most general purpose parsers parse the query correctly, due to lack of domain knowledge, domain relevant chunks are not obtained. Therefore, in our work we have concentrated on enriching general purpose parsers with domain knowledge using domain ontology in the form of RDF. We have handled constraints formulation and sub query generation which form the backbone of any robust NL system. Tackling all these issues make any natural language-enabled business application system more robust, and enables it to handle even complex queries easily, efficiently and effectively. References 1. Lopez, V., Motta, E., Uren, V., Sabou, M.: State of the art on semantic question answering - a literature review. Technical report, KMI (May 2007) 2. Androutsopoulos, I., Ritchie, G., Thanisch, P.: Natural language interfaces to databases - an introduction. Natural Language Engineering 1(1) (1995) 29–81 3. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. The MIT Press (2004) 4. Popescu, A.M., Armanasu, A., Etzioni, O., Ko, D., Yates, A.: Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In: 20th international conference on Computational Linguistics, Geneva, Switzerland (2004) 5. Charniak, E.: A maximum-entropy-inspired parser. In: NAACL. (2000) 6. Katz, B., Borchardt, G., Felshinm, S.: Syntactic and semantic decomposition strategies for question answering from multiple resources. In: AAAI 2005 Workshop on Inference for Textual Question Answering, Pittsburgh, PA (July 2005) 35–41 7. Lopez, V., Motta, E., Uren, V.: Aqualog: an ontology-driven question answering system to interface the semantic web. In: NAACL on Human Language Technology, New York (2006) 269 – 272 8. Bhat, S., Anantaram, C., Jain, H.K.: A framework for intelligent conversational email interface to business applications. In: ICCIT, Korea (2007) 9. Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems. Volume 15., Cambridge, MA, MIT Press (2003) 3–10 10. Grinberg, D., Lafferty, J., Sleator, D.: A robust parsing algorithm for link grammars. In: Fourth International Workshop on Parsing Technologies, Prague (September 1995) Opinion, Emotions, Textual Entailment Word Sense Disambiguation in Opinion Mining: Pros and Cons Tamara Martı́n-Wanton1 , Alexandra Balahur-Dobrescu2 , Andrés Montoyo-Guijarro2 and Aurora Pons-Porrata1 1 Universidad de Oriente, Center for Pattern Recognition and Data Mining Patricio Lumumba s/n, Santiago de Cuba, Cuba {tamara, aurora}@cerpamid.co.cu 2 University of Alicante, Departament of Software and Computing Systems, Apartado de Correos 99, E-03080 Alicante, Spain {abalahur, montoyo}@dlsi.ua.es Abstract. The past years have marked the birth of a new type of society - that of interaction and subjective communication, using the mechanisms of the Social Web. As a response to the growth in subjective information, a new task was defined - opinion mining, dealing with its automatic treatment. As the majority of natural language processing tasks, opinion mining is faced with the issue of language ambiguity, as different senses of the same word may have different polarities. This article studies the influence of applying word sense disambiguation (WSD) within the task of opinion mining, evaluating the advantages and disadvantages of the approach. We evaluate the WSD-based method on a corpus of newspaper quotations and compare it to the results of an opinion mining system without WSD. Finally, we discuss our findings and show how WSD helps in the task of opinion mining. 1 Introduction The past years, with the growing volumen of subjective data originating from texts pertaining to the Social Web -blogs, reviews, forums, discussion panels have marked the birth of a new type of society, where individuals can freely communicate and exchange opinions. The large benefits that can be obtained by the analysis of this data (more informed customers, companies, societies) made essential the study of methods that can be automatically employed to extract the required information. Therefore, over the past few years, there has been a large increase of interest in the identification and automatic extraction of the attitudes, opinions and feelings expressed in texts. This movement is given by to the need to provide tools for users of different domains, which require, for different reasons, the automatic monitoring of information that expresses opinion. A system that automatically carries out this task would eliminate the effort to manually extract useful knowledge from the information available on the Web. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 119-129 Received 30/11/09 Accepted 16/01/10 Final version 08/03/10 120 Martín T., Balahur A., Montoyo A., Pons A. Opinion Mining (also known as sentiment classification or subjectivity analysis) covers a wide area of Natural Language Processing, Computational Linguistics and Text Mining. The goal is not to determine at the topic of a document is, but the opinion that is expressed in it. Therefore, its objective is to determine the opinion of a speaker or a writer on a topic [1]. Many approaches to sentiment analysis rely on lexicons of words that may be used to express subjectivity; these works do not make distinction between different senses of a word, so that the term, and not its senses, are classified. Moreover, most subjectivity lexicons are compiled as lists of keywords, rather than word meanings. However, many keywords have both subjective and objective senses and even the purely subjective senses have a degree of positivity or negativity, depending on the context where the corresponding word appears. This paper presents a research on the advantages of using the word sense disambiguation to determine the polarity of opinions and the role of existing resources. To this end, we evaluated two unsupervised approaches, one based on lists of positive and negative words and the other based on a word sense disambiguation algorithm over the same affect lexicons. 2 Related Work A major task of Opinion Mining consists in classifying the polarity of the extracted opinion. This process determines whether the opinion is positive, negative or neutral with respect to the entity which it is referring to (for example, a person, a product, a topic, a movie, etc.). The large majority of research in this field focused on annotating the sentiment of the opinions but out of context (e.g. [2–5]). Recently some works have been published determining the polarity of the word senses, thus building resources that can be useful in different tasks of Opinion Mining ([1, 6, 7]). Esuli and Sebastiani [1] determine the polarity of word senses in WordNet, distinguishing among positive, negative and objective. They manually annotate a seed set of positive/negative senses in WordNet and by following the relations in WordNet expand the small set using a supervised approach. They extend their work [6] by applying the Page Rank algorithm for ranking the WordNet senses in terms of how strongly a sense possesses a given semantic property (e.g., positive or negative). Wiebe and Mihalcea [7] label word senses in WordNet as subjective or objective. They use a method relying on distributional similarity as well as an independent, large manually annotated opinion corpus (MPQA) [8] for determining subjectivity. Only few recent works take into account the correct senses of the words in the opinions (e.g. [9, 10]), but they are supervised methods. Akkaya et al. [9] build and evaluate a supervised disambiguation system that determines whether a word in a given context is being used with objective or subjective senses. In this approach the inventory of objective and subjective senses of a word can be viewed as an inventory of the senses of the word but with a coarse granularity. Rentoumi et al. [10] go a step further and determine the polarity Word Sense Disambiguation in Opinion Mining: Pros and Cons 121 by disambiguating the words and then mapping the senses to models of positive and negative polarity. To compute these models and produce the mappings of senses, they adopt a graph-based method which takes into account contextual and sub-word information. 3 Motivation and Contribution Whereas most research concentrates on the analysis of opinions at a word level there are some approaches dealing with the analysis at the sense level. The reason for this fact is that the meaning of most words depends on the context where they appears and in order to determine the polarity of opinions, it is also important to take into account the meanings of the words and the relations between them. The recent studies that regard word senses concentrate on the detection of subjectivity or on ranking senses in a lexicon according to their polarity, but they do not have as main aim the classification of the polarity. On the other hand, most research concentrates on assessing the polarity of opinion using one of the available lexicons of opinion. However, using a resource or another cannot give a measure of the impact the use of these resources has on the final system results. The main motivation of the present research is to study the impact of word sense disambiguation for determining the polarity of opinions. There are approaches that perform the analysis at the sense level, but they lack of a thorough study of the advantages and disadvantages of the disambiguation as intermediate task. They do not perform the analysis of the approaches at a word and sense levels on the same corpus and using the same resource. Thus, the contribution of this paper is given by the evaluation of two unsupervised approaches on the same corpus and using the same resources, both are based on knowledge but differ in the level at which the analysis is performed (word and sense level). The second objective is to carry out an analysis of existing public resources for opinion mining and their influence on both approaches. In this way, we can have a clear measure of the impact given by the use of each of the resources separately, as well as study methods to combine them, to obtain better results. We show that word sense disambiguation avoids the lack of balance between the classification of positive and negative opinions present in an approach as simple as positive or negative words belonging to an opinion. The majority of the existing resources that have annotated senses with its corresponding polarity have little coverage. 4 Experiments and Evaluation In the experiments, we intend to evaluate the impact of the disambiguation of words in the task of polarity classification of an opinion. With this aim, we first present a “bag of words” approach, and then a second one that uses a word sense disambiguation algorithm to determine the correct sense of the words in 122 Martín T., Balahur A., Montoyo A., Pons A. the opinion. In both approaches, we firstly perform a pre-processing of the text including sentence recognition, stop-word removal, part-of-speech tagging and word stemming by using the TreeTagger tool [11]. We comparatively analyse the different possible methods and resources for opinion mining that are publicly available and explore the possibility to combine them in order to increase the accuracy of the classification. For the evaluation of both methods we use precision, recall and F1 measures for the Positive (P+, R+, F1+) and Negative (P-, R-, F1-) categories, and the overall precision, recall and F1 (P, R, F1). Additionally, the coverage (Cov) of the method is calculated as the ratio of the number of opinions classified as negative or positive over the total number of opinions. 4.1 Data and Resources For our experiments, we chose a set of 99 quotes described in [12], on which agreement between a minimum of two annotators could be reached regarding their classification in the positive and negative categories, as well as their being neutral/controversial or improperly extracted. In this paper, we only use the 68 quotes classified as positive (35) or negative (33). The explanation for employing this dataset is that reported speech (a person referring to another person or event) represents a direct and unbiased expression of opinions that does not depend on the interpretation of the reader in the majority of cases. At the present moment, there are some lexicons annotated with affect and polarity at the sense level and are based on WordNet [13]: WordNet-Affect [14], SentiWordNet [1] and Micro-WNOp [15]. WordNet-Affect, an extension of WordNet Domains, is a hierarchy of affective domain labels which were developed by selecting suitable synsets from WordNet which represent affective concepts and dividing them into subsets of affective data. In SentiWordNet, each synset in WordNet has assigned three values of sentiment: positive, negative and objective, whose sum is 1. For example, the synset HAPPY#3 (marked by good fortune; “a felicitous life”; “a happy outcome”), is annotated as Positive = 0.875, Negative = 0.0 and Objective = 0.125. This resource was created through a mix of linguistic techniques and statistical classifiers. It was semi-automatically built so all the results were not manually validated and some resulting classifications can appear incorrect. Finally, the Micro-WNOp corpus is composed by 1105 WordNet synsets manually annotated in a manner that is similar to SentiWordNet. 4.2 Word-based Method Applied to Polarity Classication For the first approach, as in [12], each of the employed resources were mapped to four categories, which were given different scores - positive (1), high positive (4), negative (-1) and high negative (-4). On the one hand, the approach is motivated by the same mapping done in [12], and, on the other, on the wish to mantain the “idea” of the lexicons employed - that the same term may have different strengths of polarity and that two different terms even if they have the same polarity, may differ in intensity. Word Sense Disambiguation in Opinion Mining: Pros and Cons 123 The words belonging to the WordNet-Affect categories of anger and disgust were grouped as in [12] under high negative, fear and sadness were considered negative, joy was taken as containing positive words and surprise as highly positive; SentiWordNet and Micro-WNOp contain positive and negative scores between 0 and 1 and in their case, the mapping was done in the following manner: the words that have senses with positive scores lower than or equal to 0.5 to the positive category, the scores higher than 0.5 to the high positive set, the negative scores lower than or equal to 0.5 to the negative category and the ones higher than 0.5 to the high negative set. See Table 1 for the statistics of the categories built from each resource. The last row corresponds to the union of the categories for all resources. Table 1. Statistics of the categories used by the word-based method. Resource Positive Negative High Positive WN-Affect 192 215 73 Micro-WNOp 436 396 409 SentiWN 23133 22144 2462 SentiWN+Micro-WNOp+WN-Affect 23394 22442 2804 High Negative 201 457 5279 5713 Finally, the polarity value of each of the quotes was computed as sum of the values of the words identified; a positive score leads to the classification of the quote as positive, whereas a final negative score leads to the system classifying the quote as negative. A quote is classified as neutral if the score is equal to 0. Note that no word sense disambiguation is done in this method, rather a word is incorporated into a category depending on the annotation of its senses. Thus, a same word can be included to several categories and the word-to-sense relationship is lost. The results of this approach are shown in Table 2. Table 2. Classification results of the method without WSD. Resources WN-Affect Micro-WNOp SentiWN SentiWN+WN-Affect SentiWN + Micro-WNOp All P+ 0.75 0.57 0.55 0.55 0.53 0.53 P0.60 0.67 0.46 0.46 0.48 0.45 R+ 0.08 0.57 0.48 0.48 0.54 0.54 R0.09 0.18 0.36 0.36 0.33 0.33 F1+ 0.15 0.57 0.51 0.51 0.54 0.53 F10.16 0.28 0.41 0.41 0.39 0.38 P 0.67 0.59 0.51 0.51 0.51 0.50 R 0.09 0.38 0.43 0.43 0.44 0.44 F1 0.15 0.46 0.46 0.46 0.47 0.46 Cov 0.13 0.65 0.84 0.84 0.87 0.88 124 Martín T., Balahur A., Montoyo A., Pons A. 4.3 WSD Method Applied to polarity classification This approach based on word sense disambiguation to determine the polarity of opinions was previously presented in [16], where it was evaluated over the SemEval Task No. 14: Affective Text data, outperforming the results obtained by both unsupervised and supervised systems participating in the competition. Word Sense Disambiguation (WSD) is an intermediate task of Natural Language Processing. It consists in selecting the appropriate meaning of a word given the context in which it occurs [17]. The approach is based on the assumption that the same word, in different contexts, may not have the same polarity. For example, the word ”drug” can be positive, negative or objective, depending on the context where it appears (e.g., ”she takes drugs for her heart” (objective), ”to be on drugs” (negative)). Bearing in mind this need to appropiately identify the correct sense, we use a word sense disambiguation algorithm to obtain the correct sense of the words in the opinion and subsequently obtain the polarity of the senses from resources based on senses annotated with valence and emotions. The WSD-based method also handles negations and other polarity shifters obtained from the General Inquirer dictionary. For the disambiguation of the words, we use the method proposed in [18], which relies on clustering as a way of identifying semantically related word senses. In this WSD method, the senses are represented as signatures built from the repository of concepts of WordNet. The disambiguation process starts from a clustering distribution of all possible senses of the ambiguous words by applying the Extended Star clustering algorithm [19]. Such a clustering tries to identify cohesive groups of word senses, which are assumed to represent different meanings for the set of words. Subsequently, clusters that best match the context are selected. If the selected clusters disambiguate all words, the process stops and the senses belonging to the selected clusters are interpreted as the disambiguating ones. Otherwise, the clustering process is performed again (regarding the remaining senses), until a complete disambiguation is achieved. Once the correct sense for each word on the opinion is obtained, the method determines its polarity regarding the sentiment annotation for this sense in the lexical resource utilized. From SentiWordNet and Micro-WNOp we obtain a positive and a negative value for the target sense (in Micro-WNOp only a part of the synsets are annotated with the polarity, thus the senses that are not annotated are considered to be completely objectives). In the case of WordNetAffect that is annotated with emotions and not with values of polarity as such, we build a mapping, the senses pertaining to the hierarchy of positive (negative) affective domain labels were assigned a positive value of 1(0) and a negative value of 0(1), respectively. Finally, the polarity of the opinion is determined from the scores of positive and negative words it contains. To sum up, for each word w and its correct sense s, the positive (P (w)) and negative (N (w)) scores are calculated as: P(w) = Positive value of s in a lexical resource Word Sense Disambiguation in Opinion Mining: Pros and Cons 125 N(w) = Negative value of s in a lexical resource Finally, the global positive and negative scores (Sp , Sn ) are calculated as: X Sp = P (w) w:P (w)>N (w) X Sn = N (w) w:N (w)>P (w) If Sp is greater than Sn then the opinion is considered as positive. On the contrary, if Sp is less than Sn the opinion is negative. Finally, if Sp is equal to Sn the opinion is considered as neutral. In Table 3 the results are shown. Table 3. Classification results of the WSD-based method. Resources WN-Affect Micro-WNOp SentiWN SentiWN+WN-Affect SentiWN + Micro-WNOp All 5 P+ 1.00 0.58 0.48 0.50 0.50 0.53 P0.75 0.40 0.53 0.55 0.56 0.57 R+ 0.17 0.20 0.46 0.46 0.49 0.51 R0.09 0.06 0.45 0.48 0.45 0.48 F1+ 0.29 0.30 0.47 0.48 0.49 0.52 F10.16 0.11 0.49 0.52 0.50 0.52 P 0.90 0.53 0.51 0.52 0.52 0.55 R 0.13 0.13 0.46 0.47 0.47 0.50 F1 0.23 0.21 0.48 0.50 0.50 0.52 Cov 0.15 0.25 0.90 0.90 0.90 0.91 Discussion From Table 2, we can observe that the worst results for the word-based approach were obtained using WN-Affect. Note that the categories that are built from this resource contain few words and therefore the coverage of the method is affected (see Table 1 for statistics of the resources). For Micro-WNOp the method improves the coverage but fails in the detection of negative quotes (see low values of R- and F1-). The best results were obtained in the combinations that uses SentiWN; more negative opinions are correctly classified and better coverage is achieved. Combining SentiWN with other resources do not seem to improve the F1 scores, even though the coverage is slightly better. As WN-Affect and Micro-WNOp were built annotating a subset of WordNet senses and SentiWN includes all of these senses, it is likely that the four categories of SentiWN and those built from each combination are not significantly different (see Table 1). Regarding the approach based on word sense disambiguation, we can observe in Table 3 that for the Micro-WNOp and WN-Affect resources the method obtains very low results, due to the low coverage of the annotated senses; from 115425 synsets in WordNet, only 1105 (0.96%) and 884 (0.77%) are annotated 126 Martín T., Balahur A., Montoyo A., Pons A. on these resources, respectively. Also, the corpus has 1472 words, of which 1277 are non stop-words and are disambiguated with the senses of WordNet; MicroWNOp only covers 57 (4.46%) and WN-Affect 18 (1.41%) of these ambiguous words. In spite of the low coverage, the method obtains acceptable precision values when these resources are used. For these reasons, when we use these resources individually, only a few words obtain a value of polarity. On the other hand, the use of SentiWN significantly improves both the F1 scores and the coverage. Note also, that the combination of several resources obtains better precision, recall and F1 scores. Due to the fact that SentiWN was not manually annotated, some senses are misclassified (e.g., the sense FLU#1 (an acute febrile highly contagious viral disease) is annotated as Positive = 0.75, Negative = 0.0 and Objective = 0.25, despite having a lot of negative words in its gloss). These mistakes affect the polarity classification. We suppose that combining WN-Affect and Micro-WNOp with SentiWN reduces this negative influence, and consequently the precision, recall and F1 values are improved. The low coverage of the Micro-WNOp and WN-Affect resources do not allow higher increases in the classification quality. Finally, Figure 1 shows the comparison of both methods (with and without WSD) for each resource combination with respect to overall F1 measure. As can be seen, the results of the method based on word sense disambiguation are better than those of the bag-of-words approach, except for Micro-WNOp where the WSD-based method is severely affected by the low coverage of this resource. The best F1 score of the WSD-based method is 0.52 and that of the method without WSD is 0.47. Note also that the WSD-based method not only obtains better overall F1, but also a higher coverage (see Tables 2 and 3). This confirms that word sense disambiguation is useful for determining the polarity of a word. Fig. 1. Overall F1 scores for both methods. Word Sense Disambiguation in Opinion Mining: Pros and Cons 127 Unlike the WSD-based method, the addition of WN-Affect and Micro-WNOp with respect to use only SentiWN does not contribute to improve the results of the method without WSD. In the first method, this is due to the higher quality in the sense polarity annotation, while in the second one to the absence of differences among the built categories. From the results of Tables 1 and 2, we can also notice that Micro-WNOp and WN-Affect achieve better overall precision but lower overall recall than SentiWN in both methods. It can be expected due to the manual annotation and the low coverage of these resources, respectively. Another interesting observation is that the bag-of-words approach leads to better performance when classifying positive quotes, whereas the WSD-based method achieves a good balance between the classification of positive quotes and the negative ones instead. This can be seen in the precision, recall and F1 values for each class. 6 Conclusions and Future Work In this paper, a comparison between two unsupervised methods for determining the polarity of opinions has been presented. One of them performs the analysis at a word level, whereas the other at a sense level. Studies of the behaviour of both methods were presented over the same corpus and using several public resources for opinion mining. The use of word sense disambiguation in the polarity classification has pros and cons. The advantages are given in the superiority of the results (with respect to precision, recall and F1) obtained by taking into account the context of the words appearing in the opinion. Also, the polarity detection has the same behavior in both classes, that is, the performance is balanced when positive and negative quotes are classified. Despite that some of the resources have a low coverage, this method obtains a better coverage than the bag-of-words approach. However, as word sense disambiguation constitutes an intermediate task in the polarity classification, the disambiguation errors could affect the classification quality. This provides further motivation to study in depth this problem, due to the lack of a corpus manually annotated with senses and polarity. Also, the disambiguation algorithm depends on the used knowledge resources and, as we can saw in the experiments, there are no resources that have both high coverage and a good quality in the annotation of sense polarities. Future work includes the study of alternative methods to extract and classify opinions, working at a syntactic level, or using local contexts, semantic representations of concepts and the modelling of discourse structures. Our idea is to study in a broader context the impact of word sense disambiguation on the performance of opinion mining systems, be it in small texts (such as the ones we have studied in this paper) or larger contexts (on-line discussion forums, blogs or newspaper articles). Another interesting application would be the determination of figurative senses, which are used to express opinions in a more sophisticated 128 Martín T., Balahur A., Montoyo A., Pons A. manner. To this aim, the application of word sense disambiguation is an essential step. References 1. Esuli, A., Sebastiani, F.: Sentiwordnet: A publicly available lexical resource for opinion mining. In: Fifth international conference on Language Resources and Evaluation (LREC 2006). (2006) 417–422 2. Hatzivassiloglou, V., McKeown, K.R.: Predicting the semantic orientation of adjectives. In: 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, The Association for Computational Linguistics (1997) 174–181 3. Turney, P., Littman, M.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems 21 (2003) 315–346 4. Kim, S., Hovy, E.: Determining the sentiment of opinions. In: 20th International Conference on Computational Linguistics (ACL 2004), Morristown, NJ, USA, The Association for Computational Linguistics (2004) 1367–1373 5. Takamura, H., Inui, T., Okumura, M.: Extracting emotional polarity of words using spin model. In: 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, US, The Association for Computational Linguistics (2005) 133–140 6. Esuli, A., Sebastiani, F.: Pageranking wordnet synsets: An application to opinion mining. In: 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, The Association for Computer Linguistics (2007) 424–431 7. Wiebe, J., Mihalcea, R.: Word sense and subjectivity. In: 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, Sydney, Australia, The Association for Computational Linguistics (2006) 1065–1072 8. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 1 (2005) 165–210 9. Akkaya, C., Wiebe, J., Mihalcea, R.: Subjectivity word sense disambiguation. In: Conference on Empirical Methods in Natural Language Processing, Singapore, The Association for Computational Linguistics (2009) 190–199 10. Rentoumi, V., Giannakopoulos, G.: Sentiment analysis of figurative language using a word sense disambiguation approach. In: International Conference on Recent Advances in Natural Language Processing (RANLP 2009), The Association for Computational Linguistics (2009) 11. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Conference on New Methods in Language Processing. (1994) 44–49 12. Balahur, A., Steinberger, R., v. d. Goot, E., Pouliquen, B., Kabadjov, M.: Opinion mining on newspaper quotations. Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference 3 (2009) 523–526 13. Fellbaum, C.: WordNet: an electronic lexical database. MIT Press (1998) 14. Strapparava, C., Valitutti, A.: Wordnet-affect: an affective extension of wordnet. In: 4th International Conference on Language Resources and Evaluation, LREC 2004. (2004) 1083–1086 15. Cerini, S., Compagnoni, V., Demontis, A., Formentelli, M., Gandini, G.: MicroWNOp: A gold standard for the evaluation of automatically compiled lexical resources for opinion mining. In: Language resources and linguistic theory: Typology, Word Sense Disambiguation in Opinion Mining: Pros and Cons 16. 17. 18. 19. 129 second language acquisition, English linguistics. Franco Angeli Editore, Milano, IT (2007) Martı́n-Wanton, T., Pons-Porrata, A., Montoyo-Guijarro, A., Balahur, A.: Opinion polarity detection: Using word sense disambiguation to determine the polarity of opinions. In: 2nd International Conference on Agents and Artificial Intelligence, Volume 1 - Artificial Intelligence, Valencia, Spain, INSTICC Press (2010) 483–486 Agirre, E., Edmonds, P.: Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology). Volume 33. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006) Anaya-Sánchez, H., Pons-Porrata, A., Berlanga-Llavori, R.: Word sense disambiguation based on word sense clustering. In Coelho, J.S.H., Oliveira, S., eds.: Lecture Notes in Artificial Intelligence. Volume 4140., Springer (2006) 472–481 Gil-Garcı́a, R., Badı́a-Contelles, J.M., Pons-Porrata, A.: Extended star clustering algorithm. In Sanfeliu, A., Ruiz-Shulcloper, J., eds.: Lecture Notes in Computer Sciences. Volume 2905., 8th Iberoamerican Congress on Pattern Recognition (CIARP), Springer-Verlag (2003) 480–487 Improving Emotional Intensity Classification using Word Sense Disambiguation Jorge Carrillo de Albornoz1, Laura Plaza2, Pablo Gervás3 Computer Science Faculty, Universidad Complutense de Madrid, C/ Prof. José García Santesmases, s/n. 28040, Madrid, Spain 1jcalbornoz@fdi.ucm.es, 2lplazam@fdi.ucm.es, 3pgervas@sip.ucm.es Abstract. During the last years, sentiment analysis has become a very popular task in Natural Language Processing. Affective analysis of text is usually presented as the problem of automatically identifying a representative emotional category or scoring the text within a set of emotional dimensions. However, most existing approaches determine these categories and dimensions by matching the terms in the text with those presented in an affective lexicon, without taking into account the context in which these terms are immersed. This paper presents a method for the automatic tagging of sentences with an emotional intensity value, which makes use of the WordNet Affect lexicon and a word sense disambiguation algorithm to assign emotions to concepts rather than terms. An extensive evaluation is performed using the metrics and guidelines proposed in the SemEval 2007 Affective Text Task. Results are discussed and compared with those obtained by similar systems in the same task. Keywords: Sentiment Analysis, Word Sense Disambiguation, Emotional Intensity, Machine Learning Techniques 1 Introduction Sentiment analysis is becoming increasingly important in recent years. This discipline comprises very distinct research areas, such as text analysis, speech and facial expression, which have motivated a great number of systems, each one with specific properties and frequently using ad hoc and rarely available resources. Focusing on text applications, emotional analysis usually relies on the identification of emotional keywords in the text [1]. These emotional keywords are either compared with those existing in an affective lexicon or used as the keys of a set of rules. Furthermore, most approaches work at the lexical level, using words or stems as emotional keywords, instead of the appropriate concepts according to their contexts. The few approaches that work at a conceptual level rarely make use of word sense disambiguation techniques in order to obtain the correct senses of the concepts. Instead, they simply obtain the first meaning or all possible meanings [2]. On the other hand, most sentiment analysis systems are focused on the identification of emotional categories or dimensions in the sentences [3, 4]. This information may be insufficient or even irrelevant in some applications. For instance, © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 131-142 Received 26/11/09 Accepted 16/01/10 Final version 08/03/10 132 Carrillo J., Plaza L., Gervás P. the emotional intensity as well as its polarity (positive vs. negative) can be of interest in public opinion analysis systems [5]. In this paper, an emotional intensity classifier capable of determining the intensity and polarity of the sentences in a text is presented. This system has been conceived to be used in an automatic camera management system for virtual environments based on the emotional analysis of the film script. In this context, the emotional intensity of the scene is an important parameter if the adequate camera position and movements need to be selected from the dialogs and actions described in the film script. The system is based on the identification of concepts in the sentences rather than terms, using a word sense disambiguation tool to obtain the correct senses for these concepts. The WordNet Affect lexicon is used to identify those concepts which are a priori candidates to denote an emotion or feeling. The paper is organized as follows. Section 2 exposes the background and related work on sentiment analysis. Section 3 presents the method proposed for the tagging of sentences with emotional intensities. Section 4 introduces the evaluation methodology as well as the results obtained and the comparison with other similar systems. In section 5, the experimental results are discussed. Finally, section 6 provides concluding remarks and identifies pending problems and future work. 2 Background This section presents the background of this study, as well as some recent works on sentiment analysis. 2.1 Emotion Nowadays the most outstanding psychological theories on sentiment analysis are the emotional dimensions theory and the emotional categories theory. The first theory proposes the interpretation of human emotions throughout a set of emotional dimensions. This theory is based on James Russel studies [6], where the human emotional space is presented as a circular bipolar structure that represents the pleasure and the activation. The emotional dimensions were employed in the development of the ANEW list (Affective Norms for English Words) according to the SAM standard (Self Assessment Manikin) [7], where each English word is assigned a value, between 1 and 9, in each of the three dimensions proposed: pleasure, arousal and dominance. This work was supported by the idea that humans understand emotions as opposite poles. On the other hand, the emotional categories theory exposes the emotions as entities, primitive units with boundaries that can be countable [8]. This idea can be traced to Rene Descartes [9], who proposes a set of primitive emotions from which the other emotions can be derived. In order to represent these categories, this theory makes use of the everyday language (i.e. joy and anger are emotional categories). The main handicap of this theory is the disagreement on the most adequate set of emotion categories. While some works argue that a small set of categories is the most suitable selection [10], others studies expose that a bigger hierarchical set of categories is Improving Emotional Intensity Classification using Word Sense Disambiguation 133 necessary to enclose the rich human feeling [11]. This debate is also encouraged by the specific purpose of the studies. For instance, Ekman [12] argues that only six basic emotions (anger, disgust, fear, joy, sadness and surprise) are needed to analyze facial expressions, while Ortony et al. [13] present a set of 22 emotions in their OCC standard to emotion synthesis in speech. 2.2 Corpora and Affective Lexical Dictionaries According to the different emotional theories, a wide range of resources has been developed, each of one supported by a psychological study and most of them specific for a given task. Focusing on the text sentiment analysis, any system that attempts to identify the affective meaning of a text will need, at least, two types of NLP resources: an annotated corpus to trail and test the system performance and an affective lexical dictionary that attaches affective meanings to words. It is difficult to find affective corpora publicly available for researchers. Besides, most of them are very specific to the task and domain to which they have been designed; so that it is either impossible or surprisingly difficult to use them in a different task. An example of emotional corpus is Emotag [14]. Emotag consists of a set of sentences extracted from eight popular tales marked up by human evaluators with an emotional category and values for three emotional dimensions (evaluation, activation and power). A radically different corpus is the one proposed for the SemEval 2007 Affective Text task [15], where a set of 1250 sentences from news headings are manually tagged with six basic emotions (anger, disgust, fear, joy, sadness and surprise). Each emotion is scored between 0 and 100, where 0 indicates that the emotion is not present in the headline, and 100 indicates that the maximum amount of emotion is found in the headline. This corpus also includes a valence value in the interval [-100, 100] for each sentence, which expresses the negative (-100), neutral (0) or positive (100) intensity of the sentence. Similarly to the corpora, the affective lexical dictionaries strongly depend on the underlying psychological theory. In relation to the emotional dimensions theory, the most popular lexicons are the ANEW word list (Affective Norms for English Words) [7] and the DAL dictionary (Whissell’s Dictionary of Affect Language) [16]. The first one consists of a list of words scored within three emotional dimensions: pleasure, arousal and dominance, according to the SAM standard; while the second one contains 8742 words rated by people for their activation, evaluation and imagery. In relation to the emotional categories theory, the LIWC Dictionary (Linguistic Inquiry and Word Count Dictionary) [17] provides a set of 2290 words and stems, classified in one or more categories, such as sadness, negative emotion or overall affect. The main limitation of these lexicons is the use of words or stems, instead of concepts, as the primitive units, without recognizing the context in which the words are used. On the contrary, the WordNet Affect database [18] provides a list of 911 WordNet synsets labeled with a hierarchical set of emotional categories. Most of the synsets labeled are representative of the meaning of the emotional categories (nouns and adjectives), while others are suitable to denote affective meanings. 134 Carrillo J., Plaza L., Gervás P. 2.3 Sentiment Analyzers As already mentioned, the most accepted approach to sentiment analysis is the identification of a set of emotional keywords in the text. However, a great variety of methods have been proposed to achieve this purpose. Francisco and Gervás [14] present a system which is consistent with both theories: the emotional categories and the emotional dimensions, based on the averaged frequencies of the words found in a corpus of tales. Subasic and Huettner [19] propose a fuzzy approximation that uses an affective lexicon containing words manually annotated with two properties: centrality and intensity. These properties are, respectively, the membership degree of an emotional category and the strength of the affective meaning. A similar approach is presented in [20] where the emotional categories are represented as fuzzy hipercubes with a range of intensity between 0 and 1. More sophisticated are the methods that aim to improve the emotional information extracted by means of semantic rules and syntactical analysis. Zhe and Boucouvalas [21] present an emotion extraction engine that uses a parser to identify auxiliary verbs, negations, subject of the sentence, etc. Wu et al. [3] expose a system where the annotation process is guided by emotional generation rules manually deduced from psychology. In the same line, Mostafa Al Masum et al. [22] present the system ASNA (Affective Sensitive News Agent) that uses the OCC model with a set of rules and natural language processing techniques to automatically classify news. Nicolov et al. [23] analyzed the effect of coreference resolution in sentiment analysis. A third common approximation to the problem is the use of Machine Learning techniques. Devillers et al. [24] study the applicability of different machine learning algorithms to identify relevant emotional states in real life spoken interactions. In contrast, Seol et al. [25] present a hybrid system that uses emotional keywords if these are present in the text or a set of domain knowledge-based artificial neural networks if no emotional keywords are found. The neural networks are initialized with a set of rules that determine possible emotional categories states. 3 The Emotion Classifier In this section, the method for automatically labeling sentences with an emotional intensity is presented. The problem is faced as a text classification task. The classifier aims to identify the emotional intensity of each sentence in a text, as well as if this intensity denotes a positive or negative emotional meaning. The method accomplishes the task throughout four steps. Each step is explained in detail in the following subsections, along with a working example that illustrates the algorithm. Besides, in order to clarify how the system works and what resources are used, its architecture is shown in Fig. 1. Improving Emotional Intensity Classification using Word Sense Disambiguation 135 Fig. 1. Architecture of the emotional intensity tagger 3.1 Preprocessing As most NLP systems, a preliminary preprocessing of the input text is needed. This includes splitting the text into sentences and tagging the words with their part of speech (POS). To this purpose, the Tokenizer, Part of Speech tagger and Sentence splitter modules in GATE [26] have been used. Generic and high frequency terms are removed using a stop list. Besides, as only the nouns, verbs, adjectives and adverbs can present an emotional meaning, only terms from these grammatical categories are considered. 3.2 Concept Identification Once the text has been split into sentences and the words have been labeled with their POS, the next step is the mapping of the terms in the sentences to their appropriated concepts in the WordNet lexical database [27]. In order to correctly translate these terms to WordNet concepts, a word sense disambiguation tool is needed. To this aim, the implementation of the lesk algorithm in the WordNet Sense Relate Perl package was used [28]. As a result, for each word its corresponding stem and sense in WordNet are obtained. This information is used to retrieve the appropriate synset that represents the concept in WordNet. Next, the hypernyms of each concept are also retrieved. Fig. 2 shows the concept identification process for the sentence: Foetal mechanism helps heart failure. In this sentence, the term foetal has not been correctly disambiguated by WordNet Sense Relate and its synset could not be retrieved from WordNet. 136 Carrillo J., Plaza L., Gervás P. Fig.2. An example of the concept identification process 3.3 Emotion Identification The goal of this third step is to match the previously identified WordNet concepts to their emotional categories in the WordNet Affect affective lexicon. Focusing on the WordNet Affect emotion sub hierarchy, the first level distinguishes between positive-emotions, negative-emotion, neutral-emotions and ambiguous-emotions, while the second level encloses the emotional categories themselves. This level contains most of the basic emotions exposed in the emotional categories theories, such as sadness, joy and surprise. As the hierarchy used in WordNet Affect is considerably broader than those frequently used in sentiment analysis, the authors have identified that the second level is a good representation of human feeling and a good starting point for the attribute selection for the classifier and its evaluation. This subset contains 32 emotional categories. Thus, once the synset of each word in the sentence has been identified, its emotional category is retrieved from WordNet Affect (if the concept appears in the lexicon). The same analysis is carried out over their hypernyms if no entry in WordNet Affect is found for the synset. To this aim, a previous mapping between synsets of WordNet 2.1 and WordNet 1.6 versions is needed, since the method and the affective lexicon works on different versions of WordNet. The use of the WordNet 2.1 version instead of the WordNet 1.6 is motivated by the fact that this version is the most updated for windows operative systems. This process is illustrated in Fig. 3. It can be observed that only two concepts in the example sentence have been assigned an emotional meaning after this step. The emotional category for the concept helps is retrieved from its own synset, which is assigned the liking category. In contrast, the synset of the concept failure is not labeled in WordNet Affect, so the analysis of its hypernyms is carried out. As it first Improving Emotional Intensity Classification using Word Sense Disambiguation 137 level hypernyms (disorder) is labeled with the general-dislike category, this same category is assigned to failure. Fig. 3. An example of the emotion identification process 3.4 Emotional Intensity Classification Up to this point, all the words in the sentence have been labeled with their emotional category (if any). Next, a vector of emotional occurrences (VEO) is created, which will be used as input for the Random Forest classifier implementation as provided by the Weka machine learning tool [29]. A VEO vector is an array of 32 positions, one representing each emotional category. To construct this vector, each concept is evaluated in order to determine its degree of affective meaning. If the concept has been assigned an emotional category, then the position in the VEO vector that represents that category is increases in 1. If no emotional category was retrieved for the concept, then the nearest labeled hypernym is used. As a hypernym is a generalization of the concept, a lower weight is assigned to the category position in the VEO vector, which depends on the depth of the hypernym in the hierarchy, as defined in (1). In order to avoid an excessive emotion generalization, only the first n levels of hypernyms are considered, where n has been empirically set to 3. VEO [i] = VEO [i] + 1/(Hyper. Depth+1) (1) Fig. 4 shows the VEO vector for the example sentence. Since the emotional category liking is assigned to helps, the position of liking in the VEO vector is increased in one degree. On the other hand, the concept failure is labeled with the emotional category general-dislike through its first level hypernym, so its position in the VEO vector is increased in 0,5. 138 Carrillo J., Plaza L., Gervás P. Fig. 4. An example of the emotional intensity classifier Finally, the VEO vector is used as input for the classifier, and an intensity value between -100 and 100 is obtained as output. 4 Evaluation In order to evaluate the performance of the method, a large–scale evaluation is accomplished following the guidelines observed in the SemEval 2007 Affective Text Task and using the corpus developed for this task. The evaluation corpus consists of a training set of 250 sentences and a test set of 1000 sentences of news headlines. Each sentence has been manually labeled by experts with a score between -100 and 100, where -100 means a strongly negative emotional intensity, 100 means a strongly positive emotional intensity and 0 means neutral. In this task, the intensity score assigned to each sentence has been mapped to three broader classes, -100 (between -100 and -50), 0 (between -50 and 50) and 100 (between 50 and 100). The results are evaluated using precision and recall metrics. Due to the specific language used in the news domain most of the affective words in the headlines are not found in WordNet Affect. Thus, a set of emotional meanings has been manually added to the lexicon, such as dead, kill and kidnap. In particular, 188 synsets have been added: 140 nouns, 56 verbs and 22 adjectives. To determine the best classification algorithm, all the experiments have been carried out over the different classification techniques implemented in Weka, although only the results of the best four algorithms are presented here. The first group of experiments is directed to find out the best value for the number of hypernym levels as explained in section 3.4. For this parameter, we have considered all possible values from 0 to 5. As it can be observed in Table 1, using three levels of hypernyms produces the best results in 2 out of the 4 algorithms. Improving Emotional Intensity Classification using Word Sense Disambiguation 139 Table 1. Evaluation of the number of hypernym levels Algorithm Functional Trees Random Forest Naïve Bayes Multinomial Logistic Precision Recall Precision Recall Precision Recall Precision Recall 0 52,2 62,2 57,5 62,0 56,3 60,1 55,8 62,0 Number of hypernyms level 1 2 3 4 57,0 58,1 60,4 60,2 62,2 62,8 63,1 62,9 57,0 57,4 57,6 57,4 62,0 62,1 62,5 62,3 58,5 58,3 58,3 58,3 59,8 59,7 59,5 59,5 57,5 57,4 57,4 57,8 62,6 62,9 63,1 63,1 5 60,2 62,8 57,3 62,2 58,3 59,5 58,1 63,2 The aim of the second group of experiments is to determine the best set of emotional categories for the classifier. Starting from the initial set of 32 categories, three algorithms for attribute selection implemented in Weka have been applied, which lead to three subsets of 3, 4 and 12 attributes respectively. Next, the four classifiers have been evaluated over these reduced sets of attributes. For these experiments, the number of hypernym levels was set to 3. Table 2 shows that two classifiers perform better with the subset of attributes selected by the CFS Best First algorithm. Table 2. Evaluation of the set of attributes Algorithm Functional Trees Random Forest Naïve Bayes Multinomial Logistic Precision Recall Precision Recall Precision Recall Precision Recall Consistency Best First 57,7 62,8 63,1 63,1 55,8 62,4 57,7 62,8 CFS Best First 57,0 62,2 64.0 63,5 58,7 61,3 57,0 62,2 Consistency Genetic 57,2 62,3 63,5 63,5 55,8 62,4 57,3 62,5 A third group of experiments has been carried out in order to evaluate the effect of the distribution of the intensity within classes. To this aim, the emotional intensity of the sentences has been mapped to 3 balanced classes: -100 (from -100 to -35), 0 (from -35 to 35) and 100 (from 35 to 100), and to 5 balanced classes: -100 (-100 to -60), -50 (-60 to -20), 0 (-20 to 20), 50 (20 to 60) and 100 (60 to 100). For these experiments, the number of hypernym levels was set to 3, while the subset of attributes selected by the CFS Best First algorithm was used. Table 3 shows that the method performance decreases substantially when the number of classes is increased, while only a small reduction is reported when the 3 classes are equitably distributed with respect to the original distribution. Therefore, the best configuration implies using the Random Forest algorithm along with the 4 attribute subset obtained by the CFS Best First algorithm and 3 hypernym levels. 140 Carrillo J., Plaza L., Gervás P. Table 3. Evaluation of the intensity distribution in classes Algorithm Functional Trees Random Forest Naïve Bayes Multinomial Logistic Precision Recall Precision Recall Precision Recall Precision Recall 3 Classes 54,8 51,0 55,4 52,1 52,1 43,9 55,9 50,7 5 Classes 26,7 35,5 36,6 35,7 39,1 32,9 26,7 35,6 Finally, Table 4 summarizes the best results of our method along with those of the systems participating in the SemEval 2007 Affective Text Task. Our method outperforms the other systems in precision, while CLaC-NB obtains the best recall. Table 4. Comparison with SemEval 2007 systems Systems CLaC UPAR7 SWAT CLaC-NB SICS Our method Precision 61.42 57.54 45.71 31.18 28.41 64.00 Recall 9.20 8.78 3.42 66.38 60.17 63.50 5 Discussion As already mentioned, the system has obtained very promising results. When compared to other systems that take part in the SemEval task, our method obtains the best precision (64%), while providing the second highest recall (63.5%). Furthermore, both metrics are well balanced, which does not occur in the other systems. The use of concepts instead of terms, along with the use of a word sense disambiguation algorithm allows the method to correctly map the words in the sentence to their emotional categories in the affective lexicon. Besides, the use of hypernyms has permitted to increase the number of concepts labeled with an emotion, which has significantly improved the evaluation results. This indicates that the method is strongly dependent on the number of concepts labeled in the lexicon. Another interesting result is that the initial set of 32 emotional categories has proved to be too large. Some of these categories introduces noise and decreases the efficiency of the classification algorithms. According to the experiments, the best set consists of 4 emotional categories: liking, negative-fear, sadness and general-dislike. However, it is important to note that this subset will depend on the regarded domain. A detailed examination of the precision and recall obtained by each intensity class has shown that the positive class encloses most of the classification errors. The reason seems to be that a good number of the positive sentences are expressed as the Improving Emotional Intensity Classification using Word Sense Disambiguation 141 negation of negative emotional concepts (i.e. Jet flips in snowstorm, none dead). A previous process of negation detection which adapts negated sentences to a positive form through antonym relations could minimize this problem. 6 Conclusions and Future Work In this paper, an effective approach to automatically assign an emotional intensity and polarity to a sentence is presented. The method obtains both high precision and recall, which are also well balance. The evaluation results outperform those obtained by the systems participating in the SemEval 2007 Affective Text Task. However, several problems have been identified. First, the experimentation has shown that the amount of index concepts is low, since most concepts in the corpus cannot be retrieved from WordNet Affect. This clearly has an influence on the precision of the method, which can be improved by enriching the lexicon with the corpus specific vocabulary. A second handicap to the concept identification is the part of speech tagging errors reported by the GATE POS tagger, which has resulted in an good number of concepts that could not be correctly disambiguated and retrieved from WordNet. Future work will include a further evaluation using the statistical Stanford parser [30]. Finally, different techniques will be studied in order to detect negated concepts, as well as their scope, since they can invert the polarity of the sentences. Acknowledgments. This research is funded by the Spanish Ministry of Science and Innovation (TIN2009-14659-C03-01). It is also partially funded by the Comunidad Autonoma de Madrid (CAM) and the European Social Fund (ESF) through the IV PRICIT program, and by the Spanish Ministry of Science and Innovation through the FPU program. References 1. Strapparava, C., Mihalcea, R.: Learning to identify emotions in text. In: Proceedings of the 2008 ACM symposium on Applied computing, pp. 1556--1560. Ceara, Brazil (2008) 2. Chaumartin, F.R.: UPAR7: a knowledge-based system for headline sentiment tagging. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 422--425. Prague, Czech Republic (2007) 3. Wu, C.H., Chuang, Z.J., Lin, Y.C.: Emotion recognition from text using semantic labels and separable mixture models. Journal of ACM Transactions on Asian Language Information Processing 5, pp. 165--183 (2006) 4. Rubin, V.L., Stanton, J.M., Liddy, E.D.: Discerning Emotions in Texts. In: Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. Stanford, C.A. (2004) 5. Mullen, T., Collier, NSentiment Analysis using Support Vector Machines with Diverse Information Sources. In: Proceedings of EMNLP, pp. 412--418. Barcelona, Spain (2004) 6. Russell, J. A.: A circumplex model of affect. Journal of Personality and Social Psychology 36, pp. 1161--1178 (1980) 142 Carrillo J., Plaza L., Gervás P. 7. Bradley, M.M., Lang, P.J.: Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report C-1, The Center for Research in Psychophysiology, University of Florida (1999) 8. James, W.: What is an Emotion? Published in Mind 9, pp. 188--205 (1884) 9. Anscombe, E., Geach, P.T.: Descartes: Philosophical Writings. Ed. Nelson University Paperbacks (1972) 10. Plutchik, R.: A general Psycho Evolutionary Theory of Emotion. Emotion: Theory, Research, and Experience (1980) 11. Parrott, W.G.: Emotions in Social Psychology: Essential Readings. Psychology Press, Philadelphia (2001) 12. Ekman, P.: Are there basic emotions? Journal Psychol. Rev. 99, pp 550--553 (1992) 13. Ortony, A., Clore, G. L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press (1988) 14. Francisco, V., Gervás, P.: Ontology-Supported Automated Mark Up of Affective Information in Texts. Special Issue of Language Forum on Computational Treatment of Language 34, pp. 26--36 (2008) 15. SemEval 2007 Affective Text Task Corpus, http://www.cse.unt.edu/~rada/affectivetext/ 16. Whissell, C.M.: The dictionary of affect in language. In: Robert Plutchik and Henry Kellerman (Ed.), Emotion: Theory, Research, and Experience, pp. 113--131 (1989) 17. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count (LIWC): LIWC2001. Erlbaum Publishers, Mahwah, NJ (2001) 18. Strapparava, C., Valitutti, A.: Wordnet-affect: an affective extension of WordNet. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 108--1086. Lisbon, Portugal (2004) 19. Subasic, P., Huettner, A.: Affect Analysis of Text Using Fuzzy Semantic Typing. Journal of IEEE Transactions on Fuzzy Systems 9, pp 483--496 (2001) 20. Esau, N., Kleinjohann, L., Kleinjohann, B.: An Adaptable Fuzzy Emotion Model for Emotion Recognition. In: Proceeding of EUSFLAT Conference, pp. 73--78 (2005) 21. Zhe, X., Boucouvalas, A.: Text-to-Emotion Engine for Real Time Internet Communication. In: Proceedings of International Symposium on CSNDSP, pp. 164--168 (2002) 22. Mostafa Al Masum, S., Islam, M. T., Ishizuka, M.: ASNA: An Intelligent Agent for Retrieving and Classifying News on the Basis of Emotion-Affinity. In: Proceedings of the CIMCA-IAWTIC'06 (2006) 23. Nicolov, N., Salvetti, F., Ivanova, S.: Sentiment Analysis: Does Coreference Matter? In: of the Symposium on Affective Language in Human and Machine, pp. 37--40. (2008) 24. Devillers, L., Vidrascu, L., Lamel, L.: Challenges in real-life emotion annotation and machine learning based detection. Journal of Neural Netw. 18, pp 407--422 (2005) 25. Seol, Y.S., Kim, D.J., Kim, H.W.: Emotion Recognition from Text Using Knowledgebased ANN. In: Proceedings of 23rd International Technical Conference on Circuits/Systems, Computers and Communications, pp. 1569--1572 (2008) 26. General Architecture for Text Engineering, http://gate.ac.uk/ 27. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to WordNet: An On-Line Lexical Database. International Journal of Lexicography 3 (4), pp. 235--244 (1990) 28. Patwardhan, S., Banerjee, S., Pedersen, T.: SenseRelate::TargetWord - A Generalized Framework for Word Sense Disambiguation. In: Proceedings of the Twentieth National Conference on Artificial Intelligence (Intelligent Systems Demonstrations), pp. 1692–1693. Pittsburgh, PA (2005) 29. Weka Machine Learning Tool, http://www.cs.waikato.ac.nz/ml/weka/ 30. Statistical Stanford Parser, http://nlp.stanford.edu/software/lex-parser.shtml Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework Plaban Kr. Bhowmick, Anupam Basu, Pabitra Mitra and Abhisek Prasad Department of Computer Science & Engineering Indian Institute of Technology Kharagpur Kharagpur, India-721302 plaban@gmail.com, anupambas@gmail.com, pabitra@gmail.com, abhisek.hi@gmail.com Abstract. Multiple emotions are evoked with diﬀerent intensities in readers’ minds in response to text stimuli. In this work, we perform reader perspective emotion analysis in sentence level considering each sentence to be associated with the emotion classes with fuzzy belongingness. As news articles present emotionally charged stories and facts, a corpus of 1305 news sentences are considered in this study. Experiments have been performed in Fuzzy k Nearest Neighbor (FkNN) classiﬁcation framework with four diﬀerent feature groups. Word feature based classiﬁcation model is considered as baseline. In addition to that, we have proposed three features namely, polarity, semantic frame and emotion eliciting context (EEC) based features. Diﬀerent measures applicable to multi-label classiﬁcation problem have been used to evaluate the system performance. Comparisons between diﬀerent feature groups revealed that EEC based feature is the most suitable one in reader perspective emotion classiﬁcation task. 1 Introduction With the recent thrust in Human-Computer Interaction (HCI) and Human Centered Computing (HCC), researchers are concerned about modeling human behavior in order to provide truly intelligent interfaces. Emotion is one of the distinguishing features of human character and plays an important role in shaping human behavior. Current eﬀorts in HCI area are exploring the possibilities of developing emotionally intelligent interfaces. Emotional intelligence refers to one’s ability to understand and manage the emotion of one’s self or of others. Apart from other modes like speech and facial expression, language is one of the most common modes for expressing emotion whether it is day-to-day speech communications (spoken language) or published communications (written language). Recent works in natural language processing area look into diﬀerent behavioral aspects of human like personality trait, sentiment and emotion. Emotion can be studied from two perspectives. – From the writer/speaker perspective, where we need to understand the emotion that the writer/speaker intended to communicate and © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 143-154 Received 23/11/09 Accepted 16/01/10 Final version 11/03/10 144 Kumar P., Basu A., Mitra P., Prasad A. – from the reader’s perspective, where we try to identify the emotion that is triggered in a reader in response to a language stimulus. In this work, we intend to perform sentence level emotion classiﬁcation from reader’s perspective. The issues that are needed to be addressed in this kind of task are as follows: – Fuzzy and multi-label characteristics: In response to a text stimuli, a blend of emotions may be evoked in reader’s mind with diﬀerent degrees of intensity. Thus, from a classiﬁcation point of view, a text segment may have multiple memberships in diﬀerent emotion categories. – Identiﬁcation of suitable features: As reader perspective emotion analysis is in its infancy, the exploration on the identiﬁcation of suitable features has to be performed. – Feature sparseness: While emotion elicitation from a discourse or paragraph may provide larger number of cues as features, the number of features available from a single sentence is less and hence the feature space becomes sparse. The earlier works towards reader perspective emotion classiﬁcation have used word feature [1, 2] and word co-occurrence statistics [3]. In this work, we have introduced three new features, namely the polarity feature, semantic frame feature and emotion elicitation context (EEC) feature towards the same objective. As fuzziness is involved in subjective entity like emotion, Fuzzy k Nearest Neighbor (FkNN) framework has been used for developing emotion classiﬁcation models with proposed features. 2 Related Works As stated earlier, emotion analysis can be performed in two diﬀerent perspectives. There are a number of eﬀorts towards writer perspective emotion analysis [4–8]. As we focus on performing reader perspective emotion analysis, we provide an overview of the works addressing the related task. Aﬀective text analysis was the task set in SemEval-2007 Task 14 [9]. A corpus of news headlines extracted from Google news and CNN was provided. Two types of tasks were to classify headlines into positive/negative emotion category as well as distinct emotion categories like anger, disgust, fear, happiness, sadness and surprise. The system UA-ZBSA [3] gathers statistics from three diﬀerent search engines (MyWay, AllWeb and Yahoo) to attach emotion lables to the news headlines. The work computes the PMI score of each content word of a headline with respect to each emotion by querying the search engines with the headline and the emotion. The accuracy, precision and recall of the system is reported to be 85.72%, 17.83% and 11.27%. UPAR7 [10] is a linguistic rule-based approach towards emotion classiﬁcation. The system performs emotion analysis on news headline data provided in Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework 145 SemEval-2007 Task 14. In the preprocessing step, the common words are decapitalized with the help of parts of speech tagger and Wordnet. Each word ﬁrst is rated with respect to emotion classes. The main theme word is detected by parsing a headline and it is given a higher weight than the other words in the headline. The emotion score boosting to the nouns are performed based on their belongingness to some general categories in Wordnet. The word scoring also considers some other factors like human will, negation and modals, high-tech names, celebrities etc. The average accuracy, precision and recall of the system is 89.43%, 27.56% and 5.69%. The system SWAT [11] adopts a supervised approach towards emotion classiﬁcation in news headlines. A word-emotion map constructed by querying the Roget’s New Millennium Thesaurus is used to score each word in the headline and the average score of the headline words are taken into account while labeling it with a particular emotion. The reported classiﬁcation accuracy, precision and recall are 88.58%, 19.46% and 8.62%. The work by Lin and Chen [1, 2] provides the method for ranking reader’s emotions in Chinese news articles. Eight emotional classes are considered in this work. Chinese character bigram, Chinese words, news metadata, aﬃx similarity and word emotion have been used as features. The best reported system accuracy is 76.88%. 3 Emotion Data The emotion text data collected by us consists of 1305 sentences extracted from Times of India news paper archive1 . The sentences were collected from headlines as well as from the bodies of articles belonging to political, social, sports and entertainment domain. The annotation scheme considers the following points: – Choice of emotion classes: The annotation scheme considers four basic emotions, namely, Disgust, Fear, Happiness, Sadness. – Fuzzy and Multi-label annotation: A sentence may trigger multiple emotions simultaneously. So, one annotator may classify a sentence to more than one emotion category. Fuzzy annotation is considered in this work, i.e., for a sentence, the annotators provide a value from the range [0,1] against each emotion category. The distribution of sentences across emotion categories is as follows: Disgust = 307, Fear = 371, Happiness = 282 and Sadness = 735. 4 Features for Emotion Classification Following features were considered in the experiments on emotion analysis on the data set described above. 1 http://timesoﬁndia.indiatimes.com/archive.cms 146 Kumar P., Basu A., Mitra P., Prasad A. 4.1 Word Feature Words are sometimes indicative of the emotion class of a text segment. For example, the word ‘bomb’ may be highly co-associated with fear emotion. Thus, words present in the sentences may be considered to be potential features. Now, if we consider all the words in a text corpus, only a subset of these will be present in a particular sentence. The presence of these words is used to form a binary feature vector. Before creating the word feature vectors, following preprocessing steps are adopted. – Stop words are removed. – Named Entities may introduce noise in emotion classiﬁcation. So, named entities are removed using the Stanford named entity recognizer2 . – The remaining content words are stemmed using Porter’s stemmer algorithm. 4.2 Polarity based Feature Polarity of the subject, object and verb of a sentence may be good indicators of whether the sentence evokes positive or negative emotions. For example, let us consider the following sentence. Relief work improves the poor conditions of flood affected people. Here, the subject, Relief work, is of positive polarity; the verb, improves, is of positive polarity; and the object phrase, poor conditions of ﬂood aﬀected people, is of negative polarity. Intuitively, a positive subject performs a positive action on a negative object and this pattern evokes a positive emotion. The polarity values of each word in the corpus are tagged manually. Existing resources like SentiWordnet may have been employed in word level polarity tagging. However, as this resource is developed using machine learning techniques, the error introduced in the polarity learning may aﬀect the performance of emotion classiﬁcation. The polarity of a word may have values like POSITIVE (P), NEGATIVE (Ne) or NEUTRAL (N). The problem of ﬁnding polarity of verb and its corresponding subject and object in a sentence can be broken down into following sub-problems: – Finding out the main verb and head words of the corresponding subject and object phrase – Finding the modiﬁer words for verb, subject and object head words – Finding polarities of subject, object and verb phrases The Stanford Parser3 is used to parse the sentences and the dependency relations (nsubj, dobj, etc.) obtained as parser output are used to extract the subject, verb and object phrases. A dependency relation from the output of the parser is of the following form. 2 3 http://nlp.stanford.edu/software/CRF-NER.shtml http://nlp.stanford.edu/software/lex-parser.shtml Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework 147 relation(arg1, arg2) The main verb, subject and object head words in a sentence is detected using the dependency relations obtained from the parser output. Some of these relations are given in Table 1. Table 1. Example dependency relations for identiﬁcation of verb, subject and object head words. Relation Argument Example Sentence Example Relation Example dependency relations for identification of verb Agent: agent arg1 The man has been killed by the police agent(killed, police) Passive auxiliary: auxpass arg1 The president was killed auxpass(killed, was) Clausal subject: csubj arg1 What he has done made me proud csubj(made, done) Direct object: dobj arg1 He gave me a book dobj(gave, book) Nominal subject: nsubj arg1 Millitants destroyed the bridge nsubj(destroyed, millitants) Passive nominal subject: nsubjpass arg1 The bridge was destroyed by millitants nsubjpass(destroyed, bridge) Example dependency relations for identification of subject head word Agent: agent arg2 The man has been killed by the police agent(killed, police) Clausal subject: csubj arg2 What he has done made me proud csubj(made, done) Clausal passive subject: csubjpass arg2 That he will excel was predicted csubjpass(predicted, excel) Nominal subject: nsubj arg2 Millitants destroyed the bridge nsubj(destroyed, millitants) Controlling subject: xsubj arg2 Tom likes to eat fish xsubj(eat, T om) Example dependency relations for identification of object head word Direct object: dobj arg2 He gave me a book dobj(gave, book) The second step is to ﬁnd out the modiﬁers of the subject and object and verb head words. The example relations that were used for extracting the modiﬁers are given in Table 2. The polarity assignment to a phrase is performed with Table 2. Example dependency relations for identiﬁcation of modiﬁer words Relation Adverbial modifier: advmod Adjectival modifier: amod Negation modifier: neg Argument arg2 arg2 arg2 Example Phrase Example Relation Genetically modified food advmod(modified, genetically) Genetically modified food amod(food, modif ied) He was not killed negkilled, love two diﬀerent sets of phrase polarity assignment rules one for verb phrase (see Table 3)and another for subject and object phrase (see Table 4). 4.3 Semantic Frame Feature Every word in lexicon refers to some ground truth conceptual meaning that helps in clustering the words based on their conceptual similarity. In frame semantics [12], a word evokes a frame of semantic knowledge relating to the speciﬁc concept it refers to. 148 Kumar P., Basu A., Mitra P., Prasad A. Table 3. Example rules for verb polarity assignment (P = Positive, Ne = Negative, N = Neutral, NULL = absent, X = independent of relation) Rule# V1 V2 V3 V4 V5 Head Ne P Ne P N Modifier Ne P P Ne P|Ne Relation advmod advmod advmod advmod advmod Phrase Ne P Ne Ne Modifier polarity Example [brutally]/Ne [killed]/Ne −→ [brutally killed]/Ne [heartily]/P [welcomed]/P −→ [heartily welcomed]/P [artistically]/P [murdered]/Ne −→ [artistically murdered]/Ne [ghastly]/Ne [welcomed]/P −→ [ghastly welcomed]/Ne [beautifully]/P [taken]/N −→ [beautifully taken]/P Table 4. Example rules for subject and object polarity assignment Rule# N1 N2 N3 N4 N5 Head P Ne N N P|Ne Modifier P P|Ne|N P|Ne N NULL Phrase P Ne Modifier polarity N Head polarity Example [great]/P [win]/P −→ [great win]/P [airplane]/N [hijack]/Ne −→ [airplane hijack]/Ne [excellent]/P [performance]/N −→ [excellent performance]/P [Minor]/N [girl]/N −→ [Minor girl]/N [bomb]/Ne −→ [bomb]/Ne The Berkeley FrameNet project4 is a well-known resource of frame-semantic lexicon for English. Apart from storing the predicate-argument structure, the frames group the lexical units. For example, the frame Apply heat is evoked by the lexical units such as bake, blanch, boil, simmer, steam, etc. So, assignment of appropriate frames to the words may be used as a generalization technique. The semantic frame feature extraction was performed by considering the semantic parse of each sentence through SHALMANESER5 . In the example sentence given below, the words ‘arrest’, ‘man’, ‘abducting’, ‘assaulting’ and ‘girl’ are assigned with Arrest, People, Kidnapping, Rape and People frames. These frames are considered as semantic frame features. Villivakkam police arrested a 26-year old married man for abducting and sexually assaulting a 16-year-old girl 4.4 Emotion Elicitation Context (EEC) Feature Emotion is evoked in reader based on the situation described in the text. The surface level features like words are not adequate to encode these situations. In order to represent these situations, we need to capture the context in which they occur. In order to capture these contexts, we develop a knowledge base of emotion eliciting contexts. Representing EEC Knowledge An EEC involves an action and entities related to the action. One context is described by a semantic graph that contains a special node called the pivot representing the action part of EEC. The pivot node 4 5 http://framenet.icsi.berkeley.edu/ http://www.coli.uni-saarland.de/projects/salsa/shal/ Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework 149 is reference to a semantic frame like Cause harm of Framenet. The entity nodes in the semantic graph are related to the pivot node with semantic relations. The entity nodes are reference to the semantic groups (SG). An SG is a collection of similar semantic frames or concepts or both. The g th SG is represented as follows. SGg : SF1 , SF2 , . . . , SFs ; C1 , C2 , . . . , Cc (1) where SGg contains s number of semantic frames and c number of concepts. The sth semantic frame SFs is reference to a frame in FrameNet. There are some terms which cannot be mapped to any frame in the FrameNet. Those terms have been accommodated in the knowledge base by deﬁning some concepts. Thus, a concept is represented as a collection of terms. For example, there is no entry for ‘tiger’ in FrameNet and it has been represented through the concept fearful animal. The semantic group Fearful Entity is given in Table 5. In Table 5. Example of Fearful Entity semantic group Fearful Entity Terrorist fearful animal Weapon Catastrophe this example, the SG Fearful Entity contains three Frames (Terrorist, Weapon, Catastrophe) and one concept fearful animal. An example semantic graph for the EEC describing the killing of people in disease (Killing by Disease) is presented in Figure 1. Fig. 1. An example EEC describing the context of killing by disease. Identifying EEC in Sentence We analyze a sentence to identify the EECs present in it. The semantic parse of the sentence is obtained by means of a semantic parser like SALMANESER. The EEC identiﬁcation method takes the EEC graphs and the semantic parse graph of a sentence as input and outputs the matched EECs for that particular sentence. 150 Kumar P., Basu A., Mitra P., Prasad A. The identiﬁcation method starts with taking each EEC graph from the EEC knowledge base and tries to ﬁt it into the semantic parse graph. Based on the extent of match, a match score is assigned to the current EEC. The matching process for an EEC graph and a semantic parse graph is depicted in Figure 2. The matching process starts with ﬁnding the pivot node of the EEC in the Pivot Node 111 000 P 000 111 000 111 r3 r2 r1 EEC Graph C A B r3’ C’ P’ r2’ D r1’ r4’ r6’ r5’ B’ F E A’ Semantic Graph of Sentence Fig. 2. Illustration of match procedure of an EEC graph and a semantic parse graph semantic parse graph. Then the relations (an edge and node pair) are probed for matching in semantic parse graph. The match score (m) for an EEC is computed as follows. N umber of relations that are matched m = (2) T otal number of relations in EEC In the example shown in Figure 2, pivot node P is matched with P ′ of semantic parse graph. The relations P − r1 − A, P − r2 − B and P − r3 − C are matched with P ′ − r1′ − A′ , P ′ − r2′ − B ′ and P ′ − r3′ − C ′ respectively. So, the match score m for this EEC is 33 = 1. After computing match scores, the EECs with match scores greater than zero are selected to be the emotion elicitation contexts for the concerned sentence. By analyzing the corpus, we have deﬁned 62 EECs, some of which are presented in Table 6. Table 6. Examples of EECs Reduction in Harmful Entity Growth of Positive Entity Appreciation of Entity Outbreak of Disease EEC names Killing by Mass Destruction Entity Death by Accident Punishing for Illegal Act Death by Catastrophe Killing by Relatives Performing Illegal Act Suﬀering from Disease Death by Suicide Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework 5 151 Experiments As we are dealing with fuzzy annotated data, a fuzzy classiﬁcation framework has been used for developing the emotion classiﬁcation models. In this section, we present results pertaining to experiments with diﬀerent feature combinations in FkNN framework. 5.1 Fuzzy k Nearest Neighbor (FkNN) for Emotion Classification FkNN [13] is a fuzzy extension of popular k nearest neighbor algorithm and it assigns class memberships against a test instance. Let S = {s1 , s2 , . . . sn } be n labeled training instances and µij be the membership of the training instance sj in ith class. In order to assign membership of a test instance st in the ith class, k nearest neighbors of st in the training data set are found with a distance measure. The membership of st in ith class is determined using equation 3. ∑k µi (st ) = 2 (f −1) j=1 µij (1/∥st − sj ∥ 2 ∑k (f −1) ) j=1 (1/∥st − sj ∥ ) (3) The variable f controls the extent to which the distance is weighted while computing a neighbor’s contribution to the membership value. Number of nearest neighbors (k) is another parameter in FkNN algorithm. 5.2 Evaluation Measures As mentioned earlier, the reader perspective emotion analysis is a fuzzy and multi-label classiﬁcation task. The classiﬁcation model outputs a membership vector for each test instance where the ith entry µi (0 ≤ µi ≤ 1) is the predicted membership value in the ith class. The evaluation of the fuzzy membership value prediction is performed by measuring Euclidean distance between the predicted and actual membership vector. The evaluation measures those are applicable to multi-label classiﬁcation task can also be applied here by converting the real valued prediction vector into a binary prediction vector. This kind for conversion is performed by applying α-cut with α = 0.4. The evaluation measures used in this study are presented in Table 7. 5.3 Experimental Results Experiments have been performed with diﬀerent feature combinations. All the experiments have been performed with f = 1.5 and k = 5. Results have been reported based on 5-fold cross validation setting for each experiments. Table 8 summarizes the results of emotion classiﬁcation with diﬀerent features and their combinations with best results presented in bold face. When the diﬀerent features are considered separately, the performance of the emotion classiﬁer with polarity feature (P) deteriorated as compared to the 152 Kumar P., Basu A., Mitra P., Prasad A. Table 7. Evaluation measures (↑ = Higher the value better the performance, ↓ = lower the value better the performance). Evaluation Measure Group Evaluation Measure Convention Hamming Loss (HL) ↓ Partial Match Accuracy (P-Acc) ↑ Example based measures [14] Subset Accuracy (S-Acc) ↑ F1 Measure (F1) ↑ One Error (OE) ↓ Coverage (COV) ↓ Ranking based measures [15] Ranking Loss (RL) ↓ Average Precision (AVP) ↑ Micro Average F1 (Micro-F1) ↑ Label based measures [14] Macro Average F1 (Macro-F1) ↑ Fuzzy membership measure Euclidean distance (ED) ↓ Table 8. Comparison of features (W = word feature, P = polarity feature, SF = Semantic frame feature, EEC = Emotion elicitation context) Measure HL P-Acc S-Acc F1 OE COV RL AVP Micro-F1 Macro-F1 ED W 0.136 0.645 0.574 0.679 0.180 0.931 0.163 0.850 0.688 0.615 0.282 P 0.186 0.562 0.514 0.628 0.213 1.105 0.194 0.796 0.641 0.613 0.302 SF 0.092 0.781 0.692 0.803 0.116 0.719 0.081 0.884 0.781 0.743 0.253 EEC 0.063 0.864 0.791 0.899 0.059 0.547 0.053 0.925 0.827 0.818 0.185 W+P 0.121 0.705 0.612 0.750 0.167 0.868 0.136 0.864 0.729 0.652 0.284 P+SF 0.080 0.823 0.740 0.862 0.090 0.656 0.061 0.922 0.816 0.771 0.218 W+SF W+P+SF P+EEC 0.099 0.109 0.051 0.757 0.728 0.894 0.668 0.644 0.832 0.800 0.769 0.925 0.129 0.158 0.061 0.767 0.795 0.484 0.104 0.092 0.041 0.889 0.874 0.957 0.750 0.737 0.877 0.681 0.665 0.857 0.250 0.260 0.151 baseline classiﬁer (using word feature (W)) for all the evaluation metrics. This explains how important the terms present in the text are for emotion classiﬁcation. The use of semantic frames (SF) as features improves the performance of emotion classiﬁcation signiﬁcantly. This improvement may be attributed to two diﬀerent transformations over the word feature set. – Dimensionality Reduction: There is a signiﬁcant reduction in the dimension of semantic frame feature set as compared to word feature set (semantic frame feature dimension = 279 and word feature dimension = 2345). – Feature Generalization: Semantic frame assignment to the terms in the sentences is one generalization technique where conceptually similar terms are grouped into a semantic frame. On the other hand, a notable improvement have been observed with the use of EEC features. As the contextual information is encoded in EEC feature, it is Sentence Level News Emotion Analysis in Fuzzy Multi-label Classification Framework 153 more powerful than the semantic frame features. Reduction in dimension with respect to semantic frame feature is also observed in case of EEC feature. General observations over the feature comparison experiment are as follows. – The P+EEC feature combination performs best in emotion classiﬁcation with Fuzzy kNN. The EEC feature performs closer to P+EEC as compared to other feature combinations. – The polarity feature (P) is ineﬃcient than other combinations but whenever coupled with other feature combinations (i.e., W vs. W+P, SF vs. SF+P, W+SF vs. W+SF+P and EEC Vs. P+EEC), the performance improves. This improvement can be explained with the fact that the polarity feature may help the word or semantic frame based models by classifying the data set into positive and negative category. 6 Conclusions In this paper, we have presented a fuzzy classiﬁcation model based on diﬀerent proposed features in order to perform reader perspective emotion analysis. The problem of reader perspective emotion recognition has been posed as a fuzzy classiﬁcation problem. We have introduced three new features, namely, polarity, semantic frame and emotion eliciting context based features. Extensive experiments with diﬀerent feature combinations have been performed and the best performance was achieved with EEC and polarity feature combination. Acknowledgment Plaban Kumar Bhowmick is partially supported from projects sponsored by Microsoft Corporation, USA and Media Lab Asia, India. References 1. Lin, K.H.Y., Yang, C., Chen, H.H.: Emotion classiﬁcation of online news articles from the reader’s perspective. In: Web Intelligence. (2008) 220–226 2. Lin, K.H.Y., Chen, H.H.: Ranking reader emotions using pairwise loss minimization and emotional distribution regression. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, Association for Computational Linguistics (2008) 136–144 3. Kozareva, Z., Navarro, B., Vazquez, S., Montoyo, A.: Ua-zbsa: A headline emotion classiﬁcation through web information. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, Association for Computational Linguistics (2007) 334–337 4. Mihalcea, R., Liu, H.: A corpus-based approach to ﬁnding happiness. In: In AAAI 2006 Symposium on Computational Approaches to Analysing Weblogs, AAAI Press (2006) 139–144 154 Kumar P., Basu A., Mitra P., Prasad A. 5. Alm, C.O., Roth, D., Sproat, R.: Emotions from text: machine learning for textbased emotion prediction. In: HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, Morristown, NJ, USA, Association for Computational Linguistics (2005) 579–586 6. Jung, Y., Park, H., Myaeng, S.H.: A hybrid mood classiﬁcation approach for blog text. In: PRICAI. (2006) 1099–1103 7. Wu, C.H., Chuang, Z.J., Lin, Y.C.: Emotion recognition from text using semantic labels and separable mixture models. ACM Transactions on Asian Language Information Processing (TALIP) 5 (2006) 165–183 8. Abbasi, A., Chen, H., Thoms, S., Fu, T.: Aﬀect analysis of web forums and blogs using correlation ensembles. IEEE Transactions on Knowledge and Data Engineering 20 (2008) 1168–1180 9. Strapparava, C., Mihalcea, R.: Semeval-2007 task 14: Aﬀective text. In: Proceedings of the 4th International Workshop on the Semantic Evaluations (SemEval 2007), Prague, Czech Republic (2007) 10. Chaumartin, F.R.: Upar7: A knowledge-based system for headline sentiment tagging. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, Association for Computational Linguistics (2007) 422–425 11. Katz, P., Singleton, M., Wicentowski, R.: Swat-mp:the semeval-2007 systems for task 5 and task 14. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Prague, Czech Republic, Association for Computational Linguistics (2007) 308–313 12. Fillmore, C.J.: Frames and the semantics of understanding. Quaderni di semantica 6 (1985) 222–254 13. Keller, J., Gray, M., Givens, J.: A fuzzy k−nearest neighbor algorithm. IEEE Transactions on Systems Man and Cybernetics. 15 (1985) 580–585 14. Tsoumakas, G., Katakis, I.: Multi label classiﬁcation: An overview. International Journal of Data Warehouse and Mining 3 (2007) 1–13 15. Zhang, M.l., Zhou, Z.h.: Ml-knn: A lazy learning approach to multi-label learning. Pattern Recognition 40 (2007) 2007 Recognizing Textual Entailment: Experiments with Machine Learning Algorithms and RTE Corpora Julio J. Castillo Faculty of Mathematics Astronomy and Physics Cordoba, Argentina jotacastillo@gmail.com Abstract. This paper presents a system that uses machine learning algorithms and a combination of datasets for the task of recognizing textual entailment. The features chosen quantify lexical, syntactic and semantic level matching between text and hypothesis sentences. Additionally, we created a filter which uses a set of heuristics based on Named Entities to detect cases where no entailment was found. We analyze how the different sizes of datasets (RTE1, RTE2, RTE3, RTE4 and RTE5) and classifiers (SVM, AdaBoost, BayesNet, MLP, and Decision Trees) could impact on the final overall performance of the systems. We show that the system performs better than the baseline and the average of the systems from the RTE on both two and three way tasks. We conclude that using RTE3 corpus with Multilayer Perceptron algorithm for both two and three way RTE tasks outperformed any other combination of RTE-s corpus and classifiers in order to predict RTE4 test data. Keywords: Natural Language Processing, Textual Entailment, Machine Learning, RTE datasets. 1 Introduction The objective of the Recognizing Textual Entailment Challenge is the task of determining whether or not the meaning of the Hypothesis (H) can be inferred from a text (T). Recently the RTE Challenge has changed to a 3-way task that consists in determining between entailment, contradiction and unknown when there is no information to accept or reject the hypothesis. The traditional two-way distinction between entailment and non-entailment is allowed yet. In the past RTEs Challenges, machine learning algorithms were widely used for the task of recognizing textual entailment (Marneffe et al., Zanzotto et al., 2007). Thus, in this paper, we tested the most common classifiers that have been used by other researchers in order to provide a common framework of evaluation of ML algorithms (fixing the features) and to show how the development data set could impact over them. We generate a feature vector with the following components for both Text and Hypothesis: Levenshtein distance, a lexical distance based on © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 155-164 Received 29/11/09 Accepted 16/01/10 Final version 09/03/10 156 Castillo J. Levenshtein, a semantic similarity measure Wordnet based, and the LCS (longest common substring) metric. We choose only four features in order to learn the development sets. Larger feature sets do not necessarily lead to improved classification performance because they could increase the risk of overfitting the training data. In section 3 we provide a correlation analysis of these features. The motivation of the input features: Levenshtein distance is motivated by the good results obtained as measure of similarity between two strings. Additionally, we propose a lexical distance based on Levenshtein distance but working to sentence level. We create a metric based in Wordnet in order to capture the semantic similarity between T and H to sentence level. The longest common substring is selected because is easy to implement and provides a good measure for word overlap. The system produces feature vectors for all possible combinations of the available development data RTE1, RTE2, RTE3 and RTE5. Weka (Witten and Frank, 2000) is used to train classifiers on these feature vectors. We experiment with the following five machine learning algorithms: Support Vector Machine (SVM), AdaBoost (AB), BayesNet (BN), Multilayer Perceptron(MLP), and Decision Trees(DT). The Decision Trees are interesting because we can see what features were selected for the top levels of the trees. SVM, Bayes Net and AdaBoost were selected because they are known for achieving high performance. MLP was used because has achieved high performance in others NLP tasks. We experiment with various parameters (settings) for the machine learning algorithms including only the results for the best parameters. For two-way classification task, we use the RTE1, RTE2, RTE3 development sets from Pascal RTE Challenge, RTE5 gold standard and BPI test suite (Boing, 2008). For three-way task we use the RTE1, RTE2, RTE3 development sets from Stanford group, and RTE5 gold standard set. Additionally, we generate the following development sets: RTE1+RTE2, RTE2+RTE3, RTE1+RTE3, RTE1+RTE5, RTE2+RTE5, RTE3+RTE5, RTE2+RTE3+RTE5, RTE1+RTE2+RTE3 and RTE1+RTE2+RTE3+RTE5 in order to train with different corpus and different sizes. In all cases, RTE4 TAC 2008 gold standard dataset was used as test-set. The remainder of the paper is organized as follows. Section 2 describes the architecture of our system, whereas Section 3 shows results of experimental evaluation and discussion of them. Finally, Section 4 summarizes the conclusions and lines for future work. 2 System description This section provides an overview of our system that was evaluated in Fourth Pascal RTE Challenge. The system is based on a machine learning approach for recognizing textual entailment. In Figure 1 we present a brief overview of the system. Using a machine learning approach we tested with different classifiers in order to classify RTE-4 test pairs in three classes: entailment, contradiction or unknown. To deal with RTE4 in a two-way task, we needed to convert this corpus only in two classes: yes, and no. For this purpose, both contradiction and unknown were taken as class no. Recognizing Textual Entailment: Experiments... 157 Figure 1.General architecture of our system. There are two variants to deal with every particular text-hypothesis pair or instance. The first way is directly using four features: (1) the Levenshtein distance between each pair, (2) lexical distance based on Levenshtein, (3) a semantic distance based on WordNet and (4) their Longest Common Substring. The second way is using the “NER- preprocessing module” to determinate whether non-entailment is found between text-hypothesis, therefore differing only on the treatment of Named Entities. The Levenshtein distance (Levenshtein, 1966) is computed among the characters in the stemmed Text and Hypothesis strings. The others three features are detailed below. Text-hypothesis pairs are stemmed with Porter’s stemmer and PoS tagged with the tagger in the OpenNLP framework. 158 Castillo J. 2.1 NER Filter The system applies a filter based on Named Entities. The purpose of the filter is to identify those pairs where the system is sure that no entailment relation occurs. Thus, the NER-preprocessing module performs NER in text-hypothesis pairs applying several heuristics rules to discard when an entailment relation is not found in the pair. In this case, a specialized classifier SVM2 was trained only with contradiction and unknown cases of RTE3 corpus and used to classify the pairs between these two classes. We employed the following set of heuristic rules: for each type of Name Entity (person, organization, location, etc.), if there is a NE of this type occurring in H that does not occur in T, then the pair does not convey an entailment and therefore should be classified as either contradiction or unknown. The text-hypothesis pairs are tokenized with the tokenizer of OpenNLP framework and stemmed with Porter’s stemmer. We also enhanced this NER-preprocess module by using an acronym database (British Atmospheric Data Centre (BADC)). The output module was applied to approximately 10 percent of the text-hypothesis pairs of RTE4. The accuracy of the filter evaluated in TAC’08 was 0.71, with 66 cases correctly classified out of 92 where rules applied. An error analysis revealed that misclassified cases were indeed difficult cases, as in the following example (pair 807, RTE4): Text: Large scores of Disney fans had hoped Roy would read the Disneyland Dedication Speech on the theme park's fiftieth birthday next week, which was originally read by Walt on the park's opening day, but Roy had already entered an annual sailing race from Los Angeles to Honolulu. Hypothesis: Disneyland theme park was built fifty years ago. We plan to extend this module so it can also be used to filter cases where an entailment between text and hypothesis can be reliably identified via heuristic rules. 2.2 Lexical Distance We use the standard Levenshtein distance as a simple measure of how different two text strings are. This distance quantifies the number of changes (character based) to generate one text string from the other. For example, how many changes are necessary in the hypothesis H to obtain the text T. For identical strings, the distance is 0. Additionally using Levenshtein distance we define a lexical distance and the procedure is the following: Each string T and H are divided in a list of tokens. The similarity between each pair of tokens in T and H is performed using the Levenshtein distance. Recognizing Textual Entailment: Experiments... 159 The string similarity between two lists of tokens is reduced to the problem of “bipartite graph matching”, being performed by using the Hungarian algorithm over this bipartite graph. Then, we find the assignment that maximizes the sum of ratings of each token. Note that each graph node is a token of the list. Finally the final score is calculated by: finalscore = TotalSim Max( Lenght (T ), Lenght ( H )) Where: TotalSim is the sum of the similarities with the optimal assignment in the graph. Length (T) is the number of tokens in T. Length (H) is the number of tokens in H. 2.3 Wordnet Distance WordNet is used to calculate the semantic similarity between a T and a H. The following procedure is applied: 1. Word sense disambiguation using the Lesk algorithm (Lesk, 1986), based on Wordnet definitions. 2. A semantic similarity matrix between words in T and H is defined. Words are used only in synonym and hyperonym relationship. The Breadth First Search algorithm is used over these tokens; similarity is calculated using two factors: length of the path and orientation of the path. 3. To obtain the final score, we use matching average. The semantic similarity between two words (step 2) is computed as: Sim ( s, t ) = 2 × Depth ( LCS ( s, t )) Depth ( s ) + Depth (t ) Where: s,t are source and target words that we are comparing (s is in H and t is in T). Depth(s) is the shortest distance from the root node to the current node. LCS(s,t):is the least common subsume of s and t. The matching average (step 3) between two sentences X and Y is calculated as follows: MatchingAverage = 2 × Match ( X , Y ) Length (X ) + Length (Y ) 2.4 Longest Common Substring Given two strings, T of length n and H of length m, the Longest Common Sub-string (LCS) method (Dan, 1999) will find the longest strings which are substrings of both T and H. It is founded by dynamic programming. lcs(T , H ) = Length( MaxComSub(T , H )) min( Length(T ), Length( H )) 160 Castillo J. In all practical cases, min(Length(T), Length(H)) would be equal to Length(H). All values will be numerical in the [0,1] interval. 3 Experimental Evaluation and Discussion of the Results We use the following combination of datasets: RTE1, RTE2, RTE3, RTE5(gold-set), BPI, RTE1+RTE2, RTE1+RTE3, RTE2+RTE3, RTE1+RTE5, RTE2+RTE5, RTE3+RTE5, RTE1+RTE2+RTE3, and RTE1+RTE2+RTE3+RTE5 to deal with twoway classification task; and we use the following combination of datasets: RTE11, RTE21, RTE31, RTE5, RTE1+RTE2, RTE2+RTE3, RTE1+RTE3, RTE1+RTE5, RTE2+RTE5, RTE3+RTE5, RTE1+RTE2+RTE3, and RTE1+RTE2+RTE3+RTE5 to deal with three-way classification task. We use the following five classifiers to learn every development set: Support Vector Machine, Ada Boost, Bayes Net, Multilayer Perceptron (MLP) and Decision Tree using the open source WEKA Data Mining Software (Witten & Frank, 2005). In all tables results we show only the accuracy of the best classifier. The RTE4 test set and RTE5-gold set were converted to “RTE4 2way” and “RTE5-2way” taking contradiction and unknown pairs as no- entailment in order to assess the system in the two-way task. Tables 1 and 2 shows the results for two-way and three-way task, respectively. Table 1. Results of two-way classification task. Training Set RTE3 RTE3 + RTE5 RTE3 With NER Module RTE2 + RTE3 RTE1 + RTE2 + RTE3 RTE1+RTE5 RTE5 RTE1 + RTE3 RTE1+RTE2+RTE3+RTE5 RTE2 + RTE3 + RTE5 RTE2 + RTE5 RTE1 + RTE2 RTE2 RTE1 Baselines BPI 1 Data set from Stanford Group. Classifier MLP MLP SVM MLP MLP MLP SVM Decision Tree MLP Bayes Net Decision Tree Decision tree ADA Boost ADA Boost BayesNet Acc % 58.4% 57.8% 57.6% 57.5% 57.4% 57.2% 57.1% 57.1% 57% 56.9% 56.3% 56.2% 55.6% 54.6% 50% 49.8% Recognizing Textual Entailment: Experiments... 161 Table 2. Results of three-way classification task. Training Set RTE3 RTE2+RTE3+RTE5 RTE1 + RTE3 RTE1 + RTE5 RTE3 + RTE5 RTE1 + RTE2 + RTE3 RTE1 + RTE2 RTE2 RTE2+RTE3 RTE5 RTE1+RTE2+RTE3+RTE5 RTE2 + RTE5 RTE1 RTE3-With NER Module Baseline Classifier MLP MLP MLP SVM SVM MLP SVM SVM MLP SVM SVM SVM SVM SVM - Acc % 55.4% 55.3% 55.1% 54.9% 54.9% 54.8% 54.7% 54.6% 54.6% 54.6% 54.6% 54.5% 54% 53.8% 50% Here we note that, in both classification tasks (two and three way), using RTE3 instead of RTE2 or RTE1 always achieves better results. Interestingly, the RTE3 training set alone outperforms the results obtained with any other combination of RTE-s datasets, even despite the size of increased corpus. Thus, for training purpose, it seems that any additional datasets to RTE-3 introduces "noise" in the classification task. (Zanzotto et al, 2007) showed that RTE3 alone could produce higher results that training on RTE3 merged with RTE2 for the two-way task. Thus, it seems that it is not always true that more learning examples increase the accuracy of RTE systems. These experiments provide additional evidence for both classification tasks. However, this claim is still under investigation. Always the RTE1 dataset yields the worst results, maybe because this dataset has been collected with different text processing applications (QA, IE, IR, SUM, PP and MT), and our system does not have it into account. In addition, a significant difference in performance of 3.8% and 8.6% was obtained using different corpus, in two-way classification task (with and without the BPI development set, respectively). In three-way task a slight and not statistical significant difference of 1.4% between the best and worst combination of datasets and classifiers is found. So, it suggests that the combination of dataset and classifier has more impact over 2-way task than over 3-way task. The best performance of our system was achieved with Multilayer Perceptron classifier with RTE-3 dataset; it was 58.4% and 55.4% of accuracy, for two and three way, respectively. The average difference between the best and the worst classifier for all datasets in two way task was 1.6%, and 2.4% in three-way task. On the other hand, even if the SVM classifier does not appear as ‘favorite’ in any classification task, in average SVM is one of the best classifiers. 162 Castillo J. The performance in all cases was clearly above those baselines. Only using BPI in two-way classification we have obtained a result worst that baseline, and it is because, BPI is syntactically simpler than PASCAL RTE; therefore, it seems to be not enough good training set for machine learning algorithm. Although the best results were obtained without using the Name Entity Preprocessing module, we believe these results could be enhanced. The accuracy of this module was 71%, but additional analysis of the misclassified instances provide evidence that it could be improved almost up to 80% (e.g.: improving the acronym database, knowledge base information, etc) and thus it could impact positively in overall performance of the system. With the aim of analyzing the feature-dependency, we calculated the correlation of them. The correlation and causation are connected, because correlation is needed for causation to be proved. The correlation matrix of features is shown below: Table 3.Correlation matrix of features. Features 1 2 3 4 1 0,8611 0,6490 0,2057 2 0,8611 0,6951 0,0358 3 0,6490 0,6951 0,1707 4 0,2057 0,0358 0,1707 - The table shows that features (1) and (2) are strongly correlated, so we experimented eliminating feature (1) to assess the effect on the overall performance over cross validation, and we obtained that accuracy slight decreases in 1%. Similar results are obtained eliminating feature (2). Finally, we assess our system using cross validation technique with ten folds to every corpus, testing over our five classifiers for both classification tasks. The results are shown in the tables 4 and 5 below. Table 4.Results obtained with ten folds Cross Validation in three-way task. Training Set RTE3 RTE3+RTE5 RTE2 + RTE3 RTE1+RTE2+RTE3 RTE2+RTE3+RTE5 RTE1+RTE2+RTE3+RTE5 RTE5 RTE2 RTE1+RTE2 RTE2+RTE5 RTE1 RTE1+RTE5 Classifier MLP MLP MLP MLP SVM SVM MLP SVM SVM MLP Decision tree MLP Acc % 65.5% 61.42% 60.68% 59.35% 58.72% 57.74% 57.16% 56.62% 55.84% 55.28% 54.70% 53.88% Recognizing Textual Entailment: Experiments... 163 Table 5.Results obtained with Cross Validation in two-way task. Training Set RTE3 BPI RTE1 + RTE2 + RTE3 RTE3+RTE5 RTE2+RTE3+RTE5 RTE1+RTE2+RTE3+RTE5 RTE5 RTE2 RTE1+RTE2 RTE2+RTE5 RTE1 RTE1+RTE5 Classifier BayesNet BayesNet MLP BayesNet MLP ADA Boost SVM SVM MLP BayesNet SVM SVM Acc % 67.85% 64% 63.16% 63.07% 61.77% 60.91% 60.33% 60.12% 59.79% 58.78% 57.83% 56.93% The results on test set are worse than those obtained on training set, which is most probably due to overfitting of classifiers and because of the possible difference between these datasets. As before, RTE3 outperforms any other combinations of data sets in ten-fold Cross Validation for both two and three way task. 4 Conclusions We presented our RTE system which is based on a wide range of machine learning classifiers and datasets. As a conclusion about development sets, we mention that the results performed using RTE3 were very similar to those obtained by the union of the RTE1+ RTE2+RTE3 and RTE3 + RTE5, for both 2-way and 3-way tasks. Thus the claim that using more training material helps seems not to be supported by these experiments. Additionally, we concluded that the relatively similar performances of RTE3 and RTE3 with NER preprocessing module suggest that further refinements over heuristic rules can achieve better results. Despite the fact that we did not present here an exhaustive comparison between all available datasets and classifiers, we can conclude that the best combination of RTE-s datasets and classifier chosen for two way task produces more impact that the same combination for three way task, almost for all experiments that we have done. In fact, the use of RTE3 alone improves the performance of our system. Thus, we conclude that RTE3 corpus for both two and three way outperforms any other combination of RTE-s corpus using Multilayer Perceptron classifier. Future work is oriented to experiment with additional lexical and semantic similarities features and to test the improvements they may yield. Additional work will be focus on improving the performance of our NE preprocessing module. 164 Castillo J. References 1. Prodromos Malakasiotis and Ion Androutsopoulos. Learning Textual Entailment using SVMs and String Similarity Measures. ACL-PASCAL, Prague (2007) 2. Julio Javier Castillo, and Laura Alonso. An approach using Named Entities for Recognizing Textual Entailment. TAC 2008, Maryland, USA (2008) 3. M. Lesk. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. In SIGDOC ’86 (1986) 4. Gusfield, Dan. Algorithms on Strings, Trees and Sequences. CUP (1999) 5. V. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707 (1966) 6. Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco (2005) 7. D. Inkpen, D. Kipp and V. Nastase. Machine Learning Experiments for Textual Entailment. RTE2 Challenge, Venice, Italy (2006) 8. F. Zanzotto, Marco Pennacchiotti and Alessandro Moschitti. Shallow Semantics in Fast Textual Entailment Rule Learners, RTE3, Prague (2007) 9. Marie-Catherine de Marneffe, Bill MacCartney, Trond Grenager, Daniel Cer, Anna Rafferty and Christopher D. Manning. Learning to distinguish valid textual entailments. RTE2 Challenge, Italy (2006) Text and Speech Generation Discourse Generation from Formal Specifications Using the Grammatical Framework, GF Dana Dannélls NLP Research Unit, Department of Swedish Language, University of Gothenburg, Sweden dana.dannells@svenska.gu.se Abstract. Semantic web ontologies contain structured information that do not have discourse structure embedded in them. Hence, it becomes increasingly hard to devise multilingual texts that humans comprehend. In this paper we show how to generate coherent multilingual texts from formal representations using discourse strategies. We demonstrate how discourse structures are mapped to GF’s abstract grammar specifications from which multilingual descriptions of work of art objects are generated automatically. Key words: MLG, Ontology, Semantic Web, CIDOC-CRM, Cohesion, Discourse strategies, Functional programming. 1 Introduction During the past few years there has been a tremendous increase in promoting metadata standards to help different organizations and groups such as libraries, museums, biologists, and scientists to store and make their material available to a wide audience through the use of the metadata model RDF (Resource Description Framework) or the Web Ontology Language (OWL) [1, 2]. Web ontology standards offer users direct access to ontology objects; they also provide a good ground for information extraction, retrieval and language generation that can be exploited for producing textual descriptions tailored to museum visitors. These advantages have brought with them new challenges to the Natural Language Generation (NLG) community that is concerned with the process of mapping from some underlying representation of information to a presentation of that information in linguistic form, whether textual or spoken. Because the logical structure of ontologies becomes richer, it becomes increasingly hard to devise appropriate textual presentation in several languages that humans comprehend [3]. In this article we argue that discourse structures are necessary to generate natural language from semantically structured data. This argument is based on our investigations of text cohesive and syntactic phenomena across English, Swedish and Hebrew in comparable texts. The use of a discourse strategy implies that a text is generated by selecting and ordering information out of the underlying domain ontology, a process which provides a resulting text with © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 167-178 Received 19/11/09 Accepted 16/01/10 Final version 09/03/10 168 Dannélls D. fluency and cohesion. It is an approach that relies on the principles drawn from both linguistic and computer science to enable automatic translation of ontology specifications to natural language. We demonstrate how discourse structures are mapped to GF’s abstract grammar specifications from which multilingual descriptions of work of art objects are generated automatically. GF is a grammar formalism with several advantages which makes it suitable for this task – we motivate the benefits GF offers for multilingual language generation. In this work, we focus on the cultural heritage domain, employing the ontology codified in the CIDOC Conceptual Reference Model (CRM). The organization of this paper is as follows. We present some of the principles of cohesive text structure (section 2) and outline the difficulties of following these principles when generating from a domain ontology (section 3). We show how discourse strategies can bridge the gap between formal specifications and natural language and suggest a discourse schema that is characteristic to the cultural heritage domain (section 4). We demonstrate our grammar approach to generating multilingual object descriptions automatically (section 5). We conclude with a summary and provide pointers to future work (section 6). 2 Global and Local Text Structure Early work on text and context [4] has shown that cultural content is reflected in language in terms of text as linguistic category of genre, or text type. A text type is defined as the concept of Generic Structure Potential (GSP) [5]. According to this definition, any text, either written or spoken, comprises a series of optional and obligatory macro (global) structural elements sequenced in a specific order and that the obligatory elements define the type to which a text belongs. The text type that is expressed here is written for the purpose of describing work of art objects in a museum. To find the generic structure potential of written object descriptions, we examined a variety of object descriptions, written by four different authors, in varying styles. Our empirical evidence suggest there is a typical generic structure potential for work of art descriptions that has the following semantic groupings: 1. 2. 3. 4. 5. object’s title, date of execution, creation place name of the artist (creator), year of birth/death inventory number when entered to the museum, collection name medium, support and dimensions (height, width) subject origin, dating, function, history, condition. To produce a coherent text structure of an object description the author must follow this semantic specification sequences that convey the macro structure of the text. Apart from the macro structural elements, there is a micro (local) integration among semantic units of the text type that gives the text a unity. These types are reflected in terms of reference types that may serve in making a Discourse Generation from Formal Specifications... 169 text cohesive at the paragraph or embedded discourse level. Some examples of reference types are: conjunction, logical relationships between parts of an argument, consistency of grammatical subject, lexical repetition, consistency of temporal and spatial indicators. Thus local structure is expressed partly through the grammar and partially through the vocabulary. 3 The Realities of a Domain Specific Ontology The ontology we utilize is the Erlangen CRM. It is an OWL-DL (Description Logic) implementation of The International Committee for Documentation Conceptual Reference Model (CIDOC-CRM) [6].1 The CIDOC-CRM is an eventcentric core domain ontology that is intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information and museum documentation.2 One of the basic principles in the development of the CIDOC CRM has been to have empirical confirmation for the concepts in the model. That is, for each concept there must be evidence from actual data structures widely used. Even though the model was initially based on data structures in museum applications, most of the classes and relationships are surprisingly generic. In the following we use this model to illustrate the limitation imposed by a domain specific ontology on generation where concepts and relationships can not easily be mapped to natural language. According to the CIDOC-CRM specifications, a museum object is represented as an instance of the concept E22.Man Made Object, which has several properties including:3 P55.has current location, P108B.has dimension, P45F.consists of, P101F.had general use, P108B.was produced by. A concrete example of a formal specification (presented in turtle annotation) of the RestOntheHunt PE34604 object that was modeled according to the CIDOC Documentation Standards Working Group (DSWG) is given in Figure 1. Taking the domain ontology structure as point of departure, the information in hand is an unordered set of statements that convey a piece of information about an object. The information the RestOntheHunt PE34604 statements convey spans at least four of the semantic sequences that we outline in section 2. To generate a coherent text, some ordering constraints must be imposed upon them. This is in particular important because a statement may map to an addition set of statements about an object, for example the relationship P108B.was produced by maps to an instance of the concept E12.Production that has the following properties: P14F.carried out by, P7F.took place at, P4F.has time span. 1 2 3 The motivation behind the choice of DL is that it allows tractable reasoning and inference; it ensures decidability, i.e. a question about a concept in the ontology can always be answered; it supports the intuition that the model must be clear, unambiguous and machine-processable. These aspects are in particular important in computational setting, where we would like our logic to be processed automatically. The model was accepted by ISO in 2006 as ISO21127. Property is a synonym for relationship that maps between two instances. In this paper we use the term statement to refer to a relationship between instances. 170 Dannélls D. Fig. 1. Formal specification of a museum object modeled in the CIDOC-CRM. 4 From Formal Specifications to Coherent Representation As we pointed out in the previous section, the structure of the ontology is not a good point of departure for producing coherent texts and therefore requires pre-processing. In broad terms this involves taking a set of information elements to be presented to a user and imposing upon this set of elements a structure which provides a resulting text with fluency and cohesion. Some of the pre-processing steps that have been suggested by previous authors [7, 8] include removing repetitive statements that have the same property and arguments and grouping together similar statements to produce a coherent summary. Although there is a need to select statements that mirror linguistic complexity [9], most authors focus on the semantics of the ontology rather than on the syntactic form of the language. They assume that the ontology structure is appropriate for natural language generation, an assumption which in many cases only applies to English. In this section we describe the approach we exploit to learn how the ontology statements are realized and combined in natural occurring texts. We perform a domain specific text analysis; texts are studied through text linguistics by which the critic seeks to understand the relationships between sections of the author’s discourse. 4.1 Linking Statements to Lexical Units When text generation proceeds from a formal representation to natural language output, the elements of the representation need to be somehow linked Discourse Generation from Formal Specifications... 171 to lexical items of the language. We examined around 100 object descriptions in English, Swedish and Hebrew and studied how statements are ordered, lexicalised and combined in the discourse. To capture the distribution of discourse entities across text sentences we perform a semantic and syntactic analysis, we assume that our unit of analysis is the traditional sentence, i.e. a main clause with accompanying subordinate and adjunct clauses. Below we exemplify how the ontology statements are mapped to lexical items in the studied texts.4 Statements: 1. P55F. has current location maps between instances of E22.Man-Made-Object and instances of E53.Place (see line 22, Figure 1) 2. P52F. has current owner maps between instances of E22.Man-Made-Object and instances of E40. Legal Body (see line 9, Figure 1) 3. P82F.at some time within maps between instances of E52. Time-Span and String data values. Text examples: Eng> The subject made its first appearance [in 1880]P 82F . It is [now installed]P 52F in the Wallace Collection[,]P 55F London. Swe> Först [på 1900 talet]P 82F kom den till Sverige och [hänger nu på]P 55F Gripsholms slott [i]P 52F Statens porträttsamling. Heb> ha-tmuwnah hegieh larisunah le-Aeretz yisraAel [be-snat 1960]P 82F . hyA [sayeket le]P 52F -quwleqitzyah sel Amir bachar [se-nimtzet]P 55F be-muwzeyAuwn haAretz betel Aabiyb These text examples exhibit a few local linguistic differences between the languages. In English and Hebrew, the order of the statements is: 3,2,1 while in the Swedish text it is: 3,1,2. It is interesting to note how the domain entities and properties are lexicalized in the different languages. In all three languages the property P82F.at some time within is lexicalised with a preposition phrase. On the other hand, the lexicalisation of the property P55F. has current location differs significantly. Furthermore, in the Swedish text all statements are realized in one single sentence; the statements are combined with a simple syntactic aggregation using the conjunction och ’and’. Both in the English and the Hebrew examples, statements 3 and 2 are realized as two sentences which are combined with a referring pronoun, i.e. it and hyA. When generating natural occuring texts it is important to utilize a generation machinery that supports such syntactic variations. In section 5 we demonstrate how these variation are supported in the GF formalism. Empirical representations of stereotypical clause structures such as presented above not only provide evidence on how to pair ontology statements with lexical units according to the language specific patterns, but also guide template constructions proceeding according to the organization of the domain semantics. 4 The transliteration ISO-8859-8 ASCII characters of Hebrew are used to enhance readability. 172 Dannélls D. Table 1. Template specification that governs text structures of a cultural object in a museum. Name T1 T2 T3 T4 T5 4.2 Template slot (a) object’s title | (b) object’s creator | (c) creation date | (d) creation place (a) creator date of birth | (b) creator date of death (a) object id | (b) object material | (c) object size (a) current owner | (b) current location | (c) catalogue date | (d) collection (a) object’s identifier | (b) identified place Template Specifications In section 2 we presented a five stage typical GSP for a work of art object description. To guarantee that the selected statements follow this structure, we defined a sequence of templates describing the discourse structure, this approach was first introduced by [10]. Each sequence in a template consists of slots that correspond to a set of statements in the domain knowledge. The template specification as whole provides a set of ordering constraints over a pattern of statements in such a way that may yield a fluent and coherent output text. The templates and slots are specified in Table 1. 4.3 A Discourse Schema A discourse schema is an approach to text structuring through which particular organizing principles for a text are defined. It straddles the border between a domain representation and well-defined structured specification of natural language that can be found through linguistic analysis. This idea is based on the observation that people follow certain standard patterns of discourse organization for different discourse goals in different domains. Our text analysis has shown certain combinations of statements are more appropriate for the communicative goal of describing a museum object. Following our observations, we defined a discourse schema Description schema (see below) consisting of two rhetorical predicates (e.g. Identification–Property and Attributive–Property).5 The schema encodes communicative goals and structural relations in the analyzed texts. Each rhetorical predicate in the schema is associated with a set of templates (specified in Table 1). The notation used to represent the schema: ’,’ indicates the mathematical relation and, ’{}’ indicates optionality, ’/’ indicates alternatives. Description schema: Describe–Object − > Identification–Property/ Attributive–Property Identification–Property − > 5 The notion of rhetorical predicates goes back to Aristotle, who presented predicates as assertions which a speaker can use for persuasive argument. Discourse Generation from Formal Specifications... 173 T1 , {T2 / T3} Attributive–Property − > T4 / T5 An example taken from one of the studied texts: [T1b]Thomas Sully [T2](1783-1872) painted this half-length [T1a] Portrait of Queen Victoria [T1c] in 1838. The subject is now installed in the [T4d] Wallace Collection, [T4b] London. The first sentence, corresponding to the rhetorical predicate Identification– Property, captures four statements (comprising the following relationships: P82F.at some time within, P14F.carried out by, P108B.was produced by and P102. has title) that are combined according to local and global text cohesion principles. 5 Domain Dependent Grammar-Based Generation After the information from the ontology has been selected and organized according to the pre-defined schema, it is translated to abstract grammar specifications. The grammar formalism is the Grammatical Framework (GF) [11], a formalism suited for describing both the semantics and syntax of natural languages. The grammar is based on Martin-Löf’s type theory [12] and is particularly oriented towards multilingual grammar development and generation. GF allows the separation of language-specific grammar rules that govern both morphology and syntax while unifying as many lexicalisation rules as possible across languages. With GF it is possible to specify one high-level description of a family of similar languages that can be mapped to several instances of these languages. The grammar has been exploited in many natural language processing applications such as spoken dialogue systems [13], controlled languages [14] and generation [15]. GF distinguishes between abstract syntax and concrete syntax. The abstract syntax is a set of functions (fun) and categories (cat) that can be defined as semantic specifications; the concrete syntax defines the linearization of functions (lin) and categories (lincat) into strings that can be expressed by calling functions in the resource grammar. 6 Each language in the resource grammar has its own module of inflection paradigms that defines the inflection tables of lexical units and a module for specifying the syntactic constructions of the language. Below we present the abstract and concrete syntax of the rhetorical predicate Identification–Property presented in section 4.3.7 Figure 2 illustrates the abstract syntax tree of our abstract grammar that reflects on the semantics of the domain and that is common for all languages. 6 7 A resource grammar is a fairly complete linguistic description of a specific language. GF has a resource grammar library that supports 14 languages. The GF Resource Grammar API can be found at the following URL: <http://www. grammaticalframework.org/lib/doc/synopsis.html>. 174 Dannélls D. Fig. 2. Abstract syntax tree for Rest on the Hunt was painted by John Miel in 1642. abstract syntax cat IdentificationMessage; ObjTitle; CreationProperty; Artist; TimeSpan; CreationStatement; ArtistClass; TimeSpanClass; fun Identification: ObjTitle → CreationStatement → IdentificationMessage; CreationAct: CreationStatement → TimeSpanClass → CreationStatement; HasCreator: CreationProperty → ArtistClass → CreationStatement; CreatorName: Artist → ArtistClass; CreationDate: TimeSpan → TimeSpanClass; Year : Int → TimeSpan ; RestOnTheHunt: ObjTitle; JohnMiel: Artist; Paint: CreationProperty; The abstract specification expresses the semantics of the ontology and is language independent. What makes the abstract syntax in particular appealing in this context is the ability to expand the grammar by simply adding new constants that share both common semantics and syntactic alternations. For example, Beth Levin’s [16] English Performance Verbs class contains a number of verbs that can be added as constants of type CreationProperty, such as draw and produce, as follows: Paint, Draw, Produce : CreationProperty. Discourse Generation from Formal Specifications... 175 GF offers a way to share similar structures in different languages in one parametrized module called functor [17]. In our implementation the common structure of the concrete syntax for English and Swedish is shared in a functor. Since the function CreationDate is linearized differently, it is defined separately for each language. This is illustrated below. incomplete concrete syntax8 lincat IdentificationMessage = S ; TimeSpanClass, ArtistClass = Adv ; TimeSpan = NP ; CreationStatement = VP ; CreationProperty = V2 ; ObjTitle, Artist = PN ; lin Identification np vp = mkS pastTense (mkCl (mkNP np) vp); CreationAct vp compl = mkVP vp compl; HasCreator v np = (mkVP (passiveVP v) np) ; CreatorName obj = (mkAdv by8agent Prep (mkNP obj)); Year y = mkNP (SymbPN y) ; concrete English syntax lin CreationDate obj = (mkAdv in Prep obj); concrete Swedish syntax lin CreationDate obj = mkAdv noPrep (mkCN year N (mkNP obj)); The lexicon is implemented as an interface module which contains oper names that are the labels of the record types. It is used by the functor and by each of the language specific lexicons. interface lexicon oper year N : N; restOnTheHunt PN : PN ; johnMiel PN : PN ; paint V2 : V2 ; instance English lexicon oper restOnTheHunt PN = mkPN [“Rest on the Hunt”]; johnMiel PN = mkPN “John Miel”; year N = regN “year”; paint V2 = mkV2 “paint” ; instance Swedish lexicon oper restOnTheHunt PN = mkPN [“Rastande jägare”]; johnMiel PN = mkPN “John Miel”; year N = regN “år”; paint V2 = mkV2 “måla” ; 8 The word incomplete suggests that the functor is not a complete concrete syntax by itself. 176 Dannélls D. In GF it is possible to built a regular grammar for new languages by using simple record types. In our case we implemented a small application grammar for Hebrew, i.e. concrete Hebrew that uses the same abstract syntax as for English and Swedish. In this module functions are linearized as strings where records {s : Str} are used as the simplest type.9 We introduce the parameter type Gender with two values: Masc and Fem, these are used in table types to formalize inflection tables. In Hebrew, verb phrases are parameterized over the gender and are therefore stored as an inflection table {s : Gender => Str}; noun phrases have an inherent gender that is stored in a record together with the linearized string {s : Str ; g : Gender}.10 concrete Hebrew syntax lincat IdentificationMessage, TimeSpan, ArtistClass, TimeSpanClass = {s : Str}; Artist, ObjTitle = {s : Str ; g : Gender}; CreationProperty, CreationStatement = { s : Gender => Str}; lin Identification np vp = {s = np.s ++ vp.s ! np.g }; CreationAct vp compl = { s = \\g => vp.s ! g ++ compl.s }; HasCreator v obj = { s = \\g => v ! g ++ obj.s}; CreatorName obj = { s = [“al yedey”] ++ obj.s }; CreationDate obj = { s = [“be”] ++ obj.s }; ObjTitle = {s = [“menuhat tzayydym” ] ; g = Fem}; JohnMiel = {s = [“guwn miyAe” ] ; g = Masc}; Paint = { s = table {Masc => “tzuwyr”; Fem => “tzuwyrah”}}; Param Gender = Fem | Masc ; The complete grammar specifications yield the following text, in English, Swedish and Hebrew: Eng> Rest on the Hunt was painted by John Miel in 1642. The painting is located in the Hallwyska museum in Stockholm. Swe> Rastande jägare blev målad av John Miel år 1642. Tavlan hänger på Hallwyska museet i Stockholm. Heb> menuhat tzayydym tzuwyrah ’al yedey guwn miyAel be-1642. htmwnh memukemet be-muwzeyAuwn hallwiska be-stukholm. This kind of multi-level grammar specification maps non-linguistic information to linguistic representation in a way that supports local and global text variations. For example, in the English and the Hebrew concrete syntax, the sentence complement is realized as a prepositional phrase (signalled by the prepositions in and be), but in the Swedish sentence, the complement is realized as a noun phrase (signalled by the noun år). In the above example this is 9 10 The resource grammar for Hebrew is currently under development. Hebrew has a more complex morphology as the one described here. However, in this implementation we changed the grammar so that it takes only care of gender agreement. Discourse Generation from Formal Specifications... 177 illustrated in the linearization of CreationDate. In the Swedish concrete syntax no preposition is used (noPrep), and a different NP rule is applied to generate the noun phrase år 1642, i.e. CN→ NP → CN. Lexical variations are supported by the grammar as well, for instance, the verb located is not a direct translation of the Swedish verb hänger ’hang’ but the interpretation of the verb in this context implies the same meaning, namely, the painting exists in the Hallwyska museum. The choice of the lexical unit are governed by the semantic structure of the ontology that is reflected in the abstract syntax. While the functional orientation of isolated sentences of language is supported by GF concrete representations, there are cross-linguistic textual differences that we touched upon in section 4.1 and that are not yet covered in the grammar specifications, i.e. patterns with which cohesive and coherent texts are created. In English, cohesive means comprise conjunction, substitution and ellipsis that can frequently be used to realize a logical relation. In Swedish, cohesive means is often realized as elliptical item, preposition phrase, and/or punctuation. Whereas in Hebrew means of cohesion are realized through the verbal form, usage of ellipsis and conjunctive elements are not common. 6 Conclusion In this paper we have presented a grammar driven approach for generating object descriptions from formal representations of a domain specific ontology. We illustrated how the lexicons of individual languages pair ontology statements with lexical units which form the backbone of the discourse structure. We demonstrated how schema based discourse structure is mapped to an abstract grammar specification using the domain specific ontology concepts and properties. We are now in the process of the development of schemata that are being continually modified and evaluated; each rhetorical predicate should capture as many sentence structure variations as possible. A limitation of discourse schemata development is that it requires a lot of human efforts, however once a discourse schema is defined it can automatically be translated to abstract grammar specifications. This method of assembling coherent discourses from basic semantic building blocks will allow any generation system to assemble its texts dynamically, i.e. re-plan portion of its text and communicate successfully. In the nearest future we intend to extend the grammar to support grouping of rhetorical predicates which requires a certain coverage of linguistic phenomena such as ellipsis, focus, discourse and lexical semantics. The long challenge of this work is in capturing linguistic properties of a language already during the schema development process to guide further development of language independent grammar specifications. 178 Dannélls D. Acknowledgements The author would like to express her appreciation to Robert Dale for helpful discussions and acknowledge three anonymous readers for commenting on the paper. The GF summer school 2009 and the Centre for Language Technology (CLT) for sponsoring it. References 1. Schreiber, G., Amin, A., van Assem, M., de Boer, V., Hardman, L., Hildebrand, M., Hollink, L., Huang, Z., Kersen, J., Niet, M., Omelayenko, B., Ossenbruggen, J., Siebes, R., Taekema, J., Wielemaker, J., Wielinga, B.: Multimedian e-culture demonstrator. In Cruz, I.F., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L., eds.: International Semantic Web Conference. Volume 4273., Springer (2006) 951–958 E-culture-demonstrator-2006. 2. Bryne, K.: Having triplets – holding cultural data as rdf. In: Proceedings of IACH workshop at ECDL2008 (European Conference on Digital Libraries), Aarhus (2009) 3. Hielkema, F., Mellish, C., Edwards, P.: Evaluating an ontology-driven wysiwym interface. In: Proc. of the Fifth International NLG Conference. (2008) 4. Hasan, R.: Linguistics, language and verbal art. Geelong: Deakin University. (1985) 5. Halliday, M.A., Hasan, R.: Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective. Oxford: Oxford University Press (1989) 6. Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M.: Definition of the CIDOC Conceptual Reference Model. (2005) 7. O’Donnell, M.J., Mellish, C., Oberlander, J., Knott, A.: Ilex: An architecture for a dynamic hypertext generation system. NL Engineering 7 (2001) 225–250 8. Bontcheva, K.: Generating tailored textual summaries from ontologies. In: Second European Semantic Web Conference (ESWC). (2005) 531–545 9. Mellish, C., Pan, J.Z.: Natural language directed inference from ontologies. Artifical Intelligence 172 (2008) 1285–1315 10. McKeown, K.R.: Text generation : using discourse strategies and focus constraints to generate natural language text. Cambridge University Press (1985) 11. Ranta, A.: Grammatical framework, a type-theoretical grammar formalism. Journal of Functional Programming 14 (2004) 145–189 12. Martin-Löf, P.: Intuitionistic type theory. Bibliopolis, Napoli (1984) 13. Ljunglöf, P., Larsson, S.: A grammar formalism for specifying isu-based dialogue systems. In: Advances in Natural Language Processing, 6th International Conference, GoTAL 2008, Gothenburg, Sweden. Volume 5221 of Lecture Notes in Computer Science., Springer (2008) 303–314 14. Khegai, J., Nordström, B., Ranta, A.: Multilingual syntax editing in gf. In Processing, I.T., (CICLing-2003), C.L., eds.: LNCS 2588, Mexico, Springer (2003) 453–464 15. Johannisson, K.: Formal and Informal Software Specifications. PhD thesis, Chalmers University of Technology (2005) 16. Levin, B.: English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago (1993) 17. Ranta, A.: The gf resource grammar library. Linguistic Issues in Language Technology (LiLT) (2009) An Improved Indonesian Grapheme-to-Phoneme Conversion Using Statistic and Linguistic Information Agus Hartoyo, Suyanto Faculty of Informatics - IT Telkom, Jl. Telekomunikasi No. 1 Terusan Buah Batu Bandung, West Java, Indonesia truegushar@yahoo.co.id, suy@ittelkom.ac.id Abstract. This paper focuses on IG-tree + best-guess strategy as a model to develop Indonesian grapheme-to-phoneme conversion (IndoG2P). The model is basically a decision-tree structure built based on a training set. It is constructed using a concept of information gain (IG) in weighing the relative importance of attributes, and equipped with the best-guess strategy in classifying the new instances. It is also leveraged with two new features added to its pre-existing structure for improvement. The first feature is a pruning mechanism to minimize the IG-tree dimension and to improve its generalization ability. The second one is a homograph handler using a text-categorization method to handle its special case of a few sets of words which are exactly the same in spelling representations but different each other in phonetic representations. Computer simulation showed that the complete model performs well. The two additional features gave expected benefits. Keywords: Indonesian grapheme-to-phoneme conversion, IG-tree, best-guess strategy, pruning mechanism, homograph handler. 1 Introduction Many methods of data driven approach was proposed to solve grapheme-to-phoneme (G2P) conversion problem, such as instance-based learning, artificial neural networks, and decision-tree. In [7], it was stated that an IG-tree + best-guess strategy has high performance. It compresses a given training set into an interpretable model. In this research, the method is adopted to develop a new model for Indonesian G2P (IndoG2P). In the new model, two new features for improvement are added: a pruning mechanism using statistic information and a homograph handler based on some linguistic information provided by a linguist. According to the fact that the model is a lossless compression structure, which means that it stores all data including those of outliers into rules, a pruning mechanism is proposed to prune some rules accommodating outliers. Hence, the model is expected to increase its generalization, but decrease its size. Furthermore, the model does not handle homograph problems. The letter-based inspection mechanism performed letter by letter internally in a word cannot handle a few sets of words which are exactly the same in spelling representations but different © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 179-190 Received 23/11/09 Accepted 16/01/10 Final version 12/03/10 180 Agus Hartoyo, Suyanto in phonetic representations. In order to solve this problem, the system should perform an inspection mechanism for wider context. It is clear that the problem is actually only how to recognize the topic of the surrounding text (sentence, paragraph, or passage) which is known as the problem of text categorization. Since [8] stated that the centroid-based classifier for text categorization significantly outperforms other classifiers on a wide range of data sets, the classifier is adopted to solve homograph problem in this research. 2 IndoG2P The IndoG2P system is designed to use both statistic and linguistic information. This design is expected to solve some different problems in G2P. 2.1 The Phonetic Alphabets This research uses IPA (International Phonetic Association) Indonesian alphabet system to symbolize pronunciations at the phonetic side of its dataset. As explained in [1], 6 vowels and 22 consonants compose the alphabet system as listed in table 1. Table 1. The IPA Indonesian alphabets Phoneme a e ə i o u b c d f g h j k Category vowel vowel vowel vowel vowel vowel consonant consonant consonant consonant consonant consonant consonant consonant Sample Word /akan/ /sore/ /ənam/ /ini/ /toko/ /baru/ /tembakan/ /cari/ /duta/ /faksin/ /gula/ /hari/ /juga/ /kaki/ Phoneme l m n p r s t w x y z š ŋ ň Category consonant consonant consonant consonant consonant consonant consonant consonant consonant consonant consonant consonant consonant consonant Sample Word /lama/ /makan/ /nakal/ /pintu/ /raja/ /sama/ /timpa/ /waktu/ /axir/ /yakin/ /zat/ /mašarakat/ /təmpuru-ŋ/ /ňaňian/ 2.2 Datasets The system involves two datasets: 1) IndoG2P datasets used to train the system in building the IG-tree model validate the rules during the IG-tree pruning process, and test the IG-tree classifier as IndoG2P conversion system; and 2) Homograph dataset used to train the centroid-based classifiers and test them as homograph handler. An Improved Indonesian Grapheme-to-Phoneme Conversion... 181 IndoG2P Datasets. “Given a written or spelled word in Indonesian, the system should output how the word is pronounced” is the main problem IndoG2P conversion must cope with. So, this dataset should be able to give examples for the learning system about how words in Indonesian are spelled and then pronounced. It is simple to understand that the dataset can be a table with two attributes with each record demonstrating spelling and pronunciation of a word: spelling transcription of a word on the first attribute and phonetic transcription of the same word on the second one. The format of the dataset is shown in Table 2. Table 2. Format of the IndoG2P dataset. Graphemic transcription malang tembakan tempurung tempatmu Phonetic transcription mala-ŋ tembakan təmpuru-ŋ təmpatmu The learning mechanism requires that type of the relation between spelling symbols and their corresponding phonetic symbols is one-to-one mapping. It seems to be no problem in case the spelling and phonetic transcriptions of a word is basically in the same length as shown in word “tembakan” (meaning “a shot”) and “tempatmu” (meaning “your place”). In case the spelling and phonetic transcriptions of a word basically differ in length, [7] suggests to perform an alignment mechanism as shown in word “malang” (meaning “poor” or “unfortunate”) and “tempurung” (meaning “shell” or “skull”) by inserting phoneme null ‘-‘ at a certain position in the phonetic transcription in such way that (i) every single spelling symbol is mapped to a single phonetic symbol; (ii) its grapheme-to-phoneme mapping is viable (i.e. that they can be motivated intuitively or linguistically); (iii) the combination of all mappings within the alignment has a maximal probability; and (iv) it is consistent with alignments of other similar words. Table 3. The IndoG2P datasets. Dataset Training set Validation set Test set Number of instances 5,455 679 679 Percentage 80% 10% 10% In this research, IndoG2P dataset is developed from a corpus of words collected from articles published by an Indonesian newspaper, and its grapheme-phoneme pairs’ correctness was validated by a professional Indonesian linguist. The dataset consist of 6,791 distinct instances which are then randomly divided into three subsets those are a training set to train the system in building the IG-tree model, a validation set to validate the rules during the IG-tree pruning process, and a test set to test the IG-tree classifier. Proportions of the three subsets are illustrated by Table 3. 182 Agus Hartoyo, Suyanto Homograph Datasets. These are datasets used in the module of homograph handler. The datasets are composed by texts as their instances. In this module, a particular dataset is provided for a particular homograph word. Any text in a dataset for a homograph word must: 1) contains at least one occurrence of the related homograph word; 2) be composed with relevant sentences; and 3) be labeled with the category representing phonetic representation of the ambiguous graphemes. Our real-world datasets are in this research composed with used texts taken from many articles in the Internet. Referencing to the list of Indonesian homograph words shown in Table 5, we provide 5 datasets for 5 homograph words as follows. Table 4. Homograph datasets Homograph word apel penanya sedan mental tahu Training set 80 14 48 40 48 Test set 20 5 12 10 12 2.3 IG-Tree + Best-Guess Strategy As a lossless compression structure, the origin (without pruning mechanism) IG-tree stores in a compressed format the complete grapheme-to-phoneme knowledge of all words provided by the training set. In this system, compression doesn’t only mean the decrease of the model’s size, but it means generalization as well. As explained in [7], the generated rules can be seen as optimized, generalized lexical lookup. Words spelled similarly are pronounced similarly since the system’s reasoning is based on analogy in the overall correspondence of the grapheme-to-phoneme patterns. The system automatically learns parts of words on which similarity matching can be safely performed. At the end of the learning phase a generated rule actually corresponds to a grapheme with a minimal context which disambiguate mapping of the grapheme into a certain phoneme. For an illustration, see how the system determines the phonetic representation for grapheme <e> in <tempayan> (meaning “large water jar”) when dataset shown in Table 2 is given as its learning materials. Based on the learning materials the system finds that grapheme <e> has two probable phonemic representations, those are /e/ and /ə/. Both maximal subword chunk <tembakan> and <tempatmu> actually disambiguate the <e> mapping patterns, in the meaning that the context surroundings <e> in chunk <tembakan> certainly leads its <e> to be mapped to /e/ in the same way as that in chunk <tempatmu> certainly leads its <e> to be mapped to /ə/. However the sub-word chunk does not represent minimal context disambiguating the mapping patterns. In contrast, subword chunk <em> represents a smaller context but it is ambiguous since this chunk belongs to some words with different <e>’s phonetic representation. Hence, this system during the learning phase will look for more contextual information and finally find that subword chunk <temb> represents the minimal context disambiguating the focus grapheme <e> to be pronounced as /e/ An Improved Indonesian Grapheme-to-Phoneme Conversion... 183 and <temp> represents the minimal context disambiguating the focus grapheme <e> to be pronounced as /ə/. So, when the system is requested to determine what a certain grapheme in a given word should be pronounced, it will find a mapping pattern on the focus grapheme and matching context, and then get the phonetic label led by the pattern as the answer. In our case since the given word <tempayan> on focus grapheme <e> matches with context represented by subword chunk <temp>, it suggests the system to map the focus grapheme to phoneme /ə/ instead of /e/. When other words such “ditempati” (meaning “being inhabited”) and “tempatku” (meaning “my place”) are given, the generalization ability of the system is shown, as the same rule covers these cases as well. Dataset Transformation. On the lowest level, instead of running word by word, IndoG2P conversion actually runs letter by letter. If our problem is considered as a classification problem, given an unknown instance with attributes of a focus grapheme and its context graphemes, IndoG2P is a classification task responsible to label the instance with a phonetic representation. This awareness suggests us to transform the IndoG2P dataset discussed before — a word-by-word dataset, we can say — to its new format of letter-by-letter dataset. The basic idea of the transformation’s algorithm is consecutively locating each grapheme (occurred in a words) as the focus / target grapheme and ensuring that when a grapheme is located as focus, other graphemes occurred in the same word are simultaneously located on their appropriate context position. As the number of records belonging to word-by-word dataset is the number of words itself, the number of records belonging to letter-byletter dataset is the total number of letters occurred in whole words. This transformation is illustrated in Fig. 1. Graphemictranscription kamper tembak L7 . . . . . . . . . . . . L6 . . . . . ^ . . . . . ^ L5 . . . . ^ k . . . . ^ t L4 . . . ^ k a . . . ^ t e L3 . . ^ k a m . . ^ t e m L2 . ^ k a m p . ^ t e m b L1 ^ k a m p e ^ t e m b a F k a m p e r t e m b a k Phonemic transcription kampər tembak R1 a m p e r ^ e m b a k ^ R2 m p e r ^ . m b a k ^ . R3 p e r ^ . . b a k ^ . . R4 e r ^ . . . a k ^ . . . R5 r ^ . . . . k ^ . . . . R6 ^ . . . . . ^ . . . . . R7 . . . . . . . . . . . . Phonemic label k a m p ə r t e m b a k> Fig. 1. Dataset transformation. As shown in Fig. 1, we provided 14 context graphemes surrounding the focus grapheme; those are 7 for each of right and left side (as R1 represents the first context on the right, L1 represents the first context on the left, and so on). The width of context provided in the dataset should be able to accommodate the context expansion (More about context expansion is discussed in the next section.) performed during the learning process. It shouldn’t be too narrow as disambiguation expansion point cannot 184 Agus Hartoyo, Suyanto be reached for patterns with long condition. It shouldn’t be too wide either as the system will be too space-consuming. Our determination for 7 bidirectional surrounding graphemes as the width of the context is based on the result of our early investigation stating that phonetic mapping of a grapheme in any ordinary Indonesian words is ambiguous until at most 5 steps to right and/or left side. We gave 2 extra steps to anticipate extraordinary materials in the real dataset. IG-Tree Construction. The IG-tree is a compression format of context-sensitive rules. Each path in the decision tree represents a rule and is started by a node representing a focus grapheme to be mapped to a phoneme; and the consecutive node represents the consecutive context. Information gain (IG), a computational metric based on information theory, is used to determine the order of context’s expansion. Higher information gain of an attribute theoretically reflects less randomness or impurity of partitions resulted by partitioning on the attribute, while in our case it indicates more importance of the attribute in disambiguating the grapheme-tophoneme mapping. The result of the information gain computation on our IndoG2P dataset (in its letter-by-letter format) for each attribute is described in the graphic in Fig.2. Fig. 2. Information gain for attributes in IndoG2P dataset. Note that an attribute with the highest importance in disambiguating the grapheme-tophoneme mapping is the attribute F (the focus grapheme). This fact is consistent with what we have stated above that the focus grapheme is located as the starting node on every path in the IG-tree. Furthermore, the graphic clearly shows us that the information gain is getting smaller as the context position is getting further from the focus grapheme. The system sorts the attributes based on their information gain values, records the order, and uses it to determine the next attribute to which the context inspection should expand during the learning process. In our case, based on An Improved Indonesian Grapheme-to-Phoneme Conversion... 185 the result of our computation, we got this attribute order: F – R1 – L1 – L2 – R2 – L3 – R3 – L4 – R4 – R5 – L5 – L6 – R6 – L7 – R7. The construction of IG-tree, just similar with those of some standard decision tree structures such as ID3, C4.5, and CART, can be practically done in such a top-down divide-and-conquer manner. Performed on our letter-by-letter training set, the basic algorithm of the decision tree construction is greedy and can be expressed recursively as follows. Base: If the current node is pure, i.e. instances reaching that node are totally of the same class, return the node as a leaf node labeled with that class. Stop developing that part of the tree; Recurrence: Otherwise, i.e. instances reaching the current node differ in class labels, use a splitting attribute to divide the instances in such way that it splits up the example set into subsets, one for each value of the attribute. Apply this procedure to each subset. We must add information to this recursive algorithm specifically for our case that the splitting attribute, mentioned in the recurrence, at anytime is governed by the attribute order we have discussed earlier. The order is applied constantly for every path in the decision tree. Another feature belonging to IG-tree is the statistic recording mechanism performed on every non-leaf node. This mechanism records the phonetic label’s statistic of all instances reaching each node. This statistic will be employed to perform best-guess strategy and pruning mechanism. To illustrate how the model exactly is, we will use dataset shown in Table 2 as our study case. As the dataset is still in word-by-word version, our workflow demands it to be transformed to its letter-by-letter format. The IG-tree construction using the procedure we have discussed above is then performed on the later format of dataset. Fig. 3 illustrates a part, i.e. the part of grapheme <e> mapping paths, of the IG-tree constructed. Retrieving Phonetic Label in IG-Tree. Given a focus grapheme and its context graphemes, the phonetic label of the focus grapheme is basically retrieved by tracing through a proper path in the IG-tree in direction to its leaf and returning the label of the leaf as the phonetic label requested. How we get phoneme /ə/ as the phonetic label of grapheme <e> in word <tempayan> will be more clearly discussed in this section. Based on the attribute order we have discussed above, the tracing will start with node labeled <e> as the focus grapheme. The node labeled with the first grapheme on the right of <e> in <tempayan>, i.e. <m>, is the second node accessed in the path. The next taken node is that labeled <t>, the first grapheme on the left of <e>; continued with node labeled <-> as the second “grapheme” on the left of <e>. It is the end of the tracing when node labeled <p>, the second grapheme on the right of <e>, is accessed, since this node is a leaf node. Thus, label /ə/ is retrieved as the phonetic label corresponding with grapheme <e> in word <tempayan>. In the similar way, phoneme /e/ can be retrieved as the phonetic label for grapheme <e> in word <tembaklah>. However, in this case our tracing will fail in reaching any leaf node when our word is, for instance, <teman> with focus on grapheme <e>. Note that the tracing gets stuck 186 Agus Hartoyo, Suyanto on node labeled <-> since this node have no child labeled <a>. Such problem is in our research solved with best-guess strategy. Fig. 3. The IG-tree constructed on dataset in Table 2 with stressing in <e> mapping Best-Guess Strategy. This is a strategy employed by the system to avoid stumped mapping and to increase its generalization ability. The strategy is performed when a tracing to retrieve a phonetic label cannot reach any leaf node. When the tracing gets stuck on a node, the system will “guess” the phonetic label with the most probable label on that node. The most probable label is computed using statistic information stored on that node. The most frequent phonetic label on records affiliated with the node will be returned as the “guessed” label. In case that there are more-than-one phonetic labels are majorities, a random selection among the labels is performed. 2.4 Pruning Mechanism When a decision tree is built, many of its branches reflect anomalies due to outliers in the learning materials. The pruning mechanism is proposed to address this problem of overfitting the data. This is performed after the IG-tree is initially grown to its entirety. Pruning is done by replacing a subtree with a new leaf node labeled with the subtree’s majority phonetic label. The new tree is then validated using validation set. If the pruning step decreases the generalization accuracy, the previous subtree will be retained; otherwise, the subtree will be permanently removed by the new leaf node. In the case that the accuracy is constant, it was stated that for the same performance the more concise model is the better. This procedure is then performed on all subtrees in a bottom up fashion. An Improved Indonesian Grapheme-to-Phoneme Conversion... 187 2.5 Centroid-based Text Categorization Indonesian has some homograph words as shown in Table 5. Daelemans et al in [7] said that their research fails in handling such words in their G2P conversion system. We redeem their failure by proposing text categorization approach using the centroidbased classifier to cope with the problem. This approach of text categorization and the main system of the IndoG2P conversion using IG-tree actually work on different level. While the IG-tree works on the level of letter with context inspection mechanism internally in its containing word, the approach of text categorization works on the level of text with computation on some aspects of its contained words / terms. It implies that this proposed approach is applicable on the system with text inputs, not only word inputs. How centroid-based text categorization copes with the homograph problem is shortly explained as follows. In this approach each homograph word is treated as a particular problem to solve. It demands that a particular learning step, surely with a particular dataset, is performed for each homograph word. Furthermore in this approach, the text on each instance is represented as a vector in term space with TFIDF computation for each of its dimension. A centroid model of each category is then constructed as average representation of all instances labeled with that category. When an unlabeled instance is given, the classifier computes similarity between vector representing the instance and vector representing centroid of each category. The category with the most similar centroid is output as the label for the new instance. Thus, with the centroid models constructed in the learning process, IndoG2P can correctly map the ambiguous grapheme in a given homograph word surrounded by a particular text, to its phonetic representation. Table 5. The list of some Indonesian homograph words. Homograph word Ambiguous grapheme Phonetic representation <apel> <e> /apəl/,/apel/ <penanya> <e> /pənaña/,/penaña/ <memerah>, <pemerah>, <pemerahan> <seri> <e> /məmerah/, /pəmerah/, /pəmeraħan/, /məmərah/, /pəmərah/, /pəməraħan/ /səri/,/seri/ <e> <semi> <e> /səmi/,/semi/ <sedan> <e> /sedan/,/sədan/ <mental> <e> /mental, /məntal/ <seret> <e> /seret/, /sərət/ <serak> <e> /sərak/, /serak/ <tahu> <h> /taħu/, /tahu/ <gulai> <i> /gulay/,/gulai/ 188 Agus Hartoyo, Suyanto 3 Evaluation and Results Although both modules are practically not detachable, the IG-tree construction and the homograph handling substantially address different level of problem. Hence, their datasets are different as discussed before. So, we divided this section into two parts, each for a particular module. 3.1 IG-Tree Construction We will see in this part the model improvements due to pruning. The performance of the final model is then evaluated using two accuracy metrics; those are accuracy-perphoneme and accuracy-per-word. Since on the lowest level the mapping is done letter-per-letter, the computation for accuracy-per-phoneme is done to accommodate evaluation on this level. On the other hand, accuracy-per-word is more interpretable in the sense that the communication using linguistic tools in the real world is wordbased, not letter-based. Technically note that accuracy-per-word must be less than or equal to accuracy-per-phoneme since to get a word correctly-mapped, its contained letters are needed to be all true, but in the opposite, one mispronounced letter is sufficient to lead the word in which it occurs to be false. The resulted IG-tree has 2,112 leaf nodes. The number of the leaves in this section trivially represents the model’s dimension. The pruning mechanism is then performed on the model to decrease the model’s dimension and increase the accuracy as shown in Table 6. After 554 iterations, dimension of the final IG-tree is 57% smaller than that of the original and its accuracy for validation set is better. Table 6. Pruning and its improvements in dimensional and accuracy for validation set. Pruning iteration 0 1 2 3 110 221 332 443 554 Number of leaves 2,112 2,111 2,096 2,074 1,871 1,658 1,496 1,315 908 Phoneme accuracy 99.19 99.19 99.19 99.19 99.23 99.23 99.30 99.36 99.38 Word accuracy 93.96 93.96 93.96 93.96 94.26 94.26 94.70 95.14 95.29 Mean accuracy 96.57 96.57 96.57 96.57 96.74 96.74 97.00 97.25 97.33 Mean error 3.43 3.43 3.43 3.43 3.26 3.26 3.00 2.75 2.67 The final IG-tree was then tested using our test set. The test gave us result of 99.01% for phoneme accuracy and 92.42 % for word accuracy. Unfortunately, we did not find any similar system working on the same language to compare with. However, this result is much better than those of similar researches on other languages published in [3], [4], [5], [6], [8], [9], [10], and [11]. It seems that the method we used and the language we worked on are factors contributing the most for this result. It is stated in [7] that the high performance of IG-tree in G2P conversion suggests An Improved Indonesian Grapheme-to-Phoneme Conversion... 189 overstatement on previous knowledge-based approaches as well as more computationally expensive learning approaches. Moreover, in this research we improved the original method with new features which increase its performance. About the language we worked on, it is clear that linguistically Indonesian has much simpler phonetic rules than other languages, like English, Dutch, and French. These simple Indonesian linguistic patterns seem to be easily caught during the learning process, so the system could perform better. Another aspect of IG-tree we want to stress on is its aspect of interpretability. As it was constructed in decision tree structure whose characteristic is descriptive as well as predictive, IG-tree gives us reason for each of its prediction. The model “teaches” us the complete detailed “lesson” about how to pronounce letters in Indonesian words. In practical level, the comprehensive rules exposed in the model constructed even perhaps can help Indonesian linguists codifying Indonesian pronunciation standards. 3.2 Homograph Handler In this part of research we conducted experiments on five cases for five homograph words as mentioned in Table 4. However, with small number of training instances and test instances, in every single case we got so surprisingly perfect accuracy of the model built during its learning process that extremely no mistake was made by its centroid-based classifier in predicting the category of some new homographcontaining texts in its particular test set. The well-designed features of the centroid-based classifiers are factors playing the main role in the perfect performances of the models in the five cases. The other factor is the well-prepared dataset used both in each learning and test process. In spite of their small sizes, the datasets are noise-free, discriminating, and balance. 4 Conclusion The high performance of the system implies that the IG-tree + best-guess strategy is a powerful, language-independent, reasoning-based method well-designed to cope with grapheme-to-phoneme conversion problem nowadays. The result is furthermore contributed also by the characteristic of Indonesian itself whose pronunciation rules are relatively easy for the learning system to catch. The proposed pruning mechanism successfully improves the system’s performance by significantly reducing the dimension of the model and increasing its generalization ability. The high interpretability of the model developed in this work must be considered as one aspect of its high performance as well. In addition, the homograph problems, totally unsolved or unsatisfying-solved in previous works, can be handled very well with the proposed centroid-based classifiers. 190 Agus Hartoyo, Suyanto Acknowledgments We would like to thank to Mrs. Dyas Puspandari as an Indonesian linguist and all colleagues at IT Telkom for the kind and useful suggestions. References 1. Alwi, Hasan et al. 2003. Tata Bahasa Baku Bahasa Indonesia. Edisi Ketiga. Jakarta: Balai Pustaka 2. Asian, J., Williams, H. E., Tahaghoghi, S. M. M.: Stemming Indonesian. In: Proceedings of the Twenty-eighth Australasian conference on Computer Science, Newcastle, Australia, pp. 307--314 (2005) 3. Bouma, G.: A finite state and data-oriented method for grapheme-to-phoneme conversion. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, Seattle, Washington, pp. 303-310 (2000) 4. Caseiro, D., Trancoso, I., Oliveira, L., Viana, C.: Grapheme-to-phone using finite-state transducers. In: Proc. IEEE Workshop on Speech Synthesis, Santa Monica, CA, USA (2002) 5. Daelemans, W., Bosch., A.: Tabtalk: Reusability in data-oriented grapheme-to-phoneme conversion. In: In Proceedings of Eurospeech (1993) 6. Bosch, A., Daelemans, W.: Data-oriented methods for grapheme-to-phoneme conversion. In: Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, Utrecht, The Netherlands (1993) 7. Daelemans, W., Bosch, A.: Language-independent data-oriented grapheme-to-phoneme conversion. In Progress in Speech Synthesis, J. P. van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg, Eds. Springer-Verlag (1997) 8. Han, E-H., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp. 424-431 (2000) 9. Reichel, Uwe D. and Florian Schiel. Using morphology and phoneme history to improve grapheme-to-phoneme conversion. In: Proceedings of the InterSpeech, pp. 1937-1940 (2005) 10. Taylor, Paul. Hidden Markov Models for grapheme to phoneme conversion. In: Proceedings of the InterSpeech, pp. 1973-1976 (2005) 11. Yvon, Francois. Self-learning techniques for grapheme-to-phoneme conversion. In: Proceeding of the 2nd Onomastica Research Colloquium, London (1994) Machine Translation Long Distance Revisions in Drafting and Post-editing Michael Carl12 , Martin Kay1 Kristian T.H. Jensen2 1 2 Stanford University, H-STAR and Department of Linguistics Copenhagen Business School, Languages & Computational Linguistics, Frederiksberg, Denmark Abstract. This paper investigates properties of translation processes, as observed in the translation behaviour of student and professional translators. The translation process can be divided into a gisting, drafting and post-editing phase. We find that student translators have longer gisting phases whereas professional translators have longer post-editing phases. Long-distance revisions, which would typically be expected during post-editing, occur to the same extent during drafting as during post-editing. Further, both groups of translators seem to face the same translation problems. We suggest how those findings might be taken into account in the design of computer assisted translation tools. 1 Introduction In contrast to the large number of publications on MT post-editing, little research has been carried out on how translators review and post-edit their own translations. Lörscher[10], one of the pioneers in translation process research, points out: Solving translation problems is often carried out as a series of steps. Generally, subjects do not immediately reach solutions which they consider to be optimal. . . . subjects generally use (linguistically) simple strategies first, and only when they turn out to be unsuccessful do the subjects employ more complex strategies. This procedure of the subjects complies with the generative principle whereby complex translation strategies are . . . derived from simpler structures. (p:430) Revision and post-editing of drafted translation are thus in order and indicative of the complexity (or uncertainty) of a translation problem. Only few years ago, research on human translation processing was based on think-aloud protocols [4,9,10], however, recent technological developments have made it possible to directly analyse user activity data (UAD), notably eye movement data and keystroke data [5,3]. In a recent study, Malkiel [11] investigates the predicatability of “selfrevisions” in English-Hebrew translations, based on manual analysis of © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 193-204 Received 21/11/09 Accepted 16/01/10 Final version 10/03/10 194 Carl M., Kay M., Jensen K. the revision keystrokes. In this paper, we use our triangulation technology [2,3] and discuss a method to automatically detect and analyse revision patterns. Given the increasing interest in interactive Machine Translation, [8,13]3 and in the design of man-machine interfaces, we expect that insights derived from the study of human translation processing will provide valuable information for the designers of MT post-editing tools. 2 Gisting, Drafting and Post-editing We base our research on a translation experiment [7] in which 12 professional and 12 student translators produced translations using the Translog [5] software.4 Translog presents the source text (ST) in the upper part of the monitor, and the target text (TT) is typed in a window in the lower part of the monitor. When the start button is pressed, the ST is displayed and eye movement and keystroke data are registered. The task of the translator is then to type the translation in the lower window. After having completed the translation, the subject presses a stop button, and the translation, along with the translation process data, are stored in a log file. Translators vary greatly with respect to how they produce translations. However, the process can be divided into three phases, which we refer to as gisting, in which the translator acquires a preliminary notion of the ST, drafting in which the actual translation is typed (drafted), and post-editing in which some or all of the drafted text is re-read, typos corrected and sentences rearranged or reformulated on the background of the translator’s better understanding of the text by the time this stage is reached. 2.1 Translation Progression Graphs The UAD can be represented in so-called translation progression graphs[12]. Figure 1 shows translation progression graphs for two students (S17 and S23) at the top and at the bottom respectively and a professional (P1) in the middle. The graphs plot activity data which was collected during the translation of a 160 words text from English into Danish.5 3 4 5 Google has just made available a toolkit for human assisted translation with more than 50 languages. The software can be downloaded from www.translog.dk The English source text is shown in the Appendix. Long-distance Revisions in Drafting and Post-editing 195 Fig. 1. Three translation progression graphs from top down subjects S17, P1 and S23, showing keystrokes and eye movements: S17 shows a clear distinction into gisting, drafting and post-editing. P1 has no gisting phase and spends almost 50% of the translation time on post-editing, while S23 only has a drafting phase. 196 Carl M., Kay M., Jensen K. The horizontal axis represents the translation time in milliseconds, and the vertical axis represents the source-language words from the beginning of the text (bottom), to the end (top). As described in Carl, 2009 [2], keystrokes that contribute to the TT, are mapped onto the ST words which they translate. All keystrokes that contribute to the translation of the ith source word are represented as single dots in the ith line from the bottom of the graph. The red (i.e. grey) line plots the gaze activities on the source text words. Single eye fixations are marked with a dot on the fixation line6 . The progression graph of subject S17 (top graph in figure 1) shows a clear distinction between gisting, drafting and post-editing. Subject S17 spends almost 40 seconds getting acquainted with the text. The graph shows the progression of fixations nicely in which the ST is apparently read from beginning to end. The drafting phase takes place between seconds 40 and 320. Eye movements can be observed where the translator moves back and forth between the ST and the TT. Some fixations are captured during this journey between the current ST position and the TT window (or to the keyboard) which are mapped on text positions remote from the current location of the corresponding translation. The drafting phase is followed by a post-editing phase, from approx. second 320 until second 480. Translator S17 seems to re-read much of the ST during post-editing, but only few keystrokes occur, i.e. around seconds 360 and 440. Translator P1, the second graph in figure 1, shows virtually no gisting phase. The first keystrokes can be observed less than 5 seconds after the ST appears on the screen. P1 also has a long post-editing phase of two minutes, from seconds 220 to 360. A number of revision keystrokes are visible, around seconds 300 and 340. A third translation pattern for translator S23 is shown in final graph. No gisting and no post-editing take place, but some revision occurs at various places, e.g. around seconds 100 and 170. The time it takes to produce the translations is between 6 minutes (P1) and 8 minutes (S17). 2.2 Translation Expertise and Translation Phases For students, there is a clear tendency towards longer gisting and shorter post-editing phases, whereas professional translators have shorter gisting 6 Notice that only fixations on the source text are represented in the graph. Our software was not able to compute and map fixations on the emerging target text words. Long-distance Revisions in Drafting and Post-editing 197 Fig. 2. Top: drafting time (horizontal) and gisting time (vertical). Rectangular symbols represent student translators, diamond shapes represent professionals. Students spend more time on gisting than professionals. Bottom: drafting time (horizontal) and post-editing time (vertical). Rectangular symbols represent students, diamond shapes represent professionals. On average, professionals spend more time post-editing than do students; many students completely skip post-editing. 198 Carl M., Kay M., Jensen K. and longer post-editing phases. Figure 2.1 (top) plots the relationship between drafting and gisting time, and the bottom graph in figure 2.1 shows the relation between drafting time and post-editing time. Almost all professional translators (9 out of 12) engage in some kind of post-editing, while 7 out of 12 student translators do not post-edit. The inverse observation can be made with respect to gisting: 3 students but no professional translator engage in gisting for more than 20 seconds. These results are only partially in line with Jakobsen, 2002 [6] who finds that professional translators invest more time than students in gisting and post-editing, but are faster at drafting the translation. 3 Long Distance Revisions Changes in the target text translation may take place at any moment during drafting or post-editing: in the middle or at the end of a word or after or at the end of a sentence or paragraph. We distinguish between two types of revisions, short-distance revisions, and long-distance revisions. 3.1 Translation Phases and Long Distance Revisions Long-distance revisions occur if two successive keystrokes are located 2 or more words apart from each other. For instance, a translator might first translate “nurse” into “sygeplejerske”, but when she realizes that the ‘nurse’ is in fact masculin, she might correct all occurrences into “sygeplejer”7 . To do so, the cursor must move to a previous words, and if the corrected word is two or more words apart from the last cursor action we will observe long-distance keystrokes. A long-distance revision is thus a sequence of two successive keystrokes, which are located in a different part of the target text translation. All other modifications of drafted text are short-distance revisions. Whereas short-distance revisions most likely are associated with typing errors, which the translator immediately corrects, it is plausible that long distance revisions are indicative of ‘real’ translation problems that the translator is struggling with. One would expect that long-distance revisions are particularly abundant during post-editing; however, our data indicate that they occur with the same frequency although no separate post-editing phase takes place.8 Figure 3 suggests that the post-editing time and the number of longdistance revisions are basically independent: long-distance revisions take 7 8 In our classification below this would correspond to a IDWX pattern. An example of this is subject S23 in figure 1, above. Long-distance Revisions in Drafting and Post-editing 199 Fig. 3. Number of long distance revisions (vertical) and post-editing time (horizontal) shows the parameters to be unrelated. Long distance revisions occur equally frequently for students as for professionals, irrespectively of the length of the post-editing phase. place in approximately equal number, whether or not there is a separate post-editing phase. Thus, more experienced, professional translators seem to prefer a modular mode of working, in whicn both types of editing are separated in two clearly different phases. Conversely, students are more likely to mix those two phases. Jakobsen [6] reports similar findings in his experiments, where students produce more revisions during drafting. Figure 3 also shows that translators perform between 11 and 45 long distance revisions on the 160 word text. Students perform slightly more revisions, on average one revision every 6.5 words, while professionals revise once every 7.8 word. This figure approximately coincides with the one given in Malkiel [11] whose student translators “self-revise” every 8th word. In the next sections we will show that these revision are by no means equally distributed in the text. 3.2 Patterns in Long Distance Revisions A related question is whether and to what extend translators face the same difficulties during translation. That is, we may be confident that translators share similar problems if long distance revisions cluster at particular text positions so that common patterns can be observed in the UAD. 200 Carl M., Kay M., Jensen K. Indeed, figure 4 shows that revisions of the 24 translators occur more frequently at certain positions in the texts. The graph shows four or five positions where many revision take place, i.e. around word positions 14, 50, 105, 120 and 151. The contexts of these passages are shown in bold in the Appendix. We briefly discuss some of the difficulties that these particular passages might present to a translatior. A word-for-word translation of “imprisoned for life today” would not be idiomatic in Danish. In order to find an idiomatically acceptable rendering of the expression, the translator would have to reorder the constituents and make different lexical choices. Fig. 4. Elapsed time (vertical) and positions of long distance revisions in the translation (horizontal): The horizontal axis enumerates the sourcelanguage words (0 to 160)and the dots in the graph represent different types of long distance revisions of their translations. The translation of “counts of murder” into Danish may cause difficulty since the expression occurs infrequently in this context. The translator would have to test several Danish equivalent expressions in order find an acceptable one. The translation data shows more than 12 possible solutions for this passage. The compound expression “hospital staff” has no exact equivalent in Danish. The translator would have to test several possible translation Long-distance Revisions in Drafting and Post-editing 201 alternatives before reaching a satisfying solution. This difficulty can also be measured by the fact that the data contain 20 different translations for “awareness of other hospital staff”. 3.3 Classifying Long Distance Revisions The keystrokes in our representations can be either text-inserting or textdeleting. That is, keystrokes for mere cursor movement are skipped and ignored in the graphs. Accordingly, in order to classify the long-distance revisions, we distinguish between insertion (I) and deletion (D) revisions. Since each of the two keystrokes in a revision can be an insertion or a deletion, we have four categories of pairs of revision keystrokes. In addition, we also distinguish the situation in which the second keystroke immediately follows a word separator (S) from the situation in which the second keystroke is in the middle of a word (W). Thus, in principle there could be eight types of long-distance revision.9 The six most frequent combinations are shown in figure 4 and are briefly described below: 10 1. IISX: two successive long-distance insertion keystrokes, the second immediately following a word separator, e.g. inserting an article. 2. IIWX: two successive long-distance insertion keystrokes, the second not immediately following a word separator, e.g inserting a suffix of a word. 3. IDSX: an insertion followed by a long-distance deletion keystroke occurring at the beginning of a word, e.g. deleting an article. 4. IDWX: an insertion followed by a long-distance deletion keystroke occurring in the middle of a word, e.g deleting a suffix. 5. DISX: a deletion followed by a long-distance insertion that occurs at the beginning of a dislocated word, e.g. inserting an article. 6. DIWX: a deletion followed by a long-distance insertion in the middle of a dislocated word, e.g inserting a suffix of a word. Table 1 summarizes revision types for all 24 translations. It gives rise to the following observations: revisions usually start at the beginning of a word (461 occurrences) and less frequently in the middle (74 occurrences). ID revision patterns require much more time than DI or II revisions. That is, the time lapse between the end of an insertion and the beginning of a 9 10 The long-distance between successive keystrokes is marked as X in the examples below. Unfortunately, our data show too few instances for ‘DD’ revisions to draw any conclusions. 202 Carl M., Kay M., Jensen K. deletion in another passage of the text is much higher than that between a deletion to a following insertion, or two successive insertions. On average, the pause between the insertion and the long-distance deletion is 7734ms and 8676ms respectively for the deletion to take place at the beginning and the middle of a word while it is only a fraction of this for the other types of revisions. Type Number of occurrences Average time interval IIWX IISX IDWX IDSX DIWX DISX 12 423 35 20 27 18 755 169 8676 7734 689 1677 Table 1. Number of occurrences and time interval between the two keystrokes of for several types of long-distance revision pattern. Presumably, the reason for the long ID revision is that a meaning hypothesis was realized and finished by the last insertion, and a new meaning hypothesis must mature before the deletion can take place. This would require much more anticipation and effort than a DI pattern, where the long-distance insertion is presumably only a consequence of the thought that lead to the deletion, or for the II patterns where the second insertion is a continuation of the first insterion. 4 Conclusion Three phases can be distinguished in human translation: a gisting phase, a drafting phase and a post-editing phase. In our relatively short and simple text, gisting and post-editing seem to be optional: professional translators skip the gisting phase, tend to start immediately with drafting and have a longer post-editing phase. Novices, in contrast require a longer gisting phase, and often completely skip post-editing. In line with this, [7] finds that students “allocate considerably more time to each ST segment”, our investigation indicates that this might be due to the longer gisting phase. However, there seems to be an equal number of long-distance revisions for students and professionals. Hence, students revise parts of their translations when drafting, while professional translators work more structured and postpone revisions to a post-editing phase. Interestingly, irrespectively of when the revision is made, students and professionals revise the same parts of the translations, presumably because they face the same problems in the translation. Long-distance Revisions in Drafting and Post-editing 203 In order to figure out which of the phases in a translation process can be mechanized, computer assistance might be conceived to support the translator’s structuring of the following task: gisting support tools could prepare the translator for difficulties of the ST, giving them e.g. a review of frequently used terms in their contexts, point to unusual collocations, etc., whereas translation memories or MT post-editing tools [8,13] might be a basis for drafting and post-editing support. Special attention in the design of automated support during drafting and post-editing should receive the ID revision patterns, where translators spend much of their time here. If certain translation and post-editing strategies turn out to be more successful than others, as in the case of our professional translators, then they should presumably be taken into account in the design of translation support tools. Under this assumption, a MT post-editing tool seems to be better grounded than a translation completion tool [1], which would mix drafting and post-editing phases, as we have observed in novice translators. References 1. Sergio Barrachina, Oliver Bender, Francisco Casacuberta, Jorge Elsa, Civera Cubel, Shahram Khadivi, Antonio Lagarda, Hermann Ney, Jesús Tomás, Enrique Vidal, and Juan-Miguel Vilar. Statistical Approaches to Computer-Assisted Translation. Computational Linguistics, pages 3–28, 2009. 2. Michael Carl. Triangulating product and process data: quantifying alignment units with keystroke data. Copenhagen Studies in Language, 38:225–247, 2009. 3. Michael Carl and Arnt Lykke Jakobsen. Towards statistical modelling of translators’ activity data. International Journal of Speech Technology, 12(4):124–146, 2010. 4. P. Gerloff. Second Language Learners Reports on the Interpretive Process: Talkaloud Protocols of Translation. In [?], pages 243–262, 1986. 5. Arnt Lykke Jakobsen. Logging target text production with Translog. In [?], pages 9–20, 1999. 6. Arnt Lykke Jakobsen. Translation drafting by professional translators and by translation students. In [?], pages 191–204, 2002. 7. Kristian T. H. Jensen. Distribution of attention between source text and target text during translation. In IATIS, 2009. 8. Philipp Koehn and Barry Haddow. Interactive Assistance to Human Translators using Statistical Machine Translation Methods. In MT Summit, 2009. http://www.mt-archive.info/MTS-2009-TOC.htm. 9. H. Krings. Was in den Köpfen von Übersetzern vorgeht. Gunter Narr, Tübingen, 1986. 10. W. Lörscher. Investigating the Translation Process . Meta, XXXVII(3):426–439, 1992. 204 Carl M., Kay M., Jensen K. 11. Brenda Malkiel. From Ántona to My Ántona: tracking self-corrections with Translog. volume 37 of Copenhagen Studies in Language, pages 149–167. Copenhagen: Samfundslitteratur, 2008. 12. Daniel Perrin. Progression analysis (PA): investigating writing strategies at the workplace. Pragmatics, 35:907–921, 2003. 13. Marco Trombetti. Creating the World’s Largest Translation Memory . In MT Summit, 2009. http://www.mt-archive.info/MTS-2009-TOC.htm. Appendix: Source Test Killer nurse receives four life sentences Hospital Nurse Colin Norris was imprisoned for life today for the killing of four of his patients. 32 year old Norris from Glasgow killed the four women in 2002 by giving them large amounts of sleeping medicine. Yesterday, he was found guilty of four counts of murder following a long trial. He was given four life sentences, one for each of the killings. He will have to serve at least 30 years. Police officer Chris Gregg said that Norris had been acting strangely around the hospital. Only the awareness of other hospital staff put a stop to him and to the killings. The police have learned that the motive for the killings was that Norris disliked working with old people. All of his victims were old weak women with heart problems. All of them could be considered a burden to hospital staff. Dependency-based Translation Equivalents for Factored Machine Translation Irimia Elena, Alexandru Ceauşu Reasearch Centre for Artificial Intelligence, Bucharest, Romania {elena, aceausu}@racai.ro www.racai.ro Abstract. One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures - super-links and chains - and used these structures to set the translation example borders. The option for the dependency-linked ngrams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which accounts for extra-phrasal syntactic dependency would guarantee “better” translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system. Keywords. Lexical attraction model, statistical machine translation, translation model 1 Introduction Corpus-based paradigm in machine translation has seen various approaches for the task of constructing reliable translation models, − starting from the naïve “word-to-word” correspondences solution which was studied in the early works ([1], [2]) − continuing with the chunk-bounded n-grams ([3], [4], [5]) which were supposed to account for compounding nouns, collocations or idiomatic expressions, − passing through the early approach of the bounded-length n-grams IBM statistical translation models and the following phrase-based statistical translation models ([6], [7], etc.), − exploring the dependency-linked n-grams solutions which can offer the possibility of extracting long and sometimes non-successive examples and are able to catch the structural dependencies in a sentence (e.g., the accord between a verb and a noun phrase in the subject position), see [8], − and ending with the double-sided option for the sentence granularity level, which can be appealing since the sentence boundaries are easy to identify but brings the © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 205-216 Received 28/11/09 Accepted 16/01/10 Final version 12/03/10 206 Irimia E., Ceauşu A. additional problem of fuzzy matching and complicated mechanisms of recombination. Several studies were dedicated to the impact of using syntactical information in the phrase extraction process over the translation accuracy. Analyzing by comparison the constituency-based model and the dependency based model, [9] concluded that “using dependency annotation yields greater translation quality than constituency annotation for PB-SMT”. But, as previous works ([10] and [11]) have noted, the new phrase models, created by incorporating linguistic knowledge, do not necessarily improve the translation accuracy by themselves, but in combination with the “old–fashioned” bounded-length phrase models. The process of extracting syntactically motivated translation examples varies according to the different resources and tools available for specific research groups and specific language pairs. In a detailed report over the syntactically-motivated approaches in SMT, focused on the methods that use the dependency formalism, [12] distinguishes the situations when dependency parsers are used for both source and target languages from those in which only a parser for the source side is available. In the latter case, a direct projection technique is usually used to do an annotation transfer from the source to the target translation unit. This approach is motivated by the direct correspondence assumption (DCA, [13]), that states that dependency relations are preserved through direct projection. The projection is based on correspondences between the words in the parallel sentences, obtained through the lexical alignment (also called word alignment) process. Obviously, the quality of the projection is dependant of the lexical alignment quality. Furthermore, [13] notes that the target syntax structure obtained through direct projection is isomorphic to the source syntax structure, thus producing isomorphic translation models. This phenomenon is rarely corresponding to a real isomorphism between the two languages involved. In the experiments we describe in this paper, we had the advantage of a probabilistic non-supervised dependency analyzer which depends on the text’s language only through a small set of rules designed to filter the previously identified links. As both source and target dependency linking analysis is available, there is no need of direct projection in the translation examples extraction and the problem of the “compulsory isomorphism” is avoided. 2 Research Background In previous experiments with an example-based approach on machine translation for the English-Romanian language pair, we developed a strategy for extracting translation examples using the information provided by a dependency-linker described in [14]. We then justified our opting for the dependency-linked n-grams approach based on the assumption in [15] that the EBMT potential should rely on exploiting text fragments shorter than the sentence and also on the intuition that a decomposition of the source sentence in “coherent segments”, with complete syntactical structure, would be “the best covering” of that sentence. Dependency-based Translation Equivalents for Factored Machine Translation 207 The dependency-linker used is based on Yuret’s Lexical Attraction Model (LAM, [16]), in who’s vision the lexical attraction is a probabilistic measure of the combining affinity between two words in the same sentence. Applied to machine translation, the lexical attraction concept can serve as a mean of guaranteeing the translation examples usefulness. If two words are “lexically attracted” to one another in a sentence, the probability for them to combine in future sentences is significant. Therefore, two or more words from the source sentence that manifest lexical attraction together with their translations in the target language represent a better translation example than a bounded length n-gram. The choice for the Yuret’s LAM as the base for the dependency analyzer application was motivated by the lack of a dependency grammar for Romanian. The alternative was to perform syntactical analysis based on automatically inducted grammatical models. A basic request for the construction of this type of models is the existence of syntactically annotated corpora from which machine learning techniques could extract statistical information about the ways in which syntactical elements combine. As no syntactically annotated corpus for Romanian was available, the fact that Yuret’s method could use LAM for finding dependency links in a not-annotated corpus made this algorithm a practical choice. LexPar[14], the dependency links analyzer we used for the experiments described in this paper, is extending Yuret’s algorithm by a set of syntactical rules specific to the processed languages (Romanian and English) that constraints the links’ formation. It also contains a simple generalization mechanism for the link properties, which eliminates the initial algorithm inadaptability to unknown words. However, the LexPar algorithm does not guarantee a complete analysis, because the syntactic filter can contain rules that forbid the linking of two words in a case in which this link should be allowed. The rules were designed by the algorithm’s author based on his observations of the increased ability of a certain rule to reject wrong links, with the risk of rejecting good links in few cases. In our research group, significant efforts were involved in experimenting with statistical machine translation methodologies, focused on building accurate language resources (the larger the better) and on fine-tuning the statistical parameters. The aim was to demonstrate that, in this way, acceptable MT prototypes can be quickly developed and the claim was supported by the encouraging Bleu scores we obtained for the Romanian<->English translation system. The translation experiments employed the MOSES toolkit, an open source platform for development of statistical machine translation systems (see next section). One of the goals of this paper was to analyze the impact of incorporating syntactic information in the translation model by means of a probabilistic dependency link analyzer. Although the non-supervised nature of the analyzer is affecting its recall, using this tool brings the advantage of having syntactic information available for translation without the need for training syntactically annotated corpora. We feed the Moses decoder with the new translation model and we compare the translation results with the results of the baseline system. In the remaining sections we will make a short survey of the resources and tools used in the SMT experiments (section 3), we will describe the dependency-motivated translation examples extraction process (section 4) and we will present the experiments and the results with the dependency-based translation model (section 5). 208 Irimia E., Ceauşu A. 3 Factored Phrase-Based Statistical Machine Translation The corpus. The Acquis Communautaire is the total body of European Union (EU) law applicable in the EU Member States. This collection of legislative text changes continuously and currently comprises texts written between the 1950s and 2008 in all the languages of EU Member States. A significant part of these parallel texts have been compiled by the Language Technology group of the European Commission's Joint Research Centre at Ispra into an aligned parallel corpus, called JRC-Acquis [17], publicly released in May 2006. Recently, the Romanian side of the JRC-Acquis corpus was extended up to a size comparable with the dimensions of other language-parts (19,211 documents)). For the experiments described in this paper, we retained only 1-1 alignment pairs and restricted the selected pairs so that none of the sentences contained more than 80 words and that the length ratio between sentence-lengths in an aligned pair was less than 7. Finally, the Romanian-English parallel corpus we used contained about 600,000 translation units. Romanian and English texts were processed based on the RACAI tools [18] integrated into the linguistic web-service platform available at http://nlp.racai.ro/webservices. After tokenization, tagging and lemmatization, this new information was added to the XML encoding of the parallel corpora. Figure 1 shows the representation of the Romanian segment encoding for the translation unit displayed in Figure 2. The tagsets used were compliant with the MULTEXT-East specifications Version3 [19] (for the details of the morpho-syntactic annotation, see http://nl.ijs.si/ME/V3/msd/). <tu id="3936"> ... <seg lang="ro"> <s id="31985L0337.n.83.1"> <w lemma="informaŃie" ana="Ncfpry">InformaŃiile</w> <w lemma="culege" ana="Vmp--pf">culese</w> <w lemma="conform" ana="Spsd">conform</w> <w lemma="art." ana="Yn">art.</w> <w lemma="5" ana="Mc">5</w> <c>,</c> <w lemma="6" ana="Mc">6</w> <w lemma="şi" ana="Crssp">şi</w> <w lemma="7" ana="Mc">7</w> <w lemma="trebui" ana="Vmip3s">trebuie</w> <w lemma="să" ana="Qs">să</w> <w lemma="fi" ana="Vasp3">fie</w> <w lemma="lua" ana="Vmp--pf">luate</w> <w lemma="în" ana="Spsa">în</w> <w lemma="considerare" ana="Ncfsrn">considerare</w> <w lemma="în cadrul" ana="Spcg">în cadrul</w> <w lemma="procedură" ana="Ncfsoy">procedurii</w> <w lemma="de" ana="Spsa">de</w> <w lemma="autorizare" ana="Ncfsrn">autorizare</w> <c>.</c> Dependency-based Translation Equivalents for Factored Machine Translation 209 </s> </seg> ... </tu> Figure 1: Linguistically analysed sentence (Romanian) of a translation unit of the JRC-Acquis parallel corpus Based on the monolingual data from the JRC-Acquis corpus we built language models for each language. For Romanian we used the TTL [20] and METT [21] tagging modelers. Both systems are able to perform tiered tagging [22], a morphosyntactic disambiguation method that was specially designed to work with large (lexical) tagsets. In order to build the translation models from the linguistically analyzed parallel corpora we used GIZA++ [23] and constructed unidirectional translation models (ENRO, RO-EN) which were subsequently combined. After that step, the final translation tables were computed. The processing unit considered in each language was not the word form but the string formed by its lemma and the first two characters of the associated morpho-syntactic tag (e.g. for the wordform "informaŃiile" we took the item "informaŃie/Nc"). We used for each language 20 iterations (5 for Model 1, 5 for HMM, 1 for THTo3, 4 for Model3, 1 for T2To4 and 4 for Model4). We included neither Model 5 nor Model 6, as we noticed a degradation of the perplexities of the alignment models on the evaluation data. The MOSES toolkit [24] is a public domain environment, which was developed in the ongoing European project EUROMATRIX, and allows for rapid prototyping of Statistical Machine Translation systems. It assists the developer in constructing the language and translation models for the languages he/she is concerned with and by its advanced factored decoder and control system ensures the solving of the fundamental equation of the Statistical Machine Translation in a noisy-channel model: Target* = argmaxTarget P(Source|Target)*P(Target) (1) The P(Target) is the statistical representation of the (target) language model. In our implementation, a language model is a collection of prior and conditional probabilities for unigrams, bigrams and trigrams seen in the training corpus. The conditional probabilities relate lemmas and morpho-syntactic descriptors (MSD), word-forms and lemmas, sequences of two or three MSDs. The P(Source|Target) is the statistical representation of the translation model and it consists of conditional probabilities for various attributes characterizing equivalences for the considered source and target languages (lemmas, MSDs, word forms, phrases, dependencies, etc). The functional argmax is called a decoder and it is a procedure able to find, in the huge search space P(Source|Target)*P(Target) corresponding to possible translations of a given Source text, the Target text that represent the optimal translation, i.e. the one which maximizes the compromise between the faithfulness of translation (P(Source|Target)) and the fluency/grammaticality of the translation (P(Target)). The standard implementation of a decoder is essentially an A* search algorithm. The current state-of-the-art decoder is the factored decoder implemented in the MOSES toolkit. As the name suggests, this decoder is capable of considering 210 Irimia E., Ceauşu A. multiple information sources (called factors) in implementing the argmax search. What is extremely useful is that the MOSES environment allows a developer to provide the MOSES decoder with language and translation models externally developed, offering means to ensure the conversion of the necessary data structures into the expected format and further improve them. Once the statistical models are in the prescribed format, the MT system developer may define his/her own factoring strategy. If the information is provided, the MOSES decoder can use various factors (attributes) of each of the lexical items (words or phrases): occurrence form, lemmatized form, associated part-of-speech or morpho-syntactic tag. Moreover, the system allows for integration of higher order information (shallow or even deep parsing information) in order to improve the output lexical items reordering. For further details on the MOSES Toolkit for Statistical Machine Translation and its tuning, the reader is directed to the EUROMATRIX project web-page http://www.euromatrix.net/ and to the download web-page http://www.statmt.org/moses/. 4 Extracting Translation Examples from Corpora (ExTRact) In our approach, based on the availability of a dependency-linker for both the source and the target language, the task of extracting translation examples from a corpus contains two sub-problems: dividing the source and target sentences into fragments (according to the chosen approach) and setting correspondences between the fragments in the source sentence and their translations in the target sentence. The last problem is basically fragment alignment and we solved it through a heuristic based on lexical alignments produced by GIZA++. The remaining problem was addressed using the information provided by LexPar, the dependency linker mentioned above. With a recall of 60,70% for English, LexPar was considered an appropriate starting point for the experiments (extending or correcting the set of rules incorporated as a filter in LexPar can improve it’s recall). Using MtKit, a tool specially designed for the visualization and correction of lexical alignments adapted to allow the graphical representation of the dependency links, we could study the dependency structures created by the identified links inside a sentence and we were able to observe some patterns in the links’ behavior: they tend to group by nesting and to decompose the sentence by chaining. Of course, these patterns are direct consequences of the syntactical structures and rules involved in the studied languages, but the visual representation offered by MtKit simplified the task of formalization and heuristic modeling (see Fig. 1). These properties suggest more possible decompositions for the same sentence, and implicitly the extraction of substrings of different length that satisfy the condition of lexical attraction between the component words. Example 1: in Figure 1, from the word sequence “made in the national currency” can be extracted the subsequences: “national currency”, “the national currency”, „in the national currency”, „made in the national currency”. The irrelevant Dependency-based Translation Equivalents for Factored Machine Translation 211 sequences and those susceptible of generating errors (like “the national”, “in the”, “made in the national”) are ignored. Fig. 2. MtKit visualisation of the alignments and links for an english-romanian translation unit. An arrow marks the existence of a dependence link between the two words it unites. The arrow direction is not relevant for the dependency link orientation. The patterns observed above were formalized as superlinks (link structures composed of at least two simple links which nest, see Figure 3) and as chains (link structures composed of at least two simple links or superlinks which form a chain, see Figure 4). Fig. 3. Superlink structures Fig. 4. Chain structures As input data, ExTract (the application that extracts translation examples from corpora) receives the processed corpus and a file containing the lexical alignments produced by GIZA++ [23]. We will describe the extracting procedure for a single 212 Irimia E., Ceauşu A. translation unit U in the corpus, containing Ss (a source sentence) and its trans-lation Ts (a target sentence). Starting from the first position in Ss (Ts respectively) we identify and extract every possible chaining of links and superlinks, with the condition that the number of chain loops is limited to 3. The limitation was introduced to avoid overloading the database. Subsequent experiments showed that increasing the limitation to 4 or 5 chains did not significantly improve the BLEU score of the translation system. Two list of candidate sentence fragment, from Ss and Ts, are extracted. Every fragment in both sentences is projected through lexical alignment in a word string (note that this is not the direct syntactical structure projection discussed above) in the other language. A projected string of a candidate fragment in Ss is not necessarily part of the list of candidate sentence fragments Ts, and vice versa (LexPar is not able to identify all the dependency links in a sentence, the lexical alignments are also subject to errors). But if a fragment candidate from Ss projects to a fragment candidate from Ts, the pair has a better probability of representing a correct translation example. In this stage, the application extracts all the possible translation examples (<source fragment candidate, projected word string>, <projected word string, target fragment candidate>) but distinguish between them, associating a “trust” flag f=”2” to the translation examples of the form <source fragment candidate, target fragment candidate>, and a flag f=”1” to all the other. Thereby, it is possible to experiment with translation tables of different sizes and different quality levels. 5 Experiments and Results Taking into account results from previous works ([12],[13]) that proved that dependency-based translation models give improved performance in combination with a phrase-based translation model, we decided to conduct our experiments in a mixed frame: we extracted from the dependency-based translation model only the translation examples longer than 2 source words <-> 2 target words, creating a reduced dependency-based translation model and we combined it with the phrasebased translation model generated with the Moses toolkit. Starting from the reduced D-based translation model, we can develop two different translation tables, based on the “trust” flags we introduced before: - a trustful D-based translation table (if we keep only the examples with the flag f=”2”) - a relaxed D-based translation table (if we accept all the examples, irrespective of the flags). As we previously mentioned, the initial working corpus contained around 600,000 translation units. From this number, 600 were extracted for tuning and testing. The tuning of the factored translation decoder (the weights on the various factors) was based on the 200 development sentence pairs using MERT [25] method. The testing set contains 400 translation units. Dependency-based Translation Equivalents for Factored Machine Translation 213 The evaluation tool was the last version of the NIST official mteval script1 which produces BLEU and NIST scores [26]. For the evaluation, we lowered the case in both reference and automatic translations. The results are synthesized in the following table, where you can notice that our assumption that the trustful table would produce better results than the relaxed one was contradicted by evidence. We thus learned that a wider range of multi-word examples is preferable to a restricted one, even if their correctness was not guaranteed by the syntactical analysis. Table 1. Evaluation of the dependency translation table compared with the translation table generated with Moses (on unseen data) Language pair English to Romanian Romanian to English Moses translation table NIST score 8.6671 BLEU score 0.5300 Dependency translation table Trustful table Relaxed table NIST. BLEU NIST. BLEU score score score score 8.4998 0.5006 8.6900 0.5334 10.7655 0.6102 10.3122 0.5812 10.3235 0.6191 As can be seen in the table, the translation accuracy obtained with the dependencybased translation table is very close to the one manifested by Moses, but still lesser. Therefore, we took a closer look at the translations and we noticed an important number of cases in which the dependency-based translation was more accurate in terms of human evaluation. Because of the space restriction, we will present here only a few of these cases and only for one direction of translation (English to Romanian). It can be noticed that the exact n-gram matching between the dependency-based translation and the reference is not as successful as the one between the Moses translation and the reference. But a flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences, shows that the dependency-based translation is also very legitimate in terms of human translation evaluation. English original: the insurance is connected to a contract to provide assistance in the event of accident or breakdown involving a road vehicle; whereas, in the light of experience gained, it is necessary to reconsider the consequences of the disposal of products from intervention on the markets of third countries other than those intended at the time of exportation; the competent authorities of the member states shall afford each other administrative assistance in all supervisory procedures in connection with legal provisions and quality standards applicable to foodstuffs and in all proceedings for infringements of the law applicable to foodstuffs. any administrative measure taken against an individual, leaving aside any consideration of general interest referred to above, on one of the grounds mentioned in article 1a, which is sufficiently severe in the light of the criteria referred to in section 4 of this joint position, may be regarded as persecution, in particular where it is intentional, systematic and lasting. Romanian original: 1 ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v12.pl 214 Irimia E., Ceauşu A. asigurarea priveşte un contract de acordare de asistenŃă în caz de accident sau defecŃiune a unui vehicul rutier; întrucât, luând în considerare experienŃa dobândită, este necesar să se reconsidere consecinŃele desfacerii produselor de intervenŃie asupra pieŃelor din Ńări terŃe altele decât cele prevăzute în cazul exportului; autorităŃile competente din statele membre trebuie să îşi acorde reciproc asistenŃă administrativă în toate procedurile de supraveghere prevăzute în dispoziŃiile legale şi în normele de calitate aplicabile alimentelor, precum şi în toate procedurile privind încălcarea legislaŃiei în domeniul produselor alimentare. orice măsură administrativă luată împotriva unui individ, în afara considerentelor de interes general evocate mai sus, datorită unuia dintre motivele menŃionate în art. 1a, care este suficient de severă potrivit criteriilor enunŃate în secŃiunea 4 din prezenta poziŃie comună, poate fi considerată ca persecuŃie, în special când aceasta prezintă un caracter intenŃional, sistematic şi durabil. Moses translation : asigurarea este conectat la un contract să furnizeze asistenŃă în caz de accident sau defecŃiune a unui vehicul rutier ; întrucât, Ńinând seama de experienŃa dobândită, este necesar să se reconsidere consecinŃele comercializării produselor de intervenŃie pe pieŃele din Ńările terŃe, altele decât cele prevăzute în momentul exportului; autorităŃile competente ale statelor membre îşi acordă reciproc asistenŃă administrativă în toate procedurile de supraveghere legate de dispoziŃiile legale şi standardele de calitate se aplică produselor alimentare şi în toate procedurile pentru încălcarea legii aplicabile produselor alimentare orice măsură administrativă luată împotriva unui individ, lăsând din circuitul agricol orice cauză de interes general menŃionat anterior, pe unul din motivele menŃionate în art. 1a, care este suficient de grave în lumina criteriilor menŃionate la punctul 4 din prezenta poziŃie comună, pot fi considerate ca persecuŃie, în special atunci când s-a intenŃionat, sistematic şi de durată. Dependency-based translation : asigurarea priveşte un contract de asistenŃă în caz de accident sau defecŃiune a unui vehicul rutier; întrucât, în lumina experienŃei acumulate, este necesar să se reconsidere consecinŃele comercializării produselor de intervenŃie pe pieŃele Ńărilor terŃe altele decât cele prevăzute în cazul exportului; autorităŃile naŃionale competente din statele membre acorde reciproc asistenŃă administrativă în toate procedurile prevăzute în dispoziŃiile financiare şi ale standardelor de calitate aplicabile produselor alimentare şi în toate procedurile privind încălcarea legii aplicabile produselor alimentare. orice măsură administrativă luată împotriva unui individ, exclusiv, în afara considerentelor de interes general menŃionat anterior, pentru unul din motivele menŃionate la articolul 1a, care este suficient de grave Ńinând seama de criteriile enunŃate în secŃiunea 4 din prezenta poziŃie comună, poate fi considerată ca persecuŃie, în cazul în care este intenŃionat, sistematic şi durabile. 5 Conclusions We described in this paper our method of extracting translation examples from corpora based on the links identified with a statistical non-supervised dependencylinker. Although the evaluation results did not overcome the performance of the Dependency-based Translation Equivalents for Factored Machine Translation 215 Moses translation model, the scores are promising and they can be improved by increasing LexPar’s recall. We also intend to evaluate the results using metrics more sensitive to morphology variations and synonymy (e.g. METEOR, [27]). Acknowledgements The work reported here is funded by the STAR project, financed by the Ministry of Education, Research and Innovation under the grant no 742. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Gale, W. and K. Church, 1991. Identifying Word Correspondences in Parallel Texts. In Proceedings of the 4th DARPA Speech and Natural Language Workshop, Pacific Grove, CA., pp. 152-157. Melamed, I.D. 1995. Automatic Evaluation and Uniform Filter Cascades for Inducing Nbest translation lexicons. In proceedings of the Third Annual Workshop on Very Large Corpora, Cambridge, England, pp. 184-198. Kupiec, J. 1993. An Algorithm for Finding Noun Phrases Correspondences in Bilingual Corpora. In 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH., pages 23-30. Kumano, A. and H. Hirakawa. 1994. Building an MT dictionary from parallel texts based on linguistic and statistical information. In COLING-94: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, pages 76-81. Smadja, F., K.R. McKeown and V. Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics 22(1):1-38. Och, F.-J., Ch. Tillmann and H. Ney. 1999. Improved Alignment Models for Statistical Machine Translation. In Proc. of the Joint Conf. on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 99), pages 20–28, College Park, MD, June. Marcu D. and W. Wong. 2002. A Phrased-Based, Joint Probability Model for Statistical Machine Translation. In Proceedings Of the Conference on Empirical Methods in Natural Language Processing (EMNLP 02); pages 133-139, Philadelphia, PA, July. Yamamoto, K. and Y. Matsumoto. 2003. Extracting translation knowledge from parallel corpora. In: Michael Carl & Andy Way (eds.) Recent advances in example-based machine translation (Dordrecht: Kluwer Academic Publishers, 2003); pages 365-395. Hearne, M., S. Ozdowska, and J. Tinsley. 2008. Comparing Constituency and Dependency Representations for SMT Phrase-Extraction. In Proceedings of TALN ’08, Avignon, France. Groves D. & Way A. 2005. Hybrid Example-Based SMT: the Best of Both Worlds? In Proceedings of ACL 2005 Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, p. 183–190, Ann Arbor, MI. Tinsley J., Hearne M. & Way A. 2007. Exploiting Parallel Treebanks to Improve PhraseBased Statistical Machine Translation. In Proceedings of The Sixth International Workshop on Treebanks and Linguistic Theories (TLT-07), Bergen, Norway. Ambati, V. 2008. Dependency Structure Trees in Syntax Based Machine Translation, 11734 Spring 2008, Survey Report, http://www.cs.cmu.edu/~vamshi/publications/DependencyMT_report.pdf 216 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Irimia E., Ceauşu A. Hwa, R., Ph. Resnik, A. Weinberg, C. Cabezas and O. Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Nat. Lang. Eng., 11(3):311–325, September. Ion, R. 2007. Metode de dezambiguizare automată. AplicaŃii pentru limbile engleză şi română. Teză de doctorat. Academia Română. Bucureşti. Cranias, L., H. Papageorgiou and S. Piperidis. 1994. A Matching Technique in ExampleBased Machine Translation. In Proceedings of the 15th conference on Computational linguistics - Volume 1, Kyoto, Japan 100–104. Yuret, D. 1998. Discovery of linguistic relations using lexical atrraction. PhD thesis, Department of Computer Science and Electrical Engineering, MIT Steinberger R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference, Genoa, Italy, 22-28 May, pp. 2142-2147 Tufiş, D., Ion, R., Ceauşu, A., Ştefănescu, D. (2008). RACAI's Linguistic Web Services. In Proceedings of the 6th Language Resources and Evaluation Conference - LREC 2008, Marrakech, Morocco. ELRA - European Language Ressources Association. ISBN 29517408-4-0. Erjavec, T. 2004. MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Proc. of the Fourth Intl. Conf. on Language Resources and Evaluation, LREC'04, ELRA, Paris, pp. 1535 – 1538 Ion, R. 2007. Word Sense Disambiguation Methods Applied to English and Romanian, PhD thesis (in Romanian), Romanian Academy, Bucharest, 138 p. Ceausu Al. 2006. Maximum Entropy Tiered Tagging. In Janneke Huitink & Sophia Katrenko (editors), Proceedings of the Eleventh ESSLLI Student Session, pp. 173-179 Tufiş, D. 1999. Tiered Tagging and Combined Language Models Classifiers. In Václav Matousek, Pavel Mautner, Jana Ocelíková, and Petr Sojka, editors, Text, Speech and Dialogue (TSD 1999), Lecture Notes in Artificial Intelligence 1692, Springer Berlin / Heidelberg,. ISBN 978-3-540-66494-9, pp. 28-33. Och, F. J., Ney H. 2000. Improved Statistical Alignment Models. In Proceedings of the 38th Conference of ACL, Hong Kong, pp. 440-447 Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Wade, S., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E. 2007. MOSES: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic. Och, F. J. 2003. Minimal Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, July 2003, pp. 160-167. Banerjee S., Lavie A., An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings Of The ACL Workshop On Intrinsic And Extrinsic Evaluation Measures For Machine Translation And/Or Summarization, Pages 65-72, Ann Arbor, June 2005. Information Retrieval and Text Clustering Relation Learning from Persian Web: A Hybrid Approach Hakimeh Fadaei1, Mehrnoush Shamsfard1 1 NLP Research Laboratory, Faculty of Electrical & Computer Engineering, Shahid Beheshti University, Tehran, Iran ha.fadaee@mail.sbu.ac.ir, m-shams@sbu.ac.ir Abstract. In this paper a hybrid approach is presented for relation extraction from Persian web. This approach is a combination of statistical, pattern based, structure based and similarity based methods using linguistic heuristics to detect a part of faults. In addition to web, the developed system employs tagged corpora and WordNet as input resources in the relation learning procedure. The proposed methods extract both taxonomic and non-taxonomic, specific or unlabeled relations from semi-structured and unstructured documents. In this system, a set of Persian patterns were manually extracted to be used in pattern base section. Similarity based approach which uses WordNet relations as a guide to extract Persian relations uses a WSD method to map Persian words to English synsets. This system which is one of the few ontology learning systems for Persian showed good results in performed tests. In spite of resource and tool shortage in Persian the results were comparable with methods proposed for English language. Keywords: Relation learning, knowledge extraction, ontology learning, web, Wikipedia, similarity, Persian. 1 Introduction Automatic extraction of semantic relations is a challenging task in the field of knowledge acquisition from text and is addressed by many researchers during recent years. As ontologies are widely used in many branches of science, building or enriching them is of great importance. Automatic methods for performing these tasks are so welcome since the process of building ontologies manually is very time consuming. There are no available ontologies for Persian and not much work is done on automatic extraction of ontological knowledge for this language. In this paper we present a hybrid approach for extracting taxonomic and nontaxonomic relations from Persian resources. In the proposed system our focus is on using web as the learning resource although we use other resources to increase the effectiveness of our system too. The paper is organized as follows: In section 2 a brief review over related works is presented. The third section is dedicated to describing © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 219-230 Received 29/11/09 Accepted 16/01/10 Final version 10/03/10 220 Fadaei H., Shamsfard M. our system and different methods used in it. Finally performed tests and their results are described in section 4. 2 Related Work Automatic extraction of conceptual relations has attracted many attentions and some researchers work on proposing more efficient strategies in this field. During recent years different approaches were proposed to extract taxonomic and non taxonomic relations from different resources. Pattern based, statistical, structure based and linguistic approaches are well-known approaches which extract relations from texts. Many systems (2, 3, 8) use combinations of these approaches to accumulate their advantages . Pattern matching methods are widely used in extracting taxonomic and nontaxonomic relations. In this category, Hearst patterns [1] are among the most famous patterns defined for extracting taxonomic relations and has been used or adapted in many ontology learning systems 2, 3. Patterns may be defined manually [3] or extracted automatically 4. Some other systems (5, 14) use document structures to extract relations, these structures include tables, hyperlinks, html and xml tags and so on. Some systems use statistical methods and rely on the distribution of words and their modifiers in text to extract relations 3, 6, 7 and 8. Linguistic structure of sentences is another source of information used in some systems 2, 8 and 9. Linguistic methods use morphological, syntactic and semantic analysis to extract relations. These methods need many linguistic tools like chunkers, taggers and parsers and are not easily used in languages such as Persian in which these tools are unavailable. On the other hand, ontology learning systems use different resources to extract ontological knowledge, these resources include structured, semi structured on unstructured data. Raw or tagged texts are used by many systems such as 3, 5 and 6. Tagged corpora are proper resources for knowledge extraction as they are targeted for this task. These tags (POS, semantic or …) helps systems to better detect relations but they are not available in all languages. During recent years many researchers are attracted to web as knowledge extraction resource. The main reason that attracted the attentions of researchers to web documents for ontology learning is the huge amount of text in many languages that is available for everybody. Apart from availability and size there are some other features in web documents which make them suitable for the task of ontology learning and specially relation extraction. Web documents are usually filled with structured and semi-structured data, tables and hyperlinks which can be used in the process of ontology learning, some systems like 5 and 14 use these structures to learn ontological knowledge. One of the other facilities provided by web is the existence of high performance and efficient search engines like Google which are used to search this large body of texts. Many systems like 2, 4, 8, 11 and 12 use search engines in the procedure of relation extraction. Wikipedia is another advantage of using web in ontology learning, since it has a collection of very informative short articles well suited for knowledge extraction. In systems like [5], 8 and 13 Wikipedia is used in relation extraction. Relation Learning from Persian Web: A Hybrid Approach 221 Beside the positive points mentioned above for web as a resource in the task of learning, web has some shortages as well. Web documents are mainly written by ordinary people with no NLP background and as they are not basically targeted for NLP applications they may need special processes in comparison with the corpora prepared by language experts. 3 The Proposed Hybrid Approach Our proposed approach is a combination of statistical, pattern based, linguistic, structure based and similarity based methods. These methods may be used in a serial or parallel order. In serial (sequential) order the output of one is the input of the other while in the parallel approach each method extracts some relations and then we choose the best one by voting. They can also be used separately to extract different types of relations. In this approach we use web to a great extent to be able to use its advantages, but to cover the dark sides of using web as a resource, we also use other resources such as corpora, dictionaries and WordNet. In this section we will describe each method in more details. 3.1 Structure Based Approach The structure based part of our system uses the Wikipedia pages' structures such as tables, bullets and hyperlinks to extract relations. In many Wikipedia documents we can find some information given via bullets. This information usually shows some taxonomic relations. Given a Persian word the system follows these steps to extract relations from bulleted text: 1. Extract the Wikipedia article of the given word. 2. Find the bulleted parts and extracts their items. 3. Refine the items extracted in the second step by omitting stop words and prepositional phrases, finding the heads of nominal groups and ….. 4. Make new relations between the title of bulleted part and the results of the third step. The translations of some of these relations are presented in table 1. Table 1. Translation of some extracted relations from bullets Isa(Versailles , historical place) Isa(car accident, event) Isa(Islam, religion) Isa (hypertension, disease) Isa(suicide, death) Disambiguation pages in Wikipedia are also so helpful in extracting taxonomic relations. While searching a polysemous word in Wikipedia, if there are separate 222 Fadaei H., Shamsfard M. articles for each meaning of the word, Wikipedia brings a disambiguation page as the search result. In this page some or all of the meanings of the word are presented, usually with a brief explanation, in front of them. These explanations could be either a phrase or just a word indicating the parent of the word. While searching the word "tree" in Wikipedia, in disambiguation page we come across the following meanings of the word "tree": • Tree is a woody plant • Tree structure, a way of representing the hierarchical nature of a structure in a graphical form • Tree (data structure), a widely used computer data structure that emulates a tree structure with a set of linked nodes • Tree (graph theory), a connected graph without cycles From the above explanations we can extract the relations: isa(tree, woody plant) isa(tree, way of representing) isa(tree, computer data structure) isa(tree, connected graph) Table 2 contains some of the extracted relations from disambiguation pages. The extracted relations are between Persian terms but to be understandable for all readers we mention the English translation of extracted relations throughout this paper. Table 2. Translation of some extracted relations from disambiguation pages Isa (milk, dairy product) Isa (lion, Felidae) Isa (valve, device) Isa (electric charge, concept) Isa (Municipality, administrative division) Isa (watch , device) 3.2 Similarity Based Approach The similarity based part of our system uses Persian or English synonyms of a given word to find its related words and it is based on the similarities among the synonyms' contexts. The input of this part is a Persian word and as output, it returns a set of candidate related words to the given Persian word. The relation extraction resource could be WordNet, Persian corpus, Wikipedia or other resources, according to the application. The system finds the parts of the resource, which are related to each synonym, the intersection of which leads us to new relations. The process of finding the related parts of resource and finding the intersection is defined regarding the type of the resource and the task in hand. In the rest of this section we present three similarity based methods, using three different resources to extract taxonomic or nontaxonomic relations. In these methods (apart from the one using WordNet) the type of extracted relations are not defined and the system just extracts related words. The type (label) of these relations can be found by using pattern based method (see section 3.3) Relation Learning from Persian Web: A Hybrid Approach 223 or by doing linguistic analysis. In this section we will show the application of this approach on different learning resources. 3.2.1 WordNet In this part system uses WordNet relations as a guide for extracting Persian relations. The whole idea is to find WordNet synsets with a meaning close to the Persian word's, and then to translate the relations of these synsets to Persian. The problem is how to find these WordNet synsets which is solved by similarity based method. Although in this section we talk about extracting taxonomic relations, this method can be well adapted for any other WordNet relations. To extract relations having WordNet as a resource, the system follows the following steps: 1. Find the English equivalents (translations) of the Persian word using a bilingual dictionary. 2. Find the related WordNet synsets for each English equivalent. 3. Select the synset(s) with greatest number of words in common with all English equivalents. 4. Find the hypernym synsets of the selected synset(s) in step 3. 5. Translate the words in hypernym synset(s) to Persian. 6. Make new relations between the given Persian word and each translation from step 5. It is worth mentioning that zero or more translations may be found for each English word or phrase in step 5. In this method the system uses a Persian to English dictionary which gives English equivalents for Persian synsets. The structure of the dictionary is as follows: Persian Word Sense Number Persian Synonyms English Synonyms As we didn't have any English to Persian dictionary with the same structure, we decided to use this dictionary reversely to translate English words to Persian ones. So since the English words are not annotated with sense numbers, the reverse dictionary's structure is as follows: English Word Persian Word Persian Sense Number Persian synonyms The fact that we can't distinguish different senses of an English word causes some problems in translating hypernym synset(s) to Persian, the system doesn't know which translations are related to the word's sense indicated in WordNet synset. So we should have a mechanism to find the correct translation of the English words. To solve this problem and to increase the precision we used a voting strategy. In this strategy all the words in hypernym synset are searched in three resources to find their translations: bilingual dictionary, Wikipedia and Wiktionary 16 which is a wiki based dictionary. Then within the all retrieved Persian equivalents for all the synset words, we choose the equivalent(s) with greatest frequency. The translation of English words can be 224 Fadaei H., Shamsfard M. directly found by bilingual dictionary and by Wiktionary, but to find the translation of words via Wikipedia the word is searched in English Wikipedia and we see if the retrieved article has link to the related Persian article. If such a link exists, the title of the Persian Wikipedia 15 article is the translation of the original English word. To more clarify this method we follow what system dose for the first sense of the Persian word "‫" "زش‬Amuzesh". According to the bilingual dictionary the English equivalents of this word are: pedagogy, pedagogics, learning, instruction, tuition, study, teaching, schooling, educating and education. The related synsets of these words are shown in table 3. Table 3. English synsets for english synonyms of the word "‫"زش‬ English word pedagogy pedagogics learnign Instruction Tuition Study Teaching Schooling Educating education Related synsets (teaching method, pedagogics, pedagogy) || (teaching, instruction, pedagogy) || (education, instruction, teaching, pedagogy, didactics, educational activity) (teaching method, pedagogics, pedagogy) (learning, acquisition) || (eruditeness, erudition, learnedness, learning, scholarship, encyclopedism, encyclopaedism) (direction, instruction) || (education, instruction, teaching, pedagogy, didactics, educational activity) || (teaching, instruction, pedagogy) || (instruction, command, statement, program line) (tuition, tuition fee) || (tutelage, tuition, tutorship) (survey, study) || (study, work) || (report, study, written report) || (study) || (study) || (discipline, subject, subject area, subject field, field, field of study, study, bailiwick, branch of knowledge) || (sketch, study) || (cogitation, study) || (study) (teaching, instruction, pedagogy) || (teaching, precept, commandment) || (education, instruction, teaching, pedagogy, didactics, educational activity) (schooling) || (school, schooling) || (schooling) No relevant synsets found (education, instruction, teaching, pedagogy, didactics, educational activity) || (education) || (education) || (education) || (ducation, training, breeding) || (Department of Education, Education Department, Education) Common number 2, 3, 4 2 1, 2 1, 4, 3, 1 1, 1 1, 1, 1, 1, 1, 1, 1, 1, 1 3, 1, 4 1, 1, 1 4, 1, 1, 1, 1, 1 The system should select the synset(s) which has more words in common with all the English equivalents i.e. we find the intersection of each synset and the set of English equivalents and we choose the synset with largest intersection. As result we reach the most similar English synset to our Persian word. The number of words in the intersection of each sysnet (which is a sign for similarity and is called as "common number") in our example are indicated respectively in table 3 and we can see that the synset (education, instruction, teaching, pedagogy, didactics, educational activity) hast the larget intersection with 4 common words, so this synset is selected as our target synset in wordNet. Relation Learning from Persian Web: A Hybrid Approach 225 When the target English synset is found we start mapping its relations to Persian. As we mentioned before for now we choose Hypernymy relation to find taxonomic relations for the given Persian word. So we find the hypernym synset(s) of the selected synset which is: (activity). Now we translate all the words in Hypernym synset to Persian, and as result we have some hypernymy relations between the Persian word "‫" "زش‬Amuzesh" and the translated words. In our example the relation isa(‫زش‬, ) ( which means isa(instruction, activity)) is extracted. In some cases no synsets are selected in step 3 and that occurs when all the synsets only cover one word among the English equivalents. In these cases system follows another strategy. In this alternative strategy step 1 and 2 are the same as the previous one and the strategy continues with the following steps: 3. Find the hypernym(s) of all of the synstes retrieved in step2. 4. Select the hypernym synset(s) with most frequency among the hypernyms found in step 3. 5. Translate the words in selected synset(s) of step 4. 6. Make new relations between the given Persian word and each translation from step 5. 3.2.2 Wikipedia Another resource used in extracting conceptual relations via similarity based method is Wikipedia. In the whole Wikipedia articles each important word which has an article in Wikipedia itself, is linked to its related article. These linked words especially the ones locating in the first section of the text are usually related to the title of the document. We use this fact to extract some taxonomic and non-taxonomic relations. This method can be also categorized under structure based approach. The following steps are followed by system to extract relations from Wikipedia by this method: 1. Find the Persian synonyms of the given word. 2. Find the related Wikipedia articles to the given word and all its synonyms. 3. Extract hyperlinked words of each Wikipedia article of step 3. 4. Find the common hyperlinked words in all extracted Wikipedia documents. 5. Make new relations between the given word and the words found in step 4. The translations of some of the extracted relations by this method are shown in table 4. Table 4. Translations of some relations extracted from wikipedia using similarity based approach Input Word Related word Instruction Child Life Life Calculus Child School Human Death Birth Math Son 226 Fadaei H., Shamsfard M. 3.2.3 Corpus The last resource used in the similarity based part is corpus. In this system a general domain corpus named Peykareh 10 is used which is a collection gathered form Ettela'at and Hamshahri newspapers of the years 1999 and 2000, dissertations, books, magazines and weblogs. This method which can be classified as a statistical method contains the following steps: 1. Finding the Persian synonyms of the given word. 2. Finding the words which co-occur with the given word and its synonyms (separately) and their co-occurrence frequencies. 3. Selecting the words among the results of step2 with a frequency above a given threshold. 4. Finding the words of step 3 which co-occur with the given word and all its synonyms. 5. Making new relations between the words extracted in step 4 and the given word. The threshold used in third step is to increase the precision of the extracted relations. The test results showed us that 8% of the total frequency of the word (the input word or any of its synonyms) is a proper threshold. Some of the relations extracted by this method are presented in table 5. Table 5. Translations of some relations extracted from corpus using similarity based approach 3.3 Input Word Related Word Football Why Scene Politics Man Actor Success Increase Death Team Reason Movie Government Woman Movie Failure Decrease Life Pattern Based Approach In this section we describe the pattern based part of our relation learning system. This system exploits pattern based approach to extract both taxonomic and non-taxonomic relations from Persian texts. To extract taxonomic relations we define a set of 36 patterns containing the adaptation of Hearst patterns for Persian and some other new patterns. We have also extracted some patterns for some well known non-taxonomic relations such as "Part of", "Has part", "Member of" and "synonymy". The translations of some of these patterns are shown in table 6 (TW stands for target word). Relation Learning from Persian Web: A Hybrid Approach 227 In this system pattern matching method is used in two modes. In the first mode the system is given a pair of related words and the target is to find the type of relation between them. These related pairs are obtained by using structured based or similarity based methods as it was described in section 3.2. In this case the two words are substituted in each pattern and are searched in corpus or web to find the occurrences of the patterns i.e. TW and X in above patterns are given. By following this method the system would be able to detect the type of relation if it is a taxonomic relation or a non-taxonomic one for which we have a template. Table 6. Translation of some patterns for extracting relations Pattern Relation Pattern Relation TW is X. TW is a X TW is considered as X TW is known as X TW is called X TW is named as X Hypernymy Hypernymy Hypernymy Hypernymy Hypernymy Hypernymy TW is a part of X TW includes X TW means X TW is defined as X TW1 or TW2 or … are TW has X Part of Has part Definition Definition Synonymy Has In second mode system is given just one word for which it should find some relations. So in this case the X part of the patterns is known and we should find TW by matching patterns over text. As we can hardly find the occurrences of these patterns through the corpus and since large corpora are not available for Persian we decided to use Wikipedia in the process of relation extraction. In the first section of Wikipedia articles we can usually find some occurrences of our patterns. To start the pattern matching phase we extracted the 1000 most frequent Persian nouns and extracted the Wikipedia articles related to these words. For each word the related article is searched for the phrases matching any of the patterns. The translations of some of the extracted relations by this method are mentioned in table 7. Table 7. Translation of some extracted relations Isa (Iran, country) Isa (newspaper, publication) Isa (water, liquid) Isa (man, human) Isa (pen, tool) Has part (personality, specificity) Has part (Tehran, Tajrish) Has (Greece, history) Has (Iraq, source) Synonym (thought, idea) It should be mentioned that these patterns are used for simple phrases and while encountering complex phrases having different syntactic groups, the precision of the method decreases. To avoid this problem some text processing tools (e.g. chunker) are needed to find the constituents of sentences. As there is no efficient chunker for Persian, we did some post-processings to eliminate incorrectly extracted relations. This phase includes eliminating the stop words, applying some heuristics such as 228 Fadaei H., Shamsfard M. matching the head of the first noun phrase in the sentence with the head of the extracted TW in copular sentences, eliminating prepositional phrases for taxonomic relations, replacing long phrases with their heads and so on. 4 Experimental Results The proposed methods were tested separately and the results are presented in this section. As it was mentioned in section 3.3 the second mode of pattern based approach is tested over 1000 most frequent nouns of Persian. The extracted relations were mainly taxonomic and the number of non-taxonomic relations was much less. Since the given words were not domain specific and no reference ontology was available, human evaluation was used to evaluate our system. The results of pattern based section were evaluated by an ontology expert and a precision of 76% was obtained. Most of the error rate in this part is related to the lack of efficient linguistic tools like chunker for Persian. Although some refinement strategies were applied as was described in section 3.3, but still more process is needed to refine the extracted relation. Despite the unavailability of linguistic tools our system has a precision comparable with English pattern matching systems like 2 and 4. Test results in structure based approach shows a precision of 55% in extracting relations from bullet structures and 74% in relation extraction via disambiguation pages. The problems mentioned above are also present in structure based methods but a great portions of them in handled by following the same rules described in 3.3. For this method and also for pattern based approach we had no efficient way to count the whole number of relations in the input resource to calculate the recall of our system. Similarity based approach was tested regarding the used resource. Again we used human evaluation and according to our ontology expert the similarity based approach using WordNet has a precision of 73% and a recall of 54%. The similarity based method which uses Wikipedia has a precision of 76% according to human evaluation. This method has high precision but the number of extracted relations was low like in other proposed methods using Wikipedia and this fact has two major reasons: the first is that the Persian Wikipedia is sparse and there are many words for which no Wikipedia articles are created. Only about 35% of the selected words had correspondent articles in Wikipedia. The second reason is that the most frequent Persian common nouns which were used in test are mostly abstract nouns and Persian Wikipedia articles are usually about proper nouns or concrete nouns. The final test was performed on the output relations of similarity based approach which uses corpus as input. This method was tested on 300 most frequent common nouns of Persian and the results were verified by three ontology experts. As it was mentioned in section 3.2.3 to increase the precision of extracted relations we considered a threshold and we accepted the co-occurrence relations with a frequency above the defined threshold. To find the proper threshold we performed an initial test that showed us this threshold should be between 5% and 10%. Then we tested our system with different threshold between these boundaries and asked our experts to mark the extracted relations as "Acceptable" or "Unacceptable" to appear in a domain independent ontology. As we had no means to calculate the recall we compared different results regarding their precision and number of correctly extracted relations. Relation Learning from Persian Web: A Hybrid Approach 229 The test results are shown in figure 1 in which horizontal axis shows precision and vertical axis indicates the number of correctly extracted relations. Fig. 1. Test results for different thresholds As it can be seen in figure 1 the precision increases by raising the threshold while the number of correct relation decreases. To have a reasonable precision and not to miss many correct relations we decided to set 8% as our threshold. 5 Conclusion and Further Work In this paper a hybrid approach was presented for extracting conceptual relations from Persian resources especially from web. This approach is a combination of pattern based, structure based, similarity based and statistical relations, enriched with linguistic heuristics. This system which is one of the few works done on ontology learning for Persian, uses different methods and resources to use their advantages and to cover the disadvantages of each method with others. The proposed approach is able to extract a noticeable number of relations and is used in the process of building Persian WordNet. To increase the precision of extracted relations more linguistic heuristics could be applied. Extracting more patterns for taxonomic relations and covering more nontaxonomic relations could be considered as future work. Some more complex methods could be used to find the constituents of Persian sentences. In this way we can extract more relations, especially the non-taxonomic ones, from text and also the precision of pattern based method will increase. More advanced ways to find the types of relations via searching the web and linguistic analysis could be found. 230 Fadaei H., Shamsfard M. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: 14th International Conference on Computational Linguistics, pp. 539--545, (1992). Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning Taxonomic Relations from Heterogeneous Sources of Evidenc. Ontology learning from text: Methods, Evaluation and Applications. IOS press (2005). Shamsfard, M., Barforoush, A.: Learning Ontologies from Natural Language Texts. International journal of human- computer studies. 60, pp. 17--63 (2004). D. Sanchez, A. Moreno, Discovering non-taxonomic relations from the Web. In: LNCS, vol. 4224, pp. 629--636. Springer, Heidelberg (2006). Ruiz-Casado, M., Alfonseca, E., Okumura M., Castells, P.: Information Extraction and Semantic Annotation of Wikipedia. Ontology learning and Population: Bridging the Gap Between Text and Knowledge. IOS press (2008). Reinberger, M., Spyns, P.: Unsupervised Text Mining for the Learning of DOGMAInspired Ontologies. Ontology learning from text: Methods, Evaluation and Applications. IOS press (2005). Ryu, P., Choi, K.: An Information-Theoretic Approach to Taxonomy Extraction for Ontology Learning. Ontology learning from text: Methods, Evaluation and Applications. IOS press (2005). Suchanek, F. M., Ifrim, G., Weikum, G.: Combining linguistic and statistical analysis to extract relations from web documents. In: 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, (2006). Ciaramita, M., Gangemi, A., Ratsch, E., Saric, J., Rojas, I.: Unsupervised Learning of Semantic Relations for Molecular Biology Ontologies. Ontology learning and Population: Bridging the Gap Between Text and Knowledge. IOS press (2008). M. Bijankhan: Role of language corpora in writing grammar: introducing a computer software, Iranian Journal of Linguistics, No. 38 : 38-67, (2004). Sanchez, D., Moreno, A.: Automatic Generation of Taxonomies from the WWW. In: 5th International Conference on Practical Aspects of Knowledge Management. Vol. 3336 of LNAI., pp. 208--219, (2004). Cimiano, P. and Staab, S.: Learning by googling, ACM SIGKDD Explorations Newsletter, vol.6 n.2, p.24--33, (2004). Herbelot, A., Copestake, A.: Acquiring Ontological Relationships from Wikipedia Using RMRS. In: Web Content Mining with Human Language Technologies workshop, USA, (2006). Hazman, M., El-Beltagy, S.R. and Rafea, A.: Ontology learning from textual web documents. In: 6th International Conference on Informatics and Systems, NLP, pp.113-120, Giza, Egypt, (2008). Persian Wikipedia, the free encyclopedia, http://fa.wikipedia.org Persian Wiktionary, the free dictionary, http://fa.wiktionary.org Towards a General Model of Answer Typing: Question Focus Identification Razvan Bunescu and Yunfeng Huang School of EECS Ohio University Athens, OH 45701 bunescu@ohio.edu, yh324906@ohio.edu Abstract. We analyze the utility of question focus identification for answer typing models in question answering, and propose a comprehensive definition of question focus based on a relation of coreference with the answer. Equipped with the new definition, we annotate a dataset of 2000 questions with focus information, and design two initial approaches to question focus identification: one that uses expert rules, and one that is trained on the annotated dataset. Empirical evaluation of the two approaches shows that focus identification using the new definition can be done with high accuracy, holding the promise of more accurate answer typing models. 1 Introduction and Motivation Open domain Question Answering (QA) is one of the most complex and challenging tasks in natural language processing. While building on ideas from Information Retrieval (IR), question answering is generally seen as a more difficult task due to constraints on both the input representation (natural language questions vs. keyword-based queries) and the form of the output (focused answers vs. entire documents). A common approach to the corresponding increased complexity has been to decompose the QA task into a pipeline of quasi-independent tasks, such as question analysis, document retrieval, and answer extraction. As part of question analysis, most QA systems determine the answer type, i.e. the class of the object, or rhetorical type of discourse, sought by the question [1]. For example, the question Q1 :Who discovered electricity? is looking for the name of a Human entity, whereas Q2 :What are liver enzymes? asks for a Definition type of discourse. The corresponding answer types will therefore be Human, and Definition respectively. Knowledge of the answer type associated with a given question can help during the answer extraction stage, when the system can use it to filter out a wide range of candidates. Moreover, the answer type may determine the strategy used for extracting the correct answer. The Human answer type for question Q1 means that the answer is simply the name of a person, possibly identified using a named entity recognizer. A Definition question like Q2 , on the other hand, may involve strategies that identify paragraphs with definition structures focused on the question topic (liver enzymes), or more complex © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 231-242 Received 23/11/09 Accepted 16/01/10 Final version 11/03/10 232 Bunescu R., Huang Y. strategies in which sentences on the question topic from multiple documents are automatically assembled into an answer paragraph that is given the rhetorical structure of a definition. Most previous approaches to answer typing employ a predefined set of answer types, and use classifiers or manually crafted rules to assign answer types to questions. For example, [2] use a maximum entropy classifier to map each question into a predefined set of categories that contains all the MUC types, plus two additional categories: Reason for capturing WHY questions, and Phrase as a catch-all category. Realizing the benefit of using more fine-grained categories, Li and Roth have proposed in [3] a more comprehensive set of answer types in the form of a two-level hierarchy, with a first level of 6 coarse classes that are further split into 50 fine classes on the second level. As pointed out by Pinchak and Lin in [4], using a predefined set of categories presents two major drawbacks: 1. There will always be questions whose answer types do not match any of the predefined categories, e.g. What are the names of the tourist attractions in Reims . Many question analysis systems employ a special catch-all category for these cases, which leads to a less effective treatment compared with the other categories. 2. The predetermined granularity of the categories leads to a trade-off between how well they match actual answer types and how easy it is to build taggers and classifiers for them. Thus, while it is relatively easy to tag names as instances of People, this category is not a perfect fit for the question Which former actor became president of the United States?. Conversely, while an Actor answer type would be a better fit for this question, the corresponding tagging task during answer extraction will be more difficult. As a solution to these problems, Pinchak and Lin introduced in [4] a probabilistic answer type model that directly computes the degree of fitness between a potential answer and the question context, effectively obviating the need for a predefined set of answer types. Their follow-up work in [5] presents an alternative approach to answer typing based on discriminative preference ranking. Like the probabilistic model in [4], the new flexible approach works without explicit answer types, and is shown to obtain improved results on a set of “focused” WHAT and WHICH questions. Irrespective of whether they use an explicit set of answer types or not, many answer typing models emphasize the importance of one particular part of the question: the question focus, defined in [1] as “generally a compound noun phrase but sometimes a simple noun, that is the property or entity being sought by the question”. According to [1], the nouns city, population and color are the focus nouns in the following questions: Q3 McCarren Airport is located in what city? Q4 What is the population of Japan? Q5 What color is yak milk? The focus of a question is generally seen as determining its answer type. Singhal et al. [6], for example, use a lexicon that maps focus nouns to answer types. Towards a General Model of Answer Typing: Question Focus Identification 233 They also give a syntactic rule for extracting the question focus from questions of the type What X ..., What is the X ..., and Name the X ..., according to which the focus is simply the syntactic head of the noun phrase X. Consequently, the nouns company and city constitute the focus nouns in the following two example questions taken from [6]: Q6 What company is the largest Japanese builder? Q7 What is the largest city in Germany? The question focus is also important in approaches that do not use explicit answer types. The models of Pinchak et al. from [4, 5] compute how appropriate an arbitrary word is for answering a question by counting how many times the word appears in question contexts, where a question context is defined as a dependency tree path involving the wh-word. For a focused question such as What city hosted the 1988 Winter Olympics?, the authors observed that a question focus context such as X is a city is more important than the non-focus context X host Olympics. Motivated by the observed importance of the question focus to answer typing models, in this paper we take a closer look at the associated problem of question focus identification. We first give an operational definition (Section 2), followed by a set of examples illustrating the various question categories that result from a question focus analysis (Section 3). We describe a rule-based system (Section 4.2) and a machine learning approach (Section 4.3) that automatically identify focus words in input questions, and compare them empirically on a dataset of 2,000 manually annotated questions (Section 5). The paper ends with a discussion of the results and ideas for future work. 2 What is the Question Focus? To the best of our knowledge, all previous literature on answer typing assumes that a question has at most one instance of question focus. The examples given so far in questions Q1 through Q7 seem to validate this assumption. Accordingly, the question focus in many questions can be extracted using simple syntactic rules such as the noun phrase (NP) immediately following the WH-word (e.g. Q3 , Q5 , Q6 ), or the predicative NP (e.g. Q4 , Q7 ). However, the uniqueness assumption may be violated in the case of questions matching more than one extraction rule, such as Q6 above, or Q8 and Q9 below: Q8 Which Vietnamese terrorist is now a UN delegate in Doonesbury? Q9 What famed London criminal court was once a feudal castle ? One approach to enforcing uniqueness would be to rank the extraction rules and consider only the extraction from the top most matching rule. It is unclear though on which principles the rules would be ranked. Any relative ranking between the NP immediately following the WH-word and the predicative NP seems to be arbitrary, since questions Q8 and Q9 can be reformulated as questions Q10 and Q11 below: 234 Bunescu R., Huang Y. Fig. 1. Question focus example. Q10 Which UN delegate in Doonesbury was once a Vietnamese terrorist? Q11 What feudal castle is now a famed London criminal court? In order to eschew these difficulties, we propose a definition of question focus that covers both the NP immediately following the WH-word and the predicative NP. For the definition to be as general as possible, it needs to take into account the fact that a question focus, as observed in [1], can denote a property or entity being sought by the answer. In question Q6 , for example, the focus word company specifies a property of the answer. In question Q7 , on the other hand, the noun phrase the largest city in Germany denotes the answer entity, while at the same time its head noun city specifies a property of the answer. Without exception, the noun phrases considered so far as potential instances of question focus have one thing in common: they can all be considered to corefer with the answer. It is this relation of coreference with the answer that allows us to give the following simple, yet comprehensive and operational definition for question focus: Definition 1. The question focus is the set of all maximal noun phrases in the question that corefer with the answer. Figure 1 shows the two noun phrases that are identified as question focus for question Q8 . The term noun phrase in the definition refers only to phrases marked as NP in the parse tree, and thus excludes wh-noun phrases (WHNP). A noun phrase is defined to be maximal if it is not contained in another NP with the same syntactic head. In deriving the syntactic parse tree of a question, we use the same notation and bracketing criteria as the Penn Treebank [7], with one exception: if the WHNP contains more than a single wh-word, the rest of the phrase is abstracted Towards a General Model of Answer Typing: Question Focus Identification 235 as an NP. This contrasts with the flat structure employed in the Penn Treebank, and helps in simplifying the question focus definition. According to the definition, a question may have one or more instances of question focus. We believe that identifying more than one focus noun phrase can be advantageous in answer extraction, when all question focus instances may be used concurrently to filter out an even wider range of answer candidates. For example, when searching for noun phrases as potential answers to question Q8 , a question answering system may choose to enforce the constraint that the noun phrase refers to both a “Vietnamese terrorist” and a “UN delegate in Doonesbury”. An answer typing system that employs an explicit set of answer types such as [3] may exploit the enriched question focus (e.g. {terrorist, delegate}) to improve the accuracy of mapping questions to answer types (e.g. Human). Alternatively, the identification of multiple question focus instances may also benefit approaches that do not use a predefined set of answer categories. The answer typing methods of [4, 5], for example, may choose to give preference to dependency paths that start at any focus head in the question. There are no constraints on the type of noun phrases that may be considered as question focus. Consequently, the focus may be a definite, indefinite, or bare noun phrase, as well as a proper name, or a pronoun, as illustrated in questions Q8 (repeated below), Q12 , and Q13 : Q8 Which Vietnamese terrorist is now a UN delegate in Doonesbury? Q12 What French seaport claims to be The Home of Wines? Q13 Who was the first black performer to have his own network TV show? The actual semantic constraints imposed on candidate answers by a question focus vary from alternate names (e.g. The Home of Wines), to category information (e.g. seaport), to pronouns (e.g. his). Even a simple personal pronoun such as his, when identified as a question focus, may trigger a useful elimination of candidate noun phrases that do not refer to entities that are both Male and Human. 3 Question Categories When a question has at least one instance of question focus, as question Q14 below, the answer type can be determined from the focus. For questions such as Q15 that lack an explicit question focus, the answer type is implicit in the whword if it is one of who, when, where, or why. Question Q16 is an example where the answer type is both implicit in the wh-word, and explicit in the question focus, albeit at different levels of granularity. Finally, there are questions such as Q17 that do not contain any explicit question focus and where the wh-word does not convey any information about the answer type – except maybe as a negative implicature, e.g. since the question does not use the wh-word who, then it is unlikely that the answer is of type Human. Q14 What country do the Galapagos Islands belong to? 236 Bunescu R., Huang Y. Q15 Who killed Gandhi? Q16 Who was the inventor of silly putty? Q17 What do bats eat? The implicit answer type of how questions is Manner (e.g. question Q18 ), unless the wh-word is followed by an adjective or an adverb, as in questions Q19 and Q20 below. A full treatment of these quantifiable how questions is beyond the scope of this paper (Pinchak and Bergsma [8] have recently introduced an answer typing strategy specifically designed for such cases). Q18 How does a rainbow form? Q19 How successful is aromatherapy? Q20 How long is the Coney Island boardwalk? Using coreference to define a question focus implies an identity relationship between the question focus and the answer, which might not be as evident for questions Q21 , Q22 , or Q23 below. There is nevertheless an implicit identity relationship between the focus of these questions and their answers. Taking Q21 as an example, the answer is a text fragment with an appropriate rhetorical structure that describes some conceptual structure X that IS the “nature of learning”. Q21 Q22 Q23 Q24 What What What What is is is is the nature of learning? the history of skateboarding? the definition of a cascade? a cascade? Definition questions have a special status in this category, as their answer type can be expressed either explicitly through a question focus (Q23 ), or just implicitly (Q24 ). 4 Automatic Identification of Question Focus Based on the definition from Section 2, a straightforward method for solving the task of question focus identification would contain the following two steps: 1. Run coreference resolution on the question sentence. 2. Select the coreference chain that is grounded in the answer. In this section we present a more direct approach to question focus identification, in which every word of the question is classified as either belonging to the question focus, or not, leaving the coreference resolution based approach as subject of future work. Towards a General Model of Answer Typing: Question Focus Identification 4.1 237 Question Focus Dataset In order to evaluate our word tagging approaches to question focus identification, we selected the first 2000 questions from the answer type dataset of Li and Roth [3], and for each question we manually annotated the syntactic heads of all focus instances. Since, by definition, question focus identification is a constrained version of coreference resolution, we used the annotation guidelines of the MUC 7 coreference task [9]. Three statistics of the resulting dataset are as follows: – 1138 questions have at least one instance of question focus. – 121 questions have two or more instances of question focus. – 29 questions have a pronoun as one instance of the question focus. All 29 questions that have a pronoun as a question focus also contain a nonpronominal NP focus. This property, together with the relatively low occurrence of pronouns, determined us to design the initial extraction approaches to identify only non-pronominal instances of question focus. 4.2 A Rule Based Approach In the first approach to question focus identification, we have manually created a set of extraction rules that correspond to common patterns of focused questions. The rules, together with a set of illustrative examples, are shown in Figure 2. Given that the syntactic information is not always correct, we have decided to associate each syntactic rule with an analogous rule in which some of the syntactic constraints are approximated with part-of-speech constraints. In most cases, this meant approximating the “head of an NP” with “the last word in a maximal sequence of words tagged with one of {JJX, NNX, CD}”, where JJX refers to any adjective tag, and NNX refers to any noun tag. There are in total five syntactic rules R1 to R5 , together with their part-of-speech analogues R1′ to R5′ . A definite noun phrase is either a noun phrase starting with a definite determiner, a possessive construction, or a proper name. Whenever the focus is extracted as the head of a possessive construction in rules R2 and R2′ , we modify the head extraction rules from [10] to output the “possessor” instead of the “possessed” noun (e.g. output country as head of country’s president). We also use two small lexicons: one for BE verbs such as {be, become, turn into}, and one for NAME verbs such as {name, nickname, call, dub, consider as, know as, refer to as}. 4.3 A Machine Learning Approach The rules Ri′ in the rule based approach were construed to complement the syntactic rules Ri by approximating noun phrase constraints with constraints on sequences of part-of-speech tags. While they often lead to the extraction of focus words that otherwise would have been missed due to parsing errors, rules Ri′ are generally expected to obtain lower precision and recall (see also Section 5). Ideally, each rule would be associated a weight, and a measure of confidence 238 Bunescu R., Huang Y. 1. If a question starts with the verb Name: R1 = extract the head of the highest NP immediately after Name. R1′ = extract the last word of the maximal sequence of {JJX, NNX, CD} immediately after Name. Q : Name the scar-faced bounty hunter of The Old West. 2. If a question starts or ends with What/Which immediately followed by an NP: R2 = extract the head of the highest NP immediately after the wh-word. R2′ = extract the last word of the maximal sequence of {JJX, NNX, CD} immediately after the wh-word. Q : What company is the largest Japanese builder? Q : The corpus callosum is in what part of the body? 3. If a question starts with What/Which/Who immediately followed by a BE verb and does not end with a preposition or a past participle verb: R3 = extract the head of the definite highest NP after the BE verb. R3′ = extract the last word of the maximal definite sequence of {JJX, NNX, CD} after the BE verb. Q : What company is the largest Japanese builder? 4. If a question starts with What/Which/Who, optionally followed by a non-possessive NP, followed by a NAME verb in passive voice: R4 = extract the head of the highest NP after the NAME verb. R4′ = extract the last word of the maximal sequence of {DT, JJX, NNX, POS, CD} after the NAME verb. Q : What city is sometimes called Gotham? 5. If a question starts with What/Which/Who, followed by an interrogative pattern of a NAME verb: R5 = extract the head of the highest NP after the NAME verb. R5′ = extract the last word of the maximal sequence of [DT, JJX, NNX, POS, CD] after the NAME verb. Q : What author did photographer Yousuf Karsh call the shiest man I ever met? Fig. 2. Focus Identification Rules and Example Questions. would be computed for each candidate focus word based on the weights of the rules used to extract it. Such a setting can be obtained by modeling the focus identification task as a binary classification problem in which question words are classified as either part of the question focus or not. Each rule from the rule based approach would give rise to a binary feature whose value would be 1 only for the word positions matched by the rule. The fact that one word may be identified as focus by multiple rules will not be problematic for discriminative learning methods such as Support Vector Machines (SVMs) [11] or Maximum Entropy [12], which can deal efficiently with thousands of overlapping features. For our machine learning approach to focus identification we chose to use SVMs, Towards a General Model of Answer Typing: Question Focus Identification 239 Question-level features: 1. The question starts with a preposition followed by a wh-word. Q : In what U.S. state was the first woman governor elected? 2. The question starts with Name. Q : Name the scar-faced bounty hunter of The Old West. 3. The question starts with a wh-word followed by a BE verb. 4. The question starts with What/Which followed by a BE verb and a bare NP. Q : What are liver enzymes? 5. The question starts with What/Which followed by a BE verb and ends with a past participle verb, optionally followed by a preposition. Q : What is a female rabbit called? 6. The question starts with a wh-word in an empty WHNP. Q : What are liver enzymes? 7. The question starts with a wh-word followed by an NP. Q : What company is the largest Japanese builder? 8. The question ends with a preposition. Q : What are Cushman and Wakefield known for? 9. The first verb after the wh-word is not a BE verb. Q : Who killed Gandhi? 10. The questions starts with hwh-wordi: create a feature for each possible wh-word. Word-level features: 1. Create a feature for each of the rules R1 , R1′ ... R5 , R5′ . 2. The head of the highest bare NP after the WHNP. Q : What are liver enzymes? 3. The head of the highest definite NP after the WHNP. Q : What company is the largest Japanese builder? 4. The head of the highest indefinite NP after the WHNP. Q : What is a cascade? 5. The head of the highest NP nearest the wh-word. 6. The last word of the first maximal sequence of {JJX, NNX, CD} nearest the whword. Q : What is considered the costliest disaster the insurance industry has ever faced? 7. The part-of-speech htagi of the word: create a feature for each possible tag. Fig. 3. Question-level and Word-level Features . motivated by their capability to automatically induce implicit features as conjunctions of original features when run with polynomial or Gaussian kernels [13]. The explicit features employed in the SVM model are shown in Figure 3. Apart from the rules Ri and their analogues Ri′ from Figure 2, the feature set also contains more atomic features designed to capture simple constraints used in the rules. Splitting rules into their more atomic features means that an isolated error in one part of the parse tree would only affect some of the features, while the rest of the features will still be relevant. Thus, while the rule as a whole may lead to an incorrect extraction, some of its atomic features may still be used 240 Bunescu R., Huang Y. to reach a correct decision. Furthermore, if the SVM model is coupled with a polynomial kernel, then more complex parts of the original rules, when seen as conjunctions of elementary features, would be considered as implicit features. 5 Experimental Evaluation We empirically evaluated the rule based approach and the learning approach on the task of question focus identification using the 2000 manually labeled questions. For the rule based approach, we evaluated 3 rule sets: the set of rules R1 to R5 , their approximations R1′ to R5′ , and all the rules from R1 , R1′ to R5 , R5′ in a combined set. For the learning approach, we performed 10-fold crossvalidation by splitting the dataset into 10 equally sized folds, training on 9 folds and testing on the remaining 1 fold for 10 iterations. To compute the accuracy and F1 measure, we pooled the results across all 10 folds. We used SVMLight1 with its default parameters and a quadratic kernel. Table 1 shows the precision, recall, F1 measure, and accuracy for all four systems. The precision, recall and F1 measures correspond to the task of focus word extraction and are therefore computed at word level. Accuracy is computed at question level by considering a question correctly classified if and only if the set of focus words found by the system is exactly the same as the set of focus words in the annotation. Table 1. Experimental Results. Rule Based SVM Measure R1 to R5 R1′ to R5′ Combined Quadratic Precision 93.3% 92.5% 88.1% 95.2% Recall 89.2% 71.6% 90.1% 91.3% F1 91.2% 80.7% 89.1% 93.2% Accuracy 91.1% 81.8% 88.3% 93.5% As expected, adding the approximation rules to the original rules in the combined system helps by increasing the recall, but hurts the precision significantly. Overall, the learning approach obtains the best performance across all measures, proving that is can exploit useful combinations of overlapping rules and features. Figure 4 shows graphically the precision vs. recall results for the four systems. The curve for the SVM approach was obtained by varying a threshold on the extraction confidence, which was defined to be equal with the distance to the classification hyperplane, as computed by the SVMLight package. A significant number of errors made by the learning approach are caused by parsing or part-of-speech tagging errors. There are also examples that are misclassified due to the fact that syntactic information alone is sometimes insufficient. For example, questions such as What is a film starring Jude Law? have 1 URL: http://svmlight.joachims.org Towards a General Model of Answer Typing: Question Focus Identification 241 1 0.9 0.8 Precision 0.7 0.6 0.5 0.4 R1 to R5 0.3 R1’ to R5’ 0.2 Combined 0.1 SVM Quadratic 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Recall 0.7 0.8 0.9 1 Fig. 4. Precision vs. Recall graphs. the structure of an implicit definition question, yet they do contain an explicit focus word. The opposite is also possible: for the implicit definition question What is the “7-minute cigarette”? the system identifies cigarette as focus word. Semantic features that discriminate between proper names and titles may also eliminate some of the errors. The system learns, for example, that words tagged as NNP are unlikely to be focus words, which is why it fails to extract the focus word for the question Who was President of Afghanistan in 1994?. Recently, Mikhailian et al. [14] have proposed a Maximum Entropy approach for identifying the question focus (the asking point in their terminology). However, they use the traditional, less comprehensive definition of question focus, whereby a question can have at most one focus noun phrase. In order to empirically compare our learning approach with theirs, we created a second version of the dataset in which the questions were annotated with at most one NP focus. Using exact matching between system output and annotations, our SVM based approach obtains a question level accuracy of 93.7%, which compares favorably with their reported accuracy of 88.8%. 6 Future Work We plan to augment the SVM model with semantic features, some of them identified in Section 5, in order to further increase the accuracy. We also intend to implement the alternative method mentioned at the beginning of Section 4, in which the identification of question focus is done by classifying the coreference chains extracted from the question as referring to the answer or not. 242 7 Bunescu R., Huang Y. Conclusions We proposed a comprehensive definition of question focus based on coreference with the answer that eliminates inconsistencies from previous definitions. We designed both a rule based approach and a machine learning approach to question focus identification, and evaluated them on a dataset of 2000 questions manually annotated with focus information. Empirical evaluation of the two approaches shows that focus identification using the new definition can be done with high accuracy, offering the promise of more accurate answer typing models. References 1. Prager, J.M.: Open-domain question-answering. Foundations and Trends in Information Retrieval 1 (2006) 91–231 2. Ittycheriah, A., Franz, M., Zhu, W.J., Ratnaparkhi, A., Mammone, R.J.: IBM’s statistical question answering system. In: Proceedings of the Ninth Text REtrieval Conference, National Institute of Standards and Technology (NIST) (2000) 3. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th International Conference on Computational linguistics, Taipei, Taiwan, Association for Computational Linguistics (2002) 1–7 4. Pinchak, C., Lin, D.: A probabilistic answer type model. In: Proceedings of the 11st Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, The Association for Computer Linguistics (2006) 5. Pinchak, C., Lin, D., Rafiei, D.: Flexible answer typing with discriminative preference ranking. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece (2009) 666–674 6. Steve, A.S., Abney, S., Bacchiani, M., Collins, M., Hindle, D., Pereira, O.: Att at trec-8. In: In Proceedings of the Eighth Text REtrieval Conference (TREC-8. (2000) 317–330 7. Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketing Guidelines for Treebank II Style Penn Treebank Project. University of Pennsylvania. (1995) 8. Pinchak, C., Bergsma, S.: Automatic answer typing for how-questions. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, New York, Association for Computational Linguistics (2007) 516–523 9. DARPA, ed.: Proceedings of the Seventh Message Understanding Evaluation and Conference (MUC-98), Fairfax, VA, Morgan Kaufmann (1998) 10. Collins, M.: Head-driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania (1999) 11. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons (1998) 12. Berger, A.L., Della Pietra, S.A., Della Pietra, V.J.: A maximum entropy approach to natural language processing. Computational Linguistics 22 (1996) 39–71 13. Schölkopf, B., Smola, A.J.: Learning with kernels - support vector machines, regularization, optimization and beyond. MIT Press, Cambridge, MA (2002) 14. Mikhailian, A., Dalmas, T., Pinchuk, R.: Learning foci for question answering over topic maps. In: Proceedings of the ACL-IJCNLP 2009 Conference, Suntec, Singapore, Association for Computational Linguistics (2009) 325–328 Traditional Rarámuri Songs used by a Recommender System to a Web Radio Alberto Ochoa-Zezzatti1 , Julio Ponce2, Arturo Hernández3, Sandra Bustillos1, Francisco Ornelas2 & Consuelo Pequeño1 1 Instituto de Ciencias Sociales y Administración, Universidad Autónoma de Ciudad Juárez; México. 2 Aguascalientes University; Aguascalientes, México. 3 CIMAT; Guanajuato, México. alberto.ochoa@uacj.mx, jcponce@correo.uaa.mx, artha@cimat.mx, fjornel@correo.uaa.mx Abstract. In this research is described an Intelligent Web Radio associated to a Recommender System which uses different songs in a database related with a kind of Traditional Music (Rarámuri songs) which employs the Dublin Core metadata standard for the documents description, the XML standard for describing user profile, which is based on the user’s profile, and on service and data providers to generate musical recommendations to a Web Radio. The main contribution of the work is to provide a recommendation mechanism based on this Recommender System reducing the human effort spent on the profile generation. In addition, this paper presents and discusses some experiments that are based on quantitative and qualitative evaluations. Keywords: Systems Recommendation System, User Profile and Thematic Music 1 Introduction Today, the songs can be electronically accessed as soon as they are published on the Web. The main advantage of open music is the minimization of the promotion time. In this context, Digital Libraries (DLs) have emerged as the main repositories of digital documents, links and associated metadata. The Recommender System involves information personalized. The personalization is related to the ways in which contents and services can be tailored to match the specific needs of a user or a community (Rarámuri people) [3]. The human-centered demand specification is not an easy task. One experiences this difficulty when trying to find a new song in a good indexing and retrieval system as a Web Radio. The query formulation is complex and the fine tuning of the user requirements is a time-consuming task. Few users have enough time to spend some hours searching for, eventually new songs. This functionality, the query specification may be reached by the analysis of the user activities, history, information demands, in others. This paper presents a Musical recommendations system to a Web Radio; the songs recovered are © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 243-252 Received 17/10/09 Accepted 16/01/10 Final version 10/03/10 244 Ochoa-Zezzatti A., Ponce J., Hernández A., Bustillos S., Ornelas F., Pequeño C. associated with Traditional Rarámuri Songs. The main contribution of this work is to provide a recommendation mechanism based on the user reducing the human effort spent on the profile generation. The paper is organized as follows. We start giving an overview of the background literature and concepts, then the recommender system and detail its architecture and techniques. Finally, we present some quantitative and qualitative experiments to evaluate and validate our system and discuss the results and conclusions of our work. 2 Bacground The semantic Web technologies promote an efficient and intelligent access to the digital documents on the Web. The standards based on metadata to describe information objects have two main advantages: computational efficiency during the information harvesting process and interoperability among DLs. The first is a consequence of the increasing use of Dublin Core (DC) metadata standard [8]; the latter has been obtained as a result of the OAI initiative (Open Archive Initiative) [17]. DC metadata standard was conceived with the objective of defining a minimal metadata set that could be used to describe the available resources of a DL. This standard defines a set of 15 metadata (Dublin Core Metadata Element Set – DCMES) [8]. The main goal of OAI is to create a standard communication way, allowing DLs around the world to interoperate as a federation [21]. The DL metadata harvesting process is accomplished by the OAI-PMH protocol (Open Archives Initiative Protocol for Metadata Harvesting) [18], which define how the metadata transference between two entities, data and service providers, is performed. The data provider acts by searching the metadata in databases and making then available to a service provider, which uses the gathered data to provide specific services. Considering that a Recommender System concerns with information personalization, it is essential that it copes with user profile. In our work, the user profile is obtained from a Web Radio similar at used in [13]. According to [11], there are three different methodologies used in Recommender Systems to perform recommendation: (i) content-based, which recommends items classified accordingly to the user profile and early choices; (ii) collaborative filtering, which deals with similarities among users’ interests; and (iii) hybrid approach, which combines the two to take advantage of their benefits. In our work, the content-based approach is used, once the information about the user is taken from Web Radio users. This recommendation process can be perceived as an information retrieval process, in which user’s relevant songs should be retrieved and recommended. Thus, to perform recommendations, we can use the classical information retrieval models such as the Boolean Model, the Vector Space Model (VSM) or the Probabilistic Model [1, 9, and 20]. In this work, the VSM was selected since it provides satisfactory results with a convenient computational effort. In this model, songs and queries are represented by terms vectors. The terms are words or expressions extracted from the documents (lyrics) and from queries that can be used for content identification and representation. Each term has a weight associated to it to provide distinctions among Traditional Rarámuri Songs used by a Recommender System to a Web Radio 245 them according to their importance. According to [19] the weight can vary continuously between 0 and 1. Values near to 1 are more important while values near to 0 are irrelevant. The VSM uses an n-dimensional space to represent the terms, where n corresponds to the number of distinct terms. For each document or query represented the weights represent the vector’s coordinates in the corresponding dimension. The VSM principle is based on the inverse correlation between the distance (angle) among term vectors in the space and the similarity between the songs that they represented. To calculate the similarity score, the cosine (Equation 1) can be used. The resultant value indicates the relevance degree between a query (Q) and a document (song) (D), where w represents the weights of the terms contained in Q and D, and t represents the number of terms (size of the vector). This equation provides ranked retrieval output based on decreasing order of the ranked retrieval similarity values [19]. (1) The same equation is widely used to compare the similarity among songs, and similarity, in our case, Q represents the user profile and D the documents descriptors (lyrics) that are harvested in the DL (see Section 3.2 for details). The term weighting scheme is very important to guarantee an effective retrieval process. The results depend crucially of the term weighting system chosen, In addition, the query terms selection is fundamental to obtain a recommendation according to the user necessities. Our research is focused in the query terms selection and weighting. Any person with basic knowledge of Rarámuri Language which required a musical retrieval may evaluate the process complexity and the difficulty to find adequate songs. The central idea is to develop an automated retrieval and musical recommendation system where the price for the user is limited to the submission of an already existing preferences query. 3 The Recommender System Our system focuses on the recommendation of Traditional Rarámuri songs. The information source to perform recommendations is the database associated with a Web Radio (Lyrics and Music of each song), while the user profile is obtained from Database Profile Register subset. However, any DL repository providing DC metadata and supporting the OAI-PMH protocol can be used as a source. An alternative to the user profile generation is under development. This alternative approach is composed by an information retrieval system to gather data from another Music sources. A DL repository stores digital songs or its localization (web or physical), and the respective metadata. A DL data provider allows an agent to harvest documents metadata through the OAI-PMH protocol. Our system handles the songs described with XML in DC standard [7, 15]. 246 Ochoa-Zezzatti A., Ponce J., Hernández A., Bustillos S., Ornelas F., Pequeño C. 3.1 Recommendation System Architecture In this section we present the architecture elements of our system and its functionalities (Fig. 1). To start the process, the users must supply their preferences in the XML version to the system. Whenever a user makes its registration in the system and sends his preferences list (1), the XML Songs Document module is activated and the information about the user’s interests is stored in the local database named User Profile (2). Then the Metadata Harvesting module is activated to update the local database Songs Metadata. This module makes a request to a DL data provider to harvest specific documents metadata. It receives an XML document as response (3) and the XML DC to local DB module is activated (4). This module extracts the relevant metadata to perform the recommendation from the XML document and stores it in the local database named Songs Metadata (5). Once the user profile and the songs metadata are available in the local database, the Recommendation module can be activated (6). The focus is to retrieve lyrics and songs of a DL that the best matches the user profile described through the profile of each user in the Web Radio. Fig. 1. The recommender system architecture. 3.2 The Recommendation Model As stated before, the recommendation is based on the VSM model. The query vector is built with the term parsed from the title, keywords, singer or band and date. The parser ignores stop-words [5] (a list of common or general terms that are not used in the information retrieval process, e.g., prepositions, conjunctions and articles). The parser considers each term as a single word. On the other hand, the terms are taken integrally, as single expressions. The query vector terms weights are build up according to the Equation 2. This equation considers the type of term (keyword or title), the language and the year of the first air data. Keyword terms are considered more important that the song’s titles Traditional Rarámuri Songs used by a Recommender System to a Web Radio 247 and have more reading proficiency are more valorized (higher weight), and the terms obtained from the traditional Rarámuri songs are assigned with more important weight than translations to Rarámuri Language. Wt = WKeywordOrTitle ∗ WLanguage ∗ WYear (2) The weights WKeywordOrTitle, WLanguage, WYear are calculated with Equation 3. Wi = 1 – (i– 1) 1 - wmin n–1 (3) In this equation Wi varies according to the type of weight we want to compute. To illustrate this in the experimental evaluation (Section 4), for WKeywordOrTitles Wmin was 0.95, and I is 1 if the language-skill.level is “good”, 2 for “reasonable” and 3 for “few”. For WYears Wmin was 0.55 and i vary from 1 to n, where n is the interval of years considered, begin 1 the highest and n the lowest. In the experimental evaluation it was considered a sample of songs between 2008 and 2002. However, if the interval is omitted, it will be considered as between the present year and the less recent year (the smallest between artist:first-song and artist:last-song). If Wmin is not informed, the default value will be used (presented in Equation 4). In the situation, Equation 3 is reduced to Equation 5. Wmin default = 1 N Wi = n – i + 1 N (4) (5) Once the query vector is build, the songs vector terms and the respective weights must be defined. The adopted approach was (tf * idf), i.e., the product of the term frequency and the inverse document frequency [19]. This approach allows automatic term weights assignment for the songs retrieval. The term frequency (tf) corresponds to the number of occurrences of a term in a document. The inverse document frequency (idf) is a factor that varies inversely with the number of the songs n to which a term is assigned in a collection of N songs (typically computed as log (N/n)). The best terms for content identification are those able to distinguish individuals ones from the remainder of the collection [19]. Thus, the best terms correspond to the ones with high term frequencies (tf) and low overall collection frequencies (high idf). To compute tf * idf, the system uses the DC metadata dc:title and dc:description to represent the songs content. Moreover, as your system deals with Ráramuri language and translation to Rarámuri Language, the total number of songs will vary accordingly. After building the query and songs vectors, the system is able to compute the similarities values among the songs and the query according to Equation 1. 248 Ochoa-Zezzatti A., Ponce J., Hernández A., Bustillos S., Ornelas F., Pequeño C. 4 Experimental Evaluation In order to evaluate the musical recommender system, we have asked for preferences of listeners of a Web Radio. As response, a group of 57 people send us their list of preferences, whose information was loaded in the Songs Metadata related with the Web Radio local database. The songs Metadata local database was loaded in the User Profile local database related with it. This database stored up to August 2009, totalizing 87 songs from 11 singers or bands including in 7 albums. After 20 recommendations were generated by the system for each hour in the Web Radio, considering individual’s profile of the user and their preferences. This information was obtained using the user’s data base related with the Web Radio. Two evaluations were performed. The first was based on the hypothesis that the best songs to describe the profile of a user should be those produced by them. Since we had information about the songs by each user, we can match the items recommended to those. This evaluation was accomplished by the recall and precision metrics that is a standard evaluation strategy for information retrieval systems [1, 20]. The recall is used to measure the percentage of relevant songs retrieved in relation to the amount that should have been retrieved. In the case of songs categorization, the recall metric is used to measure the percentage of songs that are correct classified in relation to the number of songs that should be classified. Precision is used to measure the percentage of songs correctly recovered, i.e., the number of songs correctly retrieved divided by the number of songs retrieved. As the profiles can be seen as classes and the songs as items to be classified in these profiles, we can verify the amount of items from the singers that are correctly identified (i.e. classified) by the user profile. As we have many users (i.e., many classes), it is necessary to combine the results. The macroaverage presented in Equation 6 was designed by D. Lewis [14] to perform this specific combination (“the unweighted mean of effectiveness across all categories”), and was applied by him in the evaluation of classification algorithms and techniques. n Wi macroaverage = Σi=1 Xi (6) n In this formula, Xi is the recall or the precision, depending on the metric we want to evaluate, of each individual class (user in our case) and n is the number of classes (users). Thus, the macroaverage recall is the arithmetic average of the recalls obtained for each individual, and the macroaverage precision is the arithmetic average of the precisions obtained for each individual. Given that the users are not interested in its own preferred songs as recommendations, we performed another evaluation that takes in to account only the items from other users. Then, 15 recommendations were presented to each individual ranked on the relative grade of relevance generated by the system. In this rank, the songs with the highest grade of similarity with the user profile were set as 100% relevant and the others were adjusted to a value relative to it. In this case, each author was requested to evaluate the recommendations generated to them assigning one of the following concepts (following the bipolar five-point Lickert scale); “Inadequate”, “Bad”, “Average”, “Good”, and “Excellent”, and were also asked to comment the results. The following sections present the results obtained. Traditional Rarámuri Songs used by a Recommender System to a Web Radio 249 5 Analysis of Experiments The first experiment was designed to evaluate the capability of the system to correctly identify the user profile (i.e., to represent its preferences), since we believe that the best songs to describe the user profile are those selected by themselves, as stated before. To perform such evaluation, we identified the songs of each user had at Web Radio. After that, we employed the recall metric to evaluate the number of songs recovered for each user and combined then with the microaverage equation explained before. We have found a macroaverage recall of 43.25%. It is important to state that each user received 20 recommendations. This is an acceptable value as the query construction was made automatically without human intervention. It happened to be lower than it should be if we have used more songs, maybe used translated songs, but the problem is the limited songs for singer or band. Other important consideration is that the recommendation ranking was generated with a depreciation degree that was dependent on year and language, as explained in the previous section. As the timeslice considered corresponds to period stored in the database related with the Web Radio, not all songs are good recommendations since the preferences changes along the time, similar at propose in [23]. Inadequate 50% 40% 30% 20% Excellent Bad 10% first match top 5 0% top 10 top 15 Good Average Fig. 2. Users’ evaluations of the recommendations Figure 2 presents the results of the second experiment, which was based on the users’ qualitative evaluation of the recommended songs. On this experiment each user received 15 recommendations and evaluated them according to one of the following concepts: “inadequate”, “bad”, “average”, “good”, and “excellent”. The results were grouped into the categories “first match”, “top 5”, “top 10”, and “top 15”, and are presented in Figure 2. Analyzing three results, it is possible to observe that, if we only consider the first song recommended (the “first match”), the number of items qualified as “excellent” in greater than the others (i.e., 42.86%) and none of them were classified as “inadequate”. This strengthens the capability of the system on performing recommendations adjusted to the present user’s genre preferences interests. We have also grouped the concepts “good” and “excellent” into a category named “positive recommendation” and the concepts “bad” and “inadequate” into a “negative 250 Ochoa-Zezzatti A., Ponce J., Hernández A., Bustillos S., Ornelas F., Pequeño C. recommendation” group, so we could obtain a better visualization and comprehension of the results (Fig. 3). first match 60% 40% Negative Recommendation 20% top 15 0% top 5 Average Positive Recommendation top 10 Fig. 3. Grouped users’ evaluation We could perceive that the positive recommendations, considering only the “first match”, are superior (57.14%) in relation to the negative ones (7.14%). The same behavior can be perceived in the “top 5” and “top 10” categories, the recommendations had a negative evaluation only in the “top 15” category, and that probably happened because as the number of recommendations grows, the number of correct recommendations falls. It is clear that the automated procedure here adopted is adequate for an alert recommender system. Our proposal is to add to Web Radio an automated alert system that periodically sends to the user a list of the most relevant songs recently listen on it during seven or more weeks. Further, in our tests the users that have changed their search in the last three months have negatively qualified the recommendations. In the next experiments a variable time threshold and different depreciation values will be employed and the temporal component will be exhaustively analyzed. 5 Conclusions This paper presented a Recommender System to users of a Web Radio related with the lyrics of Rarámuri songs this dialect has 87500 speakers in Mexico (See Figure 4). In current days, in which the recovery of relevant digital information on the Web is a complex task, such systems are of great value to minimize the problems associated to the information overload phenomena, minimizing the time spent to access the right information. The main contribution of this research consists on the heavy utilization of automated Musical Recommendation and in the use of a Digital Library (DL) metadata to create the recommendations. The system was evaluated with BDBComp, but it is designed to work with the open digital library protocol OAI-PMH, then it may be easily extended to work with any DL that supports this mechanism. The same occurs with the lyrics format related with the song, but it can be extended to support Traditional Rarámuri Songs used by a Recommender System to a Web Radio 251 other formats or to analyze information about the user. Alternatively operational prototype offers the possibility to the user to load the lyrics via an electronic form. Fig. 4. Distribution of Rarámuri Language in Chihuahua State, México The developed system will have many applications. One of them is the recommendation of songs to support a Web Radio related with Rarámuri music. Thus, the user could log into a specific thematic and receive recommendations of songs containing actualized relevant material to complement its current musical selection. References 1. Baeza-Yates, R.; Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Workingham, UK (1999) 2. BDBComp Biblioteca Digital Brasileira de Computação. Online at: http://www.lbd.dcc.ufmg.br/bdbcomp/ (last access: 2006-11) 3. Callahan, J. et al.: Personalization and Recommender Systems in Digital Libraries. Joint NSF-EU DELOS Working Group Report, May (2003) 4. CITIDEL: Computing and Information Technology Interactive Digital Educational Library. Institut interfacultaire d’informatique, University of Neuchatel. Online at: http://www.unine.ch/info/clef/ (last access: 2005) 5. CLEF and Multilingual information retrieval. Institut interfacultaire d’informatique, University of Neuchatel. http://www.unine.ch/info/clef/ (last access: 2005) 6. Contessa, D., Fraga F. and Palazzo A.: An OAI Data Preovider for JEMS. Proceedings of the ACM DocEng 2006 Conference, Amsterdam. Oct (2006) 218-220 7. DC-OAI: A XML schema for validating Unqualified Dublin Core metadata associated with the reserved oai_dc metadataPrefix. Online at: http://www.openarchives.org/OAI/2.0/oai_dc.xsd (last access: 2005) 8. Dublin Core Metadata Initiative. http://dublincore.org (last access: 2005) 9. Grossman, David A. Information retrieval: algorithms and heuristics. 2nd ed. Dordrecht: Springer, (2004) 332 10. Gutteridge, C. GNU EPrints 2 overview, Jan. 01 (2002). 11. Huang, Z. et al. A Graph-based Recommender System for Digital Library. In JCDL’02 Portland, Oregon (2002) 252 Ochoa-Zezzatti A., Ponce J., Hernández A., Bustillos S., Ornelas F., Pequeño C. 12. Laender, A., Gonçalves, M. and Roberto, P.: BDBComp: Building a Digital Library for the BrazilianComputer Science Community. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tucson, AZ; USA (2004) 23-24 13. Laustanou, K.: MySpace Music (2007) 14. Lewis, D.D.; Evaluating text categorization. In Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, Morgan Kaufmann. (1991) 312-318 15. LPMP-CNPq. Padronização XML: Curriculum Vitae. Online at: http://lmpl.cnpq.br. (last access: 2005-03) 16. Maly, K., Nelson, M., Zubaír, M., Amrou, A, Kothamasa, S., Wang, L. and Luce, R.: Lightweight communal digital libraries. In Proceedings of JCDL’04, Tucson; AZ (2004) 237-238 17. OAI: Open Archives Initiative. Online at: http://openarchive.org (last access: 2005-10) 18. OAI-PMH. The Open Archives Initiative Protocol for Metadata Harvesting. Online at: http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm (last access: 2005-11) 19. Salton, G. and Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval, Information Processing and Management an International Journal, v.24, Issue 5, (1988) 513-523 20. Salton, G. and Macgill, M.: Introduction to Modern Information Retrieval. New York. McGraw-Hill. (1983) 448 21. Sompel, H. and de Lagoze, C.: The Santa Fe Convention of the Open Archives Initiative DLib Magazine, [S.1.], v.6, n.2, Feb (2000) 22. Tansley, R., Bass, M., Stuve, D., Branschofsky, M., Chudnov, D., McClellan, G. and Smith, M.: Dspace: An institutional digital repository system. In Proceedings of JCDL’03, Houston, TX (2003) 87-97 23. Ochoa, A et al. Musical Recommendation on Thematic Web Radio. In Journal of Computers, Oulu, Finland (2009) By publishing Improving Clustering of Noisy Documents through Automatic Summarisation Seemab Latif1 , Mary McGee Wood 2 , and Goran Nenadic 1 1 School of Computer Science University of Manchester, UK latifs@cs.man.ac.uk, G.Nenadic@cs.man.ac.uk 2 Assessment21, Cooper Buildings Sheffield, UK mary@cs.man.ac.uk, mmw@assessment21.com Abstract. In this paper we discuss clustering of students’ textual answers in examinations to provide grouping that will help with their marking. Since such answers may contain noisy sentences, automatic summarisation has been applied as a pre-processing technique. These summarised answers are then clustered based on their similarity, using k-means and agglomerative clustering algorithms. We have evaluated the quality of document clustering results when applied to fulltexts and summarized texts. The analyses show that the automatic summarization has filtered out noisy sentences from the documents, which has made the resulting clusters more homogeneous, complete and coherent. 1 Introduction Document clustering is a generic problem with widespread applications. The motivation behind clustering a set of documents is to find inherent structure and relationship between the documents. Document clustering has been applied successfully in the field of Information Retrieval, web applications to assist in the organization of the information on the web (Maarek et al., 2000), to cluster biomedical literature (Illhoi et al., 2006) and to improve document understanding by using document clustering and multiple document summarisation where summarisation is used to summarise the documents in resultant clusters (Wang et al., 2008). The quality of document clustering is dependent on the length of the documents and the “noise” present within documents. The main features of a document are the words that it contains. Unimportant words contribute to noise and thus may lead to faulty results from automatic processing. Assess By Computer (ABC) (Sargeant et al., 2004) is a suit of software tools for assessment of students’ exams. It uses document clustering to group students’ textual answers to support semi-automated marking and assessment in examinations (Wood et al., 2006). One important feature of ABC is abstraction: instead of referring to a large number of students’ answers one at a time, ABC refers to them as groups of similar answers at one time. Students’ answers are written under time pressure in an examination environment. Therefore, there is always a high chance of students making spelling mistakes and writing sentences or words that are not relevant © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 253-264 Received 02/12/09 Accepted 16/01/10 Final version 12/03/10 254 Latif S., McGee M., Nenadic G. to the question being asked and thus could be considered as noise, in particular, if we want to assess students’ understanding of the concepts in questions. These sentences or words may affect the performance of the clustering process. To avoid the deterioration of the clustering results, we hypothesized that Automatic Text Summarisation could be one of the solutions to remove noisy data from the answers efficiently and effectively. Previous extensive work on document clustering has focused on issues such as initial number of clusters, similarity measures and document representation, and has to some extent ignored the issue of document pre-processing. The principal pre-processing techniques applied to documents have been stopword removal, stemming and case folding. However, document clustering algorithms are highly dependent on the length of the document (Hsi-Cheng and Chiun-Chieh, 2005). Therefore, we hypothesize that preprocessing texts using automatic summarization before document clustering may improve the performance of clustering algorithms in terms of both efficiency and quality of the clustering, and summary based clusters will be more homogeneous, complete and coherent than fulltext clusters. 2 Integrating Text Summarisation into Document Clustering In this paper, we integrate text summarisation with document clustering to utilize the mutual influence of both techniques to help in grouping of students’ free text answers. We apply automatic text summarisation as a pre-processing method for students’ answers before clustering. The main aim behind using automatic text summarisation is to extract the information content of the answers, and to improve document clustering by reducing the noise and the length of the answers. 2.1 Automatic Summarisation In this paper, we have used Keyword-dependent Summarizer (KdS) that summarises a document using keywords given by the user. In case of summarising students’ answers, these keywords are given by the marker. This summarisation method follows the shallow approach for summarization, i.e. it employs text extraction techniques to generate the summary (Neto et al., 2002). Abstractive or language generation approach for summarisation was not used, as there is no guarantee that generated the text will be useful (Neto et al., 2000). In order to perform extractive summarisation, the first question arises is that on what granularity segments will be extracted i.e. the “size” of the textual units that are copied from the original text and pasted into the summary. This could be a paragraph, a sentence, a phrase or a clause. We have performed summarisation on sentence level using a structural analysis technique. This technique uses the semantic structure of the text (such as keyword frequency, position of the sentence in the document and paragraph, cue phrases and sentence length) to rank sentences for inclusion in a final summary. Improving Clustering of Noisy Documents through Automatic Summarisation 255 This summarization method has been evaluated qualitatively and quantitatively by both human and automatic evaluations. Both human and ROUGE ((Lin and Hovy, 2003)) evaluations have achieved high positive correlation with the scores assigned to the essays by the human marker. 2.2 Clustering process with Summarisation: Framework Clustering students’ free text answers is a difficult task as there are a number of problems posed by natural language for automated processing of these answers. In this paper, each student answer is referred to as a document. Documents are represented as a vector of words in the Vector Space Model (VSM). The performance of the VSM is highly dependent on the number of dimensions in the model and it can be improved by removing terms that carry less semantic meaning and are not helpful while calculating the similarity between documents (stopword removal), converting terms to their stems (stemming), spelling corrections and term weighting (the process of modifying the values associated with each term to reflect the information content of that term). The similarity between document vectors is calculated using the cosine of the angle between them. The overall framework of the integration of text summarisation with document clustering for clustering students’ answers is given in Figure 1 and it follows the following steps: – Step 1: Extracting Answers In this step, students’ answer strings for a question, which are to be clustered, are extracted from the XML file provided by the ABC marking tool. – Step 2: Summarising Answers Each answer string is then summarised using KdS. Summarisation extracts the information content of the answers and filters out the noise. This reduces the number of terms in the answer string. – Step 3: Document Pre-processing The summarised answers are then pre-processed using spelling correction, stopword removal, stemming and term weighting. First three pre-processing steps and summarisation aim to reduce the total number of terms used in clustering while term weighting provides a way to apply different weights to the terms to reflect their importance. – Step 4: Creating the Term-By-Document Matrix The term-by-document matrix on full-text answer strings could potentially be a sparse matrix containing over 500 terms (dimensions). Many terms in the matrix are noisy and useless, giving no semantic information about the answer. To improve the performance of the term-by-document matrix, we have created vectors on the summarised answer strings instead of full-text answer strings. The matrix created on summarised answer strings has fewer dimensions as compared to the 256 Latif S., McGee M., Nenadic G. matrix crated on full-text answer strings. Fig. 1. Framework for Clustering Students’ Answers – Step 5: Creating Clusters The last step in this procedure is to create clusters. Two clustering algorithms are used to generate clusters. Clustering algorithms cluster answers based on the terms present within the answer and the similarity between answers. 3 Experimental Setup In this section, the design and results of an experiment are discussed. Here, automatic summarisation has been applied as a pre-processing step for document clustering to reduce the noise. The full documents and summarised documents are then clustered and clustering results are evaluated using the precision and recall measures (defined below). Improving Clustering of Noisy Documents through Automatic Summarisation 257 The aim of this experiment is to evaluate the clustering results, when carried on full-text and summarised documents, for their quality and accuracy. Experimental Design: The automatic summarisation pre-processing method was applied for partitional (k-means) and hierarchical (agglomerative) clustering algorithms. We have used KdS and the keywords were taken from the “Model Answer” to the examination questions. The clustering algorithms were run on the full-text documents and on five different levels of summarisation compression. Each document in the dataset was compressed to 50%, 40%, 30%, 20% and 10% of the size of its original length, and both algorithms were run on the documents in each dataset having five different levels of compressed documents and the whole full-text document. Datasets: The datasets used for this experiment were taken from the ABC marking tool, and include marked exams from Pharmacy, Biology and Computer Science undergraduate courses conducted at the University over the past five years. The reason for taking these datasets is that human assigned marks for these datasets are available and these marks will be used to evaluate the clustering results. These exams were marked by an instructor who was teaching the course. Table 1 gives the number of answers in each dataset, average number of terms in the answers and the number of clusters generated for each dataset. The number of clusters for each dataset was the total number of distinct marks for that question plus one (for mark 0). For example, if we want to cluster answers to a question worth of 4 marks then the number of clusters will be 5 (one for mark 0). As a sample, one question along with its answer from the Biology and Pharmacy datasets is given in appendix A. Table 1. Datasets statistics for evaluating summary-based clusters Dataset CS Biology Pharmacy Question Number of Number of Number of ID Documents Features Clusters CS1 108 880 5 Biology1 Biology2 Biology3 Biology4 279 280 276 276 960 708 909 996 6 6 7 6 Phar1 Phar2 139 165 1103 1278 5 7 3.1 Evaluation Metrics In general, the categories are not known before hand, but in our case, we have used human-marked data from the ABC marking tool to evaluate the quality of clustering. 258 Latif S., McGee M., Nenadic G. The answers in each dataset were grouped according to their marks awarded by the human marker. These marked categories have served the purpose of the “gold standard” categories for the evaluation. For the evaluation of document clustering results, Precision and Recall were defined as follows. Precision associated with one document represents what fraction of documents in its cluster belongs to its marked category. Precision of a clustering result on a dataset can be either calculated at the document level (micro-precision) or at the cluster level (macro-precision). If the micro-precision is high, then it means that the number of noisy documents or misclassified documents in the clusters is low. If macro-precision is high then it means that the most documents from the same category are grouped together. Recall associated with one document represents what fraction of documents from its marked category appears in its cluster. Similarly to precision, recall of a clustering result on a dataset can be either calculated on the document level (micro-recall) or on the cluster level (macro-recall). If the micro-recall is high then it means that most of the documents from its model category lie in its cluster. If macro-recall is high then it means clusters are similar to the model categories. We have used only micro-precision and micro-recall measures for clustering result evaluations, which are combined using the standard combining metric, Van Rijsbergen’s F-Measure (Rijsbergen, 1974). Both precision and recall of a cluster will be high when the computed cluster is similar to the model cluster. The precision will be 1 and 1 the recall of each document will decrease by |ModelCategory| when each document is placed in independent cluster i.e. each cluster has only one document. The recall will be n 1 and the precision of each document in the cluster will decrease by n+1 when all the documents will be clustered into one cluster. Note that it is not possible to have precision or recall of 0 because at least a cluster and a category share one common document. There are four mathematical constraints proposed by (Amigo et al., 2009) that should be satisfied by the clustering evaluation metrics. These constraints are Cluster Homogeneity, Cluster Completeness, Rag Bag and Cluster Size versus Quantity. The metrics that we have introduced here satisfy these constraints. Micro-precision satisfies cluster homogeneity and ragbag constraints while micro-recall satisfies cluster completeness and size versus quantity constraints. 4 Results and Analysis Because of random initialization of the k-means, the algorithm was run 10 times on each dataset and then mean values were calculated for the evaluation. All summarisationbased clustering results have performed better then the full-text clustering results. In tables 2 and 3, the micro-precision values are higher for the summarisation-based clusters than the micro-precision values for the full-text clusters. Improving Clustering of Noisy Documents through Automatic Summarisation 259 Table 2. Micro-precision for k-means clustering Dataset CS Biology Pharmacy Question Fulltext Compression Level ID 50% 40% 30% 20% 10% CS1 0.513 0.831 0.837 0.793 0.753 0.728 Biology1 Biology2 Biology3 Biology4 0.519 0.495 0.617 0.672 0.843 0.904 0.814 0.777 0.880 0.899 0.823 0.786 0.838 0.852 0.759 0.764 0.776 0.832 0.711 0.719 0.711 0.780 0.677 0.679 Phar1 Phar2 0.569 0.586 0.774 0.798 0.800 0.737 0.691 0.642 0.811 0.770 0.720 0.663 Table 3. Micro-precision for agglomerative clustering Dataset CS Biology Pharmacy Question Fulltext Compression Level ID 50% 40% 30% 20% 10% CS1 0.466 0.840 0.840 0.783 0.750 0.694 Biology1 Biology2 Biology3 Biology4 0.715 0.738 0.673 0.695 0.855 0.900 0.809 0.786 0.862 0.895 0.843 0.800 Phar1 Phar2 0.599 0.655 0.775 0.772 0.779 0.762 0.718 0.602 0.812 0.761 0.721 0.678 0.780 0.865 0.764 0.753 0.805 0.849 0.718 0.707 0.727 0.736 0.675 0.705 According to the definition of micro-precision, it is high when most of the items from a single model category are clustered together in one cluster or when majority of the items have their individual clusters. In our case, the numbers of clusters are fixed for each dataset, which omits the possibility of each document having its own cluster. This means that most of the documents belonging to a single model category are clustered together and that summarisation has filtered out the feature terms that are not useful in distinguishing the documents. Therefore, summary-based clusters are more homogeneous and noise free than full-text clusters. Micro-recall values for each of the algorithms are given in tables 4 and 5. These values are high for full-text clusters. According to the definition of micro-recall, it is high when the resultant clusters are similar to the model categories or when majority of the items are clustered in one cluster. We analysed the full-text document clustering results manually, which showed that the high micro-recall for full-text clusters is due to the clustering of most of the documents in one cluster. However, many of the feature terms in full-text documents have low distinctive power and are not useful in clustering. This suggests that if initial documents are misclassified then these documents “attract” other documents with high similarity and a large number of common feature terms. This will make the cluster noisy with the doc- 260 Latif S., McGee M., Nenadic G. Table 4. Micro-recall for k-means clustering Dataset CS Biology Question Fulltext Compression Level ID 50% 40% 30% 20% 10% CS 1 0.795 0.715 0.769 0.678 0.592 0.549 Biology1 Biology2 Biology3 Biology4 0.688 0.769 0.674 0.697 0.724 0.668 0.760 0.772 0.730 0.715 0.769 0.772 0.838 0.643 0.706 0.750 0.616 0.588 0.655 0.702 0.545 0.518 0.618 0.650 Pharmacy Phar1 0.628 0.756 0.764 0.695 0.653 0.590 Phar2 0.638 0.754 0.771 0.725 0.667 0.606 Table 5. Micro-recall for agglomerative clustering Dataset CS Biology Pharmacy Question Fulltext Compression Level ID 50% 40% 30% 20% 10% CS1 0.863 0.727 0.768 0.712 0.556 0.564 Biology1 Biology2 Biology3 Biolog 4 0.574 0.496 0.613 0.667 0.733 0.645 0.74 0.781 Phar1 Phar2 0.557 0.598 0.713 0.728 0.683 0.690 0.558 0.764 0.761 0.713 0.676 0.617 0.733 0.680 0.777 0.779 0.722 0.622 0.727 0.739 0.620 0.586 0.675 0.684 0.599 0.530 0.636 0.664 uments that are not part of it. Tables 6 and 7 and figures 2 and 3 give the micro-F-measure values for the k-means and agglomerative clusterings on the three datasets. Analyses of the results show that both clustering algorithms results have achieved optimal clustering at the compression level of 40% summarised documents. At this level, micro-F-measure is ∼ = to 80% for both algorithms, averaged over all the datasets. The datasets with large number of documents have a smooth curve with peak at 40% summary based clustering. For most of the datasets even 10% summarised text clustering has performed better than the full-text clustering. 5 Conclusion In this paper, the experiments have been discussed which evaluate the method of improving clustering results by performing automatic summarisation of students’ textual answers. Automatic summarisation has reduced the length of the documents and hence the number of feature terms that were potentially noisy for clustering. The results suggest that automatic summarisation has filtered out the noise from the documents and Improving Clustering of Noisy Documents through Automatic Summarisation Table 6. Micro-F-measure for k-means clustering Dataset CS Biology Pharmacy Question Fulltext ID CS1 0.624 50% 0.768 Biology1 Biology2 Biology3 Biology4 0.591 0.602 0.644 0.684 0.778 0.767 0.786 0.774 Phar1 Phar2 0.597 0.611 Compression Level 40% 30% 20% 10% 0.802 0.731 0.662 0.626 0.797 0.796 0.795 0.779 0.741 0.733 0.731 0.757 0.685 0.689 0.682 0.690 0.616 0.622 0.646 0.664 0.764 0.782 0.715 0.672 0.614 0.775 0.790 0.746 0.693 0.633 Average 0.791 Fig. 2. K-means clustering micro-F-measure Table 7. Micro-F-measure for agglomerative clustering Dataset CS Biology Pharmacy Question Fulltext ID CS1 0.605 50% 0.780 Biology1 Biology2 Biology3 Biology4 0.637 0.593 0.641 0.681 0.789 0.751 0.773 0.784 Phar1 Phar2 0.577 0.625 Compression Level 40% 30% 20% 10% 0.802 0.746 0.638 0.623 0.792 0.773 0.808 0.789 0.750 0.723 0.745 0.746 0.701 0.694 0.696 0.695 0.657 0.616 0.655 0.684 0.743 0.753 0.720 0.703 0.579 0.768 0.786 0.736 0.698 0.646 Average 0.790 261 262 Latif S., McGee M., Nenadic G. Fig. 3. Agglomerative clustering micro-F-measure has extracted the relevant information content of the documents. Due to the noise removal and reduction in document length, the performance of the two document clustering algorithms has improved in terms of quality of the clustering results. While evaluating the clustering results, we came to conclusion that summarisationbased clusters are more homogeneous (as micro-precision of summary-based clusters is higher than the micro-precision of full-text clusters for both the algorithms, see tables 2 and 3) and complete (as micro-recall value of summary-based clusters is higher than the micro-recall for full-text clusters except for 10% summarised-text clusters using k-means, see tables 4 and 5) then full-text clusters. Optimal clustering results were achieved when the documents were summarised to 40% of their original length. References 1. Amigo, E., Gonzalo, J., Artiles, J. and Verdejo, F. (2009). A Comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints, Information Retrieval Journal 12(4): 461– 486. 2. Hsi-Cheng, C. and Chiun-Chieh, H. (2005). Using Topic Keyword Clusters for Automatic Doc- ument Clustering, Proceedings of the 3rd International Conference on Information Technol- ogy and Applications, pp. 419–424. 3. Illhoi, Y., Hu, X. and Il-Yeol, S. (2006). A Coherent Biomedical Literature Clustering and Sum- marization Approachthrough Ontology-Enriched Graphical Representations, Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery, pp. 374– 383. 4. Lin, C. and Hovy, E. (2003). Automatic Evaluation of Summaries using N-gram Cooccurrence Statistics, Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 71–78. 5. Maarek, Y., Fagin, R., Ben-Shaul, I. and Pelleg, D. (2000). Ephemeral Document Clustering for Web Applications, Technical Report RJ 10186, IBM Research Report. 6. Neto, J., Freitas, A. and Kaestner, C. (2002). Automatic Text Summarization using a Machine Learning Approach, Proceedings of the 16th Brazilian Symposium on Artificial Intelligence, Advances in Artificial Intelligence, pp. 205–215. Improving Clustering of Noisy Documents through Automatic Summarisation 263 7. Neto, L., Santos, A., Kaestner, C. A. and Freitas, A. (2000). Document Clustering and Text Summarization, Proceedings of the 4th International Conference on Practical Applications of Knowledge Discovery and Data Mining, pp. 41–55. 8. Rijsbergen, C. V. (1974). Foundation of Evaluation, Journal of Documentation 30(4): 365–373. 9. Sargeant, J., Wood, M. and Anderson, S. (2004). A Human-Computer Collaborative Approach to the Marking of Free Text Answers, Proceeding of the 8th International Conference on Computer Assissted Assessment, pp. 361–370. 10. Wang, D., Zhu, S., Li, T., Chi, Y. and Gong, Y. (2008). Integrating Clustering and MultiDocument Summarization to Improve Document Understanding, Proceeding of the 17th Conference on Information and Knowledge Management, pp. 1435–1436. 11. Wood, M., Jones, C., Sargeant, J. and Reed, P. (2006). Light-Weight Clustering Techniques for Short Text Answers in HCC CAA, Proceedings of the 10th International Conference on Computer Assisted Assessment, pp. 291–305. Appendix A One question along with its answer from Biology and Pharmacy dataset is given in this appendix. Biology II Question: What do you understand by the term Haematocrit? Could a person have a normal RBC count but a low Haematocrit? What could be the cause of this? Answer: The Haematocrit shows the relative proportion of red blood cells to plasma and (white blood cells and other proteins), the figure given being the proportion of ’packed red blood cells’. The ideal proportion of packed cell volume (haematocrit) being, 37-47% in females, and 40-54% in males. If a person was to have a normal red blood cell count (normal amount of cells per micro litre), but smaller cells (microcytic cells) due to a disorder with the peptide chains, for example a missing peptide chain, then the cells would be able to pack closer together, and hence have a low haematocrit. This is classed as microcytic anemia, and can be due to problems in transcribing the two alpha or two beta chains, or perhaps the inability to form the globular protein. Eitherway, the size of the haemoglobin is reduced, hence the volume is reduced, but there is still likely to be the same number of cells. Other reasons for a normal red blood count and a low Haematocrit could be an increase in bodily fluids, i.e. plasma. The red blood cell count would be considered within ’normal’ range, however, due to the increase in plasma, the ratio of blood to plasma would be altered in that the Haematocrit value would be lower. Pharmacy Question: State the factors to be considered when selecting an antiepileptic regimen for an 18 year old female patient who has been newly diagnosed as suffering from epilepsy by the local Neurologist. Answer: Firstly, before selecting a treatment regimen, it is important to establish the number of seizures the patient has suffered. This is because treatment is rarely initiated after a single seizure. Usually, a person must suffer from two seizures in twelve months before treatment is given. The type of seizure is also important in selecting a treatment regimen i.e partial seizures are generally treated differently to general seizures. These classes of seizure can be further subdivided and will have specific treatment protocols. Thirdly, it is important to establish whether the patient is taking any other medication or has any other medical conditions which could affect the treatment options. For example, many antiepileptic drugs can interact with and reduce the efficacy of the combined oral contraceptive. It is important to try and give the patient a single agent wherever possible as many antiepileptic patients are successfully controlled with monotherapy. This would avoid the problems asociated with polypharmacy such as drug interactions, increased drug tox- 264 Latif S., McGee M., Nenadic G. icity and reduced compliance. The age of the patient also needs to be considered especially if they were very young or elderly as this may limit the treatment options available or the doses may need adjusting. This does not seem to be a problem in this patient as she is 18 years old. However, the fact that she is a female needs to be considered. This is because many of the drugs can cause unacceptable side effects in women i.e. sodium valproate can cause hair loss, phenytoin can cause acne and gingival hyperplasia. This woman is also of child bearing age and it would need to be established if the patient was pregnant or breastfeeding as many of the antiepileptic drugs are teratogenic. It would also need to be established whether this patient was taking the combined oral contraceptive pill as antieplieptic medication can reduce the efficacy of this. This would mean that other precautionary advise on alternative methods of contraception would need to be given. Educational Applications User Profile Modeling in eLearning using Sentiment Extraction from Text Adrian Iftene and Ancuța Rotaru Faculty of Computer Science, “Al. I. Cuza” University, General Berthelot Street, No. 13, 700483 Iasi, Romania {adiftene, ancuta.rotaru}@infoiasi.ro Abstract. This paper addresses an important issue in the context of current Web applications because there are new ideas of providing personalized services to users. This part of web applications is a very controversial one because it is very hard to identify the main component which should be emphasized, namely, the development of a user model. Customizing an application has many advantages (the elimination of repeated tasks, behavior recognition, indicating a shorter way to achieve a particular purpose, filtering out irrelevant information for an user, flexibility), but also disadvantages (there are users who wish to maintain anonymity, users who refuse the offered customization or users who do not trust the effectiveness of personalization systems). This allows us to say that in this field there are many things to be done: the personalization systems can be improved; the user models which are created can be adapted to a larger area of applications, etc. The eLearning system created by us has as an aim to reduce the distance between the involved actors (the student and the teacher), providing easy communication using the Internet. Thus, based on this application students can ask questions and teachers can provide answers to them. Then, on the basis of this dialogue using the sentiment extraction from text, we built a user model in order to improve the communication between the student and the teacher. Therefore we built a user’s model, but we helped the teacher to understand better the problems faced by the students. Keywords: eLearning, Sentiment Extraction, User Profile Modeling 1 Introduction Web pages are customized for specific users based on certain features such as: interests, the social class they belong or the context in which they access the pages. Customizing itself starts with creating a user model that includes modeling skills and the user’s knowledge. This model can predict (as appropriate) normally carried out mistakes during the learning process because it is basically a collection of user’s personal data. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 267-278 Received 23/11/09 Accepted 16/01/10 Final version 09/03/10 268 Iftene A., Rotaru A. The models of users are found in social Web and are used to describe how people socialize and how they interact via the World Wide Web. For example, people can explicitly define their identity by creating a profile in social networking services like Facebook, LinkedIn and MySpace or tacitly by creating blogs to express their own opinions. GNU Mailman1 is free software for managing electronic mail discussion and enewsletter lists. Mailman is integrated with the web, making it easy for users to manage their accounts and for list owners to administer their lists. Mailman supports built-in archiving, automatic bounce processing, content filtering, digest delivery, spam filters, digests (RFC 934 and RFC 1153), Usenet gateways and more. It provides a web front-end for easy administration, both for list owners and list members. Our system is similar to the one presented before, but in addition we have approached the subject to achieve a user’s model in the eLearning system and have exhibited different traits that underlie it, such as interests, knowledge, experience, goals, personal traits and work context. In addition, we describe the created patterns, the forms of representation (e.g., vectors or arrays of concepts), but we also discuss about the recommendation systems (systems that give users some recommendations depending on the created model or depending on their interests). For creating the user model we used a module for extracting the sentiment from Romanian texts similar to [1], and so we emphasized the positive (frumos – En: nice, excelent – En: excellent), negative (groaznic – En: awful, teamă – En: fear) or neutral (a merge – En: to go, masă – En: weight) significance of terms from a text. Also, we identify the role of negation (nu – En: not, niciodată – En: never), of diminutive words (mai puțin – En: less, probabil – En: probable) and of intensification words (desigur – En: sure, cert – En: certain). To investigate the significance of each term we used Romanian WordNet2 and after that we created our own resource with specific terms. Thus, we first add general terms about people, taken from Maslow’s pyramid of human motivation3 [6], then we add for each of these terms its synonym, hyponym, hypernym and after that we add to our custom resource the field’s terms for which we built this system. The application is a site created for second year students from our faculty (Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iasi), where they could ask the teachers questions on certain subjects. The purpose of this application is to contribute to the strengthening of teacher-student relationship through ease of communication between them in moments when they cannot meet for various reasons. Of course, for each of the two types of users who interact with the application we provide specific functionality. Thus, a student may ask questions with a specific purpose stated explicitly or with a certain priority and can view the answers provided by the teacher, while a teacher can answer questions and can view a full report of each student’s work on the site. Using this theme (the teacher-student communication) we built a database of features which was modeled by our system for each student. To assess the quality of 1 GNU Mailman: http://www.gnu.org/software/mailman/index.html WordNet: http://wordnet.princeton.edu/ 3 Maslow hierarchy: http://en.wikipedia.org/wiki/Maslow%27s_hierarchy_of_needs 2 User Profile Modeling in eLearning using Sentiment Extraction from Text 269 these results on the one hand and to improve the quality of the recommendation system on the other hand, we created two psychological tests which we have offered for students to complete. Finally, based on this information we were able to formulate a recommendation for teachers in order to improve the communication with students. 2 eLearning Elements Both teachers and psychologists have concluded that the activity of learning is a process which involves all the components of human personality, such as from the simplest to the most complex targets the entire human system is put in motion in order to achieve the reception, the processing, the interpretation and the exploitation of the information. The eLearning concept involves combining art and psycho-pedagogy because the starting point in building such a system is the knowledge of learning mechanisms (in order to work more efficiently, so as to assimilate through these technical systems). eLearning4 (Electronic Learning) is a type of technology supporting education in which training takes place via computer technology. This form of education is used in many areas: in companies to provide training courses for employees, in universities to define a specific mode to attend a course (especially when students do not take any courses to date). Although it is said it does not favors face-to-face interaction (the user uses only the computer to find the necessary information) as opposed to the standard education where there is always one teacher who teaches the student, this new educational technology can be used in combination with already established techniques: the teacher can enrich his speech by using such an educational system (he presents to his students information which have easier to assimilate representations associated by the system). Distance learning includes different ways of deployment and technologies to provide instruction (correspondence, audio, video, and computer). This implies a physical distance between the education’s actors (student-teacher). The eLearning systems recognize this distance and they are trying to substitute it with a variety of strategies to encourage the interaction between the student and the teacher by offering new possibilities: the exchange of messages, documents or answers to required tasks. An eLearning system (for distance training or virtual education) consists of a planned teaching-learning experience and it is organized by an institution which provides mediated materials in sequential and logic order to be treated by students in their own way, without forcing the participants to be present in a certain place at a certain time and to carry out a certain activity. Mediation is done in various ways, from the material on CD (possibly sent by mail) to technologies for transmitting content via the Internet. For example, AeL (Advanced eLearning)5 is an integrated system for teaching/learning and content management, which facilitates the work of the actors involved in designing and deploying the educational process: teachers, students, 4 5 ELearning: http://en.wikipedia.org/wiki/E-Learning AeL: http://advancedelearning.com/index.php/articles/c322 270 Iftene A., Rotaru A. evaluators, content developers. Although initially it was built for universities (especially for the form of distance learning), currently it is used in school education being extremely suitable for studying different languages, regions, levels of knowledge or types of organizations. The AeL platform, designed in multilayer system, represents: a standard client application web browser type and an application server based on Java platform. One of the main characteristics of our eLearning system is related to the possibility to provide a two-way communication between the teacher and the learner with aim to reduce this distance between them. So, after the students log in they can see the electronic material for current courses and labs, can ask questions and can see the answers to their questions or to another questions, and the teacher can see these questions and can provide answers to them, all acting in real time. 3 User Modeling Because of the continuous expansion of the Internet and the WWW’s (World Wide Web), interconnected system of documents accessible through the Internet, and of the increasing number of computer users, the software systems must be increasingly more adaptable and that means that the systems must adapt depending on its user’s interests, skills and experience (the users can be from a wide range of areas). Although there were built graphical interfaces for computers more accessible for different types of users (as work field and interests, age and experience), there were not yet built good interfaces for each user. Creating user models (User Modeling) is a research area which tries to build models of human behavior in an environment where human-computer interaction is specific. The purpose is not to imitate the human behavior, but to make the program to understand the user’s desires, needs and expectations during the interaction. Moreover, the aim is that the system to be able to help the user to solve certain tasks (available/proposed by the program). The computerized representation of the user is called user model and the systems which create and use such models are called modeling systems. Our modeling system combines ideas from adaptive hypermedia systems [2] and from recommendation systems [3, 7], and additionally we come with techniques from computational linguistics which allow us to extract sentiments from texts or to identify “similar questions”. Thus, our system comparable with an adaptive hypermedia system, built the user’s models starting from user’s interests, desires and levels of knowledge and also it adapts the program’s interface accordingly during the interaction with the corresponding actors. Similar with the recommendation systems our system is able to recommend to students which professor is most appropriate to a certain type of question or to recommend similar questions with available answers. The features used by our system to obtain a user’s model are: a. Knowledge – for that we consider a basis accordingly with the student’s current year (e.g., 2nd year), but we consider additional information, like grades at taught courses for a specific domain. For example, to analyze the User Profile Modeling in eLearning using Sentiment Extraction from Text 271 questions related to an advanced course we consider grades from basic courses followed in previous years. b. Interests – represent the competencies that the user would like to acquire. We suppose that the interests of the students are related with the areas in which they ask questions. For example, if a student asks a question about a discipline called “Advanced Java Techniques” we suppose he wants to assimilate more about some specific Java techniques. c. The intention – represents the user’s aim during its interaction with the adaptive system. The student’s intentions are: asking questions, finding responses, consulting the recommendations proposed by the system, filtering the existent questions. Through these options provided by the system, the student can find support to resolve homework from different areas or for the projects work preparation. d. Previous experience – this is relevant for the system: work experience, the ability to speak a language. The student’s experience is supposed to be in accordance with the year of study (e.g., A student in the third year has experience in courses from second year as opposed to a second year student who only studies these courses). e. Individual traits – the components which define the user as an individual are: personality, cognitive traits, learning style and these are extracted through personality tests. In terms of individual features, the system focuses on each student’s type of character and on identifying the appropriate scope of his work and they were extracted by two personality questionnaires. f. Work context – it is approximated by elements such as the following: user’s platform, location, physical environment, personal background. We gave to users a list of areas about which to ask questions to a specialist and we considered that their interests are the areas that they choose to raise the question with the aim of finding out more information about it. The list contains the following subjects: current courses (Software Engineering, Advanced Programming in Java etc.) and general information courses (Work License, Research Projects, etc.). The eLearning system collects explicitly information about the user on basis of two psychological tests. Thus, the system retrieved some personal traits of the student that couldn’t be taken otherwise, namely, the type of character (nervous, emotional, violent, passionate, sparkle, phlegmatic, amorphous or melancholy) and the type of area in which they would like to work (conventional or conformist, entrepreneurial or persuasive, investigative or intellectual). Additional to information’s extracted explicitly from a user we used a software agent to extract information automatically. In this way, we have enriched the knowledge base about the user drawing some conclusions on the information entered explicitly (e.g., if a user explicitly specifies that he needs a response immediately and he receives it in less than 24 hours we may conclude that he’s a happy user). Also, a professor can be informed by an agent about the student’s grades from previous years, in order to help him to understand better why the student addressed some question. 272 Iftene A., Rotaru A. 4 Sentiment Extraction The proposed system is based on the idea that in a text the words have no emotional charge (they describe facts and events), but they are emotionally charging according to the interpretation of each reader and each author’s intention (in accordance with their interests) [1]. These interests are usually composed of personal needs, concepts that meet these needs, motivational factors, social and historical knowledge of facts, information circulated in media. These factors are called “knowledge base”, which is generally composed of the general knowledge of words and their meanings, affective terms and emotion triggers. An emotion trigger is a word or a concept in accordance with the interests of the user which lead to an emotional interpretation of the text’s content. With these words, we built a database which enables us to classify and determine valency and feelings from text. We will now present how we identify and how we classify the valences and the emotions presented in the texts written by students. To do this similar to [1], first we build incrementally a lexical database for Romanian language (which contains words that trigger emotions) to discover the opinions and emotions in the text at the word level (recognizing in it the positive, negative or neutral side of the sense). The second step is to assign valences and emotions to the terms from the database, and the third step is to identify the valency modifiers. First step: In order to build a database with words that represent emotion triggers, we start from terms presented in “Maslow’s pyramid” (it contains the hierarchy of human needs which are about 30 in English) and we translate them to Romanian. Because the number of terms is relatively small we disambiguate them using Romanian WordNet [8], and after that we associate with every term the set with synonyms accordingly with words sense. In this way we are sure that henceforth each new word will hold the meaning for which was added. After that, we used again Romanian WordNet in order to add for every term all corresponding valid meanings and valid grammatical categories from Maslow’s pyramid. For these new words we add hyponyms and the words which are in entailment relation. Similar, we apply the same steps to the terms that are parts of the Max Neef’s matrix [5], who believes that human needs are equally important, few, finite and classifiable. Additionally, we consider that the terms related to exams or to dead-lines are also emotion triggers and we added to our database terms like: punctaj (En: score), notă (En: note), termen limită (En: dead-line), examen (En: exam), parțial (En: partial exam), etc. The second step has as an aim to assign valences to the emotion trigger terms from the database built in step 1. For this the following rules presented in [1] were taken in account: - The main trigger emotions and their hyponyms are given a positive value. - The antonyms of the above terms are given a negative value. - The term’s valence is modified according to the modifiers (terms that deny, emphasize or diminish a valency) of which are accompanied. User Profile Modeling in eLearning using Sentiment Extraction from Text 273 At third step we define a set of valence modifiers (shifters) in Romanian starting from English modifiers from [1] in order to determine the changes in meaning of above emotion triggers. Additionally we add specific shifters accordingly with the courses followed by the students. The shifters can change a term’s meaning radical, from positive to negative or vice versa, or can change a term’s meaning and to make it neutral. We consider the following set of shifters that contains: a. Negation words: niciodată (En: never), nu (En: no) that change the valance’s sign. b. A set of adjectives that enhance the meaning of a term: mare (En: high), mai mult (En: more), mai bine (En: better), profund (En: intense). c. A set of adjectives that diminish a term’s meaning: mic (En: small), mai puțin (En: less), mai rău (En: worse), mai degrabă (En: rather). d. A set of modal verbs a putea (En: can), a fi posibil (En: to be possible), a trebui (En: should), a vrea (En: to want). They introduce the concept of uncertainty and possibility distinguishing between events that have occurred, could take place, are taking place or will take place in the future. e. A set of adverbs that emphasizes the meaning of all content: cu siguranță (En: definitely), sigur (En: certainly), cert (En: sure), în definitiv (En: eventually). f. A set of adverbs that change valence and diminish the entire context’s emotion: posibil (En: possible), probabil (En: probable). g. The terms which add a note of uncertainty to the context: abia (En: hardly). These terms may add uncertainty to the positive valence of context, even if in the text does not exist other negative terms. h. Connectors: chiar dacă (En: even if), deși (En: although), dar (En: but), din contră (En: the contrary) can enter information, but can influence the text they belong. Negations are used to switch to the opposite meaning of a term while intensifiers and modifiers have role to increase or to decrease the valency degree of a term. The main observation to be mentioned here is the following: for a valency modifier to fulfill its purpose (to alter the term’s valence), in the text must be expressed an attitude (who’s understood to be modified). 5 The System From the beginning we established the site’s purpose: to build a consistent database with models for students, after investigation steps, in order to identify and to interpret sentiments from student’s questions. To achieve this objective we built two interfaces: one for second year students from our faculty (the Faculty of Computer Science, the Alexandru Ioan Cuza University of Iasi) and one for their professors. The system’s architecture is presented in Figure 1. 274 Iftene A., Rotaru A. Professors Students Server Professor Interface Student Interface Q&A Courses User’s Models Questionnaires Fig. 1. The system’s architecture Student’s Interface: after registering, every student receives an account that allows him to communicate with his professors. A student has access to the following components: a. Courses page – here a student can find the available courses and can assist to a preferred course. For every course there are electronic materials, presentations, practical or theoretical exercises and useful links. b. Question&Answering page – allows and facilitates discussions between students and professors. Here, the students can ask questions related to current courses or general questions (regarding the work license etc.). Every question has additional information about priority (normal, urgent, trivial) and about the student’s motivation (to solve problems, to have more knowledge, etc.). Moreover, the students may request a meeting with the teacher when they believe that a discussion face to face would help them more. c. Questionnaires – that help our programs to build the user’s model. Here we consider two types of questionnaires: first is for the character type (nervous, emotional, violent, passionate, sparkle, phlegmatic, amorphous or melancholy) and the second one is related to the area where they would like to work (conventional or conformist, enterprising and persuasive, investigative or intellectual). The first test is composed of 12 questions and the second test of 25 questions and it was constructed following the Holland test6 model. The Professor’s Interface: after registering, every professor receives an account that allows him to communicate with his students. A professor has access to the following components: 6 Holland test model: http://www.hollandcodes.com/my_career_profile.html User Profile Modeling in eLearning using Sentiment Extraction from Text 275 a. Courses page – in order to add new courses and new materials for them. Also a professor can see a student’s comments related to courses, the number of students that attend the courses and the number of students that resolve exercises and their proposed solutions. b. Question&Answering page – the professors can answer to questions related to their courses, but they may also respond to general questions or to questions related to other courses. The order of answers depends by the priority of the questions and the answer’s content depends by a student’s motivation (a simple answer if the student wants to solve an exercise or a detailed answer if the student wants to use some techniques in order to implement complex projects, etc.). In addition, the teachers can arrange meetings with students who have requested it explicitly. c. User’s Models – this page can be accessed from Question&Answering page with aim to understand better the question and in order to identify which is the desired answer. Both the teachers and the students have the opportunity to filter the questions according to certain criteria. Therefore the teachers get a list of questions: - Sorted in ascending/descending order by date the student addressed the question. - Sorted according to the priority specified by the student. - Addressed on a particular subject by all students. Students get a list of their own questions: - Sorted by date when the professor offered the response. - Sorted by priority specified by them. On Question-Answering page, we add a specific module that analyzes the question and shows to student the similar questions with their answers. This module processes the question and identifies the question type (which can be Definition, Factoid or List), the answer type (which can be a date, a number, a link address, an organization, other, etc.) and the keywords (which are the most relevant words from the question) similar to [4]. Having these values we used two methods to search similar questions for a given question: 1. The first method considers previous questions and concludes that they are similar if they have the same values for question type and answer type and if they have common keywords. For example we consider the following questions: the initial question is: Care este termenul limită pentru alegerea temei proiectului? (En: What is the deadline for choosing the theme of the project?) and the current question is: Până când trebuie ales proiectul? (En: Until when should be chosen the project?). We can see that both questions are Factoid questions and expect an answer of type Date. Also, we have one common keyword: proiect (En: project). For these reasons we consider that the questions are similar and we offer to that student for analysis the answer to the initial question. 2. The second method is applied when the question or the answer types are different, but the questions are from the same period and are related to the same course. Again, in this case we offer for analysis the answer of the initial question. 276 Iftene A., Rotaru A. 6 Statistics During a semester we offered to our second year students the possibility to ask the teachers and to find out new things in certain areas. 132 students have created accounts, representing over 50% from the total number of students and 112 put at least one question. The total number of questions was 305, meaning that on average each student asked about 3 questions. The maximum number of questions asked by a student was 11. In Table 1 we can see the distribution of questions for professors. Table 1. The distribution of questions for professors Professor ID 1 2 3 4 5 6 7 Questions number 7 20 124 83 21 41 9 Percent 2.29 % 6.56 % 40.66 % 27.21 % 6.89 % 13.44 % 2.95 % We studied the questions and we noticed that the teachers who received more questions are professors who perform additional research activities with students or professors who have complex problems at laboratory practical activities. We discussed with the professors who received a large number of questions (IDs 3, 4 and 6) about the advantages and disadvantages of using the system that makes the user’s model. The system has seven teachers involved: two teachers did not used the system at all (IDs 1 and 7), two of them used the system very little because they didn’t believe in the effectiveness of the system (IDs 2 and 5) and three teachers used the system long enough (IDs 3, 4, 6). The third professor believes that the system has improved his communication with the students and would also use it in the future to give students more opportunities to keep in touch. The same professor is the one who regularly used the models created by the system for users before providing answers for questions. The professor said that about 50% of the models were useful (he was satisfied with the way they were created and their contents) and about 50% of the models were unnecessary (the created models were incomplete or contained no useful information to answer the questions). The system has 112 students involved, but we have done the research studying 40 of them (each of these students put more than 5 questions). Our models for users are built iteratively from question to question, using additional information obtained on questionnaires basis. When professors receive a question, they receive additionally the user’s model built from previous questions and an analysis of the current question. Let’s see few examples. For this question “Aș putea să fac un proiect cu informații despre hoteluri?” (En: Could I make a project with information about hotels?) the student specified that the question is urgent and that he needs mandatory an answer. Because the student used User Profile Modeling in eLearning using Sentiment Extraction from Text 277 in his question “aș putea” (En: could) our application identifies the uncertainty of the student who does not know the possible alternatives. This question has a higher positive valence and it is different from the previous types of his questions that have a lower positive valence or even are neutral. In this kind of situation our application mentions that something happened and that the student needs more information. After some verification we identified that our supposition was correct: the student missed one week because he was at a student contest. From 305 questions our system marked 114 questions that have emotion triggers terms and calculated the valences for them. In 9 cases the positive valences were higher than the rest of the marked questions. One of these questions is “Dacă pentru realizarea aplicației pare mai ușor să încalc un design pattern, decât să îl respect, trebuie totuși să urmez patternul respectiv sau e în regulă să fac aplicația cum consider de cuviință, atât timp cât rezultatul final funcționează?” (En: If in order to implement an application seems easier not to respect a design pattern, instead to respect it, must I still follow that pattern or is it ok to make the application how I think, as long as at the end the application works fine?). For this question we identified the following emotion triggers: pare (En: seems), mai ușor (En: easier), decât (En: instead), trebuie (En: must), totuși (En: still) and in this case we obtained the highest value for valence. Interesting is the fact that this question was the first question asked by that student and the rest of his questions, even if they were without emotions triggers, were influenced by this question. After several discussions with professors, we decided to offer at request for a new question the user’s model for the student who asks the current question, obtained from all his questions. Additionally the system remarks differences between the valence values obtained for the current question and for previous questions. Also, the professor receives the user’s comportment type and the user’s future goals. 7 Conclusions Adaptive hypermedia is the answer came to help the user who is “lost in space” because there are too many links from which to choose or because he doesn’t know how to find the shortest way to achieve the personal goal. In the application we created we tried to help the user using natural language techniques. Thus, we helped a student by suggesting him the answers to similar questions offered by a teacher and we helped a teacher to understand better the question giving him the opportunity to access a user model which we created for that student. In the first case we used the question-answering techniques and in the second case we combined the psychological profiles of students with profiles built on feelings drawn from the questions submitted by students. In recent years there have been various methods to identify feelings in a text: the explicit request of a transmitter’s opinion, identifying the feelings directly related to different areas of interest. In this paper, the emphasis was placed on the role of emotions and on triggers terms and more than that those feelings were identified in the text by identifying the positive, negative or neutral aspects of words. 278 Iftene A., Rotaru A. The assessment regarding the opinion of the students about the created system was not realized but we want in the future to create a questionnaire that would allow this. Also we want to talk with the students who used the system more because we want to improve the components dedicated to them according to their preferences. References 1. Balahur, A., Montoyo, A.: Applying a culture dependent emotion triggers database for text valence and emotion classification. In journal Procesamiento del Lenguaje Natural, ISSN 1135-5948, Nº. 40. Pp. 107-114. (2008) 2. Brusilovsky, P., Millan, E.: User Models for Adaptive Hypermedia and Adaptive Educational Systems. The Adaptive Web Journal. Pp. 3-53. (2007) 3. Brut, M.: Ontology-Based Modeling and Recommendation Techniques for Adaptive Hypermedia Systems. (Ph.D. Thesis) Technical Report 09-04. “Al. I. Cuza” University. ISSN 1224-9327. 155 pages. Iasi, Romania. (2009) 4. Iftene, A., TrandabăŃ, D., Pistol, I., Moruz, A., Husarciuc, M., Cristea, D.: UAIC Participation at QA@CLEF2008. In Evaluating Systems for Multilingual and Multimodal Information Access. Lecture Notes in Computer Science. Vol. 5706/2009. Pp. 448-451. (2009) 5. Manfred, A.: Max-Neef with Antonio Elizalde, Martin Hopenhayn. Human scale development: conception, application and further reflections. New York: Apex. Chapter 2. “Development and Human Needs”. Pp. 18. (1991) 6. Maslow, A.H.: A Theory of Human Motivation, Psychological Review 50(4). 370-96. (1943) 7. Tran, T., Cimiano, P., and Ankolekar, A.: A Rule-Based Adaption Model for OntologyBased Personalization. Springer, Volume 93. Pp. 117-135. (2008) 8. Tufiș, D., Ion, R., Ide, N.: Word Sense Disambiguation as a Wordnets’ Validation Method in Balkanet. In LREC-2004: Fourth International Conference on Language Resources and Evaluation, Proceedings, Lisbon, Portugal, 26-28 May 2004. 1071-1074. (2004) Predicting the Difficulty of Multiple-Choice Close Questions for Computer-Adaptive Testing Ayako Hoshino1∗ Hiroshi Nakagawa2 1 NEC Common Platform Software Research Laboratories 2 Unviersity of Tokyo a-hoshino@cj.jp.nec.com, n3@dl.itcu-tokyo.ac.jp Abstract Multiple-choice, fill-in-the-blank questions are widely used to assess language learners’ knowledge of grammar and vocabulary. The questions are often used in CAT (Computer-Adaptive Testing) systems, which are commonly based on IRT (Item Response Theory). The drawback of a simple application of IRT is that it requires training data which are not available in many real world situations. In this work, we explore a machine learning approach to predict the difficulty of a question from automatically extractable features. With the use of the SVM (Support Vector Machine) learning algorithm and 27 features, we achieve over 70% accuracy in a two-way classification task. The trained classifier is applied to a CAT system with a group of postgraduate ESL (English as a Second Language) students. The results show that the predicted values are more indicative of the testees performance than a baseline index (sentence length) alone. 1 Introduction Multiple-Choice (MC) questions are commonly used in many standardized tests, due to its ease of collecting and marking the answers. Multiple-Choice Fill-In-the-Blank1 (MC-FIB) questions, especially, have proven their effectiveness in testing language learners’ knowledge and usage of a certain grammar rule or a vocabulary item. The TOEIC2 test, for example, has questions in this format. Below is an example of a multiple-choice cloze question: There is a [ ] on the tree. 1) bird 2) birds 3) orange 4) oranges The sentence is called the stem, in which the blank is shown as the brackets. The alternatives consist of one right answer and the distractors. ∗ This work has been done as a part of Ph.D. studies at University of Tokyo. brevity, we use cloze instead of fill-in-the-blank. 2 Test of English for International Communication, run by ETS (Educational Testing Services), U.S. 1 For © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 279-292 Received 04/10/09 Accepted 16/01/10 Final version 10/03/10 280 Hoshino A., Nakagawa H. CAT is a technology that are motivated to offer a better assessment by being adaptive to the testee, where the computer administers subsequent questions depending on the precedent performance of the testee. IRT provides the theoretical background on most of the current CAT systems, in which a testee’s ability (or latent trait) and the difficulty of a question are described in values in a common unit called logit. In IRT, the difference in the ability and difficulty is projected to the probability of the users’ getting the right answer, using a sigmoid function. For a fuller introduction to IRT, the readers are guided to Baker et al. [1]. IRT is a well-researched area, where many positive results have been reported. For example, Urry reports that an IRT-based CAT system achieves sufficient precision with 20 questions, whereas pen-and-paper testing requires 100 questions [2]. However, less attention has been focused on the cost of adapting the model, which is, as commonly practiced, the cost of conducting the pre-test results on the comparable group of testees. In cases where pre-testing is not possible, all questions are assumed to be of equal difficulty at the onset, then the difficulty of the questions is updated as the users’ responses accumulate [1]. This research is motivated to combine CAT with recently emerging AQG technology (Automatic Question Generation, explained in the following section), which will render possible a novel assessment system that adaptively administers questions from the automatically generated question set. However, one obstacle is that the abovementioned problem of adapting an IRT model to the testees’ level, whose cost will be higher as the available number of questions increases. Existing methods of IRT model adaptation will be inpractical when unlimited number of newly-generated questions are added to the question pool. In this situation, it is vital to have a means of automatically calculating a rough prediction of difficulty, or inferring the difficulty of a question from the performance of the targeted group on similar questions. The use of supervised machine learning would be worth exploring, which will also be an attempt to have the computer gain a general notion of the difficulty of MC-FIB questions. The rest of the paper is organized as follows: We review relevant work on difficulty prediction in Section 2. In Section 3, the proposed method is presented with evaluation results on a closed data set. In section 4, the trained classifier is tested in a subject group experiment. Section 5 concludes the paper. 2 Related Work There is only limited literature in the field of computational linguistics on MC-FIB questions for language proficiency testing. Among the attempts in generating MC questions from an input text [3] [4] [5] [6] [7]. One of the AQG studies provide a method to compute the complexity of reading comprehension questions [8]3 to be used in a CAT system. Their measure is defined as a weighted sum of complexity values on the stem sentence, the paragraph, the answer sentence and so forth. 3 In their study, the format of questions was neither MC nor FIB, thus the answer is composed by the testee. Predicting the Difficulty of Multiple-Choice Close Questions... 281 In fact, the complexity of a sentence alone has been studied for decades in computational linguistics, and many indices have been proposed. Segler, in his work on extracting example sentences for language learners, compares traditionally proposed indices [9]. The indices include sentence length (number of words in the sentence), parse-tree depth (maximum number of levels from the root to a word), and the combination of such factors. Segler’s comparison reveals that it’s very hard to beat the sentence length, which is the simplest measure. The complexity of the stem sentence affects the difficulty of the above-presented question. But other factors, such as similarities between the right answer and the distractors would surely influence the difficulty of MC-FIB, and thus should be taken into account. The Flesch-Kincaid Index is a readability measure for a passage, which is widely used among educators. The index is defined as follows: RF RE = 206.8 − 1.05X − 84.6Y where X is the average number of syllables in a word and Y is the average number of words in a sentence. The index can be applied to a sentence with Y fixed to one. This version of Flesch/Kincaid score is composed so the value is interpreted as the grade in an American elementary school. Some improvements of the readability measures have been proposed [10]. Miyazaki et al. proposed an individualized measure for reading text extraction [11]. Evaluation of the existing indices are done often manually with add-hock parameters. In this study, a supervised machine learning technique is used to tune the parameters in combining feature values. 3 Difficulty Prediction In this study, we use the technique of supervised machine learning for the task of difficulty prediction. We first explore the learning algorithms, and train the best performing classifier using the question data that are annotated with the correct response rate. Then, with a simple binary search-like method based on the predicted difficulty values, we build a CAT system and have it tried out by human subjects. As this is one of the earliest attempts in applying machine learning methods to such a task, we set out with a simple binary classification. We did not employ regression, as some of the readers may wonder, which outputs numerical values. The reason was that 1) it is expected to be unworthy to predict the correct response rate that is observed from subject groups, since such observation usually contains what is called measurement error in the literature of psychometrics. We train the classifier with the labels “easy” or “difficult,” letting the computer to try to grasp a rough notion of difficulty. 3.1 Training Data The training data set is obtained from a series of TOEIC preparation books (a total of 702 questions from Part 5, which are MC-FIBs.) Each question is annotated with the correct response rates, ranging from 0.0 to 98.5. The figures are based on the tests in 282 Hoshino A., Nakagawa H. a TOEIC preparation school in Japan and reportedly based on the results of about 300 testees. Table 6 (in Appendix) shows samples from the training data. All questions consist of a stem sentence of 20-30 words and four alternatives. Seemingly, all questions are intended to be of the same difficulty, rather than being increasingly difficult according to the question number. We have labeled the top 305 easiest questions as “easy” and the top 305 difficult questions as “difficult,” based on the correct response rates, leaving out 8% around the average value 4 . 3.2 Features On deciding the feature set, we take a similar approach to Kunichika et al., who designed the factors of difficulty depending on the complexity of the question and of the answer. We assume the difficulty of a question to be composed of 1) the difficulty of the stem sentence, 2) the difficulty of the correct answer, and 3) the similarity between the correct answer and the distractors. As MC-FIB questions are often criticized for being possible to obtain the right answer just by reading a few words before and after the blanks, we have also added as a feature as 4) the tri-gram including the correct answer or a distractor. Each feature, with its notation used in this paper, is explained as follows: 1) Sentence features The sentence features consist of sentencelength, which is the number of words in the original sentence, maxdepth, which is defined as the depth, or number of brackets/levels from the root to the deepest word in a parse result 5 , and avr wlen, which is the average number of characters in a word. 2) Answer features The answer features provide information on the right answer, consisting of blanklength, which is the number of words in the correct answer, and an array of binary features on the POS (Part Of Speech) of the right answer (pos V, pos N, pos PREP, pos ADV, and pos ADJ), which indicate inclusion of the part of speech in the right answer. For example, pos V is true if any form of a verb is included in the correct alternative. 3) Distractor similarity features The distractor similarity features are obtained from the analysis using the technique of modified edit distance, which have been used to extract a lowest-cost conversion path from one string to another. In our version of edit distance, we have applied the algorithm on a word basis, as opposed to the character based application as seen in spelling-error correction. While three kinds of operations, insert, delete, and change, are used in a standard edit distance, we have additionally defined three operations: change inflection, change lexical item, and change suffix. Change inflection is an operation where a word is substituted by the same vocabulary item, but in a different inflectional form. Change lexical item is a substitution of a word with the same vocabulary item in the same inflectional form. Change suffix is a substitution of a word 4 This 5 We 8% was decided in a cross validation; we took a breaking point that maximizes the accuracy. used the Charniak parser http://www.cs.brown.edu/∼ec/. Predicting the Difficulty of Multiple-Choice Close Questions... Table 1: Results with different learning algorithms Algorithm Accuracy IB10 SVM 62.7960% MultilayerPerceptron SMO 58.1481% J48 Logistic 57.4074% IB3 VotedPerceptron 57.0370% RBF Network IB5 57.0370% IB1 NaiveBayes 56.6667% SimpleLogistic 56.4198% 283 55.5556% 53.9506% 53.7037% 52.8395% 52.2222% 51.8519% with another word with the same stem, which can be of a different part of speech. We set the cost of operation so the change inflection (cost: 2), change lexical item (3), change suffix (4), and standard change (5) are preferred in this order. The cost of insert and delete is set to 3, so the combination of deletion and insertion is never used when one change yelds the same result. The features derived from the analysis are: pathlength is defined as the number of operations, and an array of binary features (include insert, include delete, include changeinflection, include changelexicalitem, include changesuffix, include change). For example, include insert is true if any of the conversions from the right answer to the three distractors include an operation insert. 4) Tri-gram features The tri-gram features are obtained from a search engine’s hit score, assuming that the figure reflects the familiarity of a given tri-gram. Tri-gram features on the correct answer are hit correct, hit pre2 corr, hit pre2 corr, hit pre corr post, and hit corr post2, where hit correct is the google hit score of the correct answer, hit pre2 corr is the same score of the correct answer with the previous two words, hit pre corr post is the hit score of the correct answer with the previous word and the following word, and hit corr post2 is the hit score of the correct answer with the following two words. The same set of features is defined for the distractors, where hit distractor is the average of the google hit score of the three distractors. All queries are posted in parenthesis. All features are turned into numerical values, normalized6 before being fed to the learner. 3.3 Learning Algorithms We conduct the 10-fold cross validation to compare the performance of different learning algorithms from weka machine learning toolkit [12] and SVMlight applied to this task. Table 1 shows the accuracy of each learning algorithm. To our disappointment, many of the learning algorithms performed no different from sheer randomness (50%). The best accuracy is achieved by SVMlight , noted as 6 We use a normalization filter in weka package (http://www.cs.waikato.ac.nz/∼ml/ weka/). We have also tried standardization, but did not obtain better performances than normalization. 284 Hoshino A., Nakagawa H. accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 number of instances (sorted by classifier's confidence) Figure 1: Accuracy on top n instances SVM in Table 1, outperforming complex algorithms such as Voted Perceptron. Although, some of the classifiers have unexplored parameters (where we have used default values) that could change the resulting accuracy, we concluded that SVM is the most suited algorithm for this task. The fact that SMO (Sequential Minimal Optimization), which is also a version of SVM, performs second best supports the hypothesis that the high-dimensional maximum margin approach is effective for this problem. One can observe that the parameter k optimizes IBk, or k-nearest neighbor algorithm, at around 5, though the gain from IB1 remains about 5%. 3.4 Top-N Accuracy Using SVMlight , we conducted another cross validation with top N evaluation. Valuing more precision than recall, top N evaluation is an evaluation where the test instances are sorted by the confidence 7 of the classifier and the accuracy is calculated on the top N confidently classified instances. The figure in top N accuracy can be interpreted as the performance of the classifier when it is allowed to skip less confident instances. The result of top-N evaluation is shown in Figure 1. The line indicates the mean accuracy of 10 split, averaged over the results of 1,000 runs. The x-axis shows the number of instances the classifier has labeled, and of which the classifier is most confident. The mean accuracy draws a sharp curve at first, achieving 72% with about 10 instances at the top. The accuracy is kept above 70%, up until the top 18 instances. Then, it slopes down marking 65% at top 50 instances and gradually slopes down reaching 0.629% in labeling all instances. The overall standard deviation was about 0.005, which indicates the classifier’s stability. On the condition 7 The confidence value in the context of the SVM algorithm is the distance to the separating hyperplane. We use this index as difficulty value, which ranges from negative to positive values, and in our case, smaller value signifies more difficult; if the classifier is more confident of a question’s being difficult, it is a more difficult question. Predicting the Difficulty of Multiple-Choice Close Questions... 285 Table 2: Accuracy gain of each feature (Left: those with positive contribution, Right: those with zero or negative contribution) Feature f all features pos N pos V sentencelength hit pre2 dist include change maxdepth avr wlen hit pre2 answer hit pre dist post include changesuffix hit dist post2 hit pre answer post w/o f 0.70769 0.65385 0.66154 0.66923 0.66923 0.67692 0.68462 0.68462 0.69231 0.69231 0.70000 0.70000 0.70000 Gain 0.05385 0.04615 0.03846 0.03846 0.03077 0.02308 0.02308 0.01538 0.01538 0.00769 0.00769 0.00769 include insert answer wlen pos ADV blanklength include delete include change lexicalitem include change inflection pos PREP hit dist count pathlength pos ADJ hit answer post2 0.70769 0.70769 0.70769 0.70769 0.71538 0.71538 0.00000 0.00000 0.00000 0.00000 -0.00769 -0.00769 0.72308 -0.01538 0.72308 0.72308 0.73077 0.73077 0.73077 -0.01538 -0.01538 -0.02308 -0.02308 -0.02308 that the predictor works similarly well on the automatically generated questions, the predictor could label the question with over 70% accuracy skipping three out of four instances.8 Note that binary classification can be extended to comparing and ranking two or more instances. SVM algorithm can be used for comparing two or more candidate instances, which is the way a classifier will work in an actual CAT system. 3.5 Feature Analysis We investigate which of the features have contributed to the accuracy of the classifier, by removing one feature at a time. The difference from the accuracy with all features signifies the gain of accuracy caused by the feature. The experimental settings was the same as the above cross validation. Table 2 shows the accuracy and the performance gain with the top 15 instances, where the performance is kept above 70% with the largest number of labeled instances. The most contributing feature was pos N, with the gain of 5% of accuracy. Pos V provides the next largest gain. These two features exceed the contribution of the sentence length, which is assumed to be an undoubted predictor for sentence difficulty. Then, information on the hits scores and the operations in the paths follows, with include change and include changesuffix providing larger contributions. The four features (include insert, blanklength, pos AD and hit answer), however, do not affect the accuracy at the point of top 15. Several features, such as hit answer post2 and pos ADJ, exhibit negative contributions, though those features contribute by large margins as more instances are labeled. Avr wlen provides a larger contribution as more instances are counted as top n. 8 Note that skipping unconfident instances does not cause a problem in our AQG+CAT system, since unlimited number of automatically-generated questions are available. 286 Hoshino A., Nakagawa H. To summarize the results of the above experiments: 1) the overall performance is higher with the instances with higher confidence values than all instances, 2) path features contribute to the accuracy, 3) include change and include changesuffix contribute more than the other path-related features, while features such as include insert don’t provide much information at the point of n = 15, and 4) the POS information on the answer phrase helps the accuracy boost, depending on which POS to look at. Information on the verb or noun inclusion in the answer phrase significantly contributes to the accuracy. However, adverb, preposition, and adjective do not result in positive accuracy gains. 5) Some of the features do not look effective with fewer number of confident instances, although many of them prove effective with the larger number of instances. The results of cross validation with different features removed also provide an analysis on the group of testees. In the case of our data, the feature contribution reflects the tendencies of the testees who go to the TOEIC preparation school. It could also be possible to assume that the group of testees is a good sample of adult Japanese learners of English. Since the method of cross validation allows us to repeat the experiments with different features, the researcher of SLA (Second Language Acquisition) can perform analysis with their own devised features, without the need to collect the data with a carefully designed subject experiment. Also, provided with the sufficient amount of the data with a given learner, the analysis can provide a personal diagnosis of their tendencies. 3.6 Human Performance The difference among the difficulties of the questions taken from the same series of books was quite subtle. To see the difficulty of this task for human judges, we asked two Japanese assistant professors in computer science to perform the binary classification. They were presented with a mixed set of 40 “difficult” questions and 40 “easy” questions and guessed the label of each instance. The accuracy of the two human judges was 70% and 72.5%. Considering the fact that the performance of human subjects is normally deemed to be the upper limit of an NLP task, this not-too-high performance of SVM classifier makes sense. The subjects pointed the type of question was a clue they used to decide the labels; the grammar questions tended to seem easy, while vocabulary questions were more difficult. 4 Subject Experiment In order to see the efficacy of a trained difficulty predictor in the context of AQG+CAT testing, we conducted a subject experiment. The questions we use in the experiment are automatically generated from online news articles, which are administered by a simple algorithm based on the predicted difficulty values. The entire evaluation was conducted through the Internet. The subjects were called for and volunteered through the department’s email list. The participants were instructed by email, tried out the CAT system through their Internet browser, and answered the post-task questionnaire by e-mail. Predicting the Difficulty of Multiple-Choice Close Questions... 287 Twenty students responded to our call. They are master’s and doctoral students majoring in information studies (with either literature or science background.) Their first languages are Japanese (12) and Chinese (6) and other languages (2). 4.1 Automatic Question Generation With our in-house AQG system, we generated grammar and vocabulary questions on articles from several online news websites: NHK (Japan), BBC (U.K.), and DongA (Korea). The method of AQG we employed was Coniam’s frequency-based method [13] for vocabulary distractors and hand-written patterns for grammar distractors. Table 7 (in Appendix) shows samples from the questions used in the experiment, along with their predicted difficulty value (first column) and the correct response rates (second column). In this subject experiment, a set of automatically-generated questions were labeled with a classifier trained in an abovementioned method, then administered to the subjects.9 For more information on our AQG method, see [14]. 4.2 Administration algorithm Assessing a testee with a CAT system is done in a similar way as finding their position on a number line. The system starts with the questions of a mean difficulty, then it jumps a pointer (representing a participant’s position) to go up or down depending on the participant’s response. The width of the jump is reduced as more questions are administered, following an inverse logarithmic function as defined below: C(top − bottom)/log(n + 1) where top is the maximum and bottom is the minimum difficulty value in the question pool, which, at the beginning of evaluation, contained 3,000 automatically-generated questions. The value n is the number of questions attempted so far. The constant C is set by the simulation experiments. In this experiment, our system excludes the sentences that have previously been exposed to the participant. 4.3 Experiment results We have conducted a three-session experiment, where the participants took part in two or all three sessions. The number of participants were 12 in the first, 15 in the second, and 17 in the third session. Fifty questions were administered at each session, where two sessions were with random administration and one session was with an adaptive administration. The basic information of the test results is summarized in Table 3. 9 When applying the SVM light ’s classifier trained with the aforementioned data, the resulting values of the test data are extremely skewed. In fact, at most of the time, the same value is gained for all test data. This could be attributed to the difference between the training data and the test data. For example, the sentences from the news articles tend to be longer than the ones in TOEIC MC-FIB questions. Thus, the feature values (e.g., sentencelength) of the test data range outside the ones of the training data, resulting all instances being more difficult. We have re-run the training process with the option of preference ranking, with input training data (ranking) being all combinations of the “difficult” and “easy” instances. (About the option preference ranking, see the website of SVMlight http://svmlight.joachims.org/) With this setting, the resulting difficulty value with the test data distributed much like a normal distribution. 288 Hoshino A., Nakagawa H. Table 3: Summary of the test results First session Second session Average correct response rate (stdev.) highest/lowest Average total time longest/shortest Third session 0.785 (0.097) 0.741 (0.075) 0.758 (0.079) 0.980 / 0.627 0:28:33 1:10:53 / 0:15:32 0.860 / 0.600 0:30:45 0:54:58 / 0:12:41 0.920 / 0.660 0:31:13 0:49:51 / 0:11:39 Table 4: Average of the two indices in two groups based on the observed difficulty sentence length predicted difficulty average difficult: 26.14 easy: 24.91 difficult: -1.626 easy: -1.512 variance difficult: 72.09 easy: 115.89 difficult: 0.326 easy: 0.404 p value 0.6425 0.2183 Table 5: Correct response rate by part of session part 1-10 11-20 21-30 31-40 41-50 variance random1 0.753 0.846 0.741 0.792 0.725 0.00188 random2 0.716 0.799 0.724 0.758 0.696 0.00131 adaptive 0.729 0.724 0.727 0.790 0.788 0.00094 The questions that have been administered to the participants were automaticallygenerated, hence were not always the errorless ones. Still, the correct response rate of the participants were quite high with 75% on average with the highest being 86-98%. 4.4 Information gain by difficulty prediction There were 103 questions that were solved by more than three participants. At a first look, disappointingly, there was no significant correlation between predicted difficulty values and the correct response rates. We further took a look at the distribution of the correct response rates, and sampled the “difficult” questions and “easy” questions. We call them observed difficulty as opposed to predicted difficulty. There were 44 questions that were answered correctly by all participants who had been administered them. We labeled those questions as “observed easy,” and labeled the questions whose correct response rate was 0.6 or below as “observed difficult.” Table 4 compares the two indices 1) sentence length (a baseline), and 2) the predicted difficulty value, in their relation to the observed difficulty. The results show that both of the indices differ in the two groups as expected; the average sentence length is larger, and the predicted difficulty value is smaller, in the observed “difficult” group. The p-value is smaller on the predicted difficulty, which means that the predicted difficulty value differentiates the two groups better than the sentence length. A weak level of significance (p = 0.2) is observed on the predicted difficulty, despite Predicting the Difficulty of Multiple-Choice Close Questions... 289 of the difference of the two sets of questions, as well as a rather diverse subject group. The predicted difficulty was calculated on the test results of an English school, whose students are mostly Japanese learners. On the other hand, the participants we gathered included many international students whose first language was different from the others. Also, the difficulty must have reflected the difference in nature of the professionally written TOEIC preparation questions and the automatically-generated questions. The former is free of context and generally very well-written, while the latter tends to be context-dependent, and although it has 10 different patterns, still it gives an impression of being pattern-generated. These differences will be incorporated to further improve the current system. For example, we can incorporate the observed data into the training data to better tune the difficulty prediction. 4.5 Transition of the correct response rate Finally, we took a look at transitional changes in correct response rate in the three sessions to see the system’s adaptivity to the human users. We have split a session into five parts by the order of administration. Table 5 shows the correct response rate calculated on each part along with their variances. First, it is observed that the correct response rate is more stable in the adaptively administered session. It is generally a good sign for a test, since stability of the correct response rate can be attributed to the system’s administrating questions of similar difficulty, rather than moving to and from the extremities. Second, adaptive session was the only one where the performance of the participant rose in the last half of the session10 , which was unexpected, since we were hoping that the correct response rate should fall as the CAT system administeres more suited and thus challenging questions to the user. The rise of the correct response rate can be attributed to the users’ habituation to the patterns, since the adaptive session was the third session for all participants. Also, it is known that a CAT system needs fewer questions than conventional pen-and-paper tests to reach the true values with minimized errors. As Urry reports, only about 20 questions are necessary in English grammar and vocabulary tests for adult native speakers. In our data, the adaptive session was the only one where the correct response rate did not rise on the second split. We speculate that the adaptive administration actually chose more difficult questions to the response to the high correct response rates of the users, achieving the near value after 11 to 20 questions, and then drifted away to the easier questions. 5 Conclusions We have investigated an application of the machine learning techniques to the problem of difficulty prediction for a CAT system. The SVM classifier shows a performance on par with human judges, and the predicted values show some evidence of efficacy, showing more information gain than sentence length index alone, and stable correct response rates than random administration. Future direction includes more investigation 10 Whose difference from the other sessions was significant in t-test with (0.5<p<1.0) 290 Hoshino A., Nakagawa H. with the features, such as the use of syllable numbers and other difficulty measures for words. The problem of the difference of subject groups would be alleviated by re-training and updating the classifier as the data from the targeted group is obtained. Combining this prediction with standard IRT procedure also is an interesting avenue towards a more effective assessment system. References 1. Baker, F.B., Kim, S.H.: Item Response Theory: Parameter Estimation Techniques, Second Edition (Statistics, a Series of Textbooks and Monographs). Marcel Dekker, Inc., New York, USA (July 2004) 2. Urry, V.W.: Tailored testing: A successful application of latent trait theory. Journal of Educational Measurement 14(2) (1977) 181–窶 196 3. Sumita, E., Sugaya, F., Yamamoto, S.: Measuring non-native speakers’ proficiency of english by using a test with automatically-generated fill-in-the-blank questions. In: Proceedings of the Second Workshop on Building Educational Applications Using Natural Language Processing, Ann Arbor, Michigan, U.S., Association for Computational Linguistics (June 2005) 61–68 4. Liu, C.L., Wang, C.H., Gao, Z.M., Huang, S.M.: Applications of lexical information for algorithmically composing multiple-choice cloze items. In: Proceedings of the Second Workshop on Building Educational Applications Using Natural Language Processing, Ann Arbor, Michigan, U.S., Association for Computational Linguistics (June 2005) 1–8 5. Brown, J., Frishkoff, G., Eskenazi, M.: Automatic question generation for vocabulary assessment. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics (October 2005) 819–826 6. Chen, C.Y., Liou, H.C., Chang, J.S.: Fast: An automatic generation system for grammar tests. In: Proceedings of the COLING/ACL on Interactive presentation sessions, Morristown, NJ, U.S., Association for Computational Linguistics (2006) 1–4 7. Lee, J., Seneff, S.: Automatic generation of cloze items for prepositions. In: Proceedings INTERSPEECH 2007, Antwerp, Belgium (August 2007) 2173–2176 8. Kunichika, H., Urushima, M., Hirashima, T., Takeuchi, A.: A computational method of complexity of questions on contents of english sentences and its evaluation. In: ICCE 2002: Proceedings of the International Conference on Computers in Education. (2002) 97–101 9. Segler, T.M.: Investigating the Selection of Example Sentences for Unknown Target Words in ICALL Reading Texts for L2 German. PhD in Informatics, School of Informatics, University of Edinburgh, Edinburgh, U.K. (2005) 10. Terada, H., Tanaka-Ishii, K.: Sorting texts by relative readability. In: Proceedings of Empirical Methods on Natural Language Processing (EMNLP) 2008, Honolulu, Hawaii, U.S., Association for Computational Linguistics (October 2008) 127–133 11. Miyazaki, Y., Norizuki, K.: Developing a computerized readability estimation program with a web-searching function to match text difficulty with individual learners’ reading ability. In: Proceedings of WorldCALL 2008, Fukuoka, Japan, CALICO (August 2008) d–111 Predicting the Difficulty of Multiple-Choice Close Questions... 291 12. H.Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor (October 1999) 13. Coniam, D.: A preliminary inquiry into using corpus word frequency data in the automatic generation of english language cloze tests. CALICO Journal 16(2-4) (1997) 15–33 14. Hoshino, A., Huan, L., Nakagawa, H.: A framework for automatic generation of grammar and vocabulary questions. In: Proceedings of WorldCALL 2008, Fukuoka, Japan, WorldCALL (August 2008) -1.28 -1.36 -2.33 -2.95 difficulty 9.7 50.0 64.1 CRR 98.5 Table 7: Sample questions with predicted difficulty value and CRR in the subject experiment CRR instance (automatically-generated questions) 33.3 He was discharged after the hospital cited no external problems and [ ]. a. negative CT scan b. results scan nega- c. results scan nega- d. negative concusresults tive CT tive concussion sion scan results 0.4 The researchers found for the first time that [ ] pylori reduces the risk of a relapse of stomach cancer by two-thirds. a. removing heli- b. removing gastritis c. to remove heli- d. to remove gastritis cobacter cobacter 1.0 Koumura later told reporters that Rice’s response to his question on North Korea was what Japan [ ] expected. a. had b. has c. was d. were 1.0 However, disagreement persisted over which side should act first [ ] the fighting. a. to stop b. stop c. to pull d. pull Table 6: Sample questions with different CRR (Correct Response Rates) in (%) instance (TOEIC preparation questions) If you would like to learn more about [ ] to use this advanced copy machine, simply call the number on the front of the pamphlet and we will send out one of our representatives. a. how b. which c. who d. what Workers must [ ] the parcels on to a conveyor belt that carries them to the delivery trucks. a. load b. wrap c. fill d. enter Contract negotiations between the union and Pacific Shipping Inc. [ ] in Long Beach after a threeweek break. a. have resumed b. has resumed c. is resumed d. resumes The purchasing manager is trying to [ ] a deal with the supplier, which could reduce the total cost of materials significantly. a. strike b. discount c. place d. drive Appendix. Train data and test data 292 Hoshino A., Nakagawa H. MathNat - Mathematical Text in a Controlled Natural Language Muhammad Humayoun and Christophe Raffalli Laboratory of Mathematics (LAMA) Université de Savoie, France {mhuma, raffalli}@univ-savoie.fr? Abstract. The MathNat1 project aims at being a first step towards automatic formalisation and verification of textbook mathematical text. First, we develop a controlled language for mathematics (CLM) which is a precisely defined subset of English with restricted grammar and dictionary. To make CLM natural and expressive, we support some complex linguistic features such as anaphoric pronouns and references, rephrasing of a sentence in multiple ways producing canonical forms and the proper handling of distributive and collective readings. Second, we automatically translate CLM into a system independent formal language (MathAbs), with a hope to make MathNat accessible to any proof checking system. Currently, we translate MathAbs into equivalent first order formulas for verification. In this paper, we give an overview of MathNat, describe the linguistic features of CLM, demonstrate its expressive power and validate our work with a few examples. Key words: Mathematical Discourse, Informal Proofs, Anaphora, Controlled Language, Formalisation 1 Introduction Since Euclid, mathematics is written in a specific scientific language which uses a fragment of a natural language (NL) along with symbolic expressions and notations. This language is structured and semantically well-understood by mathematicians but still not precise enough for automatic formalisation. By “not precise enough”, we mean: – Like any natural language text, mathematical text contains complex linguistic features such as anaphoric pronouns and references, rephrasing of a sentence in multiple ways producing canonical forms, proper handling of distributive and collective readings, etc. ? 1 This work is funded by “Informatique, Signal, Logiciel Embarqué” (ISLE), RhoneAlpes, France. http://ksup-gu.grenet.fr/isle/ http://www.lama.univ-savoie.fr/~humayoun/phd/mathnat.html © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 293-307 Received 27/11/09 Accepted 16/01/10 Final version 10/03/10 294 Humayoun M., Raffalli C. – To make text comprehensive and aesthetically elegant, mathematicians tend to omit obvious details. Such reasoning gaps may be quite easy for a human to figure out but definitely not trivial for a machine. In the current state of art, mathematical texts are sometimes formalised in very precise and accurate systems using specific formalisms normally based on some particular calculus or logic. Such a formal piece of mathematics does not contain natural language elements at all. Instead, it contains a lot of technical details of the underlying formal system, making it not suitable for human comprehension. This wide gap between textbook and formal mathematics, reduces the usefulness of computer assisted theorem proving in learning, teaching and formalising mathematics. In this paper we focus on the first difficulty by developing a controlled language with the look and feel of textbook mathematics. We name it CLM (Controlled Language of Mathematics). It is a precisely defined subset of English with a slightly restricted grammar and lexicon. To make it natural and expressive enough, we support the above mentioned linguistic features. Here are the three main components of MathNat: 1. The Controlled language of Mathematics (CLM): Translate the NL text into an abstract syntax tree (AST) in two steps: (a) Sentence Level Grammar: It is sentence level (without context), attributed grammar with dependent records which we implement in Grammatical Framework (GF) [11]. We describe it in section 3 and 4. (b) Context building: We build context from the CLM discourse and solve the above mentioned linguistic features. Context building is described in section 5 and the linguistic features are described in section 6. 2. Translation to MathAbs: We automatically translate the AST into a system independent formal language (MathAbs), with a hope to make MathNat accessible to any proof checking system, described in section 7. 3. Proof checking: Currently, we translate MathAbs into equivalent first order formulas for automatic verification by automated theorem provers (ATP), described in section 7. This step is problematic since most ATP cannot verify even very simple proofs without help from the user. Further, (1) ATP are very sensitive to such hypotheses or details whose sole purpose is to offer an explanation to the reader. (2) Proofs in NL never give the exact list of hypotheses and definitions necessary at each step. This paper does not cover these problems. The overall picture of the MathNat project is shown in figure 1. Fig. 1. MathNat - Overall picture Such a controlled language is definitely easier to read than a formal language used by proof assistants. But is it easier to write? A realistic answer is negative MathNat - Mathematical Text in a Controlled Natural Language 295 because a writer may go out of the scope of our grammar very quickly. However, an ambitious answer could be positive. Because the design of CLM supports incremental extendability of the grammar. Further, appropriate tools such as word completion, etc could help the writer to remain in the scope of the grammar. But even if it fails to give enough freedom to an author for writing mathematics using CLM, we can still consider this work as a first step towards an almost complete mathematical language parser. Although, further work will be needed to extend the coverage tremendously, resolve more linguistic features and solve complexity issues that will certainly arise. With this in mind, the reader should therefore consider this article as a “proof of concept”. 2 The Language of textbook mathematics Mathematical texts mainly consists of theorems, lemmas and their proofs along with some supporting axioms, definitions and remarks. Axioms and definitions normally consist of a few statements expressing a proposition or a definitional equality. On the other hand, a proof is a collection of arguments presented to establish the truth of a proposition. It mainly follows a narrative style and its structure mostly remains the same for all mathematical domains. The text in figure 2 is the 1. Definition 1. x is a rational number if it is expressed as p with q > 0. [. . . ] main example that we’ll use q , where p and q are integers √ 2. Theorem 1. Prove that 2 is irrational. in this paper to illustrate the √ 3. Proof. Assume that 2 is a rational number. possibilities of MathNat. Sen- 4. By of rational numbers, we can assume that √ the definition 2 = a tences are numbered for fub where a and b are non-zero integers with no common factor. ture references. Here are a few 5. Thus, b√2 = a. 6. Squaring both sides yields 2b2 = a2 (1). remarks, general for mathea2 is even because it is a multiple of 2. matical text, showing that this 7. 8. So we can write a = 2c, where c is an integer. text is already quite hard to 9. We get 2b2 = (2c)2 = 4c2 by substituting the value of a into equation 1. formalise automatically: it 10.Dividing both sides by 2, yields b2 = 2c2 . mixes NL with symbolic ex11.Thus b is even because 2 is a factor of b2 . 12.If a and b are even then they have a common factor. pressions; it uses anaphoric 13.It is a contradiction. √ pronouns. e.g. at line 7, 12; 14.Therefore, we conclude that 2 is an irrational number. 15.This concludes the proof. t u the use of explicit, implicit Fig. 2. A typical math text references or both e.g. at line 9, 10 and 6 respectively; a lot of keywords are used in the text (e.g. let, suppose that, then, thus, etc) that are mostly part of specific patterns such as:“if proposition then proposition”, “(let | suppose | Thus | . . . ) proposition (because | by | . . . ) proposition”; the use of subordinates. e.g. at line 8, noun adjuncts. e.g. “with no common factor ” at line 4 and explicit quantification. e.g. “for every x, if x is even then x + 1 is odd” or “there is an integer x such that x > 0”; . . . 3 Sentence level CLM Grammar GF [11] is a programming language for defining NL grammars that is based on Martin-Löf’s dependant type theory [9]. We refer to [5] for further details. In GF, 296 Humayoun M., Raffalli C. we completely ignore the context, and design the CLM as an attributed grammar with dependant records. A GF grammar has two components: abstract syntax and concrete syntax. Abstract syntax defines semantic conditions to form abstract syntax trees (AST) of a language with grammatical functions (fun ) making nodes of categories (cat ). While a concrete syntax is a set of linguistic objects (strings, inflection tables, records) associated to ASTs, providing rendering and parsing. The process of translating an AST into one of its linguistic objects is called linearization. cat Subject; Attribute; Prop; Consider a part of our grammar fun MkProp: Subject -> Attribute -> Prop; for propositions such as “they are fun MkPronSubj: Pron -> Subject; even integers”. In figure 3, we de- fun It: Pron; They: Pron; Fig. 3. abstract syntax fine three categories. The function MkProp takes two parameters (a subject and an attribute) and forms a proposition. A subject is formed by pronouns It or They. As shown in figure 4, MkAttrb cat Property ; Type ; function forms an attribute with fun MkAttrb:[Property] -> Type -> Attribute; : Property ; a list of Property and Type. fun Even fun Integer : Type ; Next, we define functions Even Fig. 4. abstract syntax and Integer of category Property and Type respectively. In full CLM grammar, we add properties (e.g. positive, odd, distinct, equal, etc) and types (e.g. rational, natural number, set, proposition, etc) in a similar fashion. To map this abstract syntax into its concrete syntax, we define a set of linguistic objects corresponding to the above categories and functions. lincat Property = {s : Str}; lin Even = {s ="even"}; In figure 5, the first line defines the linearization of Property param Number = Sg | Pl ; lincat Type = {s : Number => Str}; category which is simply a string lin Integer = {s = table{ record. The second line shows Sg => "integer"; this fact for the linearization of Pl => "integers"}}; its function Even. The linearizalincat Pron = {s : Str ; n : Number} ; tion of category Type is an inlin It = {s = "it" ; n = Sg}; flection table (a finite function) lin They = {s = "they" ; n = Pl} ; from number to string, having Fig. 5. concrete syntax one string value for each (singular and plural) as shown in fourth line. Its function Integer fills this table with two appropriate values in the next three lines. The linearization of pro- lincat Attrb = {s : Number => Str}; lin MkAttrb props type = {s = table { noun Pron is a record, havSg => artIndef ++ props.s ++ type.s!Sg; ing a string and number. Pl => props.s ++ type.s!Pl}}; Further, we define the linFig. 6. concrete syntax earization of its functions and mention the fact that It is singular and They is plural, which will help us to make number agreement. Similar to category Type, Attribute is also an inflection table. Therefore, in figure 6, we define the linearization of its function MkAttrb accordingly. For instance, for singular value, we select the string value MathNat - Mathematical Text in a Controlled Natural Language 297 of the category list of Property (props) with (.s). Then, we select the singular string value of category Type with (type.s!Sg). (++) concatenates these two strings with a space between them. artIndef makes an agreement for an indefinite article with the first letter of next word. e.g. producing “an even number” or “a positive number”. It is defined in GF resource library [12] which provides basic grammar for fourteen languages as an API. Similarly, in figure 7, lincat Prop = {s : Str}; to form a proposition, in oper be = {s = table{Sg => "is" ; Pl => "are"}}; MkProp, we select appro- lin MkProp subj attrb = {s = subj.s ++ be.s!subj.n ++ attrb.s!subj.n}; priate string values of taFig. 7. concrete syntax bles be and attribute by an agreement of number with subject, and concatenate them with subject. For instance, if we parse the propositions such as “they are even integers” and “it is an even integer”, we’ll get following abstract syntax trees: 1. MkProp (MkPronSubj They) (MkAttrb (BaseProperty Even) Integer) 2. MkProp (MkPronSubj It) (MkAttrb (BaseProperty Even) Integer) Fig. 8. abstract syntax trees 4 Synopsis of CLM Grammar As a whole, CLM text is a collection of axioms, definitions, propositions (theorems, lemma, . . . ) and proofs structured by specific keywords (Axiom, Definition, Theorem and Proof). The text following these keywords are list of sentences obeying different GF grammars (one for axioms, one for definition, . . . ). These grammars are not independent and share a lot of common rules. We present in this section a short but incomplete synopsis of the grammar for sentences allowed in theorems and proofs. We describe it in an abstract way with some examples and it obeys the following conventions: [text] means that text is optional, (text1|text2) means that both text1 and text2 are possible, dots (. . . ) means that only few constructions are given due to space limitation, and each dash (–) represents a pattern in the grammar. We start this synopsis by extending the grammar of our running example: 1. Exp is a symbolic expression (equation not allowed). It is encapsulated by quotes to distinguish it from natural language parts of grammar. Defining a formal grammar for symbolic expressions and equations in GF is definitely possible but as it is specially designed for NL grammars, in past, it has caused serious efficiency overhead for parsing, because of CLM’s size. Therefore √ we define a Labelled BNF grammar in bnfc tool [4]. Examples of Exp are 2 in line 2-3, x in line 1 and a2 and 2 in line 7 of figure 2. 2. Exps is a list of Exp. e.g. a, b in line 12 of figure 2, etc. 3. Subject, as partially described before, is (Exps | anaphoric pronouns | . . . ) 4. Attribute, as partially described before, is ( 298 Humayoun M., Raffalli C. Quantity, list of Property and Type. e.g. two positive even integers. three irrational numbers, etc | Property e.g. positive, even, distinct, equal, etc | Quantity, Relation and Exp. e.g. an element of y. two elements of z, etc |. . . ) 5. Prop of proposition is ( Positive and negative statements formed by Subject and Attribute e.g. at line 2, 3, etc | Statements formed by Subject, Quantity and Relation. e.g. x, y and z have a common factor, they have two common multiples, etc | Statements containing existential and universal quantification. | If then statements. | Disjunction statements. | . . . ) 6. Eq is a symbolic equation, encapsulated by quotes. As described for Exp, it is also defined as a Labelled BNF grammar. e.g. in line 4, 5, 6, 8, 9, 10, etc 7. EqOrProp is (Eq | Prop) 8. DefRef is ( Property and Type. e.g the definition of even numbers, etc | Property e.g. the definition of evenness, etc ) 9. Operation is ( Relation e.g factoring both sides, taking square at both sides, etc | Relation and Exp e.g. dividing both sides by x, multiplying the equation by 2, etc | . . . ) 10. Justif is (Eq | Prop | DefRef | Operation) Sentence for theorems could be described as follows: 11. “Prove statement” with an optional subordinate – [(show|prove) [that]] Eq [holds] [(where|when|if|with a condition that|. . . ) EqOrProp] – [(show|prove) that] Prop [(where|when|if|with a condition that|. . . ) EqOrProp] e.g. “Prove that x + y = x holds if y = 0”, “If x is even then x + 1 is odd with a condition that x > 0”, line 2 of figure 2, etc. The main difference between these two patterns is “holds” which is optional for statements containing an equation, but does not appear in statements containing propositions. This applies to all CLM statements. So, in the following statements we mention these two patterns as one using EqOrProp 12. Assumption with an optional subordinate – (let |[we] suppose [that]|we can write [that]|. . . ) EqOrProp [holds] [(where|. . . ) EqOrProp] e.g. “we suppose that x + y > 0”, line 3 of figure 2, etc Note: with the example of section 3, we can infer a statement such as “let x is an even number”, which is grammatically incorrect. In fact, the actual grammar defined for propositions is a bit more complicated than what is given in section 3. In the actual grammar, attribute and proposition are inflection tables with two values; one for let statements (“be an even...” ), and second for the remaining (“is/are ...”). 13. Sentences that cannot be the first sentence in a theorem and proof (then|thus|so|therefore), followed by any statement. e.g. line 5, 11 of figure 2, etc MathNat - Mathematical Text in a Controlled Natural Language 299 Note: In theorem, statements of the form 12 and 13 are often stated to help the reader as a starting point for the proof. Proof statements could be described as follows: 14. Shall-Prove statement with an optional subordinate – we [(will | shall |have to)] prove [that] EqOrProp [holds] [(where|. . . ) EqOrProp] Shall-Prove pattern is almost the same as 11, but in proofs, with different keywords, we distinguish goals from deductions. 15. Assumption with an optional subordinate Same as 12. e.g. line 8 of figure 2 (if we remove “So” from the sentence), etc 16. Assumption with a justification and an optional subordinate – ([we] assume [that]|. . . ) EqOrProp (since |because |by |. . . ) Justif [(where |. . . ) EqOrProp] – (since |because |by |. . . ) Justif ([we] assume [that]|. . . ) EqOrProp [(where |. . . ) EqOrProp] e.g. line 4 of figure 2, etc These patterns are rough estimate of the actual coverage, it is possible to infer a statement which is grammatically incorrect. e.g. “assume x + y = z because squaring both sides”. In actual grammar, we have six patterns for 16 that ensures the grammatical correctness. In doing so, Justif is not just one category. Instead it is formed by some of these: (Eq |Prop |DefRef |Operation). This applies to all patterns of the grammar. 17. Deduction with an optional subordinate – [(we (get |conclude |deduce |write |. . . ) [that])] EqOrProp [holds] [(where|. . . ) EqOrProp] e.g. “we conclude that there is an even number x such that x > 2”, line 5, 12 and 14 (if we remove “Therefore,”) of figure 2, etc 18. Deduction with a justification and an optional subordinate – (we get |. . . ) EqOrProp (because |. . . ) Justif [(where |. . . ) EqOrProp] – (because |. . . ) Justif (we get |. . . ) EqOrProp [(where |. . . ) EqOrProp] – Operation (yields | shows | establishes |. . . ) EqOrProp [(where |. . . ) EqOrProp] – (because |. . . ) Justif, EqOrProp [where EqOrProp] e.g. line 6, 7, 9, 10, 11 (if we remove “Thus”) of figure 2, etc 19. Proof by case (nested cases are allowed) we proceed by case analysis case: condition . . . [this ends the case.] case: condition . . . [this ends the case.] . . . . . . [it was the last case.] 5 if condition then . . . otherwise if condition then . . . ...... otherwise . . . Discourse Building GF can compile the grammar into code usable in other general purpose programming languages. For instance, in Haskell, rendering functions linearize and 300 Humayoun M., Raffalli C. parse are made available as ordinary functions. It translates abstract syntax into algebraic data types forming objects of type AST. When the math text is parsed by CLM parser, a list of sentence level AST is produced. We recognise each AST by pattern matching on algebraic data types and build the context from CLM discourse. For an AST, we record every occurence of symbolic expressions, equations, pronouns and references as shown in Table 1. Further from Table 1, it seems like we keep a placeholder for each anaphoric pronoun that apears in the text. In fact, they are resolved immediately as they apear with the algorithm described in section 6. Context building and translation of math text into MathAbs are interleaved, as shown in the following procedure. So both are performed in the same step. However MathAbs translation is described in section 7. Context for theorem and proof in figure 2 is given in Table 1, 2 and 3. Notations: S denotes an arbitrary AST (abstract syntax tree). Context is 3-tuple (CV, ST, CE) where: CV is a list of 4-tuple (sentence number Sn , list of symbolic objects (obj1 , ..., objn ), type, number) as shown in table 1. ST is a list of 3-tuple (Sn , logical formula, type) as shown in table 2. CE is a list of 3-tuple (Sn , equation, reference) as shown in table 3. Procedure: we start the analysis with an empty Context which evolves while examining the AST of each sentence. The following procedure is repeated until we reach the end of the text. Just a few cases are mentioned here due to the space limitations. //an assumption or deduction containing an equation. e.g. line 4,8 or 5,6,9 respectively If S matches (Assume Eq) or (Deduce Eq) mathabs eq := translate Eq into MathAbs’s formula left eq := left side of mathabs eq* lookup for a latest (Sn , obj, type, Sg) of CV if left eq=obj t := if such (Sn , obj, type, Sg) found then type else NoType add (Sn , left eq, t, Sg) in CV add (Sn , mathabs eq, reference if provided in Eq) in CE add (Sn , mathabs eq, st type(S)) in ST st type(S) := if S matches (Assume ...) then Hypothesis //common to all else if S matches (Deduce ...) then Deduction else Goal *The choice of taking the left side of an equation is a bit naive. But there is no established algorithm to decide which identifier an anaphoric pronoun refers to. e.g. In “[...] we have ∀x (f (x) = p). Thus, it is a constant function” pronoun “it” refers to f , which is not at all trivial even for a human sometimes. So we always take the left side as a convention, which in fact makes sense for many equations. //an assumption or deduction containing Exps e.g. line 2,3 or line 7,11 respectively If S matches (Assume (MkProp ... Exps)) or (Deduce (MkProp ... Exps)) 1. Statements** such as “we (assume|conclude) that x is a positive integer” type := given in S number := if (length Exps)=1 then Sg else Pl MathNat - Mathematical Text in a Controlled Natural Language 301 add (Sn , Exps***, type, number) in CV mathabs exp := translate S into MathAbs’s formula add (Sn , mathabs exp, st type(S)) in ST **We mention only one case due to space limitation. ***The convention of taking whole expression is also a bit naive. But again, there is no established algorithm to decide which identifier an anaphoric pronoun refers to. e.g. “[. . . ] because a|p (a divides p), it is a prime or 1” Here pronoun “it” refers to a. But in statement “[. . . ] 2|b. Thus it is even” pronoun it refers to b. //a prove statement containing an equation. e.g. show that 2x + 2y = 2(x + y) If S matches (Prove Eq) mathabs eq := translate Eq into MathAbs’s formula left eq := left side of mathabs eq add (Sn , left eq , NoType, Sg) in CV add (Sn , mathabs eq, reference if provided in Eq) in CE vars := list of all variables appeared in mathabs formula not decl vars := all variables of vars that not found in MathAbs context //e.g. 2x + 2y = 2(x + y) translated as ∀x,y (2x + 2y = 2(x + y)) if x, y not found mathabs formula := ∀not decl vars (mathabs eq) add (Sn , mathabs formula, st type) in ST 6 Linguistic features Recall that (CV, ST, CE) is the Context defined in section 5. Also recall that in the course of context building, we resolve all kind of anaphora immediately as they apear with the following algorithm. 6.1 A Naive Algorithm for Anaphoric Pronouns If S matches (. . . (. . . It) . . . ) i.e. a statement containing pronoun “it”. We replace this pronoun with the latest obj1 of (Sn , obj1 , Sg, type) from CV. e.g. line 7 of figure 2 is interpreted as “a2 is even because a2 is a multiple of 2” If S matches (. . . (. . . They) . . . ) 1. If no quantity (e.g. two, three, . . . ) is mentioned in S, we replace pronoun “they” with the latest (obj1 , ..., objn ) of (Sn , (obj1 , ..., objn ), Pl, type) from CV. e.g. line 12 of figure 2 is interpreted as “If a and b are even then a and b have a common factor” 2. Otherwise, if there is a quantity Q mentioned in S then we replace this pronoun with the latest (obj1 , ..., objn ) of (Sn , (obj1 , ..., objn ), Pl, type) when Q=length(obj1 , ..., objn ). e.g. the last statement from “suppose that x + y + z > 0.[. . . ] assume that a and b are positive numbers.[. . . ] they are three even integers” is interpreted as “[. . . ] x, y and z are three even integers” 302 Humayoun M., Raffalli C. 6.2 Anaphora of Demonstrative Pronouns The <type,number > pair of (Sn , (obj1 , ..., objn ), number, type) from CV, allows to solve anaphora for demonstrative pronouns “this” and “these”. e.g. “these integers . . . ”, is replace by the latest (obj1 , ..., objn ) with number=Pl ∧ type=Integer. For the moment we only deal with the pronouns referring to expressions. Pronouns referring to propositions and equations are left as a future work. Table 1. CV: Symbolic objects Sn 2. 3. 4. 4. 5. 6. 7. ObjectsType √ √2 √2 2 a,√b b 2 2b2 a2 NoType Rational Rational Integer NoType NoType NoType Number Sg Sg Sg Pl,2 Sg Sg Sg Sn 7. 7. 8. 8. 9. 9. 10. ObjectsType Number It 2 a c 2b2 a 2 Sg Sg Sg Sg Sg Sg Sg ? NoType Integer Integer NoType Integer NoType Sn 10. 11. 11. 11. 12. 12. 14. ObjectsType Number b2 b 2 b2 a, b They √ 2 Sg Sg Sg Sg Pl,2 Pl,? Sg NoType Integer NoType NoType Integer ? Irrational Table 2. ST: Logical formulas Sn 2. 3. 4. Logical formula √ 2 ∈ / Q √ 2∈Q a, b ∈ Z ∧ positive(a) ∧ positive(b)∧ no√cmn factor(a, b) ∧√ 2 = a/b 5. b 2 = a 6. 2b2 = a2 Stmnt type Sn Logical formula Goal 7. even(a2 ) Hypothesis 8. c ∈ Z ∧ a = 2c 9. 2b2 = (2c)2 = 4c2 10. b2 = 2c2 Hypothesis 11. even(b) 12. even(a)∧ even(b) ⇒ one cmn factor(a, b) Deduction 13. False √ Deduction 14. 2 ∈ /Q Stmnt type Deduction Hypothesis Deduction Deduction Deduction Deduction Deduction Deduction Table 3. CE: Equations Sn 4. 5. 6.3 Equation Reference Sn Equation Reference Sn Equation Reference √ 2 2 2 2 2 2 = a/b NoRef 6. 2b = a 1 9. 2b = (2c) = 4c NoRef √ b 2 = a NoRef 8. a = 2c NoRef 10. b2 = 2c2 NoRef Solving References 1. Explicit reference to an equation as appeared at line 9. In the current settings of the context, it is trivial to solve such anaphora because each reference is preserved. e.g. line 9 is interpreted as “We get 2b2 = (2c)2 = 4c2 by substituting the value of a into 2b2 = a2 ” 2. Implicit reference to an equation as appeared at line 6 and 10. At line 10, dividing both sides by 2 implies that there is an equation in some previous sentence. So we check this condition and if an equation is found in CE, we put it in place and interpret it as “Dividing 2b2 = (2c)2 = 4c2 by 2 at both sides, yields b2 = 2c2 ” MathNat - Mathematical Text in a Controlled Natural Language 303 3. Reference to an equation that is mentioned far from the current sentence. e.g. “dividing the (last|first) equation by 2”. It is also quite trivial to solve because we lookup CE for last or first element. 4. Reference to a hypothesis, deduction or statement. e.g. “we deduce that x + y = 2(a + b) by the (last|first) (statement|hypothesis|deduction)”. For instance, for a statement containing “by the last hypothesis”, we simply pick the latest formula of (Sn , formula, type) from ST when type=Hypothesis. However, for a statement containing “by the last statement”, we pick the latest formula of (Sn , formula, type) when type=(Hypothesis or Deduction). 6.4 Distributive vs. Collective Readings Like any natural language text, distributive and collective reading are common in math text and we deal with them in CLM appropriately. For example, the statements such as “x, y are positive” and “x, y are equal” are distributive and collective respectively. Some collective readings could be ambiguous. Consider the statement “a,b,c are distinct”. Are variables a, b, c pair-wise distinct or just some of them distinct? Currently we neglect such ambiguity and always translate such properties as 2-arity predicates. Further, because collective readings require their subjects to be plural, statements such as “x is equal” are not allowed. 7 MathAbs and proof checking The Haskell code supporting above mentioned linguistic features does more. It translates the AST given by GF to an abstract Mathematical Language (MathAbs). This language is a system independent formal language that we hope will make MathNat accessible to any proof checking system. MathAbs (formally new command) was designed as an intermediate language between natural language proofs and the proof assistant PhoX [10] in the DemoNat project [14][15]. Since new command includes some of the standard commands of PhoX, we adapt it to MathNat by doing some minor changes. MathAbs can represent theorems and their proofs along with supporting axioms and definitions. On a macro level, a MathAbs’s document is a sequence of definitions, axioms and theorems with their proofs. However, analogous to informal mathematical text, the most significant part of MathAbs is the language of proof. While, the definitions and statements of propositions are a fraction of the text compared to the proofs. A proof is described as a tree of logical (meta) rules. Intuitively, at each step of a proof there is an implicit active sequent, with some hypotheses and one conclusion, which is being proved and some other sequents to prove later. The text in NL explains how the active sequent is modified to progress in the proof and gives some justifications (hints). Similarly, a thereom forms the initial sequent with some hypotheses and one goal, which is then passed to the proof. While, axioms and definitions also form the intial sequent by adding them as hypotheses. 304 Humayoun M., Raffalli C. The basic commands of MathAbs are let to introduce a new variable in the sequent, assume to introduce a new hypothesis, show to change the conclusion of the sequent, trivial to end the proof of the active sequent and { . . . ; . . . } to branch in the proof and state that the active sequent should be replaced by two or more sequents. deduce A . . . is a syntactic sugar for { show A trivial ; assume A . . . }. We translate math text of figure 2 in MathAbs as shown below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Definition. r ∈ Q ⇔ ∃p,q∈Z (r = pq ∧ q > 0) √ Theorem. show √ 2 ∈ /Q Proof. assume 2 ∈ √Q let a, b ∈ Z assume 2 = a/b assume positive(a) ∧ positive(b) ∧ no cmn√factor(a, b) by def rational Number deduce b 2 = a √ deduce 2b2 = a2 1 by oper squaring both sides(b 2 = a) deduce multiple of(a2 , 2) deduce even(a2 ) by form multiple of(a2 , 2) let c ∈ Z assume a = 2c deduce 2b2 = (2c)2 = 4c2 by oper substitution(a, 2b2 = a2 ) deduce b2 = 2c2 by oper division(2, 2b2 = (2c)2 = 4c2 ) deduce factor of (2, b2 ) deduce even(b) by form factor of(2,b2 ) deduce (even(a) ∧ even(b)) ⇒ one cmn factor(a, b) show ⊥ trivial Remarks: – At line 7, first, we deduce the justification i.e. multiple of(a2 , 2), and then deduce the whole statement. Same applies to 11. – However, the above rule does not apply to definitional references and operations as shown in 4, 6, 9, 10. – We can safely ignore Line 14 and 15 of figure 2 because the proof tree is already finished at line 13. We can represent above MathAbs proof as a proof tree using arbitrary rules (not just the rules of natural deduction). Then, for each rule we can produce a formula that justifies it. The line 2-5 of above MathAbs can be read as the following proof tree: . . . √ √ √ Γ2 ` b 2 = a Γ3 ≡ (Γ2 , b 2 = a) ` 2 ∈ /Q √ √ Γ2 ≡ (Γ1 , a, b ∈ Z, 2 = a/b, positive(a), positive(b), no cmn factor(a, b)) ` 2 ∈ /Q √ √ (Γ1 ≡ Γ, 2 ∈ Q) ` 2 ∈ /Q √ Γ ` 2∈ /Q Fig. 9. MathAbs proof as a proof tree. Γ is a context which contains the usefull definitions and axioms, needed to validate this proof. As a first prototype we implemented a translation from MathAbs to first order formulas for validation. The above MathAbs is translated as follows (one formula for each rule): MathNat - Mathematical Text in a Controlled Natural Language 305 √ √ √ 3. ` (rational( 2) ⇒ irrational( 2)) ⇒ irrational( 2) √ 4. Γ1 ` ∀a,b ((int(a) ∧ int(b) ∧ 2 =√a/b ∧ a, b > 0∧ gcd(a, b) = 1) ⇒ √ irrational( 2)) ⇒ irrational( 2) √ where Γ√1 ≡ rational( 2) √ √ √ 5. Γ2 ` (b√ 2 = a ∧ (b 2 = a ⇒ irrational( 2))) ⇒ irrational( 2)* Γ2 ` b 2 = a √ where Γ2 ≡ Γ1 , ∀a, b (int(a)∧int(b) ∧ √ 2 = a/b ∧ a, b > 0 ∧ √ gcd(a, b) = 1) 2 2 2 2 6. Γ3 ` (2b = a ∧ (2b = a ⇒ irrational( 2))) ⇒ irrational( 2)* Γ3 ` 2b2 = a2 √ where Γ3 ≡ Γ2 , b 2 = a √ 2 2 7. Γ4 ` (multiple √ of(a , 2) ∧ (multiple of(a , 2) ⇒ irrational( 2))) ⇒ irrational( 2)* Γ4 `multiple of(a2 , 2) where Γ4 ≡ Γ3 , 2b2 = a2 √ √ Γ5 ` (even(a2 ) ∧ (even(a2 ) ⇒irrational( 2))) ⇒ irrational( 2)* Γ5 ` even(a2 ) where Γ5 ≡ Γ4 , multiple of(a2 , 2) √ √ 8. Γ6 ` ∀c((int(c) ∧ a = 2c) ⇒irrational( 2)) ⇒ irrational( 2) where Γ6 ≡ Γ5 , even(a2 ) √ √ 9. Γ7 ` (2b2 = (2c)2 = 4c2 ∧(2b2 = (2c)2 = 4c2 ⇒irrational( 2))) ⇒ irrational( 2)* Γ7 ` 2b2 = (2c)2 = 4c2 where Γ7 ≡ Γ6 , int(c) ∧ a = 2c √ √ 10. Γ8 ` (b2 = 2c2 ∧ (b2 = 2c2 ⇒ irrational( 2))) ⇒ irrational( 2)* Γ8 ` b2 = 2c2 where Γ8 ≡ Γ7 , (2b2 = (2c)2 = 4c2 ) √ √ 11. Γ9 ` (factor of(2, b2 )∧(factor of (2, b2 ) ⇒ irrational( 2))) ⇒ irrational( 2)* Γ9 ` factor of(2, b2 ) where Γ9 ≡ Γ8 , (b2 = 2c2 ) √ √ Γ10 ` (even(b)) ∧ (even(b) ⇒ irrational( 2))) ⇒ irrational( 2)* Γ10 ` even(b) where Γ10 ≡ Γ9 , factor of (2, b2 ) factor(a, b)) ∧ ((even(a)∧ 12. Γ11 ` ((even(a)∧ even(b) ⇒ one cmn √ √ even(b) ⇒ one cmn factor(a, b)) ⇒ irrational( 2))) ⇒ irrational( 2)* Γ11 ` even(a)∧ even(b) ⇒ one cmn factor(a, b) where Γ11 ≡ Γ10 , even(b) √ 13. Γ12 ` (⊥ ⇒ irrational( 2)) Γ12 ` ⊥ where Γ12 ≡ Γ11 , (even(a) ∧ even(b) ⇒ one cmn factor(a, b)) Remarks: – Since deduce A is a syntactic sugar of { show A trivial ; assume A . . . }, it produces a lot of tautologies of the form (A ∧ (A ⇒ B)) ⇒ B in the first-order formulas. Where B is the main goal to prove. They are marked with * above. – In proof, if we add in a sentence such as “proof by contradiction” this adds√ a MathAbs command show ⊥ that would replace the conclusion irrational( 2) by ⊥. – The justifications such as “by def rational Number” that are preserved in MathAbs were removed from the first order translation, because most of the automated theorem provers are unable to use such justifications. 306 Humayoun M., Raffalli C. – In the above first order formulas, types are treated as predicates. The notion of types used in the grammar is linguistic and do not exactly correspond to the notion of type in a given type theory. This is one of the main problems if we want to translate CLM to a typed framework such as Coq2 or Agda3 . 8 Related Work AutoMath [2] of N.d. Bruijn, is one of the pioneering work in which a very restricted proof language was proposed. After that such restricted languages are presented by many systems. For instance, the language of Mizar, Isar [16] for Isabelle, the notion of formal proof sketches[17] for Mizar and Mathematical Proof Language MPL [1] for Coq. However such languages are quite restricted and non ambiguous having a programming language like syntax, with a few syntactic constructions. Therefore, like MathAbs, we consider them as an intermediate language between mathematical text and proof checking system. The MathLang project [7] goes one step further by supporting the manual annotation of NL mathematical text. Once the annotation is done by the author, a number of transformations to the annotated text is automatically performed for automatic verification. So MathLang seems quite good in answering the second question raised in the introduction but neglects the possibility of NL parsing completely. The work of Hallgren et al.[6] presents an extension to the type-theoretical syntax of logical framework Alfa, supporting a self extensible NL input and output. Like MathNat, its NL grammar is developed in GF but it does not support the rich linguistic features as we do. Similar to this work, we hope to make CLM grammar extensible in future. Nthchecker of Simon [13] is perhaps the poineering work for its time, towards parsing and verifying informal proofs. However, according to Zinn [18] its linguistic and mathematical analysis is quite adhoc and we second his opinion. In recent times, Vip - the prototype of Zinn [18] is a quite promising work. In his doctoral manuscript and papers, Zinn gives a good linguistic and logical analysis of textbook proofs. In Vip, he builds an extension of discourse representation theory (DRT) for parsing and integrates proof planning techniques for verification. Vip can process two proofs (nine sentences) from number theory. In our opinion, this coverage is too limited to verify the usefulness of presented concepts. Further, it supports limited linguistic features. Unfortunately, no further work has appeared after 2005. We do not use DRT because it is perhaps too complex and an overkill for the task of analysing mathematical text. However, we may consider to integrate with proof planning techniques for verification in future. Naproche [8] is also based on an extension of DRT. Like MathNat, Naproche translates its output to first order formulas. Currently, it has a quite limited 2 3 http://coq.inria.fr/ http://wiki.portal.chalmers.se/agda/ MathNat - Mathematical Text in a Controlled Natural Language 307 controlled language without rich linguistic features. Further, like MathNat, the problem of reasoning gaps in math text is yet to be tackled. WebALT[3] is a GF-based project that tries to make multilingual mathematical exercises in seven languages. It has a fairly good coverage for mathematical statements. However, a significant part of mathematics, i.e. the language of proof is out of scope of this work. 9 Conclusion and Future work For textbook mathematical text, we do not know any system that provides such linguistic features as MathNat does. The coverage of CLM facilitates some common reasoning patterns found in proofs but it is still limited. So for coverage, we aim at working in two directions: enlarging the grammar manauly for common patterns and supporting the ability of being extensible for some parts of CLM grammar. Context building for symbolic expressions and equations should be improved and we want to find consistent algorithm to pick the right pronoun referents. Some theorems and their proofs from elementary number theory, set theory and analysis were parsed in CLM, and translated in MathAbs, which was further translated in equivalent first order formulas. But we were able to validate only few of them due to the reasoning gaps. So we want to improve the MathNat interaction with automated theorem provers. We also want to explore the possibility of integrating with various proof assistants. References 1. H. Barendregt. 2003. Towards an Interactive Mathematical Proof Language. Thirty Five Years of Automath, Ed. F. Kamareddine, Kluwer, 25-36. 2. N. G. de Bruijn. 1994. Mathematical Vernacular: a Language for Mathematics with Typed Sets. In R. Nederpelt, editor, selected Papers on Automath, pages 865-935. 3. O. Caprotti. WebALT! Deliver Mathematics Everywhere. In Proceedings of SITE 2006. Orlando March 20-24, 2006. 4. M. Forsberg, A. Ranta. 2004. Tool Demonstration: BNF Converter HW’2004, ACM SIGPLAN. 5. Grammatical Framework Homepage. http://www.grammaticalframework.org/ 6. T. Hallgren & A. Ranta. 2000. An extensible proof text editor. Springer LNCS/LNAI 1955. 7. F. Kamareddine & J. B. Wells. 2008. Computerizing Mathematical Text with MathLang, ENTCS, 205, 1571-0661, Elsevier Science Publishers B. V. Amsterdam. 8. D. Kühlwein, M. Cramer, P. Koepke, and B. Schröder. 2009. The Naproche System. In Intelligent Computer Mathematics, Springer LNCS, ISBN: 978-3-642-02613-3. 9. Per Martin-Löf. 1984. Intuitionistic Type Theory. Bibliopolis, Napoli. 10. The PhoX Proof Assistant. http://www.lama.univ-savoie.fr/~RAFFALLI/af2.html 11. A. Ranta. 2004. Grammatical Framework: A Type-Theoretical Grammar Formalism. Journal of Functional Programming, 14(2):145189. 12. A. Ranta. 2008. Grammars as Software Libraries. To appear in G. Huet, et al. (eds), From semantics to computer science. Cambridge University Press. 13. D.L. Simon. 1988. Checking natural language proofs, in: Springer LNCS 310. 14. P. Thévenon. 2006. PhD Thesis. Vers un assistant à la preuve en langue naturelle. Université de Savoie, France. 15. P. Thévenon. 2004. Validation of proofs using PhoX. ENTCS. www.elsevier.nl/locate/entcs 16. M. Wenzel. 1999. Isar: a generic interpretative approach to readable formal proof documents, Springer LNCS 1690. 17. F. Wiedijk. 2003. Formal Proof Sketches. TYPES 2003, Italy, Springer LNCS 3085, 378-393 18. C. Zinn. 2006. Supporting the formal verification of mathematical texts. Journal of Applied Logic, 4(4). Applications A Low-Complexity Constructive Learning Automaton Approach to Handwritten Character Recognition Aleksei Ustimov1 , M. Borahan Tümer2 , and Tunga Güngör1 1 2 Department of Computer Engineering, Bog̃aziçi University, İstanbul, Türkiye Department of Computer Science Engineering, Marmara University, İstanbul, 90210, Türkiye Abstract. The task of syntactic pattern recognition has aroused the interest of researchers for several decades. The power of the syntactic approach comes from its capability in exploiting the sequential characteristics of the data. In this work, we propose a new method for syntactic recognition of handwritten characters. The main strengths of our method are its low run-time and space complexity. In the lexical analysis phase, the lines of the presented sample are segmented into simple strokes, which are matched to the primitives of the alphabet. The reconstructed sample is passed to the syntactic analysis component in the form of a graph where the edges are the primitives and the vertices are the connection points of the original strokes. In the syntactic analysis phase, the interconnections of the primitives extracted from the graph are used as a source for the construction of a learning automaton. We reached recognition rates of 72% for the best match and 94% for the top ﬁve matches. Key words: Syntactic Pattern Recognition, Handwritten character recognition, 2D Space, Parsing, Automaton, Graph, Data sequence, Expected Value, Matching 1 Introduction Pattern recognition receives increasing attention since longer than three decades. While considerable progress have been made some challenging problems remain unsolved [1, 5]. Syntactic Pattern Recognition (SPR), one of the four a subcategories of Pattern Recognition consists of two parts: lexical analysis (segmentation) and syntactic analysis (parsing). In lexical analysis a pattern is reconstructed in terms of primitives[7], that are obtained using deterministic [2] or adaptive [19] way. Syntactic analysis is responsible for grammatical inference in the training session or grammar checking in the recognition session. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 311-322 Received 23/11/09 Accepted 16/01/10 Final version 09/03/10 312 Ustimov A., Borahan M., Güngör T. In our method we use structural representation technique. Presented data samples are segmented into simple strokes (primitives) with three properties: length, slope and shape (curvature). We use a graph to hold of a segmented character. Edges and vertices of the graph represent the strokes and the connection points of these strokes, respectively. For training and recognition we use the syntactic approach. Segmented and digitally represented shape is converted to an automaton. Conversion process is performed using an intermediate stage where the graph is converted to a sequence of elements, each of which holds the smallest unit of the structure. In the training session, the obtained automaton contributes to the grammar of the speciﬁc class and, in the recognition session, is matched to the grammar of each class to detect the best match. Research for recognition based on curve representation has been reported in the literature. In [12] (topological representation) the primitives of the presented curve are encoded with the slope property and consecutive primitives having the same slope form a part of the curve. The parts are converted into the string representation of the curve (coding). The authors in [2] (coding representation with syntactic recognition) used similar to our’s representation except the curvature property. Primitives of a 2D ECG signal were adaptively created using ART2 ANN in [19] (coding representation with syntactic recognition). A series of fuzzy automata, each with a diﬀerent level of detail, participate in the classiﬁcation process. A shape analysis model was reported in [17] (graph representation and template matching recognition). The curves in skeletonized characters are approximated to the sequence of straight line segments and converted into a graph representation. Graphs and trees representation with a mixture of syntactic and template matching recognition was used in [10] to recognize various elements on architectural plans. Straight lines formed the primitive alphabet. Small patterns (windows, doors, stairs, etc.) were represented using graphs and recognition was performed by graph matching. The author in [13] (graphs and trees representation and natural recognition) used SPR to classify objects in medical images. The method learns the structure of the wrist bones with all variations and recognizes defects caused by bone joints or dislocations. A new method for cursive word recognition was proposed in [18] (coding representation with template matching recognition). Single characters are segmented into straight lines, arcs, ascenders, descenders and loops. Primitives are coded into 1D sequences and the system searches for common rules. Recognition of the traﬃc signs, reported in [8] (coding representation and neural networks recognition). The author used two types of primitives: lines with the slope property and circles with the breaking point orientation property. In most of the cases, authors used a small number of properties for segmented curves with complex algorithms for training and recognition. Increasing the complexity of the algorithm will result in an increase in the recognition rates to some extent. Our aim was to develop an approach to handwritten character recognition the main strength of which will be the simplicity and high speed on the A Low-Complexity Constructive Learning Automaton Approach... 313 expense of slightly less accurate recognition rates. All main parts of our method are simple and can be implemented with O(n) running time. This method is useful in cases where the computational power of the hardware is not very high and the data to be recognized is not too distorted. As an example we may take pocket computers with handwritten word recognition software plus cooperative user with good handwriting. The rest of the paper is organized as follows. The lexical analysis and syntactic analysis are explained in detail in Sections 2 and 3, respectively. In Section 4 we explain the practical application of the method proposed. Section 5 concludes the paper. 2 Lexical Analysis In the lexical analysis phase we prepare the presented data for the syntactic analysis. In this section, we ﬁrst explain the alphabet and then the segmentation. 2.1 Alphabet In SPR the alphabet of primitives is obtained either adaptively or is known a priori. In our method we decided to use a predeﬁned alphabet of primitives. When the type and format of the data are known, a human supervisor may provide a suﬃcient alphabet. To represent the strokes of handwritten characters we introduce the term arc. An arc is a small segment of a stroke for which the set of property values can be calculated, and it may be completely represented by length, slope and shape, each with a small set of discrete values. The length of an arc is the number of pixels that form the arc. The assignment of length values may depend on the expected height of the character’s body (character without ascenders and descenders). The slope of an arc is the second property that contains direction information. To determine the slope of an arc, we use a line between the arc ends, called the chord. The chord of a sample arc is shown in Fig. 1. The shape property indicates the direction of the arc’s curvature. In calculation of this property we again use the cord. The value of this property may change according to the number, distance and position of each pixel in an arc according to the chord. d Fig. 1. The farthest point from the chord on the arc. 314 Ustimov A., Borahan M., Güngör T. 2.2 Segmentation Segmentation module is responsible for breaking a given ﬁgure into a set of arcs in such a way that all strokes in the ﬁgure are encoded by the arcs. In our method, an entire ﬁgure is expressed in terms of arcs. In order not to lose the syntactic properties in the ﬁgure, arcs have to be connected to each other in the same manner as in the original ﬁgure. A good way to represent such a structure is to base the representation on a graph. The edges of the graph are the arcs with the properties converted to numerical values. The vertices of the graph are the common points of the arcs on the original ﬁgure. 3 Syntactic Analysis In the training session, the task of the syntactic component is to generate a grammar for each presented class of data. A grammar is a set of rules that help us distinguish patterns. In the recognition session, the grammar is used as part of input along with the reconstructed signal. Here the task is to determine a degree of match with which a new signal matches the grammar presented. 3.1 Conversion from Graph to Sequence of Code Pairs The graph produced by lexical analysis holds a character in terms of arcs and their interconnections. In the training and recognition sessions we use automata. The structure of the pattern that we use in automata diﬀers from the one produced by the lexical analysis. In order to convert the graph representation of the signal to an automaton we use an intermediate step. In this step, to preserve the 2D topology, a sequence of connected edge pairs, called code pairs, is extracted from the initial graph. Each element of the sequence contains two edges of a graph with one common vertex. The length of the sequence obtained from a graph will depend on the number of edges of each vertex. A vertex that connects n! diﬀerent pairs. n edges will produce C2n = (n−2)!2! 3.2 Grammatical Inference In the training session, sequences obtained from characters belonging to the same class have to be merged into one learning structure. Knowing that a sequence consists only of primitives and the primitives are unique, we employ automata. An automaton is an appropriate structure for storing syntactic data, assumes that all states are unique, has an ability to learn, and may be trained by inﬁnitely long input sequences. During the insertion of each code pair to automaton, the state and transition probabilities are updated. In order to always maintain a normalized automaton, we use linear reward-inaction (LR−I ) probability update method used in learning automata [11]. After the insertion of an arc pair into an automaton, both states (corresponding to codes) are rewarded. To reward more than one state at once we changed the reward and punishment formulae as follows: A Low-Complexity Constructive Learning Automaton Approach... 1−PI Pit+1 = Pit + (1 − Pit )λ n−P I t+1 t Pk = Pk (1 − λ) 315 (1) where n, 0 < λ < 1, Pi , Pk , PI and t denote the number of the states to be rewarded, learning rate, probability of the rewarded state, the probability of the punished state, the probability sum of the rewarded states, and time, respectively. The last (added) factor in the reward formula distributes total reward according to the prior probabilities of the rewarded states: the higher the probability, the lesser the reward. Transition probabilities are updated using the same formulae where n = 1, which neutralizes the last factor. 3.3 Pattern Matching In the training session we obtain a trained automaton for each class presented. In the recognition session we follow the algorithm explained above and stop after the generation of the sequence of code pairs is completed. Then, instead of generating an automaton, we match the sequence to all trained automata and choose the best match. When the training session completes, the system ends up with the normalized automata. For the matching operation we still need a few more values. Matching Parameters During matching, when a sequence of code pairs is presented to a trained automaton, a score (τ ) is calculated that denotes the presence of the speciﬁc arcs and their connections to each other. To calculate the score we search the trained automaton for the states and transitions that represent each pair in the matched sequence: τ= ∑ P (si )P (tij ) + P (sj )P (tji ) (2) i,j≤n where n is the number of states, si , sj ∈ S are the present states and tij , tji ∈ T are the transitions between si and sj . In other words, the score shows the extent to which the sequence is present in the automaton. The automaton of each class produces a score for the sequence of the test character. The scores produced cannot be directly compared. To be able to compare those scores, they have to be converted into a common scale. For this purpose, we use the expected value of the scores that each trained automaton produces for the characters it represents. To determine this value, for each automaton, we use a second pass over the training data to calculate the scores of training characters and the expected values from the collection of obtained scores. E(τ ) = 1 n n ∑ i=1 τi (3) 316 Ustimov A., Borahan M., Güngör T. We use this value as a reference point to show that the closer a character’s score to the expected value, the higher is the degree of match. The degree of match is calculated by the formula: Dm = 1 − |E−τ | E2 α (4) where Dm is the degree of match, E is the expected value (or the average) of the scores, E 2 is a second central moment (or variance), and α ∈ (0; 1) is a constant. To quantify the asymmetry of the distribution of the scores we use the third central moment [15, 16]. Matching In the recognition session, an unknown character passes the lexical analysis phase and is converted to a sequence of code pairs. Next, the sequence obtained is matched against all automata and degrees of match are computed. Finally, the character is assumed to belong to the class whose automaton produced the highest degree of match. 4 Practical Application In our application we used the Turkish alphabet which contains 29 letters and is similar to the English alphabet. The Turkish alphabet does not use the three letters q, w, and x; instead it includes six new letters ç, g̃, ı, ö, ş, and ü. 4.1 Preprocessing Before presenting the handwritten character images to the lexical analysis component they have to be preprocessed. The lexical analysis component works with one-pixel thin foreground strokes, so our preprocessing consists of two operations: binarization [4] and skeletonization [9, 14, 6]. 4.2 Segmentation The ﬁrst step in segmentation locates the special pixels, called reference points, on the skeletonized character. There are two types of such pixels: cross points and ends of the lines. Fig. 2. Cross and end points located on a shape. A Low-Complexity Constructive Learning Automaton Approach... 317 A cross point denotes a pixel that has more than two foreground pixels in its 8-pixel neighborhood. An end point is a pixel with only one foreground pixel in its 8-pixel neighborhood. In Fig. 2, the cross points and ends of the lines are determined. Initially, a stroke between each pair of reference points is assumed to be an arc. The next step is to calculate the values of the properties for each arc. In our application we used a predeﬁned number of values for each property. The length property may take three values: short (s), medium (m) and long (l). The slope may take four values: horizontal (h), vertical (v), positive (p) and negative (n). If the chord encloses an angle of less than a threshold value (about 15o ) with the x-axis or with the y-axis, it is assumed to be horizontal or vertical, respectively. A positive value is assigned in cases when the chord lies in the ﬁrst and third (positive) quarters of the circle drawn around. The chord locating in the second and fourth (negative) quarters is assigned a negative slope value. For example, the chord of the arc in Fig. 1 is not close to any of the axes and lies in the positive quarters of the circle, so the slope property value of that arc is positive. The shape property may take three values: straight (s), convex (x) and concave (v). To determine the shape property value we draw a chord and calculate the average distance davg of the pixels in the arc to the chord and the number of pixels, pa and pb , vertically above and below the chord, respectively. The average distance will show us the degree of the arc’s concavity and the number of points will help us distinguish between convex and concave values. The arc is assumed to be straight if the degree of concavity is less than a speciﬁc threshold value. If the arc is not straight, we check the number of pixels above and below the chord. If pa > pb the shape property of the arc is set to convex, and in the case of pa < pb , to concave. There are 3 × 4 × 3 = 36 diﬀerent value combinations for an arc. Each combination is called the code. Codes are generated as follows: each property value is given a number that is unique for each property (length: s = 1, m = 2, l = 3; slope: h = 1, v = 2, p = 3, n = 4; shape: s = 1, x = 2, v = 3). The code is a number that contains all properties in the same order: code = 100 × length + 10 × slope + 1 × shape. For example, an arc with property values as medium, positive and convex is encoded to 100 × 2 + 10 × 3 + 1 × 2 = 232. The properties of these arcs may fail to be calculated: an arc may be too long or short, or may be a too complex curve to be clearly deﬁned as convex or concave. Such arcs are broken up into several simpler arcs by locating additional reference points. The leftmost and the rightmost arcs in Fig. 2 are examples of complex arcs. To locate additional reference points we use the bounding box operation, where an arc is surrounded by a tight (bounding) box. The pixels that touch the sides of the box become new reference points. In Fig. 3 two points for the leftmost and the rightmost arcs were located using this operation. In the case that the bounding box operation does not produce any new reference point, we apply the farthest point operation, where the pixel that has 318 Ustimov A., Borahan M., Güngör T. the highest distance from the chord, as shown in Fig. 1, is the additional reference point. In Fig. 3 one point for the middle arcs were located using the farthest point operation. The segmentation process continues until all arcs are assigned valid values for their three properties. The representation of the segmented character is based on the reconstruction of the original ﬁgure in terms of primitives (arc codes) instead of arcs and presenting it as a graph. For instance, the graph corresponding to the symbol in Fig. 3 consists of 10 vertices and edges, as shown in Fig. 4. We assume that the code of the arc contains suﬃcient information about the arc and we do not have to keep the pixel details. mpx:232 mpx:232 lnx:342 mnx:242 lns:341 lns:341 mnv:243 lnv:343 mpv:233 mpv:233 Fig. 3. Property values’ abbreviations and codes of the all arcs in the shape. 232 341 232 342 242 341 243 343 233 233 Fig. 4. Graph representation of the shape in Fig. 3. 4.3 Code Pairs To convert graph obtained from the lexical analysis to the sequence of code pairs we extract all possible combinations of connected edges from each vertex. As states earlier the number of pairs from a vertex that connects n edges will be n! C2n = (n−2)!2! . The only exception is the vertex that contains one edge, which produces the pair with 0 on one side. For instance, the graph in Fig. 4 produces the following sequence: 0-243, 243-232, 232-341, 341-232, 341-343, 232-343, 232342, 343-233, 342-233, 342-341, 233-341, 341-233, 233-242 and 242-0. The order of the pairs is irrelevant since each pair is a part of the graph topology and the order of the codes in a pair is also irrelevant since the edges in a graph are not directed. A Low-Complexity Constructive Learning Automaton Approach... 4.4 319 Automaton Presenting one element of a sequence, a code pair, to an automaton consists of a few simple and straightforward operations. First, each code in a code pair is assumed to represent a state of an automaton. New states are created if the automaton does not contain a newly presented code. Second, a connection between the codes in the pair represents a transition between states. We create two transitions with opposite directions for each code pair [20], because transitions in an automaton are directed. The automaton corresponding to the graph in Fig. 4 is shown in Fig. 5. 0 243 242 232 233 341 342 343 Fig. 5. Automaton representation of the graph in Fig. 4. 4.5 Results And Discussion The main strength of our method is its simplicity and time complexity. All main parts of our method are simple and can be implemented with O(n) running time. In the training session, the algorithm requires two passes over the training data. The runtime of the second pass can be made negligible by using only a small portion of the training data. For example, if an automaton was built using 10000 samples, then the second pass may be completed with only 100 samples chosen randomly. Segmentation is performed by a single pass over all pixels of a sample image and all arcs extracted are used once in an automaton generation. All structures used in our method are simple and require limited amount of memory. Each trained automaton consists of 36 states and 36 × 36 transitions at most. Since there is no database available for Turkish handwritten character recognition, we compiled a new database. The data set of 12000 handwritten characters was collected from 10 writers. After testing our system on the Turkish handwritten characters, we obtained promising results in the area of character recognition. Each character in the test set was presented to all automata and degrees of match provided by each automaton were sorted in descending order. In 71.94% of the cases the score of the correct automaton was the highest (positioned on the top of the order) and in 93.79% of the cases was among the top 320 Ustimov A., Borahan M., Güngör T. ﬁve results. Distribution of the recognition rates from the top result to the top ﬁve results is shown in Fig. 6. 80 Recognition Rates 70 60 50 40 30 20 10 0 1 2 3 4 5 Top Results Fig. 6. Average positions of the correct automata in the recognition order. The major part of the correct recognitions is concentrated on a ﬁrst position. The positions not shown in ﬁgure have the average values less than one percent. 100 Average Scores (%) 90 80 70 60 50 40 30 20 10 0 a b c c d e f g g h .. .. i j k l m n o o p r s s t u u v y z Letters Fig. 7. Distribution of the top ﬁve results for each letter. In our system, the simpler the topology of a character, the higher the recognition performance. The distribution of the recognition results for each character is shown in Fig. 7. The top result is marked with the darkest portion of the bars. The top second and third results are combined and marked with the gray color. The fourth with ﬁfth results are also combined and shown with the white colored portions of the bars. As we expected, the simplest letters such as ı, i, j, l, o, ö and t have top recognition rates of more than 80%, while the lowest rates were calculated for the letters c, ç, f, g̃ and s with many curved portions. We also studied the recognition errors for some of the most frequent letters in Turkish: a, e, i, l, n and r (Table 1). Most frequently a is erroneously recognized A Low-Complexity Constructive Learning Automaton Approach... 321 Table 1. Misclassiﬁcations of some of the most frequent letters in Turkish. The tables show the erroneous choices and their percentages among all misclassiﬁcations of the letter in the header of a relevant table. o ü e c u a 35.96% 15.79% 15.35% 7.02% 5.70% v r o i c e 23,95% 20,96% 11,38% 10,78% 10,18% l r ü ş v i 26,53% 22,45% 12,24% 10,20% 8,16% i r v ç ş l 57,14% 19,05% 7,14% 4,76% 4,76% i ş r e ü n 21,85% 17,65% 16,81% 12,61% 11,76% i ş l c e r 39,80% 18,37% 13,27% 5,10% 5,10% as o, ü, e, c and u. The potential reason is that those letters have a similar topology with a. Misclassiﬁcations for the same reason may be observed for i, l and r. The letter e is mostly misclassiﬁed with v and r, because they have a similar handwritten topology. The misclassiﬁcations of n do not display a proper pattern, because n can be easily distorted during handwriting, and hence, may have a similar topology with many letters. 5 Conclusion In this paper we proposed a constructive learning automaton approach to the recognition of handwritten characters. Promising results were obtained after testing our system. Recognition rates reached vary between 71.94% for the best and 93.79% for the top ﬁve results. For unconstrained characters segmented from words, performance of our approach is similar to those reported in [21, 22]. A similar approach was used in [3]. The authors reached an average top ﬁve recognition rate of 93.78% for online English handwritten character recognition. As can be concluded from the results the weakest point in our application is the representation technique of smooth curves that are usually encountered in letters c, ç, g, g̃, s and ş. The direction of possible future work is the revision of the representation technique for smooth curves. A new technique for placement of reference points may be focused on the sharp changes of the degree of curvature along the presented curve. This will lead to robustness against slant variation and increase the performance. Acknowledgements. This work has been supported by the Bog̃aziçi University Research Fund under the grant number 09A107D. References 1. N. Arica and F.T. Yarman-Vural. An overview of character recognition focused on oﬀ-line handwriting. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions, 31:216–233, 2001. 2. Y. Assabie and J. Bigun. Ethiopic character recognition using direction ﬁeld tensor. Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, 3:284– 287, 2006. 322 Ustimov A., Borahan M., Güngör T. 3. V. S. Chakravarthy and B. Kompella. The shape of handwritten characters. Pattern Recognition Letters, 24:1901–1913, 2003. 4. D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2003. 5. R. Gonzalez and M. Thomason. Syntactic Pattern Recognition: An Introduction. Addison-Wesley, 1978. 6. C.M. Holt, A. Stewart, M. Clint, and R.H. Perrott. An improved parallel thinning algorithm. Communications of the ACM, 30(2):156–160, 1987. 7. K.-Y. Huang. Syntactic Pattern Recognition For Seismic Oil Exploration. World Scientiﬁc, 2002. 8. S. Kantawong. Road traﬃc signs detection and classiﬁcation for blind man navigation system. Control, Automation and Systems, 2007. ICCAS ’07. International Conference on, pages 847–852, 2007. 9. L. Lam, S.-W. Lee, and C.Y. Suen. Thinning methodologies - a comprehensive survey. Pattern Analysis and Machine Intelligence, 14(9):869–885, 1992. 10. J. Llados and G. Sanchez. Symbol recognition using graphs. Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, 2:49–52, 2003. 11. K. Narendra and M. Thathatchar. Learning Automata: An Introduction. New York. Addison-Wesley, 1989. 12. H. Nishida and S. Mori. Algebraic description of curve structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:516–533, 1992. 13. M.R. Ogiela. Automatic understanding of medical images based on grammar approach. Imaging Systems and Techniques, 2007. IST ’07. IEEE International Workshop on, pages 1–4, 2007. 14. B.R. Okombi-Diba, J. Miyamichi, and K. Shoji. Segmentation of spatially variant image textures. 16th International Conference on Pattern Recognition (ICPR’02), 2:20917, 2002. 15. A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill Companies; 3rd edition, 1991. 16. W. Paul and J. Baschnagel. Stochastic Processes: From Physics to Finance. Springer, 1999. 17. J. Rocha and T. Pavlidis. A shape analysis model with applications to a character recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:393–404, 1994. 18. J.C. Simon. Oﬀ-line cursive word recognition. Proceedings of the IEEE, 80:1150– 1161, 1992. 19. M.B. Tümer, L.A. Belfore, and K.M. Ropella. A syntactic methodology for automatic diagnosis by analysis of continuous time measurements using hierarchical signal representations. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 33(33):951–965, 2003. 20. A. Ustimov and B. Tümer. Construction of a learning automaton for cycle detection in noisy data sequences. Lecture Notes in Computer Science, pages 543–552, 2005. 21. B.A. Yanikoglu and P.A. Sandon. Oﬀ-line cursive handwriting recognition using style parameters. Department of Mathematics and Computer Science, Dartmouth College, 1993. 22. B.A. Yanikoglu and P.A. Sandon. Recognizing oﬀ-line cursive handwriting. Proc. Computer Vision and Pattern Recognition, pages 397–403, 1994. Utterances Assessment in Chat Conversations Mihai Dascalu1,2, Stefan Trausan-Matu1,3, Philippe Dessus4 1 University “Politehnica“ of Bucharest, 313, Splaiul Indepentei, 060042 Bucharest, ROMANIA 2 S.C. CCT S.R.L., 30, Gh Bratianu, 011413 Bucharest, ROMANIA 3 Romanian Academy Research Institute for Artificial Intelligence, 13, Calea 13 Septembrie, Bucharest, ROMANIA 4 Grenoble University, 1251, av. Centrale, BP 47, F-38040 Grenoble CEDEX 9, FRANCE mikedascalu@yahoo.com, stefan.trausan@cs.pub.ro, Philippe.Dessus@upmf-grenoble.fr Abstract. With the continuous evolution of collaborative environments, the needs of automatic analyses and assessment of participants in instant messenger conferences (chat) have become essential. For these aims, on one hand, a series of factors based on natural language processing (including lexical analysis and Latent Semantic Analysis) and data-mining have been taken into consideration. On the other hand, in order to thoroughly assess participants, measures as Page’s essay grading, readability and social networks analysis metrics were computed. The weights of each factor in the overall grading system are optimized using a genetic algorithm whose entries are provided by a perceptron in order to ensure numerical stability. A gold standard has been used for evaluating the system’s performance. Keywords: assessment of collaboration, analysis of discourse in conversation, social networks, LSA, Computer Supported Collaborative Learning. 1 Introduction As a result of the ongoing evolution of the web, new collaboration tools emerged and with them the desire to thoroughly process large amounts of information automatically. From the Computer Supported Collaborative Learning’s (CSCL) point of view [1], chats play an important role and have become more and more used in the effective learning process. On the other hand, manual assessment of chats is a time consuming process from the teacher’s side, and therefore the need to develop applications that can aid the evaluation process has become essential. From this perspective the major improvement targeted by this paper is the development of an automatic assessment system in order to evaluate each participant in a chat environment. A series of natural language processing and social network analysis methods were used, in addition with other computed metrics for assessment. © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 323-334 Received 24/11/09 Accepted 16/01/10 Final version 09/03/10 324 Dascalu M., Trausan-Matu S., Dessus P. The system was used for CSCL chats in which teams of 4-8 students were asked to discuss, without a moderator, the benefits of online collaboration tools. Each of the students was assigned to support a collaborative technology (wikis, blogs, chats and forums), arguing both pros and cons for it. The language was English and the environment used was Concert Chat [6], which offers the possibility of explicit referencing previous utterances. From the obtained corpus, 80 chats were afterwards manually evaluated by a student from a different year for not influencing the assessment process. The next section of this paper will present the metrics used in the evaluation process starting from the simplest, as readability or Page’s factors, initially used for essay grading [3], moving to social network analysis and finally Latent Semantic Analysis (LSA) for a semantic approach of the marking system. The third section evaluates the system. 2 The Evaluation Process Communication between participants in a chat is conveyed through language in a written form. Lexical, syntactic, and semantic information are the three levels used to describe the features of written utterances [2], and will be taken into account for the analysis of a participant’s involvement in a chat. First, surface metrics are computed for all the utterances of a participant in order to determine factors like fluency, spelling, diction or utterance structure [2, 3]. All these factors are combined and a mark is obtained for each participant without taking into consideration a lexical or a semantic analysis of what they are actually discussing. At the same level readability ease measures are computed. The next step is grammatical and morphological analysis based on spellchecking, stemming, tokenization and part of speech tagging. Eventually, a semantic evaluation is performed using LSA [4]. For assessing the on-topic grade of each utterance a set of predefined keywords for all corpus chats is taken into consideration. Moreover, at the surface and at the semantic levels, metrics specific to social networks are applied for proper assessment of participants’ involvement and similarities with the overall chat and predefined topics of the discussion. 2.1 Surface Analysis In order to perform a detailed surface analysis two categories of factors are taken into consideration at a lexical level: Page’s essay grading proxes and readability. Page’s idea was that computers could be used to automatically evaluate and grade student essays as effective as any human teacher using only simple measures – statistically and easily detectable attributes [5]. The main purpose was to prove that computers could grade as well, but with less effort and time, therefore enabling teachers to assign more writing. So the goal was to improve the student’s capabilities by practice, having at hand the statistical capabilities of computers for writing analysis. In order to perform a statistical analysis, Page correlated two concepts: proxes (computer approximations of interest) with human trins (intrinsic variables – human Utterances Assessment in Chat Conversations 325 measures used for evaluation). The overall results were remarkable – a correlation of 0.71 using only simple measures which proved that computer programs could predict grades quite reliably - at least the grades given by the computer correlated with the human judges as well as the humans had correlated with each other. Starting for Page’s metrics [5] for automatically grading essays, and taking into consideration Slotnick’s method [5] to group them correspondingly to their intrinsic values, the following factors and values were identified in order to evaluate each participant only at the surface level: Table 1. Categories taken into consideration and corresponding proxes Number Quality Characteristic Proxes 1. Fluency 2. Spelling 3. 4. Diction Utterance Structure Number of total characters, number of total words, number of different words, mean number of characters per utterance, number of utterances, number of sentences (different, because in an utterance multiple sentences can be identified) Misspelled words, but in order to obtain a positive approach (the greater the percentage, the better) the percentage of correctly written words is used Mean and standard deviation of word length Number of utterances, mean utterance length in words, mean utterance length in characters All the above proxes determine the average consistency of utterances. Although simple, all these factors play an important role in discovering the most important person in a chat, in other words to measure his activity. In addition, quantity is also important in its part of analyzing each participant’s utterances. Each factor has the same weight in the corresponding quality and the overall grade is obtained by using the arithmetic mean on all predefined values. All these factors, except misspelled words, are converted into percentages in order to scale them and to obtain a relative mark for all participants. The second factor taken into account is readability. It can be defined as reading ease of a particular text, especially as it results from one’s writing style. This factor is very important because extensive research in this field show that easy-reading text (and in our case chats and utterances) has a great impact on comprehension, retention, reading speed, and reading persistence. Because readability implies the interaction between a participant and the collaborative environment, several features from the reader’s point of view are essential: prior knowledge, personal skills and traits (for example intelligence), interest, and motivation. In the currently evaluated chats, the first factor (prior knowledge) can be considered approximately the same for all students because all come from the same educational environment and share a common background. On the other hand, the remaining features vary greatly from one student to another and the last two ones are greatly reflected in their implication in the chat. 326 Dascalu M., Trausan-Matu S., Dessus P. Therefore two key aspects must be taken into consideration: involvement and competency, both evaluated from the social network’s point of view and with a semantic approach which will be detailed further in this paper. Starting from Jacques Barzun’s quote –“Simple English is no person's native tongue“– it is very difficult to write for a class of readers other than one's own, therefore readability plays an important role in understanding a chat. Although in a chat environment some words are omitted and syntax is usually simplified, readability still offers a good perspective of one’s current level of knowledge/understanding or attitude in some cases, but all the information obtained from readability measures must be correlated with other factors. Readability is commonly used unconsciously, based on the insight of other chat participants, but for its evaluation a readability formula is used, which is calibrated against a more labor-intensive readability survey and which matches the overall text with the expected reading level of the audience [4]. These formulas estimate the reading skill required to read the utterances in a chat and evaluate the overall complexity of the words used, therefore providing the means to target an audience. Three formulas were computed. The Flesch Reading Ease Readability Formula (http://www.readabilityformulas.com/flesch-reading-ease-readability-formula.php) is one of the oldest and most accurate readability formulas, providing a simple approach to assess the grade-level of a chat participant and the difficulty of reading the current text. This score rates all utterances of a user on a 100 point scale. The higher the score, the easier it is to read, not necessarily understand the text. A score of 60 to 70 is considered to be optimal. . (1) RE is the Readability Ease, ASL is the Average Sentence Length (the number of words divided by the number of sentences) and ASW is the Average number of Syllables per Word (he number of syllables divided by the number of words). The Gunning’s Fog Index (or FOG) Readability Formula (http://www.readabilityformulas.com/gunning-fog-readability-formula.php) is based on Robert Gunning’s opinion that newspapers and business documents were full of “fog” and unnecessary complexity. The index indicates the number of years of formal education a reader of average intelligence would need to understand the text on the first reading. A drawback of the Fog Index is that not all multi-syllabic words are difficult, but for computational issues, the consideration that all words above 2 syllables are complex is used. . (2) ASL is the Average Sentence Length (the number of words divided by the number of sentences) and PHW is the Percentage of Hard Words (in current implementation words with more than 2 syllables and not containing a dash) The Flesch Grade Level Readability Formula (http://www.readabilityformulas.com/flesch-grade-level-readability-formula.php) rates utterances on U.S. grade school level. So a score of 8.0 means that the document can be understood by an eighth grader. This score makes it easier to judge the readability level of various texts in order to assign them to students. Also, a document Utterances Assessment in Chat Conversations 327 whose score is between 7.0 and 8.0 is considered to be optimal, since it will be highly readable. . (3) FKRA is the Flesch-Kincaid Reading Age, ASL is the Average Sentence Length (the number of words divided by the number of sentences) and ASW is the Average number of Syllable per Word (the number of syllables divided by the number of words) For each given chat, the system performs and evaluates all the 3 previous formulas and provides to the user detailed information for each participant. Also relative correlations between these factors and the manual annotation grades are computed in order to evaluate their relevance related to the overall grading process. 2.2 Social Networks Analysis In addition to quantity and quality measures computed starting from the utterances, social factors are also taken into account in our approach. Consequently, a graph is generated from the chat transcript in concordance with the utterances exchanged by the participants. Nodes are participants in a collaborative environment and ties are generated based on explicit links (obtained from the explicit referencing facility of the chat environment used [6], which enables participants to manually add links during the conversation for marking subsequent utterances derived from a specific one). From the point of view of social networks, various metrics are computed in order to determine the most competitive participant in chat: degree (indegree, outdegree), centrality (closeness centrality, graph centrality, eigen–values) and user ranking similar to the well known Google Page Rank Algorithm [7]. These metrics are applied first on the effective number of interchanged utterances between participants providing a quantitative approach; Second, the metrics are applied to the sum of utterance marks based on a semantic evaluation of each utterance; the evaluation process will be discussed in section 2.5 and, based on the results obtained for each utterance, a new graph is built on which all social metrics are applied. This provides the basis for a qualitative evaluation of the chat. All the identified metrics used in the social network analysis are relative in the sense they provide markings relevant only compared with other participants in the same chat, not with those from other chats. This is the main reason why all factors are scaled between all the participants, giving each participant a weighted percentage from the overall performance of all participants. 2.3 LSA and the Corresponding Learning Process Latent Semantic Analysis is a technique based on the vector-space based model [10, 14]. It is used for analyzing relationships between a set of documents and terms contained within by projecting them in sets of concepts related to those documents [9, 10]. LSA starts from a term-document array which describes the occurrence of each term in all the corpus documents. LSA transforms the occurrence matrix into a 328 Dascalu M., Trausan-Matu S., Dessus P. relation between terms and concepts, and a relation between those concepts and the corresponding documents. Thus, the terms and the documents are now indirectly related through concepts [10, 13]. This transformation is obtained by a singular-value decomposition of the matrix and a reduction of its dimensionality. Our system uses words from a chat corpus. The first step in the learning process, after spell–checking, is stop words elimination (very frequent and irrelevant words like “the”, “a”, “an”, “to”, etc.) from each utterance. The next step is POS tagging and, in case of verbs, these are stemmed in order to decrease the number of corresponding forms identified in chats by keeping track of only the verb’s stem (the meaning of all forms is actually the same, but in LSA only one form is learnt). All other words are left in their identified forms, adding corresponding tagging because same words, but with different POS tags have other contextual senses, and therefore semantic neighbors [11]. Once the term-document matrix is populated, Tf-Idf (term frequency - inverse document frequency [13]) is computed. The final steps are the singular value decomposition (SVD) and the projection of the array in order to reduce its dimensions. According to [12], the optimal empiric value for k is 300, a value used in current experiments at which multiple sources concord. Another important aspect in the LSA learning process is segmentation which is the process of dividing chats taking into consideration units with similar meaning and high internal cohesion. In the current implementation, the chat is divided between participants because of the considered unity and cohesion between utterances from the same participant. These documents are afterwards divided into segments using fixed non-overlapping windows. In this case contiguous segments are less effective because of intertwined themes present in chats and these aspects will be dealt with in future improvements of the marking system. LSA is used for evaluating the proximity between two words by the cosine measure: Sim( word 1 , word 2 ) = ∑ ∑ k k i =1 word 1,i , word 2,i word i =1 2 1,i × ∑ k word i =1 . 2 2 ,i (4) Similarities between utterances and similarities of utterances related with the entire document are used in order to assess the importance of each utterance compared with the entire chat or with a predefined set of keywords referenced as a new document: Vector (utterance) = ∑ (1 + log(no _ occurence( word i )) * vector ( word i ) . (5) Sim(utterance1 , utterance2 ) = Sim(Vector(utterance1 ),Vector(utterance2 )) . (6) i =1 Utterances Assessment in Chat Conversations 329 2.4 The Utterance and Participants’ Evaluation Process 2.4.1 The Utterance Marking Process The first aspect that needs to be taken care of is building the graph of utterances which highlights the correlations between utterances on the basis of explicit references. In order to evaluate each sentence, after finishing the morphological and lexical analysis three steps are processed: 1. Evaluate each utterance individually taking into consideration several features: the effective length of initial utterance; the number of occurrences of all keywords which remain after eliminating stop words, spell-checking and stemming; the level at which the current utterance is situated in the overall thread (similar to a Breadth-First search in the utterance space/threads based only on explicit links); the branching factor corresponding with the actual number of derived utterances from current one; the correlation / similarity with the overall chat; the correlation / similaritude with a set of predefined set of topics of discussion. This mark combines the quantitative approach (the length of the sentence starting from the assumption that a piece of information should be more valuable if transmitted in multiple messages, linked together, and expressed in more words, not only to impress, but also meaningful in the context) with a qualitative one (the use of LSA and keywords). In the process of evaluating each utterance, the semantic value is evaluated with the help of likelihood between the terms used in the current utterance (those after preliminary processing) and the whole document, respectively those from a list of predefined topics of discussion. The formulas used for evaluating each utterance are:  length(initial _ utterance) 9 remaining  markempiric =  + × ∑ mark(word )  × . 10 10 word   × emphasis (7) mark( word ) = length(word ) * (1 + log(no _ occurences)) . (8) emphasis = (1 + log(level)) × (1 + log(branching _ factor) × . × Sim(utterance, whole _ document) × × Sim(utterance, predefined _ keywords) (9) 2. Emphasize Utterance Marks. Each thread obtained by chaining utterances based upon explicit links has a global maximum around which all utterance marks are increased correspondingly with a Gaussian distribution: 330 Dascalu M., Trausan-Matu S., Dessus P. − 1 p ( x) = e σ 2π σ= ( x − µ )2 2σ 2 , where: (10) max(id _ utter _ thread ) − min(id _ utter _ thread ) ; 2 (11) µ = id _ utterance _ with _ highest _ mark . (12) Therefore each utterance mark is multiplied by a factor of 1 + p(currrent_utterance). 3. Determine the final grade for each utterance in the current thread Based upon the empiric mark, the final mark of the utterance is obtained for each utterance in its corresponding thread: mark final = mark final ( prev _ utter ) + coefficient × markempiric , (13) where the coefficient is determined from the type of the current utterance and the one to which it is tied to. For the coefficient determination, identification of speech acts plays an important role: verbs, punctuation signs and certain keywords are inspected. Starting from a set of predefined types of speech acts, the coefficients are obtained from a predefined matrix. These predefined values were determined after analyzing and estimating the impact of the current utterance considering only the previous one in the thread (similar to a Markov process). The grade of a discussion thread may be raised or lowered by each utterance. Therefore, depending on the type of an utterance and the identified speech acts, the final mark might have a positive or negative value. 2.4.2 Participant Grading The in-degree, out-degree, closeness and graph centrality, eigen–values and rank factors are applied on the matrix with the number of interchanged utterances between participants and the matrix which takes into consideration the empiric mark of an utterance instead of the default value of 1. Therefore, in the second approach quality, not quantity is important (an element [i, j] equals the sum of markempiric for each utterance from participant i to participant j), providing a deeper analysis of chats using a social network’s approach based on a semantic utterance evaluation. Each of the analysis factors (applied on both matrixes) is converted to a percentage (current grade/sum of all grades for each factor, except the case of eigen centrality where the conversion is made automatically by multiplying with 100 the corresponding eigen–value in absolute value). The final grade takes into consideration all these factors (including those from the surface analysis) and their corresponding weights: final _ gradei = ∑k weight k × percentagek ,i , (24) Utterances Assessment in Chat Conversations 331 where k is a factor used in the final evaluation of the participant i and the weight of each factor is read from a configuration file. After all measures are computed and using the grades from human evaluators, the Pearson correlation for each factor is determined, providing the means to assess the importance and the relevance compared with the manual grades taken as reference. General information about the chat – for example overall grade correlation, absolute and relative correctness – are also determined and displayed by the system. 2.5 Optimizing each Metric’s Grade The scope of the designed algorithm is to determine the optimal weights for each given factor in order to have the highest correlation with the manual annotator grades. A series of constraints had to be applied. First, minimal/maximum values for each weight are considered. For example, a minimum of 2% in order to take into consideration at least a small part of each factor, and maximum 40% in order to give all factors a chance and not simply obtain a solution with all factors 0% besides the one with the best overall correlation – 100%. Second, the Sum of all factors must be 100%. Third, obtain maximum mean correlation for all chats in the corpus. In this case, the system has two components. A perceptron is used for obtaining fast solutions as inputs for the genetic algorithm. The main advantages for using this kind of network are the capacity to learn and adapt from examples, the fast convergence, the numerical stability; search in the weight space for optimal solution; duality and correlation between inputs and weights. Secondly, a genetic algorithm is used for fine-tuning the solutions given by the neural network, also keeping in mind the predefined constraints. This algorithm operates over a population of chromosomes which represent potential solutions. Each generation represents and approximation of the solution - the determination of optimal weights in order to assure the best overall correlation, not the best distance between automatic grades and annotator ones. Correlation is expressed as an arithmetic mean of all correlations per chat because of the differences between evaluator styles. The scope of this algorithm is to maximize the overall correlation, and specific characteristics of the implemented algorithm are: − Initialization: 2/3 of initial population obtained via Neural Networks (perceptron), the rest is randomly generated in order to avoid local; − Fixed number of 100 chromosomes per population; − Fitness - overall correlation of all chats from the corpus evaluated as a mean of all individual correlations; − Selection – roulette based or elitist selection - the higher the fitness, the greater the possibility a participant is selected for crossover; − Correction – a necessary operator in order to assure that the initial constraint are satisfied: if above or below minim/maximum values, reinitialize weight starting from threshold and adding a random quantity to it; if overall sum of percentages different from 100% adjust randomly weights with steps of 1/precision; − Crossover - is based on Real Intermediate Recombination which has the highest dispersion of newly generated weights - select a random alpha for 332 Dascalu M., Trausan-Matu S., Dessus P. each factor between [-0,25; 1,25]; the relative distance between 2 chromosomes selected for crossover must be at least 20% in order to apply the operator over them; − Use CHC optimization, with a little modification - generate N children and retain 20% of the best newly generated chromosomes; 20% of best parents are kept in the new generation and the rest is made of the best remaining individuals; − Multiple populations that exchange best individuals - add after 10 generations the best individual to a common list and replace the worst individual with a randomly selected one from the list; − After reaching convergence of a population (consecutively 20% of the maximum number of generations have the same best individual), reinitialize population = keep best 10% of existing individuals, obtain 30% via neural networks, and generate the remaining randomly; The solution for determining the optimal weights combines the two approaches in order to obtain benefits from both – numerical stable solutions from neural networks and the flexibility of genetic algorithms in adjusting these partial solutions. 3 System Evaluation The initial running configuration used by the system was: 10% for Page’s Grading, 5% for social networks factors applied on the number of interchanged utterances, and 10% for the semantic social network factors applied on utterance marks. The overall results obtained with these weights are: Relative correctness: 77.44%, Absolute correctness: 70.07%, Correlation: 0.514. Relative correctness and absolute correctness represent absolute/relative distances in a one-dimensional space, where the annotator’s grade and the one obtained automatically using the Ch.A.M.P. system are taken into consideration for the given corpus. Eventually, the final results (as arithmetic means for each of the 3 individual measures determined per chat) are also displayed. The results after multiple runs of the weight optimization system (all with 4 concurrent populations) show that most importance in the manual evaluation process is given to the following factors: Table 2. Results after multiple runs of the weight optimization system, with regards to factors with a corresponding percentage ≥ 10% Percentage Factor 20-25% 10-15% Page’s Grading methods - so only surface analysis factors Indegree from the social network’s point of view, applied on number of interchanged utterances Outdegree also determined by the number of outgoing utterances – somehow a participant’s gregariousness measure Semantic graph centrality – the only measure with a higher importance applied which relies on utterance marks 30-40% ≈ 10% Utterances Assessment in Chat Conversations 333 All remaining factor are evaluated below 5%, therefore don’t have high importance in the final grading process. The overall results, with regards to correlation optimization, obtained after running the genetic algorithm are: Relative correctness: ≈ 46.83%, Absolute correctness: ≈ 45.70%, Correlation: ≈ 0.594. Fig. 1. Convergence to an optimal solution using 4 populations with the visualization of optimum/average chromosomes The spikes from each population’s average fitness are determined by newly inserted individuals or by the population reinitialization. After the first 10 iterations important improvements can be observed, whereas after 30 generations the optimum chromosomes of each population stagnate. Only population reinitializations and chromosome interchanges provide minor improvements in the current solution. Our results entail several conclusions: The human grading process uses a predominantly quantitative approach; Uncorrelated evaluations and different styles/principles used by different human annotators are the main causes for lowering the overall correlation and correctness; The improvement of correlation was in the detriment of absolute/relative correctness; Convergence of the genetic algorithm can be considered after 30 generations. 4 Conclusions The results obtained from our system allow us to conclude that the evaluation of a participant’s overall contribution in a chat environment can be achieved. Also we strongly believe that with further tuning of the weights, better LSA learning and increased number of social network factors (including those applied to the entire network) will increase performance and reliability of the results obtained. Moreover, the subjective factor in manual evaluation is also present and influences the overall correctness. In present, evaluations and tuning of the assessment system are performed in the LTfLL project, in which the work presented in the paper is one of the modules for feedback generation [16]. 334 Dascalu M., Trausan-Matu S., Dessus P. Acknowledgements The research presented in this paper was partially performed under the FP7 EU STREP project LTfLL. We would like to thank all the students of University “Politehnica” of Bucharest, Computer Science Department, who participated in our experiments and provided the inputs for generating our golden standard. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Stahl, G.: Group Cognition: Computer Support for Building Collaborative Knowledge. MIT Press (2006) Anderson, J. R.: Cognitive psychology and its implications, New York, Freeman (1985) Page, E. B. Paulus, D. H.: Analysis of essays by computer. Predicting Overall Quality, U.S. Department of Health, Education and Welfare (1968) http://www.streetdirectory.com/travel_guide/15672/writing/all_about_readability_formula s__and_why_writers_need_to_use_them.html Wresch, W.: The Imminence of Grading Essays by Computer--25 Years Later. Computers and Composition 10(2), 45-58, retrieved from http:// computersandcomposition.osu.edu/archives/v10/10_2_html/10_2_5_Wresch.html (1993) Holmer, T., Kienle, A. & Wessner, M.: Explicit Referencing in Learning Chats: Needs and Acceptance. In: Nejdl, W., Tochtermann, K., (eds.): Innovative Approaches for Learning and Knowledge Sharing- ECTEL, LNCS, 4227, Springer, pp. 170–184 (2006) Dascălu, M., Chioaşcă, E.-V. Trăuşan-Matu, S.: ASAP – An Advanced System for assessing chat participants. In: D. Dochev, M. Pistore, and P. Traverso (Eds.): AIMSA 2008, LNAI 5253, Springer, pp. 58–68 (2008) Bakhtin, M. M.: Problems of Dostoevsky’s poetics (Edited and translated by Caryl Emerson). Minneapolis: University of Minnesota Press (1993) http://lsa.colorado.edu/ Landauer, K. Th., Foltz, W. P., Laham, D.: An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284 (1998) Wiemer-Hastings, P., Zipitria, I.: Rules for syntax, vectors for semantics. In: proceeding of the 23rd Annual Conference of the Cognitive Science Society (2001). Lemaire, B.: Limites de la lemmatisation pour l’extraction de significations. JADT 2008: 9es Journées internationales d’Analyse statistique des Données Textuelles (2008). Manning, C., Schütze, H.: Foundations of statistical Natural Language Processing. MIT Press, Cambridge (Mass.) (1999) Miller, T.: Latent semantic analysis and the construction of coherent extracts. In: Nicolov, N. Botcheva, K., Angelova, G. and Mitkov, R., (eds.), Recent Advances in Natural Language Processing III. John Benjamins, Amsterdam/Philadelphia, pp. 277–286 (2004) Fernandez, S., Velazquez, P., Mandin, S.: Les systèmes de résumé automatique sont-ils vraiment des mauvais élèves?. In: JADT 2008: 9es Journées internationales d’Analyse Statistique des Données Textuelles (2008) Stefan Trausan-Matu, Traian Rebedea, A Polyphonic Model and System for Interanimation Analysis in Chat Conversations with Multiple Participants, in A. Gelbukh (Ed.): CICLing 2010, LNCS 6008, Springer, 2010, pp. 354–363 Punctuation Detection with Full Syntactic Parsing Miloš Jakubı́ček and Aleš Horák Faculty of Informatics Masaryk University Botanická 68a, 602 00 Brno Czech Republic xjakub@fi.muni.cz, hales@fi.muni.cz Abstract. The correct placement of punctuation characters is in many languages, including Czech, driven by complex guidelines. Although those guidelines use information of morphology, syntax and semantics, state-of-art systems for punctuation detection and correction are limited to simple rule-based backbones. In this paper we present a syntax-based approach by utilizing the Czech parser synt. This parser uses an adapted chart parsing technique for building the chart structure for the sentence. synt can then process the chart and provide several kinds of output information. The implemented punctuation detection technique utilizes the synt output in the form of automatic and unambiguous extraction of optimal syntactic structures from the sentence (noun phrases, verb phrases, clauses, relative clauses or inserted clauses). Using this feature it is possible to obtain information about syntactic structures related to expected punctuation placement. We also present experiments proving that this method makes it possible to cover most syntactic phenomena needed for punctuation detection or correction. 1 Introduction Incorrect usage of punctuation characters such as comma, semi-colon, or dash has always been a common mistake in written texts, especially in languages with such complex (and strict) guidelines for punctuation placement as in the case of the Czech language [1]. It has been shown that mistakes in punctuation (21 % of all errors) represent the second most frequent category of mistakes in Czech – the first position is occupied by stylistics with 23 % [2]. It is no wonder that automatic detection and correction of punctuation for Czech is still an open task and the state-of-art systems usually lack both precision and recall around 50 % [3], as they use only a small-to-medium set of rules matching the simplest guidelines for punctuation. It is obvious that such methods do not cover phenomena on syntactic or semantic levels and that such approach cannot be easily adapted to do so. In this paper we focus on superseding the deficiencies of current punctuation detection systems with analysis on the syntactic level by using synt, a powerful © A. Gelbukh (Ed.) Special issue: Natural Language Processing and its Applications. Research in Computing Science 46, 2010, pp. 335-343 Received 27/11/09 Accepted 16/01/10 Final version 10/03/10 336 Jakubíček M., Horák A. and feature-rich syntactic parser for Czech (more on the parser in Section 2). The main reason of the technique is that since the parse is able to parse (i. e. recognize) punctuation in the input, it also might be able to fill in (i. e. generate) missing punctuation. We show that with relatively small set of post-processing rules this method achieves significantly better results than the current systems (as described in [3]). The synt parser provides several options of output information based on the packed chart structure containing the resulting analyses. Besides standard enumeration of all phrasal or dependency trees, synt can compute the optimal decomposition of the input text to selected syntactic structures such as noun, verb or prepositional phrases, clauses etc. The extraction of structures (described in details in Section 3) allows us to obtain all the necessary syntactic information needed for filling in the punctuation into the given sentence. The structure of this papers is as follows: in Sections 2 and 3 we briefly describe the synt parser and the extraction of structures, then in Section 4 we explain how we adapt it and use it for punctuation detection and finally we present results of our work in Section 5. 1.1 Related Work Since the Czech rules for placing punctuation are so complicated, this topic has been addressed by several authors, however, with partial success only. There are two commercially available products which try to tackle this task Grammaticon [4] and Grammar Checker [5], which is included in the Czech localisation of MS Word text editor. A comparison of both systems has been made by Pala in 2005 [3] showing that both of them lack especially recall. A proof-of-concept system has been also shown in [6] trying to use Inductive Logic Programming to solve this task. The main problem of current solutions is that they are designed as rather simple rule-based systems with a set of (hard-coded or inducted, context-free or context-sensitive) rules trying to describe the placement of Czech punctuation. Although many of the principles for placing punctuation have syntactic background, none of them applies full syntactic parsing for this task. In this paper we utilize the Czech parse synt and show that a syntax-based approach has promising results. 2 The synt Parser The Czech parser synt [7, 8] has been developed in the Natural Language Processing Centre at Masaryk University. It performs a head-driven chart-type syntactic analysis based on the provided context-free grammar with additional contextual tests and actions. For easy maintenance, this grammar is edited in the form of a metagrammar (having about 200 rules) from which the full grammar can be automatically derived (having almost 4,000 rules). The per-rule defined Punctuation Detection with Full Syntactic Parsing 337 contextual actions are used to cover phenomena such as case-number-gender agreement. In recent measures [9, p. 77] it has been shown that synt accomplishes a very good coverage (above 90 %) but the analysis is highly ambiguous: for some sentences even millions of output syntactic trees can occur. There are several strategies developed to cope with the ambiguity: first, the grammar rules are divided into different priority levels which are used to prune the resulting set of output trees. Second, every resulting chart edge has a ranking value assigned from which the ranking for the whole tree can be efficiently computed in order to sort the trees and output N best trees while keeping it in polynomial time. The synt parser contains (besides the chart parsing algorithm itself) many additional functions such as maximum optimal coverage of partial analyses (shallow parsing) [10], effective selection of n-best output trees [11] or chart and trees beautification [12]. The punctuation detection technique uses the function of extraction of syntactic structures [13], which is described in detail in the next section. 3 Extraction of Phrase Structures Usually, a derivation tree is presented as the main output of syntactic parsing of natural languages, but currently most of the syntactic analysers for Czech lack precision. However there are many situations in which it is not necessary and sometimes even not desirable to require such derivation trees as the output of syntactic parsing, may it be simple information extraction and retrieval, transformation of sentences into a predicate-arguments structure or any other case, in which we rather need to process whole syntactic structures in the given sentence, especially noun, prepositional and verb phrases, numerals or clauses. Moreover, so as not to deal with the same problems as with the tree parser output, we need to identify these structures unambiguously. The phrase extraction functionality in synt enables us to obtain syntactic structures from the analysed sentence that correspond to a given grammar nonterminal in a number of ways. For the purpose of phrase extraction, the internal parsing structure of synt is used, the so called chart, a multigraph which is built up during the analysis holding all the resulting trees. An important feature of chart is its polynomial size [7, p. 133] implying that it is a structure suitable for further effective processing1 – as the number of output trees can be exponential to the length of the input sentence, processing of each tree separately would be otherwise computationally infeasible. However, as the algorithm works directly with chart and not trees, it is very fast and can be used for processing massive amount of data in a short time, even if we are extracting many structures at once (see Table 1 for details). The output of the extraction is shown in the following two examples of extracting clauses: 1 By processing the chart we refer to the result of the syntactic analysis, i.e. to the state of the chart after the analysis. 338 Jakubíček M., Horák A. – Example 1. Input: Muž, který stojı́ u cesty, vede kolo. (A man who stands at the road leads a bike.) Output: [0-9): Muž , , vede kolo (a man leads a bike) [2-6): který stojı́ u cesty (who stands at the road) – Example 2. Input: Vidı́m ženu, která držı́ růži, jež je červená. (I see a woman who holds a flower which is red.) Output: [0-3): Vidı́m ženu , (I see a woman) [3-7): která držı́ růži , (who holds a flower) [7-10): jež je červená (which is red) 4 Punctuation Detection The main idea of using the syntactic parser with extraction of structures for punctuation detection is as follows: if the parser has anyway to expect (i. e. match and parse) the actual presence of the punctuation using its grammar rules, we may try to modify those grammar rules in such a way that enables2 the punctuation detection even if the punctuation is (by mistake) not present in the sentence, and then extract the related (empty) punctuation nonterminals as described in 2 by allowing empty productions in the grammar Table 1. Results of detection on a sample set of 500 sentences.† Step Adding -rules Further grammar modifications Matching coordinations Sentences Average time needed per sentence † Precision Recall 35.37 % 20.12 % 69.43 % 57.31 % 82.27 % 84.76 % 500 0.65 s The precision and recall were measured across the whole sentence set. Punctuation Detection with Full Syntactic Parsing 339 Fig. 1. Example tree for an input sentence Proto jsme zůstali u mýdel, pracı́ch prostředků, zubnı́ch past, rostlinných olejů a ztužených tuků (Thus we continued with soaps, detergents, tooth pastes, vegetable oils and hardened fats). the previous section. Since the extraction of the punctuation basically represents a projection from the chart structure to the sentence surface structure, the missing punctuation can be identified as a post-analysis step. The difference between parsing a present punctuation mark and detecting a missing punctuation is displayed in Figures 1 and 2: while the first one represents a regular derivation tree for a sentence containing punctuation, the other one shows an almost identical derivation tree that, however, was produced from a sentence in which no punctuation (except the trailing full stop) was present, but the parser still deduced the correct placement of comma nonterminals. The whole process of punctuation detection is then demonstrated in Figure 3. As the first step, we have modified the relevant grammar punctuation rules to allow empty productions.3 Even this trivial change in the grammar led to a recall of < 20 % (see Table 1 for details), therefore we further analysed the grammar rules and improved them in order to better fit the purpose of punctuation detection. The grammar modifications mainly focused on improving the ability to parse (and hence also detect) punctuation in common Czech sentences (especially be3 I. e. for each rule covering some punctuation marks, e. g. comma → ",", we added a rule allowing the empty production: comma → . 340 Jakubíček M., Horák A. Fig. 2. Example tree for the same sentence from which punctuation has been removed. tween clauses, relative clauses and conjunctions), the recall increased to almost 60 %. Finally we paid special attention to coordinations of phrases where punctuation plays an important role.4 Detecting correct placement of punctuation marks in coordinations required new grammar rules to distinguish coordinations (in which all members have to agree in grammatical case) from common noun or prepositional phrases (where no such grammatical restrictions are applied) as shown in the example derivation tree with coordinations (Figure 2). The resulting recall as well as precision increased to more than 80 %. 4.1 Evaluation and Results For the purpose of evaluation of the described method, 500 randomly chosen sentences from the manually annotated DESAM corpus [14] have been used. Firstly, we removed all punctuation marks from the given sentence, then we filled in the punctuation using the enhanced parser and finally we compared 4 In general, the punctuation here distinguishes the meaning of the coordinations. This however requires semantic information (see Section 4.2), therefore we concentrated to syntactic phenomena in coordinations. Punctuation Detection with Full Syntactic Parsing 341 Proto jsme zůstali u mýdel pracích prostředků zubních past rostlinných olejů a ztužených tuků. Fig. 3. Overview of the punctuation detection process. it with the original sentence. The presented results confirm the importance of coordinations for solving this problem, because they contain many punctuation characters as is indicated by the increase in recall by almost 30 %. Also, as can be seen from the value of precision, the coordinations can be precisely analysed on the syntactic level. 4.2 Problems and Limitations As mentioned above, the guidelines for placing punctuation in Czech take information not only from morphology and syntax, but also from semantics – such phenomena cannot be covered by a pure syntax-based system. Using semantic information for punctuation detection is necessary in order to distinguish e. g.: 342 Jakubíček M., Horák A. – coordinations from gradually extending attributes In Czech, coordinations are written with punctuation, but gradually extending attributes without it, e.g. Je to velký starý drahý dům (It is a big old expensive house), but Vidı́m velký, střednı́ a malý dům (I see a large, medium and small house) Currently, we consider a noun phrase sequence to be a coordination only if it contains a coordination core as its part, consisting from a phrase followed by a conjunction and another phrase5 , i.e. střednı́ a malý (medium and small) in the previous sentence. However, the coordination core does not need to be present in some (stylistically determined) situations. – sequences of subordinating clauses If considering the sentence Petr si uvědomil, že Pavel zavřel okno[,] a šel domů (Peter realized that Paul closed the window[,] and went home), the presence of the comma determines whether it was Peter or Paul who went home. Obviously, for proper detection of comma placement in the first example, it would be necessary to have the knowledge of the meaning of the sentences or at least of some parts of the sentence – moreover, we would need to know the relation between the meanings: to know that large, medium, small belong to the same semantic class while big, old, expensive does not. It turns out that to tackle the problem, large ontologies for Czech would be needed which are unfortunately not available. The situation in the second example is even worse: we would need large context, anaphora resolution and logical inference to deduce the proper placement of the punctuation and even then, the situation might be just ambiguous from the available information. 5 Conclusions and Future Directions In this paper we have presented an efficient method for punctuation detection in common sentences by employing full syntactic parsing. Although all measures were related to Czech, we believe that the technique might be easily generalized for other languages with complicated punctuation rules provided that they have the necessary language resources and tools. The proposed method has already achieved significantly better results than the state-of-art systems and we believe that it might be further improved by supplying additional information which could overcome its current limitations described in the previous section. In particular, we plan to use the Czech WordNet [15] to distinguish coordinations and gradually extending attributes, and search for other semantic resources that may help us by improving this method. Furthermore, an application consisting of an OpenOffice.org [16] grammar checker extension is being developed for practical testing of the punctuation detection technique and for making it available to a broad scope of users. 5 Both phrases must also agree in case. Punctuation Detection with Full Syntactic Parsing 343 Acknowledgments This work has been partly supported by the Ministry of Education of CR within the Center of basic research LC536 and in the National Research Programme II project 2C06009 and by the Czech Science Foundation under the project P401/10/0792. References 1. Hlavsa, Z., et al.: Akademická pravidla českého pravopisu (Rules of Czech Orthography). Akademia, Praha (1993) 2. Pala, K., Rychlý, P., Smrž, P.: Text corpus with errors. In: Text, Speech and Dialogue: Sixth International Conference, Berlin, Springer Verlag (2003) 90–97 3. Pala, K.: Pište dopisy konečně bez chyb – Český gramatický korektor pro Microsoft Office. Computer 13–14 (2005) 72 CPress Media. 4. Lingea s. r. o.: Grammaticon. http://www.lingea.cz/grammaticon.htm (2003) 5. Oliva, K., Petkevič, V., Microsoft s .r .o .: Czech grammatical checker. http:// office.microsoft.com/word (2005) 6. Pala, K., Nepil, M.: Checking punctuation errors in czech texts. [online, quoted on 2009-11-20]. Available from: http://nlp.fi.muni.cz/publications/neco2003\ _pala\_nepil/neco2003\_pala\_nepil.rtf (2003) 7. Horák, A.: The Normal Translation Algorithm in Transparent Intensional Logic for Czech. PhD thesis, Faculty of Informatics, Masaryk University, Brno (November 2001) 8. Kadlec, V., Horák, A.: New Meta-grammar Constructs in Czech Language Parser synt. In: Lecture Notes in Computer Science, Springer Berlin / Heidelberg (2005) 9. Kadlec, V.: Syntactic Analysis of Natural Languages Based on Context-Free Grammar Backbone. PhD thesis, Faculty of Informatics, Masaryk University, Brno (September 2007) 10. Ailomaa, M., Kadlec, V., Rajman, M., Chappelier, J.C.: Robust stochastic parsing: Comparing and combining two approaches for processing extra-grammatical sentences. In Werner, S., ed.: Proceedings of the 15th NODALIDA Conference, Joensuu 2005, Joensuu, Ling@JoY (2005) 1–7 11. Kovář, V., Horák, A., Kadlec, V.: New Methods for Pruning and Ordering of Syntax Parsing Trees. In Proceedings of Text, Speech and Dialogue 2008. In: Lecture Notes in Artificial Intelligence, Proceedings of Text, Speech and Dialogue 2008, Brno, Czech Republic, Springer-Verlag (2008) 125–131 12. Kovář, V., Horák, A.: Reducing the Number of Resulting Parsing Trees for the Czech Language Using the Beautified Chart Method. In: Proceedings of 3rd Language and Technology Conference, Poznań, Wydawnictwo Poznańskie (2007) 433– 437 13. Jakubı́ček, M.: Extraction of syntactic structures based on the czech parser synt. In: Proceedings of Recent Advances in Slavonic Natural Language Processing 2008, Brno, Czech Republic, Masaryk University (2008) 56–62 14. Pala, K., Rychlý, P., Smrž, P.: DESAM – Annotated Corpus for Czech. In: Proceedings of SOFSEM ’97, Springer-Verlag (1997) 523–530 15. Pala, K., Hlaváčková, D.: Derivational Relations in Czech WordNet. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing 2007, Praha, Czech Republic, The Association for Computational Linguistics (2007) 75–81 16. OpenOffice.org Community: Openoffice.org. http://www.openoffice.org (2009) Author Index Anantaram, C Balahur, Alexandra Basu, Anupam Bhat, Shefali Bhowmick, Plaban Kumar Bunescu, Razvan Bustillos, Sandra Carl, Michael Carrillo de Albornoz, Jorge Caselli, Tommaso Castillo, Julio J. Ceauşu, Alexandru Dannélls, Dana Dascalu, Mihai Dessus, Philippe Eshkol, Iris Fadaei, Hakimeh Feldman, Anna Gervá, Pablo Goyal, Shailly Güngör, Tunga Gulati, Shailja Hartoyo, Agus Hernández, Arturo Horák, Ales Hoshino, Ayako Huang, Yunfeng Humayoun, Muhammad Iftene, Adrian Inkpen, Diana Irimia, Elena Islam, Aminul Jakubíček, Miloš Jensen, Kristian Kay, Martin 105 119 143 105 143 231 243 193 131 29 155 205 167 323 323 79 219 17 131 105 311 105 179 243 335 279 231 293 267 41 205 41 335 193 193 Latif, Seemab Mao, Qi Martín, Tamara McGee Wood, Mary Mitra, Pabitra Mondal, Prakash Montoyo, Andrés Nakagawa, Hiroshi Nenadic, Goran Ochoa-Zezzatti, Alberto Ornelas, Francisco Peng, Jing Pequeño, Consuelo Petic, Mircea Plaza, Laura Ponce, Julio Pons, Aurora Prasad, Abhisek Prodanof, Irina Prost, Jean-Philippe Pu, Xiaojia Raffalli, Christophe Rosenberg, Maria Rotaru, Ancuta Shamsfard, Mehrnoush Stree, Laura t Suyanto Taalab, Samer Tellier, Isabelle Trausan-Matu, Stefan Tümer, M. Borahan Ustimov, Aleksei Wu, Gangshan Yuan, Chunfeng 253 91 119 253 143 55 119 279 253 243 243 17 243 67 131 243 119 143 29 79 91 293 3 267 219 17 179 79 79 323 311 311 91 91 Editorial Board of the Volume Eneko Agirre Sivaji Bandyopadhyay Roberto Basili Christian Boitet Nicoletta Calzolari Dan Cristea Alexander Gelbukh Gregory Grefenstette Eva Hajičová Yasunari Harada Graeme Hirst Eduard Hovy Nancy Ide Diana Inkpen Alma Kharrat Adam Kilgarri Igor Mel’čuk Rada Mihalcea Ruslan Mitkov Dunja Mladeníc Masaki Murata Vivi Nastase Nicolas Nicolov Kemal Oflazer Constantin Orasan Maria Teresa Pazienza Ted Pedersen Viktor Pekar Anselmo Peñas Stelios Piperidis James Pustejovsky Fuji Ren Fabio Rinaldi Roracio Rodriguez Vasile Rus Franco Salvetti Serge Sharo Grigori Sidorov Thamar Solorio Juan Manuel Torres-Moreno Hans Uszkoreit Manuel Vilares Ferro Leo Wanner Yorick Wilks Annie Zaenen Additional Referees Rodrigo Agerri Muath Alzghool Javier Artiles Bernd Bohnet Ondřej Bojar Nadjet Bouayad-Agha Luka Bradesko Janez Brank Julian Brooke Miranda Chong Silviu Cucerzan Lorand Dali Víctor Manuel Darriba Bilbao Amitava Das Dipankar Das Arantza Díaz de Ilarraza Kohji Dohsaka Iustin Dornescu Asif Ekbal Santiago Fernández Lanza Robert Foster Oana Frunza René Arnulfo García Hernández Ana García-Serrano Byron Georgantopoulos Chikara Hashimoto Laura Hasler William Headden Maria Husarciuc Adrian Iftene Iustina Ilisei Ikumi Imani Aminul Islam Toshiyuki Kanamaru Fazel Keshtkar Jason Kessler Michael Kohlhase Olga Kolesnikova Natalia Konstantinova Valia Kordoni Hans-Ulrich Krieger Geert-Jan Kruij Yulia Ledeneva Yang Liu Oier Lopez de Lacalle Fernando Magán-Muñoz Aurora Marsye Kazuyuki Matsumoto Alex Moruz Sudip Kumar Naskar Peyman Nojoumian Blaz Novak Inna Novalija Tomoko Ohkuma Bostjan Pajntar Partha Pakray Pavel Pecina Ionut Cristian Pistol Natalia Ponomareva Marius Raschip Luz Rello Francisco Ribadas German Rigau Alvaro Rodrigo Franco Salvetti Kepa Sarasola GeroldSchneider Marc Schroeder Ivonne Skalban Simon Smith Mohammad G. Sorba Tadej Štajner Sanja Štajner Jan Strakova Xiao Sun Masanori Suzuki Motoyuki Suzuki Motoyuki Takaai Irina Temnikova Zhi Teng Nenad Tomasev Eiji Tomida Sulema Torres Mitra Trampas Diana Trandabat Stephen Tratz Yasushi Tsubota Hiroshi Umemoto Masao Utiyama Andrea Varga Tong Wang Ye Wu Keiji Yasuda Zhang Yi Daisuke Yokomori Caixia Yuan Zdeně Zabokrtský Venta Zapirain Daniel Zeman Hendrik Zender

RELATED PAPERS

RELATED TOPICS

Log In

User Profile Modeling in eLearning using Sentiment Extraction from Text

User Profile Modeling in eLearning using Sentiment Extraction from Text

Related Papers

RELATED PAPERS

RELATED TOPICS