EMNLP’22 trip report: neuro-symbolic approaches in NLP are on the rise

The trip to the Empirical Methods in Natural Language Processing 2022 conference is certainly one I’ll remember. The conference had well over 1000 in-person people attending what they could of the 6 tutorials and 24 workshops on Wednesday and Thursday, and then the 175 oral presentations, 654 posters, 3 keynotes and a panel session, and 10 Birds of Feather sessions on Friday-Sunday, which was topped off with a welcome reception and a social dinner. The open air dinner was on the one day in the year that it rains in the desert! More precisely on the venue: that was the ADNEC conference centre in Abu Dhabi, from 7 to 11 December.

With so many parallel sessions, it was not always easy to choose. Although I expected many presentations about just large language models (LLMs) that I’m not particularly interested in from a research perspective, it turned out it was very well possible to find a straight road through the parallel NLP sessions with research that had at least added an information-based or a knowledge-based approach to do NLP better. Ha! NLP needs structured data, information, and knowledge to mitigate the problems with hallucinations in natural language generation – elsewhere called “fluent bullshit” – that those LLMs suffer from, among other tasks. Adding a symbolic approach into the mix turned out to be a recurring theme in the conference. Some authors tried to hide a rule-based approach or were apologetic about it, so ‘hot’ the topic is not just yet, but we’ll get there. In any case, it worked so much better for my one-liner intro to state that I’m into ontologies having been branching out to NLG than to say I’m into NLG for African languages. Most people I met had heard of ontologies or knowledge graphs, whereas African languages mostly drew a blank expression.

It was hard to choose what to attend especially on the first day, but eventually I participated in part of the second workshop on Natural Language Generation, Evaluation, and Metrics (GEM’22), NLP for positive impact (NLP4PI’22), and Data Science with Human-in-the-Loop (DaSH’22), and walked into a few more poster sessions of other workshops. The conference sessions had 8 sessions in parallel in each timeslot; I chose the semantics one, ethics, NLG, commonsense reasoning, speech and robotics grounding, and the birds of a feather sessions on ethics and on code-switching. I’ve structured this post by topic rather than by type of session or actual session, however, in the following order: NLP with structured stuff, ethics, a basket with other presentations that were interesting, NLP for African languages, the two BoF sessions, and a few closing remarks. I did at least skim over the papers associated with the presentations and referenced here, and so any errors in discussing the works are still mine. Logistically, the links to the papers in this post are a bit iffy: about 900 EMNLP + workshops papers were already on arxiv according to the organisers, and 828 papers of the main conference are being ingested into the ACL anthology and so its permanent URL is not functional yet, and so my linking practice was inconsistent and may suffer link rot. Be that as it may, let’s get to the science.

The entrance of the conference venue, ADNEC in Abu Dhabi, at the end of the first workshop and tutorials day.

NLP with at least some structured data, information, or knowledge and/or reasoning

I’ve tried to structure this section, roughly going from little addition of structured stuff to more, and then from less to more inferencing.

The first poster session on the first day that I attended was the one of the NLP4PI workshop; it was supposed to be for 1 hour, but after 2.5h it was still being well-attended. I also passed by the adjacent Machine translation session (WMT’22) that also paid off. There were several posters there that were of interest to my inclination toward knowledge engineering. Abhinav Lalwani presented a Findings paper on Logical Fallacy Detection in the NLP4PI’22 poster session, which was interesting both for the computer ethics that I have to teach and their method: create a dataset of 2449 fallacies of 13 types that were taken for online educational resources, machine-learn templates from those sentences – that they call generating a  “structure-aware model” – and then use those templates to find new ones in the wild, which was on climate change claims in this case [1]. Their dataset and code are available on GitHub. The one presented by Lifeng Han from the University of Manchester was part of WMT’22: their aim was to see whether a generic LLM would do better or worse than smaller in-domain language models enhanced with clinical terms extracted from biomedical literature and electronic health records and from class names of (unspecified in the paper) ontologies. The smaller models win, and terms or concepts may win depending on the metric used [2].

For the main conference, and unsurprising for a session called “semantics”, it wasn’t just about LLMs. The first paper was about Structured Knowledge Grounding, of which the tl;dr is that SQL tables and queries improve on the ‘state of the art’ of just GPT-3 [3]. The Reasoning Like Program Executors aims to fix nonsensical numerical output of LLMs by injecting small programs/code for sound numerical reasoning, among the reasoning types that LLMs are incapable of, and are successful at doing so [4]. And there’s a paper on using WordNet for sense retrieval in the context of word in/vs context use, and on discovering that the human evaluators were less biassed than the language model [5].

The commonsense reasoning session also – inevitably, I might add – had papers that combined techniques. The first paper of the session looked into the effects of injecting external knowledge (Comet) to enhance question answering, which is generally positive, and more positive for smaller models [6]. I also have in my notes that they developed an ontology of knowledge types, and so does the paper text claim so, but it is missing from the paper, unless they are referring to the 5 terms in its table 6.

I also remember seeing a poster on using Abstract Meaning Representation. Yes, indeed, and there turned out to be a place for it: for text style transfer to convert a piece of text from one style into another. The text-to-AMR + AMR-to-text model T-STAR beat the state of the art with a 15% increase in content preservation without substantive loss of accuracy (3%) [7].

Moving on to rules and more or less reasoning, first, at the NLP4PI’22 poster session, there was a poster on “Towards Countering Essentialism through Social Bias Reasoning”, which was presented by Maarten Sap. They took a very interdisciplinary approach, mixing logic, psychology and cognitive science to get the job done, and the whole system was entirely rules-based. The motivation was to find a way to assist content moderators by generating possible replies to counter prejudiced statements in online comments. They generated five types of replies and asked users which one they preferred. Types of sample generated replies include, among others, to compute exceptions to the prejudice (e.g., an individual in the group who does not have that trait), attributing the trait also to other groups, and a generic statement on tolerance. Bland seemed to work best. I tried to find the paper for details, but was unsuccessful.

The DaSH’22 presentation about WaNLI concerned the creation of a dataset and pipeline to have crowdsourcing workers and AI “collaborate” in dataset creation, which had a few rules sprinkled into the mix [8]. It turns out that humans are better at revising and evaluating than at creating sentences from scratch, so the pipeline takes that into account. First, from a base set, it uses NLG to generate complement sentences, which are filtered and then reviewed and possibly revised by humans. Complement sentence generation (the AI part) involves taking sentences like “5% chance that the object will be defect free” + “95% that the object will have defects” to then generate (with GPT-3, in this case) the candidate sentence pairs “1% of the seats were vacant” + “99% of the seats were occupied”, using encoded versions of the principles of entailment and set complement, among the reasoning cases used.

Turning up the reasoning a notch, Sean Welleck of the University of Washington gave the keynote at GEM’22. His talks consisted of two parts, on unlearning bad behaviour of LLMs and then an early attempt with a neuro-symbolic approach. The latter concerned connecting a LLM’s output to some logic reasoning. He chose Isabelle, of all reasoners, as a way to get it to check and verify the hallucinations (the nonsense) the LLMs spit out. I asked him why he chose a reasoner for an undecidable language, but the response was not a direct answer. It seemed that he liked the proof trace but was unaware of the undecidability issues. Maybe there’s a future for description logics reasoners here. Elsewhere, and hidden behind a paper title that mentions language models, lies a reality of the ConCoRD relation detection for “boosting consistency of pre-trained language models” with a MAX-SAT solver in the toolbox [9].

Impression of the NLP4PI’22 poster session 2.5h into the 1h session timeslot.

There are (many?) more relevant presentations that I did not get around to attending, such as on dynamic hierarchical reasoning that uses both a LM and a knowledge graph for their scope of question answering [10], a unified representation for graph query language, GraphQ IR [11], and on that RoBERTa, T5, and GPT3 have problems especially with deductive reasoning involving negation [12] and PLOG table-to-logic to enhance table-to-text. Open the conference program handbook and search on things like “commonsense reasoning” or NLI where the I isn’t an abbreviation of Interface but of Inference rather, and there’s even neural-symbolic inference for graph parsing. The compound term “Knowledge graph” has 84 mentions and “reasoning” has 244 mentions. There are also four papers with “grammar induction”, two papers with CFGs, and one with a construction grammar.

It was a pleasant surprise to not be entirely swamped by the “stats/NN + automated metric” formula. I fancy thinking it’s an indication that the frontiers of NLP research already grew out of that and is adding knowledge into the mix.

Ethics and computational social science

Of course, the previously-mentioned topic of trying to fix hallucinations and issues with reasoning and logical coherence of what the language models spit out implies researchers know there’s a problem that needs to be addressed. That is a general issue. Specific ones are unique in their own way; I’ll mention three. Inna Lin presented work on gendered mental health stigma and potential downstream issues with health chatbots that would rely on such language models [13]. For instance, that women were more likely to be recommended to seek professional help and men to toughen up and get on with it. The GeoMLAMA dataset showed that not everything is as bad as one might suspect. The dataset was created to explore multilingual Pre-Trained Language Models on cultural commonsense knowledge, like which colour the dress of the bride is typically. The authors selected English, Chinese, Hindi, Persian, and Swahili. Evaluation showed that multilingual PLMs are not biased toward the USA, that the native language of a country may not be the best language to probe its knowledge (as the commonsense isn’t explicitly stated) and a language may better probe knowledge about a nonnative country than its native country. [14]. The third paper is more about working on a mechanism to help NLP ethics: modelling information change in science communication. The scientist or the press release says one thing, which gets altered slightly in a popular science article, and then morphs into tweets and toots with yet another, different, message. More distortions occurs in the step from popsci article to tweet than from scientist to popsci article. The sort of distortion or ‘not as faithful as one would like’? Notably, “Journalists tend to downplay the certainty and strength of findings from abstracts” and “limitations are more likely to be exaggerated and overstated”. [15]

In contrast, Fatemehsadat Mireshghallah showed some general ethical issues with the very LLMs in her lively presentation. They are so large and have so many parameters that what they end up doing is more alike text memorization and output that memorised text, rather than outputting de novo generated text [16]. She focussed on potential privacy issues, where such models may output sensitive personal data. It also applies to copyright infringement issues: if they return chunk of already existing text, say, a paragraph from this blog, it would be copyright infringement, since I hold the copyright on it by default and I made it CC-BY-NC-SA, which those large LLMs do not adhere to and they don’t credit me. Copilot is already facing a class action lawsuit for unfairly reusing open source code without having obtained permission. In both cases, there’s the question, or task, of removing pieces of text and retraining the model, or not, as well as how to know whether your text was used to create the model. I recall seeing something about that in the presentations and we had some lively discussions about it as well, leaning toward a remove & re-train and suspecting that’s not what’s happening now (except at IBM apparently).

Last, but not least, on this theme: the keynote by Gary Marcus turned out to be a pre-recorded one. It was mostly a popsci talk (see also his recent writings here, among others) on the dangers of those large language models, with plenty of examples of problems with them that have been posted widely recently.

Noteworthy “other” topics

The ‘other’ category in ontologies may be dubious, but here it is not meant as such – I just didn’t have enough material or time to write more about them in this post, but they deserved a mention nonetheless.

The opening keynote of the EMNLP’22 conference by Neil Cohn was great. His main research is in visual languages, and those in comic books in particular. He raised some difficult-to-answer questions and topics. For instance, is language multimodal – vocal, body, graphic – and are gestures separate from, alongside, or part of language? Or take the idea of abstract cognitive principles as basis for both visual and textual language, the hypothesis of “true universals” that should span across modalities, and the idea of “conceptual permeability” on whether the framing in one modality of communication affects the others. He also talked about the topic of cross-cultural diversity in those structures of visual languages, of comic books at least. It almost deserves to be in the “NLP + symbolic” section above, for the grammar he showed and to try to add theory into the mix, rather than just more LLMs and automated evaluation scores.

The other DaSH paper that I enjoyed after aforementioned Wanli was the Cheater’s Bowl, where the authors tried to figure out how humans cheat in online quizzes [17]. Compared to automated open-domain question-answering, humans use fewer keywords more effectively, use more world knowledge to narrow searches, use dynamic refinement and abandonment of search chains, have multiple search chains, and do answer validation. Also in the workshops days setting, I somehow also walked into a poster session of the BlackboxNLP’22 workshop on analysing and interpreting neural networks for NLP. Priyanka Sukumaran enthusiastically talked about her research how LSTMs handle (grammatical) gender [18]. They wanted to know where about in the LSTM a certain grammatical feature is dealt with; and they did, at least for gender agreement in French. The ‘knowledge’ is encoded in just a few nodes and does better on longer than on shorter sentences, since then it can use more other cues in the sentence, including gendered articles, to figure out M/F needed for constructions like noun-adjective agreement. That definitely is alike the same way humans do, but then, algorithms do not need to copy human cognitive processes.

NLP4PI’s keynote was given Preslav Nakov, who recently moved to the Mohamed Bin Zayed University of AI. He gave an interesting talk about fake news, mis- and dis-information detection, and also differentiated it with propaganda detection that, in turn, consists of emotion and logical fallacy detection. If I remember correctly, not with knowledge-based approaches either, but interesting nonetheless.

I had more papers marked for follow up, including on text generation evaluation [19], but this post is starting to become very long as it is already.

Papers with African languages, and Niger-Congo B (‘Bantu’) languages in particular

Last, but not least, something on African languages. There were a few. Some papers had it clearly in the title, others not at all but they used at least one of them in their dataset. The list here is thus incomplete and merely reflects on what I came across.

On the first day, as part of NLP4PI, there was also a poster on participatory translations of Oshiwambo, a language spoken in Namibia, which was presented by Jenalea Rajab from Wits and Millicent Ochieng from Microsoft Kenya, both with the masakhane initiative; the associated paper seems to have been presented at the ICLR 2022 Workshop on AfricaNLP. Also within the masakhane project is the progress on named entity recognition [20]. My UCT colleague Jan Buys also had papers with poster presentation, together with two of his students, Khalid Elmadani and Francois Meyer. One was part of the WMT’22 on multilingual machine translation for African languages [21] and another on sub-word segmentation for Nguni languages (EMNLP Findings) [22]. The authors of AfroLID show results that they have some 96% accuracy on identification of a whopping 517 African languages, which sounds very impressive [23].

Birds of a Feather sessions

The BoF sessions seemed to be loosely organised discussions and exchange-of-ideas about a specific topic. I tried out the Ethics and NLP one, organised by Fatemehsadat Mireshghallah, Luciana Benotti, and Patrick Blackburn, and the code-switching & multilinguality one, organised by Genta Winata, Marina Zhukova, and Sudipta Kar. Both sessions were very lively and constructive and I can recommend to go to at least one of them the next time you’ll attend EMNLP or organise something like that at a conference. The former had specific questions for discussion, such as on the reviewing process and on that required ethics paragraph; the latter had themes, including datasets and models for code-switching and metrics for evaluation. For ethics, there seems to be a direction to head toward, whereas the NLP for code-switching seems to be still very much in its infancy.

Final remarks

As if all that wasn’t keeping me busy already, there were lots of interesting conversations, meeting people I haven’t seen in many years, including Barbara Plank who finished her undergraduate studies at FUB when I was a PhD student there (and focussing on ontologies rather, which I still do) and likewise for Luciana Benotti (who had started her European Masters at that time, also at FUB); people with whom I had emailed before but not met due to the pandemic; and new introductions. There was a reception and an open air social dinner; an evening off meeting an old flatmate from my first degree and a soccer watch party seeing Argentina win; and half a day off after the conference to bridge the wait for the bus to leave which time I used to visit the mosque (it doubles as worthwhile tourist attraction), chat with other attendees hanging around for their evening travels, and start writing this post.

Will I go to another EMNLP? Perhaps. Attendance was most definitely very useful, some relevant research outputs I do have, and there’s cookie dough and buns in the oven, but I’d first need a few new bucketloads of funding to be able to pay for the very high registration cost that comes on top of the ever increasing travel expenses. EMNLP’23 will be in Singapore.

References

[1] Zhijing Jin, Abhinav Lalwani, Tejas Vaidhya, Xiaoyu Shen, Yiwen Ding, Zhiheng Lyu, Mrinmaya Sachan, Rada Mihalcea, Bernhard Schölkopf. Logical Fallacy Detection. EMNLP’22 Findings.

[2] L Han, G Erofeev, I Sorokina, S Gladkoff, G Nenadic Examining Large Pre-Trained Language Models for Machine Translation: What You Don’t Know About It. 7th Conference on Machine translation at EMNLP’22.

[3] Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer and Tao Yu. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. EMNLP’22.

[4] Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang LOU and Weizhu Chen. Reasoning Like Program Executors. EMNLP’22

[5] Qianchu Liu, Diana McCarthy and Anna Korhonen. Measuring Context-Word Biases in Lexical Semantic Datasets. EMNLP’22

[6] Yash Kumar Lal, Niket Tandon, Tanvi Aggarwal, Horace Liu, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian. Using Commonsense Knowledge to Answer Why-Questions. EMNLP’22

[7] Anubhav Jangra, Preksha Nema and Aravindan Raghuveer. T-STAR: Truthful Style Transfer using AMR Graph as Intermediate Representation. EMNLP’22

[8] A Liu, S Swayamdipta, NA Smith, Y Choi. Wanli: Worker and ai collaboration for natural language inference dataset creation. DaSH’22 at EMNLP2022.

[9] Eric Mitchell, Joseph Noh, Siyan Li, Will Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn and Christopher Manning. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference. EMNLP’22

[10] Miao Zhang, Rufeng Dai, Ming Dong and Tingting He. DRLK: Dynamic Hierarchical Reasoning with Language Model and Knowledge Graph for Question Answering. EMNLP’22

[11] Lunyiu Nie, Shulin Cao, Jiaxin Shi, Jiuding Sun, Qi Tian, Lei Hou, Juanzi Li, Jidong Zhai GraphQ IR: Unifying the semantic parsing of graph query languages with one intermediate representation. EMNLP’22

[12] Soumya Sanyal, Zeyi Liao and Xiang Ren. RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners. EMNLP’22

[13] Inna Lin, Lucille Njoo, Anjalie Field, Ashish Sharma, Katharina Reinecke, Tim Althoff and Yulia Tsvetkov. Gendered Mental Health Stigma in Masked Language Models. EMNLP’22

[14] Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, Kai-Wei Chang. Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models. EMNLP’22

[15] Dustin Wright, Jiaxin Pei, David Jurgens, Isabelle Augenstein. Modeling Information Change in Science Communication with Semantically Matched Paraphrases. EMNLP’22

[16] Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans and Taylor Berg-Kirkpatrick. An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models. EMNLP’22

[17] Cheater’s Bowl: Human vs. Computer Search Strategies for Open-Domain QA. DaSH’22 at EMNLP2022.

[18] Priyanka Sukumaran, Conor Houghton,Nina Kazanina. Do LSTMs See Gender? Probing the Ability of LSTMs to Learn Abstract Syntactic Rules. BlackboxNLP’22 at EMNLP2022. 7-11 Dec 2022, Abu Dhabi, UAE. arXiv:2211.00153

[19] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji and Jiawei Han. Towards a Unified Multi-Dimensional Evaluator for Text Generation. EMNLP’22

[20] David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson KALIPE, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, MBONING TCHIAZE Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia and Joyce Nakatumba-Nabende. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. EMNLP’22

[21] Khalid Elmadani, Francois Meyer and Jan Buys. University of Cape Town’s WMT22 System: Multilingual Machine Translation for Southern African Languages. WMT’22 at EMNLP’22.

[22] Francois Meyer and Jan Buys. Subword Segmental Language Modelling for Nguni Languages. Findings of EMNLP, 7-11 December 2022, Abu Dhabi, United Arab Emirates.

[23] Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed and Alcides Inciarte. AfroLID: A Neural Language Identification Tool for African Languages. EMNLP’22

Conference report: SWAT4HCLS 2022

The things one can do when on sabbatical! For this week, it’s mainly attending the 13th Semantic Web Applications and tools for Health Care and Life Science (SWAT4HCLS) conference and even having some time to write a conference report again. (The last lost tagged with conference report was FOIS2018, at the end of my previous sabbatical.) The conference consisted of a tutorial day, two conference days with several keynotes and invited talks, paper presentations and poster sessions, and the last day a ‘hackathon’/unconference. This clearly has grown over the years from the early days of the event series (one day, workshop, life science).

A photo of the city where it was supposed to take place: Leiden (NL) (Source: here)

It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.

Keynotes

The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool [1], a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.

The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources [2]; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases [3]. It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically [4]. There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.

Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.

The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.

Papers

Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.

Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project [5]. He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).

Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid [6], which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).

Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu  and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) [7], and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO [8]. What to do next with these insights remains to be seen.

Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do [9]. Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.

Source: [9]

Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.

Other

There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.

The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!

References

[1] Verspoor K. et al. Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds). Advances in Information Retrieval. ECIR 2021. Springer LNCS, vol 12657, 559-564.

[2] Ostaszewski M. et al. COVID19 Disease Map, a computational knowledge repository of virus–host interaction mechanisms. Molecular Systems Biology, 2021, 17:e10387.

[3] Hanspers, K., Riutta, A., Summer-Kutmon, M. et al. Pathway information extracted from 25 years of pathway figures. Genome Biology, 2020, 21,273.

[4] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics, 2012, 45(3): 482-494. DOI: dx.doi.org/10.1016/j.jbi.2012.01.004.

[5] Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[6] Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[7] Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[8] César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[9] Frances Gillis-Webber and C. Maria Keet, A Survey of Multilingual OWL Ontologies in BioPortal. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

FOIS’18 conference report

To some perhaps surprisingly, despite being local organizer, I could attend all sessions of the 10th International Conference Formal Ontology in Information Systems as participant (cf. running around for last-minute things). It just wasn’t as much of a trip as it usually is: only 15 minutes to town at the Atlantic Imbizo conference venue, which is situated between the Clock Tower and (award-winning) Zeitz MOCAA at Cape Town’s V&A Waterfront. This blog post has turned into a longer post than intended—yet, there’s still so much left out to talk about—and it is divided up into sections on keynotes, presentations, ontologies, and the (ontologically inappropriate basket of) other things.

 

Keynotes

The first keynote was presented by (emeritus) professor in philosophy Peter Simons from Trinity College Dublin and Universität Salzburg, on the ontology of aboutness (slides).

Peter Simon during his keynote talk

That may sound a bit abstract, but it is not unusual for some information system that it will have to record statements about something, such as different medical opinions, changes of policies, plans or expectations, and we need a way to represent that and deal with it. Simons discussed several earlier proposals before proposing his own, which includes as main entities a bearer, act, time, act-type, mental content, mental content type, intentional objects, referent, and referent type (slide 16), and then variants for pictorial and linguistic (speech and writing). And, in closing, his advice of “Don’t get involved in irrelevant philosophical disputes”.

The second keynote was presented by Alessandro Oltramari, who works at Bosch Research and Technology Centre in Pittsburgh, USA. He presented several of Bosch’s projects where ontologies are used in one way or another (slides) and that he was involved in. One of them was about knowledge-based intelligent IoT and another on an emergency assistant, or, in business sales parlance, a “personal guardian angel” mobile device that has location awareness, safety information of those locations, a decision support system for alternate route computation, and automatic escalation. The ontologies used include the foundational ontology DOLCE, the domain ontology of semantic sensor networks (SSN) from the W3C, and specific schemas developed in-house. Another project on a knowledge-based chatbot for healthcare policies links up DOLCE, schema.org, and some in-house schemas with Highmark-specific information (and is not ashamed of using SKOS). Om my question what methods and methodologies were used for the in-house ontology development, the (disappointing) answer was, unfortunately, only “DOLCE and OntoClean”, but the former is neither a method nor a methodology (it implies a top-down approach), and the latter is some 15 years old, as if nothing has happened in ontology engineering in the meantime (more about that further below). Regardless, it was good to see that ontologies are being used in industry.

The third keynote (slides) was by Riichiro Mizoguchi from the Japan Advanced Institute of Science and Technology (JAIST), on a state-centric methodology, which I’ll leave for a separate post.

Riichiro Mizoguchi during his keynote talk.

 

Presentations

The report on the presentations easily could take up several pages, but I’ll try to keep it short, lest otherwise this post never gets posted. The first session of the conference was on foundations. This included Antony Galton’s assessment of the treatment of time in upper ontologies [1]. It was mildly entertaining in that it turned out that BFO would need abstract things for its treatment of time (which it doesn’t have and doesn’t like) and adheres to Newtonian physics cf. the latest scientific theories. It is definitely on my list of papers to read in more detail. Another paper-for-printing to read is Torsten Hahmann’s work on mereotopology, which extends it to multidimensional space [2]. A nice bonus (though it ought not to be perceived as such) is that at least the theorems in the paper have been proved with Prover9 and Vampire (cf. having to double-check them manually). Laure Vieu presented a proposal for a graph-based approach to represent structure among the components of an entity [3], which is apparently different from the graph-based approach for representing molecules (within the Semantic Web context); I’ll have to look at that in more detail, for it sounds like it might be of some use for the parts aspects of part-whole relations.

Besides such theoretical contributions that are rather distant from applications, there were two of note that were motivated from praxis more clearly. One was about the ontological foundations of competition and the sort of competitive relations there are [4], which was presented by Tiago Prince Sales. The other one was presented by Pawel Garbacz, whose presentation conveyed more than the paper so as to get a real feel of the problem, being identity criteria for localities [5], with complicating use cases extracted from a Polish history project. He presented some examples of changes and a proposal for how to identify a locality/settlement. For instance, settlements can get moved altogether, have a population-only move, split into two, be merged, renamed and renamed again, deserted by a population and repopulated and renamed, and so on. When is it the same settlement and when is it another one? The paper [5] describes a first solution for identity criteria with an event-based approach to identity of localities.

My presentation on part-whole relations in Zulu language and culture [6] was scheduled in the ‘applications’ session, which had positive feedback and some pointers that may assist with future work.

 

venue during a Q&A session

Ontologies

Besides presentations, there was a discussion session on “what constitutes a good ontology paper?” for the Applied Ontology journal. Seeing the ontology papers at FOIS now, they should have done such as session for FOIS as well. There are four papers in the proceedings describing OWL files: “Amnestic forgery” (AF, conceptual metaphors) [7] presented by Mehwish Alam, UNiCS for research and innovation policy [8] presented by Fernando Roda, SAREF4Health [9] presented by João Moreira, and religious and spiritual belief (ORSB) [10] presented by Stefan Schulz. Skimming through each paper, AF, UNiCS and ORSB do not use a methodology explicitly, none of them uses existing methods, but they all do use a foundational or top-level ontology or the WordNet material, and then it’s cool enough to get into FOIS, apparently. This is a bit disappointing. At least SAREF4Health presented a set of competency questions, a systematic approach and broader framework, and some evaluation, and ORSB reuses not only top-level and top-domain ontologies but also tests some patterns. AF and ORSB have some interest to it as they’re addressing relatively novel modeling issues to solve and the ORSB discussion could be used more broadly for any “terms of dubious reference”. UNiCS is not really an ontology but an information model or, at best, a conceptual data model (e.g. calling “SCOPUS subject” an ontology is pushing it a bit too far); it makes their OBDA scenario easier to realize, true, but that’s a separate discussion. Fig 1 of SAREF4Health doesn’t look any better either, which has all the hallmarks of a plain UML Class Diagram (attributes with data types and such), with object diagram components attached and coloured in and annotated with OntoUML. SAREF4Health’s other downsides are things like “implementing the ontology as RDF” that just hurts to read (it is left implicit for AF that is plugged into the LOD cloud), as is the download in Turtle format (cf. the required exchange syntax of OWL 2), which isn’t even available at the provided link when you click on it (copy-paste gets you in the right direction), but is [I think] in some github sub-directory that has a whole bunch of ttl files with neither head nor tail, but one of them is called saref4health.ttl. On first inspection, it has plenty of data properties and data type use, and the class-as-instance issue here and there (e.g., ‘Rechargeable Lithium Polymer battery’ as instance cf. class), and others (e.g., a ‘series’ of measurements is not a subclass of a measurement) and very many classes directly subsumed by top, though some are knock-on effects from imports.

And then ontologists at FOIS deplored that there are many domain ontologies that are of poor quality and artifacts presented as ontologies but aren’t. The FOIS reviewers themselves apparently can’t even get their act together in the reviewing process, where artifacts that are sold as domain ontologies but aren’t (UNiCS, SAREF4Health) make it not only through the reviewing process but, moreover, even get a best paper award from the PC chairs (SAREF4Health). The PC chairs wanted to make a political statement to communicate that FOIS accepts domain ontology papers. It is good that the FOIS topics are becoming less narrow and I’m not saying they are pointless papers or lousy artifacts per sé—they are useful reference papers and UNiCS and SAREF4Health perform the application tasks they’re supposed to be performing, which is a good thing. Maybe, collectively, ontology developers can’t do better or don’t need to do better w.r.t. applied ontology? Either way, once upon a time there were principles for what ontologies are; what happened to that? Also, there are multiple methodologies for domain ontology development, and there are a myriad of methods and tools, which have been mostly ignored. For instance, using one foundational ontology over another ‘just because I know x’ is neither a scientific nor a sound engineering approach. There are comparisons, requirements, and a mix of the two to help you figure out which one is the best to use; an early tool for that is ONSET, the ONtology Selection and Explanation Tool, developed by Zubeida Khan (more data). To name one example.

Coincidentally, ontology engineering papers with such a content do not, or very rarely, make it into FOIS; but just that they don’t (because they’re typically not philosophical enough), doesn’t mean they don’t exist. Just in case a FOIS ontologist would like to explore methods, methodologies and tools for ontology development: ESWC, EKAW, and K-CAP are good/top conferences covering such topics in whole or in part, and Chapter 5 of the ontology engineering textbook provides a sampling as well (as do some other sections in Block II). Considering my critical comments, one may ask whether my ontologies and ontology papers are any better, or anyone else’s for that matter. Perhaps, perhaps not. You can check for yourself some of my recent papers on domain ontologies that also have OWL files[1] that I was involved in developing; one paper was intended as a reference paper for the domain ontology [11], another paper was a bit of both domain ontology and some framework [12], and yet another turned into a core ontology [13] (v1, with the main categories; there’s an updated version for the relations).

Anyway, returning to the first sentence of this section: the open forum discussion did not make it any clearer as to what would be the characteristics of a good ontology paper for the Applied Ontology journal (or FOIS, for that matter). Mainly just Protégé screenshots certainly is not, but opinions varied as to what would be. Going by examples of the ontology papers that made it through: use of a top-level or foundational ontology and some modeling issues and solutions seems to be preferred, evaluation and usage & uptake as a nice-to-have. Is developing an (domain) ontology science? That question wasn’t answered unanimously; I think it was leaning towards a ‘mostly no’ w.r.t. applied ontology but it may be if it’s the first to solve a modeling issue. How to evaluate the ontology? Another question without a satisfactory answer. Overall, the criteria for an ontology paper—let alone for the ontology itself—are “TBD” and meanwhile one has to hope that one will get a supportive ‘reviewer 2’.

 

Other

In case you have clicked-though to one or more of the listed papers, you may have noticed that the FOIS’18 proceedings are Open Access—paid for by those who registered for the conference (it was calculated in the registration fee). I suppose the next FOIS organisers and the IAOA exec may like your opinion on that approach.

mentors of the early career symposium papers

Besides the best paper award for SAREF4Health [9], there were two “distinguished paper awards”, which went to aforementioned paper on the graph-based approach for structured universals by Laure Vieu and Claudio Masolo [3] and to the foundational ontologies for units of measure by Michael Grüninger and co-authors [14]. The early career symposium went well and from hearsay they had a good social activity, too. There were lots of interesting conversations, networking, good food, and so on, and lots more to write about. There are also more photos.

Some of the postgraduate students and a recent PhD graduate in the spotlight at the closing ceremony, being thanked for chairing the sessions.

Last, but not least: the next FOIS in 2020 will be in Bolzano, Italy, as part of a ‘Bolzano summer of knowledge’ with more co-located conferences, workshops, and summer schools.

 

References

[1] Antony Galton. The treatment of time in upper ontologies. Proc. of FOIS’18. IOS Press, 306: 33-46.

[2] Thorsten Hahmann. On Decomposition Operations in a Theory of Multidimensional Qualitative Space. Proc. of FOIS’18. IOS Press, 306: 173-186.

[3] Claudio Masolo, Laure Vieu. Graph-Based Approaches to Structural Universals and Complex States of Affairs. Proc. of FOIS’18. IOS Press, 306: 69-82.

[4] Tiago Prince Sales, Daniele Porello, Nicola Guarino, Giancarlo Guizzardi, John Mylopoulos. Ontological Foundations of Competition. Proc. of FOIS’18. IOS Press, 306: 96-112.

[5] Pawel Garbacz, Agnieszka Ławrynowicz, Bogumił Szady. Identity criteria for localities. Proc. of FOIS’18. IOS Press, 306: 47-56.

[6] C. Maria Keet, Langa Khumalo. On the Ontology of Part-Whole Relations in Zulu Language and Culture. Proc. of FOIS’18. IOS Press, 306: 225-238.

[7] Aldo Gangemi, Mehwish Alam, Valentina Presutti. Amnestic Forgery: An Ontology of Conceptual Metaphors. Proc. of FOIS’18. IOS Press, 306: 159-172.

[8] Alessandro Mosca, Fernando Roda, Guillem Rull. UNiCS – The Ontology for Research and Innovation Policy Making. Proc. of FOIS’18. IOS Press, 306: 200-210.

[9] João Moreira, Luís Ferreira Pires, Marten van Sinderen, Laura Daniele. SAREF4health: IoT Standard-Based Ontology-Driven Healthcare Systems. Proc. of FOIS’18. IOS Press, 306: 239-252.

[10] Stefan Schulz, Ludger Jansen. Towards an Ontology of Religious and Spiritual Belief. Proc. of FOIS’18. IOS Press, 306: 253-260.

[11] Keet, C.M., Lawrynowicz, A., d’Amato, C., Kalousis, A., Nguyen, P., Palma, R., Stevens, R., Hilario, M. The Data Mining OPtimization ontology. Web Semantics: Science, Services and Agents on the World Wide Web, 2015, 32:43-53.

[12] Chavula, C., Keet, C.M. An Orchestration Framework for Linguistic Task Ontologies. 9th Metadata and Semantics Research Conference (MTSR’15), Garoufallou, E. et al. (Eds.). Springer CCIS vol. 544, 3-14.

[13] Keet, C.M. A core ontology of macroscopic stuff. 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW’14). K. Janowicz et al. (Eds.). 24-28 Nov, 2014, Linkoping, Sweden. Springer LNAI vol. 8876, 209-224.

[14] Michael Grüninger, Bahar Aameri, Carmen Chui, Torsten Hahmann, Yi Ru. Foundational Ontologies for Units of Measure. Proc. of FOIS’18. IOS Press, 306: 211-224.

[1] I have others developed as part of methods & tools research

‘Problem shopping’ and networking at IST-Africa’18 in Gaborone

There are several local and regional conferences in (Sub-Saharan) Africa with a focus on Africa in one way or another, be it for, say, computer science and information systems in (mainly) South Africa, computer networks in Africa, or for (computer) engineers. The IST-Africa series covers a broad set of topics and papers must explicitly state how and where all that research output is good for within an African context, hence, with a considerable proportion of the scope within the ICT for Development sphere. I had heard from colleagues it was a good networking opportunity, one of my students had obtained some publishable results during her CS honours project that could be whipped into paper-shape [1], I hadn’t been to Botswana before, and I’m on sabbatical so have some time. To make a long story short: the conference just finished, and I’ll write a bit about the experiences in the remainder of this post.

First, regarding the title of the post: I’m not quite an ICT4D researcher, but I do prefer to work on computer science problems that are based on actual problems that don’t have a solution yet, rather than invented toy examples. A multitude of papers presented at the conference were elaborate on problem specification, like them having gone out in the field and done the contextual inquiries, attitude surveys, and the like so as to better understand the multifaceted problems themselves before working toward a solution that will actually work (cf. the white elephants littered around on the continent). So, in a way, the conference also doubled in a ‘problem shopping’ event, though note that many solutions were presented as well. Here’s a brief smorgasbord of them:

  • Obstacles to eLearning in, say, Tanzania: internet access (40% only), lack of support, lack of local digital content, and too few data-driven analyses of experiments [2].
  • Digital content for healthcare students and practitioners in WikiTropica [3], which has the ‘usual’ problems of low resource needs (e.g., a textbook with lots of pictures but has to work on the mobile phone or tablet nonetheless), the last mile, and language. Also: the question of how to get people to participate to develop such resources? That’s still an open question; students of my colleague Hussein Suleman have been trying to figure out how to motivate them. As to the 24 responses by participants to the question “…Which incentive do you need?” the results were: 7 money/devices, 7 recognition, 4 none, 4 humanity/care/usefulness, 1 share & learn, and 1 not sure (my encoding).

    Content collaboration perceptions

    information sharing perceptions

    With respect to practices and attitudes toward information sharing, the answers were not quite encouraging (see thumbnails). Of course, all this is but a snapshot, but still.

  • The workshop on geospatial sciences & land administration had a paper on building a national database infrastructure that wasn’t free of challenges, among others: buying data is costly, available data but no metadata, privacy issues, data collected and cant ask for consent again for repurposing of that data (p16) [4].
  • How to overcome the (perceived to be the main) hurdle of lack of trust in electronic voting in Kenya [5]. In Thiga’s case, they let the students help coding the voting software and kept things ‘offline’ with a local network in the voting room and the server in sight [5]. There were lively comments in the whole session on voting (session 8c), including privacy issues, auditability, whether blockchain could help (yes on auditability and also anonymity, but consumes a lot of [too much?] electricity, according to a Namibian delegate also in attendance), and scaling up to the population or not (probably not for a while, due to digital literacy and access issues, in addition to the trust issue). The research and experiments continue.
  • Headaches of data integration in Buffalo City to get the water billing information system working properly [6]. There are the usual culprits in system integration from the information systems viewpoint (e.g., no buy-in by top management or users) that were held against the case in the city (cf. the CS side of the equation, like noisy data, gaps, vocabulary alignment etc.). Upon further inquiry, specific issues came to the surface, like not reading the water meters for several years and having been paying some guesstimate all the while, and issues that have to do with interaction between paying water (one system) and electricity (another system) cause problems for customers also when they have paid, among others [6]. A framework was proposed, but that hasn’t solved the actual data integration problem.

There were five parallel sessions over the three days (programme), so there are many papers to check out still.

As to networking with people in Africa, it was good especially to meet African ontologists and semantic web enthusiasts, and learn of the Botswana National Productivity Centre (a spellchecker might help, though needing a bit more research for seTswana then), and completely unrelated ending up bringing up the software-based clicker system we developed a few years ago (and still works). The sessions were well-attended—most of us having seen monkeys and beautiful sunsets, done game drives and such—and for many it was a unique opportunity, ranging from lucky postgrads with some funding to professors from the various institutions. A quick scan through the participants list showed that relatively many participants are affiliated with institutions from South Africa, Botswana, Tanzania, Kenya, and Uganda, but also a few from Cameroon, Burkina Faso, Senegal, Angola, and Malawi, among others, and a few from outside Africa, such as the USA, Finland, Canada, and Germany. There was also a representative from the EU’s DEVCO and from GEANT (the one behind Eduroam). Last, but not least, not only the Minister of Transport and Communication, Onkokame Kitso, was present at the conference’s opening ceremony, but also the brand new—39 days and counting—President of Botswana, Mokgweetsi Masisi.

No doubt there will be a 14th installment of the conference next year. The paper deadline tends to be in December and extended into January.

 

References

(papers are now only on the USB stick but will appear in IEEE Xplore soon)

[1] Mjaria F, Keet CM. A statistical approach to error correction for isiZulu spellcheckers. IST-Africa 2018.

[2] Mtebe J, Raphael C. A critical review of eLearning Research trends in Tanzania. IST-Africa 2018.

[3] Kennis J. WikiTropica: collaborative knowledge management in the field of tropical medicine and international health. IST-Africa 2018.

[4] Maphanyane J, Nkwae B, Oitsile T, Serame T, Jakoba K. Towards the Building of a Robust National Database Infrastructure (NSDI) Developing Country Needs: Botswana Case Study. IST-Africa 2018.

[5] Thiga M, Chebon V, Kiptoo S, Okumu E, Onyango D. Electronic Voting System for University Student Elections: The Case of Kabarak University, Kenya. IST-Africa 2018.

[6] Naki A, Boucher D, Nzewi O. A Framework to Mitigate Water Billing Information Systems Integration Challenges at Municipalities. IST-Africa 2018.

CFP 6th Controlled Natural Languages workshop

Here’s some advertisement to submit a paper to an great scientific event that has a constructive and stimulating atmosphere. How can one say these positive aspects upfront, one might wonder. I happened to have participated in previous editions (e.g., this time and another time) and now I’m also a member of the organising committee for this 6th edition of the workshop, and we’ll do our best to make it a great event again.

 

——–

Final Call for Papers

Sixth Workshop on Controlled Natural Language (CNL 2018)

Submission deadline (All papers): 15 April 2018

Workshop: 27-28 August 2018 in Maynooth, Co Kildare, Ireland

This workshop on Controlled Natural Language (CNL) has a broad scope and embraces all approaches that are based on natural language and apply restrictions on vocabulary, grammar, and/or semantics.

The workshop proceedings will be published open access by IOS Press.

For further information, please see: http://www.sigcnl.org/cnl2018.html

Logics and other math for computing (LAC18 report)

Last week I participated in the Workshop on Logic, Algebra, and Category theory (LAC2018) (and their applications in computer science), which was held 12-16 February at La Trobe University in Melbourne, Australia. It’s not fully in my research area, so there was lots of funstuff to learn. There were tutorials in the morning and talks in the afternoon, and, of course, networking and collaborations over lunch and in the evenings.

I finally learned some (hardcore) foundations of institutions that underpins the OMG-standardised Distributed Ontology, Model, and Specification Language DOL, whose standard we used in the (award-winning) KCAP17 paper. It concerns the mathematical foundations to handle different languages in one overarching framework. That framework takes care of the ‘repetitive stuff’—like all languages dealing with sentences, signatures, models, satisfaction etc.—in one fell swoop instead of repeating that for each language (logic). The 5-day tutorial was given by Andrzej Tarlecki from the University of Warsaw (slides).

Oliver Kutz, from the Free University of Bozen-Bolzano, presented our K-CAP paper as part of his DOL tutorial (slides), as well as some more practical motivations for and requirements that went into DOL, or: why ontology engineers need DOL to solve some of the problems.

Dirk Pattinson from the Australian National University started gently with modal logics, but it soon got more involved with coalgebraic logics later on in the week.

The afternoons had two presentations each. The ones of most interest to me included, among others, CSP by Michael Jackson; José Fiadeiro’s fun flexible modal logic for specifying actor networks for, e.g., robots and security breaches (that looks hopeless for implementations, but that as an aside); Ionuț Țuțu’s presentation on model transformations focusing on the maths foundations (cf the boxes-and-lines in, say, Eclipse); and Adrian Rodriguez’s program analysis with Maude (slides). My own presentation was about ontological and logical foundations for interoperability among the main conceptual data modelling languages (slides). They covered some of the outcomes from the bilateral project with Pablo Fillottrani and some new results obtained afterward.

Last, but not least, emeritus Prof Jennifer Seberry gave a presentation about a topic we probably all should have known about: Hadamard matrices and transformations, which appear to be used widely in, among others, error correction, cryptography, spectroscopy and NMR, data encryption, and compression algorithms such as MPEG-4.

Lots of thanks go to Daniel Găină for taking care of most of the organization of the successful event. (and thanks to the generous funders, which made it possible for all of us to fly over to Australia and stay for the week 🙂 ). My many pages of notes will keep me occupied for a while!

Brief report on the INLG16 conference

Another long wait at the airport is being filled with writing up some of the 10 pages of notes I scribbled while attending the WebNLG’16 workshop and the 9th International Natural Language Generation conference 2016 (INLG’16), that were held from 6 to 10 September in Edinburgh, Scotland.

There were two keynote speakers, Yejin Choi and Vera Demberg, and several long and short presentations and a bunch of posters and demos, all of which had full or short papers in the (soon to appear) ACL proceedings online. My impression was that, overall, the ‘hot’ topics were image-to-text, summaries and simplification, and then some question generation and statistical approaches to NLG.

The talk by Yejin Choi was about sketch-to-text, or: pretty much anything to text, such as image captioning, recipe generation based on the ingredients, and one even could do it with sonnets. She used a range of techniques to achieve it, such as probabilistic CFGs and recurrent neural networks. Vera Demberg’s talk, on the other hand, was about psycholinguistics for NLG, starting from the ‘uniform information density hypothesis’ compared to surprisal words and grammatical errors and how that affects a person reading the text. It appears that there’s more pupil jitter when there’s a grammar error. The talk then moved on to see how one can model and predict information density, for which there are syntactic, semantic, and event surprisal models. For instance, with the semantic one: ‘peter felled a tree’: then how predictable is ‘tree’, given that its already kind of entailed in the word ‘felled’? Some results were shown for the most likely fillers for, e.g., ‘serves’ as in ‘the waitress serves…’ and ‘the prisoner serves…’, which then could be used to find suitable word candidates in the sentence generation.

The best paper award went to “Towards generating colour terms for referents in photographs: prefer the expected or the unexpected?”, by Sina Zarrieß and David Schlangen [1]. While the title might sound a bit obscure, the presentation was very clear. There is the colour spectrum, and people assign names to the colours, which one could take as RGB colour value for images. This is all nice and well on the colour strip, but when a colour is put in context of other colours and background knowledge, the colours humans would use to describe that patch on an image isn’t always in line with the actual RGB colour. The authors approached the problem by viewing it as a multi-class classification problem and used a multi-layer perceptron with some top-down recalibration—and voilá, the software returns the intended colour, most of the times. (Knowing the name of the colour, one then can go on trying to automatically annotate images with text.)

As for the other plenary presentations, I did make notes of all of them, but will select only a few due to time limitations. The presentation by Advaith Siddhartan on summarisation of news stories for children [2] was quite nice, as it needed three aspects together: summarising text (with NLG, not just repeating a few salient sentences), simplifying it with respect to children’s vocabulary, and editing out or rewording the harsh news bits. Another paper on summaries was presented by Sabita Acharya [3], which is likely to be relevant also to my student’s work on NLG for patient discharge notes [4]. Sabita focussed on trying to get doctor’s notes and plan of care into a format understandable by a layperson, and used the UMLS in the process. A different topic was NLG for automatically describing graphs to blind people, with grade-appropriate lexicons (4-5th grade learners and students) [5]. Kathy Mccoy outlined how they were happy to remember their computer science classes, and seeing that they could use graph search to solve it, with its states, actions, and goals. They evaluated the generated text for the graphs—as many others did in their research—with crowdsourcing using the Mechanical Turk. One other paper that is definitely on my post-conference reading list, is the one about mereology and geographic entities for weather forecasts [6], which was presented by Rodrigo de Oliveira. For instance, a Scottish weather forecast referring to ‘the south’ is a different region than that of the UK as a whole, and the task was how to generate the right term for the intended region.

inlg16parts

our poster on generating sentences with part-whole relations in isiZulu (click to enlarge)

My 1-minute lightning talk of Langa’s and my long paper [7] went well (one other speaker of the same session even resentfully noted afterward that I got all the accolades of the session), as did the poster and demo session afterward. The contents of the paper on part-whole relations in isiZulu were introduced in a previous post, and you can click on the thumbnail on the right for a png version of the poster (which is less text than the blog post). Note that the poster only highlights three part-whole relations from the 11 discussed in the paper.

ENLG and INLG will merge and become a yearly INLG, there is a SIG for NLG, (www.siggen.org), and one of the ‘challenges’ for this upcoming year will be on generating text from RDF triples.

Irrelevant for the average reader, I suppose, was that there were some 92 attendees, most of whom attended the social dinner where there was a ceilidh—Scottish traditional music by a band with traditional dancing by the participants—were it was even possible to have many (traditional) couples for the couples dances. There was some overlap in attendees between CNL16 and INLG16, so while it was my first INLG it wasn’t all brand new, yet also new people to meet and network with. As a welcome surprise, it was even mostly dry and sunny during the conference days in the otherwise quite rainy Edinburgh.

 

References

(links TBA shortly—neither Google nor duckduckgo found their pdfs yet)

[1] Sina Zarrieß and David Schlangen. Towards generating colour terms for referents in photographs: prefer the expected or the unexpected? INLG’16. ACL, 246-255.

[2] Iain Macdonald and Advaith Siddhartan. Summarising news stories for children. INLG’16. ACL, 1-10.

[3] Sabita Acharya. Barbara Di Eugenio, Andrew D. Boyd, Karen Dunn Lopez, Richard Cameron, Gail M Keenan. Generating summaries of hospitalizations: A new metric to assess the complexity of medical terms and their definitions. INLG’16. ACL, 26-30.

[4] Joan Byamugisha, C. Maria Keet, Brian DeRenzi. Tense and aspect in Runyankore using a context-free grammar. INLG’16. ACL, 84-88.

[5] Priscilla Morales, Kathleen Mccoy, and Sandra Carberry. Enabling text readability awareness during the micro planning phase of NLG applications. INLG’16. ACL, 121-131.

[6] Rodrigo de Oliveira, Somayajulu Sripada and Ehud Reiter. Absolute and relative properties in geographic referring expressions. INLG’16. ACL, 256-264.

[7] C. Maria Keet and Langa Khumalo. On the verbalization patterns of part-whole relations in isiZulu. INLG’16. ACL, 174-183.

Reflections on ESWC 2016: where are the ontologies papers?

Although I did make notes of the presentations I attended at the 13th Extended Semantic Web Conference a fortnight ago, with the best intentions to write a conference report, it’s going to be an opinion piece of some sort, on ontology engineering, or, more precisely: the lack thereof at ESWC2016.

That there isn’t much on ontology research at ISWC over the past several years, I already knew, both from looking at the accepted papers and the grapevine, but ESWC was still known to be welcoming to ontology engineering. ESWC 2016, however, had only one “vocabularies, schemas, and ontologies” [yes, in that order] session (and one on reasoning), with only the paper by Agnieszka and me solidly in the ‘ontologies’/ontology engineering bracket, with new theory, a tool implementing it, experiments, and a methodology sketch [1]. The other two papers were more on using ontologies, in annotating documents and in question answering. My initial thought was: “ah, hm, bummer, so ESWC also shifted focus”. There also were few ontologists at the conference, so I wondered whether the others moved on to a non-LD related field, alike I did shift focus a bit thanks/due to funded projects in adjacent fields (I did try to get funds for ontology engineering projects, though).

To my surprise, however, it appeared that a whopping 27 papers had been submitted to the “vocabularies, schemas, and ontologies” track. It was just that only three had made it through the review process. Asking around a bit, the comments were sort of like when I was co-chair of the track for ESWC 2014: ‘meh’, not research (e.g., just developing a domain ontology), minor delta, need/relevance unclear. And looking again at my reviews for 2015 and 2016, in addition to those reasons: failing to consider relevant related work, or a lacking a comparison with related work (needed to demonstrate improvement), and/or some issues with the theory (formal stuff). So, are we to blame and ‘suicidal’ or become complacent and lazy? It’s not like the main problems have been solved and developing an ontology has become a piece of cake now, compared to, say, 10 years ago. And while it is somewhat tempting to do some paper/presentation bashing, I won’t go into specifics, other than that at two presentations I attended, where they did show a section of an ontology, there was even the novice error of confusing classes with instances.

Anyway, there used to be more ontology papers in earlier ESWCs. To check that subjective impression, I did a quick-and-dirty check of the previous 12 editions as well, of which 11 had named sessions. Here’s the overview of the number of ontology papers over the years (minus the first one as it did not have named sections):

ontoPap

The aggregates are a bit ‘dirty’ as the 2010 increase grouped ontologies together with reasoning (if done for 2016, we’d have made it to 6), as was 2007 a bit flexible on that, and 2015 had 3 ontologies papers + 3 ontology matching & summarization, so stretching it a bit in that direction, as was the case in 2013. The number of papers in 2006 is indeed that much, with sessions on ontology engineering (3 papers), ontology evaluation (3), ontology alignment (5), ontology evolution (3), and ontology learning (3). So, there is indeed a somewhat downward trend.

Admitted, ‘ontologies’ is over the initial hype and it probably now requires more preparation and work to come up with something sufficiently new than it was 10 years ago. Looking at the proceedings of 5 years ago rather, the 7 ontologies papers were definitely not trivial, and I still remember the one on removing redundancies [2], the introduction of two new matching evaluation measures and comparison with other methods [3], and automatically detecting related ontology versions [4]. Five ontology papers then had new theory and some experiments, and two had extensive experiments [5,6]. 2012 had 6 ontologies papers, some interesting, but something like the ‘SKOS survey’ is a dated thing (nice, but ESWC-level?) and ISOcat isn’t great (but I’m biased here, as I don’t like it that noun classes aren’t in there, and it is hard to access).

Now what? Work more/harder on ontology engineering if you don’t want to have it vanish from ESWC. That’s easier said than done, though. But I suppose it’s fair to say to not discard the ESWC venue as being ‘not an ontology venue anymore’, and instead use these six months to the deadline to work hard enough. Yet, who knows, maybe we are harder to ourselves when reviewing papers compared to other tracks. Either way, it is something to reflect upon, as an 11% acceptance rate for a track, like this year, isn’t great. ESWC16 in general had good papers and interesting discussions. While the parties don’t seem to be as big as they used to be, there sure is a good time to be had as well.

 

p.s.: Cretan village, where I stayed for the first time, was good and had a nice short walk on the beach to the conference hotel, but beware that the mosquitos absent from Knossos Hotel all flock to that place.

 

References

[1] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. In: Proceedings of the 13th Extended Semantic Web Conference (ESWC’16). Springer LNCS 9678, 642-657. 29 May – 2 June, 2016, Crete, Greece.

[2] Stephan Grimm and Jens Wissmann. Elimination of redundancy in ontologies. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 260-274.

[3] Xing Niu, Haofen Wang, GangWu, Guilin Qi, and Yong Yu. Evaluating the Stability and Credibility of Ontology Matching Methods. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 275-289.

[4] Carlo Alocca. Automatic Identification of Ontology Versions Using Machine Learning Techniques. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 275-289.

[5] Keet, C.M. The use of foundational ontologies in ontology development: an empirical assessment. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 321-335.

[6] Wei Hu, Jianfeng Chen, Hang Zhang, and Yuzhong Qu. How Matchable Are Four Thousand Ontologies on the Semantic Web. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 290-304.

CFP Logics and Reasoning for Conceptual Models (LRCM 2016)

Just in case you don’t have enough to do these days, or want to ‘increase exposure’ when attending KR2016/DL2016/NMR2016 in Cape Town in April, or try to use it as another way in to attend KR2016/DL2016/NMR2016, or [fill in another reason]: please consider submitting a paper or an abstract to the Second Workshop on Logics and Reasoning for Conceptual Models (LRCM 2016):

================================================================
Second Workshop on Logics and Reasoning for Conceptual Models (LRCM 2016)
April 21, 2016, Cape Town, South Africa
http://lrcm2016.cs.uct.ac.za/
==
Co-located with:
15th Int. Conference on Knowledge Representation and Reasoning (KR 2016)
  http://kr2016.cs.uct.ac.za/
29th Int. Workshop on Description Logics (DL 2016)
  http://dl2016.cs.uct.ac.za/
==============================================================

There is an increase in complexity of information systems due to,
among others, company mergers with information system integration,
upscaling of scientific collaborations, e-government etc., which push
the necessity for good quality information systems. An information
system’s quality is largely determined in the conceptual modelling
stage, and avoiding or fixing errors of the conceptual model saves
resources during design, implementation, and maintenance. The size and
high expressivity of conceptual models represented in languages such
as EER, UML, and ORM require a logic-based approach in the
representation of information and adoption of automated reasoning
techniques to assist in the development of good quality conceptual
models. The theory to achieve this is still in its infancy, however,
with only a limited set of theories and tools that address subtopics
in this area. This workshop aims at bringing together researchers
working on the logic foundations of conceptual data modelling
languages and the reasoning techniques that are being developed so as
to discuss the latest results in the area.

**** Topics ****
Topics of interest include, but are not limited to:
- Logics for temporal and spatial conceptual models and BPM
- Deontic logics for SBVR
- Other logic-based extensions to standard conceptual modelling languages
- Unifying formalisms for conceptual schemas
- Decidable reasoning over conceptual models
- Dealing with finite and infinite satisfiability of a conceptual model
- Reasoning over UML state and behaviour diagrams
- Reasoning techniques for EER/UML/ORM
- Interaction between ontology languages and conceptual data modelling languages
- Tools for logic-based modelling and reasoning over conceptual models
- Experience reports on logic-based modelling and reasoning over conceptual models
- Logics and reasoning over models for Big Data

To this end, we solicit mainly theoretical contributions with regular
talks and implementation/system demonstrations and some modelling
experience reports to facilitate cross-fertilisation between theory
and praxis.  Selection of presentations is based on peer-review of
submitted papers by at least 2 reviewers, with a separation between
theory and implementation & experience-type of papers.

**** Submissions ****
We welcome submissions in LNCS style in the following two formats for
oral presentation:
- Extended abstracts of maximum 2 pages;
- Research papers of maximum 10 pages.
Both can be submitted in pdf format via the EasyChair website at
https://easychair.org/conferences/?conf=lrcm2016.

**** Important dates ****
Submission of papers/abstracts:  February 7, 2016
Notification of acceptance:      March 15, 2016
Camera-ready copies:             March 21, 2016
Workshop:                        April 21, 2016

**** Organisers ****
Diego Calvanese (Free University of Bozen-Bolzano, Italy)
Alfredo Cuzzocrea (University of Trieste and ICAR-CNR, Italy)
Maria Keet (University of Cape Town, South Africa)

**** PC Members ****
Alessandro Artale (Free University of Bozen-Bolzano, Italy)
Arina Britz (Stellenbosch University, South Africa)
Thomas Meyer (University of Cape Town, South Africa)
Marco Montali (Free University of Bozen-Bolzano, Italy)
Alessandro Mosca (SIRIS Academic, Spain)
Till Mossakowski (University of Magdeburg)
Anna Queralt (Barcelona Supercomputing Center, Spain)
Vladislav Ryzhikov (Free University of Bozen-Bolzano, Italy)
Pablo Fillottrani (Universidad Nacional del Sur, Argentina)
Szymon Klarman (Brunel University London, UK)
Roman Kontchakov (Birkbeck, University of London, UK)
Oliver Kutz (Free University of Bozen-Bolzano, Italy)
Ernest Teniente (Universitat Politecnica de Catalunya, Spain)
David Toman (University of Waterloo, Canada)
(Further invitations pending)

Depending on the number of submissions, the duration of the workshop
will be either half a day or a full day.

Fruitful ADBIS’15 in Poitiers

The 19th Conference on Advances in Databases and Information Systems (ADBIS’15) just finished yesterday. It was an enjoyable and well-organised conference in the lovely town of Poitiers, France. Thanks to the general chair, Ladjel Bellatreche, and the participants I had the pleasure to meet up with, listen to, and receive feedback from. The remainder of this post mainly recaps the keynotes and some of the presentations.

 

Keynotes

The conference featured two keynotes, one by Serge Abiteboul and on by Jens Dittrich, both distinguished scientists in databases. Abiteboul presented the multi-year project on Webdamlog that ended up as a ‘personal information management system’, which is a simple term that hides the complexity that happens behind the scenes. (PIMS is informally explained here). It breaks with the paradigm of centralised text (e.g., Facebook) to distributed knowledge. To achieve that, one has to analyse what’s happening and construct the knowledge from that, exchange knowledge, and reason and infer knowledge. This requires distributed reasoning, exchanging facts and rules, and taking care of access control. It is being realised with a datalog-style language but that then also can handle a non-local knowledge base. That is, there’s both solid theory and implementation (going by the presentation; I haven’t had time to check it out).

The main part of the cool keynote talk by Dittrich was on ‘the case for small data management’. From the who-wants-to-be-a-millionaire style popquiz question asking us to guess the typical size of a web database, it appeared to be only in the MBs (which most of us overestimated), and sort of explains why MySQL [that doesn’t scale well] is used rather widely. This results in a mismatch between problem size and tools. Another popquiz question answer: the 100MB RDF can just as well be handled efficiently by python, apparently. Interesting factoids, and one that has/should have as consequence we should be looking perhaps more into ‘small data’. He presented his work on PDbF as an example of that small data management. Very briefly, and based on my scribbles from the talk: its an enhanced pdf where you can access the raw data behind the graphs in the paper as well (it is embedded in it, with OLAP engine for posing the same and other queries), has a html rendering so you can hover over the graphs, and some more visualisation. If there’s software associated with the paper, it can go into the whole thing as well. Overall, that makes the data dynamic, manageable, traceable (from figure back to raw data), and re-analysable. The last part of his talk was on his experiences with the flipped classroom (more here; in German), but that was not nearly as fun as his analysis and criticism of the “big data” hype. I can’t recall exactly his plain English terms for the “four V4”, but the ‘lots of crappy XML data that changes’ remained of it in my memory bank (it was similar to the first 5 minutes of another keynote talk he gave).

 

Sessions

Sure, despite the notes on big data, there were presentations in the sessions that could be categorised under ‘big data’. Among others, Ajantha Dahanayake presented a paper on a proposal for requirements engineering for big data [1]. Big data people tend to assume it is just there already for them to play with. But how did it get there, how to collect good data? The presentation outlined a scenario-based backwards analysis, so that one can reduce unnecessary or garbage data collection. Dahanayake also has a tool for it. Besides the requirements analysis for big data, there’s also querying the data and the desire to optimize it so as to keep having fast responses despite its large size. A solution to that was presented by Reuben Ndindi, whose paper also won the best paper award of the conference [2] (for the Malawians at CS@UCT: yes, the Reuben you know). It was scheduled in the very last session on Friday and my note-taking had grinded to a halt. If my memory serves me well, they make a metric database out of a regular database, compute the distances between the values, and evaluate the query on that, so as to obtain a good approximation of the true answer. There’s both the theoretical foundation and an experimental validation of the approach. In the end, it’s faster.

Data and schema evolution research is alive and well, as were time series and temporal aspects. Due to parallel sessions and my time constraints writing this post, I’ll mention only two on the evolution; one because it was a very good talk, the other because of the results of the experiments. Kai Herrmann presented the CoDEL language for database evolution [3]. A database and the application that uses it change (e.g., adding an attribute, splitting a table), which requires quite lengthy scripts with lots of SQL statements to execute. CoDEL does it with fewer statements, and the language has the good quality of being relationally complete [3]. Lesley Wevers approached the problem from a more practical angle and restricted to online databases. For instance, Wikipedia does make updates to their database schema, but they wouldn’t want to have Wikipedia go offline for that duration. How long does it take for which operation, in which RDBMS, and will it only slow down during the schema update, or block any use of the database entirely? The results obtained with MySQL, PostgreSQL and Oracle are a bit of a mixed bag [4]. It generated a lively debate during the presentation regarding the test set-up, what one would have expected the results to be, and the duration of blocking. There’s some work to do there yet.

The presentation of the paper I co-authored with Pablo Fillottrani [5] (informally described here) was scheduled for that dreaded 9am slot the morning after the social dinner. Notwithstanding, quite a few participants did show up, and they showed interest. The questions and comments had to do with earlier work we used as input (the metamodel), qualifying quality of the conceptual model, and that all too familiar sense of disappointment that so few language features were used widely in publicly available conceptual models (the silver lining of excellent prospects of runtime usage of conceptual models notwithstanding). Why this is so, I don’t know, though I have my guesses.

 

And the other things that make conference useful and fun to go to

In short: Networking, meeting up again with colleagues not seen for a while (ranging from a few months [Robert Wrembel] to some 8 years [Nadeem Iftikhar] and in between [a.o., Martin Rezk, Bernhard Thalheim]), meeting new people, exchanging ideas, and the social events.

2008 was the last time I’d been in France, for EMMSAD’08, where, looking back now, I coincidentally presented a paper also on conceptual modelling languages and logic [6], but one that looked at comprehensive feature coverage and comparing languages rather than unifying. It was good to be back in France, and it was nice to realise my understanding and speaking skills in French aren’t as rusty as I thought they were. The travels from South Africa are rather long, but definitely worthwhile. And it gives me time to write blog posts killing time on the airport.

 

References

(note: most papers don’t show up at Google scholar yet, hence, no links; they are on the Springer website, though)

[1] Noufa Al-Najran and Ajantha Dahanayake. A Requirements Specification Framework for Big Data Collection and Capture. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, .

[2] Boris Cule, Floris Geerts and Reuben Ndindi. Space-bounded query approximation. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 397-414.

[3] Kai Herrmann, Hannes Voigt, Andreas Behrend and Wolfgang Lehner. CoDEL – A Relationally Complete Language for Database Evolution. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 63-76.

[4] Lesley Wevers, Matthijs Hofstra, Menno Tammens, Marieke Huisman and Maurice van Keulen. Analysis of the Blocking Behaviour of Schema Transformations in Relational Database Systems. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 169-183.

[5] Pablo R. Fillottrani and C. Maria Keet. Evidence-based Languages for Conceptual Data Modelling Profiles. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 215-229.

[6] C. Maria Keet. A formal comparison of conceptual data modeling languages. EMMSAD’08. CEUR-WS Vol-337, 25-39.