Workshop held at the
Fifth International Semantic Web Conference
ISWC 2006
November 5 - 9, 2006
Proceedings of the
First International Workshop on
Applications and Business
Aspects of the Semantic Web
SEBIZ 2006
November 6, 2006
Athens, Georgia, USA
Edited by
Elena Paslaru Bontas Simperl,
Martin Hepp, and
Christoph Tempich
The workshop Website is available online at http://www.ag-nbi.de/conf/
SEBIZ06/
Contents
Introduction
Motivation . . . . . . . .
The Workshop . . . . . .
Technical presentations .
Conclusions and Outlook
1
2
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Enhancing Data and Processes Integration and Interoperability
in Emergency Situations: a SWS based Emergency Management
System
Alessio Gugliotta, Rob Davies, Leticia Gutiérrez Villarías, Vlad Tanasescu, John Domingue, Mary Rowlatt, Marc Richardson, Sandra Stinčić
Building Ontology in Public Administration: A Case Study
Graciela Brusa, Ma. Laura Caliusco, Omar Chiotti
iv
v
vi
vi
1
16
3
Personalized Question Answering: A Use Case for Business Analysis
VinhTuan Thai, Sean O’Riain, Brian Davis, David O’Sullivan
31
4
OntoCAT: An Ontology Consumer Analysis Tool and Its Use on
Product Services Categorization Standards
Valerie Cross, Anindita Pal
44
5
Improving the recruitment process through ontology-based querying
Malgorzata Mochol, Holger Wache, Lyndon Nixon
59
iii
Introduction
Motivation
Within the past five years, the Semantic Web research community has brought
to maturity a comprehensive set of foundational technology components, and
this both at the conceptual level and in the form of prototypes and software.
This includes, among other assets, ontology engineering methodologies, standardized ontology languages, ontology engineering tools, and other infrastructure like APIs, repositories, and scalable reasoners, plus a plethora of work for
making the Deep Web and computational functionality in the form of Web Services accessible at a semantic level. However, in order for these achievements
to provide a feasible basis for ontologies to start-up at large scale corporate
applications, they should be complemented by methods, validated by practical
application, which allow enterprises to:
• Effectively adopt ontology based systems in the existing infrastructure.
In particular this requires
– best practices and convincing showcases
– means to monitor the quality of the ontology development and deployment processes
– estimate and control the costs involved in the development and usage of ontologies
– investigate the costs and benefits of applying particular development
or deployment strategies in specific application settings
• Evaluate the quality of existing ontologies and ontology engineering methodologies, methods and tools. In particular the dissemination of ontologybased technologies at corporate level requires methods to measure the
usability of a particular ontology in a specific business scenario estimate
the business value of ontologies, but also objective means to compare
among methodologies, methods and tools dealing with them.
iv
The availability of best practices, convincing showcases, metrics as well as
quantitative and qualitative measurements assisting particular stages of ontology engineering processes are essential requirements for organizations to be
able to optimize these processes.
The Workshop
The SEBIZ Workshop on Applications and Business Aspects of the Semantic
Web brought together 30 professionals affiliated to both industry and academia.
The workshop program included a short introduction talk held by the organizers, four technical presentations and extensive discussions.
The workshop organizers, Elena Simperl, Martin Hepp and Christoph Tempich, received 7 submissions of papers in response to the call for papers. As a
result of the peer reviewing process, 5 of these were selected for publication in
these proceedings. The program committee consisted of the following Semantic Web experts from industry or academia: Richard Benjamins (iSOCO), Chris
Bizer (Free University of Berlin), Christoph Bussler (Cisco Systems), Jorge
Cardoso (University of Madeira), Oscar Corcho (University of Manchester),
Roberta Cuel (University of Trento), John Davies (BT), Jos de Bruijn (University of Innsbruck), Tommaso Di Noia (Politecnico di Bari), John Domingue
(Open University), Dieter Fensel (University of Innsbruck), Doug Foxvog (National University of Ireland Galway), Fausto Giunchiglia (University of Trento),
Michel Klein (Vrije Universiteit Amsterdam), Juhnyoung Lee (IBM Research),
Alain Leger (France Telecom), Miltiadis Lytras (Athens University of Economics & Business), Dumitru Roman (University of Innsbruck) , York Sure
(University of Karlsruhe), Robert Tolksdorf (Free University of Berlin), Ioan
Toma (University of Innsbruck) and Yuxiao Zhao (Linköping University). The
organization committee would like to thank all PC members for their thorough
and substantial reviews, which were crucial for the success of the actual workshop.
In the first session of the workshop Elena Simperl held, on behalf of the
organizers, a short talk, which outlined the motivation and objectives of the
workshop, and introduced the technical program. This consisted of two sessions
of paper presentations, followed by a closing session in which the attendees
participated in a lively discussion on the main topics covered by the event, and
pointed out open issues for making the Semantic Web a success at industry
level. In the following, we summarize the main points.
v
Technical Presentations
The presentations held on this workshop covered the following topics: Business
Intelligence (BI), Ontology-based content integration tasks in business and public sector applications, Human Resources (HR), Metrics to evaluate and compare existing ontologies, Metrics to determine the usability of a particular ontology in a specific business scenario, and Quality frameworks for ontologies.
The paper “OntoCAT: An Ontology Consumer Analysis Tool and Its Use on
Product Services Categorization Standards” by Valerie Cross and Anindita Pal
gives an in-depth overview of the field of ontology evaluation. It introduces
OntoCAT, a tool which computes a comprehensive set of metrics for use by the
ontology consumer or knowledge engineer to assist in ontology evaluation for
re-use.
The paper “Improving the recruitment process through ontology-based querying” by Malgorzata Mochol, Lyndon Nixon and Holger Wache approaches the
problem of approximate reasoning in the context of an eRecruitment scenario.
It describes a query relaxation method which demonstrates the benefit of using formal ontologies for improving the retrieval performance and the userfriendliness of a semantic job portal.
Leticia Gutierrez Villarias presented the paper “Enhancing Data and Processes
Integration and Interoperability in Emergency Situations: a SWS based Emergency Management System” by Alessio Gugliotta, Leticia Gutierrez Villarias,
Vlad Tanasescu, John Domingue, Mary Rowlatt, Marc Richardson and Sandra Stincic. The talk describes how semantic technologies, and in particular
Semantic Web Services, can be successfully deployed to integrate data and applications in the field of emergency management.
A second use case for the Semantic Web was discussed in the paper “Personalized Question Answering: A Use Case for Business Analysis” by VinhTuan
Thai, Sean O’Riain, Brian Davis and David O’Sullivan. The approach provides
evidence on the importance of using domain semantics in question answering
tasks to resolve ambiguities and to improve the recall for retrieving relevant
passages.
Conclusions and Outlook
The workshop gave clear evidence that semantic technologies are experiencing
a shift from a pure research topic to real-world applications. Further on, the
presentations and the discussions among the attendees showed the substantial
vi
interest of academia in transferring the results achieved so far to industry. On
the other hand, industry seems to be well aware of these achievements and of the
added value of using semantics for data and application integration purposes.
The main results of the SEBIZ06 workshop and associated discussions can
be summarized as follows:
• The results achieved by the research community in the last decade provide the core building blocks for realizing the Semantic Web. France
Telecom, HP, IBM or Vodafone provide first success stories for deploying semantic technologies within enterprizes, while companies such as
Ontoprise, TopQuadrant, Cerbera, Oracle or Altova are established technology vendors.
• There was consensus among the workshop participants that the mainstream adoption of semantic technologies will take about five years from
now.
Further on, the discussion revealed a series of open issues, which are crucial
for the uptake of semantic technologies at industrial level:
• A major drawback when applying semantics within enterprizes is the lack
of tools leveraging semantic data from existing legacy systems.
• Business people require means to evaluate the technology and the content.
• The business aspects of the development and deployment of semantic
technologies are still marginally addressed, thus impeding their large
scale adoption.
Berlin, Innsbruck and Karlsruhe
November, 2006
Elena Simperl
Martin Hepp
Christoph Tempich
vii
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Enhancing Data and Processes Integration and
Interoperability in Emergency Situations:
a SWS based Emergency Management System
Alessio Gugliotta1, Rob Davies2, Leticia Gutiérrez-Villarías2,Vlad Tanasescu1,
John Domingue1, Mary Rowlatt2, Marc Richardson3, Sandra Stinčić3
1
Knowledge Media Institute, The Open University,
Walton Hall, Milton Keynes, MK7 6AA, UK
{v.tanasescu, a.gugliotta, j.b.domingue}@open.ac.uk
2
Essex County Council, County Hall,
Chelmsford, CM1 1LX, UK
{Leticia.gutierrez, maryr}@essexcc.gov.uk
rob.davies@mdrpartners.com
3
BT Group
Adastral Park Martlesham, Ipswich IP5 3RE, UK
{marc.richardson, sandra.stincic}@bt.com
Abstract. In this paper we describe a powerful use case application in the area
of emergency situations management in which to illustrate the benefits of a
system based on Semantic Web Services (SWS), through the automation of the
business processes involved. After creating Web services to provide spatial data
to third parties through the Internet, semantics and domain ontologies were
added to represent the business processes involved, allowing: ease of access and
combination of heterogeneous data from different providers; and automatic
discovery, access and composition to perform more complex tasks. In this way,
our prototype contributes to better management of emergency situations by
those responsible. The work described is supported by the DIP (Data,
Information and Process Integration with Semantic Web Services) project. DIP
(FP6 – 507483), an Integrated Project funded under the European Union’s IST
programme.
1.
Introduction
In an emergency response situation there are predefined procedures which set out the
duties of all agencies involved. A very wide range of agencies is often involved in the
management of an emergency situation, potentially involving a huge data provision
and communication requirement between them. Needs and concerns are escalated
through a specified chain of command, but the organisations are independent of one
another and decisions have to be made rapidly, based on knowledge of the situation
(e.g. the type of problem, the site, and the population affected) and the data available.
Gathering all the data in a manual or semi-automated way takes time and resources
that those responsible for emergency planning and incident response may not have.
-1/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Having data and resources available through the internet, companies and public
organizations can easily and inexpensively share information with customers and
partners. Web Services (WS) would allow emergency planning agencies and rescue
corps to interoperate and share vital information easily. The supplied services are
autonomous and platform-independent computational elements. They can be
described, published, discovered, orchestrated, and programmed using XML artifacts
for the purpose of developing massively distributed interoperable application.
Unfortunately, despite progress in the use of standards for Web Service description
(WSDL [9]) and publishing (UDDI [10]), the syntactic definitions used in these
specifications do not completely describe the capability of a service and cannot be
understood by software programs. A human developer is required to interpret the
meaning of inputs, outputs and applicable constraints as well as the context in which
services can be used.
Semantic Web Services (SWS) technology aims to alleviate these problems. It
combines the flexibility, reusability, and universal access that typically characterize a
WS, with the expressivity of semantic mark-up, and reasoning in order to make
feasible the invocation, composition, mediation, and automatic execution of complex
services with multiple paths of execution, and levels of process nesting. As a result,
computers can automatically interoperate and combine information, creating a
comprehensive and most relevant possible response which is seamlessly delivered to
end-users in real time.
The Emergency Management System (EMS) envisaged within the DIP use case
will provide a decision support system which will assist the emergency planning
officer to automatically gather and analyze relevant information in a particular
emergency scenario, through the adoption of SWS technology. This should improve
the quality of the information available to emergency managers in all the phases of
emergency management: before (planning), during (response), and after (evaluation
and analysis); thereby facilitating their work and improving the quality of their
decisions in critical situations.
Our work contributes to raise the awareness of potential SWS benefits in realworld applications - ease the creation of infrastructure in which new services can be
added, discovered and composed continually, and the organization processes
automatically updated to reflect new forms of cooperation - and promote the
availability of working SWS platforms.
2.
Integrated Emergency Management (IEM) Requirements
In the definition of the use case scenario, an attempt has been made to bring together
the needs of all the groups that would be involved in case of an emergency occurring
in Essex - a large region in South East England (UK). We have conducted interviews
with emergency planning personnel in Essex County Council (ECC) and several other
agencies which are involved in various types of emergency scenario (e.g.
Meteorological Office; police, fire, ambulance emergency services; traffic control
service; British Airport Authority; and other County Councils surrounding Essex). As
a result of this work, the following main requirements were delineated:
-2/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
3
R1. In an emergency event all the authorities involved have to cooperate and provide
relevant data to each others upon request. This data comes from many sources in
many different formats. As required in the Civil Contingencies Act 2004 [1]:
“local responder bodies have to co-operate in preparing for and responding to
emergencies through a Local Resilience Forum (LRF)”. ECC is aware of the
importance of multi-agency working and consequently has belonged for many
years to several emergency groups and networks. All of these groups collaborate
now under the Essex Resilience Forum. There is also in Essex an “Essex
Emergency Services Coordinating Group (EESCG)” which is formed by
representatives from Essex Police, Essex Fire and Rescue Service, British
Transport Police, Essex Ambulance Service, Maritime Coastguard Agency,
Military and Local Authorities.
R2. Interoperation and collaboration among many agencies in an emergency
situation follow predefined procedures which set out agency’s duties. As stated
in the COPE (Combined Operational Procedures for Essex) document [2]: “The
purpose of the group is to develop, maintain and improve effective co-ordination
between the Emergency Services and the principal emergency Support
Organizations and to identify the means to ensure effective co-ordination and
regular liaison between those services in the planned response to emergencies.”
R3. Geographical Information Systems (GIS) applied to an IEM scenario can ease
the integration, storage, querying, analysis, modeling, reporting and mapping of
geographically-referenced data relevant for the emergency situation. As stated in
by the UK Emergency Planning College in their “Guide to GIS Applications in
Integrated Emergency Management (IEM)” [4]: “Geography matters to IEM:
hazards are spatially distributed, and generally very uneven in that distribution,
vulnerable facilities are distributed and clustered in space, and resources may be
sub-optimally located to deal with anticipated and actual emergencies”.
R4. Cross-border relationships are highly important in an emergency situation,
especially in the context of the Stansted area. The Airport is considered to be in
its own ‘territory’ governed by British Airports Authority (BAA) and does not
form part of a local government District. In the event of an emergency situation
around Stansted, ECC needs to work closely with the other affected adjacent
local government authorities, namely: Hertfordshire County Council and
Uttlesford District Council and with BAA itself.
3.
The Emergency Management System
We are developing an Emergency Management System (EMS), which is an end-user
Web application providing e-Emergency services to customers. The system is
intended to be used during the planning and response phases of an emergency.
Provided services can cover all kinds of information concerned with emergencies including information about hazardous weather, personnel involved in an emergency
situation, rescue corps involved in the prevention response and recovery phases of an
emergency situation, evacuation procedures, provision of supplies and help to
affected people, location of damaged facilities and the consequences, assistance
needed by vulnerable people, location of ‘hotspots’ etc.
-3/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Us
es
d
an
s
te
ga
re
ag
Figure 1 – Context Diagram
As depicted in Figure 1, there are three main actors in the general use case, which
participate in this use case and with different roles. These are:
• Customer (EPO): The end user that requests the services provided by the EMS.
They select and invoke services through a user-friendly emergency planning
interface. We envisage this application will be used by the Emergency Planning
Officers (EPO) in public organizations, and other emergency partners (Police,
Fire & Rescue, Ambulance service, NHS, Rover Rescue, etc.). As a result we
obtain a cross-border application (IEM requirement R4).
• Emergency Planning and Geographical Information Service providers:
Governmental authorities, Ordnance Survey, Meteorological Office, emergency
agencies, commercial companies, etc, which provide specific emergency
planning services and spatially-related services through the Internet in the form
of WS. They provide services to end users to improve collaboration in an
emergency-based scenario (IEM requirements R1, R3).
• EMS: The intermediary between the customer and the providers. This
management system holds all the functionalities for handling SWS - supporting
automatic discovery, composition, mediation and execution. It exposes services
to end-users, using existing emergency services and aggregating them into new
high-level services in order to improve collaboration in an emergency-based
scenario (IEM requirement R2). The EMS is considered to have a non-profit
governmental basis and to serve public interests in case of an emergency. It
interacts with customers (emergency planners and heads of rescue corps) via a
user-friendly interface, allowing users to access and combine the different
services provided by the service providers.
3.1
Use case
Several emergency-related scenarios were considered in order to pilot the prototype
definition. With the collaboration of the ECC emergency planners, we finally decided
to focus on a real past situation: “Heavy snowstorm around the Stansted area and
M11 corridor (Essex, UK) on 31st January 2003”, in which thousands of motorists
were trapped overnight on some of Britain’s busiest motorways [3]. By focusing on a
past event we ensure the availability of real data. An additional advantage is the
-4/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
5
ability to compare the actions taken and the data available at that time, with the data
and actions that would had been taken if a SWS-based emergency planning tool had
been available.
3.2
Business process and data
The current version of the prototype focused on the planning phase. Figure 2 depicts
the main goals to achieve (business processes) in a snowstorm hazardous situation
before planning an adequate emergency response. The first step is to identify the
affected area by analysing snow data. Then, the EPO has to locate suitable shelters for
resting affected people and – not necessarily in this order - identify available relevant
people (rescue corps) in the affected area. These goals are not merely retrieval
operations, but involve sub-processes that select services and manipulate retrieved
data according to situation-specific requirements. Semantics will be adopted to
represent these decompositions. A detailed example is provided in Section 4.5.
Identify Hazardous Area
(Snowstorm)
Met Office
Locate suitable shelters for
evacuated people
Identify relevant people
in the affected area
ECC
BuddySpace
Figure 2 – Emergency procedure in a snowstorm hazardous situation.
The prototype will aggregate data and functionalities from the following three
heterogeneous sources:
• Meteorological Office: a national UK organization which provides environmental
resources and in particular weather forecast data. The prototype aggregates snow
data related to the date of the snowstorm in question.
• ECC geospatial and emergency data: The prototype makes use of a wide range
of geospatial data, such as administrative boundaries, buildings, Ordnance Survey
maps, etc, as well as other data from the emergency department. Building related
data is used to support searches for suitable rest centres.
• BuddySpace is an Instant Messaging client facilitating lightweight
communication, collaboration, and presence management [5] built on top of the
instant messaging protocol Jabber 1 . The BuddySpace client can be accessed on
1
Jabber. http://www.jabber.org/
-5/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
standard PCs, as well as on PDAs and on mobile phones (which in an emergency
situation may be the only hardware devices available).
As many of the integrated real systems have security and access restriction policies,
British Telecommunications (BT) has created a single corporate spatial data
warehouse where all Meteorological Office and ECC data sources have been
replicated in order to work with them in a safe environment, thereby providing
suitable Web Services (WS) to work with. However, the prototype represents how
this system would work in a distributed environment with heterogeneous and
scattered data sources over Internet.
WS will provide a first level of interoperability by encapsulating functionality
regardless of the specific technologies/protocols of the providers’ legacy systems.
Semantic descriptions will provide the final level of interoperability, allowing
automation of all the stages of the WS use (mainly: discovery, composition and
invocation). In Section 4, we will detail these aspects.
4.
The Prototype
The main functional requirements of our SWS-enabled EMS are: (FR1) providing a
graphic user interface (GUI) for customer interaction and displaying outputs: e.g.
browser/visualization tool to display and select data layers on a map; (FR2)
discovering, combining and invoking suitable Web Services for a user request; (FR3)
providing a WS Execution Environment with control functions, error handling, and
support for optional user interaction; (FR4) dealing effectively with heterogeneous
resources, thus allowing for appropriate mediation facilities (Ontology-Ontology
mediation has been identified in the earlier stages of the prototype, other kinds of
mediation may be identified later); (FR5) providing interfaces for cooperation with
GIS and emergency service providers.
In order to provide semantic and step toward the creation of added value services
(FR2, FR3, FR4, FR5), we adopt WSMO [6] – a promising SWS framework – and
IRS-III [7] – a tested implementation of this standard. The reference language for
creating ontologies is OCML [8].
4.1
Semantic Web Services framework: WSMO and IRS-III
The Web Service Modeling Ontology (WSMO) [6] is a formal ontology for
describing the various aspects of services in order to enable the automation of WS
discovery, composition, mediation and invocation. The meta-model of WSMO
defines four top level elements:
• Ontologies: provide the foundation for describing domains semantically. They
are used by the three other WSMO components.
• Goals: define the tasks that a service requester expects a Web service to fulfil. In
this sense they express the requester’s intent.
• Web Service descriptions represent the functional behavior of an existing
deployed Web service. The description also outlines how Web services
communicate (choreography) and how they are composed (orchestration).
-6/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
7
•
Mediators handle data and process interoperability issues that arise when
handling heterogeneous systems.
One of the main characterizing features of WSMO is that ontologies, goals and Web
services are linked by mediators:
• OO-mediators enable components to import heterogeneous ontologies;
• WW-mediators link Web Services to Web Services;
• WG-mediators connect Web Services with Goals;
• GG-mediators link different Goals.
The incorporation of four classes of mediators in WSMO facilitates the clean
separation of different mapping mechanisms. For example, an OO-mediator may
specify an ontology mapping between two ontologies whereas a GG-mediator may
specify a process or data transformation between two goals.
IRS-III, the Internet Reasoning Service [7], is a platform which allows the
description, publication and execution of Semantic Web Services, according to the
WSMO conceptual model.
Based on a distributed architecture communicating via XML/SOAP messages, it
provides an execution environment for SWS; ontologies are stored by the server, and
used in WSMO descriptions to support discovery, composition, invocation and
orchestration of WS. It allows one-click publishing of “standard” program code to WS
by automatically generating an appropriate wrapper. Standard WS or REST services
can also be trivially integrated and described by using the platform.
Also, by extending WSMO goal and Web Service concepts, clients of IRS-III can
invoke web services via goals. That is, IRS-III supports so called capability-, or goaldriven service invocation which allows the user to use only generic inputs, hiding the
possible complexity of a chain of heterogeneous WS invocations.
4.2
Architecture
The general architecture of our semantically-enhanced prototype is depicted in Figure
3. As can be seen, it is a service oriented architecture (SOA) composed of the
following four layers:
• Legacy System layer: consists of the existing data sources and IT systems
available from each of the parties involved in the integrated application.
• Service Abstraction layer: exposes (micro-) functionality of the legacy systems as
WS, abstracting from the hardware and software platforms. The adoption of
existing Enterprise Application Integration (EAI) software facilitated the creation
of required WS.
• Semantic Web Service layer: given a goal request, this layer, implemented in
IRS-III will (i) discover a candidate set of Web services, (ii) select the most
appropriate, (iii) mediate any mismatches at the data, ontological or business
process level, and (iv) invoke the selected Web services whilst adhering to any
data, control flow and Web service invocation requirements. To achieve this,
IRS-III utilises the set of SWS descriptions, which are composed of goals,
mediators, and Web services, supported by relevant ontologies.
• Presentation layer: is a Web application accessible through a standard Web
browser. The goals defined within the SWS layer are reflected in the structure of
-7/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
the interface and can be invoked either through the IRS-III API or as an HTTP
GET request. The goal requests are filled with data provided by the user and sent
to the Semantic Web Service layer. We should emphasise that the presentation
layer may be comprised of a set of Web applications to support distinct user
communities. In this case, each community would be represented by a set of
goals supported by community related ontologies.
Figure 3. The EMS general architecture.
4.3
Services
We distinguish between two classes of services: data and smart. The former refers to
the three data sources introduced in Section 3, and are exposed by means of WS:
• Meteorological service: this service provides weather information (e.g. snowfall)
over a specific rectangular spatial area.
• ECC Emergency Planning services: using the ViewEssex data each service in
this set returns detailed information on a specific type of rest centre within a
given circular area. For example, the ‘getHospitals’ Web service returns a list of
relevant hospitals.
• BuddySpace services: these services allow presence information on online users
to be accessed.
Smart services represent specific emergency planning reasoning and operations on the
data provided by the data services. They are implemented in a mixture of Common
Lisp and OCML and make use of the developed ontologies. In particular, we created a
number of filter services that manipulate meteorological and GIS data according to
emergency-specific requirements semantically described; e.g. rest centres with
-8/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
9
heating system, hotels with at least 40 beds, easier accessible hospital, etc. The
criteria used were gained from our discussions with the EPO’s.
4.4
Semantic Web Services: Ontologies
In this and next section, we focus on the semantic description defined in the Semantic
Web Services Layer. The following ontologies reflecting the client and provider
domains were developed to support WSMO descriptions:
• Meteorology, Emergency Planning and Jabber Domain Ontology: representing
the concepts used to describe the services attached to the data sources, such as
snow and rain for Met Office, hospitals and supermarkets for ECC Emergency
Planning, session and presences for Jabber. If a new source and the Web services
exposing its data and functionalities are integrated, a new domain ontology has to
be introduced - also reusing existing ontologies. The services, composed of the
data types involved as well as its interface, have to be described in such a
ontology usually at a level low enough to remain close from the data.
To get the information provided by WS up to the semantic level, we introduce lifting
operations that allows the passage of data types instances from a syntactic level (xml)
to an ontological one (OCML) specified in the domain ontology definitions. These
lisp functions automatically extract data from SOAP messages and create the
counterpart class instances. The mapping information between data types and
ontological classes is defined at design time by developers.
• HCI Ontology: part of the user layer, this ontology is composed of HCI and useroriented concepts. It allows to lowering from the semantic level results for the
particular interface which is used (e.g. stating that Google Maps API is used,
defining “pretty names” for ontology elements, etc.). Note that although the
choice of the resulting syntactic format depends of the chosen lowering process,
concepts from the HCI ontology are used in order to achieve this transformation
in a suitable way.
• Archetypes Ontology: part of the user layer, this is a minimal ontological
commitment ontology aiming to provide a cognitively meaningful insight into the
nature of a specialized object; for example, by conveying the cognitive (“naïve”)
feeling that for example an hospital, as a “container” of people and provider of
“shelter” can be assimilated to the more universal concept of “house”, which we
consider to be as an archetypal concept, i.e. based on image schemata and
therefore supposed to convey meaning immediately. It is moreover assumed that
any client, whilst maybe lacking the specific representation for a specific basic
level concept, knows its archetypal representation.
• Spatial Ontology: a part of the mediation layer, it describes GIS concepts of
location, such as coordinates, points, polygonal areas, and fields. It also allows
describing spatial objects as entities with a set of attributes, and a location.
The purpose of the HCI, Archetypes and Spatial ontologies is the aggregation of
different data sources on, respectively, a representation, a cognitive and a spatial
level. Therefore we can group them under the appellation aggregation ontologies.
They allow the different data sources to be handled and presented in a similar way.
Inversely to the lifting operations, lowering operations transform instances of
-9/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
aggregation ontologies into syntactic documents to be used by the server and client
applications. This step is usually fully automated since aggregation ontologies are, by
definition, quite stable and unique.
Context Ontology: the context ontology allows describing context n-uples which
represent a particular situation. In the emergency planning application, context nuples have up to four components, the use case, the user role, the location, and the
type of object. Contexts are linked with goals, i.e. if this type of user accesses this
type of object around this particular location, these particular goals will be presented.
Contexts also help to inform goals, e.g. if a goal provides information about petrol
stations in an area, the location part of the context is used to define this area, and input
from the user is therefore not needed. Each time an object is displayed by a user at a
particular location, a function of the context ontology provides the goals which need
to be displayed and what inputs are implicit.
4.5
Semantic Web Services: WSMO descriptions
As depicted in Figure 3, the goals, mediators, and Web Services descriptions of our
application currently link the UK Meteorological Office, ECC Emergency Planning,
and BuddySpace Web services to the user interface. Correspondingly, the Web
Service goal descriptions use SGIS spatial, meteorology, ECC Emergency Planning
and Jabber domain ontologies whilst the goal encodings rely on the GUI and
archetypes ontologies. Mismatches are resolved by the defined mediators. As
introduced in the previous section, the inputs of the WS (XML in our particular
scenario, but any other format could be provided) are lifted to the ontology, and, after
invoking a Goal, the results are lowered back into XML so the results can be
displayed back to the user. For illustration purposes, a small portion of the SWS
descriptions are shown in Figure 4. The example details the main goal “Locate
suitable shelters for evacuated people” introduced in Section 3.2.
Figure 4. A portion of WSMO descriptions for the EMS prototype.
-10/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
11
Get-Polygon-GIS-data-with-Filter-Goal represents a request for available shelters
within a delimited area. The user specifies the requirements as a target area, a
sequence of at least three points (a polygon), and a shelter type (e.g. hospitals, inns,
hotels). As mentioned above the set of ECC Emergency Planning Web services each
return potential shelters of a specific type with a circular query area. The obtained
results need to be filtered in order to return only shelters correlated to emergencyspecific requirements (for example a snowstorm). The process automated in our
application is usually performed by EPO manually.
From a SWS point of view the problems to be solved by this particular portion of
the SWS layer included: (i) discovering the appropriate ECC Emergency Planning
Web service; (ii) meditating the difference in area representations (polygon vs.
circular) between the goal and Web services; (iii) composing the retrieve and filter
data operations. Below we outline how the WSMO representations in Figure 4
address these problems.
• Web service discovery (FR2): each SWS description of ECC Emergency
Planning service defines, in its capability, the specific class of shelter that the
service provides. Each definition is linked to the Get-Circle-GIS-Data-Goal by
means of a unique WG-mediator (shown as wgM). The inputs of the goal specify
the class of shelter, and the circular query area. At invocation IRS-III discovers
through the WG-mediator all associated Web services, and selects one on the
basis of the specific class of shelter described in the Web service capability.
• Area mediation and orchestration (FR2, FR4, FR5): the Get-Polygon-GIS-datawith-Filter-Goal is associated with a unique Web service that orchestrates, by
simply invoking three sub-goals in sequence. The first gets the list of polygon
points from the input; the second is Get-Circle-GIS-Data-Goal described above;
finally, the third invokes the smart service that filters the list of GIS data. The
first two sub-goals are linked by means of three GG-mediators (depicted as ggM)
that return the centre, as a latitude and longitude, and radius of the smallest circle
which circumscribes the given polygon. To accomplish this, we created three
mediation services invoked through: Polygon-to-Circle-Lat-Goal, Polygon-toCircle-Lon-Goal, and Polygon-to-Circle-Rad-Goal (the related WG-mediator and
Web service ovals were omitted to avoid cluttering the diagram). The results of
the mediation services and the class of shelter required are provided as inputs to
the second sub-goal. A unique GG-mediator connects the output of the second to
the input of the third sub-goal. In this instance no mediation service is necessary.
It is important to note that if new WS – for instance providing data from further GIS
are available, new Web Service descriptions will be simply introduced, and linked to
the Get-Circle-GIS-Goal by the proper mediators (even reusing the existing ones, if
semantic mismatches do not exist), without affecting the existing structure. In the
same way, new GIS filter services (e.g. more efficient ones) may be introduced. The
effective workflow – i.e. which services are invoked – is known at run-time only.
4.6 User Interface: usage example
The user interface has been developed using Web standards: XHTML and CSS are
used for presentation, JavaScript (i.e. EcmaScript) is used to handle user interaction
-11/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
and AJAX provides IRS-III goal invocation (FR1, FR3). One of the main components
of the interface is a map, which uses the Google Maps API to display polygons and
objects (custom images) at specific coordinates and zoom levels. These objects are
displayed in a pop-up window or in a hovering transparent region over the maps.
When the application is launched, a goal is invoked for the Essex region, and snow
hazard or storm polygons are drawn according to data from the meteorological office.
The value from which snow values can constitute a hazard or a storm are heuristic and
as emergency knowledge is gathered it can easily improved, by modifying the smart
services which are composed with weather information, while the goal visible to the
user remains the same. As an example of practical usage, we describe how an EPO
describes and emergency situation, before trying to contact relevant agents. The
procedure is as follows:
1. The EPO clicks within the displayed hazard region to bring up a menu of
available goals. In this case (Figure 5a) three goals are available: show available
shelters, login to BuddySpace and get the presence information for related staff.
2. The EPO asks for the available Rest Centres inside the region, and then inspects
the detailed attributes for the Rest Centre returned (Figure 5b).
3. The EPO requests to see the presence status for all staff within the region and
then initiates an online discussion the closest online agency worker (Figure 5c).
Figure 5 - Three views of the application in use: 5a) Goals available for the snow
hazard, 5b) obtaining detailed information for a specific rest centre, 5c) initiating a
discussion with an online emergency worker.
5.
Related Work and Lesson Learned
Spatial-related data is traditionally managed with the help of GIS, which, by linking
spatial algorithms and representation means to spatially extended databases, help
supporting decision making by facilitating the integration, storage, querying, analysis,
-12/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
13
modeling, reporting, and mapping of this data to analyze possible models. However,
each agency tends to collect only data relevant for itself and organizes it in the way
that suits it best, managing it according to particular business processes and sharing
only what is not judged confidential information. In an emergency situation, such
access and semantic barriers are unacceptable and the wish for more complete
interoperability through the network is often expressed 2 .
Maps available on the web, for identifying an address or getting transportation
information, are popular but allow only simple queries. However, recently, a new type
of mapping systems has emerged; highly responsive mapping frameworks providing
API (Google4, Yahoo5, Mapquest6, etc.). They are also usually enhanced with “reality
effects” – e.g. seamless transition between maps, satellite and hybrid views, 2.5-3D
visualisations, street level photography, etc. – which make them even more appealing.
API allow developers to populate online maps with custom information – location of
“events” or “things” –, by collecting data from standard documents such as RDF files,
or simply by ad hoc “web scraping” of HTML resources. These embryonic but very
agile Web GIS, called mashups, can merge more than one data sources and add
functionality such as filtering and search features. However, although extremely
popular, relatively easy to build and to enhance, Web GIS do not avoid traditional
issues attached to non semantic applications; indeed (i) handling data heterogeneity
still requires considerable manual work, (ii) the lack of semantics limits the precision
of queries, and (iii) limited expressiveness usually drastically limits functionality.
Any information system can gain advantage from the use of semantics [14]. In
GIS-related application, the use of semantic layers, although not yet firmly
established, is being investigated in a number of research studies [11][12][13]. Having
ontologies describing a spatial-related data repository and its functionalities is
believed to make cooperation with other systems easier and to better match user
needs.
In our approach, we adopted WSMO and IRS-III to provide an infrastructure, in
which new services can be added, discovered and composed continually, and allow
the automatic invocation, composition, mediation, and execution of complex services.
The integration of new data sources results relatively simple; the steps involved in the
process of adding new data sources can be summarized as follow: (i) ontological
description of the service; (ii) lifting operations definition; (iii) mapping to
aggregation ontologies; (iv) goal description; (v) mediation description; (vi) lowering
definition; and (vii) context linking. Although this procedure may seem tedious, and
can actually only be performed by a knowledge expert, it presents many advantages
compared to standard based approaches as the one demonstrated in the OWS-3
Initiative 3 :
• Framework openness: standards are helpful but not necessary. For example, if
querying sensor data, the use of standards – e.g. SensorML 4 – helps the reuse of
service ontologies and lifting procedures since they can be applied to any service
using a similar schema. However any other schema can be integrated with the
same results.
2
http://www.technewsworld.com/story/33927.html
http://www.opengeospatial.org/initiatives/?iid=162
4 http://vast.nsstc.uah.edu/SensorML/
3
-13/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
•
•
6.
High level services support: since services are described as SWS, they inherit all
benefits of the underlying SWS execution platform and are updated as more
features are added to the platform (e.g. trust based invocation). In other solutions
support for composition and discovery is imbedded in syntactic standards
themselves, which implies specific parsing features and adding ad hoc reasoning
capabilities to standard software applications, which is time consuming and error
prone. Moreover, SWS introduce a minimalist approach in the description of a
domain, by modeling the concepts used by Web Services only, and allowing onthe-fly creation of instances when Web Services are invoked (lifting).
Support of the Emergency Handling Process: the conceptual distinction between
goal and web services - introduced by WSMO – allows developers to easily
design business processes known a priori (e.g. emergency procedure) in terms of
composition of goals, and move the (automatic) identification of the most suitable
service at run-time. Specifically, the constant use of context to link goals and
situations greatly enhances the decision process. Indeed, actions are oriented
depending on the use case, the object, user role and location. With the help of
explanations of the utility of each goal in each context, the Emergency Officer’s
task is greatly simplified. A future development of the context ontology will
include feedback from goal invocation history, and allow workflow definitions,
i.e. this goal only appears after these two have been invoked. Note that all goals
are also accessible independently of any context which allows non directed
queries to occur, if needed.
Conclusions and Future Work
In the future, a new era of emergency management can be envisaged, in which EMS’s
‘collaborate’ thorough the Internet to provide relevant information in emergency
situations through SWS technology. In this way, involved agencies and emergency
corps can extend their knowledge about a particular emergency situation making use
of different functionalities based on data hold by other agencies which otherwise
might not be accessible to them or slow to obtain.
The proposed EMS is a decision support system based on SWS technology, which
assists the EPO in the tasks of retrieving, processing, displaying, and interacting with
only emergency relevant information, more quickly and accurately.
In our approach, we aimed to obtain a development process that might be
pragmatic - in order to quickly lead to a working outcome – as well as flexible - in
order to easily respond to eventually changes/improvements and meet the multiple
actors’ viewpoints. We followed a prototyping approach that produced two main
cycles; a third one is under way.
The first cycle rapidly defined the structure, processes and data sources of the EMS
(Section 3) starting from the requirements of a real-world integrated emergency
management (Section 2). The result has been valued by stakeholders (emergency
planning department in ECC) before advancing with the application development.
The second cycle actualized the required EMS functional requirements (Section 4)
by adopting semantic technologies. Specifically, WSMO and IRS-III have been used
to implement the SWS infrastructure, which has been linked to the user interface
-14/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
15
(based on Google Maps) through AJAX approach. As a result, we obtained a working
prototype that has been shown to the EPO’s and other people dealing with emergency
situations in the ECC area (i.e. potential end-users).
On the basis of their feedback, the third cycle has been planned. Future
improvements involve integrating demographic, highways and transport data from
ECC. Moreover, we are seeking to use real time data (e.g.: real time RADAR data
instead of the weather forecast). Assuming the availability of this data, the system
could also be used in the response phase of the designed EMS.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Essex Resilience Forum (2006).
(http://www.essexcc.gov.uk/microsites/essex_resilience/)
Combined Operational Procedures For Essex (2006).
(http://www.essex.police.uk/pages/about/mipm.pdf)
BBC news web site (2006).
(http://news.bbc.co.uk/2/hi/talking_point/2711291.stm)
A Guide to GIS Applications in Integrated Emergency Management – Cabinet Office Emergency Planning College (2006).
(http://www.epcollege.gov.uk/training/events/gis-guide_acro6.pdf)
Eisenstadt, M., Komzak, J., Dzbor, M. Instant messaging + maps = powerful collaboration
tools for distance learning. (2003).
WSMO Working Group, D2v1.0: Web Service Modeling Ontology (WSMO). WSMO
Working Draft, (2004). (http://www.wsmo.org/2004/d2/v1.0/)
Cabral, L., Domingue, J., Galizia, S., Gugliotta, A., Norton, B., Tanasescu, V., Pedrinaci,
C.: IRS-III: A Broker for Semantic Web Services based Applications. In proceedings of
the 5th International Semantic Web Conference (ISWC 2006), Athens, USA (2006).
Motta, E.: An Overview of the OCML Modelling Language. (1998).
WSDL: Web Services Description Language (WSDL) 1.1, (2001).
(http://www.w3.org/TR/2001/NOTE-wsdl-20010315)
UDDI Consortium. UDDI specification. (2000).
(http://www.uddi.org/)
Casati, R., Smith, B., Varzi, A. C.: Ontological tools for geographic representation. (1998)
77–85.
Peuquet, D., Smith, B., Brogaard B.: The ontology of fields. (1999).
Fonseca, F. T., Egenhofer, M. J.: Ontology-Driven Geographic Information Systems.
ACM-GIS (1999) 14-19.
Semantic Interoperability Community of Practice (SICoP). Introducing Semantic
Technologies and the Vision of the Semantic Web. (2005).
-15/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Building Ontology in Public Administration: A Case
Study
Graciela Brusa1, Ma. Laura Caliusco2, Omar Chiotti3
1
Dirección Provincial de Informática, San Martín 2466
Santa Fe, Santa Fe, Argentina, gracielabrusa@santafe.gov.ar
2
CIDISI, CONICET-UTN-FRSF, Lavaise 610,
Santa Fe, Santa Fe, Argentina, mcaliusc@frsf.utn.edu.ar
3
INGAR, CONICET-UTN-FRSF, Avellaneda 3657
Santa Fe, Santa Fe, Argentina, chiotti@ceride.gov.ar
Abstract. The inclusion of Semantic Web technologies into some areas,
particularly in the public sector, has not been as expected. That is, among
others, because government processes require a large amount of information
and its semantic is impossible to carry across organizations. Hence, public
servers depend on technical and specifics areas to incorporate knowledge about
information that crosses the organization structure government. It succeeds too
when government administrations aim at web services and people needs access
to semantic of services. In this public services transformation, it is necessary
incorporate new tools to be used by community whom this services are
addressed. Ontologies are important to share information in internal activities of
government administration and to facilitate information access in e-government
services. This work presents the experiences during the ontology building in a
local public sector: the budgetary and financial system of Santa Fe Province
(Argentine). Software engineering techniques were use in manner of minimize
the impact of technical knowledge required. At last, architecture is proposed in
order to show ontologies applications in government areas and their advantages.
Keywords: ontology, public sector, budgetary and financial system.
1 Introduction
During the last years, an important progress on achieving information
interoperability between heterogeneous applications in business sector has been made.
Public administrations are facing the same problems than business organizations with
respect to information integration. In the public sector, however, the direct replication
of the experiences from business sector drives several problems [20], mainly since the
complexity of the public sector.
The main difference between the business sector and the public sector is not only
the complexity but also the bureaucracy and idiosyncrasy. To comprehend the public
sector idiosyncrasy it can be adequate to consider the holistic reference model
presented by [26], which, based on a socio-technique approach, makes a consideration
of the public sector, showing different views, progress of public services and
-16/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
abstraction layers. From a technologic point of view, a main government challenge is
to get a set of capabilities to facilitate the interoperability, needed for integration as
well as the suitable interpretation of information to make decisions.
The interpretation of information without misunderstanding require to make its
meaning explicit. To this aim, ontology can be used. Ontology provides a shared
vocabulary for common understanding of a domain.
There are several works on how to develop ontologies methodologically. As
example can be mentioned: Grüninger and Fox [9], METHONTOLOGY [8][22], and
101 Method [16], among others. These methodologies were successfully used to
define ontologies in different domains [4]. Each of them presents different
intermediate representations.
Concerning software platforms that aid in ontology development can be mentioned
Protégé 3.1, and WebODE [1], among others.
In this paper we present how to develop a budgetary ontology following different
development ontology methodologies and using Protégé 3.1. To this aim, the paper is
organized as follow. Section 2 describes budgetary and financial system of the Santa
Fe Province. Section 3 discusses the tasks carried out to build the budgetary ontology.
Section 4 presents the ontology implementation using Protégé 3.1. Section 5
introduces architecture to support information integration using the implemented
ontology. Finally, Section 6 presents our conclusions.
2 Budgetary and Financial System: Domain Description
The budget of a government is a plan of the intended revenues and expenditures of
that government. The budget is prepared by different entities in different
governments. Particularly, in the Santa Fe Province (Argentine) the entities are actors
participate:
• Executive Power: it carries out the Provincial Budget Draft. A Rector Organism
that conducts all activities and all the Executors Organisms existing in
government compounds it that formulates their own budgets.
• Legislative Power: it sanctions the annual budget law.
The interaction among these actors leads different budget states: In Formulation, in
Parliamentary Proceeding and Approved. This iterative process is shown in Fig. 1.
Classifiers
Budget
Project Law
LEGISLATIVE
POWER
Resources Estimation
Finances Chief
Office
Preliminary Budget
Rector
Organism
Expense Top
Budget Draft
Budget Modifications
Modified Budget
Approved
Budget Law
Integrate Budgets
Jurisdictional
EXECUTIVE POWER
Budget for Execution
Fig. 1. Iterative process until budget is ready for execution.
-17/73-
Executor
Rector
Rector
Organism
Organis
Organis
m
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
In the Executive Power exists a Rector Organism that is responsible for all the
budgetary formulation process. This Rector Organism sets the budgetary policies and
drives the jurisdictional interactions to complete and integrate its own expenses and
resources estimates through this formulation process. Each jurisdiction as Health or
Production Ministries has Executor Organism, which are responsible to formulate and
execute budget. Formulation process results in the Project of Budget Law issued to
Legislative Power for approving.
Local budget life cycle (Fig. 2) is complex because involve a sequence of different
instances with a lot of data to each other and a great and specific knowledge is
required to operate with them.
Formulation
Approve
Execution
Closure Fiscal Year
Modifications
Fig. 2. Local Budget Life Cycle.
Along this life cycle the evaluation and control of actual and financial resources is
made, and all of them are assigned to good and services production. Table 1 shows
the detail steps.
Table 1. Budget Life Cycle Steps
1.To Initiate Fiscal Year and Distribute Classifiers
2.To Prepare Preliminary Budget and Resources Estimation
8. To Elaborate new budget according to
Budget Law
3. To Define Budgetary Policy and Expenses Projection
9. To Distribute Budget for executing
4. To Determine Expenses Top
10. To Elaborate Budgetary Modifications
5. To Formulate Budget Project Draft
11. To Program Budget executing
6. To Present Budget Project Draft to Legislature
12. To Reconduct Budget
7. To Approve Budget in Legislature
13. To Closure Fiscal Year
There is common information for all budget life cycle stages: Expenses and
resources classifiers. They carry over all budgetary life cycle states bringing a
thematic classification for its imports. Primary classifiers used in this work are:
Institutional, Expense Object, Geographic Locate, Finality Function, Resource Item,
Financing Source, and Programmatic Categories.
There are two situations where the availability of semantic information associated
to budgetary data is critical: budget formulation and approval tasks. In first case, only
government staff with specific knowledge can be involved in this task, concentrating
a high responsibility in few persons with much difficult to knowledge transference. In
second case, semantics information it is necessary to analyze budgetary data and then
to sanction budget law. Here, it is more complex because all the legislators must vote
and the majority has not the specific knowledge. For simplicity, the Formulation stage
for expenses budget was considered to this study case.
-18/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
3 Building the Budgetary Ontology
The objective of this section is to discuss the steps we have carried out in order to
define an ontology that describes the semantics of the budgetary system domain.
3.1 A Methodology Selection
Before starting to define the ontology, different development methodologies were
studied [5][14][24]. From this study, could be identified two main groups. On the one
hand, there are experience-based methodologies, such as the methodology proposed
for Grüninger y Fox [9], based in TOVE Project or the other exposed by Uschold y
King [21] [24] from Enterprise Model. Both issue in 1995 and both belong to the
enterprise modeler domain. On the other hand, there are methodologies that propose
evolutive prototypes models, such us METHONTOLOGY [8] that proposes a set of
activities to develop ontologies based on its life cycle and the prototype refinement;
and 101 Method [16] that proposes an iterative approach to ontology development.
There is no one correct way or methodology for developing ontologies. Usually,
the first ones are more appropriate when purposes and requirements of the ontology
are clear, the second one is more useful when the environment is dynamic and
difficult to understand exists [5]. Moreover, it is common to merge different
methodologies since each of them provides design ideas that distinguish it from the
rest. This merging depends on the ontology users and ontology goals.
At this work, both approaches were merged because in one hand, requirements
core are clear but in the other, domain complexity drives to adopt an iterative
approach to manage refinement and extensibility.
In general, the ontology development can be divided into two main phases:
specification and conceptualization. The goal of the specification phase is to acquire
knowledge about the domain. The goal of the conceptualization phase is to organize
and structure this knowledge using external representations that are independent of
the implementation languages and environments. In order to define the ontology for
the budget domain we have followed the 101 Method (OD101) guides for creating a
first ontology [16] and used the analysis steps from METHONTOLOGY in the
conceptualization process. Both consider an incremental construction that allows
refining the original model in successive steps and they offer different representations
for the conceptualization task.
3.2. Specification: Goal and Scope of the Ontology
The scope limits the ontology, specifying what must be included and what must
not. In OD101, this task is proposed in a later step but we considered appropriate to
include it at this point for minimizing the amount of data and concepts to be analyzed,
especially for the extent and complexity of the budgetary semantic. In successive
iterations for verification process, it will be adjusted if necessary.
This ontology only considers the needs to elaborate a project of budget law with
concepts related to expenses. It is a first prototype and then, it does not consider the
-19/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
concepts related to other stages as budgetary executing, accounting, payments,
purchases or fiscal year closure. Therefore, it includes general concepts for the budget
life cycle and specifics concepts for the formulation.
3.3. Specification: Domain Description
Taking into account that this work was made from scratch it was necessary to make
a previous domain analysis. In this analysis, the application for formulating the
provincial budget and its related documentations were studied and revised.
Furthermore, meetings with a group of experts were carried out. This group was
conformed by public officials responsible for whole budget formulation process in
Executive Power, expert professionals of Budget Committee in Legislative Power,
public agents of administrative area taking charge of elaborate own budget and
software engineers whom bring informatics supports for these tasks.
3.4. Specification: Motivating Scenarios and Competence Questions
We included this step taking into account the opinion of Gruninger y Fox [9] who
considers that for modeling ontologies, it is necessary an informal logic knowledge
model in addition to requirements resulting from different sceneries. The motivation
scenerios show problems that arise when people needs information that the system
does not provide. Besides, it contains a set of solutions to these problems in which the
semantic aspects to resolve them are. In order to define motivation scenarios and
communicate them to involved people, templates have been used. These templates
were based on those proposed to specify case uses in object oriented methodology
[23]. An example is shown in Table 2. In this template, the main semantic problems
and a list of key terms were included.
Table 2. Scenario description.
SCENARIO N°
NAME
DESCRIPTION
SITE
ACTORS
PREREQUIREMENTS
ASSOCIATES
REQUIREMENTS
NORMAL
SEQUENCE
1
Local Budget Formulation
Necessary tasks to estimate expenses for next year, which will be integrate with
the other government jurisdictions for elaborating Draft Local Budget. ……
Executor Organism of a Jurisdiction
Public agents uncharged jurisdictional budget
Rector Organism agents
Public agents from areas of a jurisdiction
Budgetary Policy defined
Expenses Classifiers received from Rector Organism
Reference documentation
Prepared agents in Budget Formulation tasks.
Advisory agents from Rector Organism
STEP
ACTION
1
To receive expenses estimations from jurisdiction areas
2
To bring support to this areas for elaborating own expenses programs.
3
To integrate all expenses programs for jurisdiction.
4
To create Programming Categories and send it to Rector Organism
5
To create the Jurisdictional Budget Project
-20/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
POSTCONDITION
EXCEPTIONS
PERMANENT
TASKS
MAIN
PROBLEMS
MAIN TERMS
6
To load budget in informatics system and send it to Rector Organism
7
To receive approved jurisdictional budget from Rector Organism
Jurisdictional Expenses Budget Project
Jurisdictional Programmatic categories
STEP ACTION
5
To consult the Rector Organism if it is not understands different
aspects to formulate budget.
7
To modify budget if it is not approved
To interact with Rector Organism to clarify the knowledge of conceptual
domain aspects
To bring support to different areas of jurisdiction
A lot of time loosed in clarify conceptual doubts
Great problems when an agent must be replaced in key places of work.
The whole process is highly dependent of few persons knowledge.
……….
Budgetary classifier, expense a classifier, Institutional, Programmatic
Category, Geographic, Expenses Object, Financing Source and Finality
Function Classifiers, among others, for working into the budget draft task.
The competency questions proceed from motivation sceneries, allowing deciding
the ontology scope, to verify if it contains enough information to answer these
questions and to specify the detail level required for the responses. Besides, it defines
expressivity requirements for the ontology because it must be able to give answers
using its own terms, axioms and definitions. The scope must define all the knowledge
that should be in the ontology as well as those that must not be. It means that a
concept must not be included if there are not competency questions to use its. This
rule is also used to determine if an axiom in ontology must be included or not.
Moreover, the competency questions allow defining a hierarchy so that an answer
response to a question may also reply to others with more general scope by means of
composition and decomposition processes. Table 3 shows some of them.
Table 3. Samples of Competency Questions
Simple Questions
Which are budget states?
Which are budgetary classifiers?
Which are expenses classifiers?
Which are resources classifiers?
Which
are
the
executor
organisms for Health Minister?
Complex Questions
Which is the institutional code for Department of Labor?
Which are sector and subsector for Central Administration?
What is the character code for “Decentralized Organism”?
Which properties have an Institution?
Which is the institutional code for “Pharmacological
Producer Laboratory” SAF?
3.5. Specification: Ontology Granularity and Type
According to purpose and level of granularity [8], the ontology proposed here was
defined as a domain ontology. Domain ontology describes the vocabulary related to a
specific domain. In this case study the ontology describe the budgetary domain of the
Santa Fe Province. And, the ontology objective is to facilitate the communication
between central administration staff that must deal with the local budget, bringing
adequate terminology to non-expert users.
-21/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
The term ontology can be used to describe models with different degrees of
structure. Particularly, the ontology defined in this paper is a formal structure
expressed in artificial formally defined languages.
3.6. Conceptualization: Conceptual Domain Model Determination
In this step, a list of more important terms was elaborated according to OD101. To
this aim, the middle-out strategy [19] was used. With this strategy, the core of basic
terms is identified first and then they are specified and generalized if necessary.
Provincial
Administration
Budgetary
Policy
Institution
Budgetary Top
Budgetary
Classifier
Estimation
Budget
Fig. 3. Basic terms of the budget domain.
Then with these concepts as reference, the key term list was defined. List shown in
Table 4 does not include partial or total overlapping of concepts, synonyms,
properties, relations and attributes.
Table 4. Key Terms
Activity
Budget
Budget Analytic
Budget Approved
Budget Project Draft
Expense
Subpartial Item
Expenses Classifier
Subprogram
Expense Object
Program Executer Unit (UEP)
Finality Function
Programmatic Category
Financial Administration Project
Budget Synthetic
Budget States
Budgetary Classifier
Financing Source
Geographic Locate
Institutional
Budgetary Fiscal Year Institution
Budgetary Policy
Jurisdiction
Budgetary Top
Executor Organism
Program
Public Funds Administrative Service (SAFOP)
Rector Organism
Resource
Resources Estimation
Year Financial Administrative Service (SAF)
-22/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
To properly understand the conceptual aspects in the context, we elaborated a
Unified Modeling Language (UML) diagram [23], (Fig. 4), with the main relations
among defined concepts. UML is a useful tool for ontology modeling though it was
not designed for this task [3].
use
Budget
1 ..n
1..n
1
has
has
1
n
Bud get State
1..n
1..n
correspond
BudgetaryClassifier
in vol ve
FinancialAdministration
BudgetManner
BudgetaryFiscallYear
ExpensesClassifier
AnalyticBudget
Clo su re
App ro val
Formulation
ResourceClassifier
SyntheticBudget
Execution
ProgrammaticCategory
1
Financin gSource
FinalityFunction
ResourceItem
GeographicLocate
has
has
Program
1..n 0..n
has
1
PrincipalItem
has
has
has
PartialItem
has
Project
has
1..n
has
Province
h as
has
has
1
1 ..n
1
Distrit
has
S ubsecto r 1
0..n
has
1
has
Sector
1
ActWork
0..n 1..n
tiene
Item
SubpartialItem
has
Subprogram
has
E xpe nseObject
has
Department
1..n
SAFOP
Inst itu ti on al
UEP
has
has
has
1..n
1
has
Character
has
has
has
1
Institution
0..n
correspond
1..n
has
1..n
has
1
SAF
Fig. 4. Domain Model in UML.
This information was the base for building the ontology term glossary, trying to
include other concepts by means of generalization and specialization techniques. The
conflictive assertions over the same entity may be discovered if the concepts are
described as completely as possible [12], to this aim, definitions were made as
complete as possible to contribute to define rules and axioms.
This UML model was useful to verify the ontology scope and to take an important
design decision: working with two ontologies. One of them is the Domain Ontology
that contains the general concepts for the budget life cycle and the other, Formulation
Ontology, contains the semantic specific for formulating it. This is task ontology [10]
since it defines concepts related to a specific task, the budget formulation. So, we
have to modify the list of key terms, hierarchical relations, and to group competency
questions depending on the ontology concepts they were related with. As Guarino sets
[10], it exists ontology types accord with dependence level of task or viewpoint.
Hence, ontologies construction implies two different strategies [6]. In one hand, a
domain ontology with an application-independent strategy because its general
concepts must be available all time. In other hand, task ontology is applicationsemidependent because different use scenerios can be identified and its
conceptualization is associated to real activities.
Working with different ontologies allows the term reusability and usability. These
concepts are important goals in ontologies construction [13] and differ finely. While
reusability implies to maximize the ontology use among different task types, usability
maximizes the number of different applications using the same ontology. From now,
-23/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
the work is concentrated on Domain Ontology development. This Domain Ontology
will be able for using in all budget states facilitating term reusability.
3.7. Conceptualization: Identification of Classes, Relations and Attributes
At this step, we considered OD101 recommendations. Besides, we used
representations proposed by METHONTOLOGY to knowledge organization as
concepts classifier trees (Fig. 5) to analyze hierarchies and attributes, binary relations,
axioms and instances tables. For determining classes, we identified those terms of
independent existence from the key terms list and the glossary.
B udgetary
C lassifier
E x h a u s tive D e c o m p o s itio n
R esources
C lassifier
E xpenses
C lassifier
Finality
Function
Program m atic
C ategory
Financing E xpense
Source O bject
D is jo in t D e c o m p o s itio n
Sector
G eographic
L ocate
R esource
Item
Institutional
Subsector
C haracter
Institution
R ector
O rganism
is-a
SA F
E xecutor
O rganism
SA FO P
UEP
Fig. 5. Taxonomy of Budgetary Ontology Concepts.
Disjoint classes, exhaustive decompositions and partitions [12] may be identified
in these graphic representations. A Disjoint-Decomposition of a concept C is a set of
subclasses of C that do not have common instances and do not cover C, that is, there
can be instances of the concept C that are not instances of any of the concepts in the
decomposition. As example (see Fig. 5), Finality Function, Financing Source,
Expense Object, Programmatic Category, Geographic Locate and Institutional can be
mentioned as disjoints. An Exhaustive-Decomposition of a concept C is a set of
subclasses of C that cover C and may have common instances and subclasses, that is,
there cannot be instances of the concept C that are not instances of at least one of the
concepts in the decomposition. For example (see Fig. 5), the concepts Expenses
Classifier and Resource Classifier make up an exhaustive decomposition of the
concept Budgetary Classifier because there are no classifier that are not instances of at
least one of those concepts, and those concepts can have common instances. A
Partition of a concept C is a set of subclasses of C that do not share common instances
-24/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
and that cover C, that is, there are not instances of C that are not instances of one of
the concepts in the partition. In this scenario there are no partitions.
It is always convenient to begin with primitive classes, examining which of them
are disjoint and verifying if that condition does not produce instances absents.
Once the hierarchies and their features have been identified a table to reflect
bidirectional relations may be elaborated by means of assigning names with an
uniform criteria. An example is shown in Table 5. Shades rows are self-evident
relations between concepts shown in the Concepts Classifier Tree (see Fig. 5) that it
results bidirectional relation after analyzing them.
Table 5. Bidirectional Relations
CONCEPT
Institutional
Institutional
Institutional
Sector
Subsector
Character
Character
Institution
RELATION
CARDINALITY CONCEPT
inst-include-sec
1
Sector
inst-include-sbsec
1
Subsector
inst-include-char
1
Character
sec-isPartOf-Inst
1,n
Institutional
sbsec-isPartOf-Inst
1,n
Institutional
char-isPartOf-Inst
1,n
Institutional
char-has-Inst
1,n
Institution
ins-has-SAF
1
SAF
INVERSE RELATION
sec-isPartOf-Inst
sbsec-isPartOf-Inst
Char-isPartOf-Inst
inst-include-sec
inst-include-sbsec
inst-include-char
inst-correspond-char
SAF-correspond-inst
The relation direction depends on competence questions to be solved and the
possible conflicts with other defined classes restrictions. A restriction list identifies
those necessary and sufficient conditions and those only necessary to work later in
their formalization. We individually analyzed the axioms but also in a group of
classes to verify if closure restrictions are required.
3.8 Conceptualization: Instances Definition
Once the conceptual model of the ontology has been created, the next step is to
define relevant instances inside an instance table. According to METHONTOLOGY,
for each instance should be defined: its name, the name of the concept it belongs to,
and its attribute values, if known, as Table 6 shows.
Table 6. An excerpt of the Instance Table of the Budgetary Ontology.
CONCEPT
INSTANCE
NAME
NAME
Institutional Institutional_111
Institutional_212
PROPERTY
cod-institutional
has-fiscal-year
inst-include-sec
inst-include-sbsec
inst-include-char
cod-institutional
has-fiscal-year
inst-include-sec
inst-include-sbsec
inst-include-char
-25/73-
VALUE
1.1.1
2004
1-No Financial Local Public Sector
1- Local Administration
1- Main Administration
2.1.2
2004
2-Financial Local Public Sector
1-Offcial Banking System
2- Official Banks
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
4 Implementing the Budget Ontology with PROTÉGÉ 3.1
In order to implement the ontology, we chosen Protégé 3.1 due to it is extensible,
and provides a plug-and-play environment that makes it a flexible base for rapid
prototyping and application development. Protégé ontologies can be exported into
different formats including RDF Schema (RDFS) [2] and Web Ontology Language
(OWL) [19].
Particularly, we have implemented the Budgetary Ontology in OWL. The first
challenge during this task was how to transform the UML diagram from
conceptualization phase into the OWL formalism. This task was hard and time
consuming. Modeling in OWL implied to transform composition relations into
bidirectional relations. In addition, some concepts modeled as classes in UML were
properties in ontology. And not all relations in UML were modeled in ontology but
only those relations that were necessary to answer competence questions. Moreover,
the granularity of domain ontology is coarse and it was adequate select a flat structure
for its.
Then, we verified the ontology by using Racer [11]. During the verification
process, we have taken into account experiences of CO-ODE Project [15] and [17].
We verified consistency validation and classification. During process for charging
classes and attributes, the verification was incremental and continuous to avoid future
propagation errors. When a class is unsatisfiable, Racer shows it with a red bordered
icon and there are different categories of causes [25] and can be exists propagated
errors. At this point is very important how are classes defined (disjoint, isSubclassOf,
Partial Class, Defined Class, etc.) and their restrictions (unionOf, allValuesFrom,
etc.). Classification process can be invoked either for the whole ontology, or for
selected subtrees only. When the test is over whole ontology, errors were searched
beginning with minor level class in the hierarchy for minimizing propagation errors.
Fig. 6. An excerpt of Ontology Taxonomy.
-26/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
To compare the ontology implementation with its conceptualization, graphics by
using the OWLViz and Ontoviz plug-ins were generated and compared with UML
diagrams. Fig. 6 shows an excerpt of the General Ontology taxonomy.
On the one hand, OWLViz enables the class hierarchies in OWL Ontology to be
viewed, allowing comparison of the asserted class hierarchy and the inferred class
hierarchy. With OWLViz primitive and defined classes can be distinguished,
computed changes to the class hierarchy may be clearly seen, and inconsistent
concepts are highlighted in red. In the taxonomy shown here, can be seen how to
represent a multiple inheritance with twice terms defined for Location Geographic.
Another form is to use axioms and lets to reasoner generates inferred classes.
On the other hand, OntoViz generates a graphics with all relations defined in the
ontology instances and attributes. It permits visualizing several disconnected graphs at
once. These graphs are suitable for presentation purposes, as they tend to be of good
clarity with none overlapping nodes. Besides, these graphics are very useful for
analyzing when a concept must be modeled as a class and when must be modeled as
an attribute. An example of them is shown in Fig. 7.
Fig. 7. Main Relations Between Concepts of Institutional Classifier.
4.1 Ontology Querying
In order to verify and validate the ontology regards to competency questions, we
used the RDF Data Query Language (RDQL) [18]. RDQL is an implementation of an
SQL-like query language for RDF. It treats RDF as data and provides query with
triple patterns and constraints over a single RDF model. Another query language is
OWL-QL [7], which was designed for query-answering dialogues among agents using
knowledge in OWL. Then, OWL-QL is suitable when it is necessary to carried out an
inference in the query. This is not the case of the major competency questions, then,
RDQL is enough. To implement these queries we used the Jena framework, which
provides an API for creating and manipulating RDF models.
Following the RDQL query that models the competency question “What are sector
and subsector for Main Administration?” is shown.
-27/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
SELECT ?x ?y ?z ?nsec ?nsbsec
WHERE (x,<adm:rdfsec-hassbsec>,?y)
(?y,<adm:rdfsbsec-has-char>,?z)
(?z,<rdfn:label>, '1-Main Administration')
(?x,<rdfn:label>, ?nsec ),
(?y,<rdfn:label>, ?nsbsec )
USING rdfn FOR http://www.w3.org/2000/01/rdf-schema#
adm FOR http://protege.stanford.edu/
5 Using the Budget Ontology
The main ontology goal is to provide a mechanism for information sharing
between people and applications without misunderstanding, independent of its
implementation. Then, the final step to achieve the ontology goal is to design and
implement architecture as one shown in Fig. 8. The architecture components are
described next.
Fig. 8. Ontology-based Architecture for Budget Content Integration in Public Sector.
Ontological Budgetary Information System: Ontology designer team carry out
the design, implementation and maintenance tasks using Protégé. This architecture
proposes a general ontology for the common concepts for all the budgetary life cycle
and specific ontologies for each stage of budget as formulation, approve, execution
and closure.
Budgetary Instances Maintenance: expert persons realize the maintenance of
instances for general and specific ontologies, requiring the necessary adjusts to
ontological designer through the interaction with budgetary system and users.
-28/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Search and Query Interface Users: receive queries and return results of them
through a friendly interface user. Applications or persons can issue queries through
this interface that it uses RDQL as query language support.
Transactional Systems: both administrative and productive government systems.
In this study case, a productive system as Hospitals or School Infrastructure
Administrative System can access simply to budgetary information for own each
interests through Ontological Systems.
6 Conclusions
In this paper, we have shown how domain experts in public sector can develop
their own ontologies merging two different methodologies and software engineering
techniques taking advantages of them. Particularly, this approach has been used to
define General Ontology for a Budgetary and Financial System, which could be
extended by Task Ontologies and used by different government applications.
The main conclusions that can be transmitted to the reader are:
• To assign all the necessary time to do a good conceptual analysis because it is
considered the most important task during development ontology.
• To modularize the ontology while it is possible for giving it more flexibility and
permitting extensibility and reuse. It can be made through relations and attributes
observation of conceptual aspects involved.
• To take into account that there are steps that can be applied during the
development of any ontology whereas there are steps that are domain-dependent.
• To realize a permanent and iterative validation process, taking into account that
partial verifications allow to identify errors propagation between sets of classes.
• To define graphics to transmit the domain conceptualization to the domain
experts. Some software engineering techniques that could be familiar for the
domain experts, such as UML, can be useful.
• To consider for development who is maintenance responsible expert user and it
anticipates a friendly interface user.
References
1. Arpírez JC, Corcho O, Fernández-López M, Gómez-Pérez A (2003) WebODE in a nutshell.
AI Magazine, 24(3)-37-47.
2. Brickley, D., Guha, R.V. RDF Vocabulary Description Language 1.0: RDF Schema. W3C
Recommendation 10 February 2004. http://www.w3.org/TR/rdf-schema/
3. Caliusco M. L., A Semantic Definition Support of Electronic Business Documents in e–
Colaboration. PhD Thesis. (Universidad Tecnológica Nacional, F.R.S.F., 2005.
4. Corcho O, Fernández-López M, Gómez-Pérez A, López-Cima A. Building legal ontologies
with METHONTOLOGY and WebODE. Law and the Semantic Web. Legal Ontologies,
Methodologies, Legal Information Retrieval, and Applications. March 2005.
5. Cristani M. and Cuel R. Methodologies for the Semantic Web: state-of-the-art of ontology
methodology. SIGSEMIS Bulletin. SW Challenges for KM. July 2004 Vol. 1 Issue 2.
6. Fernandez Lopez M. (1999) Overview of methodologies for building ontologies. In:
Benjamins VR, editor. IJCAI-99 Workshop on Ontologies and Problem-Solving
Methods; Stockholm, Sweden: CEUR Publications: 1999.
-29/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
7. Fikes, R., Hayes, P., Horrocks, I., OWL-QL - A Language for Deductive Query Answering
on the Semantic Web. KL Laboratory, Stanford University, Stanford, CA, 2003.
8. Gómez-Pérez A., Fernández López M. and Corcho O. Ontological Engineering with
examples from the areas of knowledge management, e-commerce and the semantic web.
London: Springer, 2004.
9. Gruninger M. and Fox M. S., Methodology for the Design and Evaluation of Ontologies,
IJCAI Workshop on Basic Ontological in Knowledge Sharing, Montreal, Canada, 1995.
10. Guarino N. (1998) Formal Ontology and Information Systems. In Proceedings of FOIS'98,
Trento, Italy. Amsterdam, IOS Press.
11. Haarslev V. and Möller R. 2001. RACER System Description. In Proceedings of the First
international Joint Conference on Automated Reasoning. IJCAR, June 2001.
12. Horridge M., Knublauch H., Rector A., Stevens R., Wroe C., A Practical Guide To
Building OWL Ontologies Using The Protégé-OWL Plugin and CO-ODE Tools Edition
1.0, The University Of Manchester Stanford University, August 27, 2004.
13. Jarrar M., Towards Methodological Principles for Ontology Engineering. PhD Thesis, Vrije
Universiteit Brusell, 2005.
14. Jones D., Bench-Capon T. y Visser P., Methodologies for Ontology Development, en Proc.
IT&KNOWS Conference, XV IFIP World Computer Congress, Budapest, August 1998.
15. Knublauch H., Horridge M., Musen M., Rector A., Stevens R., Drummond N., Lord P.,
Noy N., Seidenberg J., Wang H., The Protégé OWL Experience, Workshop on OWL:
Experiences and Directions, Fourth International Semantic Web Conference (ISWC2005),
Galway, Ireland, 2005.
16. Noy, N., McGuinness D., Ontology Development 101: A Guide to Creating Your First
Ontology, 2001.
17. Rector A., Drummond N., Horridge M., Rogers J., Knublauch H., Stevens R., Wang H.,
Wroe C., OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors &
Common Patterns.
18. Seaborne A., RDQL - A Query Language for RDF, W3C Member Submission 9 January
2004, http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/
19. Smith M., Welty C., McGuinness D., OWL Web Ontology Language Guide, W3C
Recommendation 10 February 2004, http://www.w3.org/TR/owl-guide/
20. Traunmüller R., Wimmer M., Feature Requirements for KM in Public Administration,
2002, http://www.lri.jur.uva.nl/~winkels/eGov2002/Traunmuller.pdf
21. Uschold, M., Building Ontologies: Towards a Unified Methodology, 16th Annual
Conference of the British Computer Society Specialists Group on Expert Systems,
Cambrige, UK, 16-18 December 1996.
22. Uschold, M., Gruninger M., Ontologies: Principles, Methods and Applications, Knowledge
Engineering Review, 1996.
23. Unified Modeling Language. http://www.uml.org/
24. Wache H., Vögele T., Visser U., Stuckenschmidt H., Schuster G., Neumann H., Hübner S.,
Ontology-Based Integration of Information –A Survey of Existing Approaches, in Proc.
IJCAI-01 Workshop: Ontologies and Information Sharing, Seattle, WA, pp. 108-117, 2001.
25. Wang H., Horridge M., Rector A., Drumond N., Seidenberg J. (2005) Debugging OWL-DL
Ontologies: A Heuristic Approach. 4th International Semantic Web Conference (ISWC'05),
Galway, Ireland.
-30/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Personalized Question Answering: A Use Case
for Business Analysis
VinhTuan Thai1 , Sean O’Riain2 , Brian Davis1 , and David O’Sullivan1
1
Digital Enterprise Research Institute,
National University of Ireland, Galway, Ireland
{VinhTuan.Thai,Brian.Davis,David.OSullivan}@deri.org
2
Semantic Infrastructure Research Group,
Hewlett-Packard, Galway, Ireland
sean.oriain@hp.com
Abstract. In this paper, we introduce the Personalized Question Answering framework, which aims at addressing certain limitations of existing domain specific Question Answering systems. Current development
efforts are ongoing to apply this framework to a use case within the domain of Business Analysis, highlighting the important role of domain
specific semantics. Current research indicates that the inclusion of domain semantics helps to resolve the ambiguity problem and furthermore
improves recall for retrieving relevant passages.
Key words: Question Answering, Business Analysis, Semantic Technology, Human Language Technologies, Information Retrieval, Information
Extraction
1
Question Answering Overview
Question answering (QA) research originated in the 1960s with the appearance of
domain specific QA systems, such as BASEBALL which targeted the American
baseball games domain and LUNAR which in turn focused on the lunar rock
domain [1]. These early systems were concerned with answering questions posed
in natural language, against a structured knowledge base of a specific domain [1].
They are commonly known as natural language front ends to databases. Research
within this field remains active with the introduction of new approaches and
techniques to enhance QA performance in more real-life, complex settings (e.g.
[2–4]).
With the advent of Semantic Web technologies, domain specific knowledge
can also be encoded within formal domain ontologies. This in turn has motivated
the growth of another branch of QA research; focusing on answering natural language questions against formal domain ontologies. Attempto Controlled English
(ACE) [5] and AquaLog [6] are recent research additions to this area. Rather
than translating natural language questions into SQL statements, these systems
translate them into variants of first-order predicate logical such as Discourse
Representation Structure (DRS) in the context of ACE and Query-Triples for
-31/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
AquaLog respectively. Consequently this permits answer derivation as an outcome of the unification process of both question and knowledge-based logical
statements.
The application of domain specific QA extends further than systems that
have their knowledge sources completely encoded in relational database or formal ontologies. Many research initiatives have investigated the use of domain
ontologies or thesaurus in assisting finding answers for questions against a small
volume of collections of unstructured texts contained within terminology-rich
documents. A variety of approaches have been proposed, ranging from a pure
Vector Space Model based on traditional Information Retrieval research, extending this approach a with domain specific thesaurus in [7], to a template-based
approach for medical domain systems [8, 9], or a computational intensive approach in [10], the goal being to convert both the knowledge source and the
question into Minimal Logical Form.
Apart from domain specific QA research, the introduction of the QA track
at Text Retrieval Conference TREC-8 in 1999 involved researchers focusing on
combining tools and techniques from research fields such as Natural Language
Processing, Information Retrieval, and Information Extraction in an attempt to
solve the QA problem in open-domain setting: the main knowledge source being
a predefined large newswire text corpus, with the World Wide Web acting as
an auxiliary source of information. The questions being asked consist mainly
of: factoid questions, list questions, definition questions and most recently, the
relationship task type question [11]. A review of participating systems in TREC
QA track is beyond the scope of this paper. Interested readers are referred to
[12] for further details. Of crucial importance however is that the existing QA
track does not target domain specific knowledge.
QA, in itself, remains an open problem. The research overview has highlighted the diversity of QA systems under development. Each system is designed
to address the problem in a particular usage scenario, which imposes certain
constraints on available resources and feasible techniques. Nevertheless, there
remain usage scenarios for QA systems that require addressing, one of which is
Personalized Question Answering. We discuss our motivation for using Personalized Question Answering in Section 2.
The remainder of this paper is structured as follows: Section 3 describes our
use case of Personalized QA within the Business Analysis domain. Section 4
presents a proposed framework for Personalized QA; Section 5 concludes this
paper, reaffirms our goals and identifies future work.
2
Personalized Question Answering
Our motivation towards Personalized Question Answering stems from existing
shortcomings within current QA systems designed for extracting/retrieving information from unstructured texts. The shortcomings are categorized below:
-32/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Authoritative source of information: In an open-domain QA setting, endusers have little control over the source of information from which answers are
sought. The reliability of answers is based mostly on the redundancy of data
present on the WWW [13]. Similarly, existing domain specific QA systems also
limit the source of information to a designated collection of documents. To our
knowledge, no QA system is designed in such a way that allows end-users to flexibly specify the source of information from which the answers are to be found.
This is of importance with respect to the design of a QA solution. For the majority of existing work, the collection of documents must initially undergo preprocessing. This pre-processing is performed only once, the results being stored
later for retrieval. This offline processing strategy makes a computational intensive approach (such as in [14, 10]), feasible because all the necessary processing
is already performed offline before any questions can be asked, and therefore reduces significantly the time required to find answers at run time. A QA system
that has a dynamic knowledge source will therefore need to take this amount of
necessary processing into consideration.
Contextual information: The use of contextual information in QA has not
received adequate attention yet. Only the work of Chung et al. [3] highlights the
fact that while forming the question, users may omit important facts that are
necessary to find the correct answer. User profile information is used in their
work to augment the original question with relevant information. The design of
QA systems therefore needs to take into account how and to what degree a given
question can be expanded to adequately reflect the context in which it is asked.
Writing style of documents: Current domain specific QA systems are usually
targeted to scientific domains, with the knowledge source, such as technical,
medical and scientific texts [8, 14, 7], written in a straight-forward, declarative
manner. This characteristic reduces the ambiguity in these texts. However, this is
not always the case with other types of documents, for example business reports.
Therefore, QA system should be able to utilize the domain and/or personal
knowledge to resolve ambiguity in texts that are written in a rhetorical way.
Targeting to address the above limitations, we propose Personalized Question
Answering, which:
– is domain specific, therefore avails of a formal domain ontology
– can cope with dynamic collection of unstructured texts written in rhetorical
style
– can handle various question types
– resolves implicit context within questions
– provides an answer-containing chunk of texts rather than the precise answer
Before discussing details of the proposed framework, we first outline in Section
3, a use case for Personalized QA within the domain of Business Analysis.
-33/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
3
Business Analysis use case
Business Analysis is largely performed as a Business Intelligence3 (BI) activity
with data mining and warehousing providing the information source for monitoring, identification and gathering of information. On-line analytical processing
then allows differing data views and report generation possible from which further BI analysis may then be performed. Data excluded from the extract, transform and load phase passes through the process unaltered as unstructured information. Subsequent mining efforts on this information compound the problem by
their reliance upon problematic document level technologies such as string-based
searching resulting in information being missed.
It is this type of mining activity that enterprises performing customer analysis as a means to identify new business opportunities currently rely upon. The
problem becomes more complex when it is considered that business analysts in
performing company health checks depend largely upon the consolidated financial information and management statements found in the free text areas of the
Form 10-Q. Management statements are those from the companies’ CEO and
are concerned with the companies’ performance. They are viewed as a promotional medium for presentation of corporate image and are important in building
credibility and investor confidence. Despite analysts having a clear understanding of the information content that the statements may contain, the searching,
identification and extraction of relevant information remains a resource intensive
activity.
Current BI technologies remain limited in this type of identification and extraction activities when processing information contained in unstructured texts
written in a rhetorical manner. For example, part of performing a company
health check involves building an understanding of that company’s sales situation mentioned in Form 10-Q4 . Sales5 performance is in turn partially dependent
upon product and services revenue. An intuitive and time-saving way to gain understanding on these areas is to pose in natural language non-trivial questions
such as ”What is the strategy to increase revenues? ”, ”Are there are plans to reduce the cost of sales versus revenue? ” and retrieve chunks of text that contain
answers to these questions.
Our Personalized QA framework is based upon semantic technology and when
applied to the Business Analysis domain, will offer business analysts the ability
to expedite the customer analysis process by having potentially relevant information presented in a timely manner. This can be achieved by having the business
analysts associate their knowledge in the form of a formal ontology and a set of
3
4
5
Term introduced by Gartner in the 1980s that refers to the user-centered process of
data gathering and analysis for the purpose of developing insights and understanding
leading to improved and informed decision making.
Quarterly report filed to the Securities and Exchange Commission (SEC) in the US.
It includes un-audited financial statements and provides a view of the company’s
financial position. Relied upon by Investors and financial professionals when evaluating investment opportunities
Discussion context is the software services domain
-34/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
domain specific synonyms to the QA system, specify the source document (Form
10-Q), pose their questions, and retrieve chunks of text that contain answers to
conduct further analysis. The framework is described in the next section.
4
Personalized Question Answering Framework
The proposed framework for Personalized QA, as shown in Fig. 1, consists of two
main modules: Passage Retrieval and Answer Extraction. The Passage Retrieval
module performs text processing of documents and analysis of questions onthe-fly to identify passages that are relevant to the input question. This coarsegrained processing reduces the search space for the answer significantly as only
relevant passages are fed to Answer Extraction (which is a more computationally
intensive module), to perform further fine-grained processing to identify chunks
of texts containing the correct answer. The details of these modules are discussed
below.
PERSONALIZED QUESTION ANSWERING
PASSAGE RETRIEVAL
Document Processing
Semantic annotation
Question Analysis
Domain ontology
Questions typing
Document
Semantic annotation
Passages splitting
Query terms expansion
Expanded
Query
Annotated Document
Indexing
Lucene
Searching
Ranked
Passages
ANSWER EXTRACTION
Dependency parsing
Dependency trees matching
Texts contain answer
Fig. 1. Personalized Question Answering Framework
-35/73-
Questions
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
4.1
Passage Retrieval
Passage Retrieval serves as an important module in the whole QA process. If
this module cannot locate any passage that possibly contains the answer when
one actually exists, an answer cannot be found. On the other hand, as noted
by Cui et al. [15], too many irrelevant passages returned by this module could
hinder the Answer Extraction module in locating the correct answer. It is worth
noting that many research works in open-domain QA have studied the Passage
Retrieval problem and proposed different density-based algorithms, which are
quantitatively evaluated by Tellex et al. [16]. The lack of domain specific knowledge makes these works different from ours significantly because no semantics is
taken into consideration; the similarity between the question and the documents
is statistically measured based only on the original words. However, this is rather
an advantage of domain specific QA systems in terms of the resources available
to them, than a limitation of current approaches being used for open-domain
QA systems. The work of Zhang et al. [7] takes semantics into consideration
while weighting similarity between the question and passages. This is similar to
our Passage Retrieval approach described below; however, this work lacks the
fine-grained processing performed by our Answer Extraction module to filter out
passages that contain the query terms but not in the right syntactic structure to
answer the question. The following paragraphs below describe each component
of the Passage Retrieval module.
Document Processing: Document Processing involves two text processing
tasks: Semantic annotation and Passage splitting.
Although to date there is no formal definition of ”Semantic Annotation”,
this concept is generally referred to as ”a specific metadata generation and usage schema, aiming to enable new information access methods and to extend
the existing ones” [17]. In other words, the semantic annotation task is performed based on the domain ontology, in order to associate the appearances of
domain specific terms or named entities with the respective ontological concepts,
therefore anchoring those terms or named entities within contents to their corresponding semantic information. There have been a lot of research efforts within
the field of semantic annotation with respect to discerning what to annotate,
what additional information users expect to have, whether to embed annotation or not, and how to perform automatic annotation etc. [17]. It is our belief
that a semantic annotation strategy should be tailored to the specific task at
hand. In the context of Personalized QA, we employ a similar strategy used
in the Knowledge and Information Management (KIM) platform [17], to link
annotations to concepts in the domain ontology. The General Architecture for
Text Engineering - GATE platform [18] provides a type of Processing Resource
called a Gazetteer, which performs gazetteer lists lookup, furthermore linking
recognized entities to concepts in ontology based on a mapping list manually
defined by users. However, instead of embedding an alias of the instance’s URI
(Uniform Resource Identifier) as in KIM, we directly embed the label of the most
specific class associated with the recognized terms or named entities inline with
-36/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
the texts. For example, the following sentence ”CompanyX releases a new operating system.” is annotated with ”CompanyX BIOntoCompany releases a new
operating system BIOntoSoftware” whereby BIOntoCompany, BIOntoSoftware
are the labels of ontological concepts http://localhost/temp/BIOnto#Company,
http://localhost/temp/BIOnto#Software respectively. Care is taken while naming the labels by prefixing the concept name with the ontology name, which is
”BIOnto” in our use case, to make them unique. The rationale for this semantic
annotation strategy is as follows:
– Preserving the original terms or named entities e.g. ”CompanyX ” ensures
that exact keyword-matching still matches, and, avoids generating noise by
over-generation if the original term is completely replaced by its annotation.
This ensures that such question as ”Which products did CompanyX release? ”
does not get as answer a sentence referring to products related to other
companies.
– Embedding the class label directly in the text adds semantic information
about the recognized terms or named entities. The annotation BIOntoCompany that follows ”CompanyX ” provides an abstraction that helps to answer
such question as ”Which products did the competitors release? ”. In this case,
the term ”competitors” in the question is also annotated with BIOntoCompany, therefore, a relevant answer can be found even though the question
does not mention the company name specifically.
– Based on the concept label, the system can derive the URI of the concept in
the domain ontology, query the ontology for relevant concepts and use them
to expand the set of query terms of the original question.
Once the documents are annotated, they are split into passages based on the
paragraph marker identified by GATE. Each passage is now considered as one
document on its own and is indexed in the next step. Before indexing is carried
out, stop-word removal is applied to each of the documents. Stop-words are
words that do not carry any significant meaning, such as ”the”, ”a”, etc. They
are used frequently but do not help to distinguish one document from the others
and therefore do not help in searching [19]. Removing these insignificant words
makes the indexing and searching task more effective. Porter stemming is also
applied to convert the morphological variants of words into their roots.
Document Indexing: The texts within processed documents are fully indexed
using Lucene6 , a freely available Information Retrieval (IR) library. Lucene supports Boolean queries based on the well-known tf.idf scoring function in IR
research. Interested readers are referred to [19] for more details on the scoring
formula being used in Lucene.
Question Analysis: Question Analysis involves three text processing tasks:
Question typing, Semantic annotation, and Query terms expansion.
6
http://lucene.apache.org/
-37/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Question typing is a common process used in many QA systems. For instance, in [20], question type taxonomy is created to map the questions into
their respective types. This helps to bridge the gap between wordings used in
the question and those used in the texts, for example, the system is aware that
question starting with ”Where” asks about places so it is typed as ”Location”.
However, since domain specific QA system already has the domain ontology in
place, instead of building a separate taxonomy for a question type as in [20], a set
of pattern-matching rules is built to map the question type to one of the concepts
in the domain ontology. Therefore, for a question such as: ”Which products did
CompanyX release? ”, the question type is BIOntoProduct. The Wh-word and
the pronoun that follow are replaced by the question type; and the question
becomes ”BIOntoProduct did CompanyX release? ”.
There are, however, some special cases in question typing, for instance, from
the sample questions from business analysts in our use case, we observe that
for ”Yes/No” questions such as ”Are there any CompanyX’s plans to release
new products? ” end-users actually do not expect to receive ”Yes” or ”No” as
an answer but instead the proof that the entities/events of interest exist if they
do. Therefore, a set of pattern-matching rules is in place to reform this type of
questions to the form of ”What” question, for the above example it is reformed to
”What are CompanyX’s plans to release new products? ” and then the question
typing process is carried out as mentioned above. There are also cases whereby
the questions cannot be typed to one of the domain concepts. In these cases,
question words are removed and the remaining words are treated as a set of
query terms.
Once the question is typed, it is also annotated, but in a different manner
from the semantic annotation performed on documents. Care is taken so that
specific named entities are not annotated with their ontological concepts’ label to
avoid noise, e.g. attaching a label BIOntoCompany after the word ”CompanyX ”
in the question will match any terms annotated with BIOntoCompany in the
document.
Before splitting the questions into query terms and submitting to IR engine,
Query terms expansion is performed based on the domain ontology and a set
of domain specific synonyms. Initial analysis of the sample questions from the
business analyst in the use case indicates two phenomena:
– When the question is typed into a concept in the ontology and that concept
has sub-concepts, the question needs expanding with all the sub-concepts in
the ontology. Assuming that concept http://localhost/temp/BIOnto#Product
has sub-concepts http://localhost/temp/BIOnto#Software and
http://localhost/temp/BIOnto#Hardware, the first example question in this
section needs to include those two sub-concepts as query terms. This ensures
that those entities or terms annotated as BIOntoSoftware or BIOntoHardware can also be matched during the searching stage.
– End-users tend to use synonyms of verbs specifically to express the same
meaning. For example, ”reduce” and ”lower ” are used interchangeably. There-
-38/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
fore, synonym lookup is performed against the available synonym set to include them in the set of query terms sent to the search engine.
Performing query terms expansion based on the domain ontology and synonym
sets in effect addresses the issue of ambiguity caused by rhetorical writing style
used in the source document.
Searching: Lucene is used to search for indexed documents containing the expanded query terms. Boolean query type is used, with AND operator between
original terms and OR operator used between expanded terms. Ranked relevant
passages returned from the search are fed into the next module, Answer Extraction, to filter out sentences containing query terms whose syntactic structures
do not match that of the original question.
4.2
Answer Extraction
In this module, the question is matched with a given candidate answer, which
is a sentence derived from passages selected by the Passage Retrieval module.
A good review of previous works on Answer Extraction is provided in [1]. Typically, once the sentence has been found by some coarse-grained processing, a set
of constraints is applied to check if the candidate sentence is actually the answer. A drawback of the majority of answer extraction techniques is that those
techniques such as typical word overlap or term density ranking fail to capture
grammatical roles and dependencies within candidate answer-sentences, such as
logical subjects and objects [15, 1]. For instance, when presented with the question ”Which company acquired Compaq? ”, one would expect to retrieve ”HP ”
as an answer to the question. However, typical term density ranking systems
would have difficulty with distinguishing the sentence ”HP acquired Compaq”
from ”Compaq acquired HP ”. It is concluded that neglecting relations between
words can result in a ”major source of false positives within the IR” [16].
Answer Extraction modules within systems that involve processing beyond
the lexical level, typically conduct pre-annotation of grammatical constraints/
relations of matching questions and candidate sentences [1]. The set of constraints can either be regarded as being ”absolute” or as a set of ”preferences”
[1]. However, the degree of constraint plays an important role in determining the
robustness of the system. It is recommended that grammatical constraints must
”be treated as preferences and not as being mandatory” [1]. Previous work, such
as PiQASso [21] system, shows that strict relations matching suffers substantially from poor recall. Cui et al. [15] propose a solution to the above ”strict
matching problem” by employing fuzzy or approximate matching of dependency
relation using MiniPar[22]. To make this paper self-contained, we provide an
overview of the MiniPar Dependency Parser and the dependency tree generated
by Minipar.
MiniPar Dependency Parser MiniPar[22] is a fast robust dependency parser
which generates dependency trees for words within a given sentence. In a de-
-39/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
pendency tree, each word or chunked phrase is represented by a node. Each
node is linked: one node corresponding to the governor and the other daughter
node corresponding to the modifier. The label associated between each link is
regarded as a dependency relation between two nodes i.e. subj, obj, gen etc. Fig.
2 is generated by MiniPar and illustrates the output dependency parse trees of
a sample question-Q1 and sample candidate answer-S1 taken from an extract of
Form 10-Q.
Q1 : What is the strategy to increase revenues ?
S1 “ the need to expand our relationships with third parties in order to
support license revenue growth”
Fig. 2. Dependency trees for Question Q1 and Answer candidate S1
Approximate/Fuzzy relation matching The work of Cui et al. [15] addresses
the setback of strict matching between dependency relations. In this work, the
authors extract relation paths between two given nodes based on previous work
in [23]. A variation of a Statistical Machine Translation model [24] is applied
to calculate the probability given candidate sentence and question terms resulting in a match given a combination of relation paths. Mapping scores between
relation paths are learned based on two statistical techniques: (1) Mutual Information in order to learn pair-wise relation mappings between questions and
candidate answers and (2) Expectation maximization as an interactive training process [24]. Question-answer pairs are extracted from the TREC 8 and 9
-40/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
QA tasks in order to provide training data. Quantitative evaluation shows that
their approach achieves significant retrieval performance when implemented on
a range of current QA systems, achieving a MMR 50-138% and over 95% for
the top one passage. It is therefore concluded that the approximate dependency
relation matching method can boost precision in identifying the answer sentence.
Applications for Personalized Question Answering It is our intention to
adapt the above approach to the specific domain of Business Analysis and to
integrate it as part of Answer Extraction module for the framework described in
Fig. 1. We can utilize questions collected from business analysts and furthermore
corpora of Form 10-Q to extract dependency relations using the MiniPar parser
in order to generate sample questions-answer pairs for statistical modeling similar to [15]. We believe that this approach in combination with large samples of
restricted domain training data will yield high precision while still maintaining
high recall.
5
Conclusion and Future work
In this paper we introduce the idea of ”Personalized Question Answering” and
propose a framework for its realization. A usage scenario within the Business
Analysis domain is also presented. Investigative analysis has highlighted that
domain knowledge in the form of formal ontology plays an important role in
shaping the approach and design of the aforementioned framework. This is particularly true for semantic annotation and query expansion whereby semantics
are needed to address the issue of ambiguity caused by rhetorical writing style
used in the source document. Current research indicates that: (1) the inclusion
of domain semantics leads to better recall in passage retrieval; (2) in a domainspecific QA system, certain types of questions may require specific analysis (e.g.
”Yes/No” questions in this business analysis domain); (3) the use of approximate dependency matching between questions-candidate answer pairs may yield
higher precision for answer extraction without impacting on recall.
An application prototype applying Personalized Question Answering framework into Business Analysis use case is being implemented. Once the fully functional prototype is available, a quantitative evaluation scheme will be implemented to gauge the effectiveness of the system within the Business Analysis
context. As the system is domain specific, the TREC QA track training data
is not suitable for benchmarking, since it is targeted exclusively towards opendomain systems. The evaluation scheme will therefore involve the manual creation of test corpus of business reports, from which given a collection of test
questions, Business Analysts will manually extract corresponding question answer pairs. This derived set of question/answer pairs will be used to benchmark
the performance of our Personalised Question Answering System.
Future work will also involve prototype functionality enhancement to cater
for complex questions whose answers are not explicitly stated, and those that
-41/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
contain implicit contexts. Additional functionality will focus on a caching mechanism for the Question Analysis component to address performance measures for
domain frequently asked questions. Last but not least, it is our goal to integrate
the QA system with the Analyst Workbench [25] to provide business analysts
with an integrated environment to perform business intelligence activity in an
effective and timely manner.
Acknowledgments. We would like to thank John Collins, Business Development & Business Engineering Manager, HP Galway and Mike Turley, CEO of
DERI, for discussions on business analysis problem. We also thank the anonymous reviewers for their constructive comments. This work is supported by Science Foundation Ireland(SFI) under the DERI-Lion project (SFI/02/CE1/1131).
References
1. Hirschman, L., Gaizauskas, R.: Natural language question answering: the view
from here. Nat. Lang. Eng. 7 (2001) 275–300
2. Berger, H., Dittenbach, M., Merkl, D.: An adaptive information retrieval system
based on associative networks. In: APCCM ’04: Proceedings of the first AsianPacific conference on Conceptual modelling, Darlinghurst, Australia, Australia,
Australian Computer Society, Inc. (2004) 27–36
3. Chung, H., Song, Y.I., Han, K.S., Yoon, D.S., Lee, J.Y., Rim, H.C., Kim, S.H.: A
practical qa system in restricted domains. In Aliod, D.M., Vicedo, J.L., eds.: ACL
2004: Question Answering in Restricted Domains, Barcelona, Spain, Association
for Computational Linguistics (2004) 39–45
4. Sneiders, E.: Automated question answering using question templates that cover
the conceptual model of the database. In: NLDB ’02: Proceedings of the 6th International Conference on Applications of Natural Language to Information SystemsRevised Papers, London, UK, Springer-Verlag (2002) 235–239
5. Bernstein, A., Kaufmann, E., Fuchs, N.E., von Bonin, J.: Talking to the semantic
web a controlled english query interface for ontologies. In: 14th Workshop on
Information Technology and Systems. (2004) 212–217
6. Lopez, V., Pasin, M., Motta, E.: Aqualog: An ontology-portable question answering
system for the semantic web. In: ESWC. (2005) 546–562
7. Zhang, Z., Sylva, L.D., Davidson, C., Lizarralde, G., Nie, J.Y.: Domain-specific
qa for construction sector. In: Proceedings of SIGIR 04 Workshop: Information
Retrieval for Question Answering. (2004)
8. Demner-Fushman, D., Lin, J.: Knowledge extraction for clinical question answering: Preliminary results. In: Proceedings of the AAAI-05 Workshop on Question
Answering in Restricted Domains, Pittsburgh, Pennsylvania. (2005)
9. Niu, Y., Hirst, G.: Analysis of semantic classes in medical text for question answering. In: Proceedings of 42st Annual Meeting of the Association for Computational
Linguistics, Workshop on Question Answering in Restricted Domains. (2004) 54–61
10. Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J., Hess, M.: Extrans: Extracting
answers from technical texts. IEEE Intelligent Systems 18 (2003) 12–17
11. TREC: Text retrieval conference trec http://trec.nist.gov (2005)
-42/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
12. Andrenucci, A., Sneiders, E.: Automated question answering: Review of the main
approaches. In: ICITA ’05: Proceedings of the Third International Conference on
Information Technology and Applications (ICITA’05) Volume 2, Washington, DC,
USA, IEEE Computer Society (2005) 514–519
13. Dumais, S., Banko, M., Brill, E., Lin, J., Ng, A.: Web question answering: Is more
always better? In: Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2002),
Tampere, Finland. (2002)
14. Diekema, A.R., Yilmazel, O., Chen, J., Harwell, S., Liddy, E.D., He, L.: What do
you mean? finding answers to complex questions. In: Proceedings of the AAAI
Spring Symposium: New Directions in Question Answering. Palo Alto, California.
(2003)
15. Cui, H., Sun, R., Li, K., Kan, M.Y., Chua, T.S.: Question answering passage
retrieval using dependency relations. In: SIGIR ’05: Proceedings of the 28th annual
international ACM SIGIR conference on Research and development in information
retrieval, New York, NY, USA, ACM Press (2005) 400–407
16. Tellex, S., Katz, B., Lin, J., Fernandes, A., Marton, G.: Quantitative evaluation
of passage retrieval algorithms for question answering. In: SIGIR ’03: Proceedings
of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, New York, NY, USA, ACM Press (2003) 41–47
17. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. Journal of Web Semantics 2 (2005) 39
18. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: A framework
and graphical development environment for robust nlp tools and applications. In:
Proceedings of the 40th Annual Meeting of the ACL. (2002)
19. Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications Co. (2005)
20. Hovy, H., Gerber, L., Hermjakob, U., Lin, C., Ravichandran, D.: Towards semanticbased answer pinpointing (2001)
21. Attardi, G., Cisternino, A., Formica, F., Simi, M., Tommasi, A.: Piqasso: Pisa
question answering system. In: Text REtrieval Conference. (2001)
22. Lin, D.: Dependency-based evaluation of minipar. In: Proc. Of Workshop on the
Evaluation of Parsing Systems, Granada, Spain. (1998)
23. Gao, J., Nie, J.Y., Wu, G., Cao, G.: Dependence language model for information
retrieval. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR
conference on Research and development in information retrieval, New York, NY,
USA, ACM Press (2004) 170–177
24. Brown, P.F., Pietra, S.D., Pietra, V.J.D., Mercer, R.L.: The mathematic of statistical machine translation: Parameter estimation. Computational Linguistics 19
(1994) 263–311
25. O’Riain, S., Spyns, P.: Enhancing business analysis function with semantics.
In Meersma, R., Tari, Z., eds.: On the Move to Meaningful Internet Systems
2006: CoopIS, DOA, GADA and ODBASE; Confederated International Conferences CoopIS, DOA, GADA and ODBASE 2006 Proceedings. LNCS 4275, Springer
(2006) 818–835
-43/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
OntoCAT: An Ontology Consumer Analysis Tool and
Its Use on Product Services Categorization Standards
Valerie Cross and Anindita Pal
Computer Science and Systems Analysis, Miami University,
Oxford, OH 45056
{crossv, pala}@muohio.edu
Abstract. The ontology consumer analysis tool, OntoCAT, provides a
comprehensive set of metrics for use by the ontology consumer or knowledge
engineer to assist in ontology evaluation for re-use. This evaluation process is
focused on the size, structural, hub and root properties of both the intensional
and extensional ontology. It has been used on numerous ontologies from
varying domains. Results of applying OntoCAT to two Product and Service
Categorization Standards, UNPSCS and ecl@ss ontologies are reported.
Keywords: ontology evaluation, ontology metrics, ontology ranking, hubs
1 Introduction
The need for domain ontology development and management is increasingly more
and more important to most kinds of knowledge-driven applications. Development
and deployment of extensive ontology-based software solutions represent
considerable challenges in terms of the amount of time and effort required to
construct the ontology. These challenges can be addressed by the reuse of ontologies.
The extent to which reuse of ontologies could contribute cost and time savings
parallels that obtained in software reuse [17] because acquiring domain knowledge,
constructing a conceptual model and implementing the model require a huge effort.
As with any other resource used in software applications, ontologies need to be
evaluated before use to prevent applications from using inconsistent, incorrect, or
redundant ontologies. Even if an ontology has none of these problems and is formally
correct, users must decide whether the content of ontology is appropriate for the
requirements of their project, that is, they must determine how well the metadata and
instances meet the requirements for their problem domain. Knowledge engineers need
an ontology analysis tool to help in the process of ontology assessment for reuse.
Much ontology research has focused on new methodologies, languages, and tools
[4]; recently, however, since the OntoWeb 2 position statement stressed the
insufficient research on ontology evaluation and the lack of evaluation tools [11],
much attention has been directed towards ontology evaluation [3,8]. Initially this
attention concentrated on a formal analysis approach to evaluating ontologies [9].
Others have created taxonomies of ontology characteristics [12] to quantify the
-44/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
suita
bi
l
i
t
yofont
ol
og
i
e
sf
oru
s
e
r
s
’s
y
s
t
e
ms
.Knowledge engineers must analyze these
characteristics for the prospective ontologies in order to compare them and select the
appropriate ontology for the system. More recent efforts address evaluating ontologies
for reuse, not by ontology developers and experts, but by ontology consumers [14]
who are users such as system project managers hoping to find existing ontologies on
the Semantic Web which can be reused and adapted for their systems.
The objective of this research is to describe an ontology consumer analysis tool,
OntoCAT [16], that summarizes essential size, structural, root and hub properties for
both an intensional ontology and its corresponding extensional ontology. An
intensional ontology only includes the ontology schema or definitions. An
extensional ontology includes the instances, i.e., occurrences of classes and
relationships. OntoCAT supports the ontology consumer by performing an analysis
on the graph-like properties of an ontology. First, a brief overview of the variety of
approaches to evaluating ontologies is presented in Section 2. Included in more detail
in this presentation are those methods which take the ontology consumer perspective
on evaluation. Section 3 describes some of the metrics included in OntoCAT.
OntoCAT has been created as a plug-in for the Protégé Ontology Editor
(http://protege.standford.edu/overview/). The OntoCAT user interface is presented in
Section 4. The results of performing an ontology consumer analysis on two product
services and categorization standards (PSCS), UNSPSC (United Nations Standard
Products and Services Code) and ecl@ss are discussed in Section 5 along with a brief
description of the UNSPC and eCl@ss ontologies. Conclusions and future planned
work are presented in Section 6.
2 Ontology Evaluation
A variety of approaches to ontology evaluation have been proposed depending on the
perspectives of what should be evaluated, how it should be evaluated and when it
should be evaluated. As such, ontology evaluation has become a broad research area
[3] with numerous frameworks pr
opos
e
df
ore
v
a
l
u
a
t
i
nghow “
g
ooda
non
t
ol
ogyi
s
”
.
These frameworks can be classified into various categories depending on what
qualities are considered most relevant to an ontology and how they are being
evaluated. These qualities may also have an importance factor. For example, is the
quality of the design more important than the quality of the content [15] and can a
gold standard be used or is an assessment by a human expert required [3]?
In
addition some evaluation methods make specific recommendations about when in the
ontology development lifecycle the evaluations should be performed. Others suggest
developing methodologies to evaluate an ontology during the development process
and throughout its entire lifetime [9].
Another distinction made in ontology evaluation is that of selection versus
evaluation. In [18]
,on
t
ol
ogys
e
l
e
c
t
i
oni
sde
f
i
n
e
da
s“
on
t
ol
ogy evaluation in the real
Se
ma
n
t
i
cWe
b.
” Their survey of existing selection algorithms reveals that few
ontology evaluation methods are incorporated except for similarities in topic
coverage. They conclude that although evaluation and selection consider different
requirements, they are complementary. In [7] a holistic view of ontology evaluation
-45/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
is considered by viewing an ontology as a communication object. The Qood grid
method permits parametric design for both evaluation and selection (diagnostic) tasks.
Ontologies are analyzed in their graph and formal elements, functional requirements,
and annotation profile. The Qood evaluation based on graph analysis parallels that of
the OntoCAT metrics proposed in [5].
Due to space limitations, not all of these various evaluation methods can be
discussed. The following sections briefly describe ontology consumer evaluation
methods and overview the two tools most closely related to OntoCAT.
2.1 Ontology Evaluation f
r
om t
heCons
ume
r
s
’Pe
r
s
pe
c
t
i
ve
To make reusing ontologies easier, more research needs to address the evaluation
of an ontology from the consumer point of view [14]. Ontology consumers needs
tools to help with two different tasks, selecting from the enormous number of
available ontologies the most appropriate ontology candidates for their applications
and quality evaluation of ontologies. As pointed out previously, ontology evaluation
and ontology selection are complementary. The question is in what order should
these two tasks be performed. The answer might depend on the individual ontology
developer, but typically the answer is that selection is performed first for filtering
purposes and then followed by a more time consuming quality evaluation. Selection
or filtering methods typically employ topic coverage, popularity, and richness of
conceptualized knowledge [18].
Consumer analysis tools could be useful to both selection and evaluation tasks.
One approach suggested in [14] for consumer analysis is ontology summarizations.
Ontology summarizations are analogous to approaches used by researchers in
reviewing the usefulness of papers or deciding on whether to purchase a book. Just as
r
e
s
e
a
r
c
h
e
re
x
a
mi
n
e
sabook
’
ss
umma
r
yorapa
pe
r
’
sa
bs
t
r
a
c
twh
e
nde
c
i
di
ngoni
t
s
usefulness, similarly there should be some abstract or summary of what an ontology
covers to help consumers decided if it fits their application requirements. The
summary might include the top-level concepts and links between them as a graphical
representation and a listing of hub concepts –concepts that have the largest number of
l
i
nk
si
na
n
dou
t
.I
tc
ou
l
da
l
s
oi
n
c
l
u
deme
t
r
i
c
ss
i
mi
l
a
rt
oGoog
l
e
’
spa
g
er
a
n
kt
h
a
t
determine that a concept is more important if other important concepts are linked to it.
OntoCAT metrics are based on these analogies and fall into both the structural and
functional types of measures for ontology evaluation [7]. OntoCAT metrics can be
valuable to both the selection and evaluation tasks performed on ontologies. The
summaries which provide the consumer with a high level view of the topic coverage
are functional types of measure and important to the selection task. The OntoCAT
metrics analyzing an ontology as a graph structure are structural metrics that can be
used to evaluate the quality of the ontology design similar to those used for software
metrics [17]. Two of the more recent related approaches to OntoCAT are presented
below. The ones selected either focus on quantitative size and structural metrics for
ontology selection or evaluation or they have a component that includes such metrics.
Structural types of measures in [7] correspond closely with OntoCAT metrics
presented in [5].
-46/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
2.2. OntoQA
The OntoQA tool [20] developed by LSDIS Lab at the University of Georgia
measures the quality of ontology from the consumer perspective. The OntoQA tool
measures the quality of the ontology in terms of Schema and Instance metrics. The
schema metrics of OntoQA address the design of the ontology schema, some of which
correspond with the OntoCAT intensional metrics calculated on the intensional
ontology (the definition of the ontology). Instance metrics of OntoQA deal with the
size and distribution of the instance data, some of which correspond with the
OntoCAT extensional metrics calculated on the ontology occurrences.
OntoQA defines three metrics in its Schema metrics group. These are relationship
richness, attribute richness and inheritance richness. Relationship richness measures
the percentage of relationships that are not is-a or taxonomic relationships. Attribute
richness measures the average number of attributes per class, dividing the cardinality
of the attribute set by the cardinality of the class set. Inheritance richness measures
the average number of subclasses per class.
Metrics for the Instance group are categorized into two subclasses: the whole
instance ontology or class metrics. Class metrics are used to describe how a class is
being populated in the instance ontology. Metrics for the whole instance ontology
include class richness, average population, and cohesion. Class richness measures the
distribution of instances across classes. Formally, it is defined as the ratio of the
number of classes with instances divided by the number of classes defined in the
ontology. Average population measures the average distribution of instances across
all classes. Formally, it is defined as the number of instance divided by the number of
class in the ontology. Cohesion measures the number of connected components in the
instance graph built using the taxonomic relationships among instances.
Class metrics include importance, fullness, inheritance richness, connectivity,
relationship richness and readability. Importance refers to the distribution of instance
over classes and is measured on a per sub-tree root class. It specifies the percentage of
instances that belong to classes in the sub-tree rooted at the selected class with respect
to the total number of instances in the ontology. This definition is somewhat
confusing because multiple instance sub-trees for a selected class could exist. It is
assumed that this definition would include all instances of sub-trees with the selected
class type. Fullness is primarily for use by ontology developers to measure how well
the data population was done with respect to the expected number of instances of each
class. It is similar to importance except that it is measured relative to the expected
number of instances that belong to the sub-tree rooted at the selected class instead of
the total number of instances in the ontology. Connectivity for a selected class is the
number of instances of other classes connected by relationships to instances of the
selected class. Relationship richness measures the number of the properties in the
selected class that are actually being used in the instance ontology relative to the
number of relationships defined for the selected class in the ontology definition.
Readability measures the existence of human readable descriptions in the ontology.
Human readable descriptions include comments, labels, or captions.
-47/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
2.3 AKTiveRank
Several ontology search engines such as Swoogle [6] and OntoSearch [21] can be
used by entering specific search terms to produce list of ontologies including search
terms somewhere in the ontology. AKTiveRank [1] ranks the ontologies retrieved by
an ontology search engine. Its initial implementation evaluated each retrieved
ontology using four measures: class match, density, semantic similarity and centrality.
The class match measure evaluates the coverage of an ontology by providing a
score based on the number of query terms contained in the ontology and evaluates
using both exact match where the search term is identical to the class name and partial
match where the search term is contained within the class name. The density
measure evaluates the degree of details in the representation of the knowledge
concerning the matching classes. The density value for an individual class is the
weighted sum of the count of its number of super-classes, subclasses, direct and
indirect relations (in and out), and siblings. The number of instances was initially
included but dropped since it was felt that this parameter might bias evaluation toward
populated ontologies [2]. This bias might penalize ontologies with higher quality
definitions (schemas). The density measure for the query is the average for all
matching classes. The semantic similarity measure (SSM) determines how close the
classes that match the search terms are in an ontology. The semantic similarity is
calculated between all pairs of matching classes and then the average is taken.
The centrality measure assumes that the more central a class is in hierarchy, the
better analyzed and more carefully represented it is. A centrality measure for a class is
calculated for each class that matches fully or partially a given query term based on its
distance from the class midway from the root to the leaf on the path containing the
matching class. Then the centrality measure for the query is the average for all
matching classes. More recent research [2] identified the redundancy of the centrality
measure because of its close correspondence with the density measure and replaced it
with the betweenness measure. The higher the betweenness measure for a class, the
more central that class is to the ontology. For betweenness, the number of shortest
paths between any two classes that contains a class matching a queried concept is
calculated. These numbers are summed over all queried concept. Their average
determines the betweenness measure. The overall rank score for an ontology is the
weighted aggregation of these resulting component measures is performed to produce
the overall rank of the ontology.
The researchers creating AKTiveRank note t
h
a
ts
u
c
hat
ool“
needs to be designed
in a way so as to provide more information of the right sort. Mere ranking, while it
will no doubt be useful, may not be the whole story from a practical perspective”and
further sugg
e
s
tt
h
a
tt
h
e
r
ei
s“
aneed to disentangle a number of different parameters
which can affect the usability of ak
n
owl
e
dg
er
e
pr
e
s
e
n
t
a
t
i
on
”s
i
n
c
et
h
epe
r
c
e
pt
i
onof
the knowledge engineers wi
t
hr
e
s
pe
c
tt
odi
f
f
e
r
e
n
tont
ol
ogypa
r
a
me
t
e
r
s“
ma
yv
a
r
y
depending on the particular application context”[1]. A limitation of this tool is that it
only ranks intensional ontologies since all measurements are based on the definition
of the ontology. There are some ontologies, especially terminological ontologies,
whose intensional ontology is quite simple but whose extensional ontology is quite
complex. An ontology consumer analysis tool should be able to process both
components of an ontology to provide the user with as much information as possible.
-48/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
3 OntoCAT Metrics
The ontology consumer analysis tool OntoCAT provides a comprehensive set of
metrics to be used by the ontology consumer or knowledge engineer to assist in
ontology evaluation for re-use. Quality measurements are not being provided but
instead summarizations, size and structural metrics are provided. The metrics are
separated into two categories: intensional metrics and extensional metrics.
Intensional metrics are calculated based on the ontology definition itself, that is, its
classes and subclasses and their properties. Extensional metrics measure the
assignment of actual occurrences of ontological concepts, that is, the instances and
how effectively the ontology is used to include the domain knowledge. Much
research has been focused on extensional ontologies, in some part, because the
consideration for reuse of ontologies has often been on terminological ontologies such
as found in the biomedical fields.
The following metrics are relative to the metadata being assessed, C –class, P –
property, A –attribute, and R –r
e
l
a
t
i
on
s
h
i
p.Me
t
r
i
c
sbe
g
i
n
ni
n
gwi
t
ha
n“
i
”a
r
ef
or the
i
n
t
e
n
s
i
on
a
lon
t
ol
ogy
,a
n
dt
hos
e be
g
i
nn
i
ng wi
t
ha
n“
e
”a
r
ef
ort
h
ee
x
t
e
n
s
i
on
a
l
ontology. Some of the metrics do not return a numeric value but instead indicate
identifying information. For example, iMaxClasses provides a list of classes that
have the maximum number of properties. In the following, Cnt stands for count, Avg
for average, and Rng for range. The main approach is to determine various metrics
and to also examine them on both horizontal (per depth) and vertical (per path) slices
of the ontology. Below only a sample of the metrics are presented due to page
limitations. A complete description of all the metrics can be found in [16].
3.1 Size Metrics
Intensional. Typically an intensional ontology has one root concept, but multiple root
concepts are possible. If no concept or class cj is specified, the intensional size metric
is calculated for the entire ontology, that is, over all the trees defined in the ontology.
When a concept cj is specified to be used as a root, the size metric is calculated on the
tree specified by the selected concept cj as its root. Although the measures using a
r
oota
r
er
e
f
e
r
r
e
dt
oa
ss
i
z
eme
t
r
i
c
s
,t
h
e
ydo,h
owe
v
e
r
,u
s
et
h
e“
is-a”or subsumption
hierarchy to determine the tree for which the size metrics are being determined.
iCnt(C) = the number of classes for the entire intensional ontology
iCnt(C)(cj-root) = the number of classes for the sub-tree at the selected class cj .
iCnt(P) = the number of properties defined for the entire intensional ontology
iCnt(P)(cj-root) = the number of properties defined for the entire sub-tree at
class cj . A property may be inherited from its parents. Only new properties
are counted for each class.
iCntTotal(P)(cj) =the total (new + inherited) number of properties for class cj.
iCnt(R) = the number of relationships defined for the entire intensional ontology.
A relationship is a special kind of property that has a class as its range.
iCntTotal(R)(cj) = the total (new + inherited) number of relationships defined for
only the selected class
-49/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
iMaxTotal(P to C)(cj-root) = max number of (new + inherited) properties defined
for a single class over all classes in the sub-tree at the selected class cj
iMaxTotalClasses(P to C)(cj-root) = class names for classes within sub-tree at the
selected class cj that have the max number of properties
Extensional.
eCnt(cj) = the number of object occurrences for class cj
eCnt(C) = ∑j eCnt(cj), the total number of object occurrences in the ontology
eCnt(C)(cj-root) = ∑i eCnt(ci), the total number of object occurrences in the
sub-trees for the selected class cj where ci is in sub-tree cj
eAvgCnt(C) = eCnt(C)/iCnt(C), the average number of occurrences for all classes
eMaxCnt(C) = maxi[eCnt(ci)] and identify eMaxCntClass, i.e., the class with the
maximum number of occurrences in the ontology
eCnt(ri) = the number of occurrences for relation ri
eCnt(R) = ∑i eCnt(ri) total number of occurrences for all relations in ontology
eAvgCnt(R) = eCnt(R)/eCnt(C), average number of relationships per occurrence
eMaxCnt(R) = maxi[eCnt(ri)] and identify eMaxCntRelation
3.2 Structural Metrics
Intensional structural metrics are similar to size metrics since they are over the entire
intensional ontology, that is, over all the root trees defined in the ontology if no
concept or class is specified. When a class is specified, the structural metrics are
calculated for the entire sub-tree at that class cj. The class hierarchy (sub-class/superclass) relationships are used for the structural organization.
iCnt(Roots) = number of roots in the ontology.
iCnt(leaves(cj-root)) = number of leaf classes of the sub-tree at the selected class cj
iCnt(leaves) = ∑j iCnt(leaves( cj-root); the total number of leaf classes in the entire
.
ontology where cj-root is a root class.
iPer(leaves(cj-root)) = iCnt(leaves(cj-root)) /iCnt(C)( cj-root) the fraction of classes
that are leaves of the is-a hierarchy for the entire sub-tree at class cj .
iAvg(leaves) = iCnt(leaves)/iCnt(C), the fraction of classes that are leaves for
the entire ontology.
iMaxDepth(cj-root) = maxj [depth(leafij)], the maximum depth of the sub-tree
at the selected class cj and return the class of the leaf at the maximum depth
Several intensional structural metr
i
c
sa
r
ea
da
pt
e
df
r
om Wor
dNe
t
’
s(IC) measure [19].
The IC for class cij for cj-root (the class may be in multiple trees, therefore, the
subscript j specifies the root of the tree) is given as [5]:
iIC(cij) = 1- log(iCnt(C)(cij-root) + 1)/log(iCnt(C)(cj-root))
The class cj-root must be a root class of the ontology whereas, cij-root is any class
within the inheritance hierarchy rooted at cj-root. This measure can be used to identify
the degree of information content on a per depth basis within the ontology. Using
class information content as a new measure provides a novel way of examining the
ontology for symmetry and balance across its horizontal slices. Some of the following
measures proposed for each cj-root of an intensional ontology are:
-50/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
iIC(depthk(cj-root)) = ∑i IC(cij) for all ci at depth k for cj-root
iAvgIC(depthk(cj-root)) = iIC(depthk(cj-root))/iWidth(idepthk(cj-root))
iRngIC(depthk(cj-root)) = iMaxIC(depthk(cj-root)) - iMinIC(depthk(cj-root))
iAvgIC(cj-root)= ∑k iAvIC(depthk(cj-root)) / iMaxDepth(cj-root), the average IC
over all depths in the tree at root cj-root
Extensional. Structural metrics are calculated on the specified root concept cj-root
and the specified relationships used to create the structural organization of the
extensional ontology. For example, in the WordNet extensional ontology, the
specified relationships providing its structure are the hyponym and hypernym
relationships. If no concept is specified or if the specified concept is the top most
concept of the ontology, structural extensional metrics for the complete ontology are
calculated. When cj-root is specified for metrics of extensional ontologies, only root
occurrences of this class with respect to the structural organization of the extensional
ontology are considered. The metrics listed below that have the occ parameter are
automatically produced for each root occurrence of the class selected by the user.
eCnt(roots) = number of root occurrences for all root classes
eCnt(leaves(cj-root )) = number of leaves for all occurrences of class cj-rooti.
eCnt(leaves(cj-root (occ)) = number of leaves for specified occurrence of cj-rooti.
eMaxCnt(leaves(cj-root)) = maxi [eCnt(leaves(cj-root (occi)))], the maximum
number of leaves in all rooted occurrences of class cj-root, give its identity
eMinDepth(cj-root(occ)) = mini [depth(leafij(occ))], the minimum depth of the
sub-tree at the selected root occurrence of root class cj and return the leaf
occurrence(s) at the minimum depth.
eWidth(depthk(cj-root)) = (∑i eWidth(depthk(cj-root (occi))), the number of
instances at depth k for all occurrences of the selected root class cj
3.3 Summarization Metrics
The hub summary displays information on the hubs, i.e., object occurrences (for
extensional) and classes (for intensional) having the maximum number of links in and
out. For intensional, the count of links is the number of subclasses and superclasses
defined. For extensional, the links are based on the relationships specified for
creating its structure. A list of the top n hubs (user-specified), is reported with
statistics for e
a
c
hhu
b.Not
et
h
a
tt
h
e‘
i
’or‘
e
’pr
e
c
e
di
ngt
heme
t
r
i
ci
somi
t
t
e
dsince it
is determined by whether it is for an intensional or extensional ontology.
depth(hub) = the depth of the tree where the hub concept is located
width(hub) = the number of other occurrences at that depth in the tree
Cnt(P(hub)) = number and list of properties defined for the hub in case of classes
Cnt(child(hub)) = the number and list of direct children of the hub
Cnt(parent(hub)) = the number and list of direct parents of the hub
A root summary may be calculated for both the intensional and extensional ontology
and include class or occurrence counts for roots and leaves, the minimum, maximum
and average depths of the intensional and extensional ontology, and the minimum,
maximum, and average widths of the intensional and extensional ontology.
-51/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
4 OntoCAT User Interface
OntoCAT is implemented as a Protégé plug-in so that it is incorporated into the
ontology engineering environment itself. As an ontology is being developed,
OntoCAT may be executed to determine how the structural types of measures change
during the development cycle. Since OntoCat is part of the ontology engineering
environment, evaluation can easily be performed without altering the ontology.
The implementation is generalized to handle the structural difference in ontologies
and is parameterized to permit easily switching between an intensional and
extensional ontology. The user selects the root class and relationship to be used to
calculate the metrics. The implementation uses the OWL API because of its flexibility
and easy access for ontologies. Metrics for ontologies in RDF/RDF(S) can also be
calculated through conversion to OWL ontologies with Pr
ot
é
g
é
’
se
x
por
tf
un
c
t
i
on
.
The main user interface consists oft
wos
pl
i
tpa
n
e
s
.Th
el
e
f
t“
Se
l
e
c
t
i
on
”pa
n
e
l
permits selection of the metrics. Th
e“
Re
s
u
l
t
”pa
n
e
ldi
s
pl
a
y
st
h
ecalculated metrics.
In the following figures, a small version of the WordNet ontology is used. WordNet is
a general terminological ontology of the English language which serves as a freely
available electronic dictionary organized by nouns, verbs, adjectives and adverbs into
synonym sets (synsets), each representing one underlying lexical concept [13].
4.1 Selection Panel
The user selects which metrics to calculate for the ontology. Figure 1 shows the
“
Se
l
e
c
t
i
on
”pa
n
e
l
. The metrics are grouped into intensional size and structure and
extensional size and structure. This organization allows the users to easily switch
between the intensional and extensional ontology. The IC metrics are separated from
the structural metrics for aesthetic reasons. Users also enter the depth values in the
two text fields for calculating the IC metrics and the width metrics at the depth.
The next set of parameters to input after selection of metrics is the root concept on
which to measure and the relationship used to build the extensional taxonomical
structure. The user can select these para
me
t
e
r
sbyc
l
i
c
k
i
ngt
h
e“
Me
t
r
i
c
s
”or“
Re
por
t
”
buttons. When these buttons are clicked a pop-up window is opened as shown in
Figure 2 below. This window contains the list of classes and relationships defined in
the ontology. For example, after selecting the metrics shown in Figure 1, the user
c
l
i
c
k
sont
h
e“
Me
t
r
i
c
”bu
t
t
on
.Th
ec
on
c
e
pta
n
dr
e
l
a
t
i
ons
e
l
e
c
t
i
onwi
n
dowpopsu
p.I
n
Figure 2, the Lexical Concept class of the WordNet ontology is selected. By default
metrics are calculated on the whole ontology so that users only need to select the class
onwh
i
c
ht
h
e
ywa
n
tt
oc
a
l
c
u
l
a
t
eme
t
r
i
c
s
.Fort
h
e“
Me
t
r
i
c
”bu
t
t
oni
fn
oc
l
a
s
si
s
selected, then only the ontology level metrics are displayed since there is no space in
the UI to display the metric results for all c
l
a
s
s
e
si
nt
h
eont
ol
ogy
.Fort
h
e“
Re
por
t
”
button, if no class is selected, the metrics are calculated on all classes in the ontology.
Th
e“
Re
por
t
”bu
t
t
ong
e
n
e
r
a
t
e
sar
e
por
toft
h
es
e
l
e
c
t
e
dme
t
r
i
c
st
oaf
i
l
ei
nt
h
et
ool
s
home directory. This report is formatted for easy importing to an Excel spreadsheet
to analyze and generate charts and tables as done in Section 5. If users do not select
a class, the report is generated on each of the classes in the ontology.
-52/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Fig. 1. Selection Panel with list of metrics
Fig. 2. Selection of Class and Relationship
-53/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
The user selects the relationship for building the extensional taxonomic structure. For
intensional, the structure uses the sub-class relationship. The extensional taxonomic
structure differs from ontology to ontology. For example, WordNet uses the
hyponymOf and hypernymOf relationships. If the extensional metrics button is
selected, the parent relationship for structuring must be entered.
4.2 Result Panel
Th
e“
Re
s
u
l
t
”pa
n
e
ldisplays the metrics from t
h
e“
Se
l
e
c
t
i
on
”pa
n
e
la
n
dh
a
sthree tabs:
Hubs, Intensional, and Extensional. Figure 3 shows the hub summary for a small
WordNet ontology. The intensional table lists the hub classes, those with the
maximum number of subclasses and super-classes. The extensional table lists the
instance hubs, those with the maximum number of links in and out. The last table is
specific to the class listed above the table. For example, the third table lists the
extensional hubs for the lexicalConceptc
l
a
s
ss
e
l
e
c
t
e
di
nt
he“
Se
l
e
c
t
i
on
”pa
n
e
lpop-up
window. The summary provides the following: depth, width, number of properties
and IC. By default the plug-in displays the top 10 hub concepts. Users can specify the
number or percent of hubs to display by changing the value in the text fields (for
example to 20 or 10%), located beside the table labels and clicking the
button.
5 Analysis of UNPSCS and ecl@ass Ontologies
An important requirement of e-Commerce is effective communication between
software agents. A common approach to provide machine-accessibility for agents is
a standardized vocabulary of product and services terminology referred to as Product
and Service Categorization Standards (PSCS) [10]. UNSPSC (United Nations
Standard Products and Services Code) and eCl@ss are two example PSCS developed
into intensional ontologies consisting of the schemas and definitions of the concepts
in the product and service domain. UNSPSC is a hierarchical classification of all
products and services for use throughout the global marketplace. eCl@ss, developed
by German companies, offers a standard for information exchange between suppliers
and their customers. It uses a 4-level hierarchical classification system that maps the
market structure for industrial buyers and supports engineers at development,
planning and maintenance.
Martin Hepp has developed an OWL version
(http://www.heppnetz.de/eclassowl).
A previous study of PSCS ontologies uses a framework of me
t
r
i
c
s“
t
oa
s
s
e
s
st
h
e
qu
a
l
i
t
ya
n
dma
t
u
r
i
t
yofpr
odu
c
t
sa
n
ds
e
r
v
i
c
e
sc
a
t
e
g
or
i
z
a
t
i
ons
t
a
n
da
r
ds
”[10]. This
framework is applied to the most current and multiple past releases of eCl@ss,
UNSPSC,e
OTD,a
n
dRNTD.I
nt
h
a
ts
t
u
dy
,t
h
et
e
r
m“
s
i
z
eofs
e
gme
n
t
s
”c
or
responds
t
oOn
t
oCAT’
siCnt(C)(cj-root), the n
umbe
rofc
l
a
s
s
e
sf
orar
ootc
l
a
s
s
.Th
et
e
r
m“
s
i
z
e
”
c
or
r
e
s
pon
dst
oOn
t
oCAT’
siCnt(C), the number of classes for the entire intensional
on
t
ol
ogy
.Th
e“
pr
ope
r
t
yl
i
s
ts
i
z
e
”c
or
r
e
s
pon
dst
oi
Cnt
(
P)
,t
h
en
umbe
rofpr
operties
defined for the entire intensional ontology Using OntoCAT, an analysis for both
-54/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
UNSPSC and eCl@ss was performed. Due to space limitations, only root summary
reports are provided below in table format. Because eCl@ss has over 25000 roots, its
root summary shows only a selected set of roots that have more interesting data.
Table 1 shows the root summary for the UNPSCS ontology. It is arranged in
descending order of the total classes under that root class. Only the top 13 roots are
shown due to space limitation. For all root classes there is a uniform maximum and
minimum depth of 3. The root classes have all leaves at the same level and the
maximum width occurs at the maximum depth, i.e., it is equivalent to the number of
leaves for the root class. The minimum width varies but it always occurs at depth 1,
i.e., the first level down from the root. The four root classes with the greatest number
ofc
l
a
s
s
e
sa
n
dl
e
a
v
e
sa
r
e“
Dr
u
gsPh
a
r
maceutical_Products”
,“
Ch
e
mi
c
a
l
si
n
c
l
u
di
n
g
Bio Chemicals and Gas Materials”
,“
Laboratory and Measuring and Observing and
Testing Equipment”
,a
n
d “
Structures and Building and Construction and
Manufacturing Components and Supplies.
”
Fig. 3. The Result Panel showing Hub concept report
-55/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Table 2 displays the root summary for the several of the *_tax roots of the ecl@ss
ontology. Note that the maximum depth for all the root classes is 4 and the minimum
depth is one. Unlike UNPSCS, the length of every path for each root class in the
ontology is not identical since a variation exists in the average depth. The maximum
width occurs not at the greatest depth but at depth 3 for all roots. Like UNPSCS, the
minimum width varies but is always occurs at depth 1 for each root. Looking at the
ratio of total number of leaves to total number of classes, UNPSCS has a much larger
percentage of leaf classes for its roots as compared to ecl@ss.
Table 1. UNPSCS Root Summary
-56/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Table 2. ecl@ss Root Summary
Concept Name
Total
Max
Min
Avg
Max Level @ Min level @
Classes Total Leaf Depth Depth Depth Width
max
Width
min
Avg
Width
C_AAG961003-tax
10623
5038
4
1
3.94
5292
3
20
1
2655.75
C_AAB572002-tax
5317
2181
4
1
3.8
2624
3
35
1
1329.25
C_AAB072002-tax
3983
1669
4
1
3.82
1973
3
19
1
995.75
C_AAD302002-tax
3585
1317
4
1
3.71
1756
3
37
1
896.25
C_AAF876003-tax
2927
1315
4
1
3.88
1444
3
20
1
731.75
C_AAC473002-tax
2653
1186
4
1
3.88
1320
3
7
1
663.25
C_AAC350002-tax
2431
1024
4
1
3.82
1192
3
24
1
607.75
C_AAB315002-tax
2127
850
4
1
3.77
1041
3
23
1
531.75
C_AAA183002-tax
2065
832
4
1
3.79
1019
3
14
1
516.25
C_AAA862002-tax
1927
739
4
1
3.73
932
3
32
1
481.75
C_AAA647002-tax
1603
589
4
1
3.68
763
3
39
1
400.75
C_AAD111002-tax
1519
580
4
1
3.74
750
3
10
1
379.75
C_AAF397003-tax
1451
499
4
1
3.62
680
3
46
1
362.75
C_AAT090003-tax
1239
445
4
1
3.64
577
3
43
1
309.75
C_AAD025002-tax
1041
318
4
1
3.57
502
3
19
1
260.25
C_AAW154003-tax
1007
417
4
1
3.79
490
3
14
1
251.75
C_AKJ313002-tax
977
403
4
1
3.79
477
3
12
1
244.25
C_AAD640002-tax
879
329
4
1
3.7
420
3
20
1
219.75
C_AKK397002-tax
863
286
4
1
3.62
416
3
16
1
215.75
C_AAC286002-tax
701
253
4
1
3.68
339
3
12
1
175.25
C_AAN560003-tax
515
214
4
1
3.8
253
3
5
1
128.75
C_AKJ644002-tax
509
121
4
1
3.41
242
3
13
1
127.25
C_AAE587002-tax
493
189
4
1
3.73
240
3
7
1
123.25
C_AAD170002-tax
451
175
4
1
3.74
221
3
5
1
112.75
C_AAC168002-tax
405
131
4
1
3.58
191
3
12
1
101.25
6 Summary and Conclusions
OntoCAT provides a comprehensive set of metrics for use by the ontology consumer.
It may be used to assist in ontology evaluation for re-use or regularly during ontology
development a
n
dt
h
r
ough
ou
tt
h
eon
t
ol
ogy
’
sl
i
f
e
c
y
c
l
et
or
e
c
or
d ah
i
s
t
or
y oft
h
e
changes in both the intensional and extensional ontology. It includes either directly or
components of many of OntoQA metrics. It differs from AKTiveRank which uses
query concepts to rank ontologies. OntoCAT could be used to further analyze the top
ranked ontologies produced by AKTiveRank. Numerous ontologies from varying
domains: WordNet, UMLS, UNSPSC, and eCl@ss have been analyzed by OntoCAT.
Here the results for the two PSCS ontologies have been reported. The metrics
identified and implemented as plug-in software for Protégé are the most
comprehensive set of metrics currently available in a tool for both kinds of ontologies.
The tool still needs more capabilities to summarize the metrics both in intuitive terms
and visually for the user. Another useful feature would be producing analysis based
on query terms to provide a context on which to calculate more detailed metrics
reflecting topic coverage. The structural types of metrics proposed in [7] that do not
already exist in OntoCAT are to be further investigated for inclusion in OntoCAT.
-57/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
References
1. Alani, H. and Brewster, C. Ontology Ranking based on the Analysis of Concept Structures,
International Conference On Knowledge Capture, Alberta, Canada (2005)
2. Alani, H. and
Brewster, C. Metrics for Ranking Ontologies, Fourth International
Evaluation of Ontologies for the Web Workshop (EON 2006), Edinburgh, UK, May (2006).
3. Brank, J., Grobelnik, M., and Mladenic, D. A survey of ontology evaluation techniques. In
Proceedings. of the 8th Int. multi-conference Information Society IS-2005, 2005.
4. Corcho, O., Fernandez-Lopez, M., Gomez-Perez, A., Methodologies tools and languages
for building ontologies, Data & Knowledge Engineering 46 41–64 (2003)
5. Cross, V. and Pal, A., Ontology Metrics, 2005 North American Fuzzy Information
Processing Society, Ann Arbor Michigan, July (2005)
6. Ding, L., T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Reddivari, V. C. Doshi, and J.
Sachs. Swoogle: A semantic web search and metadata engine. In Proc. 13th ACM Conf. on
Information and Knowledge Management, Nov. (2004)
7. Gangemi, A., Catenacci, C., Massimiliano, C. and Lehmann, J., Ontology Evaluation and
Validation: An integrated formal model for the quality diagnostic task, On-line:
http://www.loa-cnr.it/Files/OntoEval4OntoDev_Final.pdf (2005).
8. Gomez-Perez, A. Evaluating Ontology Evaluation, in Why evaluate ontology technologies?
Because it works!, Intelligent Systems, IEEE, Volume 19, Issue 4, Jul-Aug 2004.
9. Guarino, N. and Welty, C. Evaluating ontological decisions with OntoClean,
Communications of the ACM, Volume 45, Number 2, February( 2002)
10. Hepp, M., Leukel, J., and Schmitz, V. A Quantitative Analysis of eClass, UNSPSC, eOTD,
and RNTD: Content, Coverage, and Maintenance, IEEE International Conference on
eBusiness Engineering (ICEBE'05) pp. 572-581 2005.
11. Kalfoglou, Y., Evaluating ontologies during deployment in applications, position statement,
.OntoWeb 2 meeting 07-12-2002,
12. Lozano-Tello, A.and Gómez-Pérez, A., ONTOMETRIC: A Method to Choose the
Appropriate Ontology, Journal of Database Management, 15(2), 1-18, April-June 2004.
13. Miller, G., WordNet: a lexical database for English, Comm. of ACM 38 (11), 39–
41 (1995)
14. Noy, N., Evaluation by Ontology Consumers, in Why Evaluate ontology technologies?
Because it works!, IEEE Intelligent Systems, July/August (2004)
15. Orbst, L., Hughes, T., and Ray, S. Prospects and Possibilities for Ontology Evaluation:
The View from NCOR, Fourth International Evaluation of Ontologies for the Web
Workshop (EON 2006), Edinburgh, UK, May (2006).
16. Pal, A. An Ontology Analysis Tool For Consumers, Masters Thesis, Miami University,
Oxford, OH May (2006).
17. Poulin J.S. Measuring Software Reuse: Principles, Practices, and Economic Models.
Addison Wesley Longman, (1997)
18. Sabou, M., Lopez, V, Motta, E. and Uren, V. Ontology Selection: Evlauation on the Real
Semantic Web, Fourth International Evaluation of Ontologies for the Web Workshop (EON
2006), Edinburgh, UK, May (2006).
19. Seco, N.,Veale, T. and Hayes, J. An Intrinsic Information Content Metric for Semantic
Similarity in WordNet, ECAI 2004, 1089-1090 (2004)
20. Tartir, S. Arpinar, I.B., Moore, M., Sheth, A.P. and Aleman-Meza, B. OntoQA: MetricBased Ontology Quality Analysis, IEEE Workshop on Knowledge Acquisition from
Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources,
Houston, TX, USA, November 2005.
21. Zhang, Y., W. Vasconcelos, and D. Sleeman. Ontosearch: An ontology search engine. In
Proc. 24th SGAI Int. Conf. on Innovative Techniques and Applications of Artifial
Intelligence, Cambridge, UK (2004).
-58/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Improving the recruitment process through ontology-based
querying
Malgorzata Mochol
Freie Universität Berlin
Institut für Informatik
AG Netzbasierte Informationssysteme
Takustr. 9, 14195 Berlin
Germany
mochol@inf.fu-berlin.de
Holger Wache
Vrije Universiteit Amsterdam
Artificial Intelligence Department
de Boelelaan 1081a, 1081HV Amsterdam
The Netherlands
holger@cs.vu.nl
Lyndon Nixon
Freie Universität Berlin
Institut für Informatik
AG Netzbasierte Informationssysteme
Takustr. 9, 14195 Berlin
Germany
nixon@inf.fu-berlin.de
http://nbi.inf.fu-berlin.de
Abstract: While using semantic data can enable improved retrieval of suitable jobs
or applicants in a recruitment process, cases of inconsistent or overly specific queries
which would return no results still have to be dealt with. In this paper the extension of
a semantic job portal with a novel query relaxation technique is presented which is able
to return results even in such cases. Subsymbolic methods estimate the (quantitative)
similarity between job or applicant descriptions. Symbolic approaches allow a more
intuitive way to formulate and handle preferences and domain knowledge. But due to
their partial preference order they can not rank all results in practice like subsymbolic
approaches. In this paper we propose a query relaxation method which combines both
methods. This method demonstrates that by having data based on formal ontologies,
one can improve the retrieval also in a user-friendly way.
1
Introduction
Human resources management, like many business transactions, is increasingly taking
place on the Internet. In Germany 90% of human resource managers rated the Internet
as an important recruitment channel [Ke05] and over half of all personnel recruitment is
the result of online job postings [Mo03]. Although job portals are an increasingly important source for job applicants and recruitment managers, they still exhibit shortcomings
in retrieval and precision as the stored job offers are in syntactic formats, i.e. searches
are subject to the ambiguities of natural language and job descriptions and characteristics
lack relations to similar or interdependant concepts. Particularly, queries which are over-
-59/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
specified or inconsistent return no matches while relevant job offers could actually still be
found if the consistency or specificity problem were to be resolved. Extending job offers in
a portal with semantics can enable improved search and precision based on the use of ontologies and semantic matchings. A further extension of such a semantics-based job portal
with novel query relaxation techniques additionally solves the problem of inconsistent and
overspecified querying, which can be expected to occur in a real world setting.
In this paper we report on such results from the German national project Wissensnetze
(Knowledge Nets1 ) and European Network of Excellence Knowledge Web2 . Having identified the importance of online recruitment processes for businesses today, Section 1.1 will
introduce requirements that arise when seeking to match appropriate applicants to vacancies or vice versa. Wissensnetze is working together with a German job portal provider
to develop an innovative new approach to online recruitment based on the use of semantically annotated job and applicant descriptions. In Section 2 we introduce the web-based
prototype of a semantic job portal that has been developed in this project. To meet the
identified requirements, we introduce two semantic-based approaches which rely on the
formal logical basis of ontology languages. Semantic matching techniques (Section 3) are
applied to determine the level of similarity between a job and an applicant. However, we
note that this technique alone can not meet real world requirements, which include the
possibility of overly specific or inconsistent queries. Hence, in a co-operation supported
by Knowledge Web, we are extending the prototype with state of the art query relaxation
techniques (Section 4). As a result we are in a position to evaluate both the emerging semantic technique as well as the real world viability of the semantic job portal (Section 5).
In general, we are able to demonstrate the value of ontology-based approaches to a typical
business activity as well as the added value of the use of a formal logic based approach in
that new semantic techniques are easily applicable to Semantic Web systems to solve open
requirements.
1.1
Requirements
Currently, a large number of online job portals divide the online labour market into information islands and making it close to impossible for a job seeker to get an overview of all
relevant open positions. In spite of a large number of portals employers still publish their
openings on a rather small number of portals assuming that a job seeker will visit multiple
portals while searching for open positions. Alternatively, companies can publish job postings on their own website [Mü00]. This way of publishing, however, makes it difficult for
job portals to gather and integrate job postings into their database. Furthermore, the quality of search results depends not only on the search and index methods applied but also on
the processability of the used web technologies and the quality of the automatic interpretation of the company-specific terms occurring in the job descriptions. The deficiencies of
a website’s machine processability result from the inability of current web technologies,
such as HTML, to semantically annotate the content of a given website. Therefore, com1 http://wissensnetze.ag-nbi.de
2 http://knowledgeweb.semanticweb.org
-60/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
puters can easily display the content of a HTML site, but they lack the ability to interpret
the content properly.
In our opinion using Semantic Web technologies in the domain of online recruitment can
overcome the problems of distribution, heterogeneity and machine non-processability of
existing job information and substantially increase market transparency, lower transaction costs and speed up the procurement process for businesses. For this reason, in Wissensnetze we have developed job portal which is based on Semantic Web technologies as
a basis for exploring the potential of the Semantic Web from a business and a technical
viewpoint by examining the effects of the deployment of Semantic Web technologies for
particular application scenarios and market sectors. Every scenario includes a technological component which makes use of the prospected availability of semantic technologies in
a perspective of several years and a deployment component assuming the availability of the
required information in machine-readable form. The combination of these two projections
allows us, on the one hand, to build e-business scenarios for analysis and experimentations
and, on the other hand, to make statements about the implications of the new technology
on the participants of the scenario in the current early stage of development.
In the first version of the semantic job portal we concentrated on the modelling of the human resource ontology and development of a matching approach for comparisons of applicant profiles and job openings with focus on skills, occupations as well as industry sector
descriptions [BHM+ 05]. The further specification of the complete job portal contains the
comparison of the applicants and openings not only under consideration of skills and their
levels but also professional experiences of a job seeker in relation to the requirements of
the hiring company. We want to express not only the duration of a particular experience
(3 years java programming) but also to deliver these job applications which maybe do not
fit in 100% to the defined requirements but are still acceptable for the employer (3 years
instead of 5 years industry experience). Furthermore, to verify the consistency of the job
opening descriptions we also have to avoid the definition of nonsensical requirements like
job postings which demand only very young (under 25) yet highly qualified people (with
at least 10 years work experience). Following this, we need an additional method which
starts checking the data with the strongest possible query that is supposed to return the
“best” answers satisfying most of the given conditions and then weaken the query if the
returned result set is either empty or contains unsatisfactory results.
Since we have been working very close with one German job search engine we were able
to define several exemplary use cases which focuses on the definition of such requirements
and issues. From the practical point of view the use-cases may represent the kind of queries
which happen in the real world. However from the scientific point of view these use-cases
are challenges to the techniques which we want to apply.
When implementing a (Semantic Web) job portal the requirements of the system depend
on the meaningful use cases which are derived by the industrial project partner from its
day to day business practices within the HR-domain. To clarify the still outstanding problems we will briefly present one of such use cases which (at first view) seems to be quite
simple. However if we look closer and try to represent the data in an ontology or satisfy
the requirements in the semantic portal we will meet some difficulties which at the same
time show the complexity of such “simple” queries.
-61/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
We are looking for a person which:
1. has an degree in computer science
2. wants to work in software consulting and development,
3. is an expert in C, Java, PHP, UML, .Net and WindowsNT,
4. has worked for at least 5 years in an industrial and 5 year in a research
project,
5. should have experience as project or team manager,
6. should be not older then 25
This example serve as a guideline and a thread in the rest of the article.
2
Semantic Job Portal
In a Semantic Web-based recruitment scenario the data exchange between employers, applicants and job portals is based on a set of vocabularies which provide shared terms to
describe occupations, industrial sectors and job skills [MPB06]. In this context, the first
step towards the realization of the Semantic Web e-Recruitment scenario was the creation
of a human resources ontology (HR-ontology). The ontology was intended to be used in
a job portal by allowing not only a uniform representation of job postings and job seeker
profiles but first of all the semantic matching (a technique that combines annotations using
controlled vocabularies with background knowledge about a certain application domain)
in job seeking and procurement tasks. Another important requirement was development
of the Semantic Web-based job portal concerning the user needs under consideration of
the already common practice in the industry. Accordingly to this specification we focused
on how vocabularies can be derived from standards already in use within the recruitment
domain and how the data integration infrastructure can be coupled with existing non-RDF
human-resource systems.
In the process of ontology building we first identified the sub-domains of the application
setting (skills, types of professions, etc.) and several useful knowledge sources covering them (approx. 25)[BMW04]. As candidate ontologies we selected some of the most
relevant classifications in the area, deployed by federal agencies or statistic organizations:
German Profession Reference Number Classification (BKZ), Standard Occupational Classification (SOC), German Classification of Industrial Sector (WZ2003), North American
Industry Classification System (NAISC), German version of the Human Resources XML
(HR-BA-XML) and Skill Ontology developed by the KOWIEN Project[SBA03]. These
sources are represented in different formats and languages with various levels of the formality (textual descriptions, XML-schemes, DAML) and cover different domains at different precision levels. Since these knowledge sources were defined in different languages
(English/German) we first generated (depending on the language) lists of concept names.
Except for the KOWIEN ontology, additional ontological primitives were not supported
by the candidate sources. In order to reduce the computation effort required to compare
-62/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
and merge similar concept names we identified the sources which had to be completely
integrated to the target ontology. For the remaining sources we identified several thematic
clusters for further similarity computations. For instance BKZ was directly integrated to
the final ontology, while the KOWIEN skill ontology was subject of additional customization. To have an appropriate vocabulary for a core skill ontology we compiled a small
conceptual vocabulary (15 concepts) from various job portals and job procurement Web
sites and matched them against the comprehensive KOWIEN vocabulary. Next, the relationships extracted from KOWIEN and various job portals were evaluated by HR experts
and inserted into the target skill sub-ontology. The resulting conceptual model was translated mostly manually to OWL (since except for KOWIEN the knowledge sources were
not formalized using a Semantic Web representation language) [PBM05].
We have been still evaluating and refining the preliminary HR-ontology for the purpose of
further development and to calculate the costs of reusing existing sources. About 15% of
the total engineering time were spent on source gathering and about 30% were spent on
customizing the selected source ontologies. Several ontologies have been fully integrated
into the resulting ontology, while KOWIEN and the XML-based sources required additional customization effort. Although the entire integration took up over 45% of the total
engineering time the reusing classification schemes like BKZ or WZ2003, which did not
require any customization effort, definitely resulted in significant cost savings, while guaranteeing a comprehensive conceptualization of occupations and industrial sectors, respectively. The last phase of the building, refinement and evaluation process will require 10%
of the overall time. The aggregation of knowledge from different domains and the evaluation of available, large-sized ontologies were tedious and time-consuming. Although, the
benefits of using standard classification systems in this application setting outweigh the
costs of the ontology reuse. The reuse process could be significantly optimized in terms
of costs and quality of the outcomes if provided the necessary technical support.
Having modelled the HR-ontology and prepared the RDF-Repository to store applicant
profiles and job description, we developed the matching engine which as the core component of the system plays the crucial role in the procurement process. The function of the
matching engine is focused on in the following chapter.
3
Semantic matching
Semantic matching is a technique which combines annotations using controlled vocabularies with background knowledge about a certain application domain. In our prototypical
implementation, the domain specific knowledge is represented by concept hierarchies like
skills, skill level classification, occupational classification, and a taxonomy of industrial
sectors. Having this background knowledge of the recruitment domain (i.e. formal definition of various concepts and specification of the relationships between these concepts)
represented in a machine-understandable format allows us to compare job descriptions and
applicant profiles based on their semantic similarity [PC95] instead of merely relying on
the containment of keywords like most of the contemporary search engines do.
-63/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
In our HR-scenario, our matching approach3 utilizes metadata of job postings and candidate profiles and as the matching result, a ranked list of best candidates for a given job
position (and vice versa) is generated.
Inside both a job posting as well as an applicant profile we group pieces of information
into “thematic clusters”, e.g. information about skills, information regarding industry
sector and occupation category, and finally job position details like salary information,
travel requirement, etc. Each thematic cluster from a job posting is to be compared with
the corresponding cluster from an applicant profile (and the other way round). The total similarity between a candidate profile and a job description is then calculated as the
average of the cluster similarities. The cluster similarity itself is computed based on the
similarities of semantically corresponding concepts from a job description and an applicant profile. The taxonomic similarity between two concepts is determined by the distance between them which reflects their respective positions in the concept hierarchy.
Following this, the distance d between two given concepts in a hierarchy e.g. .Net
and DCOM (cf. Fig. 1) represents the path from one concept to the other over the closest common parent. The semantic differences between upper level concepts are bigger than those between concepts on lower hierarchy levels (in other words, two general
concepts like object oriented programming languages and imperative
procedural programming languages are less similar than two specialized like
C# and Java) and the distance between siblings is greater than the distance between parent and child (d(Java,C#) > d(Java,PureObjectOrientedLanguages)).
Since we also provide means for specifying competence levels (e.g. expert or beginner)
in applicants profiles as well as job postings we compare these levels in order to find the
best match. Furthermore, our approach also gives employers the opportunity to specify
the importance of different job requirements. The concept similarity is then justified by
the indicated weight (i.e. the similarity between more important concepts) like the skills
crucial for a given job position and will have greater influence on the similarity between a
job position posting and an applicant profile.
Having the example from Section 1.1 we can apply the developed semantic matching engine to compare the requirements defined within a job opening with the job applicant
descriptions (and another way round comparing the applicants profiles with the job descriptions). The results of the comparisons are presented in the form of a ranked list where
each applicant profile can be separately viewed and considered in detail (cf. Fig. 1).
The approach described above allows comparisons of job openings with applicant profiles
based on verifying occupation and industry descriptions, skills and their competence levels
as well as some general information like salary and job type. Hence, the prototype can
satisfy the first three points of the specification from the above mentioned job description
(cf. Sec. 1.1) we are not able to deliver an answer to the other requirements especially
concerning the minimal required experiences in different projects or experiences as team
manager. To tackle this problem we have decided to extend our prototype by applying
not only the semantic matching technique but also using the query relaxation methods to
compare the job description with applicants.
3 For
further information about used matching framework SemMF see [OB05]
-64/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Figure 1: Matching result
4
Query relaxation
The previous approach basically uses a similarity measure which calculates the similarity
between the job posting and candidate profile. Such a function f (jp, cp) 7→ [0..1] directly
provide a ranking between the results because answers which are more similar can be
higher ranked. In order to ensure that the job portal ranks certain answers higher than
others similarity measures normally can be biased
Pn in that way that weights wi are attached
to some parts of the calculation, i.e. f (p, r) = i=1 wi ∗fi (p, r). With such weights users
also can ensure that the system will respect his preferences during the ranking.
However, similarity functions also have their drawbacks. Like all subsymbolic approaches,
similarity functions do not explain how the job posting and the request differ. They only
return a value like 0.78 but they do not provide any information how the job posting and
the candidate profile differ in detail. For example, they can not explain that the candidate
has only three years experiences instead of requested five years.
Furthermore the similarity function is also not able to explain the differences betweens
the answers. Another answer with a very close similarity value, say 0.79, may give the
impression of a similar good candidate but it may differ on an absolutely different thematic
cluster, e.g. the candidate has no experiences in leading a project. The difference with this
answer is not obvious and is not explained. The similarity function suggests a ranking
-65/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
but in fact the result is an unordered listing; a grouping of semantically similar answered
would improve the acceptance of and the usability by the user.
On the other hand the user is directly able to specify how he wants to relax his request.
The user may specify directly: “if nobody have 5 years industrial experiences then I will
also accept 3 years experiences”. Furthermore the system can also explain how this set of
returned answers is related to the original query, e.g. here comes now the answers not with
5 but with 3 years experiences (cf. Section 1.1).
Such preferences can be specified in symbolic approaches. Advanced methods like [BBD+ 04]
also allow conditional preferences where the preferences depends on some other decisions.
However most of these approaches assume implicitly a flat data structures like e.g. a set of
variables [BBD+ 04]. But here we are focussed with the need of the full expressive power
of advanced object-centred representation of job descriptions and candidate profiles which
may be difficult to encode in symbolic approaches.
In the following we describe an approach which use rewriting rules to capture preference
and domain knowledge explicitly and show how this knowledge is used to relax the original query into a set of approximated queries. Rewriting rules are able to work on complex
data structures. We propose an approach for query rewriting based on conditional rewriting rules to solve the problem of incorporating domain knowledge and user preferences
for similar matching job request and openings. This rewriting relaxes the over-constrained
query based on rules in an order defined by some conditions. This has an advantage that
we start with the strongest possible query that is supposed to return the “best” answers
satisfying most of the conditions. If the returned result set is either empty or contains unsatisfactory results, the query is modified either by replacing or deleting further parts of
the query, or in other words relaxed. The relaxation should be a continuous step by step,
(semi-)automatic process, to provide a user with possibility to interrupt further relaxations.
Query rewriting with rewriting rules helps to incorporate domain knowledge and user preferences in the semantic matching in an appropriate way. It comes back with a set of rewritten queries. However the results of each rewritten query may be a set of answers which is
not ordered. How to order or rank them? For this part we can fall back to the similarity
function with their implicitly encoded knowledge and rank the answers for each rewritten
query. To summarize query rewriting provides a high level relaxation including grouping
the results according the domain knowledge and the user preferences and the similarity
function provides some kind of fine tuning when the results in one group is ranked.
Before we investigate concrete relaxation strategies in the context of our example domain,
we first give a general definition of the framework for re-writing an RDF query very briefly.
4.1
Formal definition
The RDF data model foresees sets of statements which are in the form of triples [Ha04]. In
[DSW06] Dolog et.al. proposed a rule-based query rewriting framework for RDF queries
independent of a particular query language which we summarize here. The framework is
based on the notion of triple patterns (RDF statements that may contain variables) as the
-66/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
basic element of an RDF query and represents RDF queries in terms of three sets:
• triple patterns that must be matched by the result (mandatory patterns)
• triple patterns that may be matched by the results (optional triple patterns).
• Conditions in terms of constraints on the possible assignment of variables in the
query patterns.
More precisely Dolog et.al. define a (generic) RDF query as a triple of these three sets.
Definition 1 RDF Query
Let T be a set of terms, V a set of variables, RN a set of relation names, and PN
a set of predicate names. The set of possible triple patterns T R is defined as T R ⊆
(T ∪ V) × (RN ∪ V) × (T ∪ V). A query Q is defined as the tuple hMQ , OQ PQ i with
MQ , OQ ∈ T R and PQ ⊆ P where MQ is the set of mandatory pattern (patterns that
have to be matched by the result, OQ is a set of optional pattern (patterns that contribute
to the result but do not necessarily have to match the result) and P is the set of predicates
with name PN , defined over T , and V.
A result set of such a RDF query is set of substitutions. Formally a substitution τ is a list of
pairs (Xi , Ti ) where each pair tells which variable Xi has to be replaced by Ti ∈ T ∪ V.
A ground substitution replaces variables Xi by a term and not by another variable, i.e.
Ti ∈ T for all i. The (ground) substitution τ replaces variables in MQ and OQ with
appropriate terms. If τ (MQ ) is equal to some ground triples then the substitution is valid.
All valid ground substitutions for MQ plus existing ground substitutions for OQ constitute
answers to the query. Additionally the predicates PQ restrict these substitutions because
only those bindings are valid answers where the predicates, i.e. τ (PQ ), are also satisfied.
The predicates additionally constraint the selection of appropriate triples.
R
Re-writings of such queries are described by transformation rules Q −→ QR where Q
the original and QR the rewritten query generated by using R. Rewriting rules consists of
three parts:
• A matching pattern represented by a RDF query in the sense of the description above
• A replacement pattern also represented by an RDF query in the sense of the description above
• A set of conditions in terms of special predicates that restrict the applicability of
the rule by restricting possible assignments of variables in the matching and the
replacement pattern.
Based on the abstract definition of an RDF query, we can now define the notion of a
rewriting rule. We define rewriting in terms of rewriting rules that take parts of a query,
in particular triple patterns and conditions, as input (P A) and replace them by different
elements (RE).
-67/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
Definition 2 Rewriting Rule
A rewriting rule R is a 3-tuple hP A, RE, CN i where P A and RE are RDF queries
according to Definition 1 and CN is a set of predicates.
For conditions the same constructs as for queries are used where the possible results are
also constrained by predicates. Patterns and replacements formally have the same structure
like queries. They also consist of a set of triples and predicates. But patterns normally
do not address complete queries but only a subpart of a query. Normally the subpart
addresses some triples as well as some predicates in the query. In order to write more
generic rewriting rules the pattern must be instantiated which is done by an substitution.
Definition 3 Pattern Matching
A pattern P A of a rewriting rule R is applicable to a query Q = hMQ , OQ , PQ i if
0
0
there are subsets MQ
⊆ MQ , OQ
⊆ OQ and PQ0 ⊆ PQ and a substitution θ with
0
0
0
hMQ , OQ , PQ i = θ(P A).
In contrast to term rewriting systems [BN98] the definition of a query as two sets of triples
and predicates differentiate the pattern matching. The identification of the right subpart
of the query for the pattern match is simplified because of the use of sets. Only a subset
of both sets has to be determined which must be syntactically equal to the instantiated
pattern. Please note that due to set semantics, the triples and predicates in the pattern may
be distributed over the query.
A re-writing is now performed in the following way: If the matching pattern matches a
given query Q in the sense that the mandatory and optional patterns as well as the conditions of the matching pattern are subsets of the corresponding parts of Q then these subsets
are removed from Q and replaced by the corresponding parts of the replacement pattern.
The application of R is only allowed if the predicates in the conditions of R are satisfied
for some variable values in the matching and the replacement pattern.
Definition 4 Query Rewriting
If a rewriting rule R = hP A, RE, CN i
0
0
⊆ OQ and
• matches a query Q = hMQ , OQ , PQ i with subsets MQ
⊆ MQ , OQ
0
PQ ⊆ PQ substitution θ and
• θ(CN ) is satisfied
R
R
R
then the rewritten query QR = hMQ
, OQ
, PQR i can be constructed with MQ
= (MQ \
0
R
0
R
0
MQ ) ∪ θ(MRE ), OQ = (OQ \ OQ ) ∪ θ(ORE ) and PQ = (PQ \ PQ ) ∪ θ(PRE ) with
RE = hMRE , ORE , PRE i.
The former definition clarifies formally how to generate a rewritten query QR from Q with
R
the help of R, i.e. Q −→ QR . We denote with QR all queries which can generated from
Q with all rules R ∈ R. On each query from QR the rewriting rules can be applied again.
We denote with QR∗ all queries which can be generated by application of the rules in R
successively.
-68/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
4.2
Application in the job portal
Each job request and opening is annotated with an RDF description which is a set of triples.
A query over these job openings is formulated as triple patterns and a set of conditions that
restrict the possible variables bindings in the patterns. Each triple pattern represents a set
of triples. The corresponding abstract definition of a query focuses on the essential features
of queries over RDF.
To clarify the approach we take the example 1 from the Section 1: someone who has experience in C, Java, PHP, UML, .Net and WindowsNT. Looking for such a person requires
from the system to translate this free text description into an instance retrieval problem.
The query must be translated into a concept expression. The retrieval process will return
all job seekers which belong to that concept expression, i.e. satisfying all the requirement
in the concept expression. The following OWL4 expression shows the concept expression
for some person who has experience in some of (the intersectionOf property) the OWL
classes C, Java, PHP or UML 5 .
<owl:Class rdf:ID="Query">
<rdfs:subClassOf>
<owl:Class rdf:ID="Person"/>
</rdfs:subClassOf>
<rdfs:subClassOf>
<owl:Restriction>
<owl:someValuesFrom>
<owl:Class>
<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="C"/>
<owl:Class rdf:about="Java"/>
<owl:Class rdf:about="PHP"/>
<owl:Class rdf:about="UML"/>
</owl:intersectionOf>
</owl:Class>
</owl:someValuesFrom>
<owl:onProperty>
<owl:ObjectProperty rdf:ID="hasExperience"/>
</owl:onProperty>
</owl:Restriction>
</rdfs:subClassOf>
...
</owl:Class>
In the following we give some examples for the rewriting rules which use the aforementioned example as a basis.
4 OWL
is an extension of RDF allowing for more expressive features than RDF like number restrictions etc.
we modelled these as nominals (enumerations like Week =Monday, Tuesday, ...). Nominals are
instances and classes at the same time. However current DL systems have problems with nominals therefore we
use classes in the current approach.
5 Originally
-69/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
A very simple rewriting rule takes into account the required skill e.g. Java. It relax the some
requirements in the experiences, i.e. instead of JAVA the PureObjectOrientedLanguages
or even the ObjectOrientedLanguages could be possible weakenings of the original query:6
pattern(<owl:Class rdf:about="Java"/>)
==>
replace(<owl:Class rdf:about="PureObjectOrientedLanguages"/>)
&& true.
This means that whenever anywhere the term representing the Java language in a query
appears then it can be replaced by a more general term representing pure object oriented
languages, of which Java is one.
Making use of the predicates we can generalize previous rewriting rules and generate
generic rules that are guided by information from the ontology. The predicate subsumed
for example is satisfied when X is more specific than Y . With the following rewriting rule
we are able to consider the knowledge in the ontology.
pattern(<owl:Class rdf:about="X"/>)
==>
replace(<owl:Class rdf:about="Y"/>)
&& subsumed(X,Y).
In the same way some number restrictions can be applied. In our example the requirement that a person has experiences in a five year industrial project is encoded with the
help of the (artificial) class FiveYearsOrMore. This class represents all Numbers representing years which are larger or equal to five. This class can be replaced by the class
TwoYearsOrMore which obviously is more general (weaker) then the former. Furthermore we can restrict the replacement in that way that we only allow this for the restriction
on property hasDuration. The corresponding rewriting rule look like:
pattern(<owl:Restriction>
<owl:onProperty rdf:resource="#hasDuration"/>
<owl:someValuesFrom>
<owl:Class rdf:ID="FiveYearsOrMore"/>
</owl:someValuesFrom>
</owl:Restriction>)
==>
replace(<owl:Restriction>
<owl:onProperty rdf:resource="#hasDuration"/>
<owl:someValuesFrom>
<owl:Class rdf:ID="TwoYearsOrMore"/>
</owl:someValuesFrom>
</owl:Restriction>)
&& true.
6 For
the sake of readability the examples are simplified.
-70/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
The main problem of the re-writing approach to query relaxation is the definition of an
appropriate control structure to determine in which order the individual rewriting rules are
applied to general new queries. In other words how to explore QR∗ . Different strategies
can be applied to deal with the situation where multiple re-writings of a given query are
possible. Example is a Divide and Conquer strategy: The best results of each possible
combinations of re-writings is returned. In the current version of the system we have
implemented a simple version with similarities to skylining [KK02, LL87] which is wellknown in database query relaxation.
In particular, we interpret the problem of finding relaxed queries as a classical search
problem with small adaptations. The search space is defined by the set QR∗ of all possible
queries. Each application of a rewriting rule R on a query Q is a possible action denoted
R
as Q −→ QR . A query represents a goal state in the search space if it does have answers.
In the current implementation we use breadth-first search for exploring this search space.
Different from classical search, however, the method does not stop when a goal state is
reached; each goal state has to be determined because each goal state represent one sequence of successful rewriting when answers are provided. However a goal state need not
to be explored further (i.e. no further re-writings must be applied to a goal state) but each
search branch has to be closed by a goal state (or by a predefined depth). Breadth-first
search ensures that each goal state represents the best solution to the relaxation problem
with respect to a certain combination of re-writings. The goal states form a “skyline” for
the rewriting problem and each of them is returned to the user together with the query
answers.
The second difference to classical search is that we do not allow the same rule to be applied
more than once with the same parameters in each branch of the search tree. The only kind
of rules that can in principle be applied twice are rules that add something to the query
(Rules that delete or replace parts of the query disable themselves by removing parts of
the query they need to match against). Applying the same rule that extend the query twice
leads to an unwanted duplication of conditions in the query that do not change the query
result, but only increase the complexity of query answering.
5
Conclusions and Future Work
We have shown the use of semantic techniques in a prototype job portal in which both
job offers and applicants are described according to ontologies. While preparing the ontologies from non-ontological sources (classification schemes) was time consuming and
tedious, by doing so we are enabled to use ontology-based querying approaches to match
relevant job applicants to vacancies and vice versa, as well as rank them in terms of similarity. Furthermore, we have shown that semantic matching alone does not allow for levels
of similarity to be differentiated and for inconsistent or overly specific queries to be resolved. Hence we introduce another technique called query relaxation, in which queries
can be rewritten to allow similarity to be ranked in different directions (e.g. in terms of
subjects the applicant has experience in, or the number of years total experience he or she
-71/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
has) or to widen the scope of the query in order to find matches (e.g. by searching on a
superclass of the class searched for, or by making cardinalities less restrictive).
At present, the semantic job portal demonstrates improved precision and recall in the semantic matching, finding relevant job offers or applicants which would not be selected
by a syntactic (text-based) query algorithm. However we have noted that alone this does
not resolve more complex queries which can be expected in the real world. In the European Network of Excellence Knowledge Web the co-operation of leading Semantic Web
researchers with selected industry partners with a real world business problem to solve is
being supported. One of the first steps taken in the network was to collect industrial use
cases where semantic technologies could form a potential solution, as well as derive from
those use cases industry requirements for Semantic Web research [LPNS05]. The recruitment scenario was one of the industrial use cases provided and was identified by Semantic
Web researchers as an ideal real world use case to test their results in query relaxation
techniques. Within the Network the job portal is now being extended to support the rule
rewriting approach described in this paper.
Our intention is to test the extended prototype against the original prototype (which supports only the semantic matching) using a set of benchmark queries. The HR ontology has
been extended for this purpose with the property of experience and instances using this
new property added to the semantic data used by the job portal. The rule rewriting tool has
been implemented and an interface to the tool which is more general than the DIG interface used by most Semantic Web reasoners (in order to not limit ourselves to Description
Logics) has been specified. We have also defined the first concrete technical details of rule
rewriting. We plan to carry out the first tests at the beginning of 2007. This will provide us
with a valuable real world test case to analyse the value of query relaxation techniques as
an extension to semantic matching in ontology-based systems, in order to solve real world
problems of inconsistent or overly specific queries. Performance is also be measured as
another advantage of query relaxation is expected to be allowing more robust and efficient
querying upon large knowledge bases which can scale to real world enterprise size.
Further research issues include determining a general guideline for the specification of
rewriting rules and a generic framework for working with those rules. In combination, we
believe this work not only demonstrates the use and value of Semantic Web techniques
in a real world industrial test case, it indicates that any assessment of the cost of shifting
to an ontology-based approach must also take into account the value to be gained from
the availability of semantic techniques that is made possible when system data is based on
formal logic through an ontology. In this paper we have introduced two such techniques:
semantic matching and query relaxation, which are shown to be of value to the online
recruitment process.
Acknowledgement The work described in this paper is supported by the EU Network of Excellence KnowledgeWeb (FP6-507482) and Knowledge Nets Project which is a part of the InterValBerlin Research Centre for the Internet Economy funded by the German Ministry of Research
BMBF.
-72/73-
Proceedings of SEBIZ 2006
______________________________
______________________________
______________________________
References
[BBD+ 04] Boutilier, C., Brafman, R. I., Domshlak, C., Hoos, H., und Poole, D.: Cp-nets: A tool
for representing and reasoning with conditional ceteris paribus preference statements.
Journal of AI Research. 21:135–191. 2004.
[BHM+ 05] Bizer, C., Heese, R., Mochol, M., Oldakowski, R., Tolksdorf, R., und Eckstein, R.: The
Impact of Semantic Web Technologies on Job Recruitment Processes. In: International
Conference Wirtschaftsinformatik (WI’05). 2005.
[BMW04]
Bizer, C., Mochol, M., und Westphal, D. Recruitment, report. April 2004.
[BN98]
Baader, F. und Nipkow, T.: Term rewriting and all that. Cambridge University Press.
New York, NY, USA. 1998.
[DSW06]
Dolog, P., Stuckenschmidt, H., und Wache, H.: Robust query processing for personalized information access on the semantic web. In: 7th International Conference on
Flexible Query Answering Systems (FQAS 2006). Number 4027 in LNCS/LNAI. Milan, Italy. June 2006. Springer.
[Ha04]
Hayes, P.: RDF Semantics. Recommendation. W3C. 2004.
[Ke05]
Keim, T. e. a. Recruiting Trends 2005. Working Paper No. 2005-22. efinance Institut.
Johann-Wolfgang-Goethe-Universität Frankfurt am Main. 2005.
[KK02]
Kießling, W. und Köstler, G.: Preference sql - design, implementation, experiences. In:
VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases,
August 20-23, 2002, Hong Kong, China. S. 990–1001. Morgan Kaufmann. 2002.
[LL87]
Lacroix, M. und Lavency, P.: Preferences; putting more knowledge into queries. In:
Stocker, P. M., Kent, W., und Hammersley, P. (Hrsg.), VLDB’87, Proceedings of 13th
International Conference on Very Large Data Bases, September 1-4, 1987, Brighton,
England. S. 217–225. Morgan Kaufmann. 1987.
[LPNS05]
Leger, A., Paulus, F., Nixon, L., und Shvaiko, P.: Towards a successful transfer of
knowledge-based technology to European Industry. In: Proceedings of the 1st Workshop on Formal Ontologies Meet Industry (FOMI 2005). 2005.
[Mü00]
Mülder, W.: Personalinformationssysteme - Entwicklungsstand, Funktionalität und
Trends. Wirtschaftsinformatik. Special Issue IT Personal. 42:98–106. 2000.
[Mo03]
Monster: Monster Deutschland and TMP Worldwide: Recruiting Trends 2004. In:
2. Fachsymposium für Personalverantwortliche. Institut für Wirtschaftsinformatik der
Johann Wolfgang Goethe-Universität Frankfurt am Main. 2003.
[MPB06]
Mochol, M. und Paslaru Bontas, E.: Practical Guidelines for Building Semantic eRecruitment Applications. In: International Conference on Knowledge Management, Special Track: Advanced Semantic Technologies (AST’ 06). 2006.
[OB05]
Oldakowski, R. und Bizer, C.: SemMF: A Framework for Calculating Semantic Similarity of Objects Represented as RDF Graphs. In: Poster at the 4th International
Semantic Web Conference (ISWC 2005). 2005.
[PBM05]
Paslaru Bontas, E. und Mochol, M.: Towards a reuse-oriented methodology for ontology engineering. In: Proc. of 7th International Conference on Terminology and
Knowledge Engineering (TKE 2005). 2005.
[PC95]
Poole, J. und Campbell, J.: A Novel Algorithm for Matching Conceptual and Related
Graphs. Conceptual Structures: Applications, Implementation and Theory. 954:293–
307. 1995.
[SBA03]
Sowa, F., Bremen, A., und Apke, S. Entwicklung der Kompetenz-Ontologie für die
Deutsche Montan Technologie GmbH. http://www.kowien.uni-essen.de/
workshop/DMT\ 01102003.pdf. 2003.
-73/73-