2012 19th Working Conference on Reverse Engineering
Reconstructing Architectural Views from Legacy Systems
Ghizlane El Boussaidi
Alvine Boaye Belle
Department of Software and IT engineering
École de technologie supérieure
Montreal, Canada
Department of Software and IT engineering
École de technologie supérieure
Montreal, Canada
Stéphane Vaucher
Hafedh Mili
Benchmark Consulting
Montreal, Canada
Department of Computer Science
Université du Québec à Montréal
Montreal, Canada
2) violations of the style constraints due to their
misinterpretation; and 3) the continuous and cumulative
changes undergone by the system, which increases its
complexity and leads to a deviation from its initial design [1,
6]. Furthermore, the as-built architecture is often insufficiently
and inaccurately documented [4]. Hence, a software
architecture reconstruction process is required to reconstruct
and document the architecture of existing systems before
initiating any modernization actions.
Abstract—Modernizing a large legacy system is a demanding and
costly process which requires a deep understanding of the
system’s architecture and its components. However legacy
systems are poorly documented and they have often undergone
many changes that make them deviate from their initial
architectural design. Approaches for reconstructing architectural
views from legacy systems and re-documenting the resulting
components are of great value in the context of a modernization
process. In this paper, we propose an approach that helps
constructing distinct architectural views from legacy systems. To
do so, we propose various clustering algorithms which are driven
by common architectural views and styles. Our approach makes
use of the knowledge discovery model which provides a standard
machine-independent representation of legacy systems. We
implemented and applied the approach in an industrial setting.
The preliminary experimentations have shown that the
algorithms perform well and produce comprehensive views.
Many approaches were proposed to support architecture
recovery using various techniques and producing different tools
that support them [4]. The technique used is generally
dependent on the way the system’s data is represented. In [22],
techniques were classified into three automation levels: quasimanual, semi-automatic and quasi-automatic. In the context of
large and complex legacy systems, we need a quasi-automatic
technique that alleviates the burden of reconstructing these
systems architectures. One such technique is clustering which
is a common used technique to reconstruct architecture [5, 10,
11, 12, 13, 15, 17, 19, 20, 21]. However, these approaches
target specific languages and systems and do not use a standard
representation of the data of the system. As a consequence
resulting tools do not interoperate with each other and can
hardly be used together in a modernization process [2]. Another
known problem related to clustering-based techniques is the
selection of appropriate entities and properties/relations that
will be used in the clustering. Most of the approaches base the
selection on the system under analysis and its data and do not
enable to build architectural views as the ones commonly used
in an architectural forward construction process (e.g., a layered
view [16]).
Keywords-legacy
system
modernization;
architecture
reconstruction; architectural views; software clustering.
I.
INTRODUCTION
Modernizing a legacy system is a demanding and costly
task. It is largely motivated by the fact that the system is no
longer able to efficiently support the business goals of a
company [1]. Understanding the legacy system is mandatory to
all the modernization disciplines and it usually involves
constructing various representations of the system that support
its comprehension. Depending on the modernization goals,
various other disciplines are considered as part of this process
including assessment of the legacy system’s architecture, its
restructuring and refactoring and, its transformation to generate
the target system [2].
The OMG’s architecture-driven modernization (ADM)
standards [7] were introduced to tackle some of these
challenges. Indeed, ADM consists of a number of standards
whose purposes are mainly to harmonize IT and business
modernization goals while enabling interoperability between
supporting tools. The Knowledge Discovery Metamodel
(KDM) is one of these standards. The KDM defines a metamodel for representing—at various levels of abstraction- all
aspects of existing legacy systems [7]. This meta-model
provides a common interchange format to ensure
interoperability between tools that support modernization.
Although the KDM specifies concepts and relations to describe
When the modernization process involves an architectural
transformation of the system, we need first to reconstruct the
system’s software architecture. When designing these
architectures, an architect relies on a set of idiomatic patterns
commonly named architectural styles which describe families
of systems [3]; for example the commonly used layered style.
However, many researchers observed that the as-built
architecture does not conform to the initial style that guided its
design (e.g., [4, 5, 6]). This is mainly due to: 1) the conceptual
gap between the abstract elements that define a style and the
concrete source code constructs that implements the system [5];
1095-1350/91 $25.00 © 4891 IEEE
DOI 10.1109/WCRE.2012.44
345
software architecture of legacy systems, it does not specify or
suggest a way of inferring these high-level representations from
low-level data that were extracted from the system.
Software architecture is usually described using a set of
complementary views. The concept of architectural view is
described in [8] as the fundamental organizing principle for
documenting architecture and it is considered as the result of
applying a style to a system. For example, the layered style
yields a module decomposition that is a static/structural view of
the system; a pipe and filter style yields a dynamic view based
on the data flow between components; and a client-server style
yields a dynamic view based on the control flow (request/reply
interactions). In this paper, we rely on this mapping between
views and styles to recover relevant views from a legacy
system.
To address the issues mentioned above, we propose an
approach that makes use of the KDM standard to reconstruct
and document software architectural views of the legacy
system. We consider an architectural view to be a way of
partitioning a system using a specific set of KDM relevant
concepts and relations and we propose clustering algorithms
that target specific views mainly a layered view that we call
horizontal view and a feature-based view that we call vertical
view. The proposed approach has two main advantages: 1) the
approach is language and platform independent; and 2) the
proposed architecture reconstruction process and the supporting
algorithms are driven by targeted architectural views and
styles. It is worth pointing out that at this stage of our research,
our aim is to propose and implement relevant clustering
algorithms and heuristics that target a common set of
architectural views. The approach was implemented as a plugin
within a commercial tool and applied to large industrial legacy
systems. The preliminary results have shown that the approach
considerably reduces the clustering time compared to similar
approaches and it yields comprehensible views.
B. The Knowledge Discovery Metamodel
In ADM, modernization is driven by architecture to
underline the need to reconstruct and restructure the
architecture of the system to be modernized. A number of
standards are listed under the ADM umbrella. These standards
aim at synchronizing the IT and business aspects inherent to
modernization and enabling interoperability between
modernization tools. One of these standards is the Knowledge
Discovery Metamodel (KDM).
The KDM defines a meta-model for representing existing
software assets, their associations, and operational
environments [7]. KDM enables to represent various
applications, platforms, and programming languages.
Following the separation of concerns principle, KDM defines
several domains each of which corresponds to a particular
architectural viewpoint. Each domain is defined by a KDM
package that gathers meta-model elements representing aspects
related to this domain. KDM specification defines 4 layers
(Figure 1). The KDM Infrastructure Layer defines common
meta-model elements which constitute the infrastructure for
other packages. The Program Elements Layer contains two
packages: Code and Action. Together these packages enable to
describe code models of software systems, i.e., they represent
the knowledge explicitly represented in the source code. The
Code package focuses on named units of implementation and
their structural relationships while the Action package focuses
on units of behavior and the control flow relationships.
The paper is organized as follows. We introduce basic
concepts related to our approach in section 2. Section 3 gives
an overview of our approach and details its steps. In section 4
we describe our view-based clustering algorithms. Section 5
presents and discusses the experimentation results. Related
works are discussed in section 6 and we conclude and outline
some future works in section 7.
II.
BACKGROUND
In this section, we start by briefly introducing relevant
concepts to our approach including software architecture,
architectural styles and views. We also give an overview of the
KDM specification and meta-model that we used to implement
our approach.
A. Software Architecture and Architectural Views
There are many definitions of software architecture in the
literature (e.g., [3, 8, 9]). However, there is a consensus on the
following: 1) architecture represents a judicious partitioning of
the system into parts with specific relations among these parts
[8]; and 2) architecture aims at satisfying a set of functional
requirements and quality attributes [3, 9]. Software architecture
is commonly defined as a set of components and connectors
(i.e., interactions between components). When designing
software architectures, an architect relies on a set of idiomatic
patterns commonly named architectural styles or patterns. An
architectural style determines the vocabulary of components
and connectors that can be used in instances of that style,
together with a set of constraints on how they can be combined.
Many common architectural styles are described in [3, 8, 16].
Examples of such styles include layered, pipes and filters,
client-server and service-oriented styles; each of these styles
has its own vocabulary and constraints and promotes some
specific quality attributes. Software systems are practically
built by combing and composing these styles.
The Runtime Resource and the Abstractions Layers
represent higher level views on the existing system. Most
information required by these views is implicit in the source
code and it usually requires some analysis techniques of lowlevel representations combined with input from experts and
Figure 1. KDM Layers and packages; extracted from [7]
346
stakeholders. A particular package we are interested in is the
structure package which represents the architectural
organization of the existing software system. Our goal is to
make use of KDM models extracted at lower-levels (code
model, data model, etc.) and construct different views of the
system at hand that will feed the structure model.
III.
statement (i.e., Includes), a CALL and PERFORM statements
(i.e., Calls) and statements manipulating data records (i.e.,
Reads and Writes). Moreover, depending on the targeted view,
we have to choose a coherent subset of concepts and relations
to be included in the view.
For example a static view, called module view in [1] and
[8], shows units of implementation and their structural
relations. This view helps understanding the system’s functions
and support requirements traceability and impact analysis [8].
To build the static view of an object-oriented legacy system,
we may select KDM concepts representing classes, interfaces
and packages and KDM relations representing extends,
implements, imports, depends-on and includes relationships as
elements to be included in this view. The dynamic view, called
component and connector view in [1] and [8], is mainly based
on the control flow and the data flow between implementation
units. Control flow and data flow elements are represented by
concepts and relations from the KDM Action package. To
build the dynamic view, we need to select appropriate
computational objects as represented by KDM concepts and
appropriate KDM actions representing interactions between
these concepts. For example, we may select MethodUnit and
DataElement as concepts and the Calls, Reads, Writes and
Creates action relations that may relate these concepts. Figure 3
shows a simplified excerpt from the KDM metamodel where
grayed rectangles are KDM concepts and relationships we use
in the dynamic view while the other concepts are those we use
when building the static view.
OVERVIEW OF THE APPROACH FOR RECONSTRUCTING
ARCHITECTURAL VIEWS
Our approach is based on the idea that we can establish a
mapping between an architectural view and one or more
specific algorithms analyzing an appropriate set of concepts
and relationships. Figure 2 illustrates the most important steps
of our approach. A preliminary step “Extract KDM
representation” is performed using a commercial tool that
analyzes the software artifacts and generates KDM models that
are at a low-level of abstraction; e.g., KDM code, action and
inventory models which are instances of metamodel elements
belonging to the Infrastructure and Program Elements layers.
These KDM models are low-level models providing views of
the implementation elements. Our approach relies on the
existence of such low-level models which it analyzes to
produce high-level views and models.
A. Selecting Relevant KDM Entities and Relationships
We consider an architectural view to be a way of
partitioning the system using a specific set of relevant concepts
and relations. Hence, the first step of the framework aims at
selecting the most relevant concepts and relationships that are
relevant for the targeted view. In this context, KDM offers too
much information and depending on the modernization targets
we have to choose the architectural views that will guide the
analysis of the system at hand and the appropriate concepts and
relations accordingly. The selection of the concepts and
relations depends on the system at hand. For example, when
analyzing a COBOL-based legacy system we will choose
KDM concepts representing a program (i.e., CompilationUnit),
a procedure (i.e., CallableUnit) and copybooks (i.e.,
SharedUnit); and KDM relations representing a COPY
B. Selecting a Relevant View and the Corresponding
Clustering Algorithm
The second step of the framework aims at revealing the
system’s structure. To do so, we apply some analysis
techniques to aggregate the system’s modules and abstract its
actual architecture. To build different views of the analyzed
system, we propose various pattern-driven clustering
algorithms to decompose the system. This includes:
•
A horizontal clustering that partitions the system into a
set of clusters, each of which corresponds to a given
Figure 2. Overview of the steps of the approach
347
Figure 3. KDM concepts and relationships exploited when building views
components is to group them into one subsystem which eases
the recovery of the system’s structure [10]. Section 4 describes
in details the horizontal and vertical algorithms.
layer in the sense of a layered architecture (e.g., UI
layer, business layer, data management layer). The
layered architecture is one of the common used module
views which helps understanding the system and
analyzing some of its important properties such as
portability in the context of a modernization process.
Indeed the elements of the top-level layer may be
analyzed to build a suitable modern front-end. This
partitioning is also useful in the case of systems that
have undergone many modifications as it reveals the
layers that were added on the top of the initial system
to support new business needs (e.g., a web layer).
•
A vertical clustering, as opposed to horizontal
clustering, that identifies functionally cohesive groups
of modules. The resulting view is complementary to
the one generated via the horizontal clustering as we
exploit computational objects and actions in this view.
This clustering can be seen as a feature-based
decomposition of the system at hand. It helps
understanding which part of the system supports which
functions and hence which business process. The
results may prove useful when identifying candidate
services during a SOA migration process.
•
A hierarchical clustering whose purpose is to keep the
size of the clusters at a manageable level as suggested
by [10]. In this case, a cluster may contain elements
which are themselves clusters. This is mainly useful for
scalability purposes. It also enables to address the
problem of finding disjoint clusters during the vertical
clustering: usually some modules (as libraries and the
modules that access data) are used by many other
modules and this makes it difficult and even
meaningless to partition the system into disjoint sets. In
this context, we can apply the horizontal clustering to
identify layers of the system and then apply the vertical
clustering to the topmost layers to identify independent
high-level features of the system.
C. Refining The Resulting View
Software clustering researchers have recognized that a
clustering algorithm can never generate a better partition of the
system than the one produced by the system’s experts [12].
Hence, the third step of the framework enables the user to
modify and adjust the resulting view and specifically the
clusters which correspond to layers or components depending
on the selected clustering algorithm. To do so a view resulting
from the clustering process is displayed to stakeholders as a
KDM structure model (see next subsection) whose elements
can be dragged and dropped to move an element from a cluster
to another. The ultimate goal here is to find meaningful
components with regard to some common style (e.g., SOA,
layers, etc.) to help with the modernization process. Involving
stakeholders is mandatory for two main reasons: 1) their
knowledge of the system can be used to improve the quality of
the groupings by adding information that is not present in the
code; and 2) they adhere better to a modernization process
whose efforts are organized in iterations based on the
components they helped to identify. The process to involve
stakeholders is out of scope of this paper.
D. Documenting The Resulting View
In the final step of the framework, the result of the
clustering is documented using the KDM structure model
whose contents are defined in the structure package (see Figure
1). This model represents the architectural organization of the
existing software system. It may contain architectural elements
Our goal is to populate our framework with families of
decomposition algorithms. The ones we have already
implemented may be set so that the algorithm may ignore or
not library components (introduced as omnipresent modules in
Muller et al.’s work [11]). Library components obscure the
system’s structure if considered during the decomposition of
the system [11]. Another alternative in handling these
Figure 4. KDM concepts used when building architectural views
348
that represent architectural views, layers or components. The
semantics of these concepts is not defined in the KDM
specification. The relations between architectural elements are
specified using the KDM AggregatedRelationship concept,
which represents a set of the primitive relationships between
entities (transitively) owned by architectural elements. Figure 4
shows the KDM concepts we used for documenting the views.
cohesion (i.e., the intra-relations). The algorithm stops when it
cannot improve the criterion anymore.
Bunch [13] starts by creating a module dependency graph
(MDG) from the source code. The MDG nodes represent
source code entities (e.g. classes, functions) while the edges
represent dependencies between these components (e.g.
inheritance, function call). An edge is weighted to specify the
number of dependencies existing between two nodes. We
adapted the approach in [13] to our vertical clustering in two
ways. First, while Bunch aims at finding partitions that
maximize cohesion and minimize coupling, our aim is to
identify subsystems corresponding to features supported by the
system. Hence, our initial partition is not random; it exploits
root nodes of the MDG generated from the legacy system. Our
intuition is that a root node (i.e., a node that has only outgoing
edges) should correspond to an entry point to a subsystem.
Second, we modified the way new partitions, called
neighboring partitions, are inferred from the current partition.
Hence, we represent resulting views from clustering
algorithms as ArchitecturalView objects while the resulting
clusters are represented as instances of the Layer concept in the
case of the horizontal clustering and the Component concept in
the case of the vertical clustering. Relations between two
clusters (layers or components) are represented using two
instances of AggregatedRelationship: an instance for the
outgoing relations (i.e., required services) and an instance for
the incoming relations (i.e., provided services). We also attach
to the generated architectural view and layers or components a
preliminary list of attributes that enables to record some
properties (e.g., in case of an architectural view, we keep track
of the parameters that were used to generate it including a list
of the selected concepts and relations, and the name of the
clustering algorithm).
IV.
The intuitions behind the horizontal and the vertical
clustering algorithms and the heuristics we developed to
support them, are explained in the following subsections.
A. Horizontal Clustering
This algorithm aims at partitioning the system into disjoint
subsystems each of which corresponds to a layer of the layered
view. Although the layered view is widely used in software
architecture it is a poorly defined view [8]. In the layering
architecture as described in [16] requests are sent from layer N
to the lower-level layer N-1, and answers to these requests or
notifications are moving in the opposite direction. Yet many
implementations of the layered architecture may violate this
strict layering principle.
BUILDING VIEWS USING CLUSTERING ALGORITHMS
Many approaches and tools were developed to support
module decomposition and clustering in the context of software
engineering (e.g., [5, 10, 11, 13, 17, 19, 20, 21]). Our initial
goal was to apply and adapt existing algorithms to our
clustering needs which are driven by targeted views. However
the majority of the proposed software clustering algorithms
uses high-cohesion and low-coupling principles to identify the
boundaries of the clusters [15]. This does not really apply when
identifying layers of the system. A layer may exhibit a set of
independent features that are commonly used by the above
layer and hence its cohesion may be low and its coupling with
other layers may be high. Hence identifying layers requires a
different approach of grouping modules.
In light of this, our horizontal partitioning starts with
identifying the data-access elements (i.e., compilation units that
create, read or write data through their control elements) and
creates the lowermost layer containing these elements. In some
legacy systems (e.g., batch sequential system) elements of the
top layer (e.g., GUI modules) are more difficult to identify than
in others (e.g., object-oriented system). In the first case, the
algorithm assigns a level number to each element/module of
the system depending on the number of dependencies that must
be crossed to get to an element of the lowermost layer. To do
so, we also developed heuristics which are based on the fact
that a layered view is not just a decomposition that reflects
module dependencies but it also considers cohesion, reuse and
portability. For example, when module x uses module y and
module y is not in the lowermost layer, two situations are in
order: 1) Module x is the only one to use module y then we put
module y in the same layer as module x; or 2) Module x is not
the only one to use module y, we put module x and module y in
distinct successive layers. Another heuristic we introduce aims
at resolving level-numbering conflicts which result from the
violation of the strict layering principle. Practically, this
heuristic assigns the minimum level number (or maximum
number if layers are numbered from top to bottom) to a module
when the module has been assigned different level-numbers
through distinct flows. The heuristic is illustrated in Figure 5
where the left part shows a module C that was assigned two
Regarding the vertical clustering, this clustering aims at
recognizing features of the system, which are generally
implemented using cohesive and loosely-coupled sets of
modules. Hence existing algorithms can be adapted to support
this clustering. Since the goal was to apply our approach on
large industrial systems, we decided to make use of approaches
that rely on optimization to reduce the search space for an
optimal decomposition of a software system. We attempted to
use the Bunch modularization algorithm proposed in [13],
which uses a family of search-based algorithms including hillclimbing and genetic algorithms. We focused on the hillclimbing algorithm because it performs well in the context of
large systems and it has been successfully used in several
approaches [18]. The algorithm works in an iterative way. It
starts by an initial partition; usually a randomly generated
partition as in [13]. Modules are then moved between clusters
to improve the partition according to some criterion. This
criterion is based on maximizing a fitness function. In [13], a
modularization quality (MQ) function is proposed as a fitness
function, which aims at minimizing the coupling between
resulting clusters (i.e., the inter-relations) and maximizing their
349
different level-numbers after analyzing two distinct
dependency flows that use the module. The result of applying
the heuristic is shown in the right part of the figure.
In the second case where user modules are easy to identify,
the algorithm simultaneously creates the lowest-level layer
containing data-access modules and the top-level layer
containing user modules. Then we assign a level to each
element/module of the system depending on its position in the
dependency flow starting from the top layer all the way across
to the lowermost layer. We apply the same heuristics as in the
first case. The modules are then grouped according to their
level number. This algorithm is a construction algorithm (as
defined by [17]) since it assigns modules to clusters in very few
passes.
Figure 5. Example of the application of the level-number minimization
(or maximization) heuristic
module B is used uniquely by identified omnipresent modules,
we consider module B as omnipresent itself.
V.
B. Vertical Clustering
We used the hill-climbing algorithm to our vertical
clustering. In particular, we adapted the approach in [13]. In
our context, we partition the system starting with a set of
clusters where all but two of the clusters, contain one
node/module corresponding to a root module of the MDG that
was built from the legacy system. Root nodes are nodes that
have only outgoing edges; they represent modules through
which a user or another system may interact with the legacy
system under analysis. One of the two remaining clusters
contains the set of omnipresent modules, or modules that are
used by an abnormally large number of other modules. These
can skew the results of the clustering. So we identify and
isolate them using some statistical analysis (e.g. outliers in a
box-plot) combined with a manual tuning. The other cluster
contains all the remaining modules, i.e., all modules except
roots and those in the omnipresent cluster.
EXPERIMENTATIONS WITH THE APPROACH
We implemented our approach as a plugin in the IRISTM
tool, the workbench developed by our industrial partner. We
conducted preliminary experiments on large industrial systems,
which are mainly batch sequential systems written in COBOL.
We illustrate the results of the application on an industrial
system and compare them to the results of applying the
approach in [13] to the same system. Due to confidentiality
agreements, we cannot describe the system in detail. The
system is written in COBOL that is mostly COBOL 85
compliant and is deployed on a HP mainframe. The whole
system is comprised of nearly 5000 programs and includes over
6000 copybooks (included COBOL code), for a total of over 6
million lines of code. In this section, we present the results of
our experimentation as well as threats to the validity of our
results.
A. Experiment
Our experiments were focused on subsystems for which we
had information on the specific functionalities they support.
For each of these subsystems, we applied our clustering
algorithms to yield the feature-based and layered views and we
validate them by the system’s experts. For the purpose of this
paper, we present the results of our clustering algorithms
applied to one of the small subsystems that contained five
distinct functionalities. Figure 6 shows a dependency graph of
this subsystem. This dependency graph is call-based, i.e. it only
uses compilation units and the calls relationships. The analysis
of this subsystem helped us assess the accuracy and the
usefulness of the views that are built using our approach.
In the following iterations, neighboring partitions are
created by moving modules to other clusters and the resulting
neighbors are evaluated using the MQ function as proposed in
[13]. However, in [13] a partition Y is considered as a neighbor
of partition X if Y is exactly the same as X except that a single
module of a cluster in partition Y is in a different cluster in
partition X; This basically means that a neighboring partition is
generated by randomly moving exactly a single module from
one cluster to another from the current partition. In our context,
it does not make sense to move a root module from one cluster
to another; this breaks the root-based intuition that directs our
vertical clustering. Further, ultimately the root module of a
cluster will be (transitively) related to all modules in the same
cluster. Hence, we generate a neighboring partition by
randomly moving exactly a single module M from one cluster
X to another cluster Y from the current partition if the module
M has a relationship with at least one module in the cluster Y.
Figure 7 shows the view resulting from applying the
vertical clustering to the subsystem of Figure 6. The view
displays five subsystems and a subsystem called “Libraries”
(i.e., the cluster containing omnipresent compilation units). The
five subsystems are consistent with the features that were
identified by the system’s experts and stakeholders. The main
issue that we observed is the identification of omnipresent
modules that may be too permissive when we rely on statistical
analysis (such as the boxplot) only. This problem is discussed
in [14], and a way to handle this it is to let the user manually
define a threshold for considering a module omnipresent.
Regarding the identification of omnipresent modules we
developed and applied some heuristics. For example, when
identifying these modules we do not consider the weight of the
relations. Indeed we consider that if module A is the only one
using module B and it uses it hundred times this does not
qualify module B to being omnipresent. On the other hand, if
module B is used by hundred distinct modules each is using it
once, this makes module B an omnipresent. Furthermore if a
350
To assess the usefulness of our clustering algorithm, we
implemented a version of the steepest ascent hill-climbing
(SAHC) algorithm as described in [13] and ran it on the same
system. Figure 8 shows a decomposition resulting from the
SAHC algorithm when applied to our subsystem. Although the
final MQ (i.e., the modularization quality function) of the
SAHC is better that the final MQ of our vertical clustering, the
decomposition view generated by the SAHC did not match the
functionalities described by the system’s owners. It’s worth
pointing that the decomposition generated by SAHC would
have been improved, had we tuned better the omnipresent
threshold; which shows that our vertical clustering is more
stable if we consider the issue of defining the omnipresence
threshold. Moreover, the vertical algorithm performs much
better on large systems; this is mostly due to the way we
compute the neighboring partitions, which reduces the number
of neighbors to consider.
Figure 6. The dependency graph of the subsystem under analysis
Figure 7. The view resulting from the vertical clustering
Figure 8. The decomposition generated by the SAHC algorithm
351
Figure 9. The layered view resulting from the horizontal clustering
Figure 10. A hierarchical view resulting from combining horizontal and vertical clusterings
B. Threats To Validity
Our preliminary experiments were performed on systems
which are written in COBOL. However, we used the KDM
representation of these systems, which is language and
platform-independent and makes the approach applicable to
other kind of systems. Nevertheless we need to experiment the
approach on other kind of systems to assess and refine the
heuristics we developed. Another issue that we have to deal
with is the assessment of the results of the approach when
applied to a large legacy system: the results and observations of
our experiment were limited to parts of the system for which
we had information indicating expected functional groupings.
We hope to get additional tagged subsystems from the
stakeholders to increase the scope of our experimentation; this
would help us to generalize the results. Additionally, we would
like to reproduce the results on open systems, but there are few
representative systems containing a large number of
interconnected programs.
Figure 9 shows the layered view resulting from applying
the horizontal clustering to our subsystem example. Layer 1 (at
the bottom) contains one module which is a data access
module. However, Layer 5, which is the topmost layer,
contains many modules. From the stakeholders’ point of view,
this view is less comprehensible and intuitive that the featurebased view. Hence, we applied the vertical clustering algorithm
to identify subsystems of the Layer 5. The resulting view is
shown in Figure 10. Interestingly enough, Layer 5, comprises
five clusters that correspond to the five subsystems previously
identified in the vertical clustering. Hence, together layers 1 to
4 contain modules that were recognized as omnipresent in the
vertical clustering. This is consistent with the fact that layers at
the bottom in a layered architecture are common services used
by higher-level layers, which in turn exhibit a set of features
supported by the system. As a matter of fact, the vertical
clustering may be applied to each layer in case of larger
subsystems. Stakeholders found the view in Figure 10 more
comprehensible and useful to restructure and migrate the
system’s parts.
VI.
RELATED WORK
Several software architecture reconstruction approaches
were proposed in the literature (e.g. [4, 19, 10, 13, 14, 15, 17,
21, 25]. Most of these approaches follow a typical process that
352
generally includes three steps: 1) Identifying relevant concepts
and accordingly extracting information from the system under
analysis; 2) Constructing higher-level models using some
analysis techniques; and 3) Visualizing the resulting models.
Depending on their goals and the targeted systems, approaches
propose different methods or techniques to support the
reconstruction process.
a hybrid clustering algorithm that exploits a weighted directed
class graph (WDCG) extracted from object-oriented systems. A
WDCG includes both static and dynamic information.
However, some important issues such as omnipresent modules
are not tackled by these approaches.
Our approach is based on view-driven (i.e., style
representation) clustering algorithms that make use of searchbased algorithms while still trying to minimize coupling
between clusters and maximize their cohesion. Hence our
approach is more related to Mitchell et al. [13, 14] and Tzerpos
and Holt’s [10] approaches. Mitchell et al., [13] propose a
search-based clustering approach, which has been implemented
in the Bunch tool. We described the principles behind Bunch in
section IV. Our approach for the vertical clustering reuses
Bunch’s modularization quality (MQ) function. However, our
initial partition is driven by the views we are attempting to
build, while in Bunch the initial partition is randomly
generated. Additionally, we constrained the way neighboring
partitions are generated from the partition of the current
iteration to remain compatible with the idea of a vertical
clustering, while Bunch randomly generate these partitions
from the current one. Although Bunch enables the user to
specify the number of neighbors to explore in each iteration of
the algorithm (e.g., all: steepest ascent hill climbing (SAHC),
the first best: next ascent hill climbing (NAHC)), our approach
reduces prudently this number by relying on existing relations
between clusters. Our experimentation with our vertical
algorithm, the NAHC and the SAHC algorithms has shown that
the vertical clustering yields more comprehensible views and
performs better.
Both Riva [19] and Stoermer et al. [4] propose an
architecture reconstruction process. Riva [9] considers the
choice of architectural significant concepts as the key to deliver
meaningful models to the architects, while the goal of Stoermer
et al. [4] is to extract information that will help in the analysis
of the quality attributes supported by systems. In [19] the
proposed process is iterative and incremental. Key concepts are
first identified, and a conceptual view is built. These concepts
are extracted from documentation and discussions with experts.
Then the source code is analyzed to produce a model that is
enriched with domain specific knowledge to produce a high
level view of the system. Many of these activities rely on
existing documentation and the manual intervention of the
architects and experts, which are not always available in the
context of legacy systems. In [4], the proposed process links
architecture reconstruction to quality attribute-driven analysis.
In the first step of the process, the system to be analyzed and
the architecture views are identified. This identification
depends on the type of the system and the quality attributes to
be analyzed. In the second step, a source model is extracted
from the source code. Elements of this model are then
aggregated in the third step and the resulting aggregates are
assigned element types as specified by the targeted view. The
resulting views are given to the quality analysis framework,
which analyzes them. Although we adopt a similar process for
architecture reconstruction, the focus of this approach is on the
analysis of quality attributes supported by existing systems.
Furthermore, relationships between the views to be built and
supporting aggregation techniques are not discussed in both [4]
and [19].
Tzerpos and Holt’s [10] propose an algorithm for
comprehension-driven clustering (ACDC) based on subsystem
patterns. Rather than using any modularity criterion to
decompose systems, ACDC relies on familiar patterns
observed in large systems to find clusters while keeping the
size of the clusters at a manageable level. The approach also
assigns meaningful names to the resulting clusters. ACDC
creates, first, a skeleton of the final decomposition using a
subsystem pattern. Then it assigns remaining modules (i.e.,
orphans) to existing subsystems using the Orphan Adoption
technique described in [23]. Proposed subsystem patterns
include the source file pattern, support library pattern and the
subgraph dominator pattern. For example, the source file
pattern is a basic pattern that creates a cluster grouping
procedures and variables contained in the same source file. The
support library pattern groups omnipresent modules into one
subsystem; we used this pattern in our vertical clustering where
the initial partition contains a subsystem grouping omnipresent
modules. The subgraph dominator pattern identifies a particular
subgraph in the system that contains a dominant node DN such
as there exists a path from DN to every node in the subgraph.
Interestingly, our vertical clustering can be seen as a particular
dominator pattern clustering where the dominant node is
constrained to be a root node.
Many of the proposed architecture reconstruction
approaches use clustering including [10, 13, 14, 15, 17, 21, 25].
Various clustering-based approaches are discussed in [20 and
24]. Many of the proposed approaches aim at finding a
clustering of the system that minimizes the coupling between
resulting components and maximize the cohesion of each
component (e.g., [13, 14, 17, 21, 25]). Lung et al. [21] propose
a hierarchical approach based on a numerical taxonomy
clustering technique to classify software components. The
method starts by constructing a data matrix of components of
the system and their properties to be used for the clustering. A
resemblance coefficient is then computed for each pair of
components to yield a resemblance matrix. The resemblance
coefficient is based on coupling measures between
components. The resemblance matrix is used to group similar
components into clusters. The algorithm works in an iterative
way where, at each iteration, the two closet clusters are merged
and the resemblance matrix is updated consequently. The
algorithm stops when all clusters are exhausted or when a given
threshold is reached. This approach is very intuitive and can be
applied in both forward- and reverse-engineering processes. It
also performs well on large systems. Zhang et al. [25] propose
VII. CONCLUSION AND FUTURE WORK
The process of reconstructing meaningful architectural
models form legacy systems remains a difficult task and an
active research field in software engineering. This process is
353
mandatory in various contexts, spanning from the redocumentation and the understanding of existing systems to
their restructuring and migration. In this paper, we presented an
approach that reconstruct and document the software
architecture of legacy systems using view-driven clustering
algorithms. Specifically we proposed and implemented
clustering algorithms that target two specific views: a layered
view that we called horizontal view and a feature-based view
that we called vertical view. The approach is language and
platform independent as it relies on the KDM specification
standard for describing low-level models of legacy systems.
The approach was implemented as a plugin within the IRIS
modernization tool, and it has been applied to large industrial
legacy systems. Preliminary results have shown that the
algorithms perform well on large systems and the resulting
views are comprehensible and may be used to support
restructuring and migration processes.
[6]
[7]
[8]
[9]
[10]
[11]
While we continue to refine our view-driven algorithms, we
need to perform more experiments and analysis to evaluate and
refine the heuristics we used. We also need to establish a strict
mapping between the architectural views as describe in [8] and
the KDM concepts and relations that can be included in a view.
As matter of fact, the challenge in reconstructing architectural
views from legacy systems resides in the difficulty we have to
establish a clear and precise mapping between the abstract
architectural elements as defined by architectural styles and the
concrete language-dependent constructs used in implementing
software systems. In the near future, we are planning to
experiment our approach on open source systems (e.g., Mozilla
and Linux) so that we can compare the results with other
approaches. We also want to automatically analyze the KDM
AggregatedRelationship that relate the resulting clusters (i.e.,
layers in the horizontal clustering or subsystems in the vertical
clustering) to construct their interfaces.
[12]
[13]
[14]
[15]
[16]
[17]
[18]
In the future works, we would like to explore other analysis
and classification techniques to build the architectural views.
We also intend to explore the usage of domain-specific
knowledge information to improve the resulting views;
although using domain-specific knowledge is known to confine
the scope of the approach [24].
[19]
[20]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[21]
R. C. seacord, D. Plakosh, and G. A. Lewis, Modernizing Legacy
Systems: Software Technologies, Engineering Process and Business
Practices. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 2003.
W. Ulrich and P. Newcomb, Information systems transformation :
Architecture-Driven Modernization Case Studies, Morgan Kaufmann
OMG Press, 2010.
M. Shaw and D. Garlan, Software Architecture: Perspectives on an
Emerging Discipline, Prentice Hall, 1996.
C. Stoermer, L. O'Brien, and C. Verhoef, “Moving Towards Quality
Attribute Driven Software Architecture Reconstruction,” In Proceedings
of the 10th Working Conference on Reverse Engineering (WCRE '03).
2003.
D.R. Harris, H.B. Reubenstein, and A.S. Yeh, "Recognizers for
Extracting Architectural Features from Source Code," Proceedings of
[22]
[23]
[24]
[25]
354
2nd Working Conference on Reverse Engineering, 1995 (WCRE’95),
vol., no., pp.252-261
T. Mens and T. Tourwé, “A Survey of Software Refactoring,” IEEE
Transactions on Software Engineering, 2004, vol. 30 (2), pp. 126-139
OMG
Modernization
Specifications
catalog:
http://www.omg.org/technology/documents/modernization_spec_catalog
.htm [accessed in July 2012]
P. Clements, F. Bachmann, L. Bass, D. Garlan, J. Ivers, R. Little, R.
Nord and J. Stafford, Documenting Software Architectures: Views and
Beyond, Addison-Wesley, 2003.
L. Bass, P. Clements and R. Kazman, Software Architecture in Practice,
Addison-Wesley, 2003.
V. Tzerpos and R. C. Holt, “ACDC: An Algorithm for ComprehensionDriven Clustering,” In Proceedings of the Seventh Working Conference
on Reverse Engineering, 2000 (WCRE'00). IEEE Computer Society,
Washington, DC, USA, pp. 258-267
H. A. Müller, M. A. Orgun, S. R. Tilley and J. S. Uhl, A ReverseEngineering Approach to Subsystem Structure Identification, Journal of
Software Maintenance: Research and Practice, 1993, Volume 5, Issue 4,
pp. 181–204
V. Tzerpos, Comprehension-Driven Software Clustering, Ph.D. thesis,
University of Toronto, Toronto, Canada, 2001.
B. S. Mitchell and S. Mancoridis. 2007. On the Evaluation of the Bunch
Search-Based Software Modularization Algorithm, Soft Comput., 2007,
vol. 12, Issue 1, pp. 77-93
S. Mancoridis, B.S. Mitchell, Y. Chen, and E.R. Gansner, "Bunch: a
Clustering Tool for the Recovery and Maintenance of Software System
Structures," In Proceedings of the IEEE International Conference on
Software Maintenance, 1999 (ICSM '99), pp.50-59
P. Andritos and V. Tzerpos, “Information-Theoretic Software
Clustering,” IEEE Transactions on Software Engineering, 2005, vol. 31,
n.2, pp.150-165
F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad and M. Stal,
Pattern-Oriented Software Architecture: A System of Patterns, John
Wiley & Sons, 1996
T.A. Wiggerts, “Using clustering algorithms in legacy systems
remodularization,” In Proceedings of the Fourth Working Conference on
Reverse Engineering, 1997 (WCRE’97), pp.33-43
J. Clark, J. J. Dolado, M. Harman, R. Hierons, B. Jones, M. Lumkin, B.
Mitchell, S. Mancoridis, K. Rees, M. Roper and M. Shepperd,
Reformulating Software Engineering as a Search Problem, In IEEE
Software, 2003, volume 150, Issue 3, pp. 161-175
C. Riva, “Architecture Reconstruction in Practice,” In Proceedings of the
3rd Working IEEE/IFIP Conference on Software Architecture, 2002
(WICSA 2002), Kluwer Academic Publishers, pp. 159-173
O. Maqbool and H.A. Babri, Hierarchical Clustering for Software
Architecture Recovery, IEEE Transactions on Software Engineering,
2007, vol.33, no.11, pp.759-780
C-H. Lung, M. Zaman and A. Nandi, Applications of Clustering
Techniques to Software Partitioning, Recovery and Restructuring, The
Journal of Systems and Software, 2004, vol. 73, pp. 227–244
D. Pollet, S. Ducasse, L. Poyet, I. Alloui, S. Cîmpan and H. Verjus,
“Towards A Process-Oriented Software Architecture Reconstruction
Taxonomy,” In Proceedings of the 11th European Conference on
Software Maintenance and Reengineering, 2007 (CSMR '07), IEEE
Computer Society, pp. 137-148
V. Tzerpos and R. C. Holt, “The Orphan Adoption Problem in
Architecture Maintenance,” In Proceedings of the Fourth Working
Conference on Reverse Engineering, 1997 (WCRE’97), pp. 76-82
M. Shtern and V. Tzerpos, Clustering Methodologies for Software
Engineering, Advances in Software Engineering, vol. 2012, 18 pages,
2012.
Q. Zhang, D. Qiu, Q. Tian, L. Sun, “Object-oriented software
architecture recovery using a new hybrid clustering algorithm,” In the 7th
International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD), 2010, vol.6, pp.2546-2550