Academia.eduAcademia.edu
2012 19th Working Conference on Reverse Engineering Reconstructing Architectural Views from Legacy Systems Ghizlane El Boussaidi Alvine Boaye Belle Department of Software and IT engineering École de technologie supérieure Montreal, Canada Department of Software and IT engineering École de technologie supérieure Montreal, Canada Stéphane Vaucher Hafedh Mili Benchmark Consulting Montreal, Canada Department of Computer Science Université du Québec à Montréal Montreal, Canada 2) violations of the style constraints due to their misinterpretation; and 3) the continuous and cumulative changes undergone by the system, which increases its complexity and leads to a deviation from its initial design [1, 6]. Furthermore, the as-built architecture is often insufficiently and inaccurately documented [4]. Hence, a software architecture reconstruction process is required to reconstruct and document the architecture of existing systems before initiating any modernization actions. Abstract—Modernizing a large legacy system is a demanding and costly process which requires a deep understanding of the system’s architecture and its components. However legacy systems are poorly documented and they have often undergone many changes that make them deviate from their initial architectural design. Approaches for reconstructing architectural views from legacy systems and re-documenting the resulting components are of great value in the context of a modernization process. In this paper, we propose an approach that helps constructing distinct architectural views from legacy systems. To do so, we propose various clustering algorithms which are driven by common architectural views and styles. Our approach makes use of the knowledge discovery model which provides a standard machine-independent representation of legacy systems. We implemented and applied the approach in an industrial setting. The preliminary experimentations have shown that the algorithms perform well and produce comprehensive views. Many approaches were proposed to support architecture recovery using various techniques and producing different tools that support them [4]. The technique used is generally dependent on the way the system’s data is represented. In [22], techniques were classified into three automation levels: quasimanual, semi-automatic and quasi-automatic. In the context of large and complex legacy systems, we need a quasi-automatic technique that alleviates the burden of reconstructing these systems architectures. One such technique is clustering which is a common used technique to reconstruct architecture [5, 10, 11, 12, 13, 15, 17, 19, 20, 21]. However, these approaches target specific languages and systems and do not use a standard representation of the data of the system. As a consequence resulting tools do not interoperate with each other and can hardly be used together in a modernization process [2]. Another known problem related to clustering-based techniques is the selection of appropriate entities and properties/relations that will be used in the clustering. Most of the approaches base the selection on the system under analysis and its data and do not enable to build architectural views as the ones commonly used in an architectural forward construction process (e.g., a layered view [16]). Keywords-legacy system modernization; architecture reconstruction; architectural views; software clustering. I. INTRODUCTION Modernizing a legacy system is a demanding and costly task. It is largely motivated by the fact that the system is no longer able to efficiently support the business goals of a company [1]. Understanding the legacy system is mandatory to all the modernization disciplines and it usually involves constructing various representations of the system that support its comprehension. Depending on the modernization goals, various other disciplines are considered as part of this process including assessment of the legacy system’s architecture, its restructuring and refactoring and, its transformation to generate the target system [2]. The OMG’s architecture-driven modernization (ADM) standards [7] were introduced to tackle some of these challenges. Indeed, ADM consists of a number of standards whose purposes are mainly to harmonize IT and business modernization goals while enabling interoperability between supporting tools. The Knowledge Discovery Metamodel (KDM) is one of these standards. The KDM defines a metamodel for representing—at various levels of abstraction- all aspects of existing legacy systems [7]. This meta-model provides a common interchange format to ensure interoperability between tools that support modernization. Although the KDM specifies concepts and relations to describe When the modernization process involves an architectural transformation of the system, we need first to reconstruct the system’s software architecture. When designing these architectures, an architect relies on a set of idiomatic patterns commonly named architectural styles which describe families of systems [3]; for example the commonly used layered style. However, many researchers observed that the as-built architecture does not conform to the initial style that guided its design (e.g., [4, 5, 6]). This is mainly due to: 1) the conceptual gap between the abstract elements that define a style and the concrete source code constructs that implements the system [5]; 1095-1350/91 $25.00 © 4891 IEEE DOI 10.1109/WCRE.2012.44 345 software architecture of legacy systems, it does not specify or suggest a way of inferring these high-level representations from low-level data that were extracted from the system. Software architecture is usually described using a set of complementary views. The concept of architectural view is described in [8] as the fundamental organizing principle for documenting architecture and it is considered as the result of applying a style to a system. For example, the layered style yields a module decomposition that is a static/structural view of the system; a pipe and filter style yields a dynamic view based on the data flow between components; and a client-server style yields a dynamic view based on the control flow (request/reply interactions). In this paper, we rely on this mapping between views and styles to recover relevant views from a legacy system. To address the issues mentioned above, we propose an approach that makes use of the KDM standard to reconstruct and document software architectural views of the legacy system. We consider an architectural view to be a way of partitioning a system using a specific set of KDM relevant concepts and relations and we propose clustering algorithms that target specific views mainly a layered view that we call horizontal view and a feature-based view that we call vertical view. The proposed approach has two main advantages: 1) the approach is language and platform independent; and 2) the proposed architecture reconstruction process and the supporting algorithms are driven by targeted architectural views and styles. It is worth pointing out that at this stage of our research, our aim is to propose and implement relevant clustering algorithms and heuristics that target a common set of architectural views. The approach was implemented as a plugin within a commercial tool and applied to large industrial legacy systems. The preliminary results have shown that the approach considerably reduces the clustering time compared to similar approaches and it yields comprehensible views. B. The Knowledge Discovery Metamodel In ADM, modernization is driven by architecture to underline the need to reconstruct and restructure the architecture of the system to be modernized. A number of standards are listed under the ADM umbrella. These standards aim at synchronizing the IT and business aspects inherent to modernization and enabling interoperability between modernization tools. One of these standards is the Knowledge Discovery Metamodel (KDM). The KDM defines a meta-model for representing existing software assets, their associations, and operational environments [7]. KDM enables to represent various applications, platforms, and programming languages. Following the separation of concerns principle, KDM defines several domains each of which corresponds to a particular architectural viewpoint. Each domain is defined by a KDM package that gathers meta-model elements representing aspects related to this domain. KDM specification defines 4 layers (Figure 1). The KDM Infrastructure Layer defines common meta-model elements which constitute the infrastructure for other packages. The Program Elements Layer contains two packages: Code and Action. Together these packages enable to describe code models of software systems, i.e., they represent the knowledge explicitly represented in the source code. The Code package focuses on named units of implementation and their structural relationships while the Action package focuses on units of behavior and the control flow relationships. The paper is organized as follows. We introduce basic concepts related to our approach in section 2. Section 3 gives an overview of our approach and details its steps. In section 4 we describe our view-based clustering algorithms. Section 5 presents and discusses the experimentation results. Related works are discussed in section 6 and we conclude and outline some future works in section 7. II. BACKGROUND In this section, we start by briefly introducing relevant concepts to our approach including software architecture, architectural styles and views. We also give an overview of the KDM specification and meta-model that we used to implement our approach. A. Software Architecture and Architectural Views There are many definitions of software architecture in the literature (e.g., [3, 8, 9]). However, there is a consensus on the following: 1) architecture represents a judicious partitioning of the system into parts with specific relations among these parts [8]; and 2) architecture aims at satisfying a set of functional requirements and quality attributes [3, 9]. Software architecture is commonly defined as a set of components and connectors (i.e., interactions between components). When designing software architectures, an architect relies on a set of idiomatic patterns commonly named architectural styles or patterns. An architectural style determines the vocabulary of components and connectors that can be used in instances of that style, together with a set of constraints on how they can be combined. Many common architectural styles are described in [3, 8, 16]. Examples of such styles include layered, pipes and filters, client-server and service-oriented styles; each of these styles has its own vocabulary and constraints and promotes some specific quality attributes. Software systems are practically built by combing and composing these styles. The Runtime Resource and the Abstractions Layers represent higher level views on the existing system. Most information required by these views is implicit in the source code and it usually requires some analysis techniques of lowlevel representations combined with input from experts and Figure 1. KDM Layers and packages; extracted from [7] 346 stakeholders. A particular package we are interested in is the structure package which represents the architectural organization of the existing software system. Our goal is to make use of KDM models extracted at lower-levels (code model, data model, etc.) and construct different views of the system at hand that will feed the structure model. III. statement (i.e., Includes), a CALL and PERFORM statements (i.e., Calls) and statements manipulating data records (i.e., Reads and Writes). Moreover, depending on the targeted view, we have to choose a coherent subset of concepts and relations to be included in the view. For example a static view, called module view in [1] and [8], shows units of implementation and their structural relations. This view helps understanding the system’s functions and support requirements traceability and impact analysis [8]. To build the static view of an object-oriented legacy system, we may select KDM concepts representing classes, interfaces and packages and KDM relations representing extends, implements, imports, depends-on and includes relationships as elements to be included in this view. The dynamic view, called component and connector view in [1] and [8], is mainly based on the control flow and the data flow between implementation units. Control flow and data flow elements are represented by concepts and relations from the KDM Action package. To build the dynamic view, we need to select appropriate computational objects as represented by KDM concepts and appropriate KDM actions representing interactions between these concepts. For example, we may select MethodUnit and DataElement as concepts and the Calls, Reads, Writes and Creates action relations that may relate these concepts. Figure 3 shows a simplified excerpt from the KDM metamodel where grayed rectangles are KDM concepts and relationships we use in the dynamic view while the other concepts are those we use when building the static view. OVERVIEW OF THE APPROACH FOR RECONSTRUCTING ARCHITECTURAL VIEWS Our approach is based on the idea that we can establish a mapping between an architectural view and one or more specific algorithms analyzing an appropriate set of concepts and relationships. Figure 2 illustrates the most important steps of our approach. A preliminary step “Extract KDM representation” is performed using a commercial tool that analyzes the software artifacts and generates KDM models that are at a low-level of abstraction; e.g., KDM code, action and inventory models which are instances of metamodel elements belonging to the Infrastructure and Program Elements layers. These KDM models are low-level models providing views of the implementation elements. Our approach relies on the existence of such low-level models which it analyzes to produce high-level views and models. A. Selecting Relevant KDM Entities and Relationships We consider an architectural view to be a way of partitioning the system using a specific set of relevant concepts and relations. Hence, the first step of the framework aims at selecting the most relevant concepts and relationships that are relevant for the targeted view. In this context, KDM offers too much information and depending on the modernization targets we have to choose the architectural views that will guide the analysis of the system at hand and the appropriate concepts and relations accordingly. The selection of the concepts and relations depends on the system at hand. For example, when analyzing a COBOL-based legacy system we will choose KDM concepts representing a program (i.e., CompilationUnit), a procedure (i.e., CallableUnit) and copybooks (i.e., SharedUnit); and KDM relations representing a COPY B. Selecting a Relevant View and the Corresponding Clustering Algorithm The second step of the framework aims at revealing the system’s structure. To do so, we apply some analysis techniques to aggregate the system’s modules and abstract its actual architecture. To build different views of the analyzed system, we propose various pattern-driven clustering algorithms to decompose the system. This includes: • A horizontal clustering that partitions the system into a set of clusters, each of which corresponds to a given Figure 2. Overview of the steps of the approach 347 Figure 3. KDM concepts and relationships exploited when building views components is to group them into one subsystem which eases the recovery of the system’s structure [10]. Section 4 describes in details the horizontal and vertical algorithms. layer in the sense of a layered architecture (e.g., UI layer, business layer, data management layer). The layered architecture is one of the common used module views which helps understanding the system and analyzing some of its important properties such as portability in the context of a modernization process. Indeed the elements of the top-level layer may be analyzed to build a suitable modern front-end. This partitioning is also useful in the case of systems that have undergone many modifications as it reveals the layers that were added on the top of the initial system to support new business needs (e.g., a web layer). • A vertical clustering, as opposed to horizontal clustering, that identifies functionally cohesive groups of modules. The resulting view is complementary to the one generated via the horizontal clustering as we exploit computational objects and actions in this view. This clustering can be seen as a feature-based decomposition of the system at hand. It helps understanding which part of the system supports which functions and hence which business process. The results may prove useful when identifying candidate services during a SOA migration process. • A hierarchical clustering whose purpose is to keep the size of the clusters at a manageable level as suggested by [10]. In this case, a cluster may contain elements which are themselves clusters. This is mainly useful for scalability purposes. It also enables to address the problem of finding disjoint clusters during the vertical clustering: usually some modules (as libraries and the modules that access data) are used by many other modules and this makes it difficult and even meaningless to partition the system into disjoint sets. In this context, we can apply the horizontal clustering to identify layers of the system and then apply the vertical clustering to the topmost layers to identify independent high-level features of the system. C. Refining The Resulting View Software clustering researchers have recognized that a clustering algorithm can never generate a better partition of the system than the one produced by the system’s experts [12]. Hence, the third step of the framework enables the user to modify and adjust the resulting view and specifically the clusters which correspond to layers or components depending on the selected clustering algorithm. To do so a view resulting from the clustering process is displayed to stakeholders as a KDM structure model (see next subsection) whose elements can be dragged and dropped to move an element from a cluster to another. The ultimate goal here is to find meaningful components with regard to some common style (e.g., SOA, layers, etc.) to help with the modernization process. Involving stakeholders is mandatory for two main reasons: 1) their knowledge of the system can be used to improve the quality of the groupings by adding information that is not present in the code; and 2) they adhere better to a modernization process whose efforts are organized in iterations based on the components they helped to identify. The process to involve stakeholders is out of scope of this paper. D. Documenting The Resulting View In the final step of the framework, the result of the clustering is documented using the KDM structure model whose contents are defined in the structure package (see Figure 1). This model represents the architectural organization of the existing software system. It may contain architectural elements Our goal is to populate our framework with families of decomposition algorithms. The ones we have already implemented may be set so that the algorithm may ignore or not library components (introduced as omnipresent modules in Muller et al.’s work [11]). Library components obscure the system’s structure if considered during the decomposition of the system [11]. Another alternative in handling these Figure 4. KDM concepts used when building architectural views 348 that represent architectural views, layers or components. The semantics of these concepts is not defined in the KDM specification. The relations between architectural elements are specified using the KDM AggregatedRelationship concept, which represents a set of the primitive relationships between entities (transitively) owned by architectural elements. Figure 4 shows the KDM concepts we used for documenting the views. cohesion (i.e., the intra-relations). The algorithm stops when it cannot improve the criterion anymore. Bunch [13] starts by creating a module dependency graph (MDG) from the source code. The MDG nodes represent source code entities (e.g. classes, functions) while the edges represent dependencies between these components (e.g. inheritance, function call). An edge is weighted to specify the number of dependencies existing between two nodes. We adapted the approach in [13] to our vertical clustering in two ways. First, while Bunch aims at finding partitions that maximize cohesion and minimize coupling, our aim is to identify subsystems corresponding to features supported by the system. Hence, our initial partition is not random; it exploits root nodes of the MDG generated from the legacy system. Our intuition is that a root node (i.e., a node that has only outgoing edges) should correspond to an entry point to a subsystem. Second, we modified the way new partitions, called neighboring partitions, are inferred from the current partition. Hence, we represent resulting views from clustering algorithms as ArchitecturalView objects while the resulting clusters are represented as instances of the Layer concept in the case of the horizontal clustering and the Component concept in the case of the vertical clustering. Relations between two clusters (layers or components) are represented using two instances of AggregatedRelationship: an instance for the outgoing relations (i.e., required services) and an instance for the incoming relations (i.e., provided services). We also attach to the generated architectural view and layers or components a preliminary list of attributes that enables to record some properties (e.g., in case of an architectural view, we keep track of the parameters that were used to generate it including a list of the selected concepts and relations, and the name of the clustering algorithm). IV. The intuitions behind the horizontal and the vertical clustering algorithms and the heuristics we developed to support them, are explained in the following subsections. A. Horizontal Clustering This algorithm aims at partitioning the system into disjoint subsystems each of which corresponds to a layer of the layered view. Although the layered view is widely used in software architecture it is a poorly defined view [8]. In the layering architecture as described in [16] requests are sent from layer N to the lower-level layer N-1, and answers to these requests or notifications are moving in the opposite direction. Yet many implementations of the layered architecture may violate this strict layering principle. BUILDING VIEWS USING CLUSTERING ALGORITHMS Many approaches and tools were developed to support module decomposition and clustering in the context of software engineering (e.g., [5, 10, 11, 13, 17, 19, 20, 21]). Our initial goal was to apply and adapt existing algorithms to our clustering needs which are driven by targeted views. However the majority of the proposed software clustering algorithms uses high-cohesion and low-coupling principles to identify the boundaries of the clusters [15]. This does not really apply when identifying layers of the system. A layer may exhibit a set of independent features that are commonly used by the above layer and hence its cohesion may be low and its coupling with other layers may be high. Hence identifying layers requires a different approach of grouping modules. In light of this, our horizontal partitioning starts with identifying the data-access elements (i.e., compilation units that create, read or write data through their control elements) and creates the lowermost layer containing these elements. In some legacy systems (e.g., batch sequential system) elements of the top layer (e.g., GUI modules) are more difficult to identify than in others (e.g., object-oriented system). In the first case, the algorithm assigns a level number to each element/module of the system depending on the number of dependencies that must be crossed to get to an element of the lowermost layer. To do so, we also developed heuristics which are based on the fact that a layered view is not just a decomposition that reflects module dependencies but it also considers cohesion, reuse and portability. For example, when module x uses module y and module y is not in the lowermost layer, two situations are in order: 1) Module x is the only one to use module y then we put module y in the same layer as module x; or 2) Module x is not the only one to use module y, we put module x and module y in distinct successive layers. Another heuristic we introduce aims at resolving level-numbering conflicts which result from the violation of the strict layering principle. Practically, this heuristic assigns the minimum level number (or maximum number if layers are numbered from top to bottom) to a module when the module has been assigned different level-numbers through distinct flows. The heuristic is illustrated in Figure 5 where the left part shows a module C that was assigned two Regarding the vertical clustering, this clustering aims at recognizing features of the system, which are generally implemented using cohesive and loosely-coupled sets of modules. Hence existing algorithms can be adapted to support this clustering. Since the goal was to apply our approach on large industrial systems, we decided to make use of approaches that rely on optimization to reduce the search space for an optimal decomposition of a software system. We attempted to use the Bunch modularization algorithm proposed in [13], which uses a family of search-based algorithms including hillclimbing and genetic algorithms. We focused on the hillclimbing algorithm because it performs well in the context of large systems and it has been successfully used in several approaches [18]. The algorithm works in an iterative way. It starts by an initial partition; usually a randomly generated partition as in [13]. Modules are then moved between clusters to improve the partition according to some criterion. This criterion is based on maximizing a fitness function. In [13], a modularization quality (MQ) function is proposed as a fitness function, which aims at minimizing the coupling between resulting clusters (i.e., the inter-relations) and maximizing their 349 different level-numbers after analyzing two distinct dependency flows that use the module. The result of applying the heuristic is shown in the right part of the figure. In the second case where user modules are easy to identify, the algorithm simultaneously creates the lowest-level layer containing data-access modules and the top-level layer containing user modules. Then we assign a level to each element/module of the system depending on its position in the dependency flow starting from the top layer all the way across to the lowermost layer. We apply the same heuristics as in the first case. The modules are then grouped according to their level number. This algorithm is a construction algorithm (as defined by [17]) since it assigns modules to clusters in very few passes. Figure 5. Example of the application of the level-number minimization (or maximization) heuristic module B is used uniquely by identified omnipresent modules, we consider module B as omnipresent itself. V. B. Vertical Clustering We used the hill-climbing algorithm to our vertical clustering. In particular, we adapted the approach in [13]. In our context, we partition the system starting with a set of clusters where all but two of the clusters, contain one node/module corresponding to a root module of the MDG that was built from the legacy system. Root nodes are nodes that have only outgoing edges; they represent modules through which a user or another system may interact with the legacy system under analysis. One of the two remaining clusters contains the set of omnipresent modules, or modules that are used by an abnormally large number of other modules. These can skew the results of the clustering. So we identify and isolate them using some statistical analysis (e.g. outliers in a box-plot) combined with a manual tuning. The other cluster contains all the remaining modules, i.e., all modules except roots and those in the omnipresent cluster. EXPERIMENTATIONS WITH THE APPROACH We implemented our approach as a plugin in the IRISTM tool, the workbench developed by our industrial partner. We conducted preliminary experiments on large industrial systems, which are mainly batch sequential systems written in COBOL. We illustrate the results of the application on an industrial system and compare them to the results of applying the approach in [13] to the same system. Due to confidentiality agreements, we cannot describe the system in detail. The system is written in COBOL that is mostly COBOL 85 compliant and is deployed on a HP mainframe. The whole system is comprised of nearly 5000 programs and includes over 6000 copybooks (included COBOL code), for a total of over 6 million lines of code. In this section, we present the results of our experimentation as well as threats to the validity of our results. A. Experiment Our experiments were focused on subsystems for which we had information on the specific functionalities they support. For each of these subsystems, we applied our clustering algorithms to yield the feature-based and layered views and we validate them by the system’s experts. For the purpose of this paper, we present the results of our clustering algorithms applied to one of the small subsystems that contained five distinct functionalities. Figure 6 shows a dependency graph of this subsystem. This dependency graph is call-based, i.e. it only uses compilation units and the calls relationships. The analysis of this subsystem helped us assess the accuracy and the usefulness of the views that are built using our approach. In the following iterations, neighboring partitions are created by moving modules to other clusters and the resulting neighbors are evaluated using the MQ function as proposed in [13]. However, in [13] a partition Y is considered as a neighbor of partition X if Y is exactly the same as X except that a single module of a cluster in partition Y is in a different cluster in partition X; This basically means that a neighboring partition is generated by randomly moving exactly a single module from one cluster to another from the current partition. In our context, it does not make sense to move a root module from one cluster to another; this breaks the root-based intuition that directs our vertical clustering. Further, ultimately the root module of a cluster will be (transitively) related to all modules in the same cluster. Hence, we generate a neighboring partition by randomly moving exactly a single module M from one cluster X to another cluster Y from the current partition if the module M has a relationship with at least one module in the cluster Y. Figure 7 shows the view resulting from applying the vertical clustering to the subsystem of Figure 6. The view displays five subsystems and a subsystem called “Libraries” (i.e., the cluster containing omnipresent compilation units). The five subsystems are consistent with the features that were identified by the system’s experts and stakeholders. The main issue that we observed is the identification of omnipresent modules that may be too permissive when we rely on statistical analysis (such as the boxplot) only. This problem is discussed in [14], and a way to handle this it is to let the user manually define a threshold for considering a module omnipresent. Regarding the identification of omnipresent modules we developed and applied some heuristics. For example, when identifying these modules we do not consider the weight of the relations. Indeed we consider that if module A is the only one using module B and it uses it hundred times this does not qualify module B to being omnipresent. On the other hand, if module B is used by hundred distinct modules each is using it once, this makes module B an omnipresent. Furthermore if a 350 To assess the usefulness of our clustering algorithm, we implemented a version of the steepest ascent hill-climbing (SAHC) algorithm as described in [13] and ran it on the same system. Figure 8 shows a decomposition resulting from the SAHC algorithm when applied to our subsystem. Although the final MQ (i.e., the modularization quality function) of the SAHC is better that the final MQ of our vertical clustering, the decomposition view generated by the SAHC did not match the functionalities described by the system’s owners. It’s worth pointing that the decomposition generated by SAHC would have been improved, had we tuned better the omnipresent threshold; which shows that our vertical clustering is more stable if we consider the issue of defining the omnipresence threshold. Moreover, the vertical algorithm performs much better on large systems; this is mostly due to the way we compute the neighboring partitions, which reduces the number of neighbors to consider. Figure 6. The dependency graph of the subsystem under analysis Figure 7. The view resulting from the vertical clustering Figure 8. The decomposition generated by the SAHC algorithm 351 Figure 9. The layered view resulting from the horizontal clustering Figure 10. A hierarchical view resulting from combining horizontal and vertical clusterings B. Threats To Validity Our preliminary experiments were performed on systems which are written in COBOL. However, we used the KDM representation of these systems, which is language and platform-independent and makes the approach applicable to other kind of systems. Nevertheless we need to experiment the approach on other kind of systems to assess and refine the heuristics we developed. Another issue that we have to deal with is the assessment of the results of the approach when applied to a large legacy system: the results and observations of our experiment were limited to parts of the system for which we had information indicating expected functional groupings. We hope to get additional tagged subsystems from the stakeholders to increase the scope of our experimentation; this would help us to generalize the results. Additionally, we would like to reproduce the results on open systems, but there are few representative systems containing a large number of interconnected programs. Figure 9 shows the layered view resulting from applying the horizontal clustering to our subsystem example. Layer 1 (at the bottom) contains one module which is a data access module. However, Layer 5, which is the topmost layer, contains many modules. From the stakeholders’ point of view, this view is less comprehensible and intuitive that the featurebased view. Hence, we applied the vertical clustering algorithm to identify subsystems of the Layer 5. The resulting view is shown in Figure 10. Interestingly enough, Layer 5, comprises five clusters that correspond to the five subsystems previously identified in the vertical clustering. Hence, together layers 1 to 4 contain modules that were recognized as omnipresent in the vertical clustering. This is consistent with the fact that layers at the bottom in a layered architecture are common services used by higher-level layers, which in turn exhibit a set of features supported by the system. As a matter of fact, the vertical clustering may be applied to each layer in case of larger subsystems. Stakeholders found the view in Figure 10 more comprehensible and useful to restructure and migrate the system’s parts. VI. RELATED WORK Several software architecture reconstruction approaches were proposed in the literature (e.g. [4, 19, 10, 13, 14, 15, 17, 21, 25]. Most of these approaches follow a typical process that 352 generally includes three steps: 1) Identifying relevant concepts and accordingly extracting information from the system under analysis; 2) Constructing higher-level models using some analysis techniques; and 3) Visualizing the resulting models. Depending on their goals and the targeted systems, approaches propose different methods or techniques to support the reconstruction process. a hybrid clustering algorithm that exploits a weighted directed class graph (WDCG) extracted from object-oriented systems. A WDCG includes both static and dynamic information. However, some important issues such as omnipresent modules are not tackled by these approaches. Our approach is based on view-driven (i.e., style representation) clustering algorithms that make use of searchbased algorithms while still trying to minimize coupling between clusters and maximize their cohesion. Hence our approach is more related to Mitchell et al. [13, 14] and Tzerpos and Holt’s [10] approaches. Mitchell et al., [13] propose a search-based clustering approach, which has been implemented in the Bunch tool. We described the principles behind Bunch in section IV. Our approach for the vertical clustering reuses Bunch’s modularization quality (MQ) function. However, our initial partition is driven by the views we are attempting to build, while in Bunch the initial partition is randomly generated. Additionally, we constrained the way neighboring partitions are generated from the partition of the current iteration to remain compatible with the idea of a vertical clustering, while Bunch randomly generate these partitions from the current one. Although Bunch enables the user to specify the number of neighbors to explore in each iteration of the algorithm (e.g., all: steepest ascent hill climbing (SAHC), the first best: next ascent hill climbing (NAHC)), our approach reduces prudently this number by relying on existing relations between clusters. Our experimentation with our vertical algorithm, the NAHC and the SAHC algorithms has shown that the vertical clustering yields more comprehensible views and performs better. Both Riva [19] and Stoermer et al. [4] propose an architecture reconstruction process. Riva [9] considers the choice of architectural significant concepts as the key to deliver meaningful models to the architects, while the goal of Stoermer et al. [4] is to extract information that will help in the analysis of the quality attributes supported by systems. In [19] the proposed process is iterative and incremental. Key concepts are first identified, and a conceptual view is built. These concepts are extracted from documentation and discussions with experts. Then the source code is analyzed to produce a model that is enriched with domain specific knowledge to produce a high level view of the system. Many of these activities rely on existing documentation and the manual intervention of the architects and experts, which are not always available in the context of legacy systems. In [4], the proposed process links architecture reconstruction to quality attribute-driven analysis. In the first step of the process, the system to be analyzed and the architecture views are identified. This identification depends on the type of the system and the quality attributes to be analyzed. In the second step, a source model is extracted from the source code. Elements of this model are then aggregated in the third step and the resulting aggregates are assigned element types as specified by the targeted view. The resulting views are given to the quality analysis framework, which analyzes them. Although we adopt a similar process for architecture reconstruction, the focus of this approach is on the analysis of quality attributes supported by existing systems. Furthermore, relationships between the views to be built and supporting aggregation techniques are not discussed in both [4] and [19]. Tzerpos and Holt’s [10] propose an algorithm for comprehension-driven clustering (ACDC) based on subsystem patterns. Rather than using any modularity criterion to decompose systems, ACDC relies on familiar patterns observed in large systems to find clusters while keeping the size of the clusters at a manageable level. The approach also assigns meaningful names to the resulting clusters. ACDC creates, first, a skeleton of the final decomposition using a subsystem pattern. Then it assigns remaining modules (i.e., orphans) to existing subsystems using the Orphan Adoption technique described in [23]. Proposed subsystem patterns include the source file pattern, support library pattern and the subgraph dominator pattern. For example, the source file pattern is a basic pattern that creates a cluster grouping procedures and variables contained in the same source file. The support library pattern groups omnipresent modules into one subsystem; we used this pattern in our vertical clustering where the initial partition contains a subsystem grouping omnipresent modules. The subgraph dominator pattern identifies a particular subgraph in the system that contains a dominant node DN such as there exists a path from DN to every node in the subgraph. Interestingly, our vertical clustering can be seen as a particular dominator pattern clustering where the dominant node is constrained to be a root node. Many of the proposed architecture reconstruction approaches use clustering including [10, 13, 14, 15, 17, 21, 25]. Various clustering-based approaches are discussed in [20 and 24]. Many of the proposed approaches aim at finding a clustering of the system that minimizes the coupling between resulting components and maximize the cohesion of each component (e.g., [13, 14, 17, 21, 25]). Lung et al. [21] propose a hierarchical approach based on a numerical taxonomy clustering technique to classify software components. The method starts by constructing a data matrix of components of the system and their properties to be used for the clustering. A resemblance coefficient is then computed for each pair of components to yield a resemblance matrix. The resemblance coefficient is based on coupling measures between components. The resemblance matrix is used to group similar components into clusters. The algorithm works in an iterative way where, at each iteration, the two closet clusters are merged and the resemblance matrix is updated consequently. The algorithm stops when all clusters are exhausted or when a given threshold is reached. This approach is very intuitive and can be applied in both forward- and reverse-engineering processes. It also performs well on large systems. Zhang et al. [25] propose VII. CONCLUSION AND FUTURE WORK The process of reconstructing meaningful architectural models form legacy systems remains a difficult task and an active research field in software engineering. This process is 353 mandatory in various contexts, spanning from the redocumentation and the understanding of existing systems to their restructuring and migration. In this paper, we presented an approach that reconstruct and document the software architecture of legacy systems using view-driven clustering algorithms. Specifically we proposed and implemented clustering algorithms that target two specific views: a layered view that we called horizontal view and a feature-based view that we called vertical view. The approach is language and platform independent as it relies on the KDM specification standard for describing low-level models of legacy systems. The approach was implemented as a plugin within the IRIS modernization tool, and it has been applied to large industrial legacy systems. Preliminary results have shown that the algorithms perform well on large systems and the resulting views are comprehensible and may be used to support restructuring and migration processes. [6] [7] [8] [9] [10] [11] While we continue to refine our view-driven algorithms, we need to perform more experiments and analysis to evaluate and refine the heuristics we used. We also need to establish a strict mapping between the architectural views as describe in [8] and the KDM concepts and relations that can be included in a view. As matter of fact, the challenge in reconstructing architectural views from legacy systems resides in the difficulty we have to establish a clear and precise mapping between the abstract architectural elements as defined by architectural styles and the concrete language-dependent constructs used in implementing software systems. In the near future, we are planning to experiment our approach on open source systems (e.g., Mozilla and Linux) so that we can compare the results with other approaches. We also want to automatically analyze the KDM AggregatedRelationship that relate the resulting clusters (i.e., layers in the horizontal clustering or subsystems in the vertical clustering) to construct their interfaces. [12] [13] [14] [15] [16] [17] [18] In the future works, we would like to explore other analysis and classification techniques to build the architectural views. We also intend to explore the usage of domain-specific knowledge information to improve the resulting views; although using domain-specific knowledge is known to confine the scope of the approach [24]. [19] [20] REFERENCES [1] [2] [3] [4] [5] [21] R. C. seacord, D. Plakosh, and G. A. Lewis, Modernizing Legacy Systems: Software Technologies, Engineering Process and Business Practices. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2003. W. Ulrich and P. Newcomb, Information systems transformation : Architecture-Driven Modernization Case Studies, Morgan Kaufmann OMG Press, 2010. M. Shaw and D. Garlan, Software Architecture: Perspectives on an Emerging Discipline, Prentice Hall, 1996. C. Stoermer, L. O'Brien, and C. Verhoef, “Moving Towards Quality Attribute Driven Software Architecture Reconstruction,” In Proceedings of the 10th Working Conference on Reverse Engineering (WCRE '03). 2003. D.R. Harris, H.B. Reubenstein, and A.S. Yeh, "Recognizers for Extracting Architectural Features from Source Code," Proceedings of [22] [23] [24] [25] 354 2nd Working Conference on Reverse Engineering, 1995 (WCRE’95), vol., no., pp.252-261 T. Mens and T. Tourwé, “A Survey of Software Refactoring,” IEEE Transactions on Software Engineering, 2004, vol. 30 (2), pp. 126-139 OMG Modernization Specifications catalog: http://www.omg.org/technology/documents/modernization_spec_catalog .htm [accessed in July 2012] P. Clements, F. Bachmann, L. Bass, D. Garlan, J. Ivers, R. Little, R. Nord and J. Stafford, Documenting Software Architectures: Views and Beyond, Addison-Wesley, 2003. L. Bass, P. Clements and R. Kazman, Software Architecture in Practice, Addison-Wesley, 2003. V. Tzerpos and R. C. Holt, “ACDC: An Algorithm for ComprehensionDriven Clustering,” In Proceedings of the Seventh Working Conference on Reverse Engineering, 2000 (WCRE'00). IEEE Computer Society, Washington, DC, USA, pp. 258-267 H. A. Müller, M. A. Orgun, S. R. Tilley and J. S. Uhl, A ReverseEngineering Approach to Subsystem Structure Identification, Journal of Software Maintenance: Research and Practice, 1993, Volume 5, Issue 4, pp. 181–204 V. Tzerpos, Comprehension-Driven Software Clustering, Ph.D. thesis, University of Toronto, Toronto, Canada, 2001. B. S. Mitchell and S. Mancoridis. 2007. On the Evaluation of the Bunch Search-Based Software Modularization Algorithm, Soft Comput., 2007, vol. 12, Issue 1, pp. 77-93 S. Mancoridis, B.S. Mitchell, Y. Chen, and E.R. Gansner, "Bunch: a Clustering Tool for the Recovery and Maintenance of Software System Structures," In Proceedings of the IEEE International Conference on Software Maintenance, 1999 (ICSM '99), pp.50-59 P. Andritos and V. Tzerpos, “Information-Theoretic Software Clustering,” IEEE Transactions on Software Engineering, 2005, vol. 31, n.2, pp.150-165 F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad and M. Stal, Pattern-Oriented Software Architecture: A System of Patterns, John Wiley & Sons, 1996 T.A. Wiggerts, “Using clustering algorithms in legacy systems remodularization,” In Proceedings of the Fourth Working Conference on Reverse Engineering, 1997 (WCRE’97), pp.33-43 J. Clark, J. J. Dolado, M. Harman, R. Hierons, B. Jones, M. Lumkin, B. Mitchell, S. Mancoridis, K. Rees, M. Roper and M. Shepperd, Reformulating Software Engineering as a Search Problem, In IEEE Software, 2003, volume 150, Issue 3, pp. 161-175 C. Riva, “Architecture Reconstruction in Practice,” In Proceedings of the 3rd Working IEEE/IFIP Conference on Software Architecture, 2002 (WICSA 2002), Kluwer Academic Publishers, pp. 159-173 O. Maqbool and H.A. Babri, Hierarchical Clustering for Software Architecture Recovery, IEEE Transactions on Software Engineering, 2007, vol.33, no.11, pp.759-780 C-H. Lung, M. Zaman and A. Nandi, Applications of Clustering Techniques to Software Partitioning, Recovery and Restructuring, The Journal of Systems and Software, 2004, vol. 73, pp. 227–244 D. Pollet, S. Ducasse, L. Poyet, I. Alloui, S. Cîmpan and H. Verjus, “Towards A Process-Oriented Software Architecture Reconstruction Taxonomy,” In Proceedings of the 11th European Conference on Software Maintenance and Reengineering, 2007 (CSMR '07), IEEE Computer Society, pp. 137-148 V. Tzerpos and R. C. Holt, “The Orphan Adoption Problem in Architecture Maintenance,” In Proceedings of the Fourth Working Conference on Reverse Engineering, 1997 (WCRE’97), pp. 76-82 M. Shtern and V. Tzerpos, Clustering Methodologies for Software Engineering, Advances in Software Engineering, vol. 2012, 18 pages, 2012. Q. Zhang, D. Qiu, Q. Tian, L. Sun, “Object-oriented software architecture recovery using a new hybrid clustering algorithm,” In the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, vol.6, pp.2546-2550