Proceedings. Student Conference on Software Engineering and Database Systems

Größe: px
Ab Seite anzeigen:

Download "Proceedings. Student Conference on Software Engineering and Database Systems"


1 Proceedings Student Conference on Software Engineering and Database Systems 27th June 2009 University of Magdeburg G Database Track Table of Contents Current State and Future Challenges of Indexing and self-tuning approaches Felix Penzlin Current state and future challenges of real-time ETL Hagen Schink Zukunft der DBMS ist in den anwendungsorientierten Lösungen Pyvovar Dmytro Software Engineering / Product Lines Track Implementation of Software Product Lines: Current state and future challenges Christian Becker Current state and future challenges in Optional Weaving Constanze Michaelis Challenges for testing of software product lines Afra'a Ahmad Alyosef Aktueller Stand und zukünftige Herausforderungen bei der Visualisierung von Softwareentwicklungslinien Marius Krug Aktueller Stand und zukünftige Herausforderungen im Aspect Mining Alexander Dreiling

2 Current State and Future Challenges of Indexing and self-tuning approaches Felix Penzlin Otto-von-Guericke Universität Magdeburg, Germany Abstract Database tuning is a common problem, but it requires much knowledge about the characteristics of the database system and the expected workload. Dynamically changing workload complicates the search for the perfect settings. In this contribution we present several approaches that autonomously examine the operations and adjust different parameters. Even though different tools for this task are available, self-tuning database technology is still an active field of research. Keywords-database; indexing; self-tuning; memory management; I. INTRODUCTION As available database systems have to cover a wide variety of use cases, tuning of the system for a specific task is an important challenge. With the rising complexity of modern database management systems (DBMS), it requires more and more expertise. Even if the database administrator (DBA) mastered the diversity of options, a changing workload could make his exercise unmanageable. Therefore several efforts have been made to support his task by automating. All the major vendors offer such tools today, such as the Self-Tuning Memory Manager for DB2, the Automatic Database Diagnostic Monitor (ADDM) for Oracle and the Resource Advisor for SQL Server. But there is still room for improvements. Most of the recent approaches are related to physical design tuning. This paper is an overview on the field of research. In the following sections, in addition to physical design tuning and self-tuning memory management, we describe logical design tuning. II. RELATED WORK A survey on self-tuning database systems has been given in [1]. The authors focus on their project AutoAdmin that is developed at Microsoft Research. An overview on the selftuning architecture used by Oracle is presented in [2] and [3]. It refers to the self-managing capabilities of Oracle 10g. In order to allow consistent measurement of performance increases caused by changes in different subsystems, a new measure is introduced, called Database Time. It is defined as the sum of the time spent inside the database on processing the user requests. Oracle 10g continuously surveys statistical data. The Automatic Database Diagnostic Monitor (ADDM) automatically detects performance issues and suggests recommendations to maximize the total database throughput. The ADDM requires appropriate data in order to operate correctly. Therefore time measurements are done, data for active sessions are collected, changes to the database settings are logged and, for estimating the impact of a particular change, data are generated by simulation. In Oracle 10g tuning is done by the component that is responsible for selecting the execution plan. This component is called Automatic SQL Tuning Advisor. It is able to compute the cost for an execution plan considering a hypothetical access path, known as what-if analyses. Indexes are only created, if the performance is expected to be improved by a large factor. A short introduction to the topic is given by the report about the workshop on self-managing database systems [4]. A brief overview on recent self-tuning technology is given by [5]. It pleads for a change in database developement. The author argues, that databases should consist of simple, independent, self-tuning components, which would simplify the database development and make database performance more predictable. III. SELF-TUNING MEMORY MANAGER As a general rule primary storage is an order of magnitude faster than secondary storage, but the price is significantly higher as well. Thus on most platforms the size of primary storage is just a fraction of the size of secondary storage. Hence the provident use of the primary storage is of great importance. This section addresses buffer tuning as an important possibility for database tuning. In [6] the Self- Tuning Memory Manager (STMM) for DB2 is introduced, which provides adaptive tuning of database memory. The features of the STMM address the following problems: For unacquainted database administrators (DBAs) with insufficient knowledge of memory usage of the database, tuning is a complicated task. The memory requirements are dependent on the workload. Even for an experienced DBA it may be unfeasible to tune the database for an unknown workload. Monitoring the workload allows the STMM to find the correct settings. In many scenarios the workload changes over the time. The system recognizes such changes and adapts the configuration to the workload. Tuning a database is time-consuming and expensive. The STMM saves the time of the DBA although 1

3 the system achieves performance levels similar to an expertly tuned system. Manual memory tuning is normally done incrementally. Starting with a common good solution, adjusting the parameters to the workload, the STMM works the following way: It adapts the total amout of memory available to any database. The available memory is distributed among the different areas of the database system, e. g., the memory for sort, hash join and buffer pools. If the workload changes, the configuration is adjusted in a reasonable amount of time, the author considers up to 1 hour depending on the workload. Self-tuning cache is considered in [7]. The approach focuses on reducing energy consumption, as energy costs are an important parameter today from an economical point of view. It therefore adjusts the tuning interval to the requirements of the system. Another approach regarding self-tuning database buffers is presented in [8]. It addresses the following issues: In spite of dropping prices for memory, there are always competing demands for memory space. On one large system there may run workloads with different priorities. Here it may be useful that same buffers are smaller so as to accelerate tasks of higher priority. Changing workload requires continuous adaption of the buffer size. The presented approach is based on a buffer miss equation. The technique is to fit available data with the equation, then use the equation for tuning calculations. The buffer miss equation is derived from an analytical model. IV. PHYSICAL DESIGN In following section we describe different approaches regarding physical database design. They implement a control loop that realizes a gainful adaption of various settings by monitoring [9] the current system performance. sef-tuning component Figure 1. physical design system performance query processor control loop for physical design tuning execution costs A. Self-tuning indexes An important property of RDBMS is physical data independence. Physical data independence allows physical structures like indexes to change, without impact on the result of queries. Though these changes may affect efficiency. In [10] a solution, named COLT (Continuous On-Line Tuning), is presented that maximizes the query performance by continuously adjusting the system configuration depending on the incoming queries for a given storage budget. Therefore the current query load is steadily monitored. The configuration is changed on-line. An implementation of the framework has been done for the PostgreSQL database system. It consists of two components, a query optimizer and the self-tuning module. The query optimizer is able to compute the gain in efficiency for plans considering hypothetical indexes, referred to as what-if plans. The self-tuning module adjusts the set of materialized indexes to gain performance. To gather information about the current needs, it requests the query optimizer to profile indexes. So it has to perform two tasks, on the one hand it chooses candidates to profile, on the other it maintains a set of materialized indexes that fit within the storage budget. Therefore it manages three sets of indexes, the first set contains materialized indexes, the second keeps interesting candidates of indexes, while the third one holds candidates that are not expected to improve performance. Information is gathered in a period of 10 requests. After such a period the candidates are reexamined and changes may be made to the materialized indexes. A similar approach has been presented in [11]. Indexes are created to improve the execution time of queries, assuming an index pool limited in space. For given queries the use of non-existent indexes, stored in a set of virtual indexes, is determined. If the cumulative profit for a certain index is high enough, it is created and added to the index pool where it replaces another index, if not enough space is available. In addition, in the paper a self-tuning index structure is proposed: an adaptable binary tree. This tree is accessbalanced, for data that is accessed more often page access time is minimized. A more advanced approach is shown in [12]. It is implemented in PostgreSQL. The implementation consists of three components: the Index Advisor, the Soft Index Manager and the IndexBuildScan/SwitchPlan Operators. A soft index is created by the system, without interaction by the DBA in contrast to the usual hard index. The key improvement is the IndexBuildScan-Operator, which is an extension to the TableScan Operator. It builds up indexes while doing a table scan. To use the built index in the same query the SwitchPlan-Operator has been implemented. It allows to swap an IndexBuildScan Operator for a TableScan Operator in the execution plan. The shown approaches do not consider the order of the workload sequence. In [13] the order of the statements is regarded. In many data warehouse scenarios the character 2

4 of queries changes completely between day and night. Considering such a change in workload allows an adaption right before it appears. For example, every morning an index could be created that is dropped in the evening. It has to be considered, that changes to the physical structure may be computationally expensive. Another example would be a table for the sales of the current quarter. New tuples are inserted each day in an empty table at the beginning of the quarter, but at its end all rows are deleted from the table. The workload is modeled as a sequence of SQL statements. To find the optimal solution for a certain sequence, the sequence is modeled as a graph. The nodes are ordered in layers, each layer represents a statement of the sequence. The nodes in each layer represent the possible set of indexes. The nodes of adjacent layers are connected by edges. These edges stand for the creation or removal of an index, with its costs. These costs include the execution of the statement of the layer. No change to the set of indexes from one statement to the next would mean only the costs of the statement for the edge. The graph starts with a node that has an empty set of indexes and no statement to execute. A last node is introduced, all edges to it have no costs and it has no statement. The shortest path - the path from the first node to the last node with the least costs - yields the optimal chronology of index creation and deletion. An implementation has been done for the Database Tuning Advisor in Microsoft SQL Server An example working on bitmap indexes is introduced in [14]. Bitmap indexes work well for data sets, which have a small number of distinct values, for instance the gender. To store the information, bitmap indexes use bit arrays. For example, in an entry for a man in the bitmap index for gender would be a bit set in the bitmap for male and no bit set in the bitmap for female. Queries are answered by performing bitwise logical operations on the bitmaps. As in previous approaches conclusions about the benefit for indexes are drawn from monitoring the workload. For index candidates it has to be considered, that bitmap indexes are only useful for a specific attribute, if its domain of possible values is much smaller than the cardinality of the relation, and that bitmap indexes tap their full potential for selections on multiple attributes. In contrast to tree based index structures, the costs for generating bitmap indexes are low while the costs for updating are relatively high. In consideration of these points hypothetical index candidates are generated and offered to the optimizer. The optimizer performs a what-if analysis with the candidates, to determine the profit in comparison to the existing index configuration or no index usage. The creation of bitmap indexes could be done during periods of low workload or before a query is executed. That way the query can benefit from the new index but its response would be delayed by the creation. An approach considering a heuristic algorithm is presented in [15]. The algorithm works as following: It finds a good set of candidate structures for every incoming query in the workload. Useful columns for index keys or views which should be materialized are guessed form the query structure. The initial candidate set is extended by merging two or more candidates together to find a candidate that helps multiple queries requiring less space. This space is searched and promising candidates are added incrementally to build a valid configuration. During the optimization of a single query, the optimizer issues several access path requests for indexes and materialized views. The optimal configuration is obtained by gathering all simulated physical structures generated during optimization, which essentially correspond to the union of optimal structures for each index or view request. B. Materialized Views In RDBMS a view is a virtual table, that represents the result of a database query. Queries or updates on the view are translated by the DBMS into queries or updates against the underlying tables. By contrast, a materialized view is cached as a concrete table. This table has to be updated, if changes are made to the original tables. If the materialized view is readable and writable, changes that are made to the view have to be written back to the underlying tables. As the view is a real table, queries on the view can be accelerated by physical structures like indexes. [16] addresses the selection of views to materialize for data warehouse solutions. Data warehouses integrate data from production databases as a set of views. Materializing views means large computational and space costs, hence selecting the appropriate set of views to materialize is important from a perspective of performance. The view selection problem is NP-complete. The author proposes a simulated annealing approach, a heuristic algorithm to approximate an optimal solution, as an exhaustive search for the large space of configurations would be too costly. Simulated annealing works well for large combinatorial optimization problems like the Travelling Salesman Problem. Simulated annealing works by randomly selecting an initial configuration that is iteratively perturbed randomly. In contrast to the hill climbing algorithm, it is possible with a certain probability that a step is accepted that results in a lower return than the current state, but the probability that such a change is accepted decreases over the time. Hence the chance to find the global minimum is larger in this way, but maybe just a local minimum is found. The author has compared the proposed algorithm to evolutionary genetic algorithms and heuristic algorithms and identified an improvement for the costs. A similar approach is considered in [17] for aggregation tables in the context of Online Analytical Processing (OLAP). Aggregation tables improve the query performance 3

5 like materialized view. They contain the results of complex aggregation queries. The task it to choose the optimal set of aggregation tables. The presented approach adapts the configuration dynamically to the current workload, at which it considers the sequence of creation and drop of tables with its costs. C. Clustering In [18] a solution for automatic database clustering is presented. Clustering tables means saving rows that will usually be queried together on the same page. In general the costs for an operation are determined by the accessed pages on secondary storage. Up to now clustering has to be triggered by the DBA. The challenge is to find the appropriate tables to cluster. As re-clustering becomes really expensive it demands careful decision-making. Clustering only frequently accessed parts of the table would be a possible approach. An implementation of the presented approach has been done, called AutoClust. AutoClust does attribute clustering by mining frequent closed items sets. A closed item set is a maximal item set contained in the same transactions. To make its decisions AutoClust does not collect statistics, instead it uses a query log, which saves the attributes accessed, the number of tuples returned and the number of disk blocks accessed. AutoClust is triggered when a drop in the query response time is detected. It then checks for bad record and attribute clustering which are detected by an increased number of accesses to record and attribute clusters, respectively. If bad clustering is detected, AutoClust will trigger a re-clustering process. D. Partitioning Beside index structures and clustering partitioning provide an opportunity for improvement in performance. Partitioning allows a table, index or index-organized table to be subdivided into smaller pieces. The pieces of the database objects are called partitions. Each partition may have its own storage characteristics. As a result of distributing the pieces to independent storages the throughput increases, given that requested pages can be fetched in parallel. In recent approaches, online partitioning is not considered. That is perspicuous as changing database partitioning is a complex and time-consuming task. Current contributions attempt to simplify the utilization of partitioning for the DBA. In [19] an approach is presented that searches the optimal partition for a given database. It considers computing splitters either on-line or off-line. If two tables are often joined on a shared attribute, it tries to utilize a precomputed splitter set to make the join process more efficient. The problem is phrased as follows: Find a set of splitters for the relations such that the biggest cost among all inequality partitions is minimized. A splitter set is optimal if it meets this condition. To find optimal splitter on a sorted column a binary search is performed iteratively. The implementation of the splitter algorithm has been done in C++. V. LOGICAL DESIGN In [20] logical self-tuning is introduced. Instead of adapting the physical structure, the database schema is changed by the system. The more a database is normalized, the longer join paths may become, resulting in more costly join operations. On the other hand, a less normalized database may have lots of null values and data redundancy may appear which could lead to update anomalies. The problem is to trade off efficiency against redundancy. The aim is to keep third normal form or Boyce-Codd normal form, but to minimize the join paths. To find the optimal schema, the author considers the workload. Changes to the database schema necessitate a change in the application, which would mean a violation of the principle of logical data independence, or the old schema has to be simulated through a view that equals the original schema. The author argues, that updating relational view is a difficult problem which is weakly supported by major Relational Database Management Systems (RDBMS) yet. The information for the tuning is either gathered from the SQL workload or the data in the database. Changing the database schema is expensive, as the contained data have to be migrated. VI. FUTURE CHALLENGES The previous sections have shown that active research on self-tuning database technology is still in progress. In [1] future directions for further development are pointed out as follows: The results of performance tuning by the tools offered by the major vendors of commercial database systems are difficult to compare. Changes to the physical design are costly operations. The author proposes partial indexes/materialized views. The special requirements of distributed databases have been neglected in the context of self-tuning. For self-tuning tasks the author proposes machine learning techniques, control theory and online algorithms. In my opinion the integration of the promising approaches in common DBMSs will be the biggest challenge. VII. CONCLUSION Manual tuning of the databases means a large investment of time for the database administrator. Self-tuning database systems reduce these costs. Several approaches have been published that address this problem. Most of the present contributions consider physical tuning. In a control loop the system settings are adapted to the changing workload. These settings could affect the memory settings or the index configuration. Beside those approaches there have been efforts to achieve self-tuning by adapting the logical structure of the database. Furthermore performance is gained 4

6 by storing results of costly operations, that could affect complex views or aggregation tables. Even if self-tuning tools are part of recent major database systems, there are still open issues. REFERENCES [1] S. Chaudhuri and V. Narasayya, Self-tuning database systems: a decade of progress, in VLDB 07: Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007, pp [2] K. Dias, M. Ramacher, U. Shaft, V. Venkataramani, and G. Wood, Automatic performance diagnosis and tuning in oracle, in CIDR, 2005, pp [3] B. Dageville and K. Dias, Oracle s self-tuning architecture and solutions, IEEE Data Eng. Bull., vol. 29, no. 3, pp , [4] A. Ailamaki, S. Babu, P. Furtado, S. Lightstone, G. M. Lohman, P. Martin, V. R. Narasayya, G. Pauley, K. Salem, K.-U. Sattler, and G. Weikum, Report: 3rd int l workshop on self-managing database systems (smdb 2008), IEEE Data Eng. Bull., vol. 31, no. 4, pp. 2 5, [5] S. Chaudhuri and G. Weikum, Rethinking database system architecture: Towards a self-tuning risc-style database system, in In VLDB, 2000, pp [6] A. J. Storm, C. G. Arellano, S. S. Lightstone, Y. Diao, and M. Surendra, Adaptive self-tuning memory in db2, in VLDB 06: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006, pp [Online]. Available: citation.cfm?id= [7] A. Gordon-Ross and F. Vahid, A self-tuning configurable cache, in DAC 07: Proceedings of the 44th annual conference on Design automation. New York, NY, USA: ACM, 2007, pp [13] S. Agrawal, E. Chu, and V. Narasayya, Automatic physical design tuning: workload as a sequence, in SIGMOD 06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 2006, pp [14] A. Lübcke, Self-tuning für bitmap-index-konfigurationen. in BTW Studierendenprogramm, 2007, pp [15] N. Bruno and S. Chaudhuri, Automatic physical database tuning: a relaxation-based approach, in SIGMOD 05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 2005, pp [16] R. Derakhshan, F. Dehne, O. Korn, and B. Stantic, Simulated annealing for materialized view selection in data warehousing environment, in DBA 06: Proceedings of the 24th IASTED international conference on Database and applications. Anaheim, CA, USA: ACTA Press, 2006, pp [17] K. Hose, D. Klan, and K.-U. Sattler, Online tuning of aggregation tables for olap, Data Engineering, International Conference on, vol. 0, pp , [18] S. Guinepain and L. Gruenwald, Research issues in automatic database clustering, SIGMOD Rec., vol. 34, no. 1, pp , [19] K. A. Ross and J. Cieslewicz, Optimal splitters for database partitioning with size bounds, in ICDT 09: Proceedings of the 12th International Conference on Database Theory. New York, NY, USA: ACM, 2009, pp [20] F. De Marchi, M.-S. Hacid, and J.-M. Petit, Some remarks on self-tuning logical database design, in ICDEW 05: Proceedings of the 21st International Conference on Data Engineering Workshops. Washington, DC, USA: IEEE Computer Society, 2005, p [8] D. N. Tran, P. C. Huynh, Y. C. Tay, and A. K. H. Tung, A new approach to dynamic self-tuning of database buffers, Trans. Storage, vol. 4, no. 1, pp. 1 25, [9] A. Thiem and K.-U. Sattler, An integrated approach to performance monitoring for autonomous tuning, in ICDE, 2009, pp [10] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis, Colt: continuous on-line tuning, in SIGMOD 06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 2006, pp [11] K.-U. Sattler, E. Schallehn, and I. Geist, Towards indexing schemes for self-tuning dbms, Data Engineering Workshops, 22nd International Conference on, vol. 0, p. 1216, [12] M. Lühring, K.-U. Sattler, E. Schallehn, and K. S. 0002, Autonomes index tuning - dbms-integrierte verwaltung von soft indexen, in BTW, 2007, pp

7 Current state and future challenges of real-time ETL Hagen Schink Departement of technical and operational information systems (ITI) Otto-von-Guericke University Magdeburg, Germany Abstract Business becomes faster every day, i.e. Business opportunities and problems appear and pass frequently. A competitive business has to react on these as they arise. To react as fast as possible decision makers make usage of sophisticated business intelligence (BI) tools. These tools use of obtained data by montoring the business process itself. Usually data warehouses (DWH) store and provide this data. To support fast and profound business decisions DWHs have to provide data that is as up-to-date as possible. The operation which fills the DWH with the newest data is called extraction, transformation and loading (ETL). That is why speeding up the ETL process to provide real-time ETL is an important issue. This paper will give an overview of the current state and future challenges of real-time ETL. This includes a short introduction of ETL and of problems arising with real-time ETL. A presentation of recent work on real-time ETL is following. The paper ends with an discussion on further problems and future challenges for the work on real-time ETL as well as suggestions for future investigations. Keywords-Real-time ETL, Business Intelligence, Active Data Warehouse I. INTRODUCTION While business becomes faster, business intelligence (BI) becomes crucial. Decision makers have to react as fast as possible on new business opportunities and problems to stay competitive. They can use advanced tools which help gathering new information out of data obtained by monitoring the business process itself. This data is usually stored in data warehouses (DWH). The process responsible to fill the DWH with data is called extract, transform and load (ETL). It is important to speed up this operation to provide up-to-date data for the BI process. That is why real-time ETL has become a big topic in the last years. This paper will give an overview of the current state and future challenges for real-time ETL. For that purpose the paper will give a introduction to the ETL process itself and describe problems arising for a real-time ETL approach. This will be followed by a presentation of recent work on real-time ETL. This work ends with a discussion on future problems and challenges for further investigations on real-time ETL. The discussion summarizes the challenges for real-time ETL and its sub-process, that are extraction, transform and load. The remainder of this paper is structured as follows. In section II related work is introduced. Section III will give an introduction to ETL and real-time ETL. Current work on real-time ETL will be presented in Section IV. Section V will describe the issues and further questions raised by the work presented in IV. At the end section VI concludes the paper. II. RELATED WORK In this section we give an overview over related work that also summarizes ETL and real-time ETL approaches. There are already several books, e.g. [1], [2], that handle the topic of ETL and real-time ETL. These books do not include present work on real-time ETL and a further discussion on future investigations. To the best of our knowledge there is only one book [3] that includes a chapter about the current state of real-time ETL. Unfortunately, it wasn t possible for the authors to obtain a copy of the book. That is why we can not give an adequate comparison of the two works. III. BACKGROUND In this section we introduce the ETL process [2], real-time ETL and related terms. The ETL process is responsible for the extraction of data from one or many source systems, the transformation of data and the loading into the DWH. The extraction process loads data from the source systems. Source systems are, e.g., databases or other systems involved in the business process. The extraction has to be done in a manner that does not put heavy load on the source systems. This is important because it is not the intention to disturb the main business process. During the transform process the data is converted into the schema of the DW because DWHs are created for different purposes, so it is likely that the DWH has a schemata that differs from the ones the operational systems have. The purpose of the stored data in a DWH is to provide a history of business process. This history builds the basis for the work of business analysts. The data is also cleaned and checked, to make sure to use only data of high quality in the DWH. The load process finishes the whole ETL process by loading the data into the DWH. 6

8 The ETL process is normally started when the source systems are quiescently, so the business process itself is not disturbed by the data obtainment. During this time or update window the data is processed in a bulk or batch process. This seperation between update and usage of the DWH prevents also changes of the data while it is queried. Because of the size of DWHs, updates during a query can have a big impact on the performance of the data warehouse. ETL solutions are normally implemented by so-called ETL scripts, which handle the extraction, transformation and loading of data. These scripts are handmade but there exist vendor specific solutions as well. Due to this approach the data in a DWH is only up-todate if updates are performed frequently. To support justin-time decision making, it is important to minimize the gap between the happening of a certain business process and the occurence in the DWH. Approaches to this goal are summarized in the term real-time or near real-time ETL. Hard real-time constraints are not needed in every use case. Most of the time it is sufficient, if the information is available in right-time [4]. That is why we use the terms real-time and near real-time equally throughout this paper. Real-time ETL tries to get the business process and the meta information about it closer together. There exists at least three approaches [5] to achieve real-time or near realtime ETL: minimize the update window, data/table swapping, continuous data integration. To minimize the update window is the most straightforward way to minimize the gap between the occurence and the obtainment of an business event by the ETL process. Therefore, data is mostly written to the disk, avoiding the usage of standard database mechanisms like transaction. Therefore, it is not recommended to work on the DWH during the update process. Data/table swapping overcomes this problem by providing seperate tables for the update and for the query/production process. If a table is updated, it is swapped into production while the previous production table is released. The continuous data integration approach handles continuous data streams that are loaded into the DWH through regular database transactions. This ensures a low latency and keeps the load on the DWH down, as only small pieces of data are loaded into the data warehouse. In this section we saw that ETL is a basic process consisting of three subprocesses. These processes ensure that data is obtained, adjusted and loaded into a DWH. A standard ETL process runs during an update window. The update window specifies the time within that no load on the operational and DWH system is performed. Real-time or near real-time ETL minimizes the gap between the state of the DWH and operational system. IV. CURRENT WORK In this section we present selected work [5] [7] on the field of real-time ETL or work [8] [10] that connects realtime ETL with other topics and issues. First of all, we present a proposal for an integrated near real-time ETL approach with J2EE (Jave 2 Platform, Enterprise Edition). This is followed by a summary of the paper presenting the MESHJOIN algorithm which is directly developed for time and memory constrained real-time ETL environments. A description of an implemented real-time ETL environment for semi-structured text files follows directly. Afterwards, we present a description of the role of ETL in future and issues connected to it regarding to business intelligence (BI). A proposal for a system using the MESHJOIN algorithm will be presented in the next part. At the end we present proposals for a real-time ETL architecture that ensures real-time data analyse. We focus directly on the opportunities and issues stated for real-time ETL. If the statements in the papers are embedded in a bigger topic we outline it briefly. A. J2EE based near real-time ETL infrastructure In [5] Schiefer and Bruckner present an integrated near real-time ETL approach based on the J2EE architecture. The J2EE architecture provides means to implement portable, robust, scalable and secure server-side Java applications. The approach of Schiefer and Bruckner consists of three components, which are the event adapters, ETLets and evaluators. Event adapters are used to extract the data and unify the different data formats obtained from source systems. ETLets are objects which represent the different tasks performed on the extracted data by event adapters. These components are also responsible for the load of data into the DWH. The authors also propose different kinds of ETLets. In respect to the event that triggers an ETLet the authors distinguish event-driven, scheduled and exception ETLets. Evaluators are the last part of the proposed system. They re able to work on the metric types provided by the different ETLets. The paper provides also an evaluation of the proposed system. The authors make clear that their proposed system is not as efficient as a second one using a vendor specific tool. In the summary of advantages and drawbacks the authors show that one of the biggest advantages is at the same time a drawback. The standardized connectors of J2EE do not use advanced capabilities of a certain database. Hence, the stated approach will reach its limits faster compared to vendor specific tools, as the latter are able to handle more events. On the other side, the paper highlights advantages of the J2EE platform like the power of the Java programming language in comparison to ETL scripting languages, platform independents and the overall flexibility. 7

9 B. MESHJOIN The need for extracting, transforming and loading data on the fly to prevent time consuming disk I/O is the topic of Polyzotis, Skiadopoulos, Vassiliadis and Simitsis in [6]. The proposed algorithm allows to retain the extracted data in memory. This is important because unnecessary disk I/O does only prevent the data being processed within the near real-time constraints. The algorithm assumes that only a part of the join relation is stored in main memory. This is due to the fact that join relations are often much bigger than main memory. The idea is to perform a cyclic load of each join relation into the main memory. Each extracted tuple is joined with the join relation currently in memory. Once a tuple was joined with all relations, it is discarded from memory and can be made available for further processing steps. The authors introduce also a sophisticated cost model. Depending on available memory or arrival time of the extracted tuples, the cost model allows to tune the algorithm to meet certain constraints. The authors prove their cost model and the efficency of the MESHJOIN algorithm. In comparison to the Indexed Nested Loop (INL) the results show how the MESHJOIN algorithm outperforms the INL on synthetic and real life data. C. Real-time ETL for semi-structured text files Viana, Raminhos and Pires describe a productive system which extracts space weather and spacecraft data from semistructured text files in [7]. They focus on the data processing module (DPM), especially the uniform data extractor and transformer (UDET), of the system. The DPM is responsible for the file retrieval, the extraction and tranformation of the data within 5 minutes. The system processes files as follows. First, the files are retrieved and stored in a local file cache. Afterwards, the UDET module applies an ETL process on the files. UDET uses file format definitions (FFD) that are descriptions of the structure of a certain file type. Afterwards UDET pushes the extracted data to the next system components. The provided test results show that the presented system is capable of meeting the stated requirement of processing files within 5 minutes. The average processing time is 11,83 seconds for a daily total of 4000 files. D. Real-time ETL and real-time business intelligence In [8] Azvine, Cui, Nauck and Majeed discusse current issues of BI. They give an outline of their vision of realtime business intelligence (RTBI). The authors ascertain that data warehouse technology remains important for providing data to analytical tools in future. Hence, ETL will become a bottle-neck for the realisation of RTBI. The authors propose real-time ETL as solution for the bottle-neck because of its capability to fetch and process data from operational sources in real-time. Furthermore, they propose the availability of metadata and semantics on the side of the data source. With this additional information it would be possible to connect to different data sources flexibly. The idea is that the analyst chooses the data sources for his DWH. This information prvides than the basis his analytical tool works with. The authors state, it will also be possible to extract data from, e.g., s, reports, and feedbacks in the future. E. Near real-time ETL for business process data Castellanos, Simitsis, Wilkinson and Dayal present an implementation to process business process data in near real-time in [9]. This implementation uses the MESHJOIN described in section IV-B. To distinguish between new data and already processed one, the system uses an operator called diff. Internally diff uses the MESHJOIN algorithm. The problem is that MESHJOIN assumes the join relations to be static. To handle this issue, the system uses a log that stores the state of the data source, and a buffer that stores changes to the data source state and reduces the load on the log. With the introduction of a buffer the log can be updated whenever there is no data processed, the join relations can stay static. F. Real-time ETL in a business intelligence architecture In [10] Nguyen, Schiefer and Tjoa state that the introduction of real-time ETL reveals new problems to a standard data warehouse infrastructure. The main problem is that continuous updates of the DWH would probably interfere with running queries, sophisticated index structures etc. To overcome this issue, the authors propose the usage of a real-time data cache. Continuously, the arriving data is stored in the real-time data cache. The data cache also provides updates for the data warehouse itself. The combination of a DWH, a real-time data cache, called real-time data store, and a logic that combines the two databases is called zero latency data warehouse. The real-time data store gets its data from an analysis server that works on the stream of event data provided by the source systems. V. OPEN ISSUES AND FURTHER QUESTIONS In this section, we present open issues and further questions that are raised by previous presented work. First of all, open issues and further questions on real-time ETL itself will be discussed. It follows a discussion about the single processes extraction, transformation, loading. The papers [5], [7], [10] show in comparison that the term real-time ETL is used in slightly different ways. In [5] the process ends with the load of the data into a DWH. The presented system in [7] loads the extracted and transformed data into a cache before the load into a DWH is started. Finally, [10] introduces an active DWH. The introduction of an active DWH states that real-time ETL processes need a data store which is capable to handle the intermediate arrival 8

10 of new data. This issue needs to be explicitly addressed. In analogy to ETL we should define that the real-time ETL process ends with the availability of the extracted data for analyses. In regard to the current technology this means that the data is available in a DWH. A. Extraction The papers [5], [8] address the issue of extracting data from different types of data sources, e.g., databases or s. The work presented in [5] utilizes the resource adapters provided by the J2EE environment and third party vendors respectively. This means that data source specific code is required to retrieve new data. Because of the realtime constraints it would be interesting to see a more general message based approach. We think that a real-time ETL system would become more efficient and easier to implement when the data sources would become aware of the ETL process at all and provide means to support it. An interesting questions is, if a more general and ETL aware approach can provide the basis to standardize the extraction process. With a standardized protocol for the communication between the data sources and the ETL process, it would be easier to implement the extraction process. Together with the meta and context data proposed in [8] the data obtainment can become more flexible. B. Transformation In [6], [7] the authors present algorithms and means that supported the transformation process. The file format definition (FFD) supports the transformation of data a different structure. A suitable questions is, if the file format definition can be made more general in a way that the transformation of data can be done automatically. The authors of [9] present a technique to generate a mapping between the schemta of the data sources and the DWH. But this approach does not support an on-the-fly mapping. In an ETL aware environment it would also be possible to decrease the transformation effort to a minimum, because the data sources would be able to provide only important data, that is data the DW is interested in. Similar ideas to that one stated in section V-A could be applied. This includes the idea of a message based approach and a standardized protocol to be able to agree on the data provided by the data source. C. Loading At the beginning of section V we stated that there are differences in the definition of the purpose of the load process. We proposed that according to the definition of ETL, the load process ends with the availability of data for the analytical process. The paper [10] shows problems arising by loading data in real-time. Furthermore, [5] states that there is no general approach to load data into a DWH that utilizes sophisticated and efficient load tools, because there are only vendor specific ones available. Due to the different ways a real-time ETL process has to treat extracted data, an investigation of the usefulness of batch oriented, or vendor specific tools is appropriated. Assuming that a real-time ETL process creates a sequence of new data tuples, a data storage is needed that supports this workflow and allows a concurrent analysis of the stored data. An investigation of the load process in respect to the data storage is also appropriated, as the advent of active data warehouses shows [10]. VI. CONCLUSION In this paper, we gave a short introduction to ETL and real-time ETL. We showed that real-time ETL needs different approaches, in comparison to ETL, to be able to meet the specific real-time constraints. We also outlined possible ideas for the implementation of real-time ETL. A presentation of different work on real-time ETL followed. We presented three papers directly related to real-time ETL and three papers which presented real-time ETL in a bigger context and summarized the results and statements related to real-time ETL. In the discussion on open issues and further questions we presented several problems for each subprocess of the real-time ETL workflow. We also gave suggestions for further investigations and improvements. The paper showed existing potential to increase the flexibility, efficiency, and performance of real-time ETL processes. We highlighted the missing integration of all participants of the ETL process, that are the source systems, into the ETL process as major drawback. As a topic for future investigations, we would suggest a message based connection to the source data and the integration of meta and context data of data sources into the ETL process. Regarding the integration of BI into the workflow of companies real-time ETL plays and will play an important role. The automatical optimization of business processes, based on analytical results, needs real-time ETL to fulfill the requirements of a competitive business. REFERENCES [1] R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. John Wiley & Sons, [2] V. Rainardi, Building a Data Warehouse: With Examples in SQL Server. Springer, [3] S. Kozielski and R. Wrembel, New Trends in Data Warehousing and Data Analysis. Springer Publishing Company, Incorporated, [4] S. Rizzi, A. Abelló, J. Lechtenbörger, and J. Trujillo, Research in data warehouse modeling and design: dead or alive? in DOLAP 06: Proceedings of the 9th ACM international workshop on Data warehousing and OLAP. New York, NY, USA: ACM, 2006, pp

11 [5] J. Schiefer and R. M. Bruckner, Container-managed etl applications for integrating data in near realtime. in ICIS. Association for Information Systems, 2003, pp [Online]. Available: [6] N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N.-E. Frantzell, Supporting streaming updates in an active data warehouse, in ICDE. IEEE, 2007, pp [7] N. Viana, R. Raminhos, and J. M. Pires, A real time data extraction, transformation and loading solution for semistructured text files, in EPIA, ser. Lecture Notes in Computer Science, C. Bento, A. Cardoso, and G. Dias, Eds., vol Springer, 2005, pp [8] B. Azvine, Z. Cui, D. D. Nauck, and B. Majeed, Real time business intelligence for the adaptive enterprise, in CEC-EEE 06: Proceedings of the The 8th IEEE International Conference on E-Commerce Technology and The 3rd IEEE International Conference on Enterprise Computing, E- Commerce, and E-Services. Washington, DC, USA: IEEE Computer Society, 2006, p. 29. [9] M. Castellanos, A. Simitsis, K. Wilkinson, and U. Dayal, Automating the loading of business process data warehouses, in EDBT 09: Proceedings of the 12th International Conference on Extending Database Technology. New York, NY, USA: ACM, 2009, pp [10] T. M. Nguyen, J. Schiefer, and A. M. Tjoa, Zelessa: an enabler for real-time sensing, analysing and acting on continuous event streams, Int. J. Bus. Intell. Data Min., vol. 2, no. 1, pp ,

12 Aktuelle Trends und Probleme im Bereich Data Warehouse, Zusammenfassung Dmytro Pyvovar, Otto-von-Guericke Universität Magdeburg, Deutschland, In diesem Artikel werde ich die zukünftigen Trends auf dem Gebiet Data Warehousing beschreiben, wie die in der Zukunft aussehen könnten,sowie Technologien, die an den von mir ausgewählten Gebieten zum Ansatz kommen: 1. In einem Data Warehouse werden hochsensible Daten gespeichert, die für die Entscheidungsfindung verwendet werden und somit vom unberechtigten Zugriff geschützt werden müssen. 2. Viele Unternehmen brauchen Analysemöglichkeiten zur Unterstützung taktischer Entscheidungen des täglichen Geschäfts in Echtzeit. Dieser aktuelle Trend nennt sich Real-Time-Analytics (RTA) oder auch Active Data Warehousing (ADW). Einsatzgebiete für RTA sind Geschäftsprozesse, wo Interaktion mit Kunden und Geschäftspartnern von großer Bedeutung sind, z. B. Internet, mobile Dienste und elektronischer Wertpapierhandel. 3. Am Markt gibt es zahlreiche Business Intelligence (BI)-Produkte, die sowohl branchenspezifisch als auch übergreifend sind. Von den Produkten wird ein wesentliche Teil der BI-Referenzarchitektur abdecken. Es existieren 2 Auswahlmethoden: Best-of-Breed und Tool Suite. Jeder der Methoden hat ihre Vorteile und Nachteile. 1. Einleitung Damit man die Problematik und Technologien versteht, werde ich zuerst grundlegende Begriffe erklären: Ein Data Warehouse ist eine themenorientierte, zeitorientierte, integrierte und unveränderliche Datensammlung, deren Daten sich für Managemententscheidungen auswerten lassen. Themenorientiert heißt alles über Kunden, Produkte etc.; Unter zeitorientiert versteht man periodische Ergänzung um aktuelle Daten, Verdichtung nach Zeitintervallen; Integration bedeutet Konsolidierung von Daten verschiedener operativer Systeme; Unveränderlich heißt, dass einmalgespeicherte Daten nicht mehr verändert werden[1] Business Intelligence (BI) ist der Prozess der Umwandlung von Daten in Informationen und weiter in Wissen. Entscheidungen und Prognosen stützen sich auf dieses Wissen und schaffen dadurch Mehrwert für ein Unternehmen. Ein Data Warehouse (DW) bildet oft Basis zur Implementierung einer BI-Lösung[3]. 11

13 Abbildung 1. RTA Architektur. OLAP multidimensionale Analyse. Mit Hilfe von OLAP-Werkzeugen (Online Analytical Processing) hat sich eine interaktive Analysetechnik durchgesetzt, bei der die zu analysierenden Kennzahlen anhand von Dimensionen organisiert sind. Die Dimensionen sind meist hierarchisch aufgebaut, d.h. sie beinhalten verschiedene Aggregierungsstufen. Eine wichtige, sich schon aus der Historisierung der Daten ergebende Dimension ist die Zeit. Die Navigation in OLAP-Anwendungen erfolgt mit sog. Slice/Dice, Drilldown/Rollup- und anderen Operationen [3] Drill-Down und Roll-up : schrittweise Verfeinerung bzw. Verdichtung von Analyseergebnissen, zum Beispiel von Jahresüber Monats- zu Tagesauswertungen. Die Verdichtung Sets und dem Cube Operator umgesetzt [3] Slice-and-Dice : Navigation in einem multidimensionalen Datenraum durch Fokussierung auf einzelne Aspekte, zum Beispiel Verteilung der Umsätze für ein bestimmtes Produktauf unterschiedliche Regionen und Zeiträume [3]. Real- Time-Analytics (RTA) oder Active Data Warehousing -Analysemöglichkeiten zur Unterstützung taktischer Entscheidungen des täglichen Geschäfts in Echtzeit [2]. Datamarts - eine Teilsicht auf Data Warehouse. 2. Problemstellung 2.1. Sicherheit Auf einer Seite ist der Grundprinzip des Data Warehouse, die Daten für die Analysen so einfach wie möglich zugreifbar zu machen, auf anderer Seite handelt es sich dabei oft um hochsensible Daten, die zur Entscheidungsunterstützung eingesetzt werden. Oft wird argumentiert, da es sich nur um Top-Managements handele, könne Sicherheit vernachlässigt werden. Diese Aussage ist jedoch schon aus dem Grund falsch, weil Data Warehouse inzwischen 12

14 schen auch für untere Managementebenen und sogar für Kunden, Partner mit Zugriff über Portale und Intra-/Internet-Lösungen geöffnet werden. Die klassischen Sicherheitsanforderungen eines Informationssystems sind Vertraulichkeit, Integrität, Verfügbarkeit. Ich werde mich mich hier ganz auf der Vertraulichkeit, d.h. den Schutz vorm unberechtigten Datenzugriff konzentrieren Real-Time-Analytics(RTA) oder Real Time Data Warehousing (RTDW) Data Warehouse Struktur ist für eine operative Entscheidungsfindung von strategischen Analysen nicht anwendbar, weil RTA vor allem für schnelle Verarbeitung optimiert werden soll. Im Rahmen der Entscheidungsfindung wird nicht nur auf die Qualität von Daten geachtet sondern auch auf die Antwortzeit (die Ausführungszeit der Anfrage) Werkzeugauswahl In der auf dem Markt herrschenden Produktvielfalt ist es schwer, genau das richtige Werkzeug zu finden. Es gibt allgemeine Vorgehensweisen die dabei helfen sollten. 3. Lösungsbeschreibungen 3.1. Sicherheitskonzept für Zugriffskontrolle Die Zugriffskontrolle muss eine der Grundfuktionalitäten sein, die vertrauenswürdige IT-Systeme anbieten müssen, um Vertraulichkeit und Integrität von Informationen zu gewährleisten. Betrachtet wird eine Menge von: Objekten (z.b. Daten und Dokumente), Subjekten (z.b. Benutzer und Anwendungen), sowie möglichen Operationen (z.b. Lesen, Schreiben und Ausführen). Die Automatisierung beschäftigt sich mit der Verwaltung der Zugriffsrechte, die legen fest, welche Operationen ein Nutzer auf einem bestimmten Objekt ausführen darf. Die Zugriffskontrolle ist so in Abbildung 2. dargestellt, automatisiert. Berechtigungen werden nach der Autorisierung vergeben bzw. entzogen. Wenn man versucht auf ein geschütztes Objekt zuzugreifen, werden seine Rechte durch die Zugriffsfunktion geprüft. Abbildung 2.Zugriffskontrolle. Prinzipiell lassen sich zwei Sichtweisen unterscheiden: Der Bottom-Up-Ansatz sieht ein Data Warehouse als föderierte, bestehend aus mehreren autonomen Datenbank Systemen. Es wird versucht nicht nur die Daten durch Integrationsmechanismen zusammenzuführen, sondern auch die Zugriffsrechte von den anderen zu übernehmen [5]. Der Top-Down-Ansatz geht den umgekehrten Weg. Er betrachtet in erster Linie die (OLAP-) Anwendungen der Endbenutzer und entscheidet auf dieser Ebene wer welchen Zugriff bekommen soll. Es erfolgt also ein expliziter Entwurf von Zugriffsrechten[5]. Damit es zu sehen ist, dass Sicherheit ein komplexes Problem darstellt, werde ich das Prinzip des Bottom-Up und Top-Down-Ansatzes genauer beschreiben Der Bottom-Up-Ansatz Beim Bottom-Up Ansatz [Abbildung 3.] wird versucht, die auf Datenquellen definierte Zugriffsrechte auf Ebene des Data Warehouses zu integrieren und zu implementieren. Da die von DB gegebenen Mechanismen beschränkt sind, werden verschiedene Erweiterungen des SQL- Statements vorgeschlagen (siehe [6]). In diesem Artikel wird vorgeschlagen, dass eine Entkopplung der physischen Ebene gegeben sein sollte, und dass die Autorisierung sich auf Inhalt und nicht auf physische Tabellen beziehen sollte. 13

15 Abbildung 3.Bottom-Up Ansatz Der Top-Down-Ansatz Der Top-Down-Ansatz arbeitet nach einem anderen Prinzip. Er betrachtet in erster Linie die(olap-) Anwendungen der Endbenutzer und entscheidet auf dieser Ebene, wer welchen Zugriff bekommen soll. Es erfolgt also ein expliziter Entwurf von Zugriffsrechten[5]. Beim Bottom-Up-Ansatzt ist es sehr schwer zu beantworten, welche Anfragen dem Benutzer gestattet sind, weil sich die Autorisierungen auf den (physischen, relationalen) Datenspeicher beziehen und von einer multidimensionalen OLAP-Anwendung zu weit entfernt sind. Dies ist besonders problematisch, da Zugriffseinschränkungen Abfrageergebnisse verfälschen und so zu falschen Entscheidungen führen können. Daher wird die Anforderung an eine Zugriffskontrolle oft aus Anwendungs- (OLAP-) Sicht definiert[5]. ein Top-Down-Ansatz, bei dem OLAP-Zugriffsrechte, oder besser gesagt Benutzersichten unter Verwendung multidimensionaler Techniken definiert werden (s. Abbildung 4). Dadurch lässt sich beispielsweise auch der Zugriff auf unterschiedliche Aggregationsebenen regulieren. Eine frühe Arbeit auf diesem Gebiet, die sich mit Autorisierung auf OLAP- Ebene beschäftigt, ist in der Quelle Nr. [12] ausführlicher beschrieben. Dort wird ein auf AMAC (Adapted Mandatory Access Controls) basierender Ansatz präsentiert. Ein späterer Beitrag der gleichen Arbeitsgruppe [14] schlägt die Verwendung von Metadaten zur Beschreibung der Autorisierungen vor. Dieser Vorschlag wird in der Quelle [13] wieder aufgegriffen und konkretisiert. Die kommerziellen Produkte bieten heute recht ausgereifte Zugriffskontrollmechanismen auf OLAP-Ebene an. Die Produkte verwenden jedoch sehr proprietäre Verfahren zur Implementierung der Zugriffseinschränkungen, die zu Entwurfs- und Dokumentationszwecken eher ungeeignet sind in der Quelle Nr. [5]. Eine Übersicht findet man in [15]. Aufgrund der Sensibilität der Daten und der Gefahr der Ergebnisverfälschung durch zu restriktive Autorisierungen ist jedoch bei dem Entwurf solcher Zugriffsrechte besondere Sorgfalt [5] wird ein Entwurfsansatz vorgestellt, der auf MDSCL (Multidimensional Security Constraint Language) basiert ist. Die Sprache ist für die Beschreibung der Autorisierung entwickelt worden[5]. 3.2 Beschleunigungspotenziale im Data Warehouse(RTA) Ein großes Optimierungspotenzial gibt es im Data Warehouse an zwei Stellen erstens im Übergang von den operativen Systemen zum Data Warehouse selbst, zweitens interne Beschleunigungskonzepte. Zu unterscheiden sind: Extraktion der Daten aus den Vorsystemen und Laden in das Data Warehouse[2]. Verarbeitung im Data Warehouse[2]. Verarbeitung in den Analysesystemen[2]. Abbildung 4.Top Down Ansatz. Wenn man die Sicherheitsproblematik also aus Sicht der Anwendungen betrachtet, ergibt sich Das Extraktion, Aufbereitung und Ladeprozesse finden in der Regel periodisch statt. Eine Beschleunigung kann somit erfolgen, indem die Ladefrequenzen erhöht und/oder die 14

16 Ladezeit verkürzt wird. Jedoch kann es nicht beliebig oft auf die Vorsysteme zugegriffen werden. Dieses Problem kann gelöst werden, indem kein direkter bulk-load durchgeführt wird, sondern ein Delta-File erzeugt wird[2]. Auch dürfen die komplexen Transformationsund Integrationsprozesse nicht unterschätzt werden. Bei häufiger stattfindenden Ladeprozessen können trotz der kleineren Datenvolumina die für die Integration der Daten notwendigen Operationen immer noch komplex, rechen- und damit zeitintensiv sein, so dass die Verarbeitungskapazitäten schnell erschöpft sind. Innerhalb des Data Warehouse sind die eigentlichen Data Warehouse-Daten sowie die mehrdimensional strukturierten (Experte) z. B. in Form von OLAP-Würfeln bzw. Data Marts. Aber aus diesem Grund, dass diese als Frontend-Systeme in der Regel direkt auf die Datenbanken zugreifen, ist ein Beschleunigungspotenzial an dieser Stelle nur bedingt gegeben und kann nur applikationsspezifisch beurteilt werden[4]. 3.3 Die Auswahl der passenden Produkte Zurzeit sind die DW-Produkte, die schon über 10 Jahre auf dem Markt sind, gut ausgereift und deswegen ist der Produkteinsatz im Normalfall eher dem Individual zugeschnittener Entwicklung vorzuziehen. Man unterscheidet folgende Ansätze zur Produktauswahl: Best-of-Breed: Für jeden Dienst der Referenzarchitektur wird jeweils das für die Anwendung am besten geeignete Werkzeug ausgewählt, auch wenn diese von unterschiedlichen Herstellern angeboten werden [3]. Tool Suite: Es wird ein Hersteller ausgewählt, der eine integrierte Plattform mit allen wesentlichen Diensten der Referenzarchitektur anbietet[3]. Jeder diesen Ansätze hat seine Vor- und- Nachteile, und es kommt auf Anwendungsfall an, Best-of-Breed-Lösungen haben bessere funktionale Abdeckung, und sind besser für spezifische Anwendungen geeignet. Sie haben aber große Integrations-und-Managementaufwände. Für eine Tool Suite sprechen reduzierte Integrations- und Managementaufwände und Unterstützung des Herstellers bei dem Aufbau eines Systems. 4. Zukünftige Herausforderungen 4.1 Sicherheit Die neuen Forschungsansätze an dem Gebiet Sicherheit und Zugriffsrechte versuchen beide Verfahren ( Bottom-Up-und Top-Down ) zu verbinden, was aber ziemlich schwierig ist, da unterschiedliche Datenmodelle (relationale vs. multidimensional) sowie Sicherheitspolitiken (offen vs. geschlossen) zum Einsatz kommen. Hierbei könnten Arbeiten im Bereich konzeptueller Data Warehouse -Modellierung helfen. 4.2 RTA Hier sollte vor allem berücksichtigt werden, inwieweit man die Schnelligkeit braucht, denn man stößt ganz leicht hier an die systembedingte Grenzen. Es ist zwar zu beobachten, dass sich die ETL- und EAI-Werkzeuge funktional aufeinander zu bewegen (z. B. Möglichkeiten komplexerer Transformationsregeln in den EAI-Systemen integriert werden) und die Hersteller versuchen, ihre Werkzeuge im jeweils anderen Markt anzubieten, aber gleiche Leistungsfähigkeit ist noch nicht gegeben[2]. 4.3 Die Auswahl der passenden Produkte Derzeit ist davon auszugehen, dass vor diesem Hintergrund die Marktkonsolidierung voranschreiten wird. Nur eine Handvoll großer Anbieter (Kandidaten sind etwa SAS, Oracle, Microstrategy, Business Objects, Hyperion, Cognos, SAP, Informatica, Microsoft) werden sich durchsetzen, lediglich einige Spezialisten mit Nischenangeboten bzw. auf bestimmte Branchen oder Analysemethoden zugeschnittenen BI-Lösungen (als Plug-Ins für BI-Plattformen) werden überleben[3]. 5. Andere Ansätze Auf dem Gebiet Data Management sind noch Vertreter anderer Architekturen zu finden. Hierbei geht es um MOLAP multidimensionale Datenbanken, deren Vertreter: Hyperion, Essbase und Powerplay sind. Außerdem ist SAS klassier HOLAP -Vertreter. HOLAP ist eine Kombination aus ROLAP und MOLAP[7]. 15

17 Solche Produkte wie Informatica PowerCenter erlauben nicht nur Verwaltung von vielen Datenquellen sondern auch die besitzen die Funktionalität, die Daten vor dem Laden in Data Warehouse zusammenführt. Die Produkte von Cognos oder Buissness Objects sind umfangreiche Suites für verschiedene Anwendungsgruppen. Die Systemen bietet folgende Funktionalität an: Erstellung von Teilsichten, Analysewerkzeuge sowie Zugriffsicherheit und Möglichkeit zur Einbindung in Applikationen oder Portale und sogar Zugriff über Web-Browser[8]. Tool Suites oder sogenannte Plattformen bringen aufeinander abgestimmte Komponente auf den Markt, die komplette BI Architektur unterstützen. Die Vertreter dieser Produkte sind SAP, Oracle, IBM bzw. SAP BW, Oracle 9i sowie IBM DB2. Die Plattformen bieten natürlich keine so große Flexibilität und sind für spezifische Lösungen eher schwer anwendbar. Im Jahr 2007 fand eine wahre Übernahmeschlacht auf dem BI-Markt statt. Oracle kaufte Hyperion, SAP übernahm Business Objects, Cognos übernahm Applix und IBM erwarb Cognos[9][10]. 6. Fazit Auf dem Gebiet Data Warehouse, sowie BI werden in der Zukunft noch einige Probleme zu lösen sein, aber wenn die Technologien gefragt sind, werden die neueren Systemen in der Entscheidungsfindung neuere Horizonten erreichen. Referenzen [1] Inmon,W.H.: Building the Data Warehouse 2nd edn. New York: John Wiley & Sons [2] Real -Time Warehousing und EAI, JOAHIM- SCHELP. 31/fulltext.pdf [3] Architektur von DataWarehouses und Business Intelligence Systemen, Bernhard Humm, Frank Wietek, Informatik Spektrum,23 Februar [4]Günter Saake, Kai-Uwe Sattler, Andreas Heuer: Datenbanken : Implementierungstechniken 2 Auflage. 2005, ISBN [5]Sicherheit in Date-Warehouse und OLAP Systemen, Torsten Preibe, Günter Pernul, Lehrstuhl für Wirtschaftsinformatik I Informationssysteme Universität Regensburg, D Regensburg. [6] ROSENTHAL, A., SCIORE, E., DOSHI, V.: Security Administration for Federations, Warehouses, and other Derived Data. In Proc. IFIP WG 11.3 Working Conference on Database Security, Seattle, WA, USA. [7] Antje Höll, Speichermodelle für Data-Warehouse Strukturen, Warehouse/mat/Hoell_Speichermodelle_text.pdf [8] Cognos, [9]Oracle9i, [10]IBM DB2 [11] Delta files, =9097 [12] KIRKGÖZE, R., KATIC, N., STOLDA, M., TJOA, A.M.: A Security Concept for OLAP. Proc. of the 8th International Workshop on Database and Expert Systems Applications (DEXA '97), Toulouse, France, September 1-2, [13] ESSMAYR, W., WEIPPL, E., WINIWARTER, W., MANGISENGI, W., LICHTENBERGER, F., An Authorization Model for Data Warehouses and OLAP. Workshop On Security In Distirbuted Data Warehousing, in conjunction with 20th IEEE Symposium on Reliable Distributed Systems (SRDS'2001), October 28-29, 2001, New Orleans, USA. [14] KATIC, N., QUIRCHMAYR, G., SCHIEFER, J., STOLBA, M., TJOA, A.M.: A Prototype Model for Data Warehouse Security Based on Metadata. In Proc. DEXA 98, Ninth International Workshop on DEXA, Vienna, Austria, August 26-28, Model for Data Warehouses and OLAP. Workshop On Security In Distirbuted Data Warehousing, in conjunction with 20th IEEE Symposium on Reliable Distributed Systems (SRDS'2001), October 28-29, 2001, New Orleans, USA. [15] PRIEBE, T., PERNUL, G.: Towards OLAP Security Design Survey and Research Issues. Proc. Third ACM International Workshop on Data Warehousing and OLAP (DOLAP 2000), McLean, VA, USA, November

18 17

19 Implementation of Software Product Lines: Current state and future challenges Christian Becker University of Magdeburg, Germany Abstract Software product lines are an effective way to produce software with variable features and the reuse of legacy source code. The implementation of a software product line is an open research field, because traditional solution does not achieve all requirements. In this paper we compare two common ways for the implementation: An annotative and a compositional approach to achieve variability software and take a look to the future challenges. We also present two software development tools which support these approaches and provide variant management for a software product line. 1. Introduction In the last years modern software has become bigger and bigger and provides more features. Of course, this is an advantage for the customer, but in some cases the customer wants their individual software product. For example embedded computers need small software for their limited resources. So the software developers need to specify their program to achieve the requirements. Rewrite or modify the software every time is not useful and takes a lot of time and provides also a risk of errors. The better way is to use a software product line (SPL). It helps to customize the software and it is an effective way to reuse source code and reduce the time to market [16, 17]. An SPL is design for a specific domain and provides a set of features. To find this set of features and their relationship the domain is analyzed (Domain Analysis). For producing a specific software product the Requirements Analysis will give a selection of features. Graphical methods to model SPLs are Feature Models (FM). It arranges the features in a hierarchic tree. This approach is well known and provides a theoretical base [15, 1]. In this paper we want to describe how to implement an SPL with two common approaches. The annotative approach marks features in legacy code and provides so the variability in the software. Compositional approaches implement the features in distinct modules and composed them to one program. There are many ways to achieve that, but we just take a look to feature oriented programming (FOP) approaches. We compare both approaches and also present two tools which can be used to implement a SPL. These tools might be a basis for future extension or combination of both approaches. 2. Feature Models As it written before the most common way to present a SPL is a Feature Model. So in this chapter we will provide a small introduction in FMs. We will use FMs to show the relationship between the features and their source codes for different implementation techniques. A FM is hierarchically organized with one root node and children nodes. The connection between two nodes can be and, alternative, or, mandatory, and optional [15]. In some cases it is not sufficient to describe the relationship between features with these connections, so it is also possible to use additional expression. Figure 1 show a typical FM. It presents a chat software product line, where it is possible to choose two encryption algorithms, Windows or Linux as operation system, Authentication for Client and Server and a log file mechanism. The feature Win requires the feature Authentication. Encryption ROT 13 Caesar Log- File Chat Win OS Linux Client Win => Authentication Authentication and alternative or mandatory optional Figure 1. Chat-Software Product Line Server 18

20 FMs are good to present a SPL. But this graphical modeling offers no support for debugging FMs. In this small example it might be easy to see a valid selection of features. But in general a SPL can contain hundreds of features where it isn t easy to see a valid selection. To solve these problems Batory connect the FM to the GUIDSL Grammar [1]. With this grammar it s possible to check a selection of features or debug large FMs with satisfiability solvers. In the next chapter we want to discuss how the variable features can implement in a programming language 3. Implementation of a Software Product Line In the previous chapter we saw how to model SPLs with FMs. The next step is to implement the SPL in programming languages. Write a new program for every variant of the SPL is of course an insufficient solution. As a result of the variability of the features, in some configurations, not every source code fragment will be needed. So the problem is to find implementation techniques to handle the variability. Traditional programming paradigms like object oriented programming do not provide a good solution for that problem [18]. So in the next chapters we show two different approaches for the implementation of an SPL. In [2], the authors describe two common ways to implement a SPL: The annotative and the compositional approach Annotative Approach Annotative approaches use annotation to mark source code which belongs to a feature. In programming languages like C/C++ it is possible to surround the source code of a feature with preprocessor annotations like #ifdef and #endif. The preprocessors will cut off the surrounded code if the feature is not selected. Another way to mark feature code is to use name conventions. For example, every method that belongs to the feature Logfile start with log_. So an additional tool can cut off those methods when the feature Logfile is not selected. But in fact the best known and common way to get variability in a software program is to use preprocessor annotations. Figure 2 show how the annotative approach can implement a SPL with preprocessor annotations. There are some implementation units (e.g. classes or methods). These units contain the implementation of one or more feature. And a feature is connecting to one or more implementation units. Figure 2. Annotative approach for implementation a SPL Figure 3 show the source code of the chat software which use preprocessor annotations Figure 3. Usage of preprocessor annotations In this simple example the feature code for encryption is just one line of code. But in typical programs there might be hundred lines of codes. And these lines might be scattered over many implementations units (Feature Traceability Problem). So it is not easy to identify the source code that belongs to a feature. It is also possible to cascade preprocessor annotations. That leads to source code which is difficult to understand and readable. Without tool support this will become very complex and it isn t easy to maintenance the source code which belongs to a feature. With preprocessor annotations it s possible to mark fine granular source code fragments. For example change the parameter of a method in depending of the feature selection class Client { /* some lines code*/ void send() { #ifdef LOGFILE logmsg(msg); #endif #ifdef CRYPT msg=cryptmsg(msg); #endif sendmsg(msg); } } class Client { void connect( #ifdef WIN ipv4 #endif #ifdef LINUX ipv6 #endif ) { /*Implementation of the feature*/ } } Implementation Units Figure 4. Alternative parameter for a method with preprocessor annotations 19

21 Figure 4 show an example for fine granular feature code. In depending of the operation system the method connect need different parameters. The feature WIN use ipv4 and the feature LINUX use ipv6 as parameter. This small example shows the possibilities of annotative approaches. But it can become more complex if the relationship between the features is not alternative. In this case the annotations must consider the comma between to parameters. Figure 5 shows an example. Line 3 provides just the comma between parameter if both features are selected class Client { void connect( #ifdef WIN ipv4 #ifdef LINUX, #endif #endif #ifdef LINUX ipv6 #endif ) { /*Implementation of the feature*/ } } Figure 5. One or two parameters for a method with preprocessor annotations One of the advantages of annotative approaches is that s possible to write the implementation units in common programming languages / paradigms and afterwards add the annotations. That s make it easy to build up a SPL from legacy software applications. A software developer can step by step mark the feature code and can so build up a SPL. In this chapter we analyses the annotative approaches, especially which use preprocessor annotations. In the next chapter we will take a look on the compositional approaches. Compared to the annotative approaches they implement features in distinct modules Compositional approaches In compositional approaches features are implemented in distinct modules. To create a variant from the SPL a set of modules is composed. This can be done at compile-time or deploy time [2]. There are many compositional approaches like component technologies [6], frameworks [7] or aspects [8]. First we will take a look on component technologies. In this case every feature should implement from one component. But this might be not possible at all times. For example if a feature just adds some lines of code to a method these feature cannot be capsulate in a component. To create a program from the SPL software-developers takes the set of components and connect them together. For this they need to write additional code, so called glue code. The next figure will show how components technology implements the chat-software SPL. Features A Additional source code glue code Features B glue code Features C component Features D Figure 6. Implementation with components The disadvantage of component technology is that is it not possible to capsulate every feature in just one component. And for a variant of the SPL additional source code is needed. To solve these problem we want take a look on an approach which use feature oriented programming paradigm [5]. The advantage is that this approach solves the feature traceability problem (every feature is implemented in one distinct module) and no additional source code is needed. As it shown in figure 7, every feature is just connecting to one implementation unit. Figure 7. Compositional approach for implementation a SPL To get a variant of the SPL select a set of features and the implementation units will compose as it shown in figure 8. Selected Features Implementation Units Composed Program Figure 8. Composing a program from a set of features 20

22 The next step is to find a suitable composing algorithm. A common way is to use mixin or jampack algorithm. The idea of them is that a feature refines a class or a method from a basic implementation. In [9] the authors compare mixin and jampack. For example, in our chat software is a class client which implements a method that sends messages to a server. The feature Log-File will refines this method. The next figure show the units implemented with Jak. Jak belongs to the AHEAD Tool Suite [12][14] and provide a collection of Java-based tools for implement compositional programming. In the first block (Figure 9) is a basic implementation of the Client class. The next block (line 8-14) extends the basic implementation using the refines keyword. In this case the existing method sendmsg will extend. The keyword Super calls the original implementation from the method. It is also possible to introduce new data fields, methods or classes Basic Implemtation class Client { public void sendmsg(message msg){ send(msg); } } Feature Log-File Refines class client { public void sendmsg(message msg){ logmsg(msg); Super.push(msg); } } Feature Encryption Refines class client { public void sendmsg(message msg){ msg=cryptmsg(msg); Super.push(msg); } } Figure 9. Refinement of the Class client with Jak The refinements in the previous example showed a language dependent approach with the new keywords. But the implementation of a SPL should be language independent. In [13] Apel et Al shows that composing software fragments is possible for many programming languages. Refine a class can be interpreting as superimposition of feature structured trees. In this chapter we present the second common way to implement SPL. The compositional approach which use mixin or jampak solve the feature traceability problem by implementing the features in distinct modules. As an affect of that, this approach also provides a separation of concern. In the next chapter we want to give a short comparison about the annotative and the compositional approach. 4. Comparison We present two different approaches for implementation a SPL. The annotative approach marks features in source code. The compositional approach implements features in distinct modules. But which should a developer chose? Of course in general this question can t be answered. Annotative approaches with C/C++ like preprocessors are well known and the acceptability in the industry is high. The big advantage is that it is possible to mark feature in legacy applications. Built up a SPL from nothing is not the normal way. So the annotative approach will help the developer to transform a legacy application step by step in a SPL. Implement a SPL with the compositional approach solves the Feature Traceability Problem and provide features models which are easier to maintain. Instead of the annotative approaches it is not possible to make fine granular extensions. For example, just change the parameter of a method is not possible. A solution is to introduce a new method with the new parameter. But this is an insufficient solution because it leads to code duplication. And as a result of that it leads to potential errors. In the next figure we summarize the properties of both approaches. Approach: Annotative Compositional Reuse of legacy application + - Separation of Concern - + Feature traceability - + Language Independent + (+) Granularity + - Maintenance - + Figure 10. Comparison of both approaches In figure 10 the point language independent for the compositional approach is in brackets. We saw that s possible to use compositional approaches for many programming languages. But the composer or the tool must support the programming language. So it is not complete language independent. Maintenance means how easy is it to maintenance the source code which belongs to a feature. In the compositional approaches it is easier to find and maintenance the code which belongs to a feature. In the next chapter we want to discuss the future challenges. 21

23 5. Future Challenges As we show in the previous chapter both approaches have their advantages and disadvantages. For both approaches one of the future challenges is a good tool support and an IDE. Today no one wants to work with console tools if he knows the advantages of an IDE (e.g. type checking, refactoring, etc.). So we want to present two tools which help the developer to implement and manage the variants of an SPL. Colored Integrated Development Environment (CIDE) is a tool for SPL analyzing and decomposing legacy code [2] [3]. Code that belongs to a feature can mark with colors. This tool will not pollute the legacy code with additional annotations. Figure 11. Screenshot Colored Integrated Development Environment The other tool is FeatureIDE [4]. It s an open Framework and provides tools for feature oriented design process and implementation of SPLs. It integrates AHEAD, FeatureC++ and FeatureHouse tools like composer and compilers. Both tools provide also tools for manage the variants of an SPL. In figure 13 you will see this for FeatureIDE: Figure 12. Screenshot of the Chat-Software Product Line in FeatureIDE Figure 13. Variant Management in FeatureIDE But these tools are academic prototypes and there is still work to do. Case studies about the usage of these tools would provide information about the requirements of software developer. The granularity of the extension is another difference between the two approaches. So an idea for the future might be to combine both approaches. For small feature source code fragments use an annotative approach. And for bigger fragments use a feature oriented approach. This combine approach would use the advantages from both approaches. An algebra for annotative approaches, like the algebra for FOP would help to see the consequences and the problem for the combination of both approaches. FOP provides a separation of concern. But for some features like lock or log mechanism, it is necessary to refine many units. Such features are called a crosscutting concerns. Here provide the Aspect Oriented Programming (AOP) paradigm a good solution [8]. With aspects it is possible to quantify the point in the source code where a feature has to refines a method. In [11] the authors also provide a method how to combine the FOP and AOP and called it Aspectual Feature Models. This can be a good way to solve the problem with cross cutting concerns, but today there is no tool which provides Aspectual Feature Models. In this paper we just take a look on the software engineering / development aspects of a SPL. But when a SPL is implemented in a company the IDE should have an interface to existing software systems, like Product Lifecycle Management (PLM) or Enterprise Application Integration (EAI) Systems. The challenges are to develop the interface and extend the SPL to a complete product model. Furthermore a SPL should contain more than just source code. Documentation and manuals also belongs to a software product. The algebra for FOP [13] provides also the possible to composed text documents (Principle of the Uniformity). After some advice for future work, we summarize our paper in the next chapter. 22

24 6. Conclusion Software Product Lines are a common and effective way for reusing software, reduce the time to market and bring variability to software for customer requirements. In this paper we saw the current state how to implement a SPL. We describe two common approaches for the implementation of a SPL. The annotative approach use annotation to mark the source code that belongs to a feature. We saw that s possible to mark fine granular software fragments and as a result of that tool support like type checking might be a problem. But the annotative approaches are good to build up a SPL from legacy applications. Compositional approaches implement features in distinct modules. These modules provide the functionality for one concern (Separation of Concern). We take a closer look to approaches which use feature oriented programming paradigm. It s easy to find the source code which belongs to a feature (Solve Feature Traceability Problem) and as a result of that is easier to maintain the source code of one feature. For the future challenges we saw that a good tool support and integrated development environments are important. So we also present a two tool: CIDE and FeatureIDE. CIDE can be count to the annotative approach and use colors to mark feature code. FeatureIDE use FOP. Both tools provide also a variant management tool. Future work is to improve these tools and may be combine both. Also an integration of AOP to handle cross cutting concerns is a future challenge and needs tool support. 7. References [1] Batory, D., Feature Models Grammars, and Propositional Formulas, In Proc. of Software Product Line Conference (SPLC), pp [2] Kästner, C., Apel, S., Kuhlemann, M., Granularity of Software Product Lines, International Conference on Software Engineering, Leipzig, Germany, May, 2008 [3] Kästner, C., Trujillo, S., Apel, S., Visualizing Software Product Line Variabilities in Source Code, In Proc. 2 nd International SPLC Workshop on Visualization in Software Product Line Engineering, Limerick, Ireland, September 2008 [5] Smaragdakis, Y., Batory, D., Mixin Layers: an Object- Oriented Implementation Technique for Refinements and Collaboration-Based Designs, ACM TOSEM, 11(2), 2002 [6] Szyperski, C., Component Software: Beyond Object- Oriented Programming, Addison-Wesley, 2002 [7] Johnson, R. E., Foote, B., Designing Reusable Classes, Journal of Object-Oriented Programming, 1(2), 1988 [8] Kiczales, et al., Aspect-Oriented Programming, In Proc. European Conference on Object-Oriented Programming (ECOOP), 1997 [9] Kuhlemann, M., Apel, S., Leich, T., Streamlining Feature-Oriented Desings, In Proc. International Symp. Software Composition, 2007 [10] Batory, D., Sarvela, J. N., Rauschmayer, A., Scaling Step-Wise Refinement, IEEE Transactions on Software Engineering, 30(6), 2004 [11] Apel, S., Leich, T., Saake, G., Aspectual Feature Modules, Software Engineering, IEEE Transactions on Volume 34, Issue 2, Pages March-April 2008 [12] Batory, D., Feature-Oriented Programming and the AHEAD Tool Suite, In Proc. of the 26 th International Conference on Software Engineering, Pages , 2004 [13] Apel, S., et al., An Algebra for Features and Feature Composition, In Proc. International Conf. Algebraic Methodology and Software Technology, Springer, 2008 [14] [15] Kang, K., et al., Feature-Oriented Domain Analysis (FODA) Feasibility Study, Technical Report, CMU/SEI-90- TR-21, Software Engineering Institute, Carnegie Mellon University, 1990 [16] [17] Bachman, F., Clements, P., Variability in Software Product Lines, Technical Report CMU/SEI-2005-TR-012, Software Engineering Institute, Carnegie Mellon University, 2005 [18] Tarr, P., et al., N degrees of separation: multidimensional separation of concerns, In Proc. International Conference Software Engineering, 1999 [4] Kästner, C., et al. FeatureIDE: A Tool Framework for Feature-Oriented Software Development, In Proc. 31th ICSE, Vancouver, Canada, May

25 Current State and Future Challenges in Optional Weaving Constanze Michaelis Department of Computer Science Otto-von-Guericke University Magdeburg, Germany Abstract In software development, software product lines become more and more attractive. Feature- and aspect-oriented programming are techniques to realize software product lines very comfortable. This paper deals with a well known issue among software product lines: The feature optionality problem, which occurs when features depending on each other are optional. One possible, promising solution to this issue, the optional weaving will be discussed. Minimal implementations of common software are an ambitious aim of software product lines. Realizing optional features with optional weaving will bring shortend resultant code, because of unnecessary omitted code. I. INTRODUCTION Feature-oriented programming (FOP) and aspectoriented programming (AOP) are solutions for realizing software product lines. The idea of both methods is developing software according to a domain and implementing features which describe concerns of the software for domain, not programming, purpose. With these programming paradigms the separation of concerns [1] can be realized, which is one of the major design principles with respect to software product lines. Developing software with software product lines offer a lot of advantages. These are for example customized programs, shortend development periods and reusable code. Features, defined within software product lines, can be mandatory or optional and in many cases they depend on each other. These dependencies lead to the feature optionality problem and therefore this paper deals with the current state of this problem and a possible solution called optional weaving. Optional weaving brings shorter resulting code, which can be essential for embedded systems, like smart cards, mobile phones or sensor networks. These embedded systems are characterized by less storage, less processing power and battery-operation. Therefore it is necessary to develop tailor-made solutions [2], [3]. The objective of this paper is to survey the current state of solutions to the feature optionality problem, whereas the concentration on optional weaving will be made. With the current state of optional weaving there is a partial solution to the feature optionality problem within AspectJ. With FeatureC++ it is possible to have two interacting optional features, which is a partial solution to the problem. First of all, this paper gives a brief background about aspect and feature-oriented programming. Afterwards the feature optionality problem will be discussed and the current approaches in optional weaving will be introduced. Due to the fact, that this approach is in its infancy, suggestions for improvement will be made in the section about Future Challenges. Finally the paper will be summarized and concluded. II. BACKGROUND This section gives a very brief review on techniques to implement software product lines like AOP, FOP and the connection of both. A. Aspect-oriented programming AOP aims at separating crosscutting concerns, i.e. code of an implemented feature is scattered across multiple components [4], [5]. According to Kiczales et al. [6] the idea of AOP is to implement the crosscutting concerns (also called crosscuts [4]) as aspects to eliminate code tangling and scattering. The core features are implemented with traditional design and implementation concepts, just pointcuts, advice and the aspect weaver build the program consisting of aspects, representing the additional features, and the core. B. Feature-oriented programming FOP aims at feature traceability. According to Apel et al. [7] the idea of FOP is to build a program by composing features, where the feature refines another feature incrementally which leads to a step-wise refinement. Features are mapped within a feature model and composed by the mixin approach within the AHEAD 1 1 Algebraic Hierarchical Equations for Application Design 24

26 Figure 2. Feature Model Chat Program Figure 1. Roles and Collaborations optionality problem will be discussed. toolsuite, which is a prominent example for the Java Community [7]. A Feature model is the result of the analysis of the commonality and variability of a domain perspective [2] and displays the feature dependencies in a graphic representation. Jak is a programming language which represents a superset of the Java language used in the AHEAD toolsuite for composing features [8]. It is a Java extension for metaprogramming, state machines and refinements. Mixin is one appropriate technique to build a featureoriented program. The basic idea is that features are not implemented inside one class, but within a set of collaborating classes. Due to [9] classes play different roles in different collaborations. For a better understanding Figure 1 shows a model of the mixin approach. The roles (R1..R3) represent the collaborations and the classes (C1..C3) implement the features. One collaboration represents one feature within a set of classes. According to current research the trend goes to the combination of FOP and AOP for implementation of software product lines, because of the compensation of the FOP disadvantages with the AOP advantages. The modularity of FOP for the base features and the encapsulation of AOP for the crosscutting concerns. The group around Apel realizes this with Aspectual Mixin Layers and Aspectual Mixins (for further information see [9], [10], [7], [5]). Similarly Lee et al. [4] persuit the combination of both for software product line development with respect to the mentioned advantages. Now having a short summary of the techniques to implement software product lines where features are the central elements, the next section will discuss the problem with optional features interacting with each other. III. FEATURE OPTIONALITY PROBLEM First of all, this section introduces feature interactions and feature dependencies. Secondly the feature A. Feature Interactions and Feature Dependencies Feature interactions occur when one or more features modify or influence another feature. There are many ways in which features can interact whereas the optional weaving focuses on a particular form of interactions that are static and structural, i.e. influencing or changing the source code of another feature [11]. Due to the fact that features interact with each other, their dependencies are classified by [4] to configuration and operational dependencies. Configuration dependencies determine whether a feature is required or excluded from configuration. Lets come to an example shown in the feature model on Figure 2 represents a short chat application. This example indicates a configuration dependency. The Color feature can just be selected if the Gui feature has been selected. Operational dependencies describe the feature interactions. Therefore operational dependencies include usage, modification and activation dependencies, thus the interaction between Authentication and Encryption as will be presented in Section III. These operational dependencies are considered in optional weaving according to optional features. B. Feature Optionality Problem The feature optionality problem is a well known problem among software product line development handling optional features [12], [13], [11]. This problem occurs according to Kästner [13] when multiple optional features interact with each other, e.g. feature A refers to feature B or feature B extends feature A. This refers to the crosscutting concerns within software product lines mentioned above. The small chat example shown on Figure 2 with just rudimentary implementation (Figure 3) to illustrate the feature optionality problem. Figure 3 represents the composed class refinements of the Server class, which is located in the Base feature (see Figure 2). From line one to five the base implementation is shown, where a method login is defined. This method will be refined by two features: Authentication and Encryption. The login method of the 25

27 1 SoUrCe RooT Base "workspace/fopchat/src/base/server.jak"; 2 abstract class Server$$Base { 3 public void login(connection c, TextMessage msg) {...} } 6 SoUrCe Encryption "workspace/fopchat/src/encryption/server.jak"; 7 abstract class Server$$Encryption extends Server$$Base { 8 public void login(connection c, TextMessage msg) 9 { 10 msg = Encryption.decrypt(msg); 11 Super().logIn(c,msg); 12 } 13 } 14 SoUrCe Authentication "workspace/fopchat/src/authentication/server.jak"; 15 abstract class Server$$Authentication extends Server$$Encryption { 16 public void login(connection c, TextMessage msg) 17 { 18 Super().logIn(c,msg); 19 c.enableclient(this.checkpassword(msg)); 20 } 21 private boolean checkpassword(textmessage tm){...} 22 } Figure 3. Refinements Figure 4. (a) Optional Features, (b) Optional Weaving Approach, (c) Derivative Feature Approach [13] Authentication feature represents the verification of the password to login and to start the chat (see lines 14 to 22). The login method of the Encryption feature represents the decryption of the Message consiting of the password before it can be verified. Because of the dependency between Authentication and Encryption and the fact that Encryption has to refine login method before Authentication this method can not be introduced by Authentication, but has to be introduced by base class Server. This results in code replication and shows the feature optionality problem. In Figure 2 are four optional features: Color, Authentication, History and Encryption. If Encryption and Authentication is selected then Authentication must implement the Encryption. The authentication message has to be sent encrypted, if the Encryption feature was selected, to ensure a secure transfer of the authentication message. If just Authentication is selected the Encryption algorithms are not required, but what, if Authentication is not selected and Encryption wants to refine the not existing Server Method login (see Figure 3)? The system will fire an error with this selection. Furthermore History, which does some logging functionality, also must implement Encryption if it is selected simultaneously. This very small example shows that there is a need to overcome the stated feature optionality problem. Figure 4 illustrates the feature optionality problem regarding two interacting features (a) and approaches to overcome this problem. One of these approaches is the derivative feature approach [14] (c), where the interacting code is swapped to a derivative, which will be implemented when feature A and feature B is selected (see Section VI). Optional Weaving is the other approach which is a solution to the stated feature optionality problem. This approach, firstly introduced by Leich et al. [12], is the implementation of optional interactions within the features, but with language constructs as advice statements that are ignored when the second feature is not selected. Therefore programs generated out of software product lines contain just the necessary code which is reduced to a minimum. Here the issue of feature optionality was discussed. In the next section the current state of the optional weaving approach will be shown. IV. OPTIONAL WEAVING - CURRENT STATE This section refers to optional weaving implemented with two programming languages. First of all, the FeatureC++ approach will be described and afterwards the AspectJ approach. A. Optional Weaving with FeatureC++ The first approach within FeatureC++ of Leich et al. [12] was based on ideas of aspect-oriented programming. As stated in the section before, the optional parts are implemented in the original feature and these parts are woven when necessary, i.e. the optional part is woven, when the interacting feature is selected. Using before, after, and around the pointcut is implicitly defined by the signature of the refined method. Figure 5 from [12] shows a log feature implementation using this new extension. The concat () method is optional by using the keyword before and it also describes when the functionality will be processed (AOP Style). The super keyword is still used for mandatory features. B. Optional Weaving with AspectJ A similar approach was discovered by Kästner [13] with the Java equivalent AspectJ. It was shown that optional weaving avoids the need of creating derivative features resolving dependencies unlike the derivative approach shown in Figure 4(c). Another advantage 26

28 Figure 5. Optional Method Refinement [12] 1 public class Server { 2 public void broadcast(textmessage tm) {...} 3 } 4 aspect Authentication{ 5 void Server.logIn(Connection c,textmessage msg){...} 6 around(connection c, Object o) : 7 call (* Connection.send(TextMessage)) && withincode(* Server.broadcast (TextMessage)) && target(c) {... 8 c.server.login(c, tm);... 9 } 10 } 14 aspect Encryption{ 15 TextMessage encrypt(textmessage tm){...} 16 TextMessage decrypt(textmessage tm){...} 17 before(connection c, TextMessage tm): call (* Server.logIn(Connection, TextMessage)) { 18 tm = this.decrypt(tm); 19 } 20 } Figure 6. Optional Weaving with AspectJ has been pointed out, that the dependencies are implemented as optional extension within the genuine feature so it can be maintained locally. The implementation of the in Section III introduced example with this approach is shown in Figure 6: The aspect Encryption advises the Method login which is introduced by the aspect Authentication. When the Authentication aspect is not included in the compilation this advice statement is not woven. With this approach the implementation dependencies can be resolved and features can be composed individually. In the case study of Kästner [13] the optional weaving approach was compared to the derivative approach (see Section VI). With the optional weaving approach in contrast to derivative approach, there is no problem with scale, no need of tools for hiding complexity and the interactions are woven self-acting by AspectJ compiler, when the dependent feature was selected. In contrast to the advantages resulting from the in [13] given case study, big lacks in current AspectJ have been found, so it is unusable for the optional weaving approach. First of all it is not feasible to reference optional classes, methods or member variables in optional advice statements, which result in code replication. Secondly this approach can only be used for advice statements and not for inter-type member declarations. Furthermore these stated scope problems impede advising methods in optional classes. Finally in [13] was pointed out, that because of these lacks in the current version of AspectJ the Optional Weaving Approach cannot be used as a solution for the feature optionality problem in the given case study. In this section was stated what approaches where made to implement the idea of optional weaving and which problems are still left. Therefore the next section will give some possible solutions to enhance the approach of optional weaving to overcome the feature optionality problem potentially. V. FUTURE CHALLENGES This section will emphasize the possibilities with AspectJ to overcome the problems stated in the section above. Furthermore another approach, that is not directly related to optional weaving, will be briefly introduced to suggest another way to implement optional weaving with CaesarJ. A. Future directions with AspectJ As stated in the section before, AspectJ is not able to cope with the problems of referencing optional classes, methods or member variables and can only handle advice statements and no inter-type member declarations. In [13] was shown, that with the current AspectJ language and its restrictions it is hard or even impossible to implement optional interactions with this approach. Therefore it was suggested to define groups of statements (advice and inter-type member declaration) as optional refinements, where the entire group has transactional semantics, i. e. either is woven with all advice or no statement is woven. Therefore it is suggested to introduce an optional keyword to define the optional part within the feature. B. Possible directions with CaesarJ Within [15] another approach with CaesarJ in contrast to AspectJ is discussed. They show, that features can be freely composed in CaesarJ by instantiating and deploying instances of the mixed classes representing a bound feature. CaesarJ has a mechanism that specifies in which context the advice definitions must be activated. This mechanism is called deployment (for more information see [15]). Additionally, CaesarJ has also a concept of bidirectional interfaces, which supports reusable features. Owing to this promising already in this language integrated appendages, which are still missing in AspectJ, we suggest to evaluate the optional weaving approach with CaesarJ. Here the future ways for optional weaving were indicated. The next section will introduce another concept to overcome the feature optionality problem. 27

5. GI/ITG KuVS Fachgespräch Ortsbezogene Anwendungen und Dienste 4.-5. September 2008, Nürnberg

5. GI/ITG KuVS Fachgespräch Ortsbezogene Anwendungen und Dienste 4.-5. September 2008, Nürnberg 5. GI/ITG KuVS Fachgespräch Ortsbezogene Anwendungen und Dienste 4.-5. September 2008, Nürnberg Jörg Roth (Hrsg.) Georg-Simon-Ohm-Hochschule Nürnberg 90489 Nürnberg Abstract


Evaluation of the COGITO system

Evaluation of the COGITO system Risø-R-1363 (EN) Evaluation of the COGITO system V. Andersen and H.H.K. Andersen Risø National Laboratory, Roskilde August 2002 Abstract This report covers the deliverable D7.2 of the COGITO project. It


Berlin... 131 8 GRK 1362: Mixed Mode Environments Cooperative Adaptive and Responsive Monitoring in Mixed Mode Environments

Berlin... 131 8 GRK 1362: Mixed Mode Environments Cooperative Adaptive and Responsive Monitoring in Mixed Mode Environments Inhaltsverzeichnis 1 GRK 623: Leistungsgarantien für Rechnersysteme Saarbrücken............................... 1 2 GRK 643: Software für mobile Kommunikationssysteme Aachen..................................


Burkhard Stiller Thomas Bocek Cristian Morariu Peter Racz Martin Waldburger (Eds.) Internet Economics II. February 2006

Burkhard Stiller Thomas Bocek Cristian Morariu Peter Racz Martin Waldburger (Eds.) Internet Economics II. February 2006 Burkhard Stiller Thomas Bocek Cristian Morariu Peter Racz Martin Waldburger (Eds.) Internet Economics II TECHNICAL REPORT No. ifi-2006.02 February 2006 University of Zurich Department of Informatics (IFI)


Energy-Aware System Software

Energy-Aware System Software Selected Chapters of System Software Engineering Energy-Aware System Software Department of Computer Science 4 Distributed Systems and Operating Systems Friedrich-Alexander University Erlangen-Nuremberg


Internet Economics II

Internet Economics II Internet Economics II Burkhard STILLER Oliver BRAUN Arnd HEURSCH Peter RACZ (Hrsg.) Institut für Informationstechnische Systeme, IIS Bericht Nr. 2003-01 June 2003 Universität der Bundeswehr München Fakultät


Virtualization: From Hardware-Lock- In to Software-Lock-In

Virtualization: From Hardware-Lock- In to Software-Lock-In Virtualization: From Hardware-Lock- In to Software-Lock-In Marius Alexander 7. Fachsemester, Bachelor Matrikelnr: 2561370 Anschrift: Heidornstr. 10, 30171 Hannover Mobil: +49 (0) 151 229 104 28 Email:


Proceedings. 24. GI-Workshop. Grundlagen von Datenbanken 29.05.2012 01.06.2012. Lübbenau, Deutschland

Proceedings. 24. GI-Workshop. Grundlagen von Datenbanken 29.05.2012 01.06.2012. Lübbenau, Deutschland Proceedings 24. GI-Workshop Grundlagen von Datenbanken 29.05.2012 01.06.2012 Lübbenau, Deutschland Ingo Schmitt, Sascha Saretz, Marcel Zierenberg (Hrsg.) {schmitt sascha.saretz zieremar}


Die für den MMBnet2007-Workshop angestrebten Themenschwerpunkte betrafen bei den

Die für den MMBnet2007-Workshop angestrebten Themenschwerpunkte betrafen bei den Vorwort Zum vierten Mal, nach 1998, 2002 und 2005 ergo mit stetig kürzer werdenden Abständen findet nunmehr in Hamburg der Workshop "Leistungs-, Zuverlässigkeits- und Verlässlichkeitsbewertung von Kommunikationsnetzen


FI & IITM SS 2014. Network Architectures and Services NET 2014-08-1

FI & IITM SS 2014. Network Architectures and Services NET 2014-08-1 Network Architectures and Services NET 2014-08-1 FI & IITM SS 2014 Proceedings of the Seminars Future Internet (FI) and Innovative Internet Technologies and Mobile Communications (IITM) Summer Semester


FRODO: A Framework for Distributed Organizational Memories

FRODO: A Framework for Distributed Organizational Memories Deutsches Forschungszentrum für Künstliche Intelligenz GmbH Document D-01-01 FRODO: A Framework for Distributed Organizational Memories Milestone M1: Requirements Analysis and System Architecture A. Abecker,


Ortsbezogene Anwendungen und Dienste 9. Fachgespräch der GI/ITG-Fachgruppe Kommunikation und Verteilte Systeme

Ortsbezogene Anwendungen und Dienste 9. Fachgespräch der GI/ITG-Fachgruppe Kommunikation und Verteilte Systeme Matthias Werner, Mario Haustein (Hrsg.) Ortsbezogene Anwendungen und Dienste 9. Fachgespräch der GI/ITG-Fachgruppe Kommunikation und Verteilte Systeme Fakultät für Informatik.


Proceedings. Proceedings of the Seminars Future Internet (FI) and Innovative Internet Technologies and Mobile Communications (IITM)

Proceedings. Proceedings of the Seminars Future Internet (FI) and Innovative Internet Technologies and Mobile Communications (IITM) Network Architectures and Services NET 2015-03-1 Proceedings Proceedings of the Seminars Future Internet (FI) and Innovative Internet Technologies and Mobile Communications (IITM) Winter Semester 2014/2015


',3/20$5%(,7. Titel der Diplomarbeit: etom Enhanced Telecom Operations Map: Design und Erstellung von Telekom-Referenzprozessen.

',3/20$5%(,7. Titel der Diplomarbeit: etom Enhanced Telecom Operations Map: Design und Erstellung von Telekom-Referenzprozessen. ',3/20$5%(,7 Titel der Diplomarbeit: etom Enhanced Telecom Operations Map: Design und Erstellung von Telekom-Referenzprozessen. Verfasser: angestrebter akademischer Grad Magister der Sozial- und Wirtschaftswissenschaften





Burkhard Stiller, Placi Flury, Jan Gerke, Hasan, Peter Reichl (Edt.) Internet-Economics 1

Burkhard Stiller, Placi Flury, Jan Gerke, Hasan, Peter Reichl (Edt.) Internet-Economics 1 Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Burkhard Stiller, Placi Flury, Jan Gerke, Hasan, Peter Reichl (Edt.) Internet-Economics 1 TIK-Report Nr. 105, Februayr


8. GI FG SIDAR Graduierten-Workshop über Reaktive Sicherheit SPRING. Christoph Pohl, Sebastian Schinzel und Steffen Wendzel (Hrsg.

8. GI FG SIDAR Graduierten-Workshop über Reaktive Sicherheit SPRING. Christoph Pohl, Sebastian Schinzel und Steffen Wendzel (Hrsg. 8. GI FG SIDAR Graduierten-Workshop über Reaktive Sicherheit SPRING Christoph Pohl, Sebastian Schinzel und Steffen Wendzel (Hrsg.) 18.-19. Februar 2013, München SIDAR-Report SR-2013-01 ISSN 2190-846X Diesen


Parallel Processing in the Engineering Sciences Methods and Applications

Parallel Processing in the Engineering Sciences Methods and Applications Report on the Dagstuhl Seminar 9745 Parallel Processing in the Engineering Sciences Methods and Applications Organized by G. Alefeld, Universität Karlsruhe O. Mahrenholtz, TU HH-Harburg R. Vollmar, Universität


Entwicklungen in den Informations- und Kommunikationstechnologien

Entwicklungen in den Informations- und Kommunikationstechnologien Entwicklungen in den Informations- und Kommunikationstechnologien Herausgeber: Friedrich-L. Holl Band 3 Study Criteria for success of identification, authentication and signing methods based on asymmetric


Visionen Praktisches Lehrbuch. Information Systems

Visionen Praktisches Lehrbuch. Information Systems Visionen Praktisches Lehrbuch Information Systems Ein Standardwerk in sechs Bänden Band 4 Jahrgang 2004 Juli 2004 Ausgabe 04/2004 Magazin des Vereins der Informatik Studierenden an der ETH Zürich (VIS)


Markus Nüttgens, Frank J. Rump (Hrsg.) Geschäftsprozessmanagement mit Ereignisgesteuerten Prozessketten

Markus Nüttgens, Frank J. Rump (Hrsg.) Geschäftsprozessmanagement mit Ereignisgesteuerten Prozessketten Markus Nüttgens, Frank J. Rump (Hrsg.) EPK 2005 Geschäftsprozessmanagement mit Ereignisgesteuerten Prozessketten 4. Workshop der Gesellschaft für Informatik e.v. (GI) und Treffen ihres Arbeitkreises Geschäftsprozessmanagement


ISSS 2012. Leipziger Beiträge zur Informatik: Band XXXVI Klaus-Peter Fähnrich (Series Editor)

ISSS 2012. Leipziger Beiträge zur Informatik: Band XXXVI Klaus-Peter Fähnrich (Series Editor) ISSS 2012 Leipziger Beiträge zur Informatik: Band XXXVI Klaus-Peter Fähnrich (Series Editor) Kyrill Meyer, Nizar Abdelkafi (Eds.) Smart Services and Service Science Proceedings of the 4th International


SESSION SUMMARIES. Trivadis TechEvent Friday 11.09.2015



Supporting Translational and Personalzed Medicine with SOA, Grid, and Cloud

Supporting Translational and Personalzed Medicine with SOA, Grid, and Cloud The booklet of abstracts for the Sept.19 2012 GMDS-Workshop on Supporting Translational and Personalzed Medicine with SOA, Grid, and Cloud organized by Bernhard Balkenhol, Anna Falkenhain, und Andreas


Seminar P2P im Wintersemester 2010

Seminar P2P im Wintersemester 2010 Seminar P2P im Wintersemester 2010 Teilnehmer des Seminar P2P im Wintersemester 2010 26. Juni 2011 Veranstalter: Fachgebiet Peer-to-Peer Netzwerke und Fachgebiet Telekooperation Fachbereich Informatik


Cloud Computing for Standard ERP Systems: Reference Framework and Research Agenda

Cloud Computing for Standard ERP Systems: Reference Framework and Research Agenda Cloud Computing for Standard ERP Systems: Reference Framework and Research Agenda Petra Schubert Femi Adisa Nr. 16/2011 Arbeitsberichte aus dem Fachbereich Informatik Die Arbeitsberichte aus dem Fachbereich


Supporting Public Deliberation Through Spatially Enhanced Dialogs

Supporting Public Deliberation Through Spatially Enhanced Dialogs MASTER THESIS Supporting Public Deliberation Through Spatially Enhanced Dialogs Gerald Pape 2nd November 2014 Westfälische Wilhelms-Universität Münster Institute for Geoinformatics First Supervisor: Second


Program SUC 2013-2016 P-2 Scientific information: Access, processing and safeguarding

Program SUC 2013-2016 P-2 Scientific information: Access, processing and safeguarding Program SUC 2013-2016 P-2 Scientific information: Access, processing and safeguarding White Paper for a Swiss Information Provisioning and Processing Infrastructure 2020 Contact: Web:


Kuchen, Herbert (Ed.); Majchrzak, Tim A. (Ed.); Müller-Olm, Markus (Ed.)

Kuchen, Herbert (Ed.); Majchrzak, Tim A. (Ed.); Müller-Olm, Markus (Ed.) econstor Der Open-Access-Publikationsserver der ZBW Leibniz-Informationszentrum Wirtschaft The Open Access Publication Server of the ZBW Leibniz Information Centre for Economics Kuchen,


Teilkonferenz: CSCW & Social Computing

Teilkonferenz: CSCW & Social Computing Auszug aus dem Tagungsband der Multikonferenz Wirtschaftsinformatik 2014, Paderborn Teilkonferenz: CSCW & Social Computing Die Teilkonferenz CSCW & Social Computing adressiert Herausforderungen bei der