Datenbank-Spektrum Zeitschrift für Datenbanktechnologien und Information Retrieval

Transkript

1

2 Datenbank-Spektrum Zeitschrift für Datenbanktechnologien und Information Retrieval Band 13 Heft 1 März 2013 Schwerpunkt: MapReduce Programming Model Gastherausgeber: Theo Härder EDITORIAL Editorial T. Härder 1 SCHWERPUNKTBEITRÄGE Compilation of Query Languages into MapReduce C. Sauer T. Härder 5 Efficient OR Hadoop: Why Not Both? J. Dittrich S. Richter S. Schuh 17 Parallel Entity Resolution with Dedoop L. Kolb E. Rahm 23 Inkrementelle Neuberechnungen in MapReduce J. Schildgen T. Jörg S. Deßloch 33 DISSERTATIONEN Dissertationen 59 COMMUNITY Bericht vom Herbsttreffen der GI-Fachgruppe Datenbanksysteme A. Kemper T. Mühlbauer T. Neumann A. Reiser W. Rödiger 65 News 67 NACHRUF Dr. Dean Jacobs A. Kemper W. Lehner 71 FACHBEITRAG Towards Integrated Data Analytics: Time Series Forecasting in DBMS U. Fischer L. Dannecker L. Siksnys F. Rosenthal M. Boehm W. Lehner 45 DATENBANKGRUPPEN VORGESTELLT Datenmanagement und -exploration an der RWTH Aachen T. Seidl 55 Weitere Artikel finden Sie auf Abstracts publiziert/indexiert in Google Scholar, Academic OneFile, DBLP, io-port.net, OCLC, Summon by Serial Solutions. Hinweise für Autoren für die Zeitschrift Datenbank Spektrum finden Sie auf

3 Datenbank Spektrum (2013) 13:1 3 DOI /s z EDITORIAL Editorial Theo Härder Online publiziert: 1. Februar 2013 Springer-Verlag Berlin Heidelberg Schwerpunktthema: MapReduce Programming Model MapReduce ist ein Programmiermodell für die parallele Verarbeitung großer Datenmengen auf einer Vielzahl von Rechnern, das im Jahr 2004 durch den Beitrag MapReduce: Simplified Data Processing on Large Clusters von den Google-Mitarbeitern Jeffrey Dean und Sanjay Ghemawat auf dem 6th Symposium on Operating System Design and Implementation (OSDI 2004) vorgestellt wurde. Seither hat dieser Aufsatz eine Lawine von Forschungsansätzen und Systementwicklungen zur Analyse und Verarbeitung von Big Data ausgelöst. Das MapReduce-Programmiermodell ermöglicht die skalierbare Analyse und Transformation großer, verteilter und heterogener Datenmengen. Um die Entwicklung spezifischer MapReduce-Anwendungen zu vereinfachen und zu beschleunigen, stellt eine MapReduce-Implementierung ein Framework zur Verfügung, das sich um die Aspekte der Datenverteilung und um das Scheduling paralleler Rechenaufgaben kümmert. Im Wesentlichen muss der Benutzer dieses Framework nur vervollständigen durch Spezifikation einer Map-Funktion, die aus einer Liste von Schlüssel/Wert-Paaren als Zwischenergebnis eine neue Liste von Schlüssel/Wert-Paaren erzeugt, und einer Reduce- Funktion, die alle Sätze mit demselben Schüssel in einer Zwischenergebnisliste gruppiert, alle Werte solcher Gruppen mischt und durch Berechnungen reduziert. Mit diesem Ansatz der funktionalen Programmierung können Programme automatisch hohe Parallelitätsgrade nut- T. Härder ( ) AG Datenbanken und Informationssysteme, TU Kaiserslautern, Kaiserslautern, Deutschland haerder@cs.uni-kl.de zen und dadurch in perfekter Weise skalieren. Da sich das MapReduce-Programmiermodell auch für ein breites Spektrum verschiedenartiger Berechnungsprobleme eignet, war es als Folge dieser Eigenschaften in den letzten Jahren bei der Verarbeitung von Big Data in vielen verschiedenen Anwendungsgebieten enorm erfolgreich. In diesem Themenheft beschreiben vier Beiträge interessante Fragestellungen im Kontext von MapReduce und der Analyse großer Datenmengen. Sie zeigen auch die vielfältigen Einsatzmöglichkeiten des Programmiermodells und seine Variationsbreite bei Anwendungen in verschiedenartigen Bereichen. Im ersten Beitrag Compilation of Query Languages into MapReduce von Caetano Sauer und Theo Härder wird ein Brückenschlag von SQL zu Datenbankprogrammiersprachen für Big Data versucht. Es wird insbesondere die Frage aufgeworfen, warum SQL in vielen Belangen der Analyse von Big Data zu restriktiv ist. Beschränkungen von SQL, die in diesem Kontext besonders auffallen, führten zur Entwicklung von besser geeigneten Datenprogrammiersprachen, von denen PigLatin, HiveQL, Jaql und XQuery näher untersucht werden. Wichtige Spracheigenschaften bilden eine Art Wunschliste für die Verarbeitung von Big Data, nach der dann qualitativ bewertet wird, wie gut diese Sprachen die mangelnde Flexibilität und die Einschränkungen von SQL überwinden. Basierend auf dieser Wunschliste von Spracheigenschaften wird in abstrakter Weise die Kompilation in das MapReduce-Programmiermodell beschrieben, die deutlich werden lässt, das der Übersetzungsprozess für alle vier Sprachen im Wesentlichen gleich ist. Einfache Generalisierungen des ursprünglichen MapReduce- Programmiermodells erlauben die Wiederbenutzung der bewährten Techniken zur Anfrageverarbeitung, die dann die Generierung von optimierten Anfrageausführungsplänen für MapReduce-Analysen erleichtern.

4 2 Datenbank Spektrum (2013) 13:1 3 Der zweite Beitrag Efficient OR Hadoop: Why not both? von Jens Dittrich, Stefan Richter und Stefan Schuh widmet sich der Anfrageoptimierung im Kontext von Big Data und beschreibt verschiedene Ansätze, die in der Forschungsgruppe Informationssysteme der Universität des Saarlandes mit dem Ziel verfolgt wurden, Hadoop effizienter zu machen. Das Projekt Hadoop++ konzentrierte sich, ohne den Code von Hadoop Distributed File System (HDFS) und Hadoop MapReduce zu ändern, auf die Flexibilisierung der einzelnen Schritte (Pipeline) der Anfrageverarbeitung in Hadoop. Dabei konnte eine Reduktion der Laufzeiten von bis zu einem Faktor 20 erzielt werden. Im Projekt Trojan Layouts wurden verschiedene Daten-Layouts (tupelweise, spaltenweise, Partition Attributes Across (PAX)) im Kontext der MapReduce-Verarbeitung untersucht mit dem Ziel, ein geeigneteres Daten-Layout von Hadoop für die analytische Anfrageverarbeitung zu finden. Mit dieser Optimierungsmaßnahme konnte gezeigt werden, dass sich damit die Laufzeiten der Anfrageverarbeitung im Vergleich zu Tupel- und PAX-Layouts um bis zu einem Faktor 5 verbessern lassen. Im Projekt HAIL (Hadoop Aggressive Indexing Library) wurde die Nutzung von verschiedenen Indexstrukturen mit Clusterbildung evaluiert. Auch dabei konnten enorme Leistungsgewinne gegenüber der Anfrageverarbeitung in Hadoop und Hadoop++ nachgewiesen werden. Im dritten Beitrag dieses Heftes beschreiben Lars Kolb und Erhard Rahm unter dem Titel Parallel Entity Resolution with Dedoop ein Tool zur Identifikation von Entities, auch als Entity Resolution (ER) bezeichnet, in Cloud- Infrastrukturen auf der Basis von Hadoop. Besonders bei heterogenen Datensammlungen ist das Erkennen von Duplikaten bei Objekten (z. B. Autoren, Kunden oder Produkten), die mit ähnlichen Strukturen und Werten repräsentiert sind, zur Absicherung der Verarbeitungsqualität von großer Wichtigkeit. In herkömmlichen Verfahren müssen Entities paarweise mithilfe verschiedenartiger Ähnlichkeitsmaße in aufwändiger Weise ausgewertet werden, um möglichst genaue Vergleichsentscheidungen zu erzielen. Zur Verbesserung der Effizienz wird normalerweise der Suchraum durch Einsatz sogenannter Blocking-Techniken verkleinert. Das von den Autoren entwickelte System Dedoop besitzt eine umfangreiche Bibliothek von Blocking- und Matching- Verfahren und setzt trainingsbasierte Methoden des Machine Learning ein, um für eine gegebene Anwendung geeignete ER-Strategien zu konfigurieren. Nach Auswahl solcher Verfahren werden diese automatisch in MapReduce- Programme übersetzt, die dann parallel auf verschiedenen Hadoop-Clustern ausgeführt werden können. Eine Verbesserung der Leistung erzielt Dedoop durch den Einsatz von Multi-Pass Blocking und effektiven Methoden zur Lastbalancierung. Die Vielseitigkeit und Leistungsfähigkeit von Dedoop wird durch die vergleichende Auswertung verschiedener ER-Strategien auf realen Datensammlungen nachgewiesen. Schließlich beschäftigt sich der Aufsatz Inkrementelle Neuberechnungen in MapReduce von Johannes Schildgen, Thomas Jörg und Stefan Deßloch mit einer für die Datenverwaltung wichtigen MapReduce-Anwendung. Es wird zunächst gezeigt, dass sich der Ansatz von MapReduce als Lösungskonzept für ein breites Spektrum von Berechnungsproblemen eignet z. B. die Erstellung von Worthistogrammen für einen Text, die Ableitung eines invertierten Link-Graphen aus einer Sammlung von Web-Seiten oder die Berechnung von Freundesfreund-Beziehungen in Sozialen Netzwerken. Solche MapReduce-Berechnungen erfolgen typischerweise auf großen Datensammlungen, die normalerweise in verteilten Dateisystemen vorliegen. Ändern sich diese Datensammlungen, werden die Berechnungsergebnisse mit der Häufigkeit der Aktualisierungen ungenauer und müssen von Zeit zu Zeit neu erstellt werden. Eine vollständige Neuberechnung ist dabei in der Regel keine effiziente Lösung. Deshalb schlägt der Beitrag einen Ansatz zur inkrementellen Neuberechnung in MapReduce vor, der auf den Ideen und Konzepten zur inkrementellen Wartung materialisierter Sichten in relationalen Datenbanksystemen basiert. Dazu wird das auf Map-Reduce basierende Marimba-Framework vorgestellt, das der einfachen Entwicklung von MapReduce-Programmen dient, die nach Änderungen im Datenbestand nur inkrementelle Neuberechnungen vornehmen und dadurch eine vollständige Wiederholung des MapReduce-Ablaufs vermeiden. Die Entwicklung solcher inkrementellen MapReduce-Programme wird für mehrere Anwendungen gezeigt; für zwei verschiedene Strategien wird ihr Leistungsverhalten abhängig vom Änderungsgrad des Datenbestandes bestimmt und mit dem der vollständigen Neuberechnung verglichen. Diese Schwerpunktbeiträge werden ergänzt durch einen Fachbeitrag Towards Integrated Data Analytics: Time Series Forecasting in DBMS von Ulrike Fischer, Lars Dannecker, Laurynas Siksnys, Frank Rosenthal, Matthias Boehm und Wolfgang Lehner. Integrierte statistische Methoden gewinnen für Datenbankanwendungen immer mehr an Bedeutung, um mit den wachsenden Datenvolumina und der steigenden algorithmischen Komplexität bei der Datenanalyse fertig zu werden. Die Autoren plädieren für eine Tiefenintegration von ausgefeilten statistischen Methoden in Datenbankverwaltungssystemen. Speziell wird in diesem Beitrag die Integration der Zeitreihenvorhersage diskutiert, die in Entscheidungsfindungsprozessen in vielen Bereichen eine große Rolle spielt. Weiterhin finden Sie unter der Rubrik Datenbankgruppen vorgestellt einen Beitrag von Thomas Seidl zu Datenmanagement und -exploration an der RWTH Aachen. Die Rubrik Dissertationen ist in diesem Heft mit acht Kurzfassungen von Dissertationen erfreulicherweise recht umfangreich. In der Rubrik Community geben Alfons Kemper, Tobias Mühlbauer, Thomas Neumann, Angelika Reiser und

5 Datenbank Spektrum (2013) 13:1 3 3 Wolf Rödiger einen Bericht vom Herbsttreffen der GI- Fachgruppe Datenbanksysteme an der TU München. Das Treffen zum Thema Scalabale Analytics hatte mit über 80 Teilnehmern eine erfreuliche Resonanz und stand unter dem Motto Industry meets Academia. Weiterhin enthält die Rubrik Community einen Beitrag News mit aktuellen Informationen. 2 Künftige Schwerpunktthemen 2.1 RDF Data Management Nowadays, more and more data is modeled and managed by means of the W3C Resource Description Framework (RDF) and queried by the W3C SPARQL Protocol and RDF Query Language (SPARQL). RDF is commonly known as a conceptual data model for structured information that was standardized to become a key enabler of the Semantic Web to express metadata on the web. It supports relationships between resources as first-class citizens, provides modeling flexibility towards any kind of schema, and is even usable without a schema at all. Furthermore, RDF allows to collect data starting with very little schema information and refining the schema later, as required. This flexibility led to a wide adoption in many other application domains including life sciences, multifaceted data integration, as well as community-based data collection, and large knowledge bases like DBpedia. This special issue of the Datenbank-Spektrum aims to provide an overview of recent developments, challenges, and future directions in the field of RDF technologies and applications. Topics of interest include (but are not limited to) RDF data management RDF access over the Web Querying and query optimization over RDF data especialy when accessed over the Web Applications and usage scenarios Case studies and experience reports Guest editors: Johann-Christoph Freytag, Humboldt-Universität zu Berlin, freytag@dbis.informatik.hu-berlin.de Bernhard Mitschang, University of Stuttgart, Bernhard.Mitschang@ipvs.uni-stuttgart.de 2.2 Best Workshop Papers of BTW 2013 This special issue of the Datenbank-Spektrum is dedicated to the Best Papers of the Workshops running at the BTW 2013 at the TU Magdeburg. The selected Workshop contributions should be extended to match the format of regular DASP papers. Paper format: 8 10 pages, double column Selection of the Best Papers by the Workshop chairs and the guest editor: April 15th, 2013 Guest editor: Theo Härder, University of Kaiserslautern, haerder@cs.uni-kl.de Deadline for submissions: June 1st, Information Retrieval The amount of available information has increased dramatically in the last decades. At the same time, the way in which this information is presented has changed rapidly: Multimedia data such as audio, images, and video complements or even replaces textual information, user-generated content from blogs or social networks replaces static Web sites, and highly dynamic content such as tweets is published in realtime. Information Retrieval methods allow to quickly find relevant pieces of information for a possibly complex information need from this huge pile of data. This special issue of the Datenbank-Spektrum aims to provide an overview of recent developments, challenges, and future directions in the field of Information Retrieval technologies and applications. Topics of interest include (but are not limited to) Crawling, Indexing, Query Processing Information Extraction and Mining Interactive Information Retrieval Personalized and Context-Aware Retrieval Structured and Semantic Search Evaluation and Benchmarking Archiving and Time-Aware Retrieval Models Enterprise Search Realtime Search: Streams, Tweets, Social Networks Multimedia Retrieval Paper format: 8 10 pages, double column Notice of intent for a contribution: July 15th, 2013 Guest editors: Ralf Schenkel, Max-Planck-Institut für Informatik, schenkel@mpi-inf.mpg.de Christa Womser-Hacker, Universität Hildesheim, womser@uni-hildesheim.de Deadline for submissions: October 1st, 2013

6 Datenbank Spektrum (2013) 13:17 22 DOI /s SCHWERPUNKTBEITRAG Efficient OR Hadoop: Why Not Both? Jens Dittrich Stefan Richter Stefan Schuh Received: 19 November 2012 / Accepted: 17 December 2012 / Published online: 11 January 2013 Springer-Verlag Berlin Heidelberg 2013 Abstract In this article, we give an overview of research related to Big Data processing in Hadoop going on at the Information Systems Group at Saarland University. We discuss how to make Hadoop efficient. We briefly survey three of our projects in this context: Hadoop++, Trojan Layouts, and HAIL. Keywords Hadoop HDFS MapReduce Indexing Big data 1 Introduction Nowadays, the amount of data that organizations have to manage is growing exponentially. For an expanding number of companies, like Google, Facebook, and Twitter, this data volume already advances to the order of petabytes. The same holds for scientific organizations, like CERN, that collect large amounts of sensor data from their experiments [3]. For these companies and institutions, their ever-growing collections of data are like gushing sources of raw information that might yield beneficial or business-relevant knowledge. However, exploiting these resources and turning them into value comes at a price. It takes the means to store and analyze huge data volumes and to keep pace with their constant growth. In the past, these requirements in scalability already exceeded the capabilities of all but the largest (and most expensive) computers. Hence, many companies moved from mainframes to clusters of cheap commodity hardware to distribute data and computation among a large number of J. Dittrich ( ) Jens Dittrich Information Systems Group, Campus E1 1, Saarland University, Saarbrücken, Germany jens.dittrich@cs.uni-saarland.de nodes. But the problem was not only about hardware, it was about software as well. Traditional database systems and analytics that have been subject to research and development for decades were simply not designed for massively parallel environments and thus only scale out to a limited number of nodes. In the past years, the Hadoop ecosystem has become the de facto standard to handle so-called Big Data. Its main components are Hadoop Distributed File System (HDFS) and Hadoop MapReduce. HDFS allows users to store petabytes of data on large distributed clusters. HDFS provides high fault-tolerance capabilities in an environment where failures of single disks or whole nodes are not the exception but the rule. Hadoop MapReduce allows users to query the data with the simple yet expressive map()/reduce() paradigm without a need for the user to care about parallelization, scheduling, or failover. In contrast to parallel DBMS, Hadoop MapReduce scales easily to very large clusters of thousands of machines. In addition, the upfront investment for using MapReduce is small: no need to use schemas, no integrity constraints, no data cleaning, no normalization, and: NoSQL. Moreover, installing and configuring a MapReduce cluster is relatively easy compared to a parallel DBMS: Almost any user with minimal knowledge of Java is able to write and run Hadoop jobs. All of this explains the popularity of MapReduce among non-database people. On the flip side, MapReduce does not really have an optimizer: MapReduce jobs are scan-oriented in fact the entire system design is centered around the idea of executing long running jobs. Furthermore, several classes of tasks cannot be expressed naturally with the map()/reduce() paradigm, e.g., joins, iterative tasks, and updates. And finally: the perfor-

7 18 Datenbank Spektrum (2013) 13:17 22 mance of MapReduce is in many cases far from the one of an optimized parallel DBMS. One might conclude that there is a deep divide among the two classes of systems parallel DBMS and Hadoop MapReduce. And in fact: in 2009, the database community triggered a heated discussion with a paper by Pavlo et al. [9] which unfortunately widened that divide. However, given the recent popularity of Hadoop,s one might get the idea that there must be a reason for this popularity. If database systems are so great, why isn t everyone using them? We believe that the key to this discussion is not about the new kid on the block Hadoop solely learning from mature database technology, but that it is key for databases to also learn from Hadoop. Our research question is: is there a way to preserve the properties of Hadoop while fixing its issues AND without turning it into yet another parallel DBMS? As a consequence, in 2009, we started a series of projects investigating this. These projects are Hadoop++, Trojan Layouts, and HAIL. We will briefly sketch these projects in the following. 2 Hadoop++ What if we do not touch the source code of HDFS and Hadoop? Is still possible to substantially improve runtimes of MapReduce jobs? We investigated this question in [4]. In that work, we analyzed the query processing pipeline of Hadoop. The major observation was that Hadoop implements a hardcoded processing pipeline whose structure is very hard if not impossible to change. However, Hadoop s processing pipeline also provides at least ten different user-defined functions (UDFs) map() and reduce() being just two of them. These different UDFs may be exploited to place arbitrary code inside the Hadoop processing pipeline and turn Hadoop into a versatile distributed runtime. We exploited this to inject indexing and co-partitioning algorithms into Hadoop. This idea is somehow similar to injecting a Trojan, however: this time for good. Hence, we coined the resulting index structure a Trojan index. For instance, we change the group() and shuffle() UDFs that control grouping and shuffling. This allows us to create separate indexes for each HDFS block. We evaluated our indexes using the benchmark proposed in [9]. We could show that the runtimes of Hadoop++ are by up to a factor 20 faster than Hadoop. 1 These performance improvements are possible, as we spend 1 A companion paper explores the pitfalls when measuring distributed systems like MapReduce in a cloud environment [13]. Other works look at the efficiency of the Hadoop Failover algorithms [10, 11]. Fig. 1 Data access costs for different data layouts in Hadoop additional time creating indexes and copartitions before executing any MapReduce job. The time spent for creating those indexes may be considerable [4]. The idea of improving the performance of a closedsource system by injecting code afterwards may also be applied to traditional database systems. In an upcoming work [8], we investigate how to change the data layout of a closed-source row-store into using compressed columnoriented layouts yielding up to a factor of 20 performance improvements. 3 Trojan Layouts How could we change the data layout of Hadoop to be better suitable for analytical query processing? We investigated this question in [6]. Obviously, one could simply store all data in a compressed column layout and hope for similar speed-ups as known from traditional column stores. However, in a distributed system there is a major issue with this approach: column data representing the same rows should be stored physically close as to avoid expensive network I/O for tuple reconstruction. Figure 1 simulates this effect. The horizontal axis depicts the number of referenced attributes for a query. The vertical axis depicts the data access costs. For a Column Layout the costs for network transfer have to be factored in and ruin the overall performance. For Row Layout, the number of referenced attributes does not have an effect. Therefore, a popular layout in the context of MapReduce is the hybrid layout PAX [1]: in this approach, all data inside an HDFS block, i.e., a large horizontal partition of data of at least 64 MB, is stored in column layout. This avoids the problems with network I/O for tuple reconstruction and still gives column-like access. However, for some workloads, PAX is not the best layout. In [6] we follow the PAX philosophy in that we keep data belonging to a particular HDFS block on that HDFS block i.e. there is no global reorganization of data across HDFS

8 Datenbank Spektrum (2013) 13: blocks. However, in contrast to PAX, we introduce an important change: Hadoop s Distributed File Systems (HDFS) stores three copies of an HDFS block for fault-tolerance anyway. All of these copies are byte-identical. We change this to allow the different copies of a logical HDFS block to have different physicals layouts. As we do not remove any data from the different copies, we fully preserve the faulttolerance properties of HDFS. At the same time, we are able to optimize the different copies for different types of queries. In [6], we explore this to compute different vertical partitionings for each copy, i.e. we end up with three different vertical partitionings which in turn are then exploited at query time. Trojan Layouts improves query runtimes both over row layouts and over PAX layouts by up to a factor 5. However, two interesting questions remain. 4 HAIL How could we instrument the different copies of an HDFS block to use different clustered indexes? And how could we teach Hadoop to create those indexes without paying a high price for expensive index creation jobs as observed for Hadoop++? For this we proposed HAIL (Hadoop Aggressive Indexing Library) [5] to improve the total runtime of those tasks dramatically. HAIL is an enhancement of HDFS and Hadoop MapReduce that keeps the existing physical replicas of an HDFS block in different sort orders and with different clustered indexes. Hence, for a default replication factor of three, three different sort orders and indexes are available for MapReduce job processing. Thus, the likelihood to find a suitable index increases and hence the runtime for a workload improves. In fact, the HAIL upload pipeline is so effective when compared to HDFS that the additional overhead for sorting and index creation is hardly noticeable in the overall process. Why don t we have high costs at upload time? We basically exploit the unused CPU ticks which are not used by standard HDFS. As the standard HDFS upload pipeline is I/O-bound, the effort for our sorting and index creation in the HAIL upload pipeline is hardly noticeable. In addition, as we already parse data to binary while uploading, we often benefit from a smaller binary representation triggering less network and disk I/O. In the following, we give a simplified, high-level overview of the HAIL upload pipeline. For example, let us assume we have a world population table stored in HDFS containing records of type [city, country, population]. If we now want to analyze the population of China, Hadoop MapReduce has to scan the whole world population table and filter for people living in China. While this might be relative efficient for a country like China, we would still waste our time with reading data not needed and this becomes even more extreme if we scan for people living in Luxembourg. If we are now interested in data for a specific city, the traditional Hadoop approach feels like finding a needle in a haystack. This is a typical situation that could be solved with an index, e.g., like the Trojan index from Hadoop++. However, we have already seen that creating Trojan indexes is a very costly operation that needs many queries that select on the indexed attribute to amortize. Additionally, the Trojan index will only help when selecting one particular attribute. But what happens if our workload consists of queries selecting on many different attributes like age or name? Figure 2 shows the HAIL upload pipeline. When uploading a data file to HDFS using the HAIL client, we first analyze the schema of the input 1 and convert the textual data into PAX layout 2. This allows us to save bandwidth, because the binary format is often more space efficient than the textual representation. Like in normal Hadoop, HAIL first asks the Namenode for the locations of all Datanodes that should store a replica of the current block 3. Then, HAIL divides each block into packets 4 and sends them to the first Datanode (the node that was chosen by the Namenode to store the first replica) 5. The first Datanode then reassembles the blocks from the packets 6, sorts the tuples on the index attribute and creates the actual clustered index 7. In parallel, the first Datanode immediately forwards each incoming packet to the next Datanode that stores replica 2. This procedure is repeated for all Datanodes that store a replica until the packets reach the last Datanode. This allows us to create different indexes in parallel on all Datanodes. After reaching the last Datanode, all packets are validated against their checksums 9. Finally, if the blocks could be verified, all Datanodes register their created indexes with the Namenode ( 10 and 11 ). With this approach, HAIL even allows us to create more than three indexes at reasonable costs. Figure 3 shows a comparison of upload times for Hadoop, Hadoop++, and HAIL on our ten-node cluster with a dataset of 130 GB. This dataset resembles a typical scientific dataset. A more detailed description of the experiments and the used datasets can be found in our HAIL paper [5]. In Fig. 3(a), we vary the number of indexes from 0 to 3 for HAIL and for Hadoop++ from 0 to 1 (this is because Hadoop++ cannot create more than one index). Notice that we only report numbers for 0 indexes for standard Hadoop as it cannot create any indexes. We observe that HAIL significantly outperforms Hadoop++ by a factor of 5.2 when creating no index and by a factor of 8.2 when creating one index. We observe that HAIL also outperforms Hadoop by a factor of 1.6 even when creating three indexes. This is because HAIL s binary representation of the dataset has a reduced size which allows HAIL to outperform Hadoop even when creating one, two or three indexes.

9 20 Datenbank Spektrum (2013) 13:17 22 Fig. 2 Overview of the HAIL upload pipeline Fig. 3 Upload times when varying the number of created indexes (a) and the number of data block replicas (b) We now analyze how well HAIL performs when increasing the number of replicas. In particular, we aim at finding out how many indexes HAIL can create for a given dataset in the same time standard Hadoop needs to upload the same dataset with the default replication factor of three and creating no indexes. Those results are presented in Fig. 3(b). The dotted line marks the time Hadoop takes to upload with the default replication factor of three. We see that HAIL significantly outperforms Hadoop for any replication factor and up to a factor of 2.5. More interestingly, we observe that HAIL stores six replicas (and hence it creates six different clustered indexes) in a little less than the same time Hadoop uploads the same dataset with only three replicas without creating any index. Still, when increasing the replication factor even further for HAIL, we see that HAIL has only a minor overhead over Hadoop with three replicas only. A more detailed description of the HAIL upload pipeline that discusses some interesting implementation challenges like adapting Hadoop s packeting and checksumming, Namenode extension, index structure, and fault tolerance can be found in our paper [5]. From these result, we can see a huge improvement for indexing overhead when compared to Hadoop++ and conclude that HAIL provides efficient indexing of many attributes with no or almost invisible overhead. But how can we now use the HAIL indexes and what are the corresponding improvements in terms of query performance? There are at least three options: 1. We can analyze the user-provided map()-function using static code analysis. Then we rewrite the map()-function automatically against our indexes. This approach is fully user transparent. This type of code analysis has already been successfully done in [2] and could be extended to exploit HAIL indexes as well. 2. We allow users to annotate the map-functions slightly. This approach is not fully user transparent yet minimally invasive. A simple example would be to find the names of all people living in Luxembourg. If we assume that name is the first attribute and country is the second attribute in the world-population dataset, we simply annotate the map function in Java Luxembourg ", projection={@1}). This has the effect that the dataset is pre-filtered and only the attribute name from tuples where country equals to Luxembourg are passed to the map function.

10 Datenbank Spektrum (2013) 13: and 36 faster than Hadoop++. We also observe that HAIL runs all six Synthetic queries 9 faster than Hadoop and 8 faster than Hadoop++. When developing HAIL we learned that the high scheduling overhead of MapReduce tasks is a severe problem when improving the performance of block accesses. All improvements can be eaten up by this overhead. HAIL reduces this overhead significantly using a novel splitting policy at query time (HAIL scheduling). At its core, HAIL scheduling assigns multiple index accesses to a single map task. Like that we avoid the Hadoop MapReduce overheads for scheduling multiple map waves (see [5] for details). Overall, using HAIL scheduling we achieve the performance seen in Fig Lessons Learned and Conclusion Fig. 4 End-to-end job runtimes for two different workloads 3. The third approach is to modify the applications sitting on top of HDFS or Hadoop MapReduce. As HAIL is a replacement for HDFS, user transparency may be achieved by modifying any software layer on top. For instance, Hive and Pig output machine-generated MapReduce programs; Impala operates directly on flat HDFS files. For these systems, it would be straight-forward to change their MapReduce program generation to exploit HAIL indexes similar to changing a DB-optimizer to create physical plans using index access paths. Figure 4 illustrates the query performance of HAIL compared to Hadoop and Hadoop++. We clearly observe that HAIL significantly outperforms both Hadoop and Hadoop++. We see in Fig. 4(a) that HAIL outperforms Hadoop up to a factor of 68 and Hadoop++ up to a factor of 73 for a log analysis workload (Bob queries). For a Synthetic workload (Fig. 4(b)), we observe that HAIL outperforms Hadoop up to a factor of 26 and Hadoop++ up to a factor of 25. Overall, we observe in Fig. 4(c) that using HAIL we can run all five queries 39 faster than Hadoop We learned that it is possible to introduce indexing into the Hadoop upload pipeline with little to no overhead (Hadoop++). Additional, substantial performance improvements are possible when HDFS is changed to support multiple physical layouts (Trojan Layouts). An interesting challenge was to instrument HDFS to provide efficient index creation and query processing at the same time (HAIL). Future work aims at generalizing the different projects into a common storage optimizer [7] and adding zero-overhead adaptive indexing to Hadoop [12]. Yes, parallel DBMS and Hadoop MapReduce are very different systems at first sight. In comparison, Hadoop is a young system compared to parallel DBMS and can still be improved in many different ways. The Hadoop ecosystem provides an opportunity for the database community to broaden the impact of our research. It is also an opportunity to revisit design decisions taken in the past and take different routes than the ones we took before. In this spirit, we believe that it will be important to teach efficiency to Hadoop without turning it into yet another parallel DBMS. Acknowledgements Research partially supported by BMBF. We would like to thank all authors and team members of the Hadoop++, Cloud Variance, RAFT, Trojan Layouts, HAIL, and LIAH projects for their support. References 1. Ailamaki A et al (2001) Weaving relations for cache performance. In: VLDB, pp Cafarella MJ, Ré C (2010) Manimal: relational optimization for data-intensive programs. In: WebDB 3. Dittrich J, Quiané-Ruiz JA (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12): Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a Cheetah (without it even noticing). Proc VLDB Endow 3(1 2):

11 22 Datenbank Spektrum (2013) 13: Dittrich J, Quiané-Ruiz JA, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. Proc VLDB Endow 5(11): Jindal A, Quiané-Ruiz JA, Dittrich J (2011) Trojan data layouts: right shoes for a running elephant. In: SOCC 7. Jindal A, Quiané-Ruiz JA, Dittrich J (2013) WWHow! Freeing data storage from cages. In: CIDR 8. Jindal A, Schuhknecht FM, Dittrich J, Khachatryan K, Bunte A (2013) How Achaeans would construct columns in Troy. In: CIDR 9. Pavlo A et al (2009) A comparison of approaches to large-scale data analysis. In: SIGMOD, pp Quiané-Ruiz JA, Pinkel C, Schad J, Dittrich J (2011) RAFT at work: speeding-up MapReduce applications under task and node failures. In: SIGMOD, pp Quiané-Ruiz JA, Pinkel C, Schad J, Dittrich J (2011) RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp Richter S, Quiané-Ruiz JA, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. arxiv: [cs.db] 13. Schad J, Dittrich J, Quiané-Ruiz JA (2010) Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc VLDB Endow 3(1):

12 Datenbank Spektrum (2013) 13:55 58 DOI /s DATENBANKGRUPPEN VORGESTELLT Datenmanagement und -exploration an der RWTH Aachen Thomas Seidl Online publiziert: 16. Januar 2013 Springer-Verlag Berlin Heidelberg 2013 Zusammenfassung Der Lehrstuhl für Informatik 9 (Datenmanagement und -exploration) an der RWTH Aachen beschäftigt sich mit Data Mining- und Datenbanktechnologien für multimediale und räumlich-zeitliche Daten in ingenieur-, natur-, lebens-, wirtschafts- und sozialwissenschaftlichen Anwendungen. Sowohl die große Menge an Daten als auch die Komplexität der einzelnen Objekte bergen unterschiedliche Herausforderungen für die Analyse und Exploration realer Daten, denen wir mit der Entwicklung neuer effektiver sowie effizienter Konzepte für Datenanalyse und Datenmanagement begegnen. Schlüsselwörter Data Mining Analyse hochdimensionaler und komplexer Daten Effiziente inhaltsbasierte Ähnlichkeitssuche 1 Entwicklung Der Lehrstuhl für Informatik 9 (Datenmanagement und -exploration) gehört zur Fachgruppe Informatik in der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen und wird seit seiner Einrichtung im Jahr 2002 von Prof. Thomas Seidl geleitet. Thomas Seidl studierte Informatik mit Nebenfach Wirtschaftswissenschaften an der Technischen Universität München. Er schloss sein Studium 1992 mit einer Diplomarbeit über Objektspeichersysteme bei Prof. Rudolf Bayer ab. T. Seidl ( ) Lehrstuhl für Informatik 9 (Datenmanagement und -exploration), RWTH Aachen, Aachen, Deutschland seidl@informatik.rwth-aachen.de url: Im Anschluss daran entstand seine Dissertation über Adaptable Similarity Search in 3-D Spatial Database Systems bei Prof. Hans-Peter Kriegel an der Ludwig-Maximilians- Universität München. Dort promovierte er 1997 und erlangte 2001 die Habilitation in Informatik. Nach einem Lehrauftrag zu Multimedia-Datenbanken an der Universität Augsburg und einer Professurvertretung an der Universität Konstanz (für den in USA weilenden Prof. Daniel Keim) wurde er im September 2002 Lehrstuhlinhaber an der RWTH. Die ersten wissenschaftlichen Mitarbeiterinnen und Mitarbeiter haben 2003 ihren Dienst angetreten. Bisher wurden am Lehrstuhl sieben Promotionen abgeschlossen: Ira Assent (2008), Christoph Brochhaus (2008), Ralph Krieger (2008), Emmanuel Müller (2010), Marc Wichterich (2010), Philipp Kranen (2011) und Stephan Günnemann (2012). Weitere neun Doktorandinnen und Doktoranden bereiten derzeit ihre Promotion vor. Drei der bisherigen Absolventen setzen die akademische Laufbahn fort: Ira Assent ist seit Dezember 2010 Professorin für Datenbanken an der Universität Aarhus in Dänemark, Emmanuel Müller leitet eine Young Investigator Group am Karlsruher Institut für Technologie und ist post-doctoral Fellow der Wissenschaftsstiftung Flandern (FWO) an der Universität Antwerpen in Belgien, und Stephan Günnemann forscht mit einem PostDoc-Stipendium des DAAD an der Carnegie Mellon University in Pittsburgh, USA. Die anderen vier sind zu privaten Unternehmen gewechselt und arbeiten nun in der forschungsnahen Innovationsabteilung eines internationalen Softwareunternehmens, bei einem großen Versandunternehmen in den USA, bei einem namhaften süddeutschen Automobilzulieferer sowie in einem regionalen IT-Unternehmen. In den letzten zehn Jahren wurden über 90 Diplom- und Masterarbeiten am Lehrstuhl abgeschlossen. Neben vielen internen Themen wurden einige Arbeiten in Kooperation

13 56 Datenbank Spektrum (2013) 13:55 58 mit anderen Fachbereichen und mit verschiedenen Unternehmen betreut. 2 Forschungsprojekte Das Umfeld des Lehrstuhls ist sehr aktiv und bietet viele inhaltliche Anknüpfungspunkte. Innerhalb der Fachgruppe Informatik bestehen verschiedene Kooperationen mit dem Lehrstuhl für Informatik 5 (Informationssysteme und Datenbanken, Prof. Matthias Jarke) und mit Arbeitsgruppen des maschinellen Lernens wie Informatik 6 (Sprachverarbeitung und Mustererkennung, Prof. Hermann Ney), Informatik 8 (Mobile Multimedia Processing, Prof. Bastian Leibe), Informatik 5 (Wissensbasierte Systeme, Prof. Gerhard Lakemeyer). Parallel zum Lehrstuhl wurde 2002 auch das Lehrund Forschungsgebiet Informatik 9 (Learning Technologies) eingerichtet und mit Prof. Ulrik Schroeder besetzt. Kooperationen mit anderen Fachgebieten wie Maschinenbau, Elektrotechnik, Bauingenieurwesen, Wirtschaftswissenschaften, Physik und Medizin schlagen sich in den unten beschriebenen Projekten nieder. Interessante wissenschaftliche Fragestellungen aus anderen Anwendungen ergeben sich auch aus der Mitgliedschaft im fächerübergreifenden Forum Informatik der RWTH Aachen sowie im Regionalen Industrie-Club Informatik Aachen (REGINA), in dem sich unter Federführung der Fachgruppe Informatik der RWTH Aachen sowie der Industrie- und Handelskammer Aachen mehr als 90 IT-Unternehmen der Region zusammengeschlossen haben. Folgende Projekte wurden und werden von Drittmittelgebern gefördert: Im Exzellenzcluster UMIC Ultra high-speed Mobile Information and Communication ( ) mit Partnern aus Informatik sowie Elektrotechnik und Informationstechnik der RWTH Aachen bearbeiten wir die Teilprojekte D4 Energy Awareness of Mobile Information Systems: Mobile data provisioning and data dissemination models; index on the air techniques und B2 Stream Data Mining for the HealthNet Scenario: Aggregation and mining of multi-dimensional concurrent sensor data streams. Im DFG-Sonderforschungsbereich 686 Modellbasierte Regelung der homogenisierten Niedertemperatur-Verbrennung ( ) mit Partnern aus dem Maschinenbau der RWTH Aachen und der Physikalischen Chemie in Bielefeld bearbeiten wir das Teilprojekt A6 Anytime- Verfahren zur prädiktiven Regelung mittels dynamisch adaptiver Modelle: Prozessanalyse durch Matching von experimentellen und simulierten Prozessdaten, sowie adaptive Modellbildung für die modellprädiktive Regelung. Im DFG-Schwerpunktprogramm 1335 Scalable Visual Analytics arbeiten wir im Projekt SteerSCiVA: Visual Analytics methods to steer the subspace clustering process ( ) mit Partnern der Universität Konstanz an neuen interaktiven Methoden für Subspace Clustering und Multi-View-Clustering. In der BMWi-THESEUS-Kooperation MachInNet: Machining Intelligence Network ( ) mit der CIM Aachen und der EXAPT GmbH entwickelten wir Suchund Retrievalalgorithmen für Werkzeugdaten in CADund NC-Datenbanken. Im DFG-Einzelprojekt Schnelle EMD-Suche ( ) ging es um Indexunterstützung für die Earth Mover s Distance zur schnellen inhaltsbasierten Suche in Multimedia-Datenbanken sowie neue Techniken zur Approximation, Indexierung und Dimensionsreduktion für die EMD. Im DFG-Einzelprojekt SQFD-Based Multimedia Retrieval ( ) entwickeln wir neue Techniken zur Approximation, Indexierung und Dimensionsreduktion für die Signature Quadratic Form Distance. BSI BioKeyS: Pilot-DB Template Protection ( ) mit Bundesamt für Sicherheit in der Informationstechnologie, Fraunhofer IGD und Hochschule Darmstadt sowie LMU München. Teilprojekt Schnelle Such- und Retrievalalgorithmen für verschlüsselte Fingerabdrücke: Approximationen und Ähnlichkeitssuche für durch Fuzzy- Vault-Technologien verschlüsselte biometrische Daten. B-IT Research School des Bonn-Aachen International Institute for Information Technology: Promotionsstipendium für das Projekt Subspace-Clusteranalyse in großen Graphdatenbanken ( ). In der EU co-ordinated action NiSIS Nature-inspired Smart Information Systems mit ELITE Aachen und M.I.T. GmbH bearbeiteten wir das Teilprojekt Nature Inspired Methods for Local Pattern Detection (NiLOP) ( ): New models and algorithms for efficient and effective subspace clustering. Project House IMP der Fakultät für Betriebswirtschaft an der RWTH Aachen ( ): Projekte 4C-NANO- NETS: From Clusters and Cooperation to Creativity and Commercialization in Nano Science and Technology Networks (Methoden zur Analyse von Publikationsnetzwerken) und UP-BEAT: User-friendly Program for Better Effectiveness of Advertising Tools (Methoden zur Analyse von Internet-Werbekanälen) mit den Lehrstühlen für Wirtschaftswissenschaften, Technologiemanagement, Marketing, Operations Research und Technik-Soziologie. Umbrella Cooperation on Managing and Analyzing Location History of Individuals and Crowds for Urban Planning (2012) mit dem Technion Haifa, Israel: Data Mining für Trajektorien in räumlichen Datenbanken. Mit einigen Institutionen pflegen wir den internationalen Austausch von Wissenschaftlern, etwa mit dem National Institute of Informatics (NII) in Tokyo, Japan (Prof. Micha-

14 Datenbank Spektrum (2013) 13: el Houle), der Simon Fraser University in Vancouver, Kanada (Prof. Martin Ester), der University of Waikato, Neuseeland (Prof. Bernhard Pfahringer, Dr. Albert Bifet), der Karls-Universität Prag, Tschechien (Prof. Tomas Skopal) und der Universität Trento, Italien (Prof. Themis Palpanas). Die internationale wissenschaftliche Kommunikation wird auch durch die Publikation der Forschungsergebnisse, durch regelmäßige Mitgliedschaften in Programmkomitees mehrerer Konferenzen und Workshops, (Gast-)Herausgeberschaften (VLDB Journal, GeoInformatica, Machine Learning) und Gutachten für Zeitschriften gepflegt. Bei einigen Tagungen hatte der Lehrstuhl Schlüsselrollen übernommen, etwa General Co-Chair bei der BTW 2007 in Aachen (mit Prof. Matthias Jarke), PC Co-Chair für die SSTD 2009 in Aalborg, Dänemark (mit Prof. Nikos Mamoulis), Area Chair für Knowledge Discovery, Clustering, Data Mining bei der ACM SIGMOD 2013, General Chair für die LWA 2014 in Aachen sowie die Organisation mehrerer Workshops, darunter NiLOP bei NiSIS 2007 in Palma, Mallorca; MultiClust bei ECML PKDD 2011 in Athen, Griechenland und MultiClust bei SDM 2012 in Anaheim, Kalifornien, USA. 3 Forschungsthemen Unsere Forschungsaktivitäten zielen auf Data Mining- und Datenbanktechnologien zur Analyse und Exploration multimedialer und räumlich-zeitlicher Daten in ingenieur-, natur-, lebens-, wirtschafts- und sozialwissenschaftlichen Anwendungen ab. Sowohl die große Menge an Daten als auch die Komplexität der einzelnen Objekte bergen unterschiedliche Herausforderungen für Datenanalyse und Datenmanagement, denen wir mit der Entwicklung neuer effektiver und effizienter Konzepte begegnen. Unsere konkreten Arbeiten lassen sich den drei Kategorien Data Mining und Datenanalyse, inhaltsbasierte Ähnlichkeitssuche und schnelle Anfragebearbeitung zuordnen, die in den folgenden Abschnitten detaillierter ausgeführt werden. 3.1 Analyse hochdimensionaler Daten, Data Mining Im Bereich des Data Mining entwickeln wir neue Techniken zur Datenanalyse und Wissensextraktion in großen strukturierten Datenbanken. Subspace Clustering, Multi-View Clustering In hochdimensionalen Daten findet man häufige Muster oft nur über wenigen relevanten Attributen, die durch andere, irrelevante Attribute verdeckt werden. Da eine vollständige Suche über alle Unterräume und Projektionen wegen der exponentiellen Komplexität nicht praktikabel ist, entwickeln wir neue effiziente und effektive Methoden für das Data Mining in hochdimensionalen Daten. Evaluierung von Verfahren zum Subspace Clustering, Open- Subspace Framework Für die Aufgabe des Subspace Clustering gibt es bislang weder standardisierte Bewertungsverfahren noch bewährte Benchmarks. Das Framework Open- Subspace 1 stellt verschiedene Implementierungen und Evaluierungsmaße zur Verfügung. Stream Data Mining, Sensor Data Mining, Anytime Data Mining In vielen Anwendungen sind Daten nicht statisch, sondern dynamisch während ihrer Erfassung zu analysieren. Je nach Geschwindigkeit und Regelmäßigkeit der Datengenerierung steht für einzelne Objekte unterschiedlich viel Zeit zur Verfügung. Anytime-Algorithmen optimieren die Nutzung der verfügbaren Ressourcen, um qualitativ hochwertige Analyseergebnisse zu erzielen. Das Framework MOA für Stream Data Mining 2 haben wir um Clustering- Techniken erweitert. Erkennung und Bewertung von Ausreißern in hochdimensionalen Daten Im klassischen Fall lässt sich die Ausreißersuche als komplementäre Aufgabe zum Clustering verstehen: Alle Objekte, die zu keinem Cluster gehören, gelten als Ausreißer. Beinhalten die Daten jedoch unterschiedliche Konzepte, deren Objektmengen nicht disjunkt sind, wie etwa die Cluster Fußballfans, Chorsänger und Informatiker, sind neue Konzepte zur Identifikation von Ausreißern nötig. Kombiniertes Graph- und Subspace-Clustering Neben vektoriell repräsentierten Objekten spielt die Analyse von Strukturen in Graphdatenbanken eine große Rolle. Insbesondere Netze mit hochdimensionalen Knoten- oder Kantenbeschriftungen bergen neue, interessante Herausforderungen für die Datenanalyse. Privacy-Preserving Data Mining In vielen Anwendungen stellt der Schutz personenbezogener Daten eine wichtige Aufgabe dar. Neben k-anonymity, l-diversity untersuchen wir vor allem Konzepte der Differential Privacy im Kontext der räumlich-zeitlichen Datenanalyse. 3.2 Inhaltsbasierte Ähnlichkeitssuche In diesem Forschungsbereich beschäftigen wir uns mit der Exploration großer Multimedia-Datenbanken und mit Multimedia Information Retrieval Massive Online Analysis,

15 58 Datenbank Spektrum (2013) 13:55 58 Adaptive Ähnlichkeitsmodelle Anpassbare Distanzfunktionen wie Quadratische Formen oder Earth Mover s Distance (EMD) werden zur histogrammbasierten Ähnlichkeitssuche ( fixed binning ) angewandt. Für die flexiblere Darstellung von Objekten durch Signaturen ( individual binning ) ist die EMD ein klassisches Ähnlichkeitsmaß, die effizienteren quadratischen Formen galten bislang als nicht anwendbar. Unsere neue Signature Quadratic Form Distance (SQFD) überwindet die Modellierungsprobleme und eröffnet neue Möglichkeiten zur schnellen Anfragebearbeitung. Ähnlichkeitssuche für komplexe Objekte wie Vektorgrafiken (CAD-Objekte, Werkzeuge), Bearbeitungsprozesse (NC-Programme), Punktemengen (2D-Fingerabdrücke, 3D- Moleküle), multivariate Zeitreihen (ozeanographische Daten) oder Konfigurationsnetze tritt in vielen Anwendungen mit unterschiedlichen Charakteristiken und Datentypen auf. Generische Ansätze wie EMD, SQFD, DTW, Graph Edit Distance eignen sich prinzipiell, benötigen jedoch entsprechende Anpassungen. Adaption neuer Interaktionsformen wie Relevance Feedback für EMD und SQFD sowie interaktive Anfragespezifikation durch inverse multidimensionale Skalierung ( MDS- Browser ). 3.3 Schnelle Anfragebearbeitung Für Data Mining und Ähnlichkeitssuche in großen räumlichzeitlichen Datenbanken entwickeln wir effiziente Indexierungs- und Anfragetechniken. Parallele Bearbeitung von Datenanalyse-Algorithmen Das MapReduce-Programmiermodell bietet interessante Möglichkeiten zur Parallelisierung von Data Mining-Algorithmen auf sehr großen, komplexen Datenbanken. Aktuell untersuchen wir Clusteringverfahren auf der Basis von MapReduce und PACT. Indexierung für hochdimensionale Daten und für Zeitreihen OF-Tree und TS-Tree vermeiden Überlappungen, um dem Curse of Dimensionality entgegenzuwirken. Eine Kombination von Konzepten aus den klassischen B*-Bäumen sowie aus den für räumliche Anwendungen bewährten R-Bäumen zeigt eine Verbesserung der Anfragebearbeitung bei der Ähnlichkeitssuche für Zeitreihen, multimediale und hochdimensionale Objekte. Grafikdatenserver für blickabhängige Visualisierung am Beispiel von CFD-Postprocessing. In einer Kooperation mit dem Virtual Reality Center Aachen (VRCA) wurden sowohl ein blickabhängiges Ähnlichkeitsmaß als auch effiziente Algorithmen zur Bearbeitung entwickelt. Indexierung von Intervalldaten in relationalen Datenbanksystemen Der Relationale Intervallbaum (RI-Tree) stellt ein Beispiel für die relationale Indexierung komplexer Objekte dar. Für Überlappungsanfragen auf Intervalldaten wurde neben der formal optimalen Komplexität auch eine hohe praktische Effizienz erreicht. 4 Lehre Unser Lehrangebot für Masterstudierende umfasst folgende Vorlesungen: Data Mining Algorithms 1 führt den Prozess des Knowledge Discovery in Datenbanken ein und umfasst neben Clustering, Klassifikation, Frequent Pattern Mining und Generalisierung auch Data Warehousing, Visualisierung und Indexstrukturen. Data Mining Algorithms 2 behandelt weiterführende Herausforderungen und Lösungskonzepte für die Analyse komplexer Objektmengen wie hochdimensionale Daten, Datenströme, attributierte Graphen und Netze. Content-Based Multimedia Search umfasst Ähnlichkeitsmodelle und effiziente Algorithmen zur inhaltsbasierten Suche in großen Mengen komplexer Objekte wie multimediale Objekte, geometrische Formen und Zeitreihen. Im Praktikum zu Data Mining beschäftigen wir uns mit Implementierungsfragen des Data Mining und verwenden hierfür die Werkzeuge WEKA, KNIME, MOA, Hadoop MapReduce und PACT. In Seminaren werden aktuelle Entwicklungen in Data Mining und Multimedia-Exploration behandelt. Für Bachelorstudierende ist der Lehrstuhl in die Rotation der Hauptfach- und Nebenfach-Vorlesungen Algorithmen und Datenstrukturen eingebunden. Die Vorlesung Data Mining Algorithms 1 wird als Wahlpflichtveranstaltung angeboten. Einschlägige Seminare, Proseminare und Softwarepraktika zu Indexstrukturen und elementaren Aufgaben des Data Mining vervollständigen das Angebot. In den letzten Jahren gab es regelmäßig Beiträge zu den Ringvorlesungen Medizinische Bildverarbeitung, Bionik und Handling Big Data. 5 Ausblick Für die Zukunft planen wir, unsere Forschungsrichtungen Data Mining und inhaltsbasierte Ähnlichkeitssuche auf weitere komplexe Objekte und Anwendungen auszuweiten. Dabei sehen wir lohnende Herausforderungen, die von Modellierungsfragen und der Entwicklung effektiver und effizienter Algorithmen bis zu neuen Methoden der Evaluierung reichen.

16 Datenbank Spektrum (2013) 13:67 69 DOI /s COMMUNITY News Published online: 25 January 2013 Springer-Verlag Berlin Heidelberg Herbstworkshop der Fachgruppe IR The ubiquity of search systems has led to the application of information retrieval technology in many new contexts (e.g. mobile and international) and for new object types (products, patents, music, microblogs). To develop appropriate products, basic knowledge on information retrieval needs to be revisited and innovative approaches need to be applied, for example by allowing for more user interaction or by taking the user s situational context and the overall task into account. The quality of information retrieval needs to be evaluated for each context. Large evaluation initiatives respond to these challenges and develop new benchmarks. The workshop Information Retrieval 2013 of the Special Interest Group on Information Retrieval within the German Gesellschaft für Informatik (GI) provides a forum for scientific discussion and the exchange of ideas. The workshop takes place in the context of the LWA Learning, Knowledge and Adaptivity workshop week (October 7 9, 2013) at the University of Bamberg in Germany. This workshop continues a successful series of conferences and workshops of the Special Interest Group on Information Retrieval ( The workshop addresses researchers and practitioners from industry and universities. Especially Doctorate and Master students are encouraged to participate and discuss their ideas with world renowned experts. An Industry Session will stimulate the exchange between information retrieval professionals and academics. The workshop is expected to include German as well as English presentations. Program Chairs Dr. Sascha Kriewel, University Duisburg-Essen, Germany Dr. Claus-Peter Klas, Fernuniversität in Hagen, Germany Topics Submission should address current issues in Information Retrieval. They include (but are not limited to): Development and optimization of retrieval systems Information retrieval theory Retrieval with structured and multimedia documents Evaluation and evaluation research Text mining and information extraction Cross-lingual and cross-cultural IR Digital libraries User interfaces and user behavior, HCIR Interactive IR Machine learning in information retrieval Information retrieval and knowledge management Information retrieval and the semantic web Databases and information retrieval Social Search Task-based IR Web information retrieval (including blogs and microblogs) Clustering Patent retrieval Plagiarism detection Enterprise search Expert search Innovative concepts in IR teaching We especially invite descriptions of running projects. Types of Submissions: Full Papers (6 to 8 pages) Short Papers (4 pages): Position papers or work in progress Poster and Demonstrations (2 pages): Poster and Presentation of systems or prototypes

17 68 Datenbank Spektrum (2013) 13:67 69 Submissions are welcome in English and German. They have to follow the conference format and should be submitted as PDF files to EasyChair ( conferences/?conf=wir2013). All submissions will be reviewed by at least two independent reviewers. Important Dates Submissions: July 1, 2013 Notification: July 29, 2013 Camera Ready Contributions: August 19, 2013 Workshop: October 7 9, 2013 Further Information LWA Would be great to see you in Bamberg! 2 Produkt-News Uta Störl 2.1 Microsoft: In-Memory-Unterstützung für SQL Server Microsoft hat in seinem offiziellen Blog im November unter dem Code-Namen Hekaton verbesserte In-Memory-Unterstützung für den SQL Server angekündigt. Wie andere Anbieter auch setzt Microsoft hier auf spaltenorientierte Speicherung und verbesserte Kompressionsalgorithmen. Die In-Memory-Unterstützung soll mit dem nächsten Release des Microsoft SQL Servers veröffentlicht werden. Microsoft, Oracle MySQL: Release-Kandidat 5.6 verfügbar Oracle hat den Release-Kandidat 5.6 von MySQL vorgestellt. Bei der Vorstellung wurde deutlich, dass Oracle zukünftig fast ausschließlich auf InnoDB als Storage-Engine setzt. Dementsprechend wurden eine Reihe von Funktionalitäten in InnoDB nachgerüstet (u.a. Volltextsuche) die bislang nur unter MyISAM verfügbar waren. Außerdem wurden eine Reihe von Performance-Verbesserungen für InnoDB und erhebliche Verbesserungen im Optimizer, u.a. durch die verbesserte Nutzung von Statistiken, durchgeführt. Eine ausführliche Beschreibung der Neuerungen findet sich hier: articles/mysql-5.6-rc.html Oracle, PostgreSQL 9.2 erschienen Neben Performance-Verbesserungen und einem geringeren Stromverbrauch bietet die neu erschienene Version 9.2 des PostgreSQL-Datenbankmanagementsystem Unterstützung für JavaScript Object Notation (JSON). JSON hat eine relativ große Bedeutung im Web-Umfeld und ist beispielsweise das typische Speicherformat dokumentenorientierter NoSQL-Datenbankmanagementsysteme. PostgreSQL, Oracle: Big Data Appliance und Oracle NoSQL Database 2.0 Oracle hat seine Big Data Appliance X3-2 vorgestellt. Die Big Data Appliance ist mit 8-core Xeon-CPUs (E5-2600), der aktuellen Cloudera-Distribution mit Apache Hadoop, dem Cloudera Manager mit einem Plug-in für den Oracle Enterprise Manager für Big Data Appliances und der Oracle NoSQL Database 2.0 ausgestattet. Die Oracle NoSQL Database ist ein Key-Value-DBMS und wurde inzwischen in der Version 2.0 veröffentlicht. Oracle, Cassandra 1.2 erschienen Das Apache-Projekt hat Version 1.2 des NoSQL-Datenbankmanagementsystems Cassandra veröffentlicht. Dabei wurde u.a. die Abfragesprache CQL aktualisiert. Cassandra gehört zu den wenigen NoSQL-DBMS, welche eine SQLähnliche Abfragesprache anbieten (siehe auch Datenbank- Spektrum 12/2 Produkt News). Außerdem wurde die Transaktionsunterstützung dergestalt erweitert, dass Transaktionen rückgängig gemacht werden können im Umfeld von NoSQL-DBMS keine Selbstverständlichkeit. Apache Software Foundation, Windows Azure: Unterstützung für NoSQL-Datenbank Riak Neben der bereits existierenden Unterstützung von MongoDB (einem dokumentenorientierten NoSQL-Datenbankmanagementsystem) und dem Map-Reduce-Framework Apache Hadoop unterstützt Microsofts Cloud-Plattform Windows Azure nun mit Riak von der Firma Basho auch ein NoSQL Key-Value-Datenbankmanagementsystem. Microsoft,

18 Datenbank Spektrum (2013) 13: Amazon: Kostenlose Nutzung von relationale Datenbanken Amazon Web Services bietet für ein Jahr ein kostenfreies Nutzungskontingent für relationale Datenbanken. Neben Amazons eigenen Datenbanksystemen DynamoDB und SimpleDB werden seit neuestem auch MySQL, Oracle und Microsoft SQL Server unterstützt. Die Datenbanksysteme können jeweils bis zu 750 Stunden pro Monat mit bestimmten Größen- und Durchsatzbeschränkungen genutzt werden: Amazon Web Services,