1 The booklet of abstracts for the Sept GMDS-Workshop on Supporting Translational and Personalzed Medicine with SOA, Grid, and Cloud organized by Bernhard Balkenhol, Anna Falkenhain, und Andreas Dress Table of Contents Bernhard Balkenhol Improving Medical Care with Modern IT-Technology P. 2 Martin Steinegger, Milot Mirdita, and Burkhard Rost Cloud Architecture for Full In Silico Mutagenesis P. 3 Stephan Schaller, Michael Block, Thomas Eissing The REACTION platform Improving long-term Management of Diabetes Personalized Diabetes Therapy and Automatic Blood Glucose Control P. 8 Klaus Maisinger Analysis and interpretation of next-generation sequencing data in the cloud P. 10 Philipp Daumke Cloud services for the secondary use of healthcare data in industry and research P. 11 The abstract in German P. 13 Titus Kühne The need for cloud-based IT infrastructure to efficiently collect, manage, store, share, and evaluate medical data P. 15 Jochen Dress Clinical Studies, Good Clinical Practice, SOA, Grid, and Cloud P.16 Harald Binder Judging data sources and personalized prediction rules for clinical endpoints P. 18 Andreas Gagidis Patent Protection of Software and Diagnostic Methods in Personalized Medicine P. 19 Andreas Dress The Challenge of Analysing Genome and Proteome Data P. 20 Bernhard Balkenhol, Andreas Dress, Anna Falkenhain Translationale und personalisierte Medizin - Ein Einsatzfeld für SOA, Grid und Cloud P.21
2 Improving Medical Care with Modern IT-Technology Bernhard Balkenhol, CEO, infinity³ GmbH, D Bielefeld Abstract: It is a well-known and much deplored fact that, in spite of great efforts, the current progress in the medical and health sciences does not easily find its way straight to the bedside in our hospitals [UR12]. While tools for generating better and more complete information regarding e.g. the individual genetic and epigenetic makeup of patients are becoming increasingly available, the shear amount of data generated cannot be easily interpreted and taken into account by medical practitioners [SW12]. And while ``Hospital Information Systems'' (HIS) can manage the information flow and storage in services, they are not yet designed for supporting current medical research by connecting routine administrative hospital and routinely integrating it with the daily work of medical professionals at the bedside. In consequence, tools need to be developed to provide medical practitioners with means to - specifically search for even the latest medical insights whenever needed and to - discuss the implications of those insights for their individual patients with medical experts while taking account of all their patients' individual data as well as all that can be learned from the various medical and bio-databases in a given context. In my lecture, I will demonstrate how one can achieved all of this while, simultaneously, opening up new avenues for medical research and clinical studies by taking advantage of specifically designed service-oriented, cloud- and grid-based IT architectures [SOA01, EN08, JO08, LNK12] References [SOA01] SOA Know How, Bitkom Servicegesellschaft mbh (http://www.soa-know-how.de). [EN08] Engels, Hess, Humm, Luwig, Lohmann, Richter, Voß, Willkomm, Quasar Enterprise: Anwendungslandschaften serviceorientiert gestalten, Heidelberg, [JO08] Josuttis, SOA in der Praxis System-Design für verteilte Geschäftsprozesse, Heidelberg, [UR12] Gerald Urban et al. Biomarker, DGBMT-Innovationsreport 2012 zum Thema Personalisierte Medizintechnik, [SW12] Thomas Wittenberg und Cord Schlötelburg Theranostik im OP Closed -Loop- Systeme, DGBMT-Innovationsreport 2012 zum Thema Personalisierte Medizintechnik, [LNK12] Linked Data - Connect Distributed Data across the Web <http://linkeddata.org/> A German version of an extended abstract of this contribution is appended at the end of this booklet.
3 Cloud Architecture for Full In Silico Mutagenesis Martin Steinegger, Milot Mirdita, Burkhard Rost Dept of Bioinformatics and Comp. Biology Dept, TUM Boltzmannstr. 3, Garching, Germany Abstract: Caused by the rise of next-generation sequencing (NGS), costs of sequencing genomes are dropping rapidly. Life Technologies promises the 1000$ genome for the year With falling costs a new kind of medical diagnosis based on the genome is going to find mainstream adoption. This is however difficult with the rising demands on computational power and storage. New types of diagnostic methods are producing enormous amounts of computed data. Cloud based computing approaches are offering an affordable solution to this problem. We built an efficient cloud based SNP pipeline which can accomplish an in silico full mutagenesis. This pipeline is useful for several reasons. Firstly, the study of all possible human mutations will provide the background against which we can assess the effect of mutations that will actually be observed between people. This will be crucial both for the advancement toward individual medicine and for the understanding of human diversity and variation. Secondly, we need to make look-up answers available for all variants that will be observed and implied in possible phenotypes. The only way to generate those look-ups is by pre-computing all possible changes. Our first run against the human proteome has generated functional predictions for over 350 million mutations; this scale of the calculations was performed with the use of massive parallel cloud computing approaches, beating local cluster installations on both price and run time. 1 Introduction 1.1 SNPs SNPs (Single Nucleotide Polymorphisms) are variations in a single position of the genome. They can affect harmless traits like eye color, or cause grave diseases like Alzheimer. Sometimes there is a one to one correlation between a SNP and a disease. An example for this is sickle cell anemia. Vernon Ingram published his research 1959 [Ing59], which shows that a single change in amino acid composition of a peptide, a change from a glutamic acid to a valine, was causing the sickle-cell shaped blood cells. In many other cases, a complicated network of SNPs can be identified as the source of a changed phenotype. M. Steinegger and M. Mirdita contributed equally to this work.
4 1.2 SNAP SNAP (Screening for Non-Applicable Polymorphisms) [BR07] is an in silico method to predict whether a non-synonymous SNP causes a change in protein function. SNAP was developed by Yana Bromberg at the ROSTLAB at Columbia University. SNAP was improved by Maximilian Hecht in His version is called SNAP2. SNAP2 [Hec11] performs comparably to, or better than, SNAP and SIFT [NH03]. The SNAP prediction is based on a neural network. SNAP considers different features such as the predicted secondary structure, predictions of solvent accessibility, protein family information, biochemical information, PSIC [SER+99] profiles, SWISSPROT [BB92] residue annotation, predictions of SIFT [NH03] and predicted residue flexibility. SNAP is able to predict 80% of the non-neutral substitutions at 77% accuracy and 76% neutral substitutions at 80% accuracy. 1.3 Functional Hotspots Protein contact areas are highly conserved and are called functional hotspots [BT98]. A method to uncover functional hotspots in vitro is alanine scanning [Wel91]. The high costs of this method make a systematic application on whole proteins of organisms or even groups of organisms difficult or impossible. Our pipeline provides the data for an alternative in silico based approach through SNAP2 [BR08]. 1.4 Project focus The focus of our project was to develop a cloud based SNAP2 pipeline that can perform an in silico full mutagenesis in an efficient way. With the old pipeline it was not possible to perform an in silico full mutagenesis of one entire human proteome (collection of all proteins that are expressed) within reasonable time and budget constraints. 2 Parallel computing architecture One solution for parallel computing is a batch system (queue system). These systems provide a queue, that contains jobs that are executed on different nodes. Batch systems are the traditional cluster architecture. Jobs are submitted to the queue which manages the prioritization. Two feasible scenarios exist with this parallel architecture: The first scenario occurs when there are no free slots available in peak times. In this case the queue and the processing time for a single requests increase.
5 The second scenario is when there are many free slots available, because there are not enough requests to utilize the cluster. This should be avoided. These two cases show the problem of optimal utilization of a cluster. Another recurring problem of using a traditional cluster is that the storage system becomes the bottleneck, because of IO problems or storage space limitations. The problems mentioned above can be solved with a cloud architecture. We created a cloud image, containing the needed tools, configuration and databases. The utilization of the cloud cluster is controlled by a master node that can spawn and destroy instances of our image. The function of the master is to manage the cloud cluster and to receive and forward the jobs to the cluster. Compared to the traditional cluster, a cloud architecture provides the following solutions to solve scaling problems: In the first scenario, there are no free slots available. The master can then create a new instance, so that the cluster size scales up. This takes only a few minutes. In the second scenario many free slots are available. The master can kill instances to scale the cluster down. Using the cloud approach, the utilization of the resources is better than using the traditional cluster. Every result will be saved in the cloud storage system (AWS S3 storage). This is a storage system that is key/value based and can be accessed over a RESTful API. The main advantage of this solution is the scalability of the storage. StarCluster [Ril10] is a tool developed on top of the EC2 API to rapidly build and manage clusters in the cloud. Using StarCluster makes it possible to create clusters fast and easily. It takes only minutes to setup a cluster with a running Grid Engine. 2.1 Tool parallelization For calculating more than 300 million mutations our algorithms must be as fast as possible to achieve a runtime speed that is good enough to handle this amount of data. The tool should also run on a wide range of computers without the need of additional hardware. Therefore, the decision was made to improve our algorithm with SSE2 (Streaming SIMD Extensions 2) instructions, because this technology is available on most current computers. 2.2 Spot Market Some cloud providers like Amazon have an option called spot pricing. This is a dynamic pricing scheme for infrastructure (e.g. Costs per CPU). At each moment in time, the provider sets a price for each instance type on which the users can bid. These instances are called spot instances. As user it is possible to specify the maximum amount that one
6 is willing to pay for a spot instance. If the amount specified by the user is below or equals the current spot price, the instance is assigned to the user. Otherwise the user s instance is terminated or the user does not get the instance at all until the spot price falls below or to the level of the user s bid. From the cloud provider s perspective, spot instances are a mechanism that allows them to sell un- or underutilized capacity at a discount price, while retaining the right to reclaim their resources quickly if necessary. The main problem with the spot market is the right to reclaim. Because of this, the architecture of tools has to be able to be resilient towards failures at all times. The spot market price is cheaper than the local cluster but it is not guaranteed to get spot resources at all times. Spot market resources can be used to make large scale calculations in parallel. This is possible because when the prices are low the users can have many parallel instances. The fact that the spot prices can increase any time and Amazon can reclaim the instances lead to longer calculation times. With spot market prices, we were able to raise the amount of CPU hours four times. But it is important to realize that the calculation time might rise as well because of price fluctuations. 3 Results The fist run with the human reference proteome from NCBI was done on the first of April The database contains 29,036 sequences, the sum of all residues is 16,618,608 and the average sequence length is 572 residues. SNAP2 has to calculate 315,753,552 mutations. For this calculation, we used 20 Cluster Compute Eight Extra Large Instance compute instances. This means 1210 GB memory, 1760 EC2 compute Units and 320 real cores in total. The cluster cost per hour with the spot market price of 0:54$ was 10:84$ per hour. The run lasted 24 hours. This resulted in a total price of 260:16$. The price would be 4 times higher without the spot market. So the normal price would be 43:36$ per hour, and the total price without the spot market 1040:64$. Without the performance optimizations, the costs were around nine times higher. The costs are 260:16$_9 = 2341:44$ while using the spot market. Without the spot market, they are 10; 202:40$. The cloud run has produced 180 GB of results. The resulting Gene Mutation Database (GeMuDB) is available at gemudb.com. 4 Conclusion At the beginning of this project, the price for one human full mutagenesis was over 10; 000$, even with the spot market price. Our new improved pipeline solved the problem
7 for under 300$. This decrease could be achieved by improving the bottleneck of our toolset and usage of spot resources. The first run in the cloud was also a success. Cloud computing was the right source of computing power because the spot market price is really attractive and the cluster was highly utilized. Cloud computing is one possible solution to address life science challenges. The trend of life science companies to use the cloud for their calculations and to share their result with their customers is noticeable. The research community is more reserved but there is an increase of interest in cloud computing. This can also be seen in publishing rates of papers about cloud computing. The prices are also constantly decreasing. A simple cost analysis revealed that our cloud run was actually cheaper than the run on the local clusters would have been. Funding This project was supported by Amazon [AWS11], a research grant. References [AWS11] Amazon Web Services (AWS). AWS in Education research grant award. AWS, [BB92] Amos Bairoch and Brigitte Boeckmann The SWISS-PROT protein sequence data bank, Nucleic Acids Research, [BR07] Yana Bromberg and Burkhard Rost SNAP: predict effect of non-synonymous polymorphisms on function, Nucleic Acids Research, [BR08] Yana Bromberg and Burkhard Rost Comprehensive in silico mutagenesis highlights functionally important residues in proteins, Bioinformatics, [BT98] A A Bogan and K S Thorn. Anatomy of hot spots in protein interface, Journal of molecular biology, [Hec11] Maximilian Hecht. Improve predictions of functional effect of non-synonymous SNPs [Ing59] V.M. Ingram. Abnormal human haemoglobins. III. The chemical difference between normal and sickle cell haemoglobins, Biochimica et Biophysica Acta, [NH03] P. C Ng and S. Henikoff. Sift: Predicting amino acid changes that affect protein function, Nucleic Acids Research, 2003.
8 The REACTION platform Improving long-term Management of Diabetes - Personalized Diabetes Therapy and Automatic Blood Glucose Control Stephan Schaller, Micahel Block, Thomas Eissing Comp. Systems Biology, Bayer Technology Services GmbH Leverkusen, Germany Abstract: Diabetes represents a major healthcare burden. Diabetes can cause many complications if the disease (e.g. blood glucose) itself and associated risk factors (e.g. blood pressure and hyperlipidemia) are not adequately controlled . The REACTION project focuses on improving long-term management of diabetes by integration and development of wearable multi-parametric sensors (especially continuous glucose monitoring (CGM) sensors) that connect to an intelligent service platform for doctors, carers, patients and scientists [2, 3]. The REACTION platform will feature an interoperable peer-to-peer communication platform based on a service oriented architecture all functionalities, including devices, are represented as services and applications consist of a series of services orchestrated to perform a desired workflow. Various clinical applications can be executed for monitoring of vital signs, context awareness, feed-back to the point of care, integrative risk assessment , event and alarm handling as well as integration with clinical and organizational workflows and external Health Information Systems [5, 6]. The aim is to assist healthcare professionals in hospital wards to improve glycemic control of admitted patients with diabetes type 1 and type 2 using CGM and therapy feedback and to support pro-active management to reduce the risk of developing long-term complications. As a core component towards this goal, Bayer Technology Services developed a control algorithm, combining a computational kernel with closed-loop control concepts, to automate the delivery of optimal insulin doses. The detailed mechanistic modeling approach using physiology-based pharmacokinetic / pharmacodynamics (PBPK/PD) model kernels allows the possibility to integrate detailed and specific knowledge on physiological conditions of the individual patients with diabetes. In general, the modeling platform developed by Bayer Technology Services facilitates PK/PD predictions on the level of predefined virtual populations as well as individuals and aims to support the personalization of pharmacotherapy by means of individual dosing decisions [7, 8]. The necessary information is exchanged through a wireless Body Area Network to any available network infrastructure in the patients surroundings. Other body and room sensors provide contextualization. Data are transmitted in a secure way to healthcare professionals and medical knowledge systems and legacy Health Information Systems and results are fed back to the point of care (Figure 1 ).
9 Figure 1: The REACTION platform concept References: 1. Kovatchev, B., Closed loop control for type 1 diabetes. BMJ, : p. d REACTION. Remote Accessibility to Diabetes Management and Therapy in Operational healthcare Networks. 2012; Available from: 3. Spanakis, E. and F. Chiarugi, Diabetes Management: Devices, ICT Technologies and Future Perspectives, in Wireless Mobile Communication and Healthcare, K. Nikita, et al., Editors. 2012, Springer Berlin Heidelberg. p Koumakis, L., et al., Risk Assessment Models for Diabetes Complications: A Survey of Available Online Tools, in Wireless Mobile Communication and Healthcare, K. Nikita, et al., Editors. 2012, Springer Berlin Heidelberg. p Holl, B., et al., Design of a mobile, safety-critical in-patient glucose management system. Stud Health Technol Inform, : p Ahlsén, M., et al., Service-Oriented Middleware Architecture for Mobile Personal Health Monitoring, in Wireless Mobile Communication and Healthcare, K. Nikita, et al., Editors. 2012, Springer Berlin Heidelberg. p Eissing, T., et al., A computational systems biology software platform for multiscale modeling and simulation: integrating whole-body physiology, disease biology, and molecular reaction networks. Front Physiol, : p Strougo, A., et al., First dose in children: physiological insights into pharmacokinetic scaling approaches and their implications in paediatric drug development. J Pharmacokinet Pharmacodyn, 2012.
10 Analysis and interpretation of next-generation sequencing data in the cloud Klaus Maisinger Illumina UK, Chesterford Research Park, Little Chesterford CB10 1XL, United Kingdom Abstract: The use and analysis of high-throughput or next generation DNA sequencing data faces various challenges in translational applications, some of which are discussed. BaseSpace  is Illumina's cloud based computing platform. BaseSpace addresses some of the analysis challenges such as aggregation of samples from multiple sources and data management. BaseSpace is built on top of Amazon's web services (AWS, ) and will offer a web-based application programming interface (API) that allows the integration of a rich set of bioinformatics applications  aiding in the biological interpretation of the data. Advances in data compression [4,5] and in sequence alignment reduce the dependency on cluster or grid computing to complete the analysis of large sequencing data sets. Illumina's approaches to biological interpretation of genetic data and clinical reporting are introduced, as well as tools to facilitate the exploration and visualisation of personal genetic data and variants. References:  https://basespace.illumina.com    Fritz M. H.-Y., Leinonen R.,Cochrane G., Birney Y. Efficient storage of high throughput sequencing data using reference-based compression, Genome Research, 2011, 21(5),  Kozanitis C., Saunders C., Kruglyak S., Bafna V., Varghese G. Compressing Genomic Sequence Fragments Using SlimGene, Journal of Computational Biology, 2011, 18(3),
11 Cloud services for the secondary use of healthcare data in industry and research Philipp Daumke, Averbis GmbH, Freiburg Abstract: Health service providers face the major challenge to improve quality of healthcare and to increase patient safety while reducing the cost of healthcare services. The provision of routine healthcare data is a promising prerequisite. Aggregated patient data can contribute to the identification of disease mechanisms. Recruitment times of patients in clinical studies may be reduced, monitoring of drug safety is improved through continuous monitoring. Plausibility and quality checks of medical treatment are conducted in an efficient and inexpensive manner. Cloud4health provides highly scalable and cost effective solutions for the secondary use of healthcare data. These data include both structured data (eg diagnoses, procedures, and laboratory data) and data which are in unstructured or semi-structured format (eg, discharge summaries, pathology and radiology reports, medications). Relevant information will be extracted from unstructured data using text analysis technologies and standardized terminologies and ontologies, converting them into a standardized format and data structure. This allows answering of research questions on patient populations across different institutions and the execution of complex analyzes on these data. Text analysis on large unstructured data sets pose special demands on computer capacity. The cloud paradigm is a promising approach to deal with these requirements. Our product solutions are designed as a flexible and adaptable software applications, which can be run in public or private clouds, depending on the needs of the end users. Specific data protection requirements will be considered by technical and organizational principles that shall increase the confidence of users in cloud technologies and serve as a model of character for future projects in the area of patient-related data processing. Cloud4health enables a variety of applications for public facilities and the industry, especially for public and private hospitals, health insurers, business enterprises in the field of medical technology, biotechnology and pharmaceutical industries. The applications include: Retrospective studies: Cloud4health provides retrospective data extraction related to diseases or treatments, and thus - looking back over several years - a fund of information for clinical studies or special registers Plausibility checks: quality controller have the opportunity of validation of treatment and prescription through evaluation of patient-related data Pharmacovigilance: Retrospective studies in patient data may indicate the effectiveness of newly approved drugs in additional application areas and may give hints on unwanted side effects Patient recruitment: cloud4health supports patient recruitment for clincial trials by matching the inclusion and exclusion criteria with routine clinical data, increasing the speed and productivity in the development and approval of new drugs, drugs and medical devices can
12 be increased significantly. The lecture will discuss the current status of the project, and preliminary results will be presented. Focus is laid on architecture, data privacy, and first prototypes. References: Secondary uses of Electronic Health Record(EHR) data in Life Sciences (http://www.deloitte.com/view/en_us/us/industries/lifesciences/dc2b066f0f001210vgnvcm100000ba 42f00aRCRD.htm) Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. Natural language processing and its future in medicine, IMIA Yearbook of Medical Informatics 2008;47 (Suppl 1): PriceWaterhouseCoopers Transforming healthcare through secondary use of health data (http://www.pwc.com/us/en/healthcare/publications/secondary-health-data.jhtml) Kim D, Labkoff S, Holliday SH: Opportunities for Electronic Health Record Data to Support Business Functions in the Pharmaceutical Industry A Case Study from Pfizer, Inc. JAMIA 2008 Vol. 15 No 5. Electronic Health Record (EHR) Data: Secondary uses of EHR data supporting post launch activities (http://www.deloitte.com/view/en_us/us/industries/lifesciences/3ecb73a4d9a07210vgnvcm100000ba42f00arcrd.htm) The abstract in German:
13 Leistungserbringer im Gesundheitswesen stehen vor der großen Herausforderung, durch Innovationen im Bereich Forschung und Entwicklung die Behandlungsqualität im Gesundheitswesen zu verbessern, die Patientensicherheit zu erhöhen und gleichzeitig die Kosten für Gesundheitsleistungen zu reduzieren. Die Bereitstellung und statistische Auswertung klinischer Routinedaten für die medizinische Forschung stellt hier eine vielversprechende Voraussetzung dar. Aggregierte Patientendaten können zur Identifikation von Krankheitsmechanismen beitragen. Rekrutierungszeiten von Patienten in klinischen Studien werden reduziert, die Überwachung der Medikamentensicherheit durch kontinuierliches Monitoring verbessert. Plausibilitätsprüfungen ärztlichen Handelns sind effizient und kostengünstig durchführbar. Cloud4health stellt eine hoch skalierbare und kosteneffektive Lösung für die Sekundärnutzung klinischer Routinedaten bereit. In cloud4health werden semantische Technologien zu Produktlösungen integriert, mit denen klinische Rohdaten für Sekundärzwecke verfügbar gemacht werden können. Die Rohdaten beinhalten sowohl strukturierte Primärdaten (z.b. Diagnosen, Prozeduren und Labor-daten) als auch Daten, die in unstrukturierter oder semi-strukturierter Form vorliegen (z.b. Arztbriefe, Pathologie- und Radiologie-Berichte, Medikationen). Aus unstrukturierten Daten werden mit Hilfe von Textanalyse-Technologien und standardisierten Terminologien und Ontologien relevante Informationen extrahiert, in ein standardisiertes Datenformat überführt und strukturiert abgespeichert. Dadurch wird es möglich, einrichtungsübergreifend Anfragen über ganze Patientenpopulationen zu generieren und Auswertungen auf diesen Daten vorzunehmen. Textanalyse-Technologien auf großen, teilweise unstrukturierten Datenmengen stellen besondere Anforderungen an die Rechnerkapazitäten dar. Das Cloud-Paradigma stellt einen sehr vielverspre-chenden Ansatz zur Bewältigung dieser Anforderungen dar. Die Produktlösungen werden als flexible und adaptierbare Software-Anwendungen entwickelt, die sich je nach Bedarf in öffentlichen Clouds ausführen oder als private Cloud bei den Endanwendern installieren lassen. Die besonderen datenschutzrechtlichen Anforderungen werden durch technische und organisatorische Schutzprinzipien berücksichtigt, die Modellcharakter für nachfolgende Projekte im Bereich cloudbasierter Datenverarbeitung patientenbezogender Daten haben und das Vertrauen der Anwender in Cloud-Technologien stärken wird. Cloud4health ermöglicht eine Vielzahl von Anwendungsmöglichkeiten für öffentliche Einrichtungen und die (mittelständische) Industrie, insbesondere für öffentliche und private Krankenhäuser, Krankenkassen, Wirtschaftsunternehmen aus dem Bereich der Medizintechnik, Biotechnologie und der pharmazeutischen Industrie. Die Anwendungsmöglichkeiten umfassen: Retrospektive Studie: Die HealthCloud ermöglicht die retrospektive Datenextraktion zu Krankheiten oder Behandlungen und somit rückblickend auf mehrere Jahre einen Grundstock an Information für klinische Studien oder spezielle Register Plausibilitätsprüfung: Krankenkassen, Kassenärztliche Vereinigungen und gemeinsame Prüfeinrichtungen erhalten die Möglichkeit einer Validierung der indikationsgerechten Verschreibung von Medikamenten durch Auswertung patientenbezogener Daten
14 Pharmakovigilanz: Retrospektive Untersuchungen in Patientendaten können Hinweise auf die Wirksamkeit neu zugelassener Medikamente in zusätzlichen Anwendungsgebieten sowie auf bisher nicht erkannte unerwünschte Anwendungswirkungen geben Patientenrekrutierung: cloud4health ermöglicht durch Abgleich der Ein- und Ausschlusskriterien von klinischen Studien mit klinischen Routinedaten eine datengetriebene Patientenrekrutierung, wodurch die Geschwindigkeit und Produktivität bei der Entwicklung und Zulassung neuer Medikamente, Wirkstoffe und Medizingeräte deutlich erhöht werden kann. Epidemiologie und Versorgungsforschung: Durch cloud4health können Anfragen über Patientenpopulationen in Klinik- oder Abteilungsbeständen generiert sowie komplexe Auswertungen in klinischen Daten, wie beispielsweise Kosten-Nutzen-Analysen von Therapien mit Medikamenten, Impfstoffen und physikalischen Maßnahmen (Kuren, Diäten) sowie von neuen diagnostischen Methoden vorgenommen werden. In dem Beitrag soll der aktuelle Projektstand diskutiert sowie erste Projektergebnisse vorgestellt werden. Schwerpunkte liegen in den Bereichen Architektur, Datenschutz und erste Prototypen.
15 The need for a cloud-based IT infrastructure to efficiently collect, manage, store, share, and evaluate medical data Titus Kühne, Deutsches Herzzentrum und Charitee, Berlin Advanced web-technologies and remote cloud-based software tools offer new opportunities to the research community for improving efficiency and quality in multi-site clinical trials or preclinical research. Such applications enable collecting, analysing (quantitatively) and sharing of numerical and imaging data in a standardized and validated manner . At the same time, any infrastructure to support state-of-the-art research must support standards and allow for interoperability [2,3]. There is also an ever-increasing demand to incorporate medical imaging, owing to the detailed information on anatomy, function, and pathology it provides. This is because image-based information provides surrogate endpoints that can, compared to solely clinical endpoints, detect more subtle changes of pathologies related to specific treatments [4,5]. In turn, surrogate endpoints support the efficiency of the discovery process and save costs by reducing study duration and group size . However, research involving medical imaging data acquired at multiple sites is, at the moment, still very challenging, costly, labour-intensive and, if conducted in a non-standardized fashion, also very error prone [1,7]. Dedicated information technology (IT) supporting this research is still lacking. There is an urgent need for standardized IT infrastructure to efficiently collect, manage, store, and share imaging and numerical data between different research institutions . Ubiquitous access to stored data as well as to advanced, ideally cloud-based analysis tools is of paramount importance to extract accurate and relevant information from this imaging data, and to combine it with other types of medical data. References:  Sarikouch S. et al. Nutzen telemedizinischer Netzwerke für die kardiovaskuläre Forschung: MR- Bildgebung angeborener Herzfehler als Beispiel, Der Kardiologe, 2010  Ohmann C. et al. Future developments of medical informatics from the viewpoint of networked clinical research. Interoperability and integration, Methods Inf Med., 48, (1) 45-54, 2009  Kuchinke W. et.al. Heterogeneity prevails: the state of clinical trial data management in Europe - results of a survey of ECRIN centres, Trials. 11(1), 79-89, 2010  Ashton E.A. Quantitative imaging in clinical trials, Applied Clinical Trials, Oct.1, 2006  Miller C.G. Medical imaging and electronic data capture in clinical trials: the future paradigm, IPI, Spring Issue, 2009  Wiro M.W. et al. Incidental findings on brain MRI in the general population, New England Journal of Medicine, Vol 357, no. 18 pp , 2007  S. Hanß T. et al. Integration of Decentralized Clinical Data in a Data Warehouse, Methods of Information in Medicine, 2009
16 Clinical Studies, Good Clinical Practice, SOA, Grid, and Cloud Jochen Dress, Zentrum für Klinische Studien, Köln Clinical studies aim to establish the safety and the effectiveness of a new drug or a new therapy. Thus, clinical studies can be viewed as scientific experiments performed with humans. In consequence, such studies should always address relevant medical problems, and they need to follow the pertinent ethical guidelines. The amount and the quality of data to be collected should provably suffice to find a statistically valid answer to the questions under consideration. The methodological and scientific requirements that clinical studies have to meet are detailed in the ICH Guidelines Topic E6 Good Clinical Practice (GCP) . And the law requires that these guidelines are to be followed strictly . To guarantee the quality of the collected data and to prevent or, at least, to detect as early as possible and to identify any problems that corrupt the data, many provisions should be applied: This starts with a good Study Protocol. In such a protocol, one has to define the primary question to be addressed, the safety parameters to watch out for, the actual data that are to be collected, when and in which way that will be accomplished, and how they are to be analyzed statistically. In case digital data sources are being used (e.g. electronic health records or digital lab system data), this has to be mentioned, and the tools that provide or generate such data have be listed. Based on the study protocol, Case Report Forms (CRFs) have to be designed and supplied that are used for systematic and controlled data acquisition. Increasingly, this is done using web-based services . The data gathered in the course of a study should be checked continuously -- with regards to content as well as various formal aspects -- in centralized as well decentralized manners. E.g., the data collected by the sites should be systematically compared with the source data (Source Data Verification) at the sites, and as part of the statistical analysis the collected data is checked again.
17 The appropriateness of computerized systems that are used in this context needs to be demonstrated explicitly thereby relying on validation procedures according to the current state of the art in science and technology . The same holds for the computerized systems that provide the source data. The sponsor  of a clinical study has to ascertain that the systems that will be used meet his requirements with respect to applicability, availability, data protection, and data quality. The resulting challenges and how to face them successfully by clever usage of SOA, grid, and cloud technology will be discussed in the lecture. Refenerences:  International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)  Verordnung über die Anwendung der Guten Klinischen Praxis bei der Durchführung von klinischen Prüfungen mit Arzneimitteln zur Anwendung am Menschen (GCP-Verordnung GCP-V) vom 9. August  Still, in many instances and for good reasons, paper documents are being preferred. This will be addressed more closely in the lecture.  In the ICH-GCP-Guidelines, it says: When using electronic trial data handling and/or remote electronic trial data systems, the sponsor should: a) Ensure and document that the electronic data processing system(s) conforms to the sponsor s established requirements for completeness, accuracy, reliability, and consistent intended performance (i.e. validation). b) Maintains SOPs for using these systems. c) Ensure that the systems are designed to permit data changes in such a way that the data changes are documented and that there is no deletion of entered data (i.e. maintain an audit trail, data trail, edit trail). d) Maintain a security system that prevents unauthorized access to the data. e) Maintain a list of the individuals who are authorized to make data changes (see and 4.9.3). f) Maintain adequate backup of the data. g) Safeguard the blinding, if any (e.g. maintain the blinding during data entry and processing) If data are transformed during processing, it should always be possible to compare the original data and observations with the processed data.  The sponsor of a clinical study is legally responsible for the correct execution of a study, the safety of patients, and the validity of the results. He is not necessarily the provider of the required capital.
18 Judging data sources and personalized prediction rules for clinical endpoints Harald Binder Institut für Medizinische Biometrie, Epidemiologie und Informatik, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Mainz, Germany. When gathering information on patients, from simple characteristics to high-dimensional molecular measurements, the aim will often be to personalize prediction of the risk for future events, such as relapse in cancer patients or death. In this context, optimal prediction performance is wanted, as well as an interpretable prediction rule that potentially combines different data sources. Prediction performance might then be useful for critically judging the added value of individual sources, potentially avoiding costly measurement or retrieval. We exemplarily consider two approaches, a classical statistical model and a machine learning approach, that combine gene expression measurements and other information to predict survival for diffuse large B-cell lymphoma patients. Potential pitfalls of overoptimistic prediction performance judgment are indicated and techniques for avoiding these are illustrated. Stable selection of individual patient characteristics for prediction rules is considered as a second criterion, which is not automatically obtained together with good prediction performance. Both, evaluating prediction performance and evaluating stability result in considerable computational demand. We illustrate that compute clusters and cloud solutions are well suited for these tasks, due to straightforward parallelization of the algorithms. Thus, cloud solutions are seen to enable comprehensive evaluation and selection of different data sources for personalized prediction of clinical outcomes. References 1. Binder H, Porzelius C, Schumacher M: An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biometrical J, 2011; 53: Sauerbrei W, Boulesteix A-L, Binder H: Stability investigations of multivariable regression models derived from low- and high-dimensional data. J Biopharm Stat, 2011; 21: Porzelius C, Schumacher M, Binder H: The benefit of data-based model complexity selection via prediction error curves in time-to-event data. Computation Stat, 2011; 26: Binder H, Schumacher M: Allowing for mandatory covariates in boosting estimation of sparse highdimensional survival models. BMC Bioinformatics, 2008; 9:
19 Patent Protection Of Software And Diagnostic Methods In The Field Of Personalized Medicine Andreas Gagidis, dompatent, Köln Many people erroneously think that software in the field of medicine and diagnostic methods cannot be patented. Actually it is possible to obtain patent protection on software and diagnostic method if certain requirements are fulfilled. The presentation will explain what aspects in the field of personalized medicine can be protected by patents and what requirements have to be considered in order to increase the chance to obtain a patent protection. Practical advice will be given on how to define patent claims in order to increase the chance to obtain a patent.
20 The Challenge of Analysing Genome and Proteome Data Andreas Dress CAS-MPG Partner Institute for Computational Biology The Shanghai Institutes for Biological Sciences Abstract: It is well-known that the shear amount of data generated by NGS and other omics technologies easily outpaces our capacity for properly exploiting all these data. In my lecture, I will discuss how modern tools for connecting distributed data across the web (cf. [Lnk12]), comparative sequence analysis (cf. [Sem03, Dre11]) and pattern classification (cf. [Apo10]) may be able to contribute to facing this problem of extracting "information" out of huge data sets. In this context, the work presented in [Wan05, Alm10,Deu11] seems to be particularly relevant. References: [Sem03] Semple C, Steel M (2003) Phylogenetics. Oxford University Press. [Wan05] Wang X, Gorlitsky R, and Almeida JS From XML to RDF: How Semantic Web Technologies Will Change the Design of Omic Standards. Nature Biotechnology, Sep;23(9): [PMID: ]. [Alm10] Almeida JS, Deus HF, Maass W. S3DB core: a framework for RDF generation and management in bioinformatics infrastructures. BMC Bioinformatics Jul 20;11(1):387. [PMID ]. [Apo10] Apostolico A, Denas O, Dress A: Efficient tools for comparative substring analysis. J Biotechnol Sep 1;149(3): [Deu11] Deus HF, Correa MC, Stanislaus R, Miragaia M, Maass W, de Lencastre H, Fox R, Almeida JS: S3QL: A distributed domain specific language for controlled semantic integration of life sciences data. BMC Bioinformatics 2011, 12:285 [ PMID: ]. [Dre11] Dress A, Huber K, Koolen J, Moulton V, Spillner A (2011) An Introduction to Phylogenetic Combinatorics. Cambridge University Press. [Lnk12] Linked Data - Connect Distributed Data across the Web <http://linkeddata.org/>