Brainware als Faktor für energieeffizientes HPC Christian Bischof, Dieter an Mey, Christian Terboven christian.bischof@tu-darmstadt.de - HRZ, TU Darmstadt {anmey, terboven}@rz.rwth-aachen.de - RZ, RWTH Aachen 20.09.2012, ZKI AK SC, Universität Düsseldorf Rechen- und Kommunikationszentrum (RZ)
Motivation Definition of Green HPC from insidehpc.com: Design and management techniques that contribute to the responsible, effective use of energy in the operation of high performance computing centers and equipment. But: The current situation hardly allows for the Economical Optimization of the Total Budget Different budgets for Staff Hardware (mainly through applications every X years), Maintenance Power Building (mainly through applications once per decade?!) User s in general don t pay for compute resources 2
Agenda Cost of Brainware versus Hardware Tuning Opportunities Success Stories Summary 3
Cost of Brainware versus Hardware 4
Understanding the Total Cost of Ownership Assumptions 2 Mio HW investment per year 5 years lifetime with 4 years maintenance through vendor Power: 850 KW, PUE=1.5, 0.14 per kwh => 1.5 Mio per year ISV software provided by users Commercial batch system Free Linux distribution costs per year percentage Building ( 5Mio / 25y) 200.000 3,72% HPC software 50.000 0,93% ISV software 0 0,00% Batch system 100.000 1,86% Linux 0 0,00% power 1.500.000 27,93% office space 0 0,00% Staff 12 FTE 720.000 13,41% hardware maintenance 800.000 14,90% investment compute servers 2.000.000 37,24% sum costs 5.370.000 100,00% 5
1 18 35 52 69 86 103 120 137 154 171 188 Does it pay off to hire more HPC Experts? Start tuning top user projects first 15 projects account for 50% of the load 64 projects account for 80% of the load Assumptions It takes 2 months to tune one project One analyst can handle 5 projects per year A projects profits for 2 years As a consequence one HPC expert can take care of 10 projects at a time One FTE costs 60,000 100,00% 90,00% 80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00% Accumulated usage of top accounts (excl. JARA-HPC) Accumulated usage of top accounts 6 Tuning can improve the code by 5, 10 or 20 percent
Does it pay off? Yes! 600.000 400.000 200.000 For example: Break even point: 7.5 HPC Analysts improve top 75 projects by 10% (TCO is 5.3 Mio /y) ROI [ ] 0-200.000 #Projects 70 60 50 40 30 20 10 Savings with 5% improvement 80 90 100 110 120 130 140 150 # of tuned projects 160-400.000-600.000 7 Savings with 10% improvement Savings with 20% improvement 10 projects handled by one FTE (60.000 /y)
Tuning Opportunities 8
Opportunities for Tuning wo/ Code Access Sanity Check Use HW Counters to Measure Performance To check for Performance Anomalies IO behavior System call statistics Hardware Choose the optimal hardware platform File system, IO parameters Parameterization Choose optimal number of threads / MPI processes Thread / Process Placement (NUMA) Mapping MPI topology to hardware topology MPI parameterization (buffers, protocols) Optimal libraries (MKL ) 9
Opportunities for Tuning w/ Code Changes Cache Tuning padding, blocking, loop based optimization techniques Inlining/outlining, Help compiler to perform optimizations MPI optimization avoid global synchronization Hide / reduce communication overhead, Unblocking communication Coalesce communications OpenMP optimization Extend parallel regions Check for false sharing NUMA optimization: first touch, migration In vogue: Add OpenMP to an MPI code to improve scalability Of Course: Choosing the optimal Algorithm is crucial 10 To be handled by or with the domain expert
Opportunities for Tuning wo/ Code Changes Development Environment Choose the optimal compiler Choose optimal compiler options Autoparallelization Compiler Profile / Feedback Adapt dataset: Partitioning / Blocking Load Balancing This list is not intended to be exhaustive, but rather to illustrate that the skill set of an HPC tuning expert is very different from that of an application scientist who develops a program, but both skill sets are needed. SimLabs 11 Interdisciplinary Collaboration: HPC, Domain Expert, Numerical Expert,
Teaching via Courses and Workshop Selected courses and workshops in Aachen, since 2011: Date Event 03/2011 Parallel Programming in CES (en), 1 week, 75 part. 05/2011 Visual Studio 2010 + Windows HPC Server workshop, 25 selected part. 10/2011 AIXcelerate Tuning Workshop (en) (with Intel), 25 selected part. 12/2011 Parallel Programming with MATLAB, 35 part. 03/2012 Parallel Programming in CES (en), 1 week, 55 part. 08-09/2012 Parallel Programming Summer Courses (en): MPI, OpenMP, Tools, 10/2012 Planned: Tuning for bigsmp HPC Workshop (en) 10/2012 Planned: OpenACC Workshop (en) 11/2012 Planned: Technical Cloud Computing with Microsoft Azure (en) 12
Success Stories 13
HPC Consulting may save real money: Combustion of Biofuels Primary Breakup for Diesel Sprays: Hybrid CDPLit Adding a single OpenMPparallelized kernel improves efficiency by 10% approx. Turns into cost reduction of an equivalent of one FTE/yr. Human effort ~7 weeks 512 s 256 s 128 s 64 s 32 s Runtime for Small Test Data Set 8 PPN 1TPP 4 PPN 1 TPP 4 PPN 2 TPP 16 s 1 2 4 16 32 48 64 Nodes Cluster of Excellence Tailor-Made Fuels from Biomass, Inst. F. Combustion Technology, RWTH Aachen University 14
Green IT = Using Resources more efficiently Code Impact Partner FLOWer Matlab Genehunter Dynmatt 15 1.8 x higher efficiency of hybrid Navier Stokes Solver simulating the landing of a space launch vehicle By adding autopar to MPI, carefully assigning threads to processes to adjust load imbalances 150x Speed-up of numerical Solution of the Diffusion Equation by extracting compute intense kernel, transforming it to a Fortran code + careful cache tuning 14x Speed-up through Cache Optimization plus scalable MPI Parallelization of Linkage Analysis to identify genes which may cause diseases. 33x Speed-up through I/O-Optimization by implementing appropriate Buffering and reducing meta data operations Code Impact Partner FIRE NestedCP TFS ~100x Speed-up of Image Recognition Software on large SMP by nested parallelization which saves a lot of IO 10-50 x Speed-up for Critical Point Extraction in flow simulation output data through nested parallelization with OpenMP even with highly imbalanced work chunks. 20x Speed-up for Simulation of Human Nasal Flow for Computer Aided Surgery through nested parallelization with OpenMP RWTH Laboratory of Mechanics RWTH Institute of Physical Chemistry Inst. F. Medical Biometry, Computer Science and Epidemiology (IMBIE) Bonn RWTH Institute of Steel and Light Alloy Building Higher sophistication of parallelization leads to higher scalability, but does not save resources RWTH Chair of Computer Science 6 Virtual Reality Center Aachen RWTH Aerodynamic Institute, Parallel Software Products
HECToR: Distributed CSE Success Stories 16 Code Domain Effect Effort Saving CASTEP Key Materials Science 4x Speed and 4x Scalability 8 PMs 320k - 480k (p.a.) NEMO Oceanography Speed and I/O- Performance CASINO Quantum Monte- Carlo 4x Performance and 4x Scalability 6 PMs 95 k (p.a.) 12 PMs 760 k (p.a.) CP2K Materials Science 12 % Speed and Scalability 12 PMs 1500 k (in total) GLOMAP/ TOMCAT CITCOM Atmospheric Chemistry Geodynamic Thermal Convection 15 % Performance? 30% Performance? significant EBL Fluid Turbulence 4x Scalability 12 PMs ChemShell Catalytic Chemistry 8x Performance 9 PMs Fluidity- ICOM Ocean Modelling Scalability? DL_POLY_3 Molecular Dynamics 20x Performance 6 PMs CARP Heart Modelling 20x Performance 8 PMs
HPC Consulting may save real money: Hydro-Dynamics with XNS XNS (M. Behr, CATS, RWTH) Simulation of Hydro-Dynamic forces of the Ohio Dam Additional OpenMP Parallelization: 9 parallel regions Parallelized with MPI Human effort: ~ 6 weeks Scales very well for larger case 20-40 % improvement Improvement of execution time in percentage best effort MPI only versus best effort hybrid 50,00% 40,00% 30,00% 20,00% 10,00% Higher is better Nehalem EP Cluster (2 processor chips, 4 cores each) with InfiniBand-QDR 17 0,00% -10,00% -20,00% 1 2 4 6 8 16 32 48 64 # Compute Nodes
Execution time [sec] XNS: How much efficiency do we want to sacrifice? Execution Time 1280 640 320 160 Lower is better PPN1 TPP1 PPN1 TPP2 PPN1 TPP4 PPN1 TPP8 PPN2 TPP1 PPN2 TPP2 PPN2 TPP4 PPN4 TPP1 PPN4 TPP2 PPN8 TPP1 80 40 18 20 PPN = processes per node TPP = threads per process 1 2 4 6 8 16 32 48 64 # Compute Nodes
Efficiency [%] XNS: How much efficiency do we want to sacrifice? - Efficiency 100,0% 90,0% Higher is better PPN = processes per node TPP = threads per process 80,0% 70,0% 60,0% 50,0% 40,0% 30,0% PPN1 TPP1 PPN1 TPP2 PPN1 TPP4 PPN1 TPP8 PPN2 TPP1 PPN2 TPP2 PPN2 TPP4 PPN4 TPP1 PPN4 TPP2 PPN8 TPP1 20,0% 10,0% 19 0,0% 1 2 4 6 8 16 32 48 64
XNS: How much efficiency to we want to sacrifice? Improvements versus Efficiency 100,00% 80,00% 100,00% 93,40% Parallelization Efficiency (best effort) 74,10% 60,00% 40,00% 20,00% 0,00% -20,00% Relative Improvement of Hybrid Version (best efforts) 15,52% 59,35% 33,21% 49,53% 38,93% 32,95% 19,79% 20,98% 20,53% 17,46% 12,67% 1 2 4 6 8 16 32 48 64-16,19% -14,01% # Compute Nodes 15,95% 8,57% 20
Summary 21
Summary Higher investment in brainware pays-off, if it is only for tuning A University can save real money by investing in Brainware rather than Electricity for inefficiently used hardware HPC Performance Analysts are a rare species Needs HW knowledge, Tools, programming languages, compiler technologies + paradigms, (algorithms), OS effects It takes some time to hire anyone and get him/her up to speed Team work: more different brains create more synergy And now there are GPUs They have much higher head room for tuning 22
The End and an Invitation German Heterogeneous Computing Group (GHCG) Unabhängige Interessengruppe rund um das Hochleistungsrechnen mit Beschleunigern im deutschsprachigen Raum Ziel: Intensivierung des technischen und wissenschaftlichen Austauschs über Projekte, Hardware und Algorithmen Nutzergruppen-Treffen Datum: 1. + 2. Oktober 2012 Ort: Braunschweig (Haus der Wissenschaft) Anmelden und mitmachen (kostenfrei)! www.ghc-group.org (Anmeldung & weitere Infos) Jeder ist herzlich willkommen! Themen (u.a.) Neuerungen im Hardund Softwarebereich CFD auf heterogenen Architekturen 23