Lessons learned in Big Data Projekten mit Hadoop. Dominik Benz, Inovex GmbH 2014/09/25, Java User Group Hessen

Transkript

1 Lessons learned in Big Data Projekten mit Hadoop Dominik Benz, Inovex GmbH 2014/09/25, Java User Group Hessen

2 Motivation Big is beautiful! Class A extends Mapper ROI, $$, Big Data is like Teenagesex: Everyone talks about it, only few reallydo it, and those who are doing it aredoing it wrong apt-get install 2

3 who is who 150 Mitarbeiter, ~12 Big Data Engineers (LoB Business Intelligence) Karlsruhe, Köln, Pforzheim, München seit 2009 Hadoop-Projekte(1&1, Prosieben/Sat1, ) 3

4 navigation Anwendungs- Szenario Setup / Deployment Rohdaten Akquise / Speicherung Analyse / Integration Verarbeitung / Workflows 4

5 Usecase Multi-Channel Analytics Anreicherung (Produktdaten, AGOF, SocialMedia, ) Aggregation Bewegungs- Daten (Clicks, Visits,Views, ) KPI 5

6 Architektur Prosieben/Sat1 Reporting Analyses DWH SQL Access Data Hub DWH Stage Ingestion Con viva User Serv. AGOF Nielsen Pro Dia Sources 6 6

7 Architektur 1&1 7

8 Projekte Mengengerüste 1&1 Prosieben/Sat1 Otto Quell-Systeme # events/ tag ~ 1 Milliarde Impressions Datenvolumen / tag (roh) ~ 1 MioUnique Users, ~5 Mio Video Views 26 Mio 1 TB 100 GB 185GB #Datanodes 24 / 8 Slots 8 / 20 Slots 20/ 24 Cores KapazitätHDFS (brutto) 500 TB 250TB 220 TB MR version MR1 MR1 YARN 8

9 Hadoop Ökosystem Applications and Analytics Batch Processing & Storage Server Systems Management Transport & Speed 9

10 navigation Multi-Channel Analytics, hybride DWHs Setup / Deployment Rohdaten Akquise / Speicherung Analyse / Integration Verarbeitung / Workflows 10

11 Hadoop welches Hadoop? Distribution verwenden! Entscheidung oft vom Kunden abhängig (Integration, Administration,..) Cloudera/ CDH: Impala, Cloudera Navigator, Hortonworks: Hive, Windows Integration, MapR: Security, Gute Erfahrungen mit CDH3 5 Start with open source! 11

12 Deployment Tools / Konfigurationen Deployment/ Konfigurations-Management CDH: Cloudera Manager Hortonworks: Ambari MapR: MapR Control System Unabhängig: Rex, Puppet Identische Konfiguration für Dev/Prod Cluster! Wenn möglich: auch für lokale VM zum Entwickeln Aktuell in Erprobung: Vagrant 12

13 Setup Lessons learned Distribution! Start with Open Source! Umgebung/Expertise beim Kunden! 13

14 navigation Multi-Channel Analytics, hybride DWHs Setup: Distribution, OSS, Kunde Rohdaten Akquise / Speicherung Analyse / Integration Verarbeitung / Workflows 14

15 Rohdaten Import ins HDFS Plain/ Hive / ETL tool LOAD DATA LOCAL INPATH nur flat files hadoop fs-copyfromlocal output steps z.b. von Pentaho(HDFS write) Flume Sources(HTTP, Netcat, Spooling Directory, ), Channels (Memory, File, JDBC, ), Sinks: (HDFS, Hbase, Thrift, ) Datenaufbereitung über Interceptors Memory Channel: Nicht ausfallsicher, File Channel: eher langsam (Durchsatz ~5 MB / sec) 15

16 Rohdaten Speicherformate Plain(CSV, ) Avro(JSON- Schema im File, content binär) Parquet (spaltenorientiert, verschiedene Kompressionen) Format Plain 481 Rcfile 428 Parquet 86 Parquet + snappy Parquet + gzip Größe MB

17 Rohdaten Schema management # tables ++, Abhängigkeiten Releases Schema Automatisiertes Schema-Deployment parametrisierte DDLs 0001-init-db.hql 0002-clicklog.hql 0003-products.hql 0004-users.hql 0005-clicklog-alter.hql Schemadeploy aktueller Zustand / changelog Hive Metastore DB apply / rollback Hive CLI / Beeline 17

18 Security Wer darf was? Standard-Kontrollmechanismen HDFS file permissions, ACLs (HDFS2) Kerberos, Grants (Hive 0.14) Distributionsabhängig / uneinheitlich CDH: Sentry/Rhino(Hive/Impala per Table) Hortonworks: Knox, Argus (Gateway) MapR: Volumes/ Namespaces DIY Security: Cluster per Mandant (automatisiertes Deployment/ Setup) (Hive) Metastore per Mandant 18

19 Rohdaten Lessons learned Dirty Data Early! Parquet! Schema management! DIY security! 19

20 navigation Multi-Channel Analytics, hybride DWHs Setup: Distribution, OSS, Kunde Rohdaten: Early, Parquet, Manage schema, DIY security Analyse / Integration Verarbeitung / Workflows 20

21 Processing Workflow management Job Steuerung / Orchestrierung Oozie Workflow: DAG aus Actions (Hive, MR, ) Coordinators: Zeit/Datengesteuerte Workflow-Trigger Bundles: Set von Coordinators ETL-Tool / cron Definition der Ablauflogik z.b. in kettle-job (Pentaho) Hive-Queries über JDBC-Schnittstelle Steuerung der ETL-Jobs über cron Kritsch dabei: Log Management (Oozie > Hive > MR), Oozie- Templating! 21

22 Development Test-driven! poll Build Artefakte deploy.sh Admin server startet jobs DEV cluster basiert auf Daten startet tests DB setup from scratch, Testdaten erzeugen 22

23 Development Test-driven! 23

24 Development Test-driven! Nahtlose Integration in Hadoop(Dev-) Umgebungen Fixturesfür Pig, Hive, Oozie, HDFS Wrapper um jeweilige Java API Lightweight (standalone server) natürlichsprachliche Test-Syntax script Hadoop upload viewlog.csv to hdfs /testdata/ hadoop job from jar viewlog.jar [...] check number of output files 3 24

25 TDD Complete Round-Trip Fachseite Selenium Scripts vergleicht (Zwischen-/End) Ergebnisse startet Verarbeitung KPIs Firefox + Selenium IDE Szenario (Einkauf, ) Replay scripts Xebium+ Browsermob Proxy tag requests Data Processing tagged log data 25

26 Processing Die Wahl der Waffen Transformation / Aggregation via plain MR, plain Hive Hive + UDFs ETL tools(morphlines, Pentaho, ) Spark, MR jobs Spark Hive UDFs morphlines graphical ETL Coding skills /- - Flexibility Depends on tool Keine Monokultur nötig (YARN!) 26

27 Processing Hive UDFs public class myudf extends GenericUDF { public ObjectInspector initialize (ObjectInspector[] args) { // Signatur, Argumente prüfen } public Object evaluate(deferredobject[] args){ } // zeilenweise Verabeitung der Daten 27

28 Processing Hive UDFs ctd. Neben UDFs: auch UDTFs, UDAFs statt mehrerer (komplexerer) UDAFs: Voraggregation der GROUP BY Elemente in Map/StructÜbergabe dessen an Standard UD(T)Fs SELECT parse_session(agg.session) FROM (SELECT to_map(s.timestamp, ) as session FROM logtable s GROUP BY s.session_id) agg array(s.useragent, s.status, ) 28

29 Processing Morphlines Java-zentrierte ETL Lösung von Cloudera, Teil des kite-sdk Definition von Transformationen in JSON (HOKON) Datenmodell: Record( key [value1, value2, ]) Verarbeitung in Commands : process(record) Command-Libfür Standard-Aufgaben (Parsing, Date Handling, ), leicht erweiterbar Kann (auch) direkt aus MR job aufgerufen werden 29

30 Processing Lessons learned Oozie Templates! TDD works! Hive UDF combination! Morphlines! 30

31 navigation Multi-Channel Analytics, hybride DWHs Setup: Distribution, OSS, Kunde Rohdaten: Early, Parquet, Manage schema, DIY security Analyse / Integration Verarbeitung: Oozie Templates, TDD, combine UDFs, Morphlines 31

32 Analysis Ad hoc querying SQL-on-Hadoop Hive (Stinger(.next)), Impala, Shark, Presto, Tajo, Phoenix (SQL-on-Hbase), Drill, Redshift, Verschiedene, teils widersprüchliche Benchmarks Benchmark with your own data! Execution JDBC/ODBC Complex Types SQL stabilität Hive MR/Tez Jobs + + HiveQL + Impala MPP + - HiveQL +/- Presto MPP Nur JDBC + SQL92? Drill MPP + ANSI SQL? 32

33 Analysis Ad hoc querying Cloudera 60,00 50,91 45,00 39,43 Inovex 30,00 34,31 30,96 15,00 Amplab 0,00 Hive Shark Impala Presto 33

34 Analysis Datenschutz / Anonymisierung Schützen von sensitiven Daten vs. effiziente Entwicklung (Testdaten) Auch beim Export aus dem Datahub Pseudo-Anonymisierung mit Morphlines { } { } readcsv { separator : ";" columns : [isodate,isotime,ip,username,..] } anonymize { fields : [ip,username] } 34

35 Analysis Datenschutz / Anonymisierung private static final class Anonymize extends AbstractCommand { private final List<String> fieldnames; public Anonymize( ) { this.fieldnames = getconfigs().getstringlist(config,"fields"); } protected boolean doprocess(record record) { for (String field : record.getfields().keyset()) { // perform anonymization on configured fields // of record object } // pass record to next element in chain return super.doprocess(record); } } 35

36 Analysis Integration BI tools BI tools(business Objects, Tableau, Microstrategy, ) haben meistens JDBC/ODBC-Schnittstellen Idee: darüber Reports direkt auf Hadoop-Daten erstellen Kombination Business Objects / Hive zentral: Partition Pruning(LEFT OUTER JOINS!) Setzen von Query-Optimierungsparameternvia Initial SQL (z.b. hive.auto.convert.join=true) Möglichkeiten zur Query-Syntax Beeinflussung begrenzt! 36

37 Analysis Lessons Learned Benchmark with your own data! Anonymized Test-Data! Integration: Hive Query Syntax! 37

38 Danke! Multi-Channel Analytics, hybride DWHs Setup: Distribution, OSS, Kunde Rohdaten: Early, Parquet, Manage schema, DIY security Analyse: Own Benchmark, Anonymize, Query Syntax! Verarbeitung: Oozie Templates, TDD, combine UDFs, Morphlines 38

39 BASICS 39

40 Want more? Inovex trains you! Android Developer Training (3 days, Karlsruhe/München) Certified Scrum Developer Training (5 days, Köln) Hadoop Developer Training (3 days, Karlsruhe/Köln) Liferay Portal-Developer Training (4 days, Karlsruhe) Liferay Portal-Admin Training (3 days, Karlsruhe) Pentaho Data Integration Training (4 days, München/Köln) information and registration at 40

41 HDFS Architektur name node Where do I store block 1? Done! Done! data nodes 03, 05, 08 Done! data node 01 data node 05 data node 09 client node blk 1 blk 2 blk 3 blk 4 data node 02 data node 06 data node 10 data node 03 blk 1 (03, 05, 08) data node 07 data node 11 data node 04 data node 08 data node 12 rack 1 rack 2 rack 3 41

42 Map/Reduce Prinzip auf Datanodes input map map map map shuffle reduce reduce map map reduce 42

43 Map/Reduce Java public class WebtrekkEventMapper extends Mapper<Text, Text, Text, IntWritable> { protected void map( Text key, Text value, Context context ) throws IOException, InterruptedException public class IntSumReducer { // key contains entire record String[] fields = key.tostring().split( ";" ); // extract relevant information String eventname = fields[12]; // emit output key and count context.write( new Text( eventname ) ), } Mapper } extends Reducer<Text, IntWritable, Text, IntWritable> protected void reduce( Text key, Iterable<IntWritable> values, Context context new IntWritable( 1 )); throws IOException, Reducer InterruptedException { int sum = 0; for ( IntWritable partialcount : values ) { sum += partialcount.get(); } context.write( key, new IntWritable sum ) ); 43

44 Map/Reduce Pentaho 44