Data Warehousing Data Quality

Ähnliche Dokumente

Data Warehousing Data Quality

prorm Budget Planning promx GmbH Nordring Nuremberg

Turning Data Into Insights Into Value. Process Mining. Introduction to KPMG Process Mining

HIR Method & Tools for Fit Gap analysis

Lehrstuhl für Allgemeine BWL Strategisches und Internationales Management Prof. Dr. Mike Geppert Carl-Zeiß-Str Jena

Exercise (Part II) Anastasia Mochalova, Lehrstuhl für ABWL und Wirtschaftsinformatik, Kath. Universität Eichstätt-Ingolstadt 1

Big Data Analytics. Fifth Munich Data Protection Day, March 23, Dr. Stefan Krätschmer, Data Privacy Officer, Europe, IBM

Exercise (Part XI) Anastasia Mochalova, Lehrstuhl für ABWL und Wirtschaftsinformatik, Kath. Universität Eichstätt-Ingolstadt 1

VGM. VGM information. HAMBURG SÜD VGM WEB PORTAL USER GUIDE June 2016

Preisliste für The Unscrambler X

Tube Analyzer LogViewer 2.3

Englisch-Grundwortschatz

Security of Pensions

Information Governance - Enabling Information

Exercise (Part V) Anastasia Mochalova, Lehrstuhl für ABWL und Wirtschaftsinformatik, Kath. Universität Eichstätt-Ingolstadt 1

Geometrie und Bedeutung: Kap 5

ISO Reference Model

Lehrstuhl für Allgemeine BWL Strategisches und Internationales Management Prof. Dr. Mike Geppert Carl-Zeiß-Str Jena

Number of Maximal Partial Clones

Aufbau eines IT-Servicekataloges am Fallbeispiel einer Schweizer Bank

Materialien zu unseren Lehrwerken

p^db=`oj===pìééçêíáåñçêã~íáçå=

Darstellung und Anwendung der Assessmentergebnisse

Beschwerdemanagement / Complaint Management

Cloud Architektur Workshop

Rev. Proc Information

VGM. VGM information. HAMBURG SÜD VGM WEB PORTAL - USER GUIDE June 2016

Prediction Market, 28th July 2012 Information and Instructions. Prognosemärkte Lehrstuhl für Betriebswirtschaftslehre insbes.

Magic Figures. We note that in the example magic square the numbers 1 9 are used. All three rows (columns) have equal sum, called the magic number.

Mitglied der Leibniz-Gemeinschaft

Analysis Add-On Data Lineage

Titelbild1 ANSYS. Customer Portal LogIn

FEM Isoparametric Concept

NEWSLETTER. FileDirector Version 2.5 Novelties. Filing system designer. Filing system in WinClient

Labour law and Consumer protection principles usage in non-state pension system

Taxation in Austria - Keypoints. CONFIDA Klagenfurt Steuerberatungsgesellschaft m.b.h

Lukas Hydraulik GmbH Weinstraße 39 D Erlangen. Mr. Sauerbier. Lukas Hydraulik GmbH Weinstraße 39 D Erlangen

Statistics, Data Analysis, and Simulation SS 2015

Franke & Bornberg award AachenMünchener private annuity insurance schemes top grades

Non users after Cochlear Implantation in Single Sided Deafness

1. General information Login Home Current applications... 3

Level of service estimation at traffic signals based on innovative traffic data services and collection techniques

Introduction FEM, 1D-Example

Database Systems Unit 7. Logical Modeling. DB / PRM1 / db07_logicalmodelling.ppt / Version Learning Goals

Hazards and measures against hazards by implementation of safe pneumatic circuits

Seminar aus Programmiersprachen. Markus Raab LVA

DIGICOMP OPEN TUESDAY AKTUELLE STANDARDS UND TRENDS IN DER AGILEN SOFTWARE ENTWICKLUNG. Michael Palotas 7. April GRIDFUSION

Internationale Energiewirtschaftstagung TU Wien 2015

Algorithms for graph visualization

DAS ERSTE MAL UND IMMER WIEDER. ERWEITERTE SONDERAUSGABE BY LISA MOOS

Dr.Siegmund Priglinger Informatica Österreich

Unternehmensweite IT Architekturen

LOG AND SECURITY INTELLIGENCE PLATFORM

vcdm im Wandel Vorstellung des neuen User Interfaces und Austausch zur Funktionalität V

ISO Reference Model

IATUL SIG-LOQUM Group

Word-CRM-Upload-Button. User manual

RESI A Natural Language Specification Improver

Was heißt Denken?: Vorlesung Wintersemester 1951/52. [Was bedeutet das alles?] (Reclams Universal-Bibliothek) (German Edition)

Introduction FEM, 1D-Example

Mock Exam Behavioral Finance

Support Technologies based on Bi-Modal Network Analysis. H. Ulrich Hoppe. Virtuelles Arbeiten und Lernen in projektartigen Netzwerken

Wissenschaftliche Dienste. Sachstand. Payment of value added tax (VAT) (EZPWD-Anfrage ) 2016 Deutscher Bundestag WD /16

Context-adaptation based on Ontologies and Spreading Activation

Level 2 German, 2013

Using TerraSAR-X data for mapping of damages in forests caused by the pine sawfly (Dprion pini) Dr. Klaus MARTIN

Customer-specific software for autonomous driving and driver assistance (ADAS)

Level 2 German, 2016

AS Path-Prepending in the Internet And Its Impact on Routing Decisions

Wie man heute die Liebe fürs Leben findet

Infrastructure as a Service (IaaS) Solutions for Online Game Service Provision

PONS DIE DREI??? FRAGEZEICHEN, ARCTIC ADVENTURE: ENGLISCH LERNEN MIT JUSTUS, PETER UND BOB

Creating OpenSocial Gadgets. Bastian Hofmann

LOC Pharma. Anlage. Lieferantenfragebogen Supplier Questionnaire. 9. Is the warehouse temperature controlled or air-conditioned?

EVANGELISCHES GESANGBUCH: AUSGABE FUR DIE EVANGELISCH-LUTHERISCHE LANDESKIRCHE SACHSEN. BLAU (GERMAN EDITION) FROM EVANGELISCHE VERLAGSAN

A central repository for gridded data in the MeteoSwiss Data Warehouse

Bosch Rexroth - The Drive & Control Company

Education Day Wissensgold aus Datenminen: wie die Analyse vorhandener Daten Ihre Performance verbessern kann! Education Day

ONLINE LICENCE GENERATOR

Die Datenmanipulationssprache SQL

Level 2 German, 2015

Employment and Salary Verification in the Internet (PA-PA-US)

Rough copy for the art project >hardware/software< of the imbenge-dreamhouse artist Nele Ströbel.

Challenges for the future between extern and intern evaluation

EEX Kundeninformation

1. Fundamentals of Processes: Core Processes [ID: ] 2. Fundamentals of Processes: Core Processes [ID: ]

1. Fundamentals of Processes: Core Processes [ID: ] 2. Fundamentals of Processes: Core Processes [ID: ]

Repositioning University Collections as Scientific Infrastructures.

Corporate Digital Learning, How to Get It Right. Learning Café

Harry gefangen in der Zeit Begleitmaterialien

Product Lifecycle Manager

32. Fachtagung der Vermessungsverwaltungen, Trient 2015

How to get Veränderung: Krisen meistern, Ängste loslassen, das Leben lieben! (German Edition)

ELBA2 ILIAS TOOLS AS SINGLE APPLICATIONS

Stand der Recherche nach publizierten Identity Management Standards - ISO/IEC, DIN, BSI, CEN/ISSS und OASIS

Supplier Questionnaire

Mit Legacy-Systemen in die Zukunft. adviion. in die Zukunft. Dr. Roland Schätzle

v+s Output Quelle: Schotter, Microeconomics, , S. 412f

Flow - der Weg zum Glück: Der Entdecker des Flow-Prinzips erklärt seine Lebensphilosophie (HERDER spektrum) (German Edition)

file:///c:/users/wpzsco/appdata/local/temp/tmp373d.tmp.htm

Transkript:

Data Warehousing Data Quality Spring 2014 Dr. Andreas Geppert Credit Suisse geppert@acm.org Spring 2014 Slide 1 Outline of the Course Introduction DWH Architecture DWH-Design and multi-dimensional data models Extract, Transform, Load (ETL) Metadata Data Quality Analytic Applications and Business Intelligence Implementation and Performance Spring 2014 Slide 2

Outline 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 3 Data Quality Data have quality, when they meet the requirements of their inteded use (Olson) In order to meet these requirements, data need to be accurate, upto-date, relevant, complete, understood, and trustworthy (Olson) The level of data quality thus depends on the intended use as well as on the data themselves The same data can have different levels of quality in different usage contexts Spring 2014 Slide 4

Real-world data quality problems Estimates guess that in some companies, the cost incurred by lack of data quality amounts to 15-20% of the operating result (Olson) According to a TDWI study in 2001, the cost of insufficient data quality in US-companies amounts to $600 bn USD (http://www.tdwi.org/display.aspx?id=6045) Estimates by Gartner analysts predict that up to 50% of DWH and crm rojects will not meet their requirements or fail entirely because of data quality problems Inmon assumes that without data quality assurance, up to 20% of the data in data warehouses are wrong, redundant or otherwise useless Spring 2014 Slide 5 Real-world data quality problems (2) USPS estimates that in 2001 they paid 1.8 bn USD because of undeliverable mail "A major... telecommunication company... had more than 33 different definitions for «customer churn»" (Howson) Eziba has sent tens of thousands of catalogues to addresses that before had been classified as "unlikely respondents". The company had to suspend operations temporarily (NYT 2005-01-24) It has been found out in various studies that Consumer lose $1-2.5 bn because of scanner errors in supermarket cashiers Scanner determine a wrong price for 6-8% of the products in certain stores (NYT 2006-01-28) Spring 2014 Slide 6

Real-world data quality problems (3) An employee of a Japanese brokerage unit entered a sell order over 610,000 shares at 1 Yen, instead of one share at 610,000 Yen. The resulting loss amounted to 27 bn Yen ($224 mn) After the presidential election in 2004, it has been found out that in New Jersey... 4,755 persons had voted that had been declared dead before; 4,397 persons that had been registered in more than one county, had voted twice; 6,572 persons had voted in New Jersey and one of five other states. (NYT 2005-09-16) Spring 2014 Slide 7 Real-world data quality problems (4)»... Alan Greenspan, former chairman of the Federal Reserve, today... explained to a U.S. House committee what he thought went wrong. And insufficient data was one of the causes he pointed to. Greenspan has long praised computer technology as a tool that can be used to limit risks in financial markets. For instance, in 2005, he credited improved computing power and risk-scoring models with making it possible for lenders to extend credit to subprime mortgage borrowers. But at a hearing held today by the House Committee on Oversight and Government Reform, Greenspan acknowledged that the data fed into financial systems was often a case of garbage in, garbage out. Business decisions by financial services firms were based on "the best insights of mathematicians and finance experts, supported by major advances in computer and communications technology," Greenspan told the committee. "The whole intellectual edifice, however, collapsed in the summer of last year because the data inputted into the risk management models generally covered only the past two decades a period of euphoria." He added that if the risk models had also been built to include "historic periods of stress, capital requirements would have been much higher and the financial world would be in far better shape today, in my judgment." And that wasn't the only bad data being crunched by IT systems. Christopher Cox, chairman of the Securities and Exchange Commission, told the committee that credit rating agencies gave AAA ratings to mortgage-backed securities that didn't deserve them. "These ratings not only gave false comfort to investors, but also skewed the computer risk models and regulatory capital computations," Cox said in written testimony.«computerworld, October 23, 2008 (http://www.computerworld.com/action/article.do?command=viewarticlebasic&articleid=9117961) Spring 2014 Slide 8

Quality Criteria Accuracy DWH data correspond to source data (and real world) Only wanted, intended deviations Deviations are documented violations...... because of input errors o typos: Joe Cool Joe Coool o misundestandings: Joe Cool Joe Kuhl... because of wrong input or insufficient maintenance in the source (because data are not used there)... because of deliberate wrong input o [ Joe Cool, non-smoking ] Spring 2014 Slide 9 Quality Criteria Semantics The meaning of data is known and documented Unique and uniform data semantics exists; differing semantics are documented Completeness DWH contains complete data set required for analysis Individual data elements / records are complete Typical reason for violations: data incomplete in source Not relevant for OLTP, thus not stored in source [name: Joe Cool, ssn: 999-99-9999 ] Spring 2014 Slide 10

Quality Criteria Consistency Data do not contain contradictions Violations through uncontrolled, redundant data storage and un-managed updates Example: birthday = 1992-12-19, age = 20 Uniqueness The same data elements exist only once in the DWH Violation through redundant input in one or multiple sources Joe Cool, Joe S. Cool, Snoopy Spring 2014 Slide 11 Quality Criteria Timeliness Data are updated «in time» Depending on user requirements Violations because of...... In adequate (business) processes... Missing capture... technical issues o Missing data delivery from sources o Processing issues in the ETL process Spring 2014 Slide 12

Inhalt 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 13 Information Quality Management Data quality improvement is not a one-time activity, but an ongoing process Data Quality Assurance Program (Olson) Total Quality Data Management Corresponding concepts, processes, tools, and organizational setup One might even have to align the entire organization, its culture, and its management, to data quality Buy-in of management and involvement of business (semantics!) Quality improvement should be integral part of all new projects, which create new data sets or migrate, replicate, or integrate data improve, prevent, monitor Spring 2014 Slide 14

Information Quality Management Establish information quality environment Assess data definition and information architecture quality assess IQ Assess cost of lacking IQ Re-design and cleansing of data Improve process quality Spring 2014 Slide 15 Data Quality Assurance Cycle Analysis / Data Profiling Online data cleansing DQ rule definition DQ remediation Offline data cleansing DQ analysis DQ monitoring Spring 2014 Slide 16

Inhalt 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 17 Data Profiling Usage of analytic techniques to discover valid structures and content and to assess quality of data sets Takes existing metadata and instance data as input Outputs accurate, more complete metadata and further information about meaning and quality of data Spring 2014 Slide 18

Data Profiling: Input Metadata, depending on the type of data source (database, file, ) Data definitions: Table definitions, column definitions, incl. data types ( DB schema) Type definitions ( program source code) Type definitions ( Cobol copybooks, PL1-includes) Structure information Primary keys, foreign keys, constraints ( DB-Schema) Semantic/implicit relationships ( trigger and stored procedures source code) Application constraints when inserting, modifying, deleting data ( application source code) Spring 2014 Slide 19 Data Profiling: Outline Column analysis Invalid values Inaccurate Data Structure analysis Simple data analysis Complex data analysis Value analysis Invalid combinations of values Unreasonable values Not detectable through analysis techniques Spring 2014 Slide 20

Analysis Techniques Discovery Discovers previously unknown facts see Data Mining Rule and Assertion Checking Assertions are input for analysis Analysis checks validity and identifies data elements that violate assertions Visual inspection Usually against pre-processed data sets Frequency distributions histograms), groupings, compare values/ aggregated across data sources, threshold values Example: "Mickey Mouse" is the most frequent user name Spring 2014 Slide 21 Data Profiling: Column Analysis Investigates values in individual columns, independent from values in other columns Metadata specify properties of admissible values Name (Business) Meaning Domain name Data type Character Set Length restrictions (min, max) Acceptable values (discrete list; value set; character/digit patterns) Null constraints Uniqueness constraints Consecutivity constraints Spring 2014 Slide 22

Data Profiling: Column Analysis external documentation Documented properties Discovery Discovered properties Data Validation Accurate properties Invalid values Analysis Problems regarding bad practices Facts about invalid data Spring 2014 Slide 23 Data Profiling: Structure Analysis Looks for table structures (between columns) and object structures (between tables) Uses these relationships to assess data quality Determines functional dependencies as basis for subsequent steps Identifies primary and foreign key constraints, redundant and derived columns, synonyms Relies on understanding of semantics to decide whether data are inaccurate or rules invalid in case of rule violations Problem when using samples: false positives Spring 2014 Slide 24

Data Profiling: Structure Analysis external documentation Documented structures Discovery Discovered structures Data Analysis Validation Accurate structure definitions Structure violations Facts about inaccurate structures Spring 2014 Slide 25 Data Profiling: Basic Data Analysis Basic rule: restriction of the values of multiple attributes of a business object Soft vs. hard rules Example: hire_date > birth_date vs. hire_date > birth_date + 18 years Formulation of rules and validation of data against rules Often, an individual data value cannot be identified, only in combination with other values Example: martial status = single", maiden name = "Meier" Rules typically for: Date and time attributes, time durations Classifying attributes Process status attributes ("workflow") Derived attributes Spring 2014 Slide 26

Data Profiling: Basic Data Analysis External sources Documented data rules Discovery Analysis Data Validation Accurate data rules Data rule violations Facts about inaccurate data Spring 2014 Slide 27 Data Profiling: Complex Data Analysis Complex rule: restriction of values of multiple business objects or the relationships between multiple business objects Otherwise very similar to basic data analysis Examples: A book can be lent to at most one user per point in time (time and space properties) Aggregations (completeness) Spring 2014 Slide 28

Data Profiling: Value Analysis Inaccurate data that are not detectable through clear-cut ( black/white ) rules All individual attribute values can be valid or possible Value rule: calculation of one or more values over a data set cardinalities Frequency distribution Number of distinct values Extreme values Percentage of null values Visual inspection to identify obvious outliers Spring 2014 Slide 29 Data Profiling: Value Analysis External sources Possible queries Tests Analysis Data Analysis Meaningful queries Unexpected results Facts about inaccurate data Spring 2014 Slide 30

Inhalt 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 31 Data Cleansing Transformation & Datenwäsche Laden DWH staging area Extraktion Data Source Data Source Data Source Data Source Spring 2014 Slide 32

Data Cleansing Problem detection Identification of bad data (dirty data) Resolution of problems, data correction Location of data cleansing Dirty data should be cleansed in sources (why?) Data modification in the DWH only in exceptional cases (which?) Automated vs. manual data cleansing Automation desirable especially for large data sets In many cases, manual cleansing / human supervision will be necessary offline vs. Online See appendix Spring 2014 Slide 33 Inhalt 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 34

Data Quality in the ETL Process Analysis / Data Profiling Online Data Cleansing DQ Rule definition DQ remediaiton Offline Data Cleansing DQ Analysis DQ Monitoring Spring 2014 Slide 35 DQ Monitoring Data quality must be determnied for newly loaded data (selected) rules are checked at runtime as part of the ETL process Checking all rules for the entire data set will not be feasible in many cases Possible reactions in case of DQ rule violation: Log DQ issue (Continue with ETL processing) Notify data owner Drop dirty data Cleanse dirty data Stop ETL process Spring 2014 Slide 36

DQ Analysis Data quality issues detected as part of monitoring must be analyzed further "DQ Dashboard" provides overview about data quality and data quality issues Which rules have been checked for which data sets? User (group) defined quality thresholds Traffic light system provides information about usability of data Drill-down support allows user to analyze data quality issues in more detail Spring 2014 Slide 37 DQ Remediation Information about DQ issues are sent to theorganization responsible for rectifying quality issues (DQ Issue Management) Part of Data Governance Detailed analysis of DQ issues Correction of dirty data Process improvements, application changes in order to improve quality in a sustainable way Spring 2014 Slide 38

Summary DQ issues in sources are a frequent problem in many data warehouses DQ issues are aggravated by integration DQ issues can compromise the usefullness of analysis results Data quality must be approached as a comprehensive and continuous process Quality rules are a form of metadata and must often be determined and validated Data profiling helps to complete and improve metadata and assess initial status of data quality DQ monitoring checks data quality at runtime and provides a range of possible reactions to DQ issues Spring 2014 Slide 39 Inhalt 1. Data Quality 2. Data Quality Management 3. Data Profiling 4. Data Cleansing 5. Data Quality in the ETL Process 6. Appendix: Data Cleansing & Merge/Purge Spring 2014 Slide 40

Datenwäscheverfahren: Merge/Purge Eindeutigkeitsproblem: finde Duplikate diverse Algorithmen zur Duplikaterkennung wurden vorgeschlagen allgemein besteht ein Verfahren zur Duplikaterkennung aus einem Ähnlichkeitsmass und einem Algorithmus. Der Algorithmus bestimmt, welche Entitäten verglichen werden naiver Ansatz: vergleicht jede Entität mit jeder anderen O(n 2 ) je nach Ansatz kann auf 99% dieser Vergleiche verzichtet werden Merge/Purge: vereinheitliche Duplikate zu einzelnem Datensatz (merge) lösche überflüssige Duplikate (purge) Spring 2014 Slide 41 Merge/Purge Joe Cool, 123-45-6789, 4300 The Woods Drive Laden DWH staging area Joe Cool, 999-99-9999, 4300 The Woods Drive Joe Fool, 123-45-6789, 4300 The Woods Dr Joe Cool, 123-45-6789, Grossackerstrasse 32 Data Source Data Source Data Source Spring 2014 Slide 42

Merge/Purge bestimme und erzeuge Schlüssel bestimme Entfernungsfunktion (Metrik) sortiere Datenmenge (basierend auf gewähltem Schlüssel) Idee: Duplikate nahe beieinander in sortierter Liste bestimme Duplikate verschiebe Fenster der Grösse w über Liste vergleiche w Datensätze in Fenster mithilfe der gewählten Metrik behandle gefundene Duplikate Wahl des Schlüssels essentiell Spring 2014 Slide 43 Merge/Purge: Schlüsselbildung Schlüsselbildung: basierend auf Attributwerten (oder Teilen) Konkatenation der Teile zu Zeichenkette Ziel: Duplikate haben benachbarte Schlüssel Schlüsselbildung basierend auf Anwendungsbereichswissen ausgerichtet auf typische Fehler Spring 2014 Slide 44

Schlüsselbildung: Beispiel Daten: Personen, inkl. SSN und Adressangaben Sprache: Englisch Beobachtung: Nachnamen häufiger falsch Vornamen gebräuchlich, deshalb weniger Fehler Fehler eher bei Vokalen Spring 2014 Slide 45 Schlüsselbildung: Beispiel (2) Schlüssel: ersten drei Konsonanten des Nachnamens erste drei Buchstaben des Vornamens Hausnummer alle Konsonanten der Strasse erste drei Ziffern der SSN First Bill Bill Bill Bill Last Address SSN Key Jansen 123 First Street 45678987 JNSBIL123FRST456 Jansen 123 First Street 45678987 JNSBIL123FRST456 Janssen 123 First Street 45678987 JNSBIL123FRST456 Jensen 123 Forest Street 45654321 JNSBIL123FRST456 Spring 2014 Slide 46

Merge/Purge: Duplikatbestimmung Ziel: vergleiche alle Sätze innerhalb des Fensters Fenster bestimmt Kandidaten für Duplikatsprüfung Bestimmung der Entfernungs- oder Ähnlichkeitsfunktion d Bestimme Schwellwert s vergleiche Entfernung d zwischen Sätzen t1 und t2 Behandlung der Tupel mit d(t1, t2) < s: automatisch, regelbasiert manuell Spring 2014 Slide 47 Merge/Purge: Entfernungsmasse Editierdistanz wieviele Änderungen in t1, um t2 zu erhalten? d( Joe Cool, Joe Coool ) < d( Joe Cool, Joe Kuhl ) Tastaturdistanz Vergleich der Buchstaben gemäss Tastatur-Layout d( Joe Cool, Joe Fool ) < d( Joe Cool, Joe Pool ) Phonetische Distanz basierend auf Aussprache, Klang d( Joe Cool, Joe Kuhl ) < d( Joe Cool, Joe Pool ) Spring 2014 Slide 48

Merge/Purge: Duplikatbestimmung (2) manuell oder mittels Entscheidungsregeln Verwendung von Attributvergleichen exakt oder ähnlich Beispiel: gegeben t1, t2 mit d(t1, t2) < s t1.firstname = t2.firstname d(t1.lastname, t2.lastname) < s 2 -- ähnliche Nachnamen t1.address = t2.address t1 t2 Spring 2014 Slide 49 Merge/Purge: Duplikatbestimmung (3) Güte des Verfahrens Anzahl bestimmter Duplikate (IR: recall) Anzahl falsch bestimmter Duplikate (IR: Präzision) abhängig von Fenstergrösse und Schlüssel erste Schlüsselkomponente signifikant Transpositionen grosse Fenster mehrere Durchläufe mit verschiedenen Schlüsseln Spring 2014 Slide 50

Duplikaterkennung: Zusammenführung von Duplikaten sind Duplikate bestimmt, müssen Inkonsistenzen beseitigt werden Bestimmung der «korrekten» Eigenschaften (i.w. Attribute, aber prinzipiell auch Beziehungen) «Merge» ist i.d.r. ein manueller Prozess Heuristiken: bevorzuge «echte» Werte gegenüber Nullwerten bevorzuge konkrete Werte gegenüber Standardwerten bevorzuge Werte aus Quellen mit besserer Datenqualität (Referenzsysteme) Spring 2014 Slide 51