VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Montag

Transkript

1 VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Montag Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin

2 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation 2

3 3

4 Warum Sequenz-Vergleiche? Biologischer Hintergrund Viele Gene und Proteine sind Mitglieder von Familien mit ähnlicher biochemischer Funktion und/oder gemeinsamer evolutionärer Herkunft Der Sequenzvergleich wird verwendet, um funktionelle, strukturelle oder evolutionäre Beziehungen zu entdecken z.b. Ähnliche Sequenz => ähnliche Funktion, Struktur,... um konservierte Muster zu identifizieren wenn man über ein unbekanntes Protein etwas wissen will: Finden von Proteinen mit ähnlichen Domänen => Schließen auf ähnliche Funktion 4

5 Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic Schattierte Boxen repräsentieren Wirtsspezies; Vogel (grün), Schwein (rot) and Mensch (blau). GJD Smith et al. Nature 459, (2009) doi: /nature

6 Homologie: Ähnlichkeit, die durch Abstammung von einem gemeinsamen Ursprungsgen herrührt die Identifizierung und Analyse von Homologien ist eine zentrale Aufgabe der Phylogenie. Ein Alignment ist eine Hypothese für die positionelle Homologie zwischen Basenpaaren bzw. Aminosäuren. 6

7 Der Baum des Lebens? 7

8 Der Baum des Lebens! 8

9 Mutationsraten sind nicht konstant 9

10 Ähnlichkeit, Homolog, Ortholog, Paralog Frühes Globulin-Gen Gen-Duplikation -chain Gen ß-chain Gen Maus Mensch Rind Rind ß Mensch ß Maus ß Ortholog ( ) Paralog (Rind) Ortholog (ß) Homolog Ortholog Nach Artenbildung divergiert haben normalerweise ähnliche Funktionen Paralog Nach Gen-Duplikation divergiert Divergenz auch in Funktion Daher müssen orthologe Gene betrachtet werden, um Annotationen auf andere Spezies zu übertragen. 10

11 11

12 Viele Möglichkeiten zu alignieren 12

13 Erste Erkentnisse zum Sequenzalignment Viele Alignierungen sind denkbar Zwei Sequenzen lassen sich immer alignieren Sequenz-Alignments müssen bewertet werden ("Score") Oft kann es mehr als eine Lösung zur gleichen Bewertung geben 13

14 Drei grundsätzliche Möglichkeiten Match Mismatch Gap THESEGA-PS THIS-GAAP- 14

15 Visuelle Sequenzvergleiche: Dotplot 15

16 Visuelle Sequensvergleiche: Gefilterter Dotplot 4 bp Fenster, 75% identity cutoff 16

17 Dotplots von evolutionären Sequenz umordnungen 17

18 Ähnlichkeit GAACAAT 7/7 OR 100% GAACAAT GAACAAT 6/7 OR 84% GAATAAT MISMATCH 18

19 Mismatches GAACAAT 6/7 bzw. 84% GAATAAT GAACAAT 6/7 bzw. 84% GAAGAAT 19

20 Einfaches DNA Scoring System Sequenz 1 Sequenz 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact A G C T A G C T Match: 1 Mismatch: 0 Score = 5 20

21 Terminale Mismatches GAACAATttttt aaaccgaataat 6/7 bzw. 84% 21

22 INDELS GAAgCAAT 7/7 bzw. 100% GAA*CAAT 22

23 INDELS GAAgCAAT GAA*CAAT GAAggggCAAT GAA****CAAT 23

24 Beurteilung von Ähnlichkeiten GAACAAT 7/7 bzw. 100% GAACAAT Welches Alignment ist besser? Wie ist ein Alignment zu bewerten? GAACAAT 1/7 bzw. 14% GAACAAT 24

25 Ähnlichkeits-Bewertung Allgemeine Methode: Terminale Mismatches (0) Match (1) Mismatch Strafe (-3) Gap Strafe (-1) Gap extension Strafe (-1) Standard für DNA Alignments 25

26 DNA Scoring GGGGGGAGAA * * 8(1)+2(-3)=2 GGGGGAAAAAGGGGG GGGGGGAGAA--GGG * * 11(1)+2(-3)+1(-1)+1(-1)=3 GGGGGAAAAAGGGGG 26

27 Problem mit niedrigen Gap- Strafen GATCGCTACGCTCAGC A.C.C..C..T Perfekte Gleichheit lässt sich immer erreichen. 27

28 28

29 PI3 Kinasen 29

30 PI3 Kinasen aus der Pfam DB 30

31 Globale vs lokale Alignments Globales Alignment Aligniere eine Sequenz mit einer anderen von Anfang bis Ende Lokales Alignment Needleman-Wunsch Finde eine möglichst lange Teilsequenz, die in beiden Sequenzen möglichst gleich ist Smith-Waterman 31

32 Globale vs lokale Alignments

33 Protein-Alignment kann durch tertiäre Strukturinformationen geführt werden DjlA protein (E. Coli) DjlA protein (Human) ( Gaps eines Alignments sollten vorwiegend in Loops liegen, nicht in Sekundärstrukturelementen - nur so kann man letztlich bewerten, ob ein Sequenzalignment vermutlich korrekt ist. 33

34 Multiple Sequenz-Alignments Multiples Alignment: Gleichzeitiger Vergleich mehrerer Sequenzen Anwendungen: Funktionsvorhersage, Suche nach konservierten Motiven Datenbanksuche (z. B. PSI-BLAST: Position-Specific Iterated BLAST; zuerst wird ein Profil erstellt) Phylogenie Sequenzassemby (Sequenzierprojekte) Vorteile: ähnliche Bereiche (Motive) können herausgearbeitet werden Unähnlichkeiten können für die Rekonstruktion phylogenetischer Zusammenhänge verwendet werden 34

35 Multiple Sequenz-Alignments Global: Lokal: 35

36 Multiple Sequenz-Alignments Multiples Alignment: Gleichzeitiger Vergleich mehrerer Sequenzen 36

37 Infos aus MSA von Thioredoxin-Familie Abschnitte mit vielen Insertionen und Deletionen entsprechen vermutlich Schleifen an der Oberfläche. Eine Position mit einem konservierten Gly oder Pro lässt auf eine Wendung der Kette ( turn ) schließen. 37

38 Multiple Sequenz-Alignments 38

39 Sequenz Logos 39

40 Multiple Sequenz-Alignments Optimale Lösungsansätze (dynamisches Programmieren): Zeit- und Speicherbedarf wächst ins Unermessliche Bei einer Sequenzlänge L und n Sequenzen: Speicherbedarf: O(L n ), Zeitbedarf: O(2 n L n ). Das Problem ist NPvollständig. Für L=300 bedeutet das: Sequenzzahl Zahl der Zellen Speicherbedarf Zeitbedarf in der Matrix (4 byte/zelle) (10 6 Op./s) = kb 0,36 s = MB 216 s =8,1 x ,2 GB 36 h =2,4 x GB 900 d 40

41 Multiple Sequenz-Alignments Optimierungsmöglichkeiten durch z.b. divide and conquer 41

42 Ähnlichkeits-Scoring von Proteinen Schwache Alignments Chemikalische Ähnlichkeiten L vs I, K vs R Evolutionäre Ähnlichkeit Wie entwickeln sich Proteine? Wie erkennen wir Ähnlichkeit? 42

43 Scoring-Systeme für Proteine Aminosäuern haben verschiedene biochemische und physikalische Eigenschaften, die die Wahrscheinlichkeit eines evolutionsbedingten Austausches / Mutation beeinflussen. aliphatisch I L C S+S V A G T P G C SH S D N sehr klein klein hydrophob aromatisch M F Y W H K E Q R geladen positiv polar 43

44 Scoring-Systeme für Proteine Scoring Matrizen beschreiben: # der Mutationen von AA1 zu AA2 Chemische Ähnlichkeit Beobachtete Mutationsfrequenz Wahrscheinlichkeit des Auftretens einer AA 44

45 Substitutionsmatrix Zwei Hauptklassen von Matritzen: PAM-Dayhoff BLOSUM-Henikoff 45

46 PAM (Percent Accepted Mutations) Matrix Matrix-Familie: PAM 80, PAM 120, PAM 250 Die Zahl gibt die evolutionäre Distanz zwischen zwei Sequenzen an, die benutzt wurde, um diese Matrix zu berechnen Höhere Zahlen entsprechen größeren Distanzen 46

47 PAM (Percent Accepted Mutations) Matrix Zur Berechnung der PAM-1 Matrix wurde die Anzahl der Ersetzungen benutzt. Die PAM-1 Matrix legt einen durchschnittliche Änderung von 1% aller Aminosäurenpositionen zugrunde. Alle weiteren PAM Matrizen können durch Extrapolation gebildet werden. PAM250 = 250 Mutationen auf 100 Residuen. 47

48 PAM (Percent Accepted Mutations) Matrix Abgeleitet von globalen Alignments von Protein Familien. Proteine einer Familie sind mindestens zu 85% identisch (Dayhoff et al., 1978). Konstruktion eines phylogenetischen Baums und Vorläufer- Sequenzen jeder Protein Familie Berechnung der Anzahl der Ersetzungen jedes Aminosäure-Paars 48

49 PAM-250 Matrix C A R N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z W 49

50 aromatic hydrophobic basic acid acid-amide other hydrophilic PAM-250 Matrix other hydrophilic acid acidamide basic hydrophobic aromatic 50

51 PAM - Einschränkungen Basiert nur auf einem (original) Datensatz Beinhaltet nur Proteine mit wenigen Unterschieden (85% Ähnlichkeit) Basiert hauptsächlich auf kleinen globulären Proteinen Matrix ist unausgewogen 51

52 BLOSUM Matrizen Verschiedene BLOSUMn Matrizen sind unabhängig voneinander berechnet worden (basierend auf der BLOCKS DB) BLOSUMn basiert auf Clustern von BLOCKS Sequenzen, die mindesten n% ähnlich sind Damit repräsentiert BLOSUM62 näher verwandte Sequenzen als BLOSUM45 Höhere Zahlen bedeuten kleinere evolutionäre Distanz. 52

53 BLOSUM (Blocks Substitution Matrix) Abgeleitet von Domänen-Alignments von weit entfernten Proteinen (Henikoff & Henikoff,1992). A A C E C Gezählt wird das Vorkommen jedes AA-Paares pro Spalte eines Block-Alignments Die Anzahlen aller Blöcke werden dann zur Berechnung der BLOSUM Matritzen benutzt. A A C E C A - C = 4 A - E = 2 C - E = 2 A - A = 1 C - C = 1 53

54 Blosum-50 Matrix 54

55 BLOSUM-62 Matrix (S) (T) 55

56 BLOSUM-62 Matrix Lysine (K) Arginine (R) 56

57 PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 Weiter entfernte Sequenzen PAM120 normal PAM60 nahe Verwandschaft PAM250 entfernte Verwandschaft BLOSUM62 normal BLOSUM80 nah BLOSUM45 entfernt 57

58 58

59 Types of Similarity-Based Methods 59

60 Alignment Programs Local alignment (Smith-Waterman) BLAST (simplified Smith-Waterman) FASTA (simplified Smith-Waterman) BESTFIT (GCG program) Global alignment (Needleman-Wunsch) GAP 60

61 BLAST 61

62 BLAST Basic Local Alignment Search Tool ~ 50x faster than simple local alignment Make an efficient table with all k-letter words of the database (e.g. k=3) Find the locations of the words of the query sequence in the database Link them together, do a local alignment 62

63 BLAST Making a table of k-tuples from a database of sequences e.g. for k=3 1.THISSEQUENCE THI, HIS,ISS, SSE, SEQ, EQU, QUE, UEN, ENC, NCE 2.THISISASEQUENCE THI, HIS, ISA, SAS, ASE, SEQ, EQU, QUE, UEN, ENC, NCE ASE 2 ENC 1,2 EQU 1,2 HIS 1,2 ISA 2 NCE 1,2 QUE 1,2 SAS 2 SEQ 1,2 THI 1,2 UEN 1,2 63

64 BLAST Adding neighboring words For every k-tuple word BLAST also considers its neighboring variants E.g. 3-letter words with 1-letter variations: THISSEQUENCE : THI AHI, BHI, CHI,, ZHI, TAI, TBI, TCI, TZI, THA, THB,, THZ HIS AIS, BIS, CIS,, ZIS, 64

65 BLAST Throwing away low-scoring words All the words are first searched against the query sequence using a scoring table Low scoring (<threshold) words are removed THISSEQUENCE : THI AHI, BHI, CHI,, ZHI, TAI, TBI, TCI, TZI, THA, THB,, THZ HIS AIS, BIS, CIS,, ZIS, 65

66 BLAST Searching Table maintains a mapping of k-tuples to database sequences Table is prepared for efficient searching (tree) For a query sequence, first quickly look up all possible word hits from table Sometimes low-complexity regions are excluded from the query sequence Then try to extend alignment from individual hits (also considering gaps). How this is done depends on the particular version of BLAST. 66

67 BLAST Scoring Score (S) is the sum of scores for all k-tuple hits. This depends on the scoring matrix. We are usually interested in the statistical significance of the alignment: E-value If for a given query sequence E=10-6, this means in practice that: There is one in a million chance that a random sequence would get an as high score S as the query sequence did. The E-value is a reliability measure of the S value 67

68 BLAST statistical significance: assessing the likelihood a match occurs by chance Karlin-Altschul statistic: E kmne m = Size of query seqeunce n = Size of database k = Search space scaling parameter Lambda = scoring scaling parameter S = BLAST HSP score Low E -> good match S 68

69 BLAST statistical significance: Rule of thumb for a good match: Nucleotide match E < 1e-6 Identity > 70% Protein match E < 1e-3 Identity > 25% 69

70 Basic BLAST Family BLASTN DNA to DNA database BLASTP protein to protein database TBLASTN DNA (translated) to protein database BLASTX protein to DNA database (translated) TBLASTX DNA (translated) to DNA database (translated) 70

71 BLAST Input Program Database Options - see more Sequence FASTA gi or accession# 71

72 BLAST Options Algorithm and output options # descriptions, # alignments returned Probability cutoff Strand Alignment parameters Scoring Matrix PAM30, PAM70, BLOSUM45, BLOSUM62, BLOSUM80 Filter (low complexity) PPPPP->XXXXX 72

73 Extended BLAST Family Gapped Blast (default) PSI-Blast (Position-specific iterated blast) self generated scoring matrix PHI BLAST (motif plus BLAST) BLAST2 client (align two seqs) megablast (genomic sequence) rpsblast (search for domains) 73

74 74

75 Mehr Informationen im Internet unter medicalbioinformaticsgroup.de/teaching Vielen Dank! Tim Conrad AG Medical Bioinformatics Weitere Fragen 75