CS580 Bioinformatics Fall 2001

Sept 10.

1. Introduction  SIMON

·         What is Bioinformatics ?  How does it relate to Computational Biology?

     Why study it?

·         Overview of the course

 

Computational Problems to be solved:

1.    Homology, alignments and phylogeny

2.    Mapping

3.    Gene Finding

4.    Determinants of protein structure

5.    Identification of genetic structural/metabolic lesions from gene expression data

 

Strategies

      Computational Molecular Biology vs BioInformatics

      Strings vs Large Data Arrays

 

 

 

 Review of Probability  LAZAR

One event- Simple probability

   * Densities and distributions

          Uniform, Exponential family (including  Exponential, Gaussian,

          Dirichlet/Beta)

   * The Shannon entropy, Minimal description length

 

More than one event- Joint probability

   * Conditional  and Unconditional (independent) Probability

   * The Markov assumption

          Finite Markov chains, Transition probabilities, Simplifying

          properties of the Markov assumption

   *  Bayes theorem

·         Missing parameters- Expectation Maximization

 

2. OPTIONAL  Biology MiniCourse Part 1.  MOSAVI

 

 

 

OPTIONAL: Mon Sept 17 6:30-7:30pm Biology MiniCourse Part 2 MOSAVI

 

STRINGS

 

Sept 17. Begin 7:30 pm

3.   BIOLOGY REVISIT Review of key problems in biology requiring solutions based on algorithms on strings  SIMON

·         The genetic code

·         The informational structure of prokaryotes and eukaryotes

        o Introns,Exons,ORFs

        o Starts, stops, splices, CpG regions

 

 4. Algorithms in this course SIMON

·         The computational complexities/considerations of algorithms particularly those applied to Computational Molecular Biology

·         Nondeterministic Polynomial time solutions of problems

·         The Satisfiability Problem –NP Complete

·         Cook’s Theorem: Polynomial time mapping to SAT

·         NP Hard Problems: the TSP, the partition problem, the shortest

           common superstring

        o Mapping the double-digest problem to an NP complete problem

·         Review NP Hard techniques  -Solution vs. Heuristic vs.

          Approximation

             + Branch and Bound

             + Minimum Spanning trees

             + Genetic algorithms

             + Simulated Annealing

 

-Special Cases-Dynamic Programming

 

GENOMICS

 

Sept 24.

5. Algorithms on Strings  SIMON

 

·         Pattern recognition as applied to strings – matching vs alignment 

-Finite Automaton models of string searching

-Hashing functions, prime numbers and hashing schemes for

searching

Regular Expressions and RE processing engines

-Suffix Trees (very brief intro)

 

 

BIOLOGICAL APPLICATION OF STRING ALGORITHMS:

ALIGNMENT-SIMILARITY AND HOMOLOGY SEARCHING

 

6. Sequence Alignment SIMON

   * Rationale for aligning

   * Dynamic Programming

          Needleman Wuncsh and Smith Waterman algorithms –dynamic

          programming in action

   * Sequence alignment

        o Global alignment

        o Local alignment and gaps

 

  Multiple sequence alignment

   * Sum of pairs scoring scheme

   * Heuristics- Star alignment

 

Oct 1

7. BIOLOGY REVISIT  SCHILLER

Cloning

   * Molecular Cloning-uses

   * Practicalities of cloning DNA and cDNA

        o Libraries

        o Amplifying DNA- the polymerase chain reaction

        o Electronic Cloning

   * Subcloning  MOSAVI

        o Vectors

        o Enzymes

        o Purification

        o DNA sequencing

 

 8. Gene Expression

        * Promotors, Regulation

  * Mutations

        * Low Complexity/Repetitive Sequences

 

Oct 8

9. Homology  SIMON

   *

   * Introduction to substitution matrices

 

 Annotated sequences

   * Why annotate and how?

   * Survey of  ‘the big’ annotated databases

   * Searching the big databases-Introduction to BLAST and FastA

 

10.BLAST  SIMON

   * What BLAST does

   * How it works

   * Substitution matrices- continued

        o Point Accepted Mutations

        o Blocks and Prosite

·         Extreme Value Distribution and BLAST statistics

·          

Oct 15

11. FASTA-how it works and why use it SIMON

 

12 Pattern Discovery SIMON

      Suffix Trees: Some thoughts about a new searching tool 

      MEME

 

Oct 22

 

 13. PHYLOGENY  SIMON

   * About trees

   * Character Based Phylogenies-Parsimony, Fitch's algorithm

     Distance Based Phylogenies- Unweighted Pair Group Method with

     Arithmetic Mean

 

MAPPING GENOMES – SEQUENCING DNA

 

14. Genetic Maps

   * Physical Maps SIMON

        o Restriction mapping:  an NP complete algorithm

        o Hybridization mapping.

 

Oct 29

15.  Genetic Maps continued -MOSAVI

   * DNA Sequencing

   * Fragment Assembly-Shotgun sequencing

   * The Human Genome Project

 

DETECTING GENES IN SEQUENCES OF DNA

 

16. Gene finding    SIMON

   * Prokaryotes: open reading frames

   * ORF finding-problems in Eukaryotes

        o automata (HMMs) vs dynamic programming

    * Markov chains for sequence recognition, intro to Hidden Markov Model

        o HMMs for splice and gene finding

 

Nov 5

17. The Hidden Markov Model for motif recognition SIMON

   * Finite state nondeterministic automata

        o The Viterbi algorithm

        o EM-the Baum Welch algorithm

 

PROTEINS

 

18 Machine learning – overview KJELL

o        Representation of Data

o        Decision Tree as an example

Detection vs discovery

        o Supervised vs unsupervised machine learning

 

Nov 12

17. BIOLOGY REVISIT Protein structure MOSAVI

   * Review of primary, secondary, tertiary and quaternary structure

   * Sequence vis à vis structure

        o Homology

        o Analogy

   * Protein folding –overview MOSAVI,SIMON

·         The importance of a motif in structure prediction

·         Algorithms for motif recognition-Hidden Markov models, joint

           probability models, decision trees

·         Schemes based on description length/ algorithmic discovery

 

18. Classification of Protein Folds   KING

   * types of protein folds and protein domains

   * protein fold databases (SCOP, FSSP, CATH, PDB)

   * searching structural databases for 3D homologs (DALI)

 

Nov 19

19. Principle Component Analysis  JIN

 

BIOINFORMATICS AND LARGE DATA ARRAYS OF BIOLOGICAL INFORMATION

 

20. Analysis of Gene Expression - Gene Arrays   ROWE

·         BIOLOGY REVISIT cDNA, ESTs and the human genome

·         commercial chips(Affymetrix et al),  home made chips


·          

Nov 28

21. Clustering and Pattern Discovery SIMON

      Agglomeration vs Division

·         Canonical Discriminant Analysis

·         Other Pattern Analysis schemes

o        K-means (k-medioids)

 

22. Pattern Discovery in clusters  LAZAR

·         Neural Nets

·         Self-organizing maps

·         Genetic programming for pattern discovery SIMON


Dec 3

23. Bioinformatics and the classification of disease SIMON

·         Pattern Recognition: the leukaemia model

 

 24. Bioinformatics and drug discovery -MOSAVI

·         Drug Discovery: Docking Models

 

Dec 10 Oral Presentation of Final Projects

 

   

OPTIONAL BIOLOGY MINICOURSE

Mon Sept 10, 8:30-9:30pm

Part 1: Proteins

·         Chemical Bonds

·         Proteins

          Amino acids: structure and properties.   Protein primary

          structure, secondary structure, secondary superstructure,

          folding,  tertiary structure,  quaternary structure

 

Mon Sept 17, 6:30-7:30 pm

Part 2: Classical and molecular theories of inheritance

·         Chromosomes and classical inheritance

·         Mitosis and Meiosis

·         DNA and RNA structure

 

The Central Dogma of Molecular Genetics

·         Replication

·         Transcription - TATA Box

·         Ribosomes, Splicing

·         Translation

·         The genetic code

 

 

 

 

 

Required Texts:

Primer on Molecular Genetics (Download form Dept of Energy)

Introduction to Computational Molecular Biology,  J.Setubal and J.Meidanis, PWS Publishing Company, ISBN: 0534952623

Data Mining      Practical Machine Learning Tools and Techniques with JAVA Implementation. IH Witten and E Frank, Morgan Kaufmann Publishers, ISBN 1-55860-552-5