Introduction With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools
Introduction-PEDANT Difference of existing genome analysis programs - protein oriented vs. DNA oriented analysis
- interactive work vs. commandline operation
- bioinformatics method applied
- user interface
- conveniency feature, project management and data editors
- fidelity of result produced
Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search) - a workhorse for general bioinformatics research
- a common framework for a number of genome analysis projects
- a complete database of automated genomes
- a tool for routine analysis of large amounts of genomic contigs and ESTs
System Architecture Overview - database module: storing, modifying and accessing data
- processing module: bioinformatics computations
- user interface: web based communication
System Architecture-Cont. Data access - primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output )
- secondary table: parsed program results
- simplified schema
Operation in command line mode - applying bioinformatics methods to sequences
- parsing data tables
- querying the resulting databases
Web interface - No static HTML pages required
- DNA and Protein viewers make direct access to the SQL tables
Implementation and system requirements - Perl 5, and C++ for graphical viewer
Performance
Schema
Bioinformatics Method Overview of the PEDANT processing pipeline - identification of coding regions and various analysis genetics elements
- homology search
- detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition
- automatically attributed to pre-defined functional categories
Prediction of genes and other genetic elements - Table 1
- choose one of 15 genetic codes
- http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
Functional and structural categories - similarity search : PSI-BLAST(Position-Specific Iterated BLAST)
- special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS
- significant matches of PIR: annotations, keywords, enzyme classification and superfamily information
- with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case)
- low complexity region, membrance regions, coiled coils and signal peptides
- comparison of SCOP with IMPALA
Table 1
Bioinformatics Method-Cont. Yeast biological role categories - first system of biological role of categories : E.Coli
- MIPS: advanced hierarchical functional catalogue (Yeast)
- Multidimensionality-protein:gene is M:M
- automated assignment to MIPS is first approximation, will be refined by manual annotation
- Distribution of ORFs
Visualization - a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation
- Protein report page
Distribution of ORFs
Protein report page
Bioinformatics Method-Cont.2 Automatic versus manual annotation - Problem of error propagation
- erroneous annotation by human error and spurious similarity hits
- with filtering algorithms and domain structure ?
- quality improvement of manual review of human experts !
- Manual annotation
- Catalogue independent
- Flexibility: first place in higher category and later step move to the finer categories
- 528 categories: 20 main categories and 6 levels
- confidence levels: “reject”, “low”, “medium”, “high” and default is “auto”
Data release management - new release data can be intelligently merged with existing data pool
- transfer manual annotation between subsequent data release
- “manual” field: “yes” or ”no” and default is “no” initially
- example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”
Manual annotation transfer
The PEDANT Genome Database Annotation of publicly available completely sequenced and unfinished genomes - Genome annotated by MIPS
- Completely sequenced and published genomic sequences
- Unfinished and/or unpublished genomics sequences
- gene prediction by ORPHEUS, allow large overlaps between ORFs
PEDANT as a structural genomics resource-0.3M proteins - class-based approach, cost-saving
- (i)non-redundant protein sequence databases
- (ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles
- (iii)construct a SCOP profile library using IMPALA
- (iv)IMPALA search with each genomic sequence against SCOP library
- same procedure for nr PDB sequence database
- performance of IMPALA
Cross-genome comparison - treat each genome as an individual contig : creat cross-genome datasets without any modification
- 44 genomes
Performance of IMPALA
Applications - 3744 predicted protein coding genes
- roughly 30% are known proteins or strongly similar to known proteins
- multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species
Assembled human transcripts - human UniGene subjected PEDANT analysis, compare over 75000 contigs
- this MySQL DB is close to 8GB
- acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects
Analysis of the GroEL substrates - GroEL: a common E.Coli chaperonin
- structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation
Classification of predicted genes
Summary and Outlook PEDANT is a useful tool for genome annotation and bioinformatics research It can automated and manual assignment of gene product to functional and structural categories extensive hyperlinked protein report and advanced viewers Outlook - better decision rules need to be employed
- manually annotate predicted genetics eelments(ex. LTRs)
- supporting Oracle RDBMS
- automatic gene prediction pipeline for higher eukaryotes
- interactive capabilities
Dostları ilə paylaş: |