BANCHE DATI BIOLOGICHE

Introduzione all’utilizzo di database


1  Struttura e organizzazione di database
I database sono insiemi di dati memorizzati su un computer con diversi livelli di astrazione al di sopra di essi. Ogni livello di astrazione consente di organizzare i dati contenuti e accedervi piu' facilmente, separando la richiesta dal meccanismo di recupero di specifici dati.

Database diversi organizzano i dati in modi differenti. La metodologia piu' comune e' utilizzata dai database relazionali o RDBMS (Relational Database Management Systems). I piu' famosi sistemi odierni sono hanno struttura prevalentemente relazionale (es. Oracle, Sybase).
Altra metodologia popolare e' quella orientata agli oggetti (OODBMS), in cui il l'intero contenuto del database e' gestito come oggetto di una classe specifica, in cui sono state definite delle regole per manipolare i dati in essa contenuti. Esistono poi dei pacchetti di database semplicistici che sono in relta' paradatabase, ovvero sistemi piu' o meno sofisticati digestione di files.
Tutti i sistemi di database impiegano delle interfacce (API, Application Programming Interface) per accedere ai dati ed, eventualmente modificarli. La manipolazione dei dati avviene attraverso un linguaggio di interrogazione che permette essenzialmente quattro operazioni principali: acquisizione, memorizzazione, aggiornamento e eliminazione di dati.

Database flat-file. Il tipo piu' semplice di database e' il database flat-file, formato da files di testo ASCII in formato standard che il programa esamina per cercare informazioni. Il formato e' di solito costituito da un insieme di campi, contenenti ciascuno una specifica categoria di informazioni, delimitati attraverso caratteri speciali  o con lunghezza fissa assegnata. Il pregio principale dei database flat-file e' la semplicita' di gestione, controbilanciata pero' dalla loro incapacita' di gestire accesso concorrente e dalla mancanza di indicizzazione dei dati, che non consente interrogazioni sequenziali.

Database relazionali. Il linguaggio SQL (Structured Query language) e' stato progettato per manipolare basi di dati (970, Codd, IBM, modello relazionale). Un database relazionale e' percepito dall'utente con un insieme di tabelle, dove una tabella e' un insieme non ordinato di righe. Ogni riga ha un numero fisso di campi  (colonned ella tabella) e ogni campo puo' memorizzare un tipo predefinito di dati (numeri o stringhe). Le informazioni correlate possono essere conservate nello stesso punto o in punti distinti ma collegati a quello principale.Questo processo di razionalizzazione delle tabelle (normalizazione dei dati) fa sia che i dati non risultino duplicati e riduce la ridondanza di dati.
I dati possono essere: numerici, carattere (stringhe di lettere e numeri), data (data, data piu' ora), binari (immagini, audio, ...) o NULL (privo di valore).

 
2  Database compositi e information retrieval

SRS - Sequence Retrieval System
E' stato sviluppato per rendere possibile l'interrogazione di piu' database residenti nel medesimo sito, anche in assenza di un formato comune tra i diversi database. Si tratta di un network browser per database in biologia molecolare, sviluppato allinterno dell'European Molecular Biology network. SRS permette l'indicizzazione di qualsiasi flat-file database rispetto a qualsiasi altro. Gli indici cosi' derivati sono velocemente cercabili e l'utente ha la possibilita' di recuperare entries da tutte le fonti interconnesse. Il sistema e' disponibile ed adattabile alle caratteristiche di ciascun set di database.
Tipicamente, SRS permette di collegare dati relativi ad acidi nucleici, EST, sequenze proteiche, pattern di sequenze, a strutture o di tipo bibliografico, senza che all'utente sia richiesta la conoscenza della struttura dei dati e dei linguaggi utilizzati.

SRS Documentation
EBI SRS 6
CNR Bari SRS


ENTREZ
L'NCBI (National Center for Biotechnology Information) ha un ruolo fondamentale nel mantenimento di banche dati di informazioni di interesse biologico e nella diffusione di strumenti di analisi e biocomputing. L'NCBI sviluppa nuove tecnologie informatiche per favorire lo studio dei processi genetici e molecolari di impartanza biomedica.
Ricadute di queste ricerche sono lo sviluppo di metodi per computer-based information processing e di sistemi che facilitano l'accesso degli utenti a database e software. Dal 1992, l'NCBI mantiene GenBank, il database di sequenze di DNA dell'NIH, che scambia dati con l'EMBL e il DDGJ.
ENTREZ e' stato sviluppato per permettere l'accesso a dati di biologia molecolare e citazioni bibliografiche. Forse un po' meno flessibile di SRS, permette tuttavia il massimo sfruttamento del concetto di "neighbouring" offrendo la possibilita' di collegare tra loro oggetti diversi di database differenti, indipendentemente dal fatto che essi siano direttamente "cross-referenced".
Tipicamente, ENTREZ permette l'accesso a database di sequenze nuclotidiche, di sequenze proteiche, di mappaggio di cromosomi e di genomi, di struttura 3D e bibliografici (PubMed).

NCBI Entrez
Entrez HELP
PubMed Overview
PubMed Tutorial


3  Database primari

Database di sequenze nucleotidiche e proteiche

Genbank
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research 2000 Jan 1;28(1):15-8). There are approximately 13,543,000,000 bases in 12,814,000 sequence records as of August 2001 (see GenBank growth statistics). As an example, you may view the record for a Saccharomyces crevisiae gene. The complete release notes for the current version of GenBank are available. A new release is made every two months. GenBank is part of the International Nucleotide  Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
example of GenBank entry

Nucleotide Sequence Database
The Nucleotide database  is a composite database wich contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. EMBL is the European Molecular Biology Laboratory (EMBL) at Hinxton Hall, UK, DDBJ is the DNA Database of Japan (DDBJ) in Mishima, Japan.
Sequence data is also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (US PTO), and via the collaborating international databases from other international patent offices.

SWISS-PROT
Founded in 1986 in Geneva, has moved to EMBL's UK outstation in 1994 and then in 1998 to the Swiss institute og Bioinformatics (SIB). Now is a collaboration.
The SWISS-PROT database consists of sequence entries. It contains high-quality annotation, is non-redundant and cross-referenced to many other databases. SWISS-PROT is accompanied by TrEMBL, a computer-annotated supplement to SWISS-PROT.
TrEMBL (Translated EMBL) was created in 1996 and contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database, which are not yet integrated into SWISS-PROT. It uses SWISS-PROT format. TrEMBL is split into two main sections; SP-TrEMBL and REM-TrEMBL.
SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries which should eventually be incorporated into SWISS-PROT and can be considered as a preliminary section of SWISS-PROT as all SP-TrEMBL entries have been assigned SWISS-PROT accession numbers. REM-TrEMBL (REMaining TrEMBL) contains the entries that we do not want to include in SWISS-PROT (immunoglogulins, T-cell receptors, synthetic or patented sequences ). REM-TrEMBL entries have no accession numbers.
There is a weekly update to TrEMBL called TrEMBLnew. TrEMBLnew is produced from  nucleotide sequences deposited in the EMBL nucleotide sequence database. At each TrEMBL release the annotation of TrEMBLnew entries is upgraded, redundant entries are merged and the remainder are then added to TrEMBL.

The structure of a SWISS-PROT entries reflects the fact that the database contains protein sequences with a high-level annotation about structure, function and post-transational modifications of proteins.
Each line of the entry is flagged by a two letter code, which helps to present the content in a structured way. The entry begins with ID line and ends with a // terminator. ID codes can change through database releases, so an accession number is provided as an unique identifier of the entry, remaining static between database releases.
example of SWISS-PROT entry


LEVELS OF PROTEIN SEQUENCE AND STRUCTURAL ORGANISATION
                
                
V       primary                 sequence        --->  primary database          V
V                                                                               V
V          |                       |                                            V
V                                                                               V
V       secondary                motif          --->  secondary database        V
V                                                                               V
V          |                      / \                                           V
V                                                                               V
V       tertiary            domain   module     --->  secondary database        V


NCBI Protein Database
The Protein database is a composite database wich contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to PIR, SWISSPROT, PRF, Protein Data Bank (PDB) (sequences from solved structures).

Database di strutture proteiche

NRL_3d
NRL_3D is a sequence-structure database derived from the 3 dimensional structure of proteins deposited with the Brookhaven National Laboratory's Protein Data Bank. sample entry
The Web version derived from NRL_3D has hot links among its own entries and to the following Databases:
PDB - The Protein Databank (3D structures); EC-Enzyme - The EC Enzyme Classification Database;  Refbase - A Protein Sequence Citation Database, two of which have links among themselves and to other databases as well.

Structure Database (three-dimensional macromolecular structure)
The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. (NCBI 3D structure viewer, Cn3D, for interactive visualization of molecular structures from Entrez).
 

4  Database secondari

In addition to numerous primary and composite resources, there are many secondary databases, containning the fruits of analyses of the sequences in the primary sources.

UniGene
UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included. Consequently, the collection may be of use to the community as a resource for gene discovery. UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis.
However, it should be noted that the procedures for automated sequence clustering are still under development and the results may change from time to time as improvements are made. It should also be noted that no attempt has been made to produce contigs or consensus sequences. There are several reasons why the sequences of a set may not actually form a single contig. For example, all of the splicing variants for a gene are put into the same set. Moreover, EST-containing sets often contain 5' and 3' reads from the same cDNA clone, but these sequences do not always overlap. Currently, sequences from the animals human, rat, mouse, cow, zebrafish and clawed frog have been processed. Plant organisms are wheat, rice, barley, maize and cress. These species were chosen because they have the greatest amounts of EST data available and represent a variety of species.

GEO
In order to support the public use and dissemination of gene expression data, NCBI has launched the Gene Expression Omnibus. GEO is an effort to build a gene expression data repository and online resource for the retrieval of gene expression data from any organism or artificial source. Many types of gene expression data from platform types such as spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data, will be accepted, accessioned, and archived as a public data set. A series of precomputed definitions and descriptions of the data, as well as online tools for the interactive retrieval and analysis of this expression data will follow shortly thereafter.

NCBI dbSNP (Database of Single Nucleotide Polymorphisms)
(Sep 13, 2001 , dbSNP has submissions for 3053511 SNPs; human: 3052574)
SNP stands for "single nucleotide polymorphism".  SNPs are the most common genetic variations and occur once every 100 to 300 bases.  A key aspect of research in genetics is the association of sequence variation with heritable phenotypes.  It is expected that SNPs will accelerate the identification of disease genes by allowing researchers to look  for associations between a disease and specific differences (SNPs) in a population. This differs from the more typical approach of pedigree analysis which tracks transmission of a disease through a family.  It is much easier to obtain DNA samples from a random set of individuals in a population than it is to obtain them from every member of a family over several generations. Once discovered, these polymorphisms can be used by additional laboratories, using the sequence information around the polymorphism and the specific experimental conditions.
The database has been designed to accept several classes of genetic variation: (1) SNPs; (2) microsatellite repeats; (3) small insertion/deletion polymorphisms.

5  Database di sequenze genomiche

NCBI Entrez GENOMES
The Genomes database provides views for a variety of genomes, complete chromosomes, contiged sequence maps, and integrated genetic and physical maps.
The whole genomes of over 800 organisms can be found in Entrez Genomes. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and organelles.

Drosophila melanogaster genome
The assembled and annotated genome sequence of the euchromatic arms of the five Drosophila melanogaster (fruit fly) chromosomes is now available in GenBank. The sequence, determined in a collaboration between Celera and the Berkeley Drosophila Genome Project, is described in the March 24, 2000 issue of Science. The ~137 Mb of sequence, most of which is found on chromosomes 1 (also known as X), 2, and 3, contains ~13,500 annotated genes. ~2470 of these genes correlate with a known gene described in FlyBase. FlyBase provides sequence for an additional ~500 genes that are not annotated on the Celera/BDGP sequence.

From early observations of the banding patterns of its polytene chromosomes to current work on mRNA and protein gradients in the developing embryo, Drosophila melanogaster has been studied in biology labs for over eighty years. Many of the genes that define the spatial pattern of cell types and body parts have now been identified, along with the regulatory pathways in which they operate. As a number of these genes have counterparts in higher eukaryotes, the study of the Drosophila developmental program provides insight into human development as well Drosophila is the second multicellular organism to be sequenced, after the nematode Caenorhabditis elegans

FlyBase  FlyBase is a database of genetic and molecular data for Drosophila. FlyBase includes data on all species from the family Drosophilidae; the primary species represented is Drosophila melanogaster.

GadFly: Genome Annotation Database of Drosophila


HOME

Page by Stefania Bortoluzzi, last update October 5, 2001