BANCHE DATI BIOLOGICHE

Analisi di sequenze

1 Utilizzo di BCM sequence utilities, per operazioni "semplici" su sequenze

FASTA format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description
line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that
all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP

Blank lines are not allowed in the middle of FASTA input.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these
exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to
represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below).
Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by
appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic
acid codes supported are:

     A --> adenosine           M --> A C (amino)
     C --> cytidine            S --> G C (strong)
     G --> guanine             W --> A T (weak)
     T --> thymidine           B --> G T C (not A)
     U --> uridine             D --> G A T (not C)
     R --> G A (purine)        H --> A C T (not G)
     Y --> T C (pyrimidine)    V --> G C A (not T)
     K --> G T (keto)          N --> A G C T (any)
                               -  gap of indeterminate length

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes
are:

     A  alanine                         P  proline
     B  aspartate or asparagine         Q  glutamine
     C  cystine                         R  arginine
     D  aspartate                       S  serine
     E  glutamate                       T  threonine
     F  phenylalanine                   U  selenocysteine
     G  glycine                         V  valine
     H  histidine                       W  tryptophan
     I  isoleucine                      Y  tyrosine
     K  lysine                          Z  glutamate or glutamine
     L  leucine                         X  any
     M  methionine                      *  translation stop
     N  asparagine                      -  gap of indeterminate length
 

ReadSeq -converts nucleic acid/protein sequences to FASTA format (BCM)


Reverse Complement - reverse complements a nucleic acid sequence (BCM)



TRANSLATION

The STANDARD Genetic Code

First
Letter
Second Letter Third
Letter
U C A G
U UUU Phe UCU Ser UAU Tyr UGU Cys U
UUC UCC UAC UGC C
UUA Leu UCA Ser UAA Stop UGA Stop A
UUG UCG UAG UGG Trp G
C CUU Leu CCU Pro CAU His CGU Arg U
CUC CCC CAC CGC C
CUA Leu CCA Pro CAA Gln CGA Arg A
CUG CCG CAG CGG G
A AUU Ile ACU Thr AAU Asn AGU Ser U
AUC ACC AAC AGC C
AUA Ile ACA Thr AAA Lys AGA Arg A
AUG Met ACG AAG AGG G
G GUU Val GCU Ala GAU Asp GGU Gly U
GUC GCC GAC GGC C
GUA Val GCA Ala GAA Glu GGA Gly A
GUG GCG GAG GGG G

6 Frame Translation - translates a nucleic acid sequence in 6 frames (BCM)
Reads nucleic acid sequences in various formats, performs a six frame translation of the sequence(s), and translates the sequence(s)
to a protein sequence(s). Input data files may have multiple sequences. The results are returned in Pearson/Fasta format, unless a
different format is specified when using the full options page. Each entered nucleic acid sequence returns six protein sequences (one
for each reading frame). Some formats truncate the sequence title, but the output is always in the order: frame +3, frame +2, frame
+1, frame -1, frame -2, and frame -3.



MASKING

Low Complexity Region (LCR) are regions of biased composition including homopolymeric runs, short-period repeats, and other overrepresentations of one or a few  residues. Masking (filterig) is the removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence analysis (similarity searches, primer design, ...).
RepeatMasker is a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes as well as for low complexity DNA
sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query
sequence in which all the annotated repeats have been masked (replaced by Ns). On average, over 40% of a human genomic DNA sequence is masked by the program.

RepeatMasker-identify and mask repeats in DNA sequences (UW)


2 Similarity search

Ricerca di similarità. Omologia e Analogia, Ortologia e Paralogia

Similarity between two sequences can indicate that they are evolutionarily related.
Two sequences can de defined as "homologous" if they arose from the same ancestral sequence, by gene duplication or by speciation processes. Homology is different from similarity: homology is a qualitative property (it is due to descent from a common ancestor), while similarity is a quantitative property (similarity between two sequences can be measured after sequence alignment).  Then: the sentence "Percentage of homology" is meaningless !!!

Two characters (sequences) can be similar without sharing a common ancestor (analogous), because of chance alone or convergent evolution.

Two homologous sequences are orthologous if they arose by a speciation event, paralogous if they arose by a duplication event (see Figure)
Orthologous   >>>>>>>  Homologous sequences in different species that arose from a common ancestral gene during speciation.
Paralogous  >>>>>>>  Homologous sequences within a single species that arose by gene duplication.
 
 

Genes A1/A2 and B1/B2are Ortologous. Genes A1/B1 and A2/B2 are Paralogous.

3 Pairwise sequence alignment

The alignment of two sequences is the result of a process aiming to establish a relationships between residues of the considered sequences in order to maximize the similarity between them, to reduce the number of differences (which is the same) and to minimize the number of edit operations needed for the conversion of a sequence into the other.
We want to consider the development of an algorithm (a recipe) for determining the similarity between two sequences. One can manually align sequences, by putting some gaps in order to increase the score of the alignment (maximum number of identical residues aligned)
 

seq.1 AACCGTTGACTTTGACC
seq2
ACCGTAGACTAATTAACC
AACCGTTGACT..TTGACC
| ||||.||||  ||.|||
A.CCGTAGACTAATTAACC

Feasible only for very short sequences !
In addition, several alignment can exist giving rise to the same number of matches.
 

AACCGAAGGACTTTAATC
AAGGCCTAACCCCTTTGTCC
AA..CCGAAGGACTTTAATC
||  |..||...||||...|
AAGGCTAAACCCCTTTGTCC

AACCGAAGGACT      TTAATC
|     |||.||      ||..||
A     AGGCCTAACCCCTTTGTC

The goodness of an alignment can be measured in terms of the number of gaps introduced and of the number of mismatches remaining (edit distance). Different metric exists, related to the use of different distance measures to compute and score alignments.

The dotplot matrix is a basic method useful to compare two sequences.


  A|X        X        X
  T|   X           X
  G|            X
  T|   X           X          A T C A C T G T A
  C|      X                   | | | |     | | |
  A|X        X        X       A T C A - - G T A
  C|      X
  T|   X           X
  A|X        X        X
   +-------------------
    A  T  C  A  G  T  A

Identity and similarity
Thus far, we have considered only identity between residues. Essentially we have used a simplified similarity matrix (unitary scoring matrix), that scores 1 identical residues matches and 0 the mismatches.

Unitary scoring matrix for nucleotides

     A C G T 
   --------- 
 A | 1 0 0 0 
 C | 0 1 0 0 
 G | 0 0 1 0 
 T | 0 0 0 1 

Matrices are needed that weight matches between non identical residues according to biologically meaningful criteria. For example, one can assign different similarity scores to transversions and to transitions.
Furthermore, similarity between different amino acids can be scored following their chemical and physical properties (Eg. glutamic acid is more similar to aspartic acid than to phenylalanine).
Another way is to use observed frequencies of specific aminoacid substituion to build a similarity matrix. The most popular series of substitution matrices are the Dayhoff Mutation Data (MD) Matrix (or PAM Matrix) and BLOSUM matrices.
PAM matrix
(Dayhoff et al. 1978)
The MD score is based on the concept of the Point Accepted Mutation (PAM).
This matrix was compiled by analysing the observed substitutions in a dataset of several groups of homologous proteins; in particular, 1572 substitutions in 71 different groups of highly homologous proteins (with 85% identity) were considered.
This dataset was used because the high similarity of protein sequences allowed a trivial alignment, without introducing the need to correct for multiple hits (substitutions like A->G->A or A->G->N).
The analysis of these alignments showed that different aminoacid substitutions are observed with different frequencies: these are biased toward those sustitutions that do not seriously disrupt protein function or, in other words, that are "accepted" by the selection (Point Accepted Mutation, PAM).
This observed frequency of each substitution (say A to N) can be used to estimate the probability of the transition A to N in an alignment of omologous proteins. The probabilities of different substitutions are stored in the PAM matrix.
After collection of mutation frequencies corresponding to 1 PAM (an evolutionary distance of 1 PAM indicates the probability of a residue mutating during a distance in which 1 point mutation was accepted per 100 residues), and following estrapolation of data to an evolutionary distance of 250 PAM, Dayhoff and coworkers, produced the PAM 250 matrix.
The PAM matrix is useful for very similar protein. However, to obtain a good results when analysing distant sequences, other matrices were introduced.

BLOSUM matrices
Henikoff and Henikoff (1992) derived a set of substitution matrices from more than 2000 blocks of multi-aligned sequences correponding to conserved regions  of evolutionary related clusters of sequences (in the BLOCKS database), in order to represent distant relationships more explicitly.
In order to reduce the contribution of pairs of amino acids of very strictly related proteins, groups of highly similar sequences are threated as a single sequence and the average contribution of each residue position is calculated.
Different BLOSUM matrices emerges from different clustering thresholds (BLOSUM 62, for 62% identity, BLOSUM 80 ...).

Due to the importance od detecting biologically meaningful relationships resulting often in alignment with weak statistical significance, the choice of the correct similarity matrix is crucial.

Scoring Systems: Substitution Scoring Matrix
 

Allineamento locale o globale

Global alignment considers similarity between the full extent of two sequences whereas local alignment focuses on regions of similarity in parts of the considered sequences. Because sequences are not uniformly similar, is not useful to try to perform a global alignment of sequences that have only local similarity.

Global alignment
LTGARDWEDIPLWTDWDIEQESDFKTRAFGTANCHK
 ||.  | |  |  .|     .|  ||  || | ||
   TGIPLWTDWDLEQESDNSCNTDHYTREWGTMNAHKAG

 Local alignment

       LTGARDWEDIPLWTDWDIEQESDFKTRAFGTANCHK
                ||||||||.|||| 
              TGIPLWTDWDLEQESDNSCNTDHYTREWGTMNAHK

Algoritmo di Needleman & Wunsch per l’allineamento globale

This method is similar to the dotplot, but interpreted computationally. The maximum match between two sequences is defined by the largest number of elements of one sequence that can be matched to those of the other, while allowing for all possible deletions.
A penalty is introduced to provide a barrier against arbitral gap insertion. Sequence are compared by a 2d matrix. N&W proposed that the maximum-match pathway can be obtained computationally by applying a simple algorithm.
Cells representing identities are scored 1, mismatches are scored 0.
The operation of successive summation of cell starts and the maximum score along any path leading to the cell is added to the present content. When this process is completed, the maximum match patway is composed.
 

dotplot matrix ---->
matrix with similarity scores
   E V D Q K I S K W D 
E x 0 0 0 0 0 0 0 0 0
V 0 x 0 0 0 0 0 0 0 0
K 0 0 0 0 x 0 0 x 0 0
K 0 0 0 0 x 0 0 x 0 0
I 0 0 0 0 0 x 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 x 0 0 x 0 0
W 0 0 0 0 0 0 0 0 x 0
D 0 0 0 0 0 0 0 0 0 x
   E V D Q K I S K W D
E 1 0 0 0 0 0 0 0 0 0
V 0 1 0 0 0 0 0 0 0 0
K 0 0 0 0 1 0 0 1 0 0
K 0 0 0 0 1 0 0 1 0 0
I 0 0 0 0 0 1 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 1 0 0 1 0 0
W 0 0 0 0 0 0 0 0 1 0
D 0 0 0 0 0 0 0 0 0 1

Successive summation of cells begins at the last cell of the matrix (bottom right). The rule for the summation is, starting form the last cell ( M(y,z) ), to add to each cell of the line  i=y-1 and of the column j=z-1 the maximum value between all cells of line y and column z, being in a pathway undelining the element (y,z).
The score in the upper left cell represent the total score of the alignment, without considering gap penalties.
 

  E V D Q K I S K W D
E 7 5 5 5 4 3 3 2 1 0
V 5 6 5 5 4 3 3 2 1 0
K 5 5 5 5 5 3 3 3 1 0
K 4 4 4 4 5 3 3 3 1 0
I 3 3 3 3 3 4 3 2 1 0
T 3 3 3 3 3 3 3 2 1 0
R 3 3 3 3 3 3 3 2 1 0
P 3 3 3 3 3 3 3 2 1 0
K 2 2 2 2 3 2 2 3 1 0
W 1 1 1 1 1 1 1 1 2 0
D 0 0 1 0 0 0 0 0 0 1

One of the optimal alignments is:
 

E V D Q K I S - - K W D
| |     | |       | | |
E V K - K I T R P K W D

Algoritmo di Smith & Waterman per l’allineamento locale

The aim of finding the best local alignment between relatively long sequences is to define the largest subsequence of the first one that shows the maximal similarity with a substring of the second sequence.
Each cell in the matrix defines the end point of a potential alignment, whose similarity is represented by the value stored in the cell.
One begins by filling the edge elements with 0.0, because these cell represent alignments of lenght zero. Next step is to populate the other cells, by evaluating different functions and chosing the maximum of these values or zero, if a negative score results. The function are: (1) a similarity score (eg. 1.0 match, -0.33 mismatch); (2) gap penalty [ W=a+b(k-1) ) a is the gap opening penalty, bk is the penalty due to gap extension to lenght k].
The identification of the highest score indicate the starting point of the highest scoring local alignment between two sequences.

4 Comparazione di una sequenza con un database: introduzione a FastA e a BLAST

The comparison of a sequence with a database can be view as an extension of pairwise alignment. FastA and BLAST are essentially local similarity search methods that concentrate in finding short, identical matches, which may contribute to a total match, using implementations that address issues of execution speed.

FastA

(Lipman and Pearson, 1985)
FastA was the first widely used algorithm for database similarity searching.
The algorithm try first to identify short words (k-tuples) common to both sequences under comparison (k-tuple size of 1, 2 for a.a., up to 5 for nucleotides).
Each k-tuple lie in a diagonal of the matrix. Diagonals containing the largest density of k-tuples are selected (beacuse they represent the best regions of local similarity). In the following steps, the recalculation of the score for each region is done, after the insertion of small gaps or substitutions, by using the appropriate similarity matrix (eg PAM-250, for proteins).The highest scoring diagonal is the "primary similarity region". Then, the program verifies if some of the previously selected similarity regions can be put toghether with the primary region in order to construct a single alignment. An optimization phase complete the process.
Sequences in the database are ordered according to decreasing similarity scores with the probe sequence.

BLAST

(Altschul et al. 1990)
The BLAST algorithm.The BLAST algorithm is an heuristic search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.

The quality of each pairwise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). A unitary matrix is used for DNA pairs because each position can be given a score of +1 if it matches and a score of zero if it does not. Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query. The alignment score will be the sum of the scores for each position. Various scoring systems (e.g. PAM, BLOSUM and PSSM) for quantifying the relationships between residues have been used.
The significance of each alignment is computed as a P value or an  E value. Each alignment must be viewed by a critical human eye before being accepted as meaningful. For example high scoring pairs whose similarity is based on repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect meaningful similarity  between the query and the match.

General rules about using BLAST for finding homologous sequences:

1) Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons.

2) The requirement for a common folded structure in homologous proteins usually causes these proteins to be similar over the entire length of the gene product (or domain). Therefore, most sequences that share statistically significant similarity throughout their entire lengths are homologous.

3) Matches that are more than 50% identical in a 20-40 amino acid region occur frequently by chance.

4) Distantly related homologs may lack significant similarity. Two or more homologous sequences may have very few absolutely conserved residues.

5) If homology has been inferred due to significant similarity scores between two proteins, A and B, that align over their entire lengths and between protein B and a third protein, C, then proteins A and C must also be homologous, even if they share no significant similarity.

6) Low complexity regions, transmembrane regions and coiled-coil regions frequently display significant similarity in the absense of  homology. Low complexity regions can be filtered out using the default parameters of BLAST. Transmembrane and coiled-coil  regions should be identified and masked (by eliminating these regions from the query) by the user.

BLAST (NCBI)
 

AVAILABLE
AT NCBI
Nucleotide BLAST

Standard nucleotide-nucleotide BLAST [blastn]

MEGABLAST

Search for short nearly exact matches

Genomic BLAST pages

Human Genome

Microbial Genomes

Arabidopsis thaliana

Other eukaryotes

Specialized BLAST pages

VecScreen - BLAST-based detection of vector contamination

IgBLAST - Analysis of immunoglobulin sequences in GenBank

OLD Finished and Unfinished Microbial Genomes

Protein BLAST

Standard protein-protein BLAST [blastp]

PSI- and PHI-BLAST

Search for short nearly exact matches

Translated BLAST Searches

Nucleotide query - Protein db [blastx]

Protein query - Translated db [tblastn]

Nucleotide query - Translated db [tblastx]

Pairwise BLAST

BLAST 2 Sequences

Search for conserved domains

Search the Conserved Domain 
Database using RPS-BLAST

Search by domain architecture [DART]

5 Multiple Sequence Alignment

An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs.
The goal of multiple sequence alignment is to generate a concise, information rich summary of sequence data in order to inform decision making on the relatedness of sequences to a gene family.
Sometimes, alignments are used to express dissimilarity (eg, for specific primer design on a gene belonging to a family).
Alignments can be considered as models for hypothesis testing. Two types of multiple sequence alignments can be made: sequence-based and structure-based.
A multiple sequence alignment is a 2D table in which the rows represent individual sequences and the columns the residue position. Sequences are arranged into the alignment in such a way that (1) the relative positoning of residues whithin one sequence is preserved and (2) similar residues of different sequences are arranged into vertical columns.
Due to gap introduction, the absolute position of residues (in the unaligned sequence) can be different form the aligned residue position.
 

        1  2  3  4  5  6  7  8  9  10
seqA   Y  D  G  G  A  V  -  E  A  L
seqB   Y  D  G  G  -  -  -  E  A  L
seqC   F  E  G  G  I  L  V  E  A  L
seqD   F  D  -  G  E  A  L  Q  A  V
seqE   Y  E  G  G  A  V  V  Q  A  L
cons.  Y  D  G  G A/IV/L V E/q A  L

The alignment table can be summarised in a single line, a pseudo-sequence, normally at the and of the alignment.
 

        1  2  3  4  5  6  7  8  9  10
seqA   Y  D  G  G  A  V  -  E  A  L
seqB   .  .  .  .  -  -  -  .  .  .
seqC   F  E  . .  I  L  V  .  .  .
seqD   F  .  -  .  E  A  L  Q  .  V
seqE   .  E  .  .  .  .  V  Q  .  .
cons.  Y  D  G  G A/IV/L V E/q A  L

Methods for multiple sequence alignment can be simultaneous or progressive, according to the fact they align all sequences of a given set at once or they take a progressive approach and align pairs of sequences or build sequence clusters.
Clustal is the most known program for multiple alignment using a progressive method (Feng and Doolittle, 1987). The methods aligns sequences in pairs following the branching order of a family tree. The tree can be constructed by the program itself, by using a similarity matrix derived from the results of all the possible pairwise alignments of the input sequences.
Basically, the multiple sequence alignment process is composed by three steps:
1) The results of M pairwise alignment of input sequences, in all the possible combinations (for N sequences M=(N(N-1))/2 ) are used to calculate a distance matrix. For each pair of sequences, the parameter D (Distance is calculated as the sum of the total gap penalty and of the negative scores due to mismatches between seqeunces.
2) A reference tree is reconstructed from the distance matrix data, with Neighbor-Joinig method.
3) Following the branching pattern of the reference tree, the input seqeunces are aligned in a progressive process. The progressive porcess follow the rule that the early established gaps are conserved.

BCM ClustalW 1.8
 


HOME

Page by Stefania Bortoluzzi, last udate October 18, 2001