Help for B10K Blast



1 Programs available for the BLAST search

The NCBI BLAST family of programs includes:
blastp: compares an amino acid query sequence against a protein sequence database
blastn: compares a nucleotide query sequence against a nucleotide sequence database
blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database
tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

2 FASTA format description

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

                        >gi|532319|pir|TVFV2E|TVFV2E envelope protein
                        ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
                        QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
                        HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
                        MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
                        TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
                        APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
                        LAAVEAQQQMLKLTIWGVK
                        

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:

       
                                A --> adenosine           M --> A C (amino)
                                C --> cytidine            S --> G C (strong)
                                G --> guanine             W --> A T (weak)
                                T --> thymidine           B --> G T C
                                U --> uridine             D --> G A T
                                R --> G A (purine)        H --> A C T
                                Y --> T C (pyrimidine)    V --> G C A
                                K --> G T (keto)          N --> A G C T (any)
                                                          -  gap of indeterminate length
                        

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

                                A  alanine                         P  proline
                                B  aspartate or asparagine         Q  glutamine
                                C  cystine                         R  arginine
                                D  aspartate                       S  serine
                                E  glutamate                       T  threonine
                                F  phenylalanine                   U  selenocysteine
                                G  glycine                         V  valine
                                H  histidine                       W  tryptophan
                                I  isoleucine                      Y  tyrosine
                                K  lysine                          Z  glutamate or glutamine
                                L  leucine                         X  any
                                M  methionine                      *  translation stop
                                N  asparagine                      -  gap of indeterminate length
                        

3 Low complexity filtering

The server filters your query sequence for low compositional complexity regions by default. Low complexity regions commonly give spuriously high scores that reflect compositional bias rather than significant position-by- position alignment. Filtering can elminate these potentially confounding matches (e.g., hits against proline-rich regions or poly-A tails) from the blast reports, leaving regions whose blast statistics reflect the specificity of their pairwise alignment. Queries searched with the blastn program are filtered with DUST. Other programs use SEG.

Low complexity sequence found by a filter program is substituted using the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") and the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Users may turn off filtering by using the "Filter" option on the "Advanced options for the BLAST server" page.

Reference for the DUST program:

Tatusov, R. L. and D. J. Lipman, in preparation.
Hancock, J. M. and J. S. Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput Appl Biosci 10:67-70.

Reference for the SEG program:

Wootton, J. C. and S. Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149-163.
Wootton, J. C. and S. Federhen (1996). Analysis of compositionally biased regions in sequence databases. Methods in Enzymology 266: 554-571.

Reference for the role of filtering in search strategies:

Altschul, S. F., M. S. Boguski, W. Gish, J. C. Wootton (1994). Issues in searching molecular sequence databases. Nat Genet 6: 119-129.

4 Out-Of-Frame BLAST notation

When protein aligned to the nucleotide there are 6 possibilities of match at any point. In OOF alignment - upper sequence is DNAP - 3-frame translated DNA. Lower sequence is protein. At any position next protein base may be aligned to 6 possible bases in DNAP:
0: 3 nucleotides missing - gap (TBO notation "-")

                            OOF alignment with DNAP:

                                  DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGG-GVLCV
                                  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   |  |
                                  D  G  T  K  F  A  T  G  G  Q  G  Q  D  S  G K V  V

                            TBO:

                                  DGTKFATGGQGQDSG-VV
                                  DGTKFATGGQGQDSG VV
                                  DGTKFATGGQGQDSGKVV
                        

1: 2 nucleotides missing - "frameshift -2" (TBO notation "\\")

                        OOF alignment with DNAP:

                              DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGGGVLCV
                              |  |  |  |  |  |  |  |  |  |  |  |  |  |  |/  |  |
                              D  G  T  K  F  A  T  G  G  Q  G  Q  D  S  GK  V  V

                        TBO:

                              DGTKFATGGQGQDSG\\GVV
                              DGTKFATGGQGQDSG   VV
                              DGTKFATGGQGQDSG  KVV
                        

2: 1 nucletide missing - "frameshift -1" (TBO notation "\")

                        OOF alignment with DNAP:

                              DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGERGV
                              |  |  |  |  |  |  |  |  |  |  |  |  |  | /  |  |  
                              D  G  T  K  F  A  T  G  G  Q  G  Q  D  S G  K  V  
                        TBO:

                              DGTKFATGGQGQDS\GEV
                              DGTKFATGGQGQDS G V
                              DGTKFATGGQGQDS GKV  
                        

3: Complete match

                        OOF alignment with DNAP:

                              DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGEKRGV
                              |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
                              D  G  T  K  F  A  T  G  G  Q  G  Q  D  S  G  K  V 

                        TBO:

                              DGTKFATGGQGQDSGKV
                              DGTKFATGGQGQDSGKV 
                              DGTKFATGGQGQDSGKV 
                        

4: 1 nucleotide insertion - "frameshift +1" (TBO notation "/")

                        OOF alignment with DNAP:

                              DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLWGGVEKRGV
                              |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   \
                              D  G  T  K  F  A  T  G  G  Q  G  Q  D  S  G   K  V

                        TBO:

                              DGTKFATGGQGQDSG/KV
                              DGTKFATGGQGQDSG KV
                              DGTKFATGGQGQDSG KV
                        

5: 2 nucleotides insertion - "frameshift +2" (TBP notation "//")

                        OOF alignment with DNAP:

                              DTRGGDTPQKSVFSRAQNTLWGERGDTQKRGGAQRGDIFSLFLWGGEKRGV
                              |  |  |  |  |  |  |  |  |  |  |  |  |  |    \  |  |
                              D  G  T  K  F  A  T  G  G  Q  G  Q  D  S    G  K  V

                        TBO:

                              DGTKFATGGQGQDS//GKV
                              DGTKFATGGQGQDS  GKV
                              DGTKFATGGQGQDS  GKV
                        

4 BLAST Search main parameters

DESCRIPTIONS:

Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions. See also EXPECT.

ALIGNMENTS:

Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 100. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see EXPECT below), only the matches ascribed the greatest statistical significance are reported.

EXPECT:

The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Fractional values are acceptable.

INCLUSION THRESHOLD:

The statistical significance threshold for including a sequence in the model used by PSI-BLAST on the next iteration.

ORGANISM NAME:

Enter the organism name in the form "Genus species" (e.g., "Homo sapiens"). A number of popular organism names are listed on a pull-down menu.

TAXONOMIC CLASSIFICATION:

Enter any taxonomic group from the NCBI taxonomy (e.g. "Mammalia").

                                Some popular groups are:

                                Archaea
                                Bacteria
                                Eukaryota
                                Embryophyta (higher plants)
                                Fungi
                                Metazoa (multicellular animals)
                                Vertebrata
                                Mammalia
                                Rodentia
                                Primates
                        

Explore the taxonomy database at NCBI

FILTER:
FILTER (Low-complexity)

Mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.

FILTER (Human repeats)

This option masks Human repeats (LINE's and SINE's) and is especially useful for human sequences that may contain these repeats. This option is still experimental and under development, so it may change in the near future.

FILTER (Mask for lookup table only)

This option masks only for purposes of constructing the lookup table used by BLAST. The BLAST extensions are performed without masking. This option is still experimental and may change in the near future.

NCBI-gi:

Causes NCBI gi identifiers to be shown in the output, in addition to the accession and/or locus name.

Query Genetic Code:

Genetic code to be used in blastx translation of the query.

Graphical Overview:

An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

Color schema description:

  • Color schema 1:
    • masked regions in lower case
    • everything else in upper case
  • Color schema 2:
    • masked regions in lower case, gray letters
    • Unaligned regions in italic
    • everything else in upper case
  • Color schema 3:
    • No middle line.
    • Masked regions in lower case, gray letters unless identity
    • Everything else in upper case
    • Unaligned regions in italic
    • Identity shown in red color
    • Similarity shown in blue color
    • Mismatches shown in black color
  • Color schema 4:
    • No middle line.
    • Masked regions in lower case, gray letters
    • Everything else in upper case
    • Unaligned regions in italic
    • Identity shown in blue color
    • Similarity shown in brown color
    • Mismatches shown in red color
  • Color schema 5:
    • No middle line.
    • Masked regions in lower case, gray letters
    • Everything else in upper case
    • Unaligned regions in italic
    • Identity shown in red color
    • Similarity shown in blue color
    • Mismatches shown in black color
  • Color schema 6:
    • No middle line.
    • Masked regions in lower case, gray letters unless identity
    • Everything else in upper case
    • Unaligned regions in italic
    • Identity shown in red bold color
    • Similarity shown in blue color
    • Mismatches shown in gray color

Matrix:

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

     
                            Query length     Substitution matrix     Gap costs
                            ------------     -------------------     ---------
                            <35              PAM-30                  ( 9,1)
                            35-50            PAM-70                  (10,1)
                            50-85            BLOSUM-80               (10,1)
                            >85              BLOSUM-62               (11,1)
                        
Gap Costs:

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

Lambda Ratio:

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9]. For determining S', the more important of these parameters is lambda. The "lambda ratio" quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, but with infinite gap costs [8]. This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.

                    [1] Altschul, S.F. (1991) "Amino acid substitution matrices from an information
                        theoretic perspective." J. Mol. Biol. 219:555-565.
                    [2] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of
                        nucleic acid database searches using application-specific scoring matrices."
                        Methods 3:66-70.
                    [3] Altschul, S.F. (1993) "A protein alignment scoring system sensitive at all
                        evolutionary distances." J. Mol. Evol. 36:290-300.
                    [4] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
                        protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
                    [5] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary
                        change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5,
                        suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found.,
                        Washington, DC.
                    [6] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant
                        relationships." In "Atlas of Protein Sequence and Structure, vol. 5,
                        suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found.,
                        Washington, DC.
                    [7] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
                        significance of molecular sequence features by using general scoring
                        schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.
                    [8] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
                        Enzymol. 266:460-480.**
                    [9] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller,
                        W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of
                        protein database search programs." Nucleic Acids Res. 25:3389-3402.
                

5 BLAST Program Advanced Options

BLASTN Program Advanced Options:
 
                              -G  Cost to open a gap [Integer]
                                default = 5
                              -E  Cost to extend a gap [Integer]
                                default = 2
                              -q  Penalty for a mismatch in the blast portion of run [Integer]
                                default = -3
                              -r  Reward for a match in the blast portion of run [Integer]
                                default = 1
                              -e  Expectation value (E) [Real]
                                default = 10.0
                              -W  Word size, default is 11 for blastn, 3 for other programs.
                              -v  Number of one-line descriptions (V) [Integer]
                                default = 100
                              -b  Number of alignments to show (B) [Integer]
                                default = 100
                        
BLASTP Program Advanced Options:
BLASTX Program Advanced Options:
TBLASTN Program Advanced Options:
  
                          -G  Cost to open a gap [Integer]
                            default = 11
                          -E  Cost to extend a gap [Integer]
                            default = 1
                          -e  Expectation value (E) [Real]
                            default = 10.0
                          -W  Word size, default is 11 for blastn, 3 for other programs.
                          -v  Number of one-line descriptions (V) [Integer]
                            default = 100
                          -b  Number of alignments to show (B) [Integer]
                            default = 100


                          Limited values for gap existence and extension are supported for these three programs.  
                          Some supported and suggested values are:

                          Existence Extension

                             10             1
                             10             2
                             11             1
                              8             2
                              9             2