Protein BLAST against users own protein sequence set

Description

This tool runs a protein BLAST search using the given protein sequence set as the target database. The query sequence file can contain up to 100 000 sequences. The size of the target database is not limited.

Parameters

  • BLAST program to useb> BLAST algorithm to use. blastp short is recommended for queries shorter than 30 residues.
  • Expectation threshold for saving hits E-value specifies the statistical significance threshold for reporting matches against database sequences. The default value 10 means that 10 such matches are expected to be found merely by chance. Lower thresholds are more stringent, leading to fewer chance matches being reported.
  • Word size The length of the seed that initiates an alignment. BLAST works by finding word-matches between the query and database sequences. One may think of this process as finding hot-spots that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For BLASTP searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.
  • Maximun number of hits to collect per sequence This parameter limits the number of hit sequences reported for one query sequence. By default up to 100 hits are reported, but if you wish to collect all the hits, and not just the best ones you should in many cases increase this value significantly.
  • Output format type The BLAST results can be presented in many different formats. The classical BLAST report is not optimal for big data query sets or in the cases where the results will be analyzed with other tools. In addition to the text based BLAST reports, the results can be presented as table or XML file. You can also produce a fasta formatted sequence file containing the matching hit sequence regions or a list of hit sequence names.
  • Filter low complexity regions Use SEG program for filtering low complexity regions in the query sequence.
  • Entrez query to limit search You can use Entrez query syntax to search a subset of the selected BLAST database. This can be helpful to limit searches to molecule types, sequence lengths or to exclude organisms.
  • Location on the query sequence Location of the search region in the query sequence, for example: 23-66.
  • Matrix"b> Weight matrix assigns a score for aligning pairs of residues, and determines overall alignment score. Experimentation has shown that the BLOSUM62 matrix is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM45 matrix may prove superior. For proteins, shorter than 85 residues, the BLOSUM80 matrix may provide better hits"
  • Gap opening penalty Cost to open a gap. Integer value from 6 to 25. The default value of this parameter depends on the selected scoring matrix. Note that if you assign this value, you must define also the gap extension penalty
  • Gap extension penalty Gap extension penalty. Integer value from 1 to 3.The default value of this parameter depends on the selected scoring matrix. Note that if you assign this value, you must define also the gap opening penalty
  • Output a log file Collect a log file for the BLAST run.
  • Details

    This tool requires two protein sequence sets as input data. One dataset, the database sequences, is idexed and used as the target database while the other is used as a query sequence set