Path: funic!news.funet.fi!sunic!mcsun!uunet!bionet!GENBANK.BIO.NET!kristoff From: kristoff@GENBANK.BIO.NET (Dave Kristofferson) Newsgroups: bionet.molbio.genbank Subject: Re: obtaining Genbank Message-ID: Date: 31 Mar 91 19:08:21 GMT Sender: kristoff@genbank.bio.net Lines: 518 P.S. - In regards to FASTA searching of GenBank I'll post our e-mail server instructions for your reading enjoyment. ---------------------------------------------------------------------- FASTA Server Help GenBank now offers the FASTA program for nucleic acid sequence and protein similarity searching of sequence databases. You can access the GenBank FASTA Server through a number of different networks, including Internet, BITNET, EARN, NETNORTH and JANET. The FASTA program allows you to send a specially formatted mail message containing the nucleic acid or protein query sequence to the FASTA Server at GenBank. A FASTA sequence similarity search is then performed against the specified database using the FASTA program developed by William Pearson and David Lipman as described in their paper: Pearson, W.R. and Lipman, D.J. 1988. Improved Tools for Biological Sequence Comparison. Proc. Natl. Acad. Sci., 85: 2444-2448. If you use FASTA as a research tool, we ask that this reference be cited in your paper. The results of the FASTA search will be returned to your local mail file as soon as they are processed and can be saved in a separate disk file. The following databases are currently available for FASTA searches: Designator Database ---------- -------- GenBank/all Latest GenBank quarterly release PLUS sequences added since last release. GenBank/new GenBank sequences added since last release. GenBank/primate GenBank subdivisions GenBank/rodent GenBank/other_mammalian GenBank/other_vertebrate GenBank/invertebrate GenBank/plant GenBank/organelle GenBank/bacterial GenBank/structural_rna GenBank/viral GenBank/phage GenBank/synthetic GenBank/unannotated GenPept/all Translated protein reading frames from the latest GenBank release. Note that GenPept contains translations only of reading frames that are explicitly mentioned in the GenBank sequence entry annotations! GenPept/new Translated protein reading frames from GenBank daily updates (translated from GenBank/new). EMBL/all Latest EMBL Data Library release PLUS sequences added since last release. EMBL/new EMBL sequences added since last release. EMBL/bacteriophage EMBL subdivisions EMBL/fungi EMBL/invertebrate EMBL/organelle EMBL/other_mammalian EMBL/other_vertebrate EMBL/plant EMBL/primate EMBL/prokaryote EMBL/rodent EMBL/synthetic EMBL/unannotated EMBL/viral SWISS-PROT/all All of the SWISS-PROT protein database. GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is a protein sequence database. GenPept is produced by GenBank and consists of translations of open reading frames as documented in the sequence entry annotations. Accessing the FASTA program To access the program, send an electronic mail message containing the formatted query sequence (as described below) to the following Internet address: SEARCH@GENBANK.BIO.NET If you are not on Internet, you may need to change the format of the address. Consult your systems manager to determine the correct address. Obtaining Help If you would like to receive instructions on using the FASTA program, send a mail message to the address above containing the word "Help" on a single line of the mail message. Leave the Subject line in the mail header blank. The help text will be updated when new information is available for FASTA searches (such as new databases on-line). For additional help on using FASTA, contact GenBank at (415) 962-7307 or send an electronic mail message to the address: CONSULTANT@GENBANK.BIO.NET Formatting a Query Queries consist of a mail message with search parameters identifying the database to be searched, values related to the search and the query sequence to be used in the search. The mail message has two mandatory lines, three optional lines and a line identifying the query sequence as descibed below. These lines are typed into the body of the mail message in the order shown below: Search Parameter Mandatory Explanation DATALIB Yes This line specifies the database to be searched (as described in the beginning of this text) for the query sequence and must be included in the message. KTUP No This line identifies the Ktup value which specifies the sensitivity of the search. Values range between 3 and 6 for nucleic acid searches and between 1 and 2 for protein searches. Lower values specify more sensitive searches but require more time to complete. For DNA sequences longer than 200 base pairs, use a Ktup value of 4 or greater; lower values are unnecessary and take longer to complete. Protein searches will benefit from having a Ktup value of 1 if you expect significant matches with evolutionary amino acid replace- ments but few exact amino acid matches. The default value for nucleic acids is 4 and 1 for proteins. SCORES No This line specifies the number of best-ranked sequences to be listed in the results. The default value is 100. ALIGNMENTS No This line identifies the maximum number of best-ranked sequences to be aligned in the results. The default value is 20. BEGIN Yes This line must be included in the message. No other information is typed on it. The remainder of the message contains the query sequence in either Pearson FASTA format or in IntelliGenetics format. Preparing Files for Similarity Searches Only one sequence query is allowed per mail query. The query sequence that you would like searched in the database must be contained in its own file. Your sequence file must be in either Pearson format or IntelliGenetics format. GenBank database file format is not currently accepted; however, it is possible to use an editor to change the file to Pearson format as described below. Note: all lines must be less than 80 characters in length; larger lines will be truncated. Pearson Format Pearson is the preferred format to use for query sequences. The format includes a mandatory comment line beginning with a greater-than sign ">" followed by the name of the sequence, a space, and an optional note about the sequence. The sequence data begin on the next line without the greater-than sign. For example: >AGREP4 Monkey SV40-like genomic segment promoting transcription. ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg tttccatttactttggattatacgtcattataaa IntelliGenetics Format If your sequence was derived using one of the IntelliGenetics programs, it can be used for a FASTA search. Comment lines are optional and begin with a semi-colon ";". The name of the sequence and the sequence data appear on separate lines without a semicolon. At the end of the sequence data a number must follow to indicate if the sequence is linear (1) or circular (2). For example: ;Monkey SV40-like genomic segment promoting transcription. AGMREP4 ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg tttccatttactttggattatacgtcattataaa1 GenBank Flat-File Format GenBank database file format is NOT accepted for query searches. The files contain annotation data and residue numbers that cannot be recognized by FASTA. For example: (annotation data) 1 ccccttcaaa tctattacaa ggtgagcgtc tcgccaaggc aatgaaatcg caatatgatg 61 taaccttgcg ctttggatta gacggactgt taaacggcaa These files can be used only if they are changed to follow Pearson format. The files must be stripped of annotation data and the numbers in the sequence; the mandatory comment line (starting with ">") must then be added. Sending the Query Sequence Use your local mail program to send GenBank your query sequence. Most mail programs allow you to import a file into the mail message. You can import your sequence file into the mail message on the line after "Begin". Please follow the format in the following example of a FASTA request PRECISELY, but note that the program is case-insensitive, i.e. either upper or lower case letters may be used. This is an example of a mail message sent for a FASTA search. Note that the first four lines are a mail header that is automatically created when you address a mail message. Nothing need be entered for the Subject. Each line of information must be less than 80 characters in length. Longer lines will be truncated. ~From: drbob@someaddress.somewhere.edu Tue Jun 14 21:36:38 1988 ~Date: 14 Jun 1988 2129:02-PDT To: SEARCH@GENBANK.BIO.NET ~Subject: The text that you enter into the body of the message begins with DATALIB (do not add blank lines in the message): DATALIB GenBank/other_mammalian KTUP 4 SCORES 100 ALIGNMENTS 20 BEGIN >BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides. tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat caccaccatggacagcaaa The sequence is then sent to the FASTA Server at GenBank. Once your message is received, it is placed in a batch queue and processed in the order it is received. Two queues called the fast and slow queues process FASTA requests. The slow queue handles nucleic acid searches of "genbank/all" and "embl/all." All other requests are placed in the fast queue. Searches submitted to the fast queue require less CPU time and are completed more quickly than those sent to the slow queue. If you would like to know the status of the queues being processed, you can send a mail message to the FASTA Server address (SEARCH@GENBANK.BIO.NET) containing the word "QUEUE" on a single line of the mail message (Leave the Subject field blank). The fast queue is labeled with the letter "d"; the slow queue is labeled with "e". You cannot have more than one search waiting in the slow queue at any one time. If you send an additional search to the slow queue before your first request has been processed, the initial search will be cancelled. At MOST you can have one executing search and one waiting job in the slow queue at the same time. Multiple jobs are currently permitted in the fast queue but please limit your zeal since others also use the service. For example, submitting ten jobs simultaneously to the fast queue would definitely be in bad taste. We would prefer it if, after submitting 2 - 4 jobs to the fast queue, you wait until your results are received before submitting additional runs. Handling the Results of a FASTA Search When the results are returned, use your local mail program to retrieve them. You can transfer the results of a FASTA search to a separate disk file to free up space in your mail directory. Consult the documentation for your local mail program for the commands to transfer and read mail. If you wish to obtain sequences of interest, use the e-mail retrieval server mentioned below or the IRX searching system available through the GenBank On-line Service. Contact GenBank for details (415-962-7364). Interpreting the Results of a FASTA Search The mail message returned after the FASTA search will contain the sequence name and length, the database searched, and the scoring matrix used. When searching all of GenBank, each subdivision of GenBank will also be displayed. To achieve a rapid yet sensitive search, the FASTA program uses a hierarchy of steps to determine scores for the sequences searched in the database. There are cut off points in each of the scoring steps so that only high scoring sequences are used in subsequent searching steps. Three scores are tallied and reported: INITN, INIT1, and OPT. Each of these scores is assigned to a sequence based on its rank at a specific point in the similarity searching process. In comparing the query sequence to a sequence in the database , the following steps are taken to determine the three scores: 1. First, the ktup value is used to establish a matrix for comparing sequences. A value of 4 for a nucleic acid means that each group of 4 consecutive residues of the query sequence and the database sequence will be compared. The sequences are compared on two perpendicular axes and a diagonal line is created when ktup matches with residues of the two sequences occur. 2. By joining match regions along the same diagonal that are not separated by excessive mismatches, initial regions of high similarity are identified. The 10 best diagonal regions of high similarity are used for further analysis. 3. An INIT1 score is then assigned to each region of high similarity. 4. Next, FASTA attempts to join regions on the diagonal and assign them an INITN score. The INITN score is determined by adding each of the INIT1 scores of the two regions to join and subtracting a constant value of 20 as a joining penalty. If the combined value of the region is less than the INIT1 score of either region, the regions are not joined. In this case, the INITN score will be equal to the INIT1 score of each region. Only the sequences that have an INITN score above a set cutoff point are kept for possible alignment. 5. Sequences with the highest INITN scores are then used for a Needleman-Wunsch/Smith-Waterman alignment to determine their OPT score. The OPT score is used to evaluate the alignments produced by FASTA. A histogram of the score distributions for both the INITN and INIT1 scores will be displayed in the results. The score value is given in the left column and the number of sequences that were in that interval is displayed in the two columns to the right. In the following example, there were 377 sequences with INITN scores that were greater than 12 but less than or equal to 16. In the graphic histogram, "+"'s and "-"'s are used to distinguish the bars for INITN and INIT1 scores, respectively, if the number of scores differ. Example: initn init1 < 4 16 16:======== 8 0 0: 12 1 1:= 16 377 377:================================================== 20 1272 1272:================================================== 24 2224 2224:================================================== 28 2717 2717:================================================== 32 3147 3147:================================================== 36 2921 2921:================================================== 40 2064 2064:================================================== 44 1243 1243:================================================== 48 568 568:================================================== 52 269 269:================================================== 56 105 105:================================================== 60 43 43:====================== 64 21 22:=========== 68 7 7:==== 72 3 3:== 76 18 19:---------+ 80 11 11:====== 84 16 17:--------+ 88 8 8:==== 92 0 0: 96 1 1:= 100 0 0: 104 1 0:+ 108 0 0: 112 0 0: 116 1 0:+ 120 0 0: 124 1 0:+ 128 0 0: 132 0 0: 136 1 1:= 140 0 0: 144 0 0: 148 0 0: 152 0 0: 156 0 0: 160 0 0: >160 0 0: KEY: + initn scores - init1 scores = no. of initn scores same as no. of init1 scores The statistics of the search will be given after the histogram including the total number of residues in the database searched, the number of sequences searched, the average INITN and INIT1 scores with their respective standard deviations, the number of scores that were above the cutoff value, the value for ktup, and the value for fact. searched 19156002 residues in 17047 sequences mean initn score: 31.8 (s.d.= 8.44) mean init1 score: 31.8 (s.d.= 8.44) 161 scores better than 55 saved, ktup: 4, fact: 4 The name and scores for the top 100 best-ranking sequences, as determined by their INITN score, will be presented in the results. In addition, the optimized alignments for the top 20 ranking sequences are given as shown below. (Please note that the default values are 100 and 20 but may be more or less depending on the parameters and query sequence submitted.) Only the region that was considered significant by the program will be displayed. The best scores are: initn init1 opt >SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi 134 134 140 >M13TG117 - Phage M13tg117 cloning vector in 5' end of 122 84 87 >SYNPUC92B - Plasmid PUC9-2, a modified pUC9 vector wi 114 74 80 >M13TG115 - Phage M13tg115 cloning vector in the 5' en 103 63 63 >MUSP53MR - Mouse p53 cellular tumor antigen mRNA, com 96 96 96 . . . Alignments: >SYNPUC81A - Plasmid PUC8-1, a modified pUC8 vector wi initn= 134 init1= 134 opt= 140 80.0% identity in 65 nt overlap 10 20 30 40 50 60 M13MP5 ATGACCATGATTACGAATTCCGGAATTCCGGAATT-CCGGAATTCCG--GAATTCC--CC X:::::::::::::::::: :::::::::: :::: v^:: :::: :: : : :: SYNPUC ATGACCATGATTACGAATTGCGGAATTCCGCAATTCCCGGGGATCCGTCGACCTGCAGCC 10 20 30 40 50 60 70 80 M13MP5 AAGCTTGGGAATTCCGGAATT ::::: SYNPUC AAGCTGCAAGCTTGCAGCTTG 70 80 . . . Library scan: 0:05:20 total CPU time: 0:05:23 After all the alignments are printed out, the CPU time used for the library (database) scan and the total CPU time will be displayed. The following table shows the symbols used in the alignment and their representation. __________________________________________________________________ Symbol Representation : an exact match . an ambiguous match or a match with a conservatively replaced amino acid - a gap in the sequence X boundaries of the initial region that are associated with the INIT1 score ^ and v boundaries shifted during the final optimization step which replace "X" __________________________________________________________________ Interpreting the Scores The OPT score is derived from the alignment and is generally the best score to evaluate the alignments produced by FASTA. Please note that the program prints the scores in the order given by the INITN scores and not the OPT scores. In general, sequences with high INITN scores usually have high OPT scores but this is not always true. Also, the OPT scores are determined for only some of the database sequences, therefore the mean and standard deviation are not calculated. These statistics can only be calculated for INITN and INIT1 scores. For more information on interpreting the scores produced by a FASTA search, consult Pearson and Lipman's paper presented at the beginning of the help text. Calculating Time Usage The processing time for a FASTA search depends on: the size of the queued sequence, the database selected, the ktup value, the number of requests in the batch queue and the load on the GenBank computer. Retrieving DataBank Entries found with FASTA Database entries can be retrieved by either locus name or accession number. To use the GenBank Retrieval System, send an electronic message to RETRIEVE@GENBANK.BIO.NET containing as text (leave the ~Subject: line blank) either an accession number, or an entry name, but not both. The message text should contain exactly one word. The data banks are searched in the order: GenBank New Data, GenBank current release, EMBL New Data, EMBL current release, GenPept New Data, GenPept current release, and Swiss-Prot until a match is found. If an entry exists in both GenBank and EMBL with the same accession number (the usual case), a query on the accession number will return the GenBank version of the entry. If the EMBL-format version is required, it can be retrieved from the file server at NETSERV@EMBL.BITNET (for instructions send a message containing the line HELP to that address). To retrieve GenPept entries, use the LOCUS name of the corresponding GenBank entry followed by a _1, or _n where n represents the nth coding region in that GenBank entry. For example, ASNTUBBA_1 is the GenPept LOCUS name for the translation from GenBank entry ASNTUBBA. An electronic version of the sequence data submission form used by the sequence data banks is also available through the RETRIEVE server. To receive a copy, send a message containing the word DATASUB as the only line. Instructions for completing and submitting the form are included. Obtaining FASTA The FASTA program (and other related programs) can be purchased for VAX/VMS, SUN/Unix, IBM-PC and Macintosh computers. To obtain the program for one of these systems, contact Dr. William Pearson at: Department of Biochemistry Box 440 Jordan Hall University of Virginia Charlottesville, VA 22908 or send electronic mail to: wrp@Virginia.BITNET You can also obtain the programs by anonymous FTP from uvaarpa.virginia.edu and accessing the file, public_access/fasta.shar End of FASTA Server Help