Skip redundant sequences

Description

Remove redundant sequences from an input set

This tool runs EMBOSS command skipredundant that removes redundancy from an input file of sequences, either at a single threshold of sequence similiarty (e.g. 40%) or within a threshold range of sequence similiarty (e.g. 40% - 70%). Redundancy is calculated for each pair of sequences in turn and thus this tool is not suitable for large data sets.

Redundancy in a collection of sequences occurs when one or more similar sequences are present. The inclusion of very similar sequences in certain analyses will introduce undesirable bias. For example, a family may possess 100 sequences in the sequence database, but 90 of these might be essentially the same sequence, e.g. very close relatives or mutations of a single sequence. Although 100 sequences are known, the family only contains 11 sequences that are essentially unique. For many applications it is desirable or even essential to remove redundant sequences from a set in order to produce a smaller set that is representative of the whole.

The input sequences can be in any of the EMBOSS compatible sequence alignment formats. Note that this tool reads just one input file, that should contain all the sequences to be analyzed.

For more details, please check the manual pages of skipredundant command.

Output

Output is a fasta formatted sequence file.