National Human Genome Research Institute National Human Genome Research Institute National Human Genome Research Institute National Institutes of Health
   
       Home | About NHGRI | Newsroom | Staff
Research Grants Health Policy & Ethics Educational Resources Careers & Training

Intramural Research > Online Research Resources > SOOP

Soop


NAME

soop - system for optimized overgo picking


SYNOPSIS

soop [options] <PipMaker/BLAST filename>

soop [options] -p <primer filename> <FASTA filename>


DESCRIPTION

Soop designs ``overgo'' DNA hybridization probes either from a sequence comparison or a straight FASTA file. The basic idea behind designing probes from sequence comparisons is that orthologous sequences between two species that are highly concerved are likely to also be conserved in a third species. Soop runs in these two modes depending on the type of file passed to it as the last parameter on the command line.

FASTA
If a FASTA formatted file is passed to soop then the sequences are converted to overgos optimizing for GC content. The only relevant options are: -f -p -i -n -l -o -t -w

NCBI BLAST or PipMaker Verbose
If a sequence comparison output is sent to soop it will try to design and pick overgos based on spacing between probes and sequence homology for use in comparitive species probe design. All options are relevant.


OPTIONS

filename
Either PipMaker verbose or standard BLAST 2.0 (requires NHGRI::blastall.pm) to pick overgos from species comparison data or FASTA to just make overgos. This must be the last argument when calling soop

-a filename
``All overgos'' File containing the overgo from every pip that meets the GC content requirements optimizing first for sequence similarity and second for GC content. This file is not written by default.

-f filename
File containing the selected overgos only maximizing percent identity and spacing then GC content in BLAST or PipMaker modes. In FASTA mode simply contains all the overgos that could be made maximizing for GC content. Defaults to ``overgos''

-g number
The ideal spacing between probes in kilobases. Soop trys to make the probes from ungapped alignments this far apart. Defaults to 30.

-h
Print out this message (requires perldoc in your path) and exit. (Overrides all other options)

-i decimal
Proportion minimum identity for probes to be made. In other words if i is .80 no overgo will be made with less than 80% similarity between the two sequences. Defaults to .80.

-l length
Length in bases of overgo to design. If l is 36 then the overgos will be made of two primers who's final lengths post extension will be 36 base pairs. Defaults to 36.

-n text
Prefix for names of primers in .stj (primers) file. If this option is present the primers will be named <prefix>##.1 and <prefix>##.2 for each primer with the ## incrementing by one from 00. If a primer file is written and this option is not specified the primer names will be the first ``words'' on the defline that match the character class [A-Za-z0-9_-.].

-o length
Overlap in bases between the two primers. Defaults to 8

-p filename
Name of file to write primers in (In ``Send to Jackie'' format). When making overgos from a FASTA formatted file this is the primary output filename. Not written by default.

-Q
Use the Query sequence for probe spacing. Otherwise uses the reference sequence to determine probe spacing. If you're blasting your reference sequence against a library of other sequences then use this option.

-r
Make probes/overgos from the ``reference'' or ``subject'' sequence. This does not affect which coordinates are used for setting probe spacing. By default the probes are made from the Query sequence.

-s offset
Start at this offset (in bases). This option isn't used very much, it describes an offset to start making probes at. In other words if you wanted to make some probes but skip the first 100kb of reference sequence you would set this to 100000. Deafults to 0.

-S filename
Write summery information about the probes chosen to this file. Not written by default.

-t decimal
Target proportion GC for the overgo as a whole. Defaults to .5

-v
Print the version number and exit. (Overrides all other options except -h)

-w decimal
Wiggle room for the GC proportion. In other words the amount the GC content can vary from the ideal GC content; if -t was set to .5 and -w was set to .06 than the GC content of the overgo would have to be in the range of 44% to 56% GC. Defaults to .06


OUTPUT FILE FORMATS

overgo file
The overgo files (those written by the -f and -a options) are written in FASTA Format. With the defline in the following format:

 ><defline> ### pip:[ref:(<start>-<end>) query:(<start>-<end>) \
    id:<percent_identity>] og:[id:<id>% offset:<start> diff:<diff>]
<defline>
The defline of the query sequence if in a comparison mode (BLAST or PipMaker). The defline of the fasta entry if in FASTA mode.

pip:[ref:(<start>-<end>) query:(<start>-<end>) id:<%id>%]
The characteristics of the pip that this overgo was made from. Start and ending coordinates in the reference/subject sequence and start and end coordinates in the query sequence. If in FASTA mode the reference coordinates (ref:(#-#)) will be 0 to the length of the fasta entry. Either ``ref'' or ``query'' will be capitalized to indicate which sequence the overgo was made from.

og:[id:<%id> offset:<start>
The characteristics of the overgo itself. <%id> is the percent identity of the overgo alone. <start> is the number of bases from the first base of the ungapped alignment or contig that the overgo starts.

diff:<diff>
This is a sequence comparison between the sequence of the overgo and the other sequence. From this you can retrieve either the overgo sequence or the sequence it was compared to. The format is the overgo sequence with a > after any base that changes where _ indicates a gap on either side. For example the following aligned sequences:

    ACTG TCAGA
    || |-:|-||
    ACGGTCC GA

would be represented ``ACT>GG_>TCA>_GA'' The format can be converted to the sequence of the overgo with the regular expression s/\>[ACTG]|_//g or to the sequence it was compared to with the regular expression s/[ACTG]\>|_//g.

primer file
Format of the primer file is simply the name of the primer <tab> then the sequence of that primer. Primer names are the overgo name with a .1 or .2 appended depending whcih strand the oligo is from. For example
    4href03.1   GCTGGGTGTGCCTGGGGATCAT
    4href03.2   GTTTTAACAATAAAATGATCCC


USAGE

This is a brief summary of how we use soop.

  1. The first thing to do is to put together a ``finished'' ungapped reference sequence that will be used as a basis for the comparison and location estimation for other species. This should be your best estimate of the true finished sequence for the reference organism (human in our case).

  2. Get comparison sequence from another species. We have used both finished mouse sequence from BACs that align to our reference sequence as well as now we are using mouse whole genome shotgun traces downloaded from Ensembl's web site (http://www.ensembl.org) that align to golden path coordinates similar to ours.

  3. Mask the repeats from both sequences. We use RepeatMasker (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker) to mask both the reference and the other species sequences.

  4. Run an NCBI blast (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/) or PipMaker (http://bio.cse.psu.edu/pipmaker/) with a database of the reference sequence and a query of the other species. If you run PipMaker You'll need the Verbose text output (The one the web site says you don't want or need).

  5. Finally we get to running soop. You'll need to run soop using either the BLAST2 standard text output or PipMaker verbose file. If you just want the default parameters soop can be run as just soop text_file.

  6. You must then re-blast the overgos file that soop produces against appropriate databases to insure that it hasn't designed low complexity or repetitive probes. You should make sure that all high value hits make sense as coming from one position in a given genome. If you have to drop probes at this point you can use the -a option to find nearby replacements.


FILES

Uses Bio::SearchIO to parse blast reports. Available at http://www.bioperl.org


VERSION

soop $Revision: 2.15 $ $Date: 2004/01/28 19:26:25 $ UTC


BUGS

Files with long stretches of NNNNNN's that span more than one line cause the PipMaker text file parser (parseVerbose) to barf and give incorrect data (subject becomes query).

Will overwrite input file in FASTA mode without asking permission. This is especially insidious when the input filename is ``overgos''. (note to self: should be fixed in getParams())

Should have offset (-s) or ``start at'' option for FASTA mode.


AUTHOR

Written by Arjun Prasad (aprasad@nhgri.nih.gov)

Some ideas taken from overgomaker by John D. McPherson (overgo GC content optimization)


SEE ALSO

BioPerl (Bio::SearchIO used to parse blast output)
http://www.bioperl.org

RepeatMasker
http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker

PipMaker
http://bio.cse.psu.edu/pipmaker/

NCBI Blast 2
http://www.ncbi.nlm.nih.gov/BLAST/

Stand alone blast (recommended): ftp://ftp.ncbi.nlm.nih.gov/blast/executables/

Ensembl
http://www.ensembl.org


COPYRIGHT

PUBLIC DOMAIN NOTICE

This software/database is ``United States Government Work'' under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (NHGRI) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose.

In any work or product derived from this material, proper attribution of the authors as the source of the software or data should be made, using: Thomas et.al., 2002 ``Parallel Construction of Orthologous Sequence-Ready Clone Contig Maps in Multiple Species'' Genome Res. 12:1277-85 as the citation.


Comments, suggestions and problems to bioinformatics@nhgri.nih.gov


Genome.gov privacy policyPrivacy Genome.gov contact informationContact Genome.gov accessibility informationAccessibility Genome.gov site indexSite Index Genome.gov staff directoryStaff Directory Genome.gov home pageHome Government Links Department of Health and Human Services FirstGov National Institutes of Health