Soop
soop - system for optimized overgo picking
soop [options] <PipMaker/BLAST filename>
soop [options] -p <primer filename> <FASTA filename>
Soop designs ``overgo'' DNA hybridization probes either from a sequence
comparison or a straight FASTA file. The basic idea behind designing
probes from sequence comparisons is that orthologous sequences between two
species that are highly concerved are likely to also be conserved in a
third species. Soop runs in these two modes depending on the type of
file passed to it as the last parameter on the command line.
- FASTA
-
If a FASTA formatted file is passed to soop then the sequences are
converted to overgos optimizing for GC content. The only relevant options
are: -f -p -i -n -l -o -t -w
- NCBI BLAST or PipMaker Verbose
-
If a sequence comparison output is sent to soop it will try to design
and pick overgos based on spacing between probes and sequence homology
for use in comparitive species probe design. All options are
relevant.
- filename
-
Either PipMaker verbose or standard BLAST 2.0 (requires NHGRI::blastall.pm)
to pick overgos from species comparison data or FASTA to just make overgos.
This must be the last argument when calling soop
- -a filename
-
``All overgos'' File containing the overgo from every pip that meets the GC
content requirements optimizing first for sequence similarity and second for
GC content. This file is not written by default.
- -f filename
-
File containing the selected overgos only maximizing percent identity and
spacing then GC content in BLAST or PipMaker modes. In FASTA mode simply
contains all the overgos that could be made maximizing for GC content.
Defaults to ``overgos''
- -g number
-
The ideal spacing between probes in kilobases. Soop trys to make the probes
from ungapped alignments this far apart. Defaults to 30.
- -h
-
Print out this message (requires
perldoc in your path) and exit.
(Overrides all other options)
- -i decimal
-
Proportion minimum identity for probes to be made. In other words if i
is .80 no overgo will be made with less than 80% similarity between the two
sequences. Defaults to .80.
- -l length
-
Length in bases of overgo to design. If l is 36 then the overgos will be
made of two primers who's final lengths post extension will be 36 base pairs.
Defaults to 36.
- -n text
-
Prefix for names of primers in .stj (primers) file. If this option is
present the primers will be named <prefix>##.1 and <prefix>##.2 for each
primer with the ## incrementing by one from 00. If a primer file is written
and this option is not specified the primer names will be the first ``words''
on the defline that match the character class [A-Za-z0-9_-.].
- -o length
-
Overlap in bases between the two primers. Defaults to 8
- -p filename
-
Name of file to write primers in (In ``Send to Jackie'' format). When making
overgos from a FASTA formatted file this is the primary output filename.
Not written by default.
- -Q
-
Use the Query sequence for probe spacing. Otherwise uses the reference
sequence to determine probe spacing. If you're blasting your reference
sequence against a library of other sequences then use this option.
- -r
-
Make probes/overgos from the ``reference'' or ``subject'' sequence. This does not
affect which coordinates are used for setting probe spacing. By default the
probes are made from the Query sequence.
- -s offset
-
Start at this offset (in bases). This option isn't used very much, it
describes an offset to start making probes at. In other words if you
wanted to make some probes but skip the first 100kb of reference sequence
you would set this to 100000. Deafults to 0.
- -S filename
-
Write summery information about the probes chosen to this file. Not written
by default.
- -t decimal
-
Target proportion GC for the overgo as a whole. Defaults to .5
- -v
-
Print the version number and exit. (Overrides all other options except -h)
- -w decimal
-
Wiggle room for the GC proportion. In other words the amount the GC content
can vary from the ideal GC content; if -t was set to .5 and -w was set
to .06 than the GC content of the overgo would have to be in the range of
44% to 56% GC. Defaults to .06
- overgo file
-
The overgo files (those written by the -f and -a options) are written
in FASTA Format. With the defline in the following format:
><defline> ### pip:[ref:(<start>-<end>) query:(<start>-<end>) \
id:<percent_identity>] og:[id:<id>% offset:<start> diff:<diff>]
- <defline>
-
The defline of the query sequence if in a comparison mode (BLAST
or PipMaker). The defline of the fasta entry if in FASTA mode.
- pip:[ref:(<start>-<end>) query:(<start>-<end>) id:<%id>%]
-
The characteristics of the pip that this overgo was made from.
Start and ending coordinates in the reference/subject sequence
and start and end coordinates in the query sequence. If in
FASTA mode the reference coordinates (ref:(#-#)) will be 0 to
the length of the fasta entry. Either ``ref'' or ``query'' will
be capitalized to indicate which sequence the overgo was made
from.
- og:[id:<%id> offset:<start>
-
The characteristics of the overgo itself. <%id> is the percent
identity of the overgo alone. <start> is the number of bases
from the first base of the ungapped alignment or contig that
the overgo starts.
- diff:<diff>
-
This is a sequence comparison between the sequence of the overgo
and the other sequence. From this you can retrieve either the
overgo sequence or the sequence it was compared to. The format
is the overgo sequence with a > after any base that changes where
_ indicates a gap on either side. For example the following aligned
sequences:
-
ACTG TCAGA
|| |-:|-||
ACGGTCC GA
-
would be represented ``ACT>GG_>TCA>_GA'' The format can be converted
to the sequence of the overgo with the regular expression s/\>[ACTG]|_//g
or to the sequence it was compared to with the regular expression
s/[ACTG]\>|_//g.
- primer file
-
Format of the primer file is simply the name of the primer <tab> then
the sequence of that primer. Primer names are the overgo name with a
.1 or .2 appended depending whcih strand the oligo is from. For example
-
4href03.1 GCTGGGTGTGCCTGGGGATCAT
4href03.2 GTTTTAACAATAAAATGATCCC
This is a brief summary of how we use soop.
The first thing to do is to put together a ``finished'' ungapped reference
sequence that will be used as a basis for the comparison and location
estimation for other species. This should be your best estimate of the
true finished sequence for the reference organism (human in our case).
Get comparison sequence from another species. We have used both finished
mouse sequence from BACs that align to our reference sequence as well as
now we are using mouse whole genome shotgun traces downloaded from
Ensembl's web site (http://www.ensembl.org) that align to golden path
coordinates similar to ours.
Mask the repeats from both sequences. We use RepeatMasker
(http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker) to mask
both the reference and the other species sequences.
Run an NCBI blast (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/) or
PipMaker (http://bio.cse.psu.edu/pipmaker/) with a database of the
reference sequence and a query of the other species. If you run PipMaker
You'll need the Verbose text output (The one the web site says you don't
want or need).
Finally we get to running soop. You'll need to run soop using either the
BLAST2 standard text output or PipMaker verbose file. If you just want
the default parameters soop can be run as just soop text_file.
You must then re-blast the overgos file that soop produces against
appropriate databases to insure that it hasn't designed low complexity or
repetitive probes. You should make sure that all high value hits make
sense as coming from one position in a given genome. If you have to drop
probes at this point you can use the -a option to find nearby
replacements.
Uses Bio::SearchIO to parse blast reports. Available at http://www.bioperl.org
soop $Revision: 2.15 $ $Date: 2004/01/28 19:26:25 $ UTC
Files with long stretches of NNNNNN's that span more than one line cause
the PipMaker text file parser (parseVerbose) to barf and give incorrect
data (subject becomes query).
Will overwrite input file in FASTA mode without asking permission. This
is especially insidious when the input filename is ``overgos''. (note to self:
should be fixed in getParams())
Should have offset (-s) or ``start at'' option for FASTA mode.
Written by Arjun Prasad (aprasad@nhgri.nih.gov)
Some ideas taken from overgomaker by John D. McPherson
(overgo GC content optimization)
- BioPerl (Bio::SearchIO used to parse blast output)
-
http://www.bioperl.org
- RepeatMasker
-
http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker
- PipMaker
-
http://bio.cse.psu.edu/pipmaker/
- NCBI Blast 2
-
http://www.ncbi.nlm.nih.gov/BLAST/
-
Stand alone blast (recommended):
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
- Ensembl
-
http://www.ensembl.org
PUBLIC DOMAIN NOTICE
This software/database is ``United States Government Work'' under the
terms of the United States Copyright Act. It was written as part of
the authors' official duties for the United States Government and thus
cannot be copyrighted. This software/database is freely available to
the public for use without a copyright notice. Restrictions cannot
be placed on its present or future use.
Although all reasonable efforts have been taken to ensure the accuracy
and reliability of the software and data, the National Human Genome
Research Institute (NHGRI) and the U.S. Government does not and cannot
warrant the performance or results that may be obtained by using this
software or data. NHGRI and the U.S. Government disclaims all
warranties as to performance, merchantability or fitness for any
particular purpose.
In any work or product derived from this material, proper
attribution of the authors as the source of the software
or data should be made, using: Thomas et.al., 2002 ``Parallel
Construction of Orthologous Sequence-Ready Clone Contig Maps in
Multiple Species'' Genome Res. 12:1277-85 as the citation.
Comments, suggestions and problems to
bioinformatics@nhgri.nih.gov
|