cosmoss.org documentation

Available databases

Available databases:

 

ppp_nr The nr (non redundant) dataset is based on the ppp dataset (see below). However, each cluster is represented by the longest sequence it contains only. This is similar to the NCBI's unique unigene approach. In theory, the nr set represents all genes present in the ppp dataset as well.
ppp These are all the sequences that are left after clustering and assembly. They consist of contigs as well as singlets. Singlets can be part of clusters or independent. In addition, we included so-called problem sequences, that were detected to be possible chimeras (cloning artifacts) into the dataset. Please note that each cluster can contain more than one sequence! Ideally, each cluster contains one gene and its transcripts (with splice variants) - however, multiple sequences in one cluster can also be due to close paralogs and/or cloning or sequencing artefacts.
ppp_orf Nucleic acid database. This dataset is based on the ppp_nr dataset. It contains the predicted ORFs that are longer than 150na. ORF prediction was carried out using ESTScan and FrameD with species-specific models.
ppp_fil This database is a cleaned up version of the raw EST input data. It contains the raw data after filtering. During filtering, sequences that significantly match against the E. coli genome or Physcomitrella rRNA, mitochondrial or plastidal genes are removed. All sequence stretches that match against vector are excised. Sequences that contain less than 150 meaningful bases are then removed. Before the sequences are clustered, low complexity, A-tail and repetitive regions are masked, so they don't disturb clustering and assembly. In addition, known plant UTRs from UTR-DB and repeats from Repbase are also masked.
ppp_raw The unfiltered EST data.
ppp_icm The "iterative contig members". For very large clusters, contigs are build iteratively. The information of the original member sequences is not present in the visualization of the contig, but can be figured by a BLAST search against this dataset.
ppp_seeds The full length, annotated CDS that have been used for seed clustering.
ppp_orfpep Peptide sequence database. This dataset is based on the ppp_nr dataset. It contains the predicted ORFs that are longer than 50aa. ORF prediction was carried out using ESTScan and FrameD with species-specific models.
pp_fosmids High quality, full length sequences of genomic clones that have been produced by JGI as a quality control for the WGS data.
pp_traces The whole genome shotgun (WGS) reads or traces, produced by JGI. For historical reasons, the data are split into two databases. To search them simultaneously, use the advanced option -d database
Ceratodon The equivalent to the Physcomitrella ppp_nr data, produced for the available Ceratodon purpureus ESTs
Tortula The equivalent to the Physcomitrella ppp_nr data, produced for the available Tortula ruralis ESTs

Nomenclature

Nomenclature:
A sequence of the form PPP_1001_C1 is a contig from a cluster.
In this case, contig 1 (C1) from cluster number 1001.
If the sequence name contains an additional sd (like PPP_sd_112_C2) it is a seed cluster sequence. Seed clusters are produced prior to the main steps of clustering and assembly. As seeds, we take all the publicly available Physomitrella CDS (coding sequences). Because of this, you will find genes that have previously been characterised in seed clusters.
Singlets generally use the genbank accession number they are derived from as their name.
Singlets that are part of clusters (clustered singlets) additionally contain the name of the cluster after the accession number.
Problem sequences are easily distinguishable by the PR- in front of the sequence name. Such sequences have a high probability to be chimeras (cloning artefacts).
For a more detailed description of the information contained in the sequence headers, please read this PDF.

Some definitions concerning EST clustering

Some explanations:
contig: a contig is a consensus sequence built from at least two sequences with local sequence similarity
singlet: a singlet is a sequence that did not find a matching partner during the initial pairwise comparisons between the input sequences (clustering) or during assembly of a specific cluster. The latter type of singlet is referred to as a clustered singlet.
cluster: a cluster is a pool of sequences that were found to have local similarities. During assembly, the pool of sequences is reduced to a single contig (ideally) or several contigs and/or singlets.

 

The Sequence Retrieval User Manual

 

BLAST interface - alphabetical list of terms

alignments
the maximum number of detailed alignments to be shown (default: 50)
database
the database you want to search against
(see above, "Available databases" for details)
the databases above the horizontal line contain nucleic acid, below peptide sequences
 
e-mail
activate this button to receive your results via e-mail (plain text recommended)
e-value threshold
the e-value threshold (cutoff), format 0.01 or 10e-2 accepted
(default: 10e-4 for peptide comparisons and 10e-2 for BLASTN)
gap extension
the gap extension penalty (default: 2 for BLASTN, otherwise 1 [for BLOSUM62])
gap open
the gap opening penalty (default: 5 for BLASTN, otherwise 11 [for BLOSUM62])
html
activate this button to retrieve your results in html format (with hyperlinks)
list size
the maximum amount of hits that will be listed (default: 50)
matrix
the substitution matrix (default: BLOSUM62)
molecule
the molecule type of your input (query) sequence: nucleic acid or peptide
nofilter
activate to turn OFF low complexity filtering (default: filtering ON)
other
additional BLAST command line parameters (experts only)
output
determines whether you like to have plain text or html output (default: html) and whether you want to have it displayed on a webpage (default) or send by e-mail
query name
the name of your query sequence (an identifier for your search)
sequence
your sequence in plain text or in FastA format
 
sequence from file
instead of pasting your sequence into the form you can upload it from disk (FastA format accepted)
subsequence
only the given range of your sequence will be used
text
activate this button to receive your results as plain text
type
specify the kind of blast search you want to execute (the default is automatically determined by input sequence type vs. database type!)
ungapped
activate if you don't want to allow gaps (default: allow gaps)
wordsize
the word size k (default: 11 for BLASTN, otherwise 3)



button_up