PlanTAPDB Documentation

PlanTAP Family Viewer documentation
PlanTAP Family Members Viewer documentation
PlanTAPDB Retrieval documentation
PlanTAP Family Viewer documentation
PlanTAP Family Members Viewer documentation
PlanTAP Family Cluster Viewer documentation

horizontal line

PlanTAPDB Retrieval documentation

We have determined a comparative set of transcription associated protein TAP families which are focused on. but not limited to. land plants. Beside transcription factors (TF). extensively studied in seed plants. PlanTAPDB includes transcriptional regulators (TR) and putative TAPs and covers a broad taxonomic range including algae and a moss. For detailed description of TAP classifications see the category page.

Gene clusters were built by using PSI-BLAST searches and subsequent filtering and clustering steps followed by a comprehensive manual annotation procedure. Phylogenies were created in an automated way for all families using a combination of distance and maximum likelihood methods.

The TAP families are listed below. grouped according to their functional category. and link to the corresponding family entries.

Sandra Richardt. Daniel Lang. Ralf Reski. Wolfgang Frank and Stefan A. Rensing (2006): PlanTAPDB - A phylogeny-based comprehensive resource of plant transcription associated proteins (submitted)

accession_number
category
main_family
search
sub_family

horizontal line

top
accession_number Each PlanTAP entry has its distinct accession number, which is a 5-letter string comprised of a leading two-character category string and a trailing unique 3 digit number. (TF|TR|PT)([0-9]{3})
category Each PlanTAP family entry belongs to one of the following categories:
  1. DNA-binding transcription factors (TF), which directly activate or repress transcription of target genes upon binding to the promoter or upstream enhancer / silencer elements
  2. Transcriptional regulators (TR), comprising of general transcription initiation factors (interacting with RNA polymerase II and/or core promoter elements and recruiting components of the basal transcription machinery), co-activators / -repressors (binding to and influencing the activity of TFs) and chromatin remodelling factors (affecting the accessibility of DNA through histone modifications and DNA methylation)
  3. Putative TAP (PT) with unknown function and/or domains that are possibly associated with transcriptional regulation
main_family Each PlanTAP entry was annotated to belong to an existing or new family of TAPs.
search Using the search function, PlanTAPDB can be queried by searching the following database fields:
  • family accession number: Each PlanTAP entry has its distinct accession number, which is a 5-letter string comprised of a leading two-character category string and a trailing unique 3 digit number. Query-type possible: Lookup only
  • family_id: Unique integer describing a distinct PlanTAP family. Query-type possible: Lookup only
  • categories: Search the category field of the PlanTAPDB.Query-type possible: Lookup only
  • families: Search the main_family and the sub_family fields of the PlanTAPDB. Query-type possible: Full support of the POSIX regular expression syntax. Example query: To get both "TFIIB" and "TFIIH", use the term "TFII[BH]". Another common wildcard is ".*" which matches basically everything. E.g. "^H.*" will find all families beginning (the "^") with "H".
  • family member names: Retrieve PlanTAP entries by the accession number of a member sequence. The accession numbers of the member sequences are the identifiers of their orginating databases, e.g. UniProt, GenPept, TAIR, Cosmoss ... Query-type possible: Lookup only
sub_family Some of the PlanTAP families can be further divided into subfamilies.
topic index top

horizontal line

PlanTAP Family Viewer documentation

Filters
accession_number
category
citation(s)
consensus domain(s)
domain structure
homology reduction
in family clusters
in_tree?
is a?
last modified
main_family
member species
number of clusters
number of members
number of members in trees
number of non-redundant members
number of queries
redundancy removal
redundant?
sub_family
taxonomic profile
user_contributed_trees

horizontal line

top
Filters The family member list can be filtered using the following criteria: In addition the information displayed for every family member can be switched between full textual output and graphical domain structure by using the graphical view of the family members' domain structures checkbox. Selection of multiple filters is additive: If you modify multiple filters, all of them are used to filter the family members list after you click "Filter member list". Use the "Reset" button to reset the filters to their defaults to avoid unexpected results after modification of filter parameters. The "Reset" button only resets the selected filtering parameters, not the family members displayed below! To display the full list of family members after reset, you will have to "Filter member list" again.
accession_number Each PlanTAP entry has its distinct accession number, which is a 5-letter string comprised of a leading two-character category string and a trailing unique 3 digit number. (TF|TR|PT)([0-9]{3})
category Each PlanTAP family entry belongs to one of the following categories:
  1. DNA-binding transcription factors (TF), which directly activate or repress transcription of target genes upon binding to the promoter or upstream enhancer / silencer elements
  2. Transcriptional regulators (TR), comprising of general transcription initiation factors (interacting with RNA polymerase II and/or core promoter elements and recruiting components of the basal transcription machinery), co-activators / -repressors (binding to and influencing the activity of TFs) and chromatin remodelling factors (affecting the accessibility of DNA through histone modifications and DNA methylation)
  3. Putative TAP (PT) with unknown function and/or domains that are possibly associated with transcriptional regulation
citation(s) Related literature references describing the PlanTAP entry. Follow the hyperlink to view the corresponding PubMed entry.
consensus domain(s) In the manual annotation process matching InterPro domains were condensed to a set of consensus domains common to the majority of members. The entries are directly hyperlinked to the corresponding InterPro entry. If your browser supports mouse-over information, use this to display additional information.
domain structure Display a graphical view of the members domain structure instead of the default full textual view. The images are scaled in relation to the longest displayed member sequence. CAUTION: depending on the number of members to be displayed, this may take a while!
homology reduction Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. A total of 102 clusters had more than 150 members, these were condensed via stepwise homology reduction until the threshold of 150 members was reached. Homology reduction was implemented in the same program as redundancy removal, but follows a different strategy. Beginning with 1 substitution per 100aa and heuristically increasing this distance threshold, the distance matrix is iteratively scanned for sequence pairs with the respective distance, regardless of their species. The iteration stops when the remaining representative cluster members reach a given limit (150 sequences).
in family clusters Display only entries from specific PlanTAP family clusters.
in_tree? Filter the member list using the "in tree" property described under "in tree" in the family members viewer documentation section.
is a? Filter the member list using the "is a" property described under "is a" in the family members viewer documentation section.
last modified Timestamp of the last modification of the entry.
main_family Each PlanTAP entry was annotated to belong to an existing or new family of TAPs.
member species Filter the member list to show only entries having exactly the same taxonomy string (NCBI taxonomy full lineage represented by a single NCBI taxonid). Each binary species name stands for a full linage taxonomy string, e.g. like Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; commelinids; Poales; Poaceae; BEP clade; Ehrhartoideae; Oryzeae; Oryza; Oryza sativa (japonica cultivar-group). Due to the nature of sequence submission in e.g. Genbank, it can happen that there is another entry with the same binary name but slightly divergent lineage. These will be two different entries in the filter list. But since multiple selections are possible, this should not be an problem.
number of clusters Total number of clusters the describing the PlanTAP family (= number of trees for the family). Multiple clusters depict the particular TAP family either from a different taxonomic perspective (e.g. restricted to the plant lineage vs. covering all kingdoms), or comprise different subfamilies. Because large TAP gene families are substantially divergent beyond their conserved domains, it appears more reasonable to deduce phylogenies from subgroups in order to be able to utilize as much homologous sequence information as possible.
number of members Total number of family member sequences.
number of members in trees Total number of member sequences after homology reduction.
number of non-redundant members Total number of family member sequences after the redundancy removal. See redundancy removal for more details.
number of queries Total number of query sequences in the PlanTAP family.
redundancy removal While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. For the removal of redundant sequences, a multiple sequence alignment was performed using MAFFT FFT-NS-2 and pairwise distances were calculated using the EMBOSS distmat program. The resulting matrix was scanned for sequence pairs from the same species with a distance 1 substitutions per 100aa. For each pair, one representative was selected based on the originating database (UniProt sequences were preferred), sequence length and lexical sort order of the accession number. The procedure was implemented in Perl using several Bioperl modules, including a modified version of the Bio::Tools::Run::Alignment::MAFFT module. For the parsing of the distmat distance matrices, an object-oriented Bioperl module (Bio::Matrix::IO::distmat) was written.
redundant? Filter the member list using the "redundant" property described under "redundant" in the family members viewer documentation section.
sub_family Some of the PlanTAP families can be further divided into subfamilies.
taxonomic profile For visualization of the distribution of TAP family members across all taxonomic lineages a taxonomic profile was created and is presented as a heat map. Initial tests using taxonomic resolution fixed at the kingdom or order level, respectively, were not able to resolve the expected phylogeny of the contributing taxa using columnwise clustering (data not shown). Therefore, those taxonomic groups which contributed significantly to the overall distribution were selected as columns, the remainder of the Eubacteria, protists, plants and animals was gathered into respective other columns. Thus, a non-redundant representation of the taxonomic distribution was created which is able to resolve the expected phylogeny using columnwise clustering. To overcome the sampling bias presented by fully sequenced genomes, the columns were normalized. Subsequent clustering yielded the significantly correlated groups. The filter "taxonomic profile" gives the opportunity to specifically select all member entries belonging to an individual taxonomic group.
user_contributed_trees You can extend PlanTAPDB. If you want to contribute a manually curated or extended phylogeny of a PlanTAP family, just send us a nhx formatted tree with support values and species annotation together with a short text describing the method used.
topic index top

horizontal line

PlanTAP Family Members Viewer documentation

The Family Members Viewer allows you to view, filter and retrieve the members of a given PlanTAP family.
Sequence Retrieval
description
domains
in #clusters
in tree
is a
length
member_name
redundant
repr. species
representative
species

horizontal line

top
Sequence Retrieval The PlanTAPDB interfaces allow sequence retrieval in three ways:
  1. Hyperlink: contents of the fields member_name and representative of the Family Members Viewer are hyperlinked to the Cosmoss Retrieval System to fetch individual entries one at a time.
  2. Batch retrieval by checkbox: The Family Members Viewer allows retrieval of multiple member entries at a time by selecting them using the leading checkbox and hitting the "retrieve selected members" button. The check-status of all displayed members can be modified by using the checkbox in the table header of the member list in the upper left corner.
  3. Batch retrieval by node: The ATV Tree Viewer was modified to provide an additional option "get PlanTAP sequences", that can be used to retrieve all sequences belonging to a given node in the phylogenetic tree simply by clicking on it. Of course, the original option that links to a UniProt entry if possible was also preserved.
description The member sequences' description line, i.e. textual annotation provided by the orginating database
domains Matching InterPro domains in order of occurence along the sequence. If your browser supports mouse-over information, use this to display additional information, like e.g. description, E-value, start - stop of the match.
in #clusters In how many clusters belonging to this family did the member sequence occur? If you follow the hyperlink, an additional window appears displaying the PlanTAP family cluster(s), the respective entry is part of and provides hyperlinks to these clusters' ClusterView page.
in tree Is the member sequence part of any of the family msa and trees? Or was it removed in the homology reduction? This is also a filter property
is a Was the member sequence a hit, a query or both In the initial PSI-BLAST? This is also a filter property
length Length of the member's amino acid sequence.
member_name The unique accession number of a sequence which can be a member of multiple PlanTAP families. The accession numbers of the member sequences are the identifiers of their orginating databases, e.g. UniProt, GenPept, TAIR, Cosmoss ... By following the hyperlink you can retrieve the individual sequence via the Cosmoss Sequence Retrieval System.
redundant Was the member sequence tagged to be redundant in the homology reduction in any of the member clusters? Sequences marked as redundant were excluded in the taxonomic profiling of the PlanTAP families.This is also a filter property
repr. species Scientific name of the organism the representative member sequence is derived from. For small clusters, only redundant sequences of the same organism are considered, whereas this is not the case for huge clusters where iterative homology reduction was performed. SYNTAX: Genus species (subspecies or variety...) The last two words of the corresponding NCBI Taxonomy full linage string. Follow the hyperlink to access the corresponding NCBI taxonomy entry.
representative Fellow member sequence which represents a sequence in at least one of the family msa and trees. By following the hyperlink you can retrieve the individual sequence via the Cosmoss Sequence Retrieval System.
species The scientific name of the organism the member sequence is derived from. SYNTAX: Genus species (subspecies or variety...) The last two words of the corresponding NCBI Taxonomy full linage string. Follow the hyperlink to access the corresponding NCBI taxonomy entry.
topic index top

horizontal line

PlanTAP Family Cluster Viewer documentation

Sequence Retrieval
algorithm
avg_ident
comment
description
f_quantile_ident
fiala_stemminess
from_step
homology filtering
last_cutoff
longest_internal_branch_length
max_ident
max_iteration
median_ident
members
min_ident
ml
multiple alignment
nleafs
nr_distances
nr_members
number_of_internals
number_of_nodes
number_of_terminals
phylogenetic trees
queries
redundancy
remaining
removed
resolution
sd_ident
t_quantile_ident
total_paths
tree_height
tree_length

horizontal line

top
Sequence Retrieval The PlanTAPDB interfaces allow sequence retrieval in three ways:
  1. Hyperlink: contents of the fields member_name and representative of the Family Members Viewer are hyperlinked to the Cosmoss Retrieval System to fetch individual entries one at a time.
  2. Batch retrieval by checkbox: The Family Members Viewer allows retrieval of multiple member entries at a time by selecting them using the leading checkbox and hitting the "retrieve selected members" button. The check-status of all displayed members can be modified by using the checkbox in the table header of the member list in the upper left corner.
  3. Batch retrieval by node: The ATV Tree Viewer was modified to provide an additional option "get PlanTAP sequences", that can be used to retrieve all sequences belonging to a given node in the phylogenetic tree simply by clicking on it. Of course, the original option that links to a UniProt entry if possible was also preserved.
algorithm Multiple sequence alignment algorithm used for this cluster.
avg_ident The average %identity of all pairwise distances of the cluster_members.
comment Comments from the manual annotation phase
description The manual annotation infered for the cluster.
f_quantile_ident The first quantile of the summary statistics of all pairwise distances of the cluster_members.
fiala_stemminess Tree measure as described in Fiala, K.L. and R.R. Sokal, 1985. Factors determining the accuracy of cladogram estimation: evaluation using computer simulation. Evolution, 39: 609-622
from_step Our pipeline filters PSI-BLAST hits according to a six-step filtering scheme. "from_step" tells you which filter step was applied when the sequences resulting in this cluster were initially filtered. Sequences passing a specific filter step have to furfill at least the alignment length and fraction identical criteria of the step:

step length_aln frac_identical
1 50 0.25
2 60 0.3
3 80 0.35
4 100 0.45
5 150 0.45
6 300 0.45

homology filtering While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. Clusters with more than 150 members, were condensed via stepwise homology reduction until the threshold of 150 members was reached. To further investigate this process for every cluster in detail, we offer the distance matrix as text file, a graphic of distribution plots of the cluster_member distances and the initial MAFFT fftns2 alignment of the cluster used for the pairwise distances.
last_cutoff The last %identitiy threshold applied in the homology reduction of the cluster_members
longest_internal_branch_length Length of the longest internal branch used to (midpoint-)root the phylogenetic tree of the cluster.
max_ident The maximal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.
max_iteration The maximal PSI-BLAST iteration the cluster_members are from.
median_ident The median %identity of all pairwise distances of the cluster_members.
members Total number of cluster_members before redundancy removal and homology reduction.
min_ident The minimal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.
ml The maximum likelihood of the consensus tree topology calculated with TREE-PUZZLE
multiple alignment Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise which disturbs correct inference of phylogenetic relationships. Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a best-of-two approach, during which first two alignments were calculated using different state-of-the-art algorithms and then filtered using the sum-of-pairs score. In the second step the alignment with the maximal number of remaining columns was chosen. In version 1.0, on average, the alignments consisted to 65% of gaps and were reduced to 28% of the original alignment length by applying this procedure. In 71% of the cases the MAFFT G-INSI alignment was selected to represent the cluster, whereas ProbCons or Muscle were chosen for 29% of the clusters. The "best" alignment can be downloaded and viewed with the Jalview alignment editor applet. To comprehend the MSA column filtering process for each cluster we also provide a nice overview graphic.
nleafs Number of leafs or taxa in this clusters phylogenetic tree.
nr_distances Total number of pairwise distances.
nr_members The number of members after redundancy removal
number_of_internals Number of internal nodes of the phylogenetic tree.
number_of_nodes Total number of nodes (internal + leafs) of the tree.
number_of_terminals Number of nodes without children.
phylogenetic trees Many approaches to phylogenomics rely solely on a distance approach using Neighbor-Joining (NJ) (Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, like maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in large-scale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma distributed rates from bootstrapped NJ topologies. The ML consensus topology of the phylogenetic tree can be downloaded in NHX and explored using the ATV Tree Viewer applet.
queries Total number of queries present in the cluster
redundancy The amount of shared (redundant) history on the total tree. Formula:
1 / ( treelength - height / ( ntax * height - height ) )
remaining The number of members after homology reduction
removed Number of sequences removed in the homology reduction
resolution The total number of internal nodes over the total number of internal nodes on a fully bifurcating tree of the same size.
sd_ident The standard deviation of the %identities of all pairwise distances of the cluster_members.
t_quantile_ident The third quantile of the summary statistics of all pairwise distances of the cluster_members.
total_paths The sum of all root-to-tip path lengths of the phylogenetic tree of this cluster.
tree_height For ultrametric trees (supporting the molecular clock hypothesis) this value is the height of the tree, but this is done by averaging over all root-to-tip path lengths, so for additive trees the result should consequently be interpreted differently.
tree_length The sum of all branch lengths of the phylogenetic tree.
topic index top

horizontal line