PlanTAP Family Cluster Viewer documentation

Sequence Retrieval
homology filtering
multiple alignment
phylogenetic trees

horizontal line

Sequence Retrieval The PlanTAPDB interfaces allow sequence retrieval in three ways:
  1. Hyperlink: contents of the fields member_name and representative of the Family Members Viewer are hyperlinked to the Cosmoss Retrieval System to fetch individual entries one at a time.
  2. Batch retrieval by checkbox: The Family Members Viewer allows retrieval of multiple member entries at a time by selecting them using the leading checkbox and hitting the "retrieve selected members" button. The check-status of all displayed members can be modified by using the checkbox in the table header of the member list in the upper left corner.
  3. Batch retrieval by node: The ATV Tree Viewer was modified to provide an additional option "get PlanTAP sequences", that can be used to retrieve all sequences belonging to a given node in the phylogenetic tree simply by clicking on it. Of course, the original option that links to a UniProt entry if possible was also preserved.
algorithm Multiple sequence alignment algorithm used for this cluster.
avg_ident The average %identity of all pairwise distances of the cluster_members.
comment Comments from the manual annotation phase
description The manual annotation infered for the cluster.
f_quantile_ident The first quantile of the summary statistics of all pairwise distances of the cluster_members.
fiala_stemminess Tree measure as described in Fiala, K.L. and R.R. Sokal, 1985. Factors determining the accuracy of cladogram estimation: evaluation using computer simulation. Evolution, 39: 609-622
from_step Our pipeline filters PSI-BLAST hits according to a six-step filtering scheme. "from_step" tells you which filter step was applied when the sequences resulting in this cluster were initially filtered. Sequences passing a specific filter step have to furfill at least the alignment length and fraction identical criteria of the step:

step length_aln frac_identical
1 50 0.25
2 60 0.3
3 80 0.35
4 100 0.45
5 150 0.45
6 300 0.45

homology filtering While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. Clusters with more than 150 members, were condensed via stepwise homology reduction until the threshold of 150 members was reached. To further investigate this process for every cluster in detail, we offer the distance matrix as text file, a graphic of distribution plots of the cluster_member distances and the initial MAFFT fftns2 alignment of the cluster used for the pairwise distances.
last_cutoff The last %identitiy threshold applied in the homology reduction of the cluster_members
longest_internal_branch_length Length of the longest internal branch used to (midpoint-)root the phylogenetic tree of the cluster.
max_ident The maximal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.
max_iteration The maximal PSI-BLAST iteration the cluster_members are from.
median_ident The median %identity of all pairwise distances of the cluster_members.
members Total number of cluster_members before redundancy removal and homology reduction.
min_ident The minimal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.
ml The maximum likelihood of the consensus tree topology calculated with TREE-PUZZLE
multiple alignment Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise which disturbs correct inference of phylogenetic relationships. Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a best-of-two approach, during which first two alignments were calculated using different state-of-the-art algorithms and then filtered using the sum-of-pairs score. In the second step the alignment with the maximal number of remaining columns was chosen. In version 1.0, on average, the alignments consisted to 65% of gaps and were reduced to 28% of the original alignment length by applying this procedure. In 71% of the cases the MAFFT G-INSI alignment was selected to represent the cluster, whereas ProbCons or Muscle were chosen for 29% of the clusters. The "best" alignment can be downloaded and viewed with the Jalview alignment editor applet. To comprehend the MSA column filtering process for each cluster we also provide a nice overview graphic.
nleafs Number of leafs or taxa in this clusters phylogenetic tree.
nr_distances Total number of pairwise distances.
nr_members The number of members after redundancy removal
number_of_internals Number of internal nodes of the phylogenetic tree.
number_of_nodes Total number of nodes (internal + leafs) of the tree.
number_of_terminals Number of nodes without children.
phylogenetic trees Many approaches to phylogenomics rely solely on a distance approach using Neighbor-Joining (NJ) (Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, like maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in large-scale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma distributed rates from bootstrapped NJ topologies. The ML consensus topology of the phylogenetic tree can be downloaded in NHX and explored using the ATV Tree Viewer applet.
queries Total number of queries present in the cluster
redundancy The amount of shared (redundant) history on the total tree. Formula:
1 / ( treelength - height / ( ntax * height - height ) )
remaining The number of members after homology reduction
removed Number of sequences removed in the homology reduction
resolution The total number of internal nodes over the total number of internal nodes on a fully bifurcating tree of the same size.
sd_ident The standard deviation of the %identities of all pairwise distances of the cluster_members.
t_quantile_ident The third quantile of the summary statistics of all pairwise distances of the cluster_members.
total_paths The sum of all root-to-tip path lengths of the phylogenetic tree of this cluster.
tree_height For ultrametric trees (supporting the molecular clock hypothesis) this value is the height of the tree, but this is done by averaging over all root-to-tip path lengths, so for additive trees the result should consequently be interpreted differently.
tree_length The sum of all branch lengths of the phylogenetic tree.

horizontal line