Sequence Retrieval 
The PlanTAPDB interfaces allow sequence retrieval in three ways:


algorithm  Multiple sequence alignment algorithm used for this cluster.  
avg_ident  The average %identity of all pairwise distances of the cluster_members.  
comment  Comments from the manual annotation phase  
description  The manual annotation infered for the cluster.  
f_quantile_ident  The first quantile of the summary statistics of all pairwise distances of the cluster_members.  
fiala_stemminess  Tree measure as described in Fiala, K.L. and R.R. Sokal, 1985. Factors determining the accuracy of cladogram estimation: evaluation using computer simulation. Evolution, 39: 609622  
from_step 
Our pipeline filters PSIBLAST hits according to a sixstep filtering scheme. "from_step" tells you which filter step was applied when the sequences resulting in this cluster were initially filtered. Sequences passing a specific filter step have to furfill at least the alignment length and fraction identical criteria of the step:


homology filtering  While it greatly improves taxon sampling, the strategy to use both, a huge multispecies containing database like UniProt and the individual fullgenome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. Clusters with more than 150 members, were condensed via stepwise homology reduction until the threshold of 150 members was reached. To further investigate this process for every cluster in detail, we offer the distance matrix as text file, a graphic of distribution plots of the cluster_member distances and the initial MAFFT fftns2 alignment of the cluster used for the pairwise distances.  
last_cutoff  The last %identitiy threshold applied in the homology reduction of the cluster_members  
longest_internal_branch_length  Length of the longest internal branch used to (midpoint)root the phylogenetic tree of the cluster.  
max_ident  The maximal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.  
max_iteration  The maximal PSIBLAST iteration the cluster_members are from.  
median_ident  The median %identity of all pairwise distances of the cluster_members.  
members  Total number of cluster_members before redundancy removal and homology reduction.  
min_ident  The minimal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline.  
ml  The maximum likelihood of the consensus tree topology calculated with TREEPUZZLE  
multiple alignment  Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise which disturbs correct inference of phylogenetic relationships. Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a bestoftwo approach, during which first two alignments were calculated using different stateoftheart algorithms and then filtered using the sumofpairs score. In the second step the alignment with the maximal number of remaining columns was chosen. In version 1.0, on average, the alignments consisted to 65% of gaps and were reduced to 28% of the original alignment length by applying this procedure. In 71% of the cases the MAFFT GINSI alignment was selected to represent the cluster, whereas ProbCons or Muscle were chosen for 29% of the clusters. The "best" alignment can be downloaded and viewed with the Jalview alignment editor applet. To comprehend the MSA column filtering process for each cluster we also provide a nice overview graphic.  
nleafs  Number of leafs or taxa in this clusters phylogenetic tree.  
nr_distances  Total number of pairwise distances.  
nr_members  The number of members after redundancy removal  
number_of_internals  Number of internal nodes of the phylogenetic tree.  
number_of_nodes  Total number of nodes (internal + leafs) of the tree.  
number_of_terminals  Number of nodes without children.  
phylogenetic trees  Many approaches to phylogenomics rely solely on a distance approach using NeighborJoining (NJ) (Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, like maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in largescale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma distributed rates from bootstrapped NJ topologies. The ML consensus topology of the phylogenetic tree can be downloaded in NHX and explored using the ATV Tree Viewer applet.  
queries  Total number of queries present in the cluster  
redundancy 
The amount of shared (redundant) history on the total tree. Formula: 1 / ( treelength  height / ( ntax * height  height ) ) 

remaining  The number of members after homology reduction  
removed  Number of sequences removed in the homology reduction  
resolution  The total number of internal nodes over the total number of internal nodes on a fully bifurcating tree of the same size.  
sd_ident  The standard deviation of the %identities of all pairwise distances of the cluster_members.  
t_quantile_ident  The third quantile of the summary statistics of all pairwise distances of the cluster_members.  
total_paths  The sum of all roottotip path lengths of the phylogenetic tree of this cluster.  
tree_height  For ultrametric trees (supporting the molecular clock hypothesis) this value is the height of the tree, but this is done by averaging over all roottotip path lengths, so for additive trees the result should consequently be interpreted differently.  
tree_length  The sum of all branch lengths of the phylogenetic tree. 