In silico prediction of UTR repeats using clustered EST data

Stefan A. Rensing*, Daniel Lang and Ralf Reski

University of Freiburg, Plant Biotechnology, Sonnenstr. 5, D-79104 Freiburg, Germany
* stefan.rensing@biologie.uni-freiburg.de, fon +49 761 203-6974, fax -6990

Abstract

Clustering of EST data is a method for the non-redundant representation of an organisms transcriptome. During clustering of large amounts of EST data, usually some large clusters (>500 sequences) are created. Those can lead to iterative contig builds, consumation of lots of computing time and improbable exon alignments, which is unfavourable. In addition, these clusters sometimes contain transcripts for more than one gene, which is not desired. Such large clusters come into existence due to: (1) large numbers of identical ESTs / high transcript levels; (2) large gene families with highly similar members; (3) false clustering due to a) unremoved vector or rRNA sequences, b) undetected cloning artifacts or c) repetitive elements in UTRs.
During pre-processing (filtering and masking) of the sequence raw data, contaminations such as vector or linker sequences as well as bacterial genes are being removed (clipping). In the same process, it is essential to mask repetitive elements in order to avoid wrong clustering due to these sequence stretches. Therefore, determination of UTR repeats (to use in masking) is a method to avoid false clustering.
When dealing with organisms where repetitive elements are unknown, it is crucial to extract those sequences from the data prior to clustering. We developed three in silico approaches to detect UTR repeats using clustered EST data. All three approaches yielded several putative repeats (17 in total), of which the majority could be proven to be of repetitive nature in the genome. Usage of the predicted repeats enabled us to save computing time while increasing the quality of the clustered data.

Rensing S.A., Lang D. and Reski R. (2003): In silico prediction of UTR repeats using clustered EST data. In: Proceedings of the German Conference on Bioinformatics 2003, Mewes H.-W., Heun V., Frishman D., Kramer S. (eds.), pp 117-122, Belleville Verlag Michael Farin, Munich, Germany


 

 

Currently available: 17 predicted Physcomitrella repeats as a FastA file