Blast2GO (B2G) is a tool designed to enable Gene Ontology (GO) based
data mining on sequence data for which no GO annotation is yet available.
This is done by associating sequences to a putative function using sequence
homology criterion and providing tools for statistical and visual analysis on
this information.
The aim of this evaluation is to identify optimal parameters for correct
annotation as well as evaluating the overall performance of the
methodology.
The strategy we follow in our evaluation has been to use a sequence data
set of a model organisms for which a true functional annotation is available.
These sequences have been processed using Blast2GO against a gene
bank nr database depleted by sequences of the test species. The
comparison between the inferred and the original annotation has allowed us
to evaluate the performance of the tool as a function of the annotation
parameters.
Our results show that Blast2GO has a good annotation accuracy, typical of
automatic annotation method, and more important, that the tool is
successful in extracting relevant functional features of these sequences
based on use of this annotation.
In this evaluation we used the sequences represented in the AMT
microarray originally designed in Dr Amtmann’s laboratory.. This
oligonucleotide microrray represents 1090 Arabidopsis transporter genes
and has been used to study the transporter transcriptome in roots under
different salt stress conditions (Maathuis et al.). GO annotatios are available
for these sequences as well as a specific functional classification made by
the authors (see supplementing material). This data set is ideal for the
purpose of evaluating Blast2GO because it represents a typical scenario
where Blast2Go is likely to be used for sequence annotation and as a
function-based data mining tool.
Firstly, a filtered NCBI nr database has been generated from which all Arabidopsis sequences were removed (nr –ATH)
The AMT set has been analysed by Blast2GO.
Blast was done against the nr-AMT using WWWBlast and default Blast parameters of the application
Sequences were mapped
Annotation was done for different values of the annotation parameters following a factorial design of 3 factors with the levels:
GO weight: 0,5 and 10
Annotation cut-off: 0,30,35,40,45,50,55,60
EC weights: B2G default, all to 1
Annotation results were compared to the True GO annotation of these sequences obtained from the Tair site (
www.tair.org). Each B2G annotated GO term were scored as (Note that identical+general+specific are annotations in the Same branch as the True annotation values):
Indentical: if the B2G annotation is present among the True annotations of the sequence
General: if the B2GO annotated GO term is a parent term of one of the True annotations of the sequence
Specific: if the B2GO annotated GO term is a children term of one of the True annotations of the sequence
Other branch: if no True annotation terms lay in the same DAG branch of the considered B2GO annotated GO term.
Combined graphs were generated for the whole data set at each of the three main branches of the Gene Ontology, and the highlighted nodes were compared to the Functional annotation provided by the authors. These graphs were generated with a Seq Filter value of 50 (i.e. only nodes with more than 50 associated sequences are shown), to control de size of the graph.
Finally, we took all the significant gene lists provided in the Westernhuis paper and computed, for each list, Fisher´s Exact Tests for evaluating functional category enrichment using the Functional Classification provided by the authors (see supplementing material of this publication and .txt).We selected a gene list for which there were significantly enriched categories at a multiple testing corrected p-value of 0.01, performed a B2G Enrichment Analysis for this list using the B2G annotation, and compared results
The results of the evaluation of the annotation procedure and its
parameters are given in Table 1 and are summarized in Figure 1 and
2. As expected, rising the Annotation cut-off resulted in a increase in
the quality of the annotation but decreased the number of annotated
genes. For the tested GO weights we observed an increase in
positive annotations when the value was increased, indicating that
abstraction can be an adequate way of valid GO annotation.
Setting EC weights all to 1 (no EC weight) resulted in an increase in
positive annotations. However it was noticed that EC weights=default
resulted in a much less annotation coverage, suggesting that the
lower performance of this option may be more the result of failing to
annotate than of annotating at other branch.
In general, good annotation results (up to 65% identical annotation
and 70% annotation at the same branch) were obtained for some
values of the annotation parameters, which is similar to the
performance reported by other automatic annotation systems (e.g.
Martin et al., 2004; Khan et al., 2003). In addition Blast2GO offers a
graphical environment for functional annotation.
Evaluation of annotation in other biological systems (Saccharomyces,
Plasmodium…) shows similar behaviour of the annotation
parameters, although absolute values may vary slightly.
From the results of the annotation evaluation we took suitable values
of the annotation parameters for performing the functional evaluation
(annot.cutoff=50, GOweight=10)
Comparison of the results of the Combined Graph with the Functional
Annotation provided by the authors (Funcional_analysis_AMT.xls,
worksheet 1) showed how the B2G visualization tool is successful in
showing the most relevant biological aspects of this data set. The
terms Transport (BP) and Transport activity (MF) clearly appear as
the heaviest colored ones in their graphs (Figs. 3 and 4). Others like
cation transport, ion transport or multidrug transport for the Biological
Process category (Fig.3), ATPase activity coupled to the
transmembrane movement of substances, ATP binding or antiporter
activity for the Molecular Function category (Fig.4), and integral to
membrane or intracellular membrane-bound organelle for the Cellular
Component category (Fig.5) are highlighted in the corresponding
graphs.
For the second aspect of the Functional Genomics evaluation the
AMT_specific_Na&Ca&K data subset was used. This data set shows
a significant enrichment of the category aquaporin when a Fisher´s
Exact Test is performed using the Functional classification provided
by the authors (Funcional_analysis_AMT.xls, worksheets 2 and 3).
B2G Enrichment Analysis of this data set successfully detected a
significant enrichment for the same functional category (Fig.6)
This example illustrates the validity of Blast2GO as a research tool in
Functional Genomics studies. Its ideal application is for the functional
analysis of non-annotated sequence data in non-model organisms.
Annotation accuracy using default parameters reached 65-70% . These are
typical values obtained by automatic annotation methods. In addition,
Blast2GO offers a versatile and user friendly graphical environment for
functional annotation combining functionality that has been so far available
in different implementations.
Our results show that Blast2GO is a valuable tool for gathering functional
information of otherwise not characterized sequences. Our approach can be
useful in guiding the interpretation of experimental results in genomics
approaches such as gene expression studies, EST projects etc.
Khan,S., Situ,G., Decker,K. and Schmidt,C.J. (2003) GoFigure: Automated Gene OntologyTM annotation. Bioinformatics 19, 2484-2485.
Maathuis, Frans J. M., Filatov, Victor, Herzyk, Pawel, C. Krijger, Gerard, B. Axelsen, Kristian, Chen, Sixue, Green, Brian J., Li, Yi, Madagan, Kathryn L., Sánchez-Fernández, Rocío, Forde, Brian G., Palmgren, Michael G., Rea, Philip A., Williams, Lorraine E., Sanders, Dale & Amtmann, Anna (2003) Transcriptome analysis of root transporters reveals participation of multiple gene families in the response to cation stress. The Plant Journal 35 (6), 675-692.
Martin,D., Berriman,M. and Barton,G. (2004) GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178