Well here's my take.
Too bad there is no open access to this article.
I see that they only had statistics for the top 50 annotations, which implies a fairly manual process - you are going to have to read a bunch of articles or abstracts to decide directly whether the software is right.
In this case, the authors are using a weighted latent semantic indexing method, which have started with a bunch of articles for each gene. In this case they would need to only look over the data they put into their training.
That might not answer your question though. If you wanted to search through the literature to confirm annotations generated by a different method...you could create a semi automated process perhaps like this:
Do a rough search from NCBI in an automated fashion (I use biopython) and then read every article for validation of the Gene Ontology result.
Its possible you could search for the gene and the GO term if to get a more focused result, load them into a database or just create a set of files with the data. I would build a little website that would let me scan through the abstracts with links to the complete articles (pubmed supplies DOI numbers) and just check 'yea' or 'nay' on whether confirmation happened.
There's other potential options on finding articles for search. the Genbank record for the gene often has a few references in it. One might use pubmed's referencing feature, where you can start at these confirmed articles that describes the gene in question and use pubmed to show articles that have referenced that first article. I'm not sure if this is accessible by API, but you can build an article graph by crawling the bibliographic references.
For anything that was neither directly contradicted nor confirmed you would have to go back to google and look until you found something. Clearly there are genes for which you will find no reference at all, and you will have to avoid these for confirmation.
I'm not sure if they checked for any biases here. If something like interpro2go which generates GO mappings from protein domains was used to generate the annotations, you would want to take care that you weren't just rediscovering the references which helped generate the annotations in the first place.
No comments:
Post a Comment