- Information
- AI Chat
Was this document helpful?
Introduction of Bioinformatics - Notes-36
Course: Introduction to bioinformatics (BINF100)
696 Documents
Students shared 696 documents in this course
University: ACTS Computer College
Was this document helpful?
141
Selection of papers is already a useful result, even if a human curator must read them. The next
step would be automatic extraction of the information from the paper. This is a challenge and focus
of current research. CASP-like evaluations track progress.
The most basic task in computer analysis of an article is to identify the names that appear: names
of genes, proteins, metabolites, drugs, and diseases (or more generally, phenotypes). Name
identification depends heavily on dictionaries, but natural language processing contributes semantic
information helpful in both recognizing names themselves and recognizing modifiers of names.
The next level is to identify associations and interactions. Examples include attempts to correlate
genes or proteins with diseases, or, more generally, to assign function to genes or proteins. To extract
interactions, the minimal pattern must include two names + one interaction, the interaction being
specified by a word or a phrase. We have already seen examples of the combination:
There are many other protein–protein interactions, such as:
More complex combinations are very important: a correlation between a set of interacting proteins
and two or more apparently unrelated diseases can show a hidden relationship in the mechanism
underlying the diseases.
Identification of references to individual genes and proteins
A basic task is to identify in a body of text the names of the relevant objects, such as genes and
proteins. The difficulty is the wide range and ambiguity of names, and the use of common words as
parts of gene names. The problem of identifying the species from which a gene arises is very
difficult, as many genes have equivalent names in different mammalian species. It is very important
to recognize species differences in searching for correlations between genes and drug activities.
Tamoxifen, used widely against breast cancer, was originally developed as a birth-control pill. It is a
fine contraceptive for rats but promotes ovulation in women.
Chang, Schütze, and Altman developed a program called GAPSCORE that identifies gene and
protein names within submitted text.10 One might think that simply creating a dictionary and looking
for its entries would suffice. Dictionaries are of course at the core of any identification procedure.
But many genes names have other meanings. For instance, ‘ring’ (which stands for ‘really interesting
new gene’) can also appear in articles in the biomedical literature in the context of chemical structure
(‘histidine ring’) or histology (‘signet-ring cell’). Even the common colloquial sense of the word
ring, as an item of jewellery, appears in the scientific literature in connection with metal-elicited
contact dermatitis. Also, a dictionary should include a thesaurus, specifying, for example, that PTEN
and MMAC1 are synonyms. (PTEN stands for phosphatase and tensin homolog and MMAC1 stands
for mutated in multiple advanced cancers 1.)
GAPSCORE scores terms according to a statistical model based on:
•
dictionary lookup: a table of known gene names;
•
appearance: many gene names have the form NAT1; other gene or protein names end with -in.
Many enzyme names end with -ase;
•
variations: the title of a recent paper included the phrase ‘conformational changes of apo- and