Skip to document

Introduction of Bioinformatics - Notes-36

N/A
Course

Introduction to bioinformatics (BINF100)

696 Documents
Students shared 696 documents in this course
Academic year: 2022/2023
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Canossa College San Pablo City

Comments

Please sign in or register to post comments.

Preview text

Selection of papers is already a useful result, even if a human curator must read them. The next step would be automatic extraction of the information from the paper. This is a challenge and focus of current research. CASP-like evaluations track progress. The most basic task in computer analysis of an article is to identify the names that appear: names of genes, proteins, metabolites, drugs, and diseases (or more generally, phenotypes). Name identification depends heavily on dictionaries, but natural language processing contributes semantic information helpful in both recognizing names themselves and recognizing modifiers of names. The next level is to identify associations and interactions. Examples include attempts to correlate genes or proteins with diseases, or, more generally, to assign function to genes or proteins. To extract interactions, the minimal pattern must include two names + one interaction, the interaction being specified by a word or a phrase. We have already seen examples of the combination:

There are many other protein–protein interactions, such as:

More complex combinations are very important: a correlation between a set of interacting proteins and two or more apparently unrelated diseases can show a hidden relationship in the mechanism underlying the diseases.

Identification of references to individual genes and proteins

A basic task is to identify in a body of text the names of the relevant objects, such as genes and proteins. The difficulty is the wide range and ambiguity of names, and the use of common words as parts of gene names. The problem of identifying the species from which a gene arises is very difficult, as many genes have equivalent names in different mammalian species. It is very important to recognize species differences in searching for correlations between genes and drug activities. Tamoxifen, used widely against breast cancer, was originally developed as a birth-control pill. It is a fine contraceptive for rats but promotes ovulation in women. Chang, Schütze, and Altman developed a program called GAPSCORE that identifies gene and protein names within submitted text. 10 One might think that simply creating a dictionary and looking for its entries would suffice. Dictionaries are of course at the core of any identification procedure. But many genes names have other meanings. For instance, ‘ring’ (which stands for ‘really interesting new gene’) can also appear in articles in the biomedical literature in the context of chemical structure (‘histidine ring’) or histology (‘signet-ring cell’). Even the common colloquial sense of the word ring, as an item of jewellery, appears in the scientific literature in connection with metal-elicited contact dermatitis. Also, a dictionary should include a thesaurus, specifying, for example, that PTEN and MMAC1 are synonyms. (PTEN stands for phosphatase and tensin homolog and MMAC1 stands for mutated in multiple advanced cancers 1.) GAPSCORE scores terms according to a statistical model based on:

  • dictionary lookup: a table of known gene names;

  • appearance: many gene names have the form NAT1; other gene or protein names end with -in. Many enzyme names end with -ase;

  • variations: the title of a recent paper included the phrase ‘conformational changes of apo- and

holocalmodulin’; the prefixes apo- and holo- are used only for proteins;

  • syntax/context: the name of a protein or gene must be a noun. It is likely to be associated with certain other words, such as ‘expression’, ‘mutated’, or even ‘gene’ itself. To utilize such word combinations as effectively as possible requires syntactic analysis;

  • word morphology: the derivation and formation of terms. For example, any short term that begins cdk... is likely to be a cyclin-dependent protein kinase.

Submitting to GAPSCORE only the title of a paper, 11 ‘Neuroprotection by transforming growth factor-β1 involves activation of nuclear factor-κB through phosphatidylinositol-3-OH kinase/Akt and mitogen-activated protein kinase-extracellular-signal regulated kinase1,2 signaling pathways’, returned the following:

1 Mitogen-activated protein kinase Excellent (1) 2 Phosphatidylinositol-3-OH kinase Excellent (1) 3 Transforming growth factor-beta1 Excellent (1) 4 Nuclear factor-kappaB Good (0) 5 Activation Poor (0) 6 Neuroprotection Poor (0)

Note that the Greek letter β is spelt out in full.

See Weblem 3.

Identification of interactions

R. Hofmann and A. Valencia developed a system for data mining PubMed by natural language processing to identify genes, proteins, and their interactions. Their results are available in a database named iHOP, 12 or Information Hyperlinked Over Proteins (ihop-net/UniPub/iHOP/). The basic item of iHOP data is a sentence from an abstract of an article appearing in PubMed. Appearances of any gene name, or synonym, in two different sentences provide a link. Currently the system contains 12 000 000 sentences, referring to 80 000 genes, from 1500 organisms. An example of iHOP and its navigation facilities appears in Figure 3.

Gene or protein name Quality (score)

Figure 3 Proteins associated with xeroderma pigmentosum and Cockayne syndrome, and their interactions. Arc at lower left: proteins associated with xeroderma pigmentosum. Arc at lower right: proteins associated with Cockayne syndrome. Arc at top: proteins associated with both. Lines indicate interaction pairs. Note that there is only one direct interaction between a protein associated with xeroderma pigmentosum only and another associated with Cockayne syndrome only.

From Sam, L., Liu, Y., Li, J., Friedman, C., and Lussier, Y. (2007). Discovery of protein interaction networks shared by diseases. Pacific Symposium on Biocomputing, 12 , 76–87.

At the time of this work, the close connection between xeroderma pigmentosum and Cockayne syndrome, both effects of repair dysfunction, was already known. What was and still is not well understood is what, beyond the known functional defects,

produces the differences in phenotype associated with the two diseases. In this respect, the mutations that produce the combined symptoms—the XP/CS complex—may be the ones that provide the clues.

Hypothesis generation

The literature implicitly contains many unsuspected relationships. D. Swanson read papers that connected magnesium and epilepsy, and papers that connected epilepsy and migraine headaches. Taken together, these suggested to him that there should be a relationship between magnesium and migrane. Subsequent research confirmed such a link. Swanson had other successes, including the suggestion that fish oil would benefit patients with Raynaud's syndrome (a disorder affecting blood vessels of the extremities). Subsequent research confirmed this suggestion as well. Automation of Swanson's approach is an obvious goal; implementation of effective methods is not so easy. P. Srinivasan and B. Libbus developed software to apply Swanson's approach. They searched for applications of turmeric, a spice from the rhizomes of the plant Curcuma longa, containing the active compound curcumin. 13 In Asia, turmeric is in common use in cooking. Its medicinal properties are also well known. It is an analgesic and an antiseptic, used for treatment of burns, stomach ulcers, skin diseases, and the common cold.

Box 3 Xeroderma pigmentosum and Cockayne syndrome: two diseases of DNA repair

  • Xeroderma pigmentosum is a genetic disorder involving a defect in the ability to repair damage caused by ultraviolet light. This leads most obviously to great sensitivity to sunlight, including tendency, upon even short exposure, to sunburn, blisters, and freckles. More devastating is the predisposition to development of malignant tumours, presumably arising from unrepaired damage to tumour-suppressor genes.
  • Cockayne syndrome shares with xeroderma pigmentosum a sensitivity to sunlight, but involves other symptoms including abnormal growth and development leading to short stature, retinal and other neurological degeneration, and premature aging. Risk of skin cancer is normal, not elevated as in xeroderma pigmentosum.
  • A small number of cases of the xeroderma pigmentosum/Cockayne complex (XP/CS) syndrome are known. Patients show symptoms of both diseases. Disease Xeroderma pigmentosum

Genes in which mutations appear include XPA, XPB (ERCC3), XPC, XPD (ERCC2), XPE (DDB2), XPF (ERCC4), XPG (RAD2, ERCC5), XPV (POLH) Cockayne syndrome CSB ERCC6 (CSB), ERCC8 (CSA) XP/CS complex XPB (ERCC3), XPD (ERCC2), XPG (ERCC5)

Was this document helpful?

Introduction of Bioinformatics - Notes-36

Course: Introduction to bioinformatics (BINF100)

696 Documents
Students shared 696 documents in this course
Was this document helpful?
141
Selection of papers is already a useful result, even if a human curator must read them. The next
step would be automatic extraction of the information from the paper. This is a challenge and focus
of current research. CASP-like evaluations track progress.
The most basic task in computer analysis of an article is to identify the names that appear: names
of genes, proteins, metabolites, drugs, and diseases (or more generally, phenotypes). Name
identification depends heavily on dictionaries, but natural language processing contributes semantic
information helpful in both recognizing names themselves and recognizing modifiers of names.
The next level is to identify associations and interactions. Examples include attempts to correlate
genes or proteins with diseases, or, more generally, to assign function to genes or proteins. To extract
interactions, the minimal pattern must include two names + one interaction, the interaction being
specified by a word or a phrase. We have already seen examples of the combination:
There are many other proteinprotein interactions, such as:
More complex combinations are very important: a correlation between a set of interacting proteins
and two or more apparently unrelated diseases can show a hidden relationship in the mechanism
underlying the diseases.
Identification of references to individual genes and proteins
A basic task is to identify in a body of text the names of the relevant objects, such as genes and
proteins. The difficulty is the wide range and ambiguity of names, and the use of common words as
parts of gene names. The problem of identifying the species from which a gene arises is very
difficult, as many genes have equivalent names in different mammalian species. It is very important
to recognize species differences in searching for correlations between genes and drug activities.
Tamoxifen, used widely against breast cancer, was originally developed as a birth-control pill. It is a
fine contraceptive for rats but promotes ovulation in women.
Chang, Schütze, and Altman developed a program called GAPSCORE that identifies gene and
protein names within submitted text.10 One might think that simply creating a dictionary and looking
for its entries would suffice. Dictionaries are of course at the core of any identification procedure.
But many genes names have other meanings. For instance, ‘ring’ (which stands for ‘really interesting
new gene’) can also appear in articles in the biomedical literature in the context of chemical structure
(‘histidine ring’) or histology (‘signet-ring cell’). Even the common colloquial sense of the word
ring, as an item of jewellery, appears in the scientific literature in connection with metal-elicited
contact dermatitis. Also, a dictionary should include a thesaurus, specifying, for example, that PTEN
and MMAC1 are synonyms. (PTEN stands for phosphatase and tensin homolog and MMAC1 stands
for mutated in multiple advanced cancers 1.)
GAPSCORE scores terms according to a statistical model based on:
dictionary lookup: a table of known gene names;
appearance: many gene names have the form NAT1; other gene or protein names end with -in.
Many enzyme names end with -ase;
variations: the title of a recent paper included the phrase ‘conformational changes of apo- and