Information
AI Chat

Introduction of Bioinformatics - Notes-36

N/A

Course

Introduction to bioinformatics (BINF100)

696 Documents

Students shared 696 documents in this course

University

ACTS Computer College

Academic year: 2022/2023

Uploaded by:

Anonymous Student

This document has been uploaded by a student, just like you, who decided to remain anonymous.

Canossa College San Pablo City

Recommended for you

8
Introduction to DNA Microarrays
Introduction to bioinformatics
Lecture notes
100%(1)
8
How DNA Extraction Kits Work in the Lab
Introduction to bioinformatics
Lecture notes
100%(1)

Comments

Please sign in or register to post comments.

Preview text

Selection of papers is already a useful result, even if a human curator must read them. The next step would be automatic extraction of the information from the paper. This is a challenge and focus of current research. CASP-like evaluations track progress. The most basic task in computer analysis of an article is to identify the names that appear: names of genes, proteins, metabolites, drugs, and diseases (or more generally, phenotypes). Name identification depends heavily on dictionaries, but natural language processing contributes semantic information helpful in both recognizing names themselves and recognizing modifiers of names. The next level is to identify associations and interactions. Examples include attempts to correlate genes or proteins with diseases, or, more generally, to assign function to genes or proteins. To extract interactions, the minimal pattern must include two names + one interaction, the interaction being specified by a word or a phrase. We have already seen examples of the combination:

There are many other protein–protein interactions, such as:

More complex combinations are very important: a correlation between a set of interacting proteins and two or more apparently unrelated diseases can show a hidden relationship in the mechanism underlying the diseases.

Identification of references to individual genes and proteins

A basic task is to identify in a body of text the names of the relevant objects, such as genes and proteins. The difficulty is the wide range and ambiguity of names, and the use of common words as parts of gene names. The problem of identifying the species from which a gene arises is very difficult, as many genes have equivalent names in different mammalian species. It is very important to recognize species differences in searching for correlations between genes and drug activities. Tamoxifen, used widely against breast cancer, was originally developed as a birth-control pill. It is a fine contraceptive for rats but promotes ovulation in women. Chang, Schütze, and Altman developed a program called GAPSCORE that identifies gene and protein names within submitted text. 10 One might think that simply creating a dictionary and looking for its entries would suffice. Dictionaries are of course at the core of any identification procedure. But many genes names have other meanings. For instance, ‘ring’ (which stands for ‘really interesting new gene’) can also appear in articles in the biomedical literature in the context of chemical structure (‘histidine ring’) or histology (‘signet-ring cell’). Even the common colloquial sense of the word ring, as an item of jewellery, appears in the scientific literature in connection with metal-elicited contact dermatitis. Also, a dictionary should include a thesaurus, specifying, for example, that PTEN and MMAC1 are synonyms. (PTEN stands for phosphatase and tensin homolog and MMAC1 stands for mutated in multiple advanced cancers 1.) GAPSCORE scores terms according to a statistical model based on:

dictionary lookup: a table of known gene names;
appearance: many gene names have the form NAT1; other gene or protein names end with -in. Many enzyme names end with -ase;
variations: the title of a recent paper included the phrase ‘conformational changes of apo- and

holocalmodulin’; the prefixes apo- and holo- are used only for proteins;

syntax/context: the name of a protein or gene must be a noun. It is likely to be associated with certain other words, such as ‘expression’, ‘mutated’, or even ‘gene’ itself. To utilize such word combinations as effectively as possible requires syntactic analysis;
word morphology: the derivation and formation of terms. For example, any short term that begins cdk... is likely to be a cyclin-dependent protein kinase.

Submitting to GAPSCORE only the title of a paper, 11 ‘Neuroprotection by transforming growth factor-β1 involves activation of nuclear factor-κB through phosphatidylinositol-3-OH kinase/Akt and mitogen-activated protein kinase-extracellular-signal regulated kinase1,2 signaling pathways’, returned the following:

1 Mitogen-activated protein kinase Excellent (1) 2 Phosphatidylinositol-3-OH kinase Excellent (1) 3 Transforming growth factor-beta1 Excellent (1) 4 Nuclear factor-kappaB Good (0) 5 Activation Poor (0) 6 Neuroprotection Poor (0)

Note that the Greek letter β is spelt out in full.

See Weblem 3.

Identification of interactions

R. Hofmann and A. Valencia developed a system for data mining PubMed by natural language processing to identify genes, proteins, and their interactions. Their results are available in a database named iHOP, 12 or Information Hyperlinked Over Proteins (ihop-net/UniPub/iHOP/). The basic item of iHOP data is a sentence from an abstract of an article appearing in PubMed. Appearances of any gene name, or synonym, in two different sentences provide a link. Currently the system contains 12 000 000 sentences, referring to 80 000 genes, from 1500 organisms. An example of iHOP and its navigation facilities appears in Figure 3.

Gene or protein name Quality (score)

Figure 3 Proteins associated with xeroderma pigmentosum and Cockayne syndrome, and their interactions. Arc at lower left: proteins associated with xeroderma pigmentosum. Arc at lower right: proteins associated with Cockayne syndrome. Arc at top: proteins associated with both. Lines indicate interaction pairs. Note that there is only one direct interaction between a protein associated with xeroderma pigmentosum only and another associated with Cockayne syndrome only.

From Sam, L., Liu, Y., Li, J., Friedman, C., and Lussier, Y. (2007). Discovery of protein interaction networks shared by diseases. Pacific Symposium on Biocomputing, 12 , 76–87.

At the time of this work, the close connection between xeroderma pigmentosum and Cockayne syndrome, both effects of repair dysfunction, was already known. What was and still is not well understood is what, beyond the known functional defects,

produces the differences in phenotype associated with the two diseases. In this respect, the mutations that produce the combined symptoms—the XP/CS complex—may be the ones that provide the clues.

Hypothesis generation

The literature implicitly contains many unsuspected relationships. D. Swanson read papers that connected magnesium and epilepsy, and papers that connected epilepsy and migraine headaches. Taken together, these suggested to him that there should be a relationship between magnesium and migrane. Subsequent research confirmed such a link. Swanson had other successes, including the suggestion that fish oil would benefit patients with Raynaud's syndrome (a disorder affecting blood vessels of the extremities). Subsequent research confirmed this suggestion as well. Automation of Swanson's approach is an obvious goal; implementation of effective methods is not so easy. P. Srinivasan and B. Libbus developed software to apply Swanson's approach. They searched for applications of turmeric, a spice from the rhizomes of the plant Curcuma longa, containing the active compound curcumin. 13 In Asia, turmeric is in common use in cooking. Its medicinal properties are also well known. It is an analgesic and an antiseptic, used for treatment of burns, stomach ulcers, skin diseases, and the common cold.

Box 3 Xeroderma pigmentosum and Cockayne syndrome: two diseases of DNA repair

Xeroderma pigmentosum is a genetic disorder involving a defect in the ability to repair damage caused by ultraviolet light. This leads most obviously to great sensitivity to sunlight, including tendency, upon even short exposure, to sunburn, blisters, and freckles. More devastating is the predisposition to development of malignant tumours, presumably arising from unrepaired damage to tumour-suppressor genes.
Cockayne syndrome shares with xeroderma pigmentosum a sensitivity to sunlight, but involves other symptoms including abnormal growth and development leading to short stature, retinal and other neurological degeneration, and premature aging. Risk of skin cancer is normal, not elevated as in xeroderma pigmentosum.
A small number of cases of the xeroderma pigmentosum/Cockayne complex (XP/CS) syndrome are known. Patients show symptoms of both diseases. Disease Xeroderma pigmentosum

Genes in which mutations appear include XPA, XPB (ERCC3), XPC, XPD (ERCC2), XPE (DDB2), XPF (ERCC4), XPG (RAD2, ERCC5), XPV (POLH) Cockayne syndrome CSB ERCC6 (CSB), ERCC8 (CSA) XP/CS complex XPB (ERCC3), XPD (ERCC2), XPG (ERCC5)

Was this document helpful?

Introduction of Bioinformatics - Notes-36

Course: Introduction to bioinformatics (BINF100)

696 Documents

Students shared 696 documents in this course

University: ACTS Computer College

Was this document helpful?

141

Selection of papers is already a useful result, even if a human curator must read them. The next

step would be automatic extraction of the information from the paper. This is a challenge and focus

of current research. CASP-like evaluations track progress.

The most basic task in computer analysis of an article is to identify the names that appear: names

of genes, proteins, metabolites, drugs, and diseases (or more generally, phenotypes). Name

identification depends heavily on dictionaries, but natural language processing contributes semantic

information helpful in both recognizing names themselves and recognizing modifiers of names.

The next level is to identify associations and interactions. Examples include attempts to correlate

genes or proteins with diseases, or, more generally, to assign function to genes or proteins. To extract

interactions, the minimal pattern must include two names + one interaction, the interaction being

specified by a word or a phrase. We have already seen examples of the combination:

There are many other protein–protein interactions, such as:

More complex combinations are very important: a correlation between a set of interacting proteins

and two or more apparently unrelated diseases can show a hidden relationship in the mechanism

underlying the diseases.

Identification of references to individual genes and proteins

A basic task is to identify in a body of text the names of the relevant objects, such as genes and

proteins. The difficulty is the wide range and ambiguity of names, and the use of common words as

parts of gene names. The problem of identifying the species from which a gene arises is very

difficult, as many genes have equivalent names in different mammalian species. It is very important

to recognize species differences in searching for correlations between genes and drug activities.

Tamoxifen, used widely against breast cancer, was originally developed as a birth-control pill. It is a

fine contraceptive for rats but promotes ovulation in women.

Chang, Schütze, and Altman developed a program called GAPSCORE that identifies gene and

protein names within submitted text.10 One might think that simply creating a dictionary and looking

for its entries would suffice. Dictionaries are of course at the core of any identification procedure.

But many genes names have other meanings. For instance, ‘ring’ (which stands for ‘really interesting

new gene’) can also appear in articles in the biomedical literature in the context of chemical structure

(‘histidine ring’) or histology (‘signet-ring cell’). Even the common colloquial sense of the word

ring, as an item of jewellery, appears in the scientific literature in connection with metal-elicited

contact dermatitis. Also, a dictionary should include a thesaurus, specifying, for example, that PTEN

and MMAC1 are synonyms. (PTEN stands for phosphatase and tensin homolog and MMAC1 stands

for mutated in multiple advanced cancers 1.)

GAPSCORE scores terms according to a statistical model based on:

•

dictionary lookup: a table of known gene names;

•

appearance: many gene names have the form NAT1; other gene or protein names end with -in.

Many enzyme names end with -ase;

•

variations: the title of a recent paper included the phrase ‘conformational changes of apo- and

Introduction of Bioinformatics - Notes-36

Introduction to bioinformatics (BINF100)

ACTS Computer College

Recommended for you

Comments

Students also viewed

Related documents

Preview text

Identification of references to individual genes and proteins

Identification of interactions

Hypothesis generation

Introduction of Bioinformatics - Notes-36

Course: Introduction to bioinformatics (BINF100)

University: ACTS Computer College

Recommended for you

Students also viewed

Related documents