Skip to document

Chapter 6 - help for uni

help for uni
Module

Epidemiology (BMS3023)

36 Documents
Students shared 36 documents in this course
Academic year: 2023/2024
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
Newcastle University

Comments

Please sign in or register to post comments.

Preview text

Critical Appraisal of Epidemiological Studies and Clinical Trials (4 edn) Mark Elwood doi/10.1093/med/9780199682898.001. Published: 2017 Online ISBN: 9780191763045 Print ISBN: 9780199682898 CHAPTER

doi/10.1093/med/9780199682898.003.0006 Pages 103–
Published: February 2017

Abstract

Keywords: Bias, error, non-dierential, dierential, misclassification, blind assessment, recall bias, kappa,
sensitivity, specificity, predictive value, validity
Subject: Public Health, Epidemiology
Collection: Oxford Medicine Online

6 Error and bias in observations 

Mark Elwood

This chapter distinguishes error and bias, non-dierential and dierential misclassication distinguished. Non-dierential misclassication almost always biases results toward the null, while dierential misclassication can aect results in any direction. Methods to minimise observation bias include single, double and triple blind assessment. It discusses recall and other biases, with methods of assessment and avoidance, and practical issues on reducing error and bias. In part two, it shows how to measure and adjust for observational error and bias, including Kappa and adjusting for non- dierential misclassication, and similar adjustments using continuous exposure measures. Eects with more than two categories of outcome or exposure, and of the misclassication of confounders are discussed. In assessing the accuracy of information, sensitivity, specicity, and predictive value are dened, and the calculation of the eects of misclassication using sensitivity and specicity are shown. Mathematics may be compared to a mill of exquisite workmanship, which grinds you stu of any degree of neness; but, nevertheless, what you get out depends on what you put in; and as the grandest mill in the world will not extract wheat-our from peascod, so pages of formulae will not get a denite result out of loose data —T. Huxley: Geological Reform; 1869 Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Assessing causation

Measured factors

Introduction

This chapter falls into two parts. First we will discuss general principles of identifying and minimizing error (inaccuracies) in observations, and in the second part we will look at how to measure and adjust for such errors.

Part 1. Sources of error and of bias

In the previous chapters we have seen that in cohort studies and intervention trials, the subjects are dened by their exposure or intervention, and then the outcome is assessed. In case–control studies, the subjects are selected by their outcome, and their exposure is assessed. In both types of study, other relevant factors will also be assessed. The way in which the subjects are chosen sets the study design, inuences the hypothesis being tested, and determines the external validity. For example, a study of treatment of rheumatoid arthritis may be relevant only to patients of a certain age who have a particular form of rheumatoid arthritis. In the next stage of assessing scientic work, either our own or that of others, we accept what has been done in terms of the subjects included in the study and the design used. We can then ask this central question: do the results support a causal relationship between the exposure and the outcome, within the connes of the particular study? If any association is shown within the study, it must be due to one (or more) of four mechanisms: observation bias, confounding, chance, or causation. In this chapter, we shall deal with observation bias and observation error. Observation bias is most relevant to the measurement of the dependent variable in the study, that is, the outcome in studies of a cohort design and the exposure in studies of a case–control design. p. 104 The central issue is the relationship between the true value of the factor being assessed (outcome or exposure) and the value of the variable that represents that factor in the study. In the study of treatment for rheumatoid arthritis, a relevant outcome would be an improvement in the function of the aected joints. How this improvement can be best assessed is a major question in the study design; possibilities include X- ray appearances, physiological measures such as hand grip, and questionnaire assessments of degree of functional impairment. Expert knowledge is obviously required, and attention must be paid to the acceptability, reproducibility, and relevance of the measures considered. The variable measured in a study is often considerably far removed from the biological factor or event that is dened in the causal hypothesis. Consider a case–control study assessing whether high vitamin C consumption is protective against heart disease. The causal hypothesis relates the occurrence of heart disease to the intake of vitamin C over a long time period many years before the clinical diagnosis. The variable used to represent this factor in the retrospective study may be the responses to a questionnaire on diet at a dened period in the recent past, converted through a formula into an estimate of vitamin C consumption at that time. The variable appearing in the results as ‘exposure’ is considerably dierent from the biological ‘exposure’ in the hypothesis. A good example is given by studies assessing the association between gastric cancer and infection with Helicobacter pylori, a bacterium that can survive within the stomach. The causal hypothesis relates to the Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Error or bias: non-dierential and dierential misclassification

Figure 6.

Sources of error and of bias in the observed value of a variable compared with the true value of the factor it represents. p. 106 However, in comparative studies, the important issue is whether the errors, whether random or systematic, dier between the groups being compared. In studies comparing groups of people dened by exposures and outcomes, error in assessment is important if it results in misclassication. Consider a case–control study of heart disease, using a questionnaire to assess previous vitamin C intake, dening exposure as consumption less than some dened amount. Some subjects who are truly exposed (and would be reported as exposed if an accurate measure of vitamin C intake were used) will be instead classed as non-exposed, on the basis of the less accurate assessment; and some truly non-exposed subjects will be reported as exposed. The critical issue is whether this misclassication error occurs to the same extent in the case and in the control groups, giving non-dierential misclassication, or whether it happens more (or less) in the case group, giving dierential misclassication. These two situations have very dierent consequences, with dierential errors being more dicult to deal with. Non-dierential and dierential misclassication can be regarded as error and bias, respectively. Thus, the key principle for any comparative study is that the methods used will be applied in the same manner and with the same care to all the subjects in the study, irrespective of the group to which they belong. If this is done, then we will accept that a degree of error exists, but may be able to conclude that there is little Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Non-dierential misclassification and its eects

Dierential misclassification: observation bias

possibility of systematic dierences between the groups being compared, so that the error is non- dierential. Consider a study of breast cancer cases and controls where we want to assess weight. We may have data from simply asking the women their weight in an interview; we may also have measured weight on the medical record for the cases. Although the medical record weight may be more accurate, the interview response weight is more useful as it has been assessed in the same way for cases and for controls. Using the interview weight for controls and the medical record weight for cases is likely to give dierential misclassication. Misclassication, even if non-dierential, is of course important. The greater the error, the more ‘noise’ there is in the system, and therefore the more dicult it is to detect a true dierence between the groups being compared. In the extreme situation, if the measurement used is so inaccurate that its value has no relationship to the true value of the factor being assessed, we will not detect any dierences between groups of subjects even if large dierences exist. Thus, if physiotherapy is actually benecial in improving joint function, a reasonably accurate method of assessing joint function will show this improvement, while a very inaccurate method will show no dierence between treated and untreated groups of patients. The eect of non-dierential error is to make the observed association closer to the null value than is the true situation. Thus if a study shows a strong association, this association cannot be produced by error in the measurements used; on the other hand if a study shows no association or a weak association, error in the observations may be disguising a much stronger association. This applies to dichotomous outcomes or exposures (only two categories); with several categories the eects can be more complex. p. 107 Studies of disease causation often produce only weak associations, with small relative risks. This can happen if the exposure variable measured is only an inaccurate estimate of the true biological factor concerned. Where the factor assessed is a closer estimate of the true biological agent, relative risks will be higher. The inhalation of certain types of wood dust is a cause of cancers in the nose and nasal sinuses; if we compare employees in an industry which uses wood with the general working population, we nd a moderately increased relative risk, perhaps of 2–3; if we compare workers employed on dusty processes which use wood with the general working population, we nd a much higher relative risk, perhaps 10 or more, while if we assess workers who personally have had exposure over many years to particular types of wood we nd a relative risk of 100 or more. Thus in a study in France, a relative risk of 303 was found for workers with over 35 years’ exposure to hardwoods, while softwoods showed no increased risk [ 2 ]. The closer we come to the biological causal factor, the higher the relative risk will become. Dierential misclassication is misclassication that is dierent in its size or direction in one of the groups under study than in the others. This can be referred to as observation bias, in contrast to (non-dierential) error. This is a much more serious problem, as observation bias can inuence the results of a study in any direction. Most importantly, it can produce an association when there is no true dierence between the groups being compared. So if a study shows an association, observation bias is a possible explanation of the association seen; and it is the rst issue we should consider, and perhaps rule out, before going on to consider other explanations for the results, including causation. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Family history of congenital abnormalities

In a study of family history, the mothers of babies with central nervous malformations were asked about whether there had been similarly aected babies in their own, or their partner’s, family (Table 6). The results showed that these had occurred more in the maternal than in the paternal relatives. Such observations have led to hypotheses of complex sex-linked inheritance patterns. However, when control mothers (mothers of normal babies) were asked the same questions (Table 6), they also reported more malformations in their families than in the fathers’ families, which must be due to biased reporting. Indeed, the relative risks were the same [ 4 ]. The bias is produced by mothers generally knowing more about births in their own family than in their partner’s family. Table 6 Recall bias in a genetic study. Table A compares the reported frequency of central nervous system malformations (CNSM) in cousins of an index series of 547 cases of these defects, and shows higher frequencies of CNS defects in maternal compared with paternal relatives (relative risk = 2). However, Table B shows the reported frequencies of CNSM in cousins of control births which did not have CNS defects, and shows a similar maternal–paternal dierence (relative risk = 2) [ 4 ]. Total With CNSM % aected (A) Cousins of index subjects Motherʼs siblingsʼ children 2327 26 1. Fatherʼs siblingsʼ children 2627 12 0. (B) Cousins of control subjects Motherʼs siblingsʼ children 1231 9 0. Fatherʼs siblingsʼ children 1333 4 0. Source: Carter CO et al. ʻA family study of major central nervous system malformations in South Wales,ʼ Journal of Medical Genetics, Volume 5, Issue 2, p. 81–106, © 1968 10.1136/jmg.5.2 [ 4 ]. These examples show that if the information is biased, the associations seen may be strong. There is no point in applying statistical tests to biased data; the fact that the associations are statistically signicant gives no protection against observation bias. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Outcomes in a clinical trial

Methods to minimize bias

Definitions of exposure and outcome variables

p. 109 Issues of outcome measurement in a randomized trial comparing antibiotics and placebo in the treatment of otitis media in children in Pittsburgh, Pennsylvania, USA, resulted in a major conict referred to as the ‘Cantekin aair’. The rst report of this trial, published in The New England Journal of Medicine, showed a doubling of the frequency of clinical resolution of the disease with antibiotic treatment, based on an outcome determined mainly by clinical examination [ 5 ]. However, other investigators in the trial thought that the ear examinations were open to observation bias, and submitted another analysis concluding that no benet was seen, based on tympanometric measurements [ 6 ]. They concluded that the clinical examination results for each ear were not done independently, and that they were not compatible with more objective measurements. In fact, the dierence was more quantitative than qualitative: the clinical examination showed a larger eect, while the results based on tympanometry also showed a benet, but smaller, and not statistically signicant in the analysis published. Both groups of investigators agreed that no benet was shown if a further end point, hearing tests, was used. There were allegations that some of the clinical investigators had been inuenced by drug company funding. This conict led to several investigations, which had criticisms of both main parties, and raised issues about the peer-review and publication processes [ 7 , 8 ]. The most important sources of bias are variation in the subject’s response to the method of assessment, and variation in the observer’s response (Figure 6). The main principle in avoiding bias is to ensure that the same methods are used, under the same circumstances, by the same observers, for all subjects involved in the study, and to employ double- or single-blind techniques as far as possible. p. 110 The outcome or exposure measures used must not only be relevant to the hypothesis, but be chosen to be objective, reproducible, and robust; that is, likely to be little inuenced by variations in the method of testing. We must guard against mistakes in both directions, however; while an outcome which is extremely dicult to measure and open to highly subjective interpretation may be of little value, there is also the danger of choosing an outcome simply because it can be measured easily, even if it is not directly relevant to the hypothesis under test, or may even result in a distortion or change in that hypothesis. For example, we may want to know if a health education programme results in subjects changing their diet (behaviour), but as this is very dicult to measure, we may choose to use something much simpler, such as the subjects’ responses to factual questions about diet (knowledge), or their opinions about changing it (attitudes). The hypothesis under test in our study has now changed. The study may still be worthwhile: but we must not assume, unless we have good evidence, that an improvement in knowledge or attitudes will result in a change in behaviour. This has been referred to as a ‘substitution game’ [ 9 ]; we may want to know if a new drug for asthma produces major improvements such as fewer hospital admissions or even fewer deaths, but may only measure easy and short-term outcomes like improvements in respiratory capacity. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Triple-blind studies

Observational cohort studies: bias in outcome assessment

Example: biased outcome assessment

The term ‘triple-blind’ has been used where the analysis of the trial is carried out by investigators with information on which study group each subject is in, without knowing which treatment was allocated to each group, which is often easy to achieve and desirable. It is particularly important in interim analyses, which may modify the conduct of the study and even terminate it early. Thus in a cluster-randomized trial comparing vitamin A supplementation with a placebo for women in Ghana, with mortality as the outcome, neither the women participants, the eldworkers running the study, the doctors conrming and classifying the deaths, nor the statisticians and others on the monitoring committee, were aware of the treatment allocated to any individual [ 13 ]. To add to the confusion, some authors separate blindness of the investigators who deal with entry to the study and treatment from those investigators who determine outcome; then, adding patient-blindness and data analyst-blindness produces a ‘quadruple-blind’ study; an example is a trial of prevention of renal dysfunction after cardiac surgery [ 14 ]. Less seriously, some have referred to higher levels of blinding as studies where, even after analysis, no one knows what the results mean. p. 112 In cohort studies, the subjects are selected in terms of their exposure, and the bias question applies mainly to the outcome data. In observational cohort studies, the subjects are usually aware of their exposure and the outcome may be assessed by the subjects’ response to questionnaires or by routine clinical records. In such a situation, single- or double-blind outcome assessment may be impossible. It may be useful to compare the exposed and comparison groups in terms of the frequency with which routine examinations are done or outcomes are reported, who reports them, and the completeness and consistency of the observations. Comparability in these process measures will support comparability in the results. Another useful ploy is to look for specicity of the result, by showing that outcomes that are irrelevant to the causal hypothesis are similar in the groups being compared. A critical issue is whether the exposure would inuence the outcome assessed. In a retrospective cohort study comparing professional musicians with other workers in a large insurance system, the musicians were found to have a higher risk of hearing loss based on routine records; but this may be biased, as musicians may be more likely than other workers to complain about or be assessed for hearing loss, perhaps for compensation reasons [ 15 , 16 ]. A study where samples of musicians and others were tested objectively in the same way would be more convincing. Outcomes where legal issues may be involved may be dicult, such as studies of post-traumatic stress disorder. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Further examples

Case–control studies; recall bias in exposure assessment

Example of recall bias

Two cohort studies of oral contraceptive use were described in Chapter 5 [ 17 , 18 ]. In both these studies, the outcome data were based on routine medical and clinical records. These could have been biased directly by knowledge of the method of oral contraceptive use, if certain conditions were looked for more carefully in women using a certain type of contraceptive method, and also indirectly, in that women using oral contraceptives might have visited a general practitioner or clinic more, or less, frequently than other women. For example, in the Royal College of General Practitioners’ study, 18 per cent of all diagnoses were recorded on the prescription date of the oral contraceptive, suggesting that some complaints might have come to the general practitioner’s notice only because the patient had to visit for the prescription [ 17 ]. Observation bias will be less likely if the assessment methods are more objective. In the Family Planning Association study, the association between oral contraceptive use and venous embolism was stronger where the evidence for the diagnosis was more objective [ 19 ], making bias less likely as an explanation. In this study, the morbidity information used was restricted to hospital referrals, partly in order to avoid problems of observation bias. p. 113 In case–control studies, the main bias issue applies to the documentation of past exposure. In most case– control studies, information is obtained by interviewing cases and controls. The central issue is recall bias (or response bias); a dierential response to questions between cases who have been diagnosed with disease, and controls who have not. A good study design will ensure use of a well-designed standardized interview, a consistent approach by well-trained interviewers, and a supportive and non-judgemental atmosphere for the interview. However, a dierence in the ability or willingness to report past events is likely, even if unconscious on the part of the subject. Where the study concerns sensitive issues, this recall bias may be more marked. For example, a meta-analysis (by methods to be described in Chapter 9 ) has brought together data on 83 000 women with breast cancer from 53 studies in 16 countries, relating breast cancer to a previous spontaneous or induced abortion [ 20 ] (Table 6). These included cohort studies and case–control studies in which record linkage methods used information on abortion that had been recorded before the occurrence of the breast cancer: these studies may have random error in the data on abortions, but dierential bias can be excluded. There were also case–control studies using retrospective interviews, where the data on abortions were collected from the women after the diagnosis of breast cancer in the cases; these studies are open to bias as well as random error. For spontaneous abortion, both types of study showed no association. For induced abortion, no increased risk was seen in cohort studies or in the case–control studies with the information on abortions recorded before breast cancer occurred. But the case–control studies using retrospective interviews showed a modest but statistically signicant association with induced abortion which, given that no association was seen in the other studies, is due to recall bias. The women who have been diagnosed with breast cancer must have reported induced abortions more readily than the control women [ 20 ]. p. 114 Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Assessing bias

Example

Practical issues in reducing bias and error

Some data may be helpful in judging whether subject or observer bias may be a problem, such as the length of time taken for examinations or interviews, the interviewer’s assessment of the cooperation of the subject and the degree of diculty experienced with some of the key questions, and asking the subjects at the end of the interview whether they are aware of any relationship between their condition and some of the factors asked about. Similarly, the examiners or interviewers can be asked to record whether they became aware of the case or control status of the subject before or during the assessment. Such recordings give the possibility of analysing subsets of data for subjects who were aware or were not aware of the key hypothesis, and those in whom the observer did or did not know their status. Questions for which cases and controls would be expected to give similar answers may be useful. If the study relies on a small number of interviewers, the results for each interviewer should be examined. In a large survey of women in the United States carried out by four dierent interviewers, no interviewer variation was seen for questions requiring recall of specic events, but the responses to questions involving subjective and person information or requiring further probing from the interviewer varied between interviewers. As a result, results on the impact of support networks on psychological symptoms varied depending on which interviewers’ data were used [ 22 ]. p. 115 In the case–control study of breast cancer in New Zealand described in Chapter 5 [ 23 ], the information was collected by a standardized telephone interview, after an initial approach by letter. The interviewer did not know whether the interviewee was a cancer patient or a comparison subject, which provides some protection against interviewer bias. However, the subjects themselves were well aware of whether they had been treated for breast cancer or not. A standardized, non-emotive, and systematic interview technique is the best protection against bias. The bias could also be overcome if information on the exposure of interest, in this case oral contraceptive use, were obtained from other independent sources, such as medical records (as applied in the study of abortions in Table 6). The investigators did assess general practitioners’ records for women who reported recent use of prescribed contraceptives, and concluded that there was ‘close agreement’ between this information and that given by the women themselves. However, doing this in practice is often dicult, and it is unlikely that such sources will give comparable information on all the relevant confounding factors. The design of methods of investigation that minimize error and bias is a large subject in its own right; we will only summarize some of the main approaches. Important issues include the denition of the items to be recorded, the choice of methods of measurement, the standardization of procedures, and quality control of all aspects of data gathering and processing (Table 6). It is essential in any research study to dene precisely the factor being assessed, even when it appears simple; consider the denitional issues involved in items such as tumour stage, cardiac failure, pain relief, social class, diastolic blood pressure, high-fat diet, or cellular atypia. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Table 6 Bias and error. An outline scheme to assist in the consideration of issues of observation bias and error. The questions should be considered for the whole study, and specifically with regard to the comparability of the relevant groups: exposed and unexposed in cohort and intervention studies; aected and unaected in case–control studies. What is the definition of the factor being assessed? Is it the same for each group? Is it appropriate to the hypothesis? What is the method of assessment? Instrument used Observer making the assessment circumstances of use Subjectsʼ circumstances Subjectsʼ knowledge and cooperation Are the methods of assessment similar for each group? Are the subjects, or the observers, aware of the grouping of the subjects when the assessment is made? How accurate and reliable is the method of assessment? When is the observation made? In calendar time In relation to the hypothesis Is it the same for each group? How are the data handled? Recording, coding Computation Are the methods the same for each group? The ‘instrument’ used to assess the factor must then be chosen: this general term includes any means of assessment, such as a clinical examination, laboratory test, questionnaire, or review of medical records. The way in which the instrument is to be applied must be standardized: by whom, when, how, and under what circumstances. As an example of a dicult item to measure, the prevalence of stress disorder in 641 Australian Vietnam veterans was assessed. The lifetime prevalence of combat-related post-traumatic stress disorder assessed by a standardized interview format was 12 per cent, but when assessed by an interview which gave the interviewer the opportunity to interact more with the subject, was 21 per cent. Moreover, with the latter instrument, the prevalence found when the interviewer was a trained counsellor was up to twice as high as with non-counsellor interviewers, and for the counsellors, was considerably greater for a female interviewer than a male [ 24 ]. Quality control procedures should be used to monitor the information collected throughout the study, and to produce data that will attest to its quality. These processes of denition, standardization, and quality control are relevant not only to the collection of information, but also to recording, coding, and computer entry. Quality control should include systematic checks for gross errors such as variables which are irreconcilable such as males with menstrual problems, contradictory such as non-matching age and date of birth, or outside an expected range; systematic checks for inconsistencies, such as addresses or diagnoses from dierent sources; and rechecking of all or a sample of examination, interview, coding, and data entry procedures. p. 116 Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Table 6 Use of a re-interview technique to assess consistency of data. For questions related to the aetiology of melanoma, subjects were re-interviewed aer a period of 1–3 years, and the data on consistency were used to adjust the risk estimates in a case–control study [ 26 ]. Source: data from Westerdahl J et al., ʻReproducibility of a self-administered questionnaire for assessment of melanoma risk,ʼ International Journal of Epidemiology Volume 25, Issue, pp. 245–251, © 2006, Oxford University Press 10.1093/ije/25. [ 26 ]. Question: ʻHave you ever had freckles?ʼ Second survey: Yes No Total Number Prop. Number Prop. Number First Yes 255 25 280 survey No 28 338 366 Total 283 0 363 0 646 Prop. = proportion of all subjects Observed agreement Expected agreement by chance Kappa, κ, = (observed agreement − expected agreement)/(1 − expected agreement)

= (255 + 338)/646 = 0.
= (0 × 0 + 0 × 0.
= (0 − 0)/(1 − 0.

However, even if there were no relationship between an individual’s responses to the two surveys, substantial agreement would be expected by chance alone: this amount can be calculated as shown. The logic is that as 0 of all subjects gave a ‘yes’ response on the rst survey, and 0 on the second, if the two responses were unrelated, the expected proportion giving ‘yes’ responses on both occasions would be 0 × 0 = 0. Similarly, the expected proportion responding ‘no’ on both surveys is 0 × 0. = 0. The total expected agreement is the sum of the expected agreement for each category: here it is 0. If we take the excess of agreement over expected agreement by chance (0 − 0), and divide it by the potential excess, which is (1 − 0), we obtain a statistic known as kappa, κ: p. 118

κ =

(a−e) (1−e) where a is the proportion of subjects giving consistent responses, and e is the proportion with consistent responses expected by chance alone. Kappa has a range of + 1 (complete agreement), to 0 (agreement equal to that expected by chance), to negative values (agreement less than that expected by chance). Here kappa is 0, which indicates good consistency. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Interpretation of kappa

Example

The kappa calculation assumes that the true results are the same for the two surveys. We need to consider if the true result might vary over time or because of other changes. A major limitation is that the interpretation of kappa is subjective. In general, kappa values above 0 are regarded as showing very good agreement, 0–0 good agreement, 0–0 moderate, and values under 0 fair to poor agreement [ 27 ]. The consistency of some common clinical assessments has been shown to be quite low, with kappa values of 0–0. The kappa value varies with the prevalence of the condition, being dicult to interpret where the prevalence is very low or very high [ 28 ]. If the prevalence is very low, kappa may be low even if the consistency is high; an example is shown later (see Table 6). p. 119 Table 6 Assessment of accuracy of data by comparison with a fully accurate method: the validity of antenatal screening for neural tube defects by a measurement of α-fetoprotein in maternal serum at 16–22 weeksʼ gestation compared with the presence of a neural tube defect assessed aer delivery. The test is only the first step in the screening process; the ultimate result was that terminations were carried out on 16 aected and two unaected pregnancies. For open neural tube defects, the sensitivity was 17/18 (94 per cent) [ 33 ]. True result Total Aected Unaected Screening result Abnormal 17 245 262 Normal 5 6176 6181 Total 22 6421 6443 Source: data from Wald N et al., ʻAntenatal screening in Oxford for fetal neural tube defects,ʼ British Journal of Obstetrics and Gynaecology, Volume 86, Issue 2, pp. 91–100, © 1979 10.1111/j.1471-0528.1979.tb10574. Sensitivity Specificity Predictive value positive = proportion of affected subjects giving a positive test = 17/22 = 77% = proportion of unaffected subjects giving a negative test = 6176/6421 = 96% = proportion of subjects with positive tests who are affected = 17/262 = 6% Consistency results can be useful to show the best of several methods of assessment. For example, to assess routine skin screening in Australia, subjects were asked if they had had an examination of their skin by a doctor in the last 3 years and, in another question, in the last 12 months. The responses were checked against the doctors’ records. The kappa value comparing the questionnaire to the medical records for the 3- yearly question was 0, showing very good agreement, but for the question on 12 months was only 0. This dierence was attributed to telescoping, that is, patients tend to remember events as being more recent than they were: many subjects reported an examination within the last 12 months when in fact it had been more than 12 months in the past. This comparison showed that an analysis based on reported 3-yearly screening would be better for further studies [ 29 ]. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Adjustment for non-dierential error in continuous exposure measures

The study shown in Table 6 was related to a case–control study showing an odds ratio of 1 between having had freckles and developing melanoma. Using the kappa value of 0 gives an estimated true odds ratio of (0 + 1 − 1)/0 = 1, modestly increased. Information on the reproducibility or validity of an exposure measurement is often expressed as the correlation between measurements over the range of the variable. For a continuous measure, such as caloric intake, number of cigarettes smoked per day, or level of blood pressure, the extent of non-dierential misclassication can be expressed as the validity coecient, v, being the correlation between the observed measure of the exposure and its true value [ 31 ]. The relationship between the observed odds ratio ORO related to a unit change in the measured variable and the true odds ratio ORT is given by the square of this validity coecient:

ORo =ORV

2 T or

ORO = exp ( v 2 ln ORT)

p. 121 and

ORT = exp (ln ORO / v 2 )

Suppose a study shows an odds ratio of 1 with a unit increase in obesity, as measured in a eld survey, and the correlation between that measurement of obesity and the true value, assessed by comparing the eld measurement to an ideal measurement on an adequate sample of subjects, is 0. Then, the true odds ratio is exp(ln1 / 0) = 2. Often the validity coecient will be unknown, as the true value of the quantity may be dicult or impossible to measure (e. exposures in the past). Often all that is available is information from repeated measurements. The correlation between two measures of an exposure is the reliability coecient, r. Under ideal conditions this is equal to the square of the validity coecient, so r can be substituted for v in the earlier given equations. However, in practice such a measure should be assumed to be the upper limit, or most optimistic, estimate of validity. 2 For example, suppose a study assessing a relationship of adult disease to alcohol consumption in the teenage years yields an odds ratio of 2. There is no method of measuring the true value of this variable; but repeated measures using the same questionnaire will give a reliability coecient. If this were 0, the revised estimate of the true odds ratio would be exp(ln2 / 0) = 2. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Further eects of non-dierential misclassification

Eects if there are more than two categories

Associated errors

Misclassification of confounders

Eects on the numbers of subjects needed in a study

Assessment of the accuracy of information: sensitivity, specificity, and

predictive value

If the exposure is in more than two categories, the eects of non-dierential misclassication may be more complex. If there are categories of unexposed, moderately exposed, and highly exposed, and the true situation is that the risk increases across that gradient, the eect of misclassication will be to reduce the observed association in the highly exposed group (as it will contain more individuals who are not actually highly exposed), but it could either increase or decrease the observed risk in the moderately exposed category. The trend in risk over the ordered exposure categories will then be aected, and the estimated trend could be either decreased (towards the null) or increased. A further situation in which non-dierential error can produce biases are away from the null is where the errors in the ascertainment of the exposure and of the outcome are not independent. For example, in a survey using a few interviewers and subjective data, interviewer dierences in assessing exposure and outcome may produce related eects. Such situations need special caution, but the eects will be specic to the situation. The ability to adjust for the eects of confounding (see Chapter 7 ) will depend on the accuracy of measuring the confounder; non-dierential misclassication of a confounding variable will reduce the degree to which the confounding can be controlled. p. 122 One situation in which the calculation of observed values from assumed true values is useful is in the estimation of the size of a projected study. As will be discussed in Chapter 8 , this estimation depends on the odds ratio that is assumed to apply. As the observed odds ratio will be closer to the null than the true ratio because of misclassication, it is prudent to take this into account, by using the projected observed odds ratio when calculating sample sizes. Now we will go beyond just considering consistency. Consider the situation where we have a method of assessment that can be regarded as denitive, often termed a ‘gold standard’; then the accuracy of any other method can be assessed against it. While this is relevant to the assessment of bias, it is the central issue in the assessment of the accuracy of diagnostic tests and screening tests. The results of a screening or diagnostic test can be compared against the nal diagnosis achieved after full investigation. A measure of overall consistency is not so useful here, as the consequences of a positive and of a negative result will be very dierent. The terminology used here is easiest to describe in terms of screening and diagnosis. Downloaded from academic.oup/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024

Was this document helpful?

Chapter 6 - help for uni

Module: Epidemiology (BMS3023)

36 Documents
Students shared 36 documents in this course
Was this document helpful?
Critical Appraisal of Epidemiological Studies and Clinical Trials (4 edn)
Mark Elwood
https://doi.org/10.1093/med/9780199682898.001.0001
Published: 2017 Online ISBN: 9780191763045 Print ISBN: 9780199682898
CHAPTER
https://doi.org/10.1093/med/9780199682898.003.0006 Pages 103–130
Published: February 2017
Abstract
Keywords: Bias, error, non-dierential, dierential, misclassification, blind assessment, recall bias, kappa,
sensitivity, specificity, predictive value, validity
Subject: Public Health, Epidemiology
Collection: Oxford Medicine Online
6 Error and bias in observations
Mark Elwood
This chapter distinguishes error and bias, non-dierential and dierential misclassication
distinguished. Non-dierential misclassication almost always biases results toward the null, while
dierential misclassication can aect results in any direction. Methods to minimise observation bias
include single, double and triple blind assessment. It discusses recall and other biases, with methods of
assessment and avoidance, and practical issues on reducing error and bias. In part two, it shows how to
measure and adjust for observational error and bias, including Kappa and adjusting for non-
dierential misclassication, and similar adjustments using continuous exposure measures. Eects
with more than two categories of outcome or exposure, and of the misclassication of confounders are
discussed. In assessing the accuracy of information, sensitivity, specicity, and predictive value are
dened, and the calculation of the eects of misclassication using sensitivity and specicity are
shown.
Mathematics may be compared to a mill of exquisite workmanship, which grinds you stu of any
degree of neness; but, nevertheless, what you get out depends on what you put in; and as the
grandest mill in the world will not extract wheat-our from peascod, so pages of formulae will not
get a denite result out of loose data
T.H. Huxley: Geological Reform; 1869
Downloaded from https://academic.oup.com/book/24610/chapter/187901206 by University of Newcastle user on 06 January 2024