Taking Critical Appraisal to Extremes

Article Type

Changed

Mon, 01/14/2019 - 11:13

Display Headline

Author(s)

In January 2000 an article in The Lancet drew attention when it questioned the supporting evidence for screening mammography.¹ Danish investigators Peter Gøtzsche and Ole Olsen presented a series of apparent flaws in the 8 randomized trials of mammography, ultimately concluding that screening is unjustified. Their cogent arguments and the press coverage they received left many physicians wondering whether they should continue to order mammograms. The story led the CBS Evening News² and was featured in the Washington Post,³ Time,⁴ and Reuters.⁵

A Patient-Oriented Evidence that Matters (POEM) review in the April 2000 issue of The Journal of Family Practice⁶ that addressed The Lancet study lent apparent support to these concerns. The POEM related the arguments in The Lancet article without challenging them and concluded that “mammography screening has never been shown to help women to live longer.” The authors of this POEM suggested that the only reasons for screening to continue are “politics, patients’ preconceptions, and the fear of litigation.” Unlike most POEMs, this one included no critical appraisal of the methods or assumptions of the reviewed study. This lack of comment, combined with the authors’ negative remarks about mammography, may have convinced family physicians that the criticisms of Gøtzsche and Olsen were beyond dispute.

However, controversy does surround their arguments, as the many letters to the editor published in The Lancet attest.⁷ For example, The Lancet critique made much of inconsistent sample sizes and baseline dissimilarities between screened and unscreened women. The authors asserted that such age and socioeconomic differences were “incompatible with adequate randomization.” That premise is contestable. It is normal and predictable that a proportion of population variables will differ between groups for statistical reasons, no matter how perfect the randomization. Also, the observed age difference (1 to 6 months) would not explain the 21% reduction in mortality observed in the trials.⁸

For Gøtzsche and Olsen the discrepant age patterns and sample sizes were less a cause of the results than a warning sign that randomization had been subverted (because of failure to conceal allocation). Since mortality in the screened and unscreened groups differed by only a relatively small number of deaths, they reasoned that very little bias would be necessary to tip the scales in favor of mammography.

Several arguments weaken their case, however. First, they offered no evidence that subversion or unconcealed allocation actually occurred. They equated inexplicit documentation of procedures (and dissimilar group characteristics) with improper randomization. Second, even if unconcealed allocation occurred, it does not in itself thwart randomization. Investigators who know to which group a patient will be assigned can still follow the rules and make the correct assignment. Anecdotal reports of subversion (by deciphering assignment sequences to divert or target patients for allocation) do not offer denominator data to assess how often this occurs.⁹ It would have had to occur in every trial that favored mammography to uphold the authors’ allegations. Third, even if the trials were subverted there is no indication that case mix differed enough to skew outcomes. Age differences were minor; the authors speculated that sizable imbalances in unmeasured factors could have altered results, but they gave no evidence. They cited reports that poorly concealed allocation is associated with a 37% to 41% exaggeration in odds ratios,^10,11 but these reports concerned other trials and made arguable assumptions. Finally, their confirmatory finding—that only the 6 “flawed” trials reported a benefit for mammography and that the 2 acceptable trials showed no effect—was based on recalculated relative risk rates. The original trial data show no such pattern.⁸

This is not to suggest that weaknesses in the mammography trials do not merit scrutiny. Others have also voiced criticisms.¹² But the alarm raised by Gøtzsche and Olsen goes further, compelling us to rethink the purpose of critical appraisal and the extremes at which it might cause more harm than good.

Excessive critical appraisal

We seek perfection in evidence to safeguard patients. Prematurely adopting (or abandoning) interventions through uncritical acceptance of findings risks overlooking potential harms or more effective alternatives. But critical appraisal can do harm if valid evidence is rejected. Deciding whether to accept evidence counterbalances the risks of acceptance against the risks of rejection, which are inversely related. At one extreme of the spectrum, where data are accepted on face value (no appraisal), the risk of a type I error (accepting evidence of efficacy when the intervention does not work or causes harm) is high, and that of a type II error (discarding evidence when the intervention actually works) is low. At the other extreme (excessive scrutiny) the risk of a type II error is great; such errors harm patients because knowledge is rejected that can save (or improve) lives. Obviously, patients are best served somewhere in the middle, striking an optimal balance between the risks of type I and type II errors.

Enthusiasts for critical appraisal sometimes forget this, assuming that more is better. Gøtzsche and Olsen, fearing that the medical community had committed a type I error (promoting an ineffective or harmful screening test), set a high standard for judging validity (eg, dismissing trials with any baseline differences). But when data from 8 trials demonstrate a large effect size (21% reduction in mortality) with narrow confidence intervals (13%-29%),⁸ the risk of a type II error overwhelms that of a type I error. The implications of the error are stark for a disease that claims 40,000 lives per year.¹³ Rejecting evidence under these conditions is more likely to cause death than accepting it.

The pivotal question in rejecting evidence should not be whether there is a design flaw but something more precise: the probability that the observed outcomes are due to factors other than the intervention under consideration (eg, chance, confounding). This can be just as likely in the absence of design flaws (eg, a perfectly conducted uncontrolled case series) as when a study is performed poorly, and it can be low even when studies are imperfect. It is a mistake to reject evidence reflexively because of a design flaw without studying these probabilities.

For example, what worried Gøtzsche and Olsen about flawed randomization was that factors other than mammography might account for lower mortality. But how likely was that (compared with the probability that early detection was efficacious)? Suppose a trial with imbalanced randomization carries a 50% probability of producing spurious results due to chance or confounding. The probability that the same phenomenon would occur in 8 trials (conducted independently in 4 countries in different decades, using different technologies, views, and screening intervals¹⁴) would be in the neighborhood of (0.50)⁸ or 0.39%. The probability would be higher if one believed that subversion is systematic among researchers, but without data such speculation is an exercise in cynicism rather than science.

Suppose that the criticisms of Gøtzsche and Olsen justify the termination of mammography. Are we applying the same standard for other screening and clinical practices, or do we move the goal posts? Physicians screen for prostate cancer without one well-designed controlled trial showing that it lowers mortality.¹⁵ We say this is not evidence-based,¹⁶ but what kind of study would change our minds? The conventional answer is a randomized controlled trial showing a reduction in prostate cancer mortality is required,¹⁷ but The Lancet article finds that even 8 such trials are unconvincing. Indeed, flawless trials lose influence if the end points or setting lack generalizability.¹⁸ If the consequence of such high standards is that 30 years of trials involving a total of 482,000 women (and untold cost) cannot establish efficacy, the prospects for making the rest of medicine evidence-based are slim indeed.

The search for perfect data also has epistemologic flaws. No study can provide the absolute certainty that extremism in critical appraisal seeks. The willingness to reject studies based on improbable theories of confounding or research misbehavior may have less to do with good science than with discomfort with uncertainty, that is, unease with any possibility that the inferences of the investigators are wrong. The wait for better evidence is in vain, however, because science can only guess about reality. Even in a flawless trial a P value of .05 means that claims of significance will be wrong 5% of the time. Good studies have better odds in predicting reality, but they do not define it. It is legitimate to reject poorly designed studies because the probability of being wrong is too high. But to reject studies because there is any probability of being wrong is to wait futilely for a class of evidence that does not exist.

Inadequate critical appraisal

The POEM about The Lancet article highlights the other extreme of critical appraisal, accepting studies at face value. The review mentioned none of the limitations in The Lancet analysis, thus giving readers little reason to doubt the conclusion that mammography lacks scientific support and potentially convincing them to stop screening. Physicians should decide whether this is the right choice only after having heard all the issues. That this study reported a null effect and used meta-analysis does not lesson the need for critical appraisal. Like removing a counterweight from a scale, the omission of critical appraisal unduly elevates study findings (positive or negative), thus fomenting overreaction by not putting the information in context.

Several new resources, POEMs among them, have become available to alert physicians to important evidence. Some features (eg, the “Abstracts” section in the Journal of the American Medical Association) simply reprint abstracts. Others associated with the evidence-based medicine (EBM) movement offer critical appraisals. In family practice, these include POEMs and Evidence-Based Practice.¹⁹ In other specialties, they include the American College of Physicians’ ACP Journal Club and the EBM journals from the BMJ Publishing Group (eg, Evidence-Based Medicine, Evidence-Based Nursing). These efforts try to approach critical appraisal systematically. ACP Journal Club and the BMJ journals apply a quality filter (excluding studies failing certain criteria²⁰) and append a commentary that mentions design limitations. POEMs go further, devoting a section to study design and validity and giving the authors explicit criteria for assessing quality.²¹

But a closer look reveals inconsistencies in how these criteria are applied. The heterogeneity is apparent in any random set of POEMs: Some authors list strengths and weaknesses by name with no elaboration, some give more details, some address only one criterion (eg, allocation concealment), while others state simply that the study was well designed. Some, like The Lancet review, say nothing about quality, reporting only the design and results. Eight (17%) of the 48 POEMs published in the first 6 months of 2000 included no critical appraisal (or a vague remark).

Similar omissions plague ACP Journal Club and the BMJ journals. Although the reviews describe concealment of allocation and blinding, and the commentary sections sometimes address design flaws at length, the degree to which this occurs, if at all, is variable. An ACP Journal Club review²² of the United Kingdom Prospective Diabetes Study, the landmark trial of intensive glycemic control, mentioned no concerns about its external validity (for contrast see the report by the American Academy of Family Physicians and American Diabetes Association²³). Calling such synopses critical appraisals obfuscates the meaning of the term.

Some say that even brief remarks are critical appraisals. But good appraisals consistently and objectively rate studies using uniform criteria.²⁴ POEMs and the EBM journals do not currently meet this standard; more systematic procedures are needed to ensure that narratives routinely discuss core elements of internal and external validity and that manuscripts lacking these elements are returned. The current conditions for preparing reviews make this difficult, however. With their modest budgets³ journals rely on hundreds of volunteer contributors, each with a different writing style and level of expertise. Onerous procedures for critical appraisal might discourage participation. Because journal space is limited, narratives are kept short (700 words for POEMs,²⁵ 425 words for the BMJ journals and ACP Journal Club²⁰) to review more studies. Each issue of JFP contains 8 POEMs, and Evidence-Based Medicine has 24 reviews. Longer appraisals would reduce the number and currency of reviews and would be less concise for busy physicians.

The disadvantage of short reviews and rapid turnover is a greater risk of inaccuracies and imbalance. Careful analysis of a study requires more time to do research and more space to explain the results than journals currently provide. Authors have only weeks to prepare manuscripts, which leaves little time to verify that descriptions are accurate and give proportionate emphasis to the issues that matter most. For many studies a few hundred words provide inadequate room to fully explain the design, results, and limitations. The risk of mistakes is heightened with less time, analysis, expert review, and space.

This tradeoff between quantity and quality begs the question of what is more important to readers and to patient care: the number of studies that physicians know about or the accuracy with which they are described. POEMs, which are designed to change practice,²⁶ can do harm if physicians acting on inaccurate or incomplete information make choices that compromise outcomes. The Lancet review is a shot across the bow. The 40,000 annual deaths from breast cancer¹³ remind us that for certain topics a mistaken inference can cost thousands of lives. If inconsistencies in appraisal make this happen often enough, efforts to synopsize evidence can do more harm than good. Also, it is incongruent for programs espousing EBM, discourages a discipline that accepts evidence on face value, to report studies with little or no discussion of validity.

Setting policy in critical appraisals

It is also antithetical for EBM to support evidence-based practice guidelines²⁷ and to publish clinical advice that is not derived from these methods. Critical appraisals that conclude by suggesting how physicians should modify patient care cross the line from science to policy. In the first half of 2000 73% of the POEMs advised physicians (with varied explicitness) to use tests (6), drugs (14), or other treatments (5) and to withhold others (10). Such advice is common fare in the medical literature, but EBM ascribes to a higher standard. Because imprudent practice policies can do harm or compromise effectiveness, EBM holds that guidelines should be drafted with care using evidence-based methods.²⁸ This entails reviewing not one study but all relevant evidence, with systematic grading of studies and explicit linkage between the recommendations and the quality of the data.^27-30 This process typically involves an expert panel and months or years of deliberations.

In contrast, practice recommendations in EBM journals and POEMs reflect what individual authors think of a study. They lack the time, funding, and journal space for a systematic literature review. Thus, the authors and their readers cannot be sure that the conclusions reflect the evidence as a whole without undue influence from the reviewed study. Rules of evidence and grades for recommendations are rarely provided. Unlike guideline panels, authors seldom vet their recommendations with experts, societies, and agencies, which often uncover flawed inferences. The thinking process behind recommendations is necessarily telescoped. While the United States Preventive Services Task Force spent 2 years deciding whether pregnant women should be screened for bacterial vaginosis, a POEM³¹ produced its recommendations within weeks.

To some such pronouncements are not guidelines, only the bottom line of a review. But in policy terms it matters little whether physicians prescribe a drug because of a guideline or because of the advice they read in ACP Journal Club or a POEM. The outcome for the patient is the same. Reviews need a bottom line, but summarizing the results of a study (eg, drug A worked better than drug B) differs from advising physicians what to do (eg, prescribe drug A). The latter is a statement of policy rather than science and should be based on broader considerations than one study.²⁸

EBM faults guidelines that omit evidence-based methods, such as those issued by advocacy groups that reflect personal opinions and selective use of studies more than systematic reviews.^32,33 Yet the recommendations in EBM journals and POEMs differ little in appearance: They provide little documentation of how conclusions were reached, feature select evidence (the study under review), rely on authors’ opinions, and provide few details on rationale. EBM journals should extract themselves from this inconsistency by sharpening the distinction between summarizing evidence and setting policy and eschewing the latter unless it emanates from evidence-based methods.

How this is handled in POEMs will reflect on family medicine. The prominence the specialty has given POEMs (promotion in JFP, family practice literature,^34-35 the Internet,^36-37 and newsletters¹⁹) signals the way family physicians think studies should be reviewed. It is important to get this right. If POEMs are meant to be critical appraisals and 17% contain no critique, calling them critical appraisals casts doubts on the specialty’s understanding of the term and perpetuates confusion about definitions. Conversely, by instituting greater scrutiny—defining the core criteria that must be discussed to qualify a study commentary as a critical appraisal and systematizing their use in POEMs—the specialty would set a new standard for EBM. If POEMs are not meant to be critical appraisals, it is important to clarify the distinction in terms, especially for family physicians who have grown accustomed to POEMs, know little about alternatives, and have come to believe that POEMs, critical appraisals, and EBM are essentially the same.

Conclusions

Advocates of EBM should be systematic in their application of critical appraisal. Critical appraisals do not deserve the name if they accept studies on face value. The criteria for determining which studies are rated good or bad should be explicit and consistent. But the scrutiny of evidence should not be taken to extremes, to the point that studies are rejected for being imperfect when there is little likelihood that the findings are wrong. By making the perfect the enemy of the good, excesses in critical appraisal do injustice to the goal of helping patients and imply existence of a level of certainty that science cannot provide.

References

1. Gøtzsche PC, Olsen O. Is screening for breast cancer with mammography justifiable? Lancet 2000;355:129-33.

2. Health watch: mammography controversy. CBS Evening News January 7, 2000. Vanderbilt University Television News Archive, available at tvnews.vanderbilt.edu.

3. Mammography assessed. Washington Post, January 7, 2000, A14.

4. Reaves J. Here’s why your oncologist is angry. Time January 13, 2000. Available at www.time.com/time/daily/0,2960,37449,00.html.

5. Mammography screening deemed unjustifiable. Reuters Medical News January 7, 2000. Available at www.medscape.com/reuters/prof/test/2000/o1/01.07/pbo1070c.html.

6. Wilkerson BF, Schooff M. Screening mammography may not be effective at any age. J Fam Pract 2000;49:302-371.

7. Screening mammography re-evaluated. Lancet 2000;355:747-52.

8. Kerlikowske K, Grady D, Rubin SM, Sandrock C, Ernster VL. Efficacy of screening mammography: a meta-analysis. JAMA 1995;273:149-54.

9. Schulz KF. Subverting randomization in controlled trials. JAMA 1995;274:1456-58.

10. Moher D, Pham B, Jones A, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses?. Lancet 1998;352:609-13.

11. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995;273:408-12.

12. Berry DA. Benefits and risks of screening mammography for women in their forties: a statistical appraisal. J Natl Cancer Inst 1998;90:1431-39.

13. American Cancer Society. Cancers facts & figures 2000. Atlanta, Ga: American Cancer Society; 2000.

14. Fletcher SW, Black W, Harris R, Rimer BK, Shapiro S. Report of the International Workshop on Screening for Breast Cancer. J Natl Cancer Inst 1993;85:1644-56.

15. Collins MM, Stafford RS, Barry MJ. Age-specific patterns of prostate-specific antigen testing among primary care physician visits. J Fam Pract 2000;49:169-72.

16. Lefevre ML. Prostate cancer screening: more harm than good? Am Fam Phys 1998;58:432-38..

17. Woolf SH, Rothemich SF. Screening for prostate cancer: the role of science, policy, and opinion in determining what is best for patients. Ann Rev Med 1999;50:207-21.

18. Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users’ guides to the medical literature: XIX. Applying clinical trial results. A. How to use an article measuring the effect of an intervention on surrogate end points. Evidence-Based Medicine Working Group. JAMA 1999;282:771-78.

19. Advertiement Evidence-Based Practice Montvale, NJ: Quadrant HealthCom Inc.; 2000.

20. Purpose and procedure. Evidence-Based Medicine. Available at www.bmjpg.com/data/ebmpp.htm.

21. Assessing validity and relevance. Available at www.infopoems.com/EBP_Validity.htm.

22. Gerstein HC. Commentary on “Intensive blood glucose control reduced type 2 diabetes mellitus-related end points.” ACP J Club 1999; 2-3. Comment on: UK Prospective Diabetes Study Group. Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications. Lancet 1998;352:837-53.

23. Woolf SH, Davidson MB, Greenfield S, et al. Controlling blood glucose levels in patients with type 2 diabetes mellitus: an evidence-based policy statement by the American Academy of Family Physicians and American Diabetes Association. J Fam Pract 2000;49:453-60.

24. Cook DJ, Mulrow CD, Haynes RB. Systematic reviews: synthesis of best evidence for clinical decisions. Ann Intern Med 1997;126:376-80.

25. Instructions for POEMs authors. Available at www.infopoems.com/authors.htm.

26. Slawson DC, Shaughnessy AF. Becoming an information master: using POEMs to change practice with confidence. J Fam Pract 2000;49:63-67.

27. Hayward RS, Wilson MC, Tunis SR, Bass EB, Guyatt G. Users’ guides to the medical literature. VII. How to use clinical practice guidelines. A. Are the recommendations valid? The Evidence-Based Medicine Working Group. JAMA 1995;274:570-74.

28. Woolf SH, George JN. Evidence-based medicine: interpreting studies and setting policy. Hem Oncol Clin N Amer 2000;14:761-84.

29. Shekelle PG, Woolf SH, Eccles M, Grimshaw J. Developing guidelines. BMJ 1999;318:593-96.

30. Woolf SH. Practice guidelines: what the family physician should know. Am Fam Phys 1995;51:1455-63.

31. Lazar PA. Does oral metronidazole prevent preterm delivery in normal-risk pregnant women with asymptomatic bacterial vaginosis (BV)? J Fam Pract 2000;49:495-96.

32. Cook D, Giacomini M. The trials and tribulations of clinical practice guidelines. JAMA 1999;281:1950-51.

33. Grilli R. Practice guidelines developed by specialty societies: the need for a critical appraisal. Lancet 2000;355:103-06.

34. Geyman JP. POEMs as a paradigm shift in teaching, learning, and clinical practice. J Fam Pract 1999;48:343-44.

35. Dickinson WP, Stange KC, Ebell MH, Ewigman BG, Green LA. Involving all family physicians and family medicine faculty members in the use and generation of new knowledge. Fam Med 2000;32:480-90.

38. JFP online. Available at www.jfponline.com.

39. POEMs for primary care. Available at www.infopoems.com.

Author and Disclosure Information

Steven H. Woolf, MD, MPH
Fairfax, Virginia

All correspondence should be addressed to Steven H. Woolf, MD, MPH, Department of Family Practice, Virginia Commonwealth University, 3712 Charles Stewart Drive, Fairfax, VA 22033. E-mail: shwoolf@aol.com.

Issue

The Journal of Family Practice - 49(12)

Publications

The Journal of Family Practice

MDedge Family Medicine

Page Number

1081-1085

Sections

Commentary

Author(s)

Steven H. Woolf, MD, MPH