Evaluating articles about treatment in the medical literature

May 1, 2003

Article

Does the latest study of a therapy warrant changing the care you give to children in your practice? You can be the judge--if you know what questions to ask about the research and how to find the answers.

Your turn to learn to read:
Evaluating articles about treatment in the medical literature

Jump to:

Choose article section...

By Dimitri A. Christakis, MD, MPH

Does the latest study of a therapy warrant changing the care you provide just because it was written by experts and published in an authoritative journal? You can be the judgeif you know what questions to ask about the research and how to find the answers.

Remaining current on developments in the science and practice of medicine is an explicit component of a physician's professionalism; in fact, it's a mandate delivered in the form of continuing medical education (CME) requirements. But educational updates that employ traditional CME formats may not be sufficient to significantly improve health outcomes among a provider's patients, research shows.¹ And although many providers believe that they do apply the latest evidence in the care of their patients, ample data suggest that they do not: Many years may pass before an effective therapy is widely accepted and used,² and ineffective therapies continue to be used despite convincing evidence that they should not.³

The movement known as "evidence-based medicine," or EBM, began in the early 1990s in part as a step toward maximizing health outcomes at the level of the provider.⁴ The meaning of the term has evolved; today, it is defined as "the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients."⁵ (For more definitions of terms used in this article, see the glossary.) One way for a physician to practice EBM is to survey the medical literature on a given topic of concern, read the pertinent studies found there, and apply the findings to the care of patients. This article describes a comprehensive way of evaluating any single published study that addresses a treatment for a disease or condition, roughly following the approach laid out in the "User's guide to the medical literature" that has been published in the Journal of the American Medical Association.^6,7

A word of caution before beginning: When perusing or reading the medical literature, always keep in mind that a single study may present only one piece of a complex clinical puzzle, and that different studies of the same therapy or intervention may reach conflicting conclusions. That is one benefit of using a so-called systematic review of a clinical topic (see "All about reviews, summaries, meta-analyses, and guidelines") as your source of information: Such a review should, if properly undertaken, be comprehensive and make note of study-to-study discrepancies.

Give a study "the third degree"

There are a number of questionsand questions within questionsfor you to pose when you evaluate the merits of a study and determine whether to apply the investigators' findings to your patients.

Are results valid?

Was assignment of subjects randomized?

A trial in which the assignment of patients (I'll generally call them "subjects") was not randomized has been shown to be more likely to yield a positive result, and has therefore been deemed less reliable than a randomized trial. But you should not be satisfied just by the authors' statement in the "Methods" section of a research article that subjects were randomized. Close scrutiny of studies that reported having randomized subjects has revealed that, in some, the investigators believed that assignment was random (that is, they did not consciously assign subjects to any of the groups and assignment was therefore random) when in fact it was not.

The "Methods" section of an article should state how randomization was achieved (by computer, coin flip, etc.). Ideally, a randomized, controlled trial will constitute two (or more) groups that are entirely comparable in size and other characteristics. If the number of subjects randomized is small, however, the size of the groups can sometimes be out of balance.

To maintain balance in group size, researchers often employ so-called block randomization. Here, blocks of random assignment (theoretically of any size, but typically comprising four to eight subjects) are created. Within each block, an equal number of assignments are made to each of the two groups (treatment and control) in entirely random sequencing. In a block-of-four randomization, for example, two treatment and two control assignments are made. As patients are enrolled in the trial, they are assigned to the next group in the sequence. This ensures that the two groups can never be more than two patients out of balance. Block randomization has two advantages: It ensures that the groups are balanced in size, which is particularly important for smaller studies, and it allows the investigators to conduct interim analysis of results. Groups of comparable size maximize statistical power and allow researchers to halt the larger trial if clear benefit or harm is shown to be associated with treatment.

Is follow-up complete?

Losing patients to follow-up can be expected, particularly if the follow-up period is long. But such losses can be problematic, particularly if data are missing on a large percentage of population of subjects. Losses raise important questions: Did subjects who dropped out fare better, or worse, than those who remained? If a substantial number of control patients left the study, was it because the "treatment" they were given didn't work? If patients in the intervention group left, could the intervention have unacceptable side effects?

In general, losing more than 10% to 15% of patients to follow-up brings into question the validity of findings. A conservative estimate of a treatment's effect can be made by making the assumption that all lost intervention subjects did not get better and that all lost control subjects did.

Were subjects analyzed in the group to which they were randomized?

Problems can arise when investigators fail to include in their analysis all subjects who were randomized in the study. Consider the following hypothetical example:

A researcher is interested in determining whether providing free membership in a health club to adolescents can reduce their risk of obesity. She identifies 200 at-risk teens and randomly gives a free membership in a health club to 100 of them and nothing to the other 100. When it is time to analyze the data, she finds that only 50 participants in the intervention arm ever went to the club. She decides, therefore, to exclude the 50 who never went to the club, and to include in her analysis only those who went to the club and to compare them to the 100 subjects in the control arm.

The researcher's decision about how to perform the analysis of the data may appear reasonable, but it is seriously flawed. The adolescents who actually went to the health club are, very likely, different from those who did not. It is possible, even likely, that they are more motivated, responsible, and, probably, even more fit at baseline than those who did not use their membership. All these differences would skew the results in favor of a finding that free membership in a health club works to reduce the risk of obesity among adolescents. In fact, adherent patients do better than nonadherent patients in trialseven when they have been given placebo! The epidemiologic shorthand for ensuring that all patients were analyzed in the group to which they were randomized is called intention to treat. Authors of a published study may report that such an analysis was performed; you can always double-check their assertion by looking at the analytic tables in the article and comparing the stated number of subjects there to the number who were enrolled.

Were subjects, health-care workers, and study personnel blind to the treatment?

Blinding is an important step that ensures that the findings of a study are accurate. Unblinded studies are more likely to be positive and more likely to show a greater treatment effect.⁸ This is not surprising; many outcomes can be subjective, and knowing treatment status can affect one's perception of effect or success. Suppose, for example, that a group of researchers are studying pain associated with circumcision. The pain of the procedure is being evaluated by trained observers who watch the infants' reactions. If the evaluators know whether a given infant is being medicated for pain, that knowledge might affect their assessment of the child's crying (after all, even a medicated child typically cries during circumcision).

In general, the softer, or more subjective, the outcome, the more important blinding is. For example, blinding may not be so important if death is the topic of a study. The blinding process should be described in the same level of detail as the randomization process so that readers can determine how effective they find it to be.

Were the groups similar at the start of the trial?

Typically, the authors of a research article compare the characteristics of the two arms of the study in one of the article's first tables. As the number of subjects included in the trial grows, the two groups should look more and more like each other. If differences in characteristics do exist between the two groups, those differences can be adjusted for in the analyses; the real problem is not measured differences but unmeasured or even unmeasurable ones. This is the real strength of a randomized trial: Researchers (and, later, readers) have a priori reason to believe that the two groups of subjects are not different in any way.

Consider the example of a researcher who is studying a parental intervention to alleviate colic in infants. He can compare the age, parity, gender, race, socioeconomic status, and other characteristics of parents in the treatment and control groups, but he will find it more difficult to compare how patient, or calm, or excitable they are across groups. How can he overcome this difficulty of characterization? Easily, if the trial has been rigorously designed: He can take comfort that all characteristics that might affect the success or strength of the intervention in a randomized trial are likely to be balanced between the two groups of parents.

Aside from the intervention, were the groups treated and evaluated equally?

Once subjects are randomized, they should be treated identically (except for the treatment being studied) so not to confound findings. For example:

A researcher is studying the effectiveness of acupuncture for low back pain in adolescents. She randomizes patients (recruited because they suffer low back pain) to be treated by an experienced acupuncturist or to be controls and get sham therapy from a pediatric intern. Because the hospital's institutional review board has determined that it would be unethical not to treat pain, patients in both groups are allowed to take rescue medication (acetaminophen with codeine, as needed). To the researcher's surprise, she finds that patients in the control arm have less pain at follow-up than those in the treatment arm have.

Why might this be the case? It seems highly probable that sham therapy would not work at all, or even that it made the controls feel worse, as interns (toiling in good faith) randomly stuck them with needles at various places on their body. So the control subjects may, in fact, have taken more rescue medication, which is known to work, than those in the intervention group. In this study, it may have been better for the researcher to have measured rescue medication usage, not pain, as the outcome, or indicator of success.

What are the results?

Is the analysis in error?

Most articles about clinical trials include some statistical tests to assess whether the associations or differences between the study groups reported in the "Findings" can be explained by chance alone. Authors (and readers) may determine that such differences or associations are not chance (that is, they are significant) or that they are due to chance (they are not significant). Either conclusion may be wrong, however, and either of two types of error can therefore lurk in the findings. Statisticians (who aren't especially imaginative) have named these errors Type I and Type II (Table 1).

TABLE 1
Snapshot of type I and type II errors

What the

"A significant difference exists" (positive study)

True positive finding

Type I error

"An insignificant difference exists" (negative study)

Type II error

True negative finding

A type I error is akin to a false-positive result in diagnostic testing: Namely, a reported significant finding is, in fact, the result of chance alone. The P value derived from statistical testing of the findings describes the probability with which such an error may have occurred. The conventional standard is that a type I error should occur no more often than 5% of the time (expressed as P <.05). In other words, we tolerate a 5% chance that a given treatment offers no real benefit even though such a benefit has been reported. A type I error is only relevant in a study in which investigators found a difference between subject groups.

A type II error is akin to a false-negative result in diagnostic testing: A reported statistically insignificant finding is, in fact, a real one. The probability of a type II error often is unreported. Typically, researchers consider the possibility of a type II error occurring in determining the sample size they will need to conduct a study. This is reflected in the so-called power of a study, a measure of how likely the researchers are to detect a particular magnitude of difference, given a certain number of patients. Type II errors are usually not as big a concern for researchers as type I errors are, and the typical power of studies is 80%. What does this mean? In a negative study of 80% power, for example, there is a 20% chance that, in fact, the differences between groups are real even though they are statistically insignificant (that is, the probability of a type II error is 20%).

Be careful not to conclude that a therapy does not work if a study has low statistical power, particularly if the differences reported appear large or clinically significant. Proving ineffectiveness is very different from failing to prove effectiveness. If a study had 95% power (very rarely the case because that would require an enormous number of subjects), then a negative study could conclude that a therapy does not work with the same degree of certainty that a positive study concludes that the same therapy does work! In most cases, a negative study allows the researchers and readers only to conclude that evidence of benefit is lacking. Type II errors are relevant only for studies that have failed to find a difference between groups.

How large was the treatment effect?

Often, researchers compare two or more groups and report on the differences they observedifferences that can be expressed in relative or absolute terms. In simplest terms, the risk difference (RD) involves subtraction; the relative difference is a ratio. Although both expressions are valid and accurate representations of the difference between two groups, the implications of the way the difference is interpreted (by the researchers or readers) can be striking.

Consider the risk of hospitalization for respiratory syncytial virus infection among premature infants treated with palivizumab. A study found that 10.6% of patients treated with placebo were hospitalized, compared with 4.8% treated with palivizumab.⁹ This difference can be reported as a 5.8% risk difference (subtracting 4.8% from 10.6%) or as a 55% relative reduction (difference) in the risk of hospitalization.

The latter expression sounds much more dramatic; in fact, it does reflect the reduction in risk that an individual patient might expect. But knowing the risk difference allows you to calculate something else that is meaningful: By determining its inverse (1/RD), you arrive at what is known as the number needed to treat, or NNTthe number of patients that you would need to treat with a given therapy for one patient to benefit.¹⁰ In the example I offered, the calculation is 1/5.8%an NNT of 17.24. In other words, for every 18 patients treated, one avoids hospitalization as a result of palivizumab. Knowing NNT can also be useful for the parents of a patient because it tells them, in effect, how likely it is that their child will benefit from, in this example, palivizumab. Table 2 demonstrates just how divergent RD and the relative reduction in risk can be in a single setting, and the corresponding effect on NNT.

TABLE 2
The efficacy of treatment vs. placebo How relative and absolute differences compare

Treatment cure rate

Placebo "cure" rate

Relative difference

Absolute difference

Number needed to treat (NNT)

20%

10%

50%

10%

50%

100

0.2%

0.1%

50%

0.1%

1,000

The most common problem that arises in reporting differences between study groups in articles is investigators' failure to report whether they are reporting a relative or absolute measure of difference. Simply stating that a treatment "reduces the risk of an untoward outcome by 50%" can be misinterpreted as RD when it is, in fact, relative differencethereby making it impossible to know with certainty what NNT is. That, of course, can have implications for your patients. With a little sleuthing, it is often possible to determine the absolute difference by looking for the raw data in the article.

Do the researchers make a distinction between odds ratios and risk ratios?

Chance can be measured in terms of probability or as odds. Physicians think in probabilitiesthe ratio of an occurrence of an event to all opportunities for that event to occur. Gamblers are well-versed in oddsthe ratio of occurrences to non-occurrences. When examining infrequent events, the distinction between the two measures is largely irrelevant. Consider the suits in a deck of cards. The probability of drawing a club is 1/4 (13 clubs among 52 cards total). The odds of drawing a club are 1/3 (13 clubs to 39 non-clubs). The difference between the two measures is relatively large. What about drawing the ace of clubs? The probability is 1/52; the odds, 1/51now, a trivial difference.

It is common to see probability reported in the medical literature (because, after all, that is how clinicians tend to think); reporting of odds is rare. It is common, however, to find odds ratios reported because they are the output of a commonly employed, multivariable analytic technique known as logistic regression. Because odds ratios and risk ratios become indistinguishable for rare events, researchers may be in the habit of reporting odds ratios as a percent increase (or decrease) in a given outcome associated with a specific intervention. Doing so may paint a very inaccurate picture of findings.

Consider a study of a particular intervention to promote immunization in a pediatric practice. The study reveals that 90% of children in the intervention group receive their first measles-mumps-rubella vaccination by 15 months of age compared to 80% of children in the control group (Table 3). In this case, given how common it is for children in the control arm to be vaccinated, the odds ratio is twice the risk ratio! It would be misleading, therefore, to report that the treatment group is more than twice as likely to be immunized; in fact, they are only 12% more likely to be immunized. In general, readers of an article on treatment should worry about treating odds ratios and risk ratios as equivalent when the baseline probabilities exceed roughly 15%.

TABLE 3
Comparing probability, odds, risk ratio and odds ratio for MMR vaccination

Group

Probability of vaccination

Odds of vaccination

Risk ratio

Odds ratio

Control

4.0

1.12 (0.9/0.8)

2.25 (9.0/4.0)

Intervention

9.0

How precise is the treatment effect?

Statistical precision is measured in terms of the confidence interval. This interval is associated with a percentage (typically, 95%). Simply stated (in a way that may offend statistics purists) the 95% confidence interval describes the range of values in which the reader can be 95% confident that the true value, so to speak, resides. Although the "best" or "most probable" value is the "point estimate" that the researchers report, the data used for their analysis can lead us to be 95% confident that the "real" value lies in this interval. Wide confidence intervals mean less precision. As a reader, ask yourself: Would I be as impressed with the investigators' findings if the truth were at one or the other end of the interval?

Will the results help me care for my patients?

Studies are performed on select patient populations. Sometimes, criteria used to select patients make them unlike children in your practice. Consider that gum containing xylitol was found effective at preventing otitis media when chewed five times a day.¹¹ In the context of a clinical trial, this frequency of intervention may be feasible; subjects are often selected for inclusion in such trials based on their ability to comply with a medication regimen and, often, a study nurse, or other means, is available to remind or assist subjects. In your practice, neither of these facilitating factors may be at work.

This situation highlights a fundamental difference between efficacy and effectiveness. Investigation into the efficacy of a drug (or treatment or intervention) asks the question: "Can it work?" (in other words, under ideal circumstances, if taken as directed, and so forth). A look at effectiveness asks the question: "Does it work?" (in other words, in the real world, with real patients). For your patients, you need to be the judge about whether an efficacious therapy will be effective.

Were all clinically important outcomes considered?

The researchers who undertook the study determined which endpoints they would assess. You need to judge whether those outcomes are meaningful to patients. Not all significant differences have clinical relevance. For example, a difference in O₂ saturation between 95% and 93% that favors a particular treatment in patients with bronchiolitis may not make a difference to you or your patients. Typically, a child doesn't care what her O₂ saturation is. The pertinent questions for her parents are: Does the child feel better? Did she leave the hospital sooner? Was she able to play more?

Does a treatment have side effects or deleterious consequences even though researchers judged it effective?

Researchers typically design studies to show benefit. They tend to pay less attention to the possibility of harm. Effective treatments are therefore often used until a serious side effect is identified (recall the association of felbamate with aplastic anemia, and rotavirus vaccine with intussusception). Trials of these treatments may not have contained enough subjects to identify certain, sometimes rare, side effects, but you should always look closely to discern even a trend toward differences in the profile and frequency of side effects between treatment and control groups.

Last, be wary of new therapies even when investigators haven't reported a difference in the occurrence of side effects between study groups! In general, it's prudent to be neither the first nor the last to put a new therapy into practice.

REFERENCES

1. Davis DA, Thompson MA, Oxman AD, et al: Changing physician performance: A systematic review of the effect of continuing education strategies. JAMA 1995;274:700

2. Antman EM, Lau J, Kupelnick B, et al: A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. JAMA 1992;268:240

3. Nyquist AC, Gonzales R, Steiner JF, et al: Antibiotic prescribing for children with colds, upper respiratory tract infections, and bronchitis. JAMA 1998;279:875

4. Evidence-based medicine. A new approach to teaching the practice of medicine. Evidence-Based Medicine Working Group. JAMA 1992;268:2420

5. Sackett DL, Rosenberg WMC, Gray JAM, et al: Evidence-based medicine: What it is and what it isn't. BMJ 1996;312:71

6. Guyatt GH, Sackett DL, Cook DJ: Users' guides to the medical literature. II. How to use an article about therapy or prevention. B. JAMA 1994;271:59

7. Guyatt GH, Sackett DL, Cook DJ: Users' guides to the medical literature. II. How to use an article about therapy or prevention. A. JAMA. 1993;270:2598

8. Chalmers TC, Celano P, Sacks HS, et al: Bias in treatment assignment in controlled clinical trials. N Engl J Med 1983;309:1358

9. Anonymous: Palivizumab, a humanized respiratory syncytial virus monoclonal antibody, reduces hospitalization from respiratory syncytial virus infection in high-risk infants. The IMpact-RSV Study Group. Pediatrics 1998; 102:531

10. Cook RJ, Sackett DL: The number needed to treat: A clinically useful measure of treatment effect. BMJ 1995;310:452

11. Uhari M, Kontiokari T, Niemela M: A novel use of xylitol sugar in preventing acute otitis media [comment]. Pediatrics 1998;102:879

Glossary

95% Confidence interval
The range of values within which one can be 95% confident that the true result lies.

Block randomization
A procedure by which subjects are randomized in blocks such that, at the completion of each block, equal numbers will be present in each arm.

Clinical practice guideline
A systematically developed statement that facilitates clinicians' and patients' decisions about appropriate health care for specific circumstances.

Effectiveness
Answers the question of whether a given therapy works under real-world circumstances.

Efficacy
Answers the question of whether a given therapy can work under optimal circumstances.

Evidence-based medicine (EBM)
The conscientious, explicit, and judicious use of current best evidence to make a decision about the care of individual patients.

Intention to treat
The analytic approach in which all patients are evaluated according to their group assignmentregardless of whether they comply with, or complete, the treatment.

Meta-analysis
The statistical combination of many, different studies to arrive at a single summary estimate of an effect.

Number needed to treat
Or NNT. The number of patients who need to be treated before a statistically significant benefit can be expected to accrue to one additional patient. Expressed as 1 over the risk difference (RD).

Odds
The ratio of an occurrence of an event to the non-occurrence of that event. For example: The odds of a person drawing the ace of spades from a deck of playing cards is 1/51.

All about reviews, summaries, meta-analyses, and guidelines

Summarizing the best available evidence about a treatment begins with a rigorous and systematic review of the topic. Such systematic reviews are often published in pediatric journals and are available at several Web sites; they are intended to locate and assess all relevant studies on a given topic, and require a search strategy that is sensitive enough to ensure proper identification and retrieval of all relevant published trials.

A systematic review can be as simple as reporting the results of the single published study of a given therapy, but when studies are numerous and, at times, in conflict, alternative means of summarizing existing data are needed. These alternatives can take many forms, but typically employ, as part of the summary process, a formal evidence-based summary or meta-analysis.

An evidence-based summary presents the state of the evidence without quantitatively combining results. Meta-analysis, on the other hand, combines data from different studies to increase the statistical power of the findings and to arrive at a summary estimate. Recently, meta-analyses have been subject to scrutiny and criticism because some authors have been able to identify inconsistencies between their summarized results and subsequent large, randomized, controlled trials.^1,2 Such discrepancies may arise in part because meta-analyses often combine studies with heterogeneous patient populations and inclusion criteria, but the discrepancies reported have been of moderate size and unclear clinical significance and have affected the magnitude or statistical significance of the effect but not the direction of the findings. Proponents of meta-analysis are developing statistical methods to better evaluate the heterogeneity of studies³; in the meantime, which findings should serve as a "gold standard"those of the meta-analysis or the subsequent randomized controlled trial is unclear.^4,5

Both evidence summaries and meta-analyses are often used to devise clinical practice guidelines,^6,7 which The Institute of Medicine defines as "systematically developed statements to assist practitioners' and patients' decisions about appropriate health care for specific circumstances."⁸ Traditionally, practice guidelines have relied on two building blocks: consensus and expert opinion. Although most practice guidelines in use today take an evidence-based approach, you should carefully scrutinize the methods that have been employed to devise them.⁹¹¹ Regrettably, some guidelines cannot, in present form, be viewed as helpful to many practicing pediatricians¹²many of whom find them too "cookbook"like in their recommendations and prefer summaries of evidence to outright clinical rules.

REFERENCES

1. LeLorier J, Gregoire G, Benhaddad A, et al: Discrepancies between meta-analyses and subsequent large randomized, controlled trials. N Engl J Med 1997;337: 536

2. Cappelleri JC, Ioannidis JP, Schmid CH, et al: Large trials vs meta-analysis of smaller trials: How do their results compare? JAMA 1996;276:1332

3. DerSimonian R, Levine RJ: Resolving discrepancies between a meta-analysis and a subsequent large controlled trial. JAMA 1999;282:664

4. Ioannidis JP, Cappelleri JC, Lau J: Meta-analyses and large randomized, controlled trials. N Engl J Med 1998;338:59; discussion 61

5. Saint S, Veenstra DL, Sullivan SD: The use of meta-analysis in cost-effectiveness analysis. Pharmacoeconomics 1999;15:1

6. Bergman DA: Evidence-based guidelines and critical pathways for quality improvement. Pediatrics 1999;103:225

7. Cook DJ, Greengold NL, Ellrodt AG, et al: The relation between systematic reviews and practice guidelines. Ann Intern Med 1997;127:210

8. Field MJ, Lohr KN: Clinical Practice Guidelines. Directions for a new agency. Washington, D.C., National Academy Press, 1990

9. Wilson MC, Hayward RS, Tunis SR, et al: Users' guides to the medical literature. VIII. How to use clinical practice guidelines. B. JAMA 1995;274:1630

10. Hayward RS, Wilson MC, Tunis SR, et al: Users' guides to the medical literature. VIII. How to use clinical practice guidelines. A. JAMA 1995;274:570

11. Shaneyfelt TM, Mayo-Smith MF, Rothwangl J: Are guidelines following guidelines? The methodological quality of clinical practice guidelines in the peer- reviewed medical literature. JAMA 1999;281:1900

12. Christakis DA, Rivara FP: Pediatricians' awareness of and attitudes about four clinical practice guidelines. Pediatrics 1998;101:825

DR. CHRISTAKIS is an associate professor in the department of pediatrics and director of the Child Health Institute at the University of Washington, Seattle. Dr. Christakis has nothing to disclose in regard to affiliation with, or financial interests in, any organization that may have an interest in any part of this article.

Dimitri Christakis. Evaluating articles about treatment in the medical literature. Contemporary Pediatrics May 2003;20:79.

Access practical, evidence-based guidance to support better care for our youngest patients. Join our email list for the latest clinical updates.

Subscribe Now!