How to Read a Paper: The Intensivist's Guide to Medical Statistics - ESBICM® | Educational Society of Bedside Intensive Care Medicine

How to Read a Paper: The Intensivist’s Guide to Medical Statistics

Introduction

Reading a clinical trial can seem intimidating, but it is a learnable skill, not a passive activity. It is an active process of critical appraisal, much like being a detective examining evidence. The ability to dissect a paper, evaluate its methodology, and judge the validity of its conclusions is a core skill for any evidence-based practitioner. This guide aims to demystify the key statistical terms, study designs, data visualization methods, and tests you will encounter, providing simple explanations and practical examples relevant to critical care.

1. Types of Clinical Studies: The Hierarchy of Evidence

Not all research is created equal. The type of study design determines the strength of its conclusions.

Randomized Controlled Trial (RCT):
- What it is: The gold standard of clinical research. A population of patients is randomly assigned to two or more groups: an intervention group (receiving the new treatment) and a control group (receiving a placebo or the current standard of care). The groups are then followed over time to compare outcomes.
- Clinical Significance: Randomization is the most powerful tool to reduce bias. It ensures that, on average, both known and unknown confounding factors are distributed evenly between the groups. This means any difference in outcome is more likely to be due to the intervention itself. The PROSEVA and ARMA trials are classic examples of RCTs that changed clinical practice.
Cohort Study:
- What it is: An observational study where a group of individuals (the cohort) is identified and followed forward in time. Some individuals will be exposed to a risk factor (e.g., smoking, a certain medication), and others will not. The study then compares the incidence of an outcome (e.g., lung cancer, recovery from illness) between the exposed and unexposed groups.
- Clinical Significance: Cohort studies are excellent for identifying risk factors and determining prognosis. Because they are not randomized, they are more susceptible to confounding variables. For example, a cohort study might follow ICU survivors to see if early mobilization (exposure) is associated with better long-term outcomes.
Case-Control Study:
- What it is: Another observational study that works in reverse. It starts by identifying a group of patients who already have an outcome (the “cases”) and a comparable group who do not have the outcome (the “controls”). It then looks backward in time to determine if there was a difference in exposure to a potential risk factor between the two groups.
- Clinical Significance: These studies are particularly useful for investigating rare diseases or outcomes, where a cohort study would be impractical. For instance, to investigate risk factors for a rare ICU complication, one could compare the past exposures of patients who developed the complication (cases) with those who did not (controls).
Systematic Review & Meta-Analysis:
- What it is: A systematic review is a comprehensive and structured review of all available evidence on a specific topic. A meta-analysis goes one step further by using statistical methods to combine the results from multiple individual studies into a single, more powerful summary estimate.
- Clinical Significance: A well-conducted meta-analysis sits at the top of the evidence hierarchy. By pooling data, it can provide a more precise estimate of a treatment’s effect than any single study alone and can sometimes detect effects that individual, smaller studies might have missed.

2. Hypothesis Testing and the P-value

At the heart of every clinical trial is a research question, which is translated into a statistical hypothesis.

The Null Hypothesis (H₀): This is the default assumption that there is no difference between the treatment groups. For example, the null hypothesis would state that a new drug has no effect on mortality compared to a placebo. The goal of the trial is to gather enough evidence to reject this null hypothesis.
The Alternative Hypothesis (H₁): This is the opposite of the null hypothesis. It states that there is a differencebetween the groups (e.g., the new drug does reduce mortality).

What is a P-value?

Definition: The P-value is the probability of observing the study’s results (or results even more extreme) if the null hypothesis were actually true.
Clinical Interpretation: A small P-value (typically < 0.05) suggests that the observed result is unlikely to be a fluke of random chance. It does not tell you the size or clinical importance of the effect, only that an effect is likely to exist.
Statistical Significance: By convention, a P-value of less than 0.05 is considered “statistically significant.” This means there is less than a 5% probability of seeing the observed difference if no real difference existed.

Example: In the PROSEVA trial, the P-value for 28-day mortality was <0.001.

Interpretation: This means that if prone positioning had no real effect on mortality, the probability of observing such a large difference between the groups by pure chance would be less than 0.1%. This gives us strong confidence to reject the “no difference” idea and conclude that prone positioning has a real effect.

3. Confidence Intervals (CI)

While a P-value gives a simple “yes/no” on statistical significance, a confidence interval provides a more nuanced picture of the magnitude and precision of the effect.

Definition: A confidence interval is a range of values within which we are reasonably confident the “true” value of a population parameter lies. A 95% CI is the most common.
Clinical Interpretation: A 95% CI provides a plausible range for the true treatment effect. A narrow CI implies a precise estimate, while a very wide CI suggests more uncertainty and a less reliable result.
The “No Effect” Line: The most important part of interpreting a CI is to see if it crosses the “line of no effect.”
- For ratios (Relative Risk, Odds Ratio, Hazard Ratio), the line of no effect is 1.0.
- For absolute differences, the line of no effect is 0.
- If the CI crosses this line, the result is not statistically significant.

Example: The APROCCHSS trial reported a hazard ratio for 90-day mortality of 0.82 with a 95% CI of 0.69 to 0.98.

Interpretation: We are 95% confident that the true effect of the steroids is a mortality reduction of somewhere between 2% (a hazard ratio of 0.98) and 31% (a hazard ratio of 0.69). Because the entire range is below 1.0, the result is statistically significant.

4. Common Statistical Tests

The choice of statistical test depends on the type of data being analyzed (e.g., categorical percentages vs. continuous measurements).

Chi-Square Test:
- Use Case: Used to compare proportions or frequencies of categorical data between two or more groups. For example, comparing the percentage of patients who survived (category: alive/dead) in a treatment group versus a placebo group.
Student’s t-test:
- Use Case: Used to compare the means of a continuous variable (e.g., blood pressure, heart rate, days on ventilator) between two groups. It assumes the data is normally distributed.
ANOVA (Analysis of Variance):
- Use Case: An extension of the t-test used to compare the means of a continuous variable between more than two groups. For example, comparing mean blood pressure across a placebo group, a low-dose drug group, and a high-dose drug group.

5. Measures of Clinical Impact

These metrics translate the results into a more clinically meaningful context.

Absolute Risk Reduction (ARR): The simple difference in the event rates between the control and intervention groups. This is often considered the most clinically relevant measure of effect.
- ARR = Risk in Control Group – Risk in Intervention Group
Relative Risk Reduction (RRR): The proportional reduction in risk in the intervention group compared to the control group. While often a larger, more impressive number, it can be misleading if the baseline risk is very low.
- RRR = ARR / Risk in Control Group or 1 – RR
Number Needed to Treat (NNT): The number of patients you would need to treat with the intervention to prevent one additional adverse outcome. It is the inverse of the ARR.
- NNT = 1 / ARR

Example (using PROSEVA data):

Risk in control (supine) group = 32.8%
Risk in intervention (prone) group = 16.0%
ARR: 32.8% – 16.0% = 16.8%. (Clinically, this means for every 100 patients treated, proning prevents about 17 deaths).
NNT: 1 / 0.168 = 6. (Clinically, this means you need to treat 6 patients with prone positioning to prevent one death).

6. Bias and Confounding

Bias: A systematic error in a study’s design or conduct that leads to an incorrect estimate of the treatment effect.
- Selection Bias: Occurs when the groups being compared are different from the start (e.g., sicker patients are preferentially assigned to one group). RCTs are the best way to prevent this.
- Information Bias: Occurs when data is collected differently or inaccurately between groups. Blinding (where patients and/or investigators don’t know the group assignment) helps prevent this.
Confounding: Occurs when a third factor is associated with both the exposure and the outcome, distorting the apparent relationship. For example, if a study finds that coffee drinking is associated with heart disease, smoking could be a confounder, as smokers are more likely to drink coffee and also have a higher risk of heart disease.

7. Visualizing Data: Common Charts and Graphs

Kaplan-Meier Curve:
- What it is: A graph that shows the probability of survival over time. The y-axis represents the estimated probability of survival, and the x-axis represents time. Each time a patient has an event (e.g., death), the curve drops down.
- Example Graph:

How to Interpret: You will typically see two curves, one for the intervention group and one for the control group. The further apart the curves are, the larger the difference in survival between the groups. If the curve for the treatment group is consistently above the curve for the control group (like the blue line in the example), it indicates better survival in the treatment group.

Forest Plot:
- What it is: The standard way to display the results of a meta-analysis. Each study included in the analysis is represented by a horizontal line (its confidence interval) and a square (its point estimate). The size of the square is proportional to the weight of that study in the overall analysis. At the bottom, a diamond represents the pooled result of all the studies combined.
- Example Graph:

How to Interpret: Look at the diamond at the bottom. The width of the diamond is the confidence interval for the overall effect. In this example, since the diamond does not cross the “line of no effect” (the vertical line at 1.0), the pooled result is statistically significant, favoring the intervention.

Box-and-Whisker Plot:
- What it is: A standardized way of displaying the distribution of continuous data. The “box” represents the interquartile range (the middle 50% of the data), with a line inside the box for the median. The “whiskers” extend out to show the range of the data, often excluding outliers, which may be plotted as individual points.
- Example Graph:

How to Interpret: These plots are excellent for comparing the distribution of a variable (like length of stay or blood pressure) between two or more groups at a glance. In this example, you can quickly see that the median (the colored line) for the Intervention Group is lower than the Control Group, and its overall distribution is shifted downwards.

How to Read a Paper: A Step-by-Step Guide

Think of reading a paper like investigating a case. Most clinical trials follow a standard structure called IMRaD: Introduction, Methods, Results, and Discussion. To be a good detective, you just need to know which questions to ask at each stage.

1. First Pass: The Abstract and Title (The Case File Cover)

Before diving deep, quickly scan the cover.

Is the title clear and unbiased? A good title tells you what the study is about without sensational language.
Does the abstract provide a coherent summary? The abstract is the “CliffsNotes” version of the paper. Read it first. Does the final conclusion make sense based on the results mentioned? This is your first check to see if the story holds together.

2. Introduction: Why Was This Study Done? (The Backstory)

This section sets the scene.

Is the clinical question well-defined and relevant to your practice? The authors should clearly explain, “Here is a problem we see in our patients.” This is the knowledge gap.
Is the hypothesis clear? They should then state, “We think this new treatment might be a solution.” This is their hypothesis, the idea they set out to test.

3. Methods: The Heart of the Paper (The Recipe and the Rules) This is the most important section for judging if the “experiment” was fair.

Study Design: Was it an RCT, a cohort study, or something else? Is this “recipe” the right one for answering the question? (For testing a new treatment, an RCT is usually the best recipe).
Population (PICO – Patients): Who were the patients? Look at the inclusion and exclusion criteria. Were they too specific (e.g., only including young, healthy patients, making the results hard to apply to sicker, older patients)? Or were they too vague?
Intervention and Control (PICO – Intervention/Comparison): What exactly did the researchers do? Was the new treatment described clearly? What did the comparison group get? Was it a placebo (a sugar pill) or the current best treatment? The comparison must be fair.
Outcome (PICO – Outcome): What were they measuring to see if the treatment worked? Was it something that matters to patients, like survival (a patient-centered outcome), or just a change in a lab test?
Randomization and Blinding: If it’s an RCT, how did they assign patients to groups? Good randomization is like a coin toss—it ensures the groups are similar by chance. Blinding is like a blind taste test—if patients and doctors don’t know who got the new treatment, their expectations can’t influence the results.
Sample Size: Did they include enough patients? Think of it like a political poll. You wouldn’t trust a poll of only 10 people. The authors should explain how they calculated that their study was big enough to be reliable (a “power calculation”).

4. Results: What Did the Study Find? (The Evidence)

Are the groups comparable at baseline? Look at “Table 1.” This is the “tale of the tape” before a boxing match. It shows the characteristics of the patients in each group. Thanks to randomization, they should look very similar.
Are the results presented clearly? The authors should present the findings for the main outcomes they promised to measure in the Methods section.
How big was the effect? Don’t just look at the P-value. Look at the absolute risk reduction. This tells you the real-world impact of the treatment.

5. Discussion: What Does It All Mean? (The Verdict)

Do the authors’ conclusions match the results? Or are they getting carried away and overstating their findings?
Limitations: Do the authors admit the study’s weaknesses? Good scientists are honest about what their study can’t tell you. This builds trust.
Context: How do these findings fit with what we already know? Does it support or contradict other studies?

Practical Application: Breaking Down Two Landmark Trials

Case Study 1: The High-Quality RCT (PROSEVA, 2013)

Let’s apply our framework to the PROSEVA trial, which established prone positioning for severe ARDS.

Introduction: Clearly explains that while turning patients onto their stomach seemed like a good idea for the lungs, we weren’t sure if it actually saved lives.
Methods:
- Design: A well-designed, multicenter RCT (done in many hospitals, which makes the results more reliable).
- Population: Very specific criteria (only patients with severe ARDS), targeting those most likely to benefit. This is a huge strength.
- Intervention: A very specific “recipe” (prone for at least 16 hours straight) compared to standard care.
- Outcome: 28-day mortality—a very clear and important outcome for patients.
- Critique: The main weakness is that it was unblinded (you can’t pretend to flip a patient over), but this was unavoidable. Also, the hospitals in the study were very good at proning, so the results might be better than what a less experienced hospital could achieve (generalizability).
Results:
- The two groups were very similar at the start.
- The results were striking: mortality was cut in half in the prone group (16.0% vs. 32.8%).
Discussion: The authors’ conclusion—that this specific proning recipe saves lives in this specific group of patients—is directly and strongly supported by their results.
Verdict: A landmark, practice-changing trial. It’s a great example of a study with a clear question, a strong method, and a powerful, clinically meaningful result.

Case Study 2: The Influential but Flawed Trial (Rivers et al. EGDT, 2001)

Now let’s look at the original Early Goal-Directed Therapy trial for septic shock—a paper that shaped sepsis care for over a decade but had limitations that are important for learning.

Introduction: Brilliantly identified a critical problem: patients with septic shock weren’t getting aggressive enough treatment early on.
Methods:
- Design: A single-center RCT (only done in one hospital).
- Population: Patients with severe sepsis or septic shock in the emergency department.
- Intervention: A complex “multi-vitamin” protocol (EGDT) involving many different treatments at once.
- Outcome: In-hospital mortality.
- Critique (The “Doubtful” Points):
  1. Single-Center: A result from one hospital might not apply everywhere. It’s like saying a recipe works in one specific kitchen—will it work in yours?
  2. Lack of Blinding: Doctors knew who was getting the special protocol. This might have caused them to pay more attention to those patients, potentially improving their outcomes for reasons other than the protocol itself (performance bias).
  3. Unusual Control Group: The patients who received standard care died at a very high rate (46.5%). This was higher than many other hospitals at the time, which might have made the new treatment look much better by comparison.
  4. Bundled Intervention: Because EGDT was a “multi-vitamin” of treatments, it’s impossible to know if all the components were necessary, or if just one or two (like getting fluids faster) were responsible for the benefit.
Results: Showed a massive reduction in mortality (30.5% vs. 46.5%).
Discussion: The conclusion was supported by the data in this one study. However, a critical reader would wonder if the benefit was real or exaggerated due to the flaws in the methods.
Verdict: A hugely important, hypothesis-generating trial that rightfully changed the culture of sepsis care. However, its significant limitations meant that its findings needed to be confirmed. When larger, multicenter trials (ProCESS, ARISE, and ProMISe) were done later, they couldn’t replicate the huge benefit, leading to a simplification of sepsis guidelines. This makes the original Rivers trial a perfect example of how a flawed but influential paper can still be a positive force for change, while also teaching us to be critical readers.

So this was in short about summarizing and explaining the required medical statistics to read a medical paper. If you find any information incorrect or missing, or want to add something or update it, get in touch with us at academics[at]esbicm.org.

Related Posts