Effect Size and Clinical Significance

Q: Conflates statistical significance (p-value) with effect magnitude or clinical importance

The p-value only tells you the probability of observing your data if the null hypothesis were true — it is a measure of evidence against the null, not a measure of how big the effect is. A p-value of 0.0001 and a p-value of 0.04 can both accompany a Cohen's d of 0.2 (a small effect); the difference is just sample size and statistical power. Always check the effect size separately from the p-value to assess importance.

Q: Assumes large-sample statistical significance implies clinical or practical significance

Statistical power increases with sample size, which means large studies can reliably detect effects that are real but tiny. A drug that lowers blood pressure by 1 mmHg on average might produce p < 0.001 in a trial of 100,000 patients — but a 1 mmHg reduction has no clinical relevance. The p-value confirms the effect is not zero; the effect size tells you whether it's worth caring about.

Q: Thinks effect size (Cohen's d) is influenced by sample size the way p-values are

Cohen's d is computed as the difference between two group means divided by the pooled standard deviation — neither of those quantities is directly determined by how many people you measured. Sample size affects the precision of your estimate of d and the p-value of your test, but the d value itself reflects the actual separation between groups relative to their variability, not how large your sample was. This is exactly what makes effect sizes useful for comparing findings across studies with different sample sizes.

MCAT trap: Conflates statistical significance (p-value) with effect magnitude or clinical importance. P-value reflects the probability of the data given the null hypothesis, not the magnitude of the effect; a tiny p-value can accompany a trivially small effect size.

Effect size is a key concept the MCAT uses to test whether you conflate statistical significance with clinical importance — and the dominant misconception is exactly that: treating p < 0.05 as proof that an effect matters. A study with 50,000 participants can make a Cohen's d of 0.05 reach p < 0.0001 — an effect so small it's clinically meaningless. Effect size measures the magnitude of a difference or association independent of how many people were studied, and the MCAT specifically presents impressive p-values paired with tiny effect sizes to see if you notice the distinction. Statistical significance tells you whether an effect is real; effect size tells you whether it matters.

The two effect size measures that appear on the MCAT are Cohen's d and the correlation coefficient r. Cohen's d is the standardized mean difference between two groups — you divide the raw difference by the pooled standard deviation to get a unitless number you can interpret across studies. By convention: d ≈ 0.2 is small, d ≈ 0.5 is medium, d ≈ 0.8 is large. For r, the benchmarks shift slightly: 0.1 small, 0.3 medium, 0.5 large. These benchmarks are the kind of thing a passage might give you implicitly or explicitly, and you need to use them to judge whether a finding is 'clinically significant' — meaning it actually changes what a clinician would do.

What makes this tricky is that most students conflate the p-value with importance. If p = 0.0001, it feels like something big happened. But a study with 50,000 participants can detect a Cohen's d of 0.05 at p < 0.0001 — a difference so small it's clinically meaningless. The MCAT loves this exact scenario. You'll see a passage present an impressive p-value, and the right answer will require you to notice that the effect size is tiny or that the absolute difference between groups has no real-world consequence. Train yourself to always ask: statistically significant, yes — but is the effect actually large?

Common misconceptions

Common mistake

Wrong: A smaller p-value means the effect is larger or more clinically important.

Right: P-value reflects the probability of the data given the null hypothesis, not the magnitude of the effect; a tiny p-value can accompany a trivially small effect size.

The p-value only tells you the probability of observing your data if the null hypothesis were true — it is a measure of evidence against the null, not a measure of how big the effect is. A p-value of 0.0001 and a p-value of 0.04 can both accompany a Cohen's d of 0.2 (a small effect); the difference is just sample size and statistical power. Always check the effect size separately from the p-value to assess importance.

Common mistake

Wrong: A statistically significant result from a very large study must reflect a meaningful real-world difference.

Right: Large sample sizes give high power to detect even trivially small differences as statistically significant, so effect size must be evaluated separately from the p-value.

Statistical power increases with sample size, which means large studies can reliably detect effects that are real but tiny. A drug that lowers blood pressure by 1 mmHg on average might produce p < 0.001 in a trial of 100,000 patients — but a 1 mmHg reduction has no clinical relevance. The p-value confirms the effect is not zero; the effect size tells you whether it's worth caring about.

Common mistake

Wrong: Cohen's d changes as sample size increases, just like the p-value does.

Right: Cohen's d is a standardized measure of the difference between two means divided by the pooled standard deviation, and is independent of sample size.

Cohen's d is computed as the difference between two group means divided by the pooled standard deviation — neither of those quantities is directly determined by how many people you measured. Sample size affects the precision of your estimate of d and the p-value of your test, but the d value itself reflects the actual separation between groups relative to their variability, not how large your sample was. This is exactly what makes effect sizes useful for comparing findings across studies with different sample sizes.

Common mistake

Gap: Unaware of standard benchmarks for interpreting Cohen's d and r as small, medium, or large effects

By convention, Cohen's d ≈ 0.2 is small, ≈ 0.5 is medium, and ≈ 0.8 is large; the same benchmarks (0.1, 0.3, 0.5) apply to the correlation coefficient r.

Jacob Cohen established standard benchmarks so researchers could interpret effect sizes consistently: for Cohen's d, values near 0.2 are considered small, near 0.5 medium, and near 0.8 large. For the correlation coefficient r, the thresholds are lower: 0.1 small, 0.3 medium, 0.5 large. Knowing these lets you look at a reported effect size in a passage and immediately judge whether the finding has practical weight, rather than just noting that p < 0.05.

Guided session

Stuck on this? An AI tutor that probes your understanding and catches where your reasoning breaks.

Start a session →

Free Deck audit

Already run Anki? See if your deck covers this topic.

Upload your deck →

What the exam tests

Know that effect size (Cohen's d, r) measures the magnitude of a difference or association and is calculated independently of sample size — it doesn't go up just because you recruited more participants.
Understand why statistical significance and clinical significance are different things: a very large sample size gives high statistical power to detect even trivially small differences as 'significant,' so a low p-value alone doesn't tell you the finding matters in practice.
In a passage that presents both p-values and effect sizes (or absolute differences), be able to identify whether a statistically significant result is also clinically meaningful — or whether it's a case of a large study finding a real but trivial effect.
Apply standard benchmarks to interpret a reported Cohen's d or r value: for Cohen's d, small ≈ 0.2, medium ≈ 0.5, large ≈ 0.8; for r, small ≈ 0.1, medium ≈ 0.3, large ≈ 0.5.

Can you avoid these mistakes?

A pharmaceutical company runs a trial with 80,000 participants comparing a new antidepressant to placebo. The result is p < 0.0001, and the reported Cohen's d is 0.15. What should you conclude about the clinical significance of this finding, and why?

Two studies examine the same intervention. Study A (n = 50) finds p = 0.08 and Cohen's d = 0.6. Study B (n = 10,000) finds p = 0.001 and Cohen's d = 0.15. Which study provides stronger evidence that the intervention has a meaningful effect, and what is the key concept?

A passage reports a correlation of r = 0.28 between hours of sleep and exam performance, with p = 0.03. Is this a small, medium, or large effect? Is the finding statistically significant? Are those two judgments the same thing?

True or false: If you double the sample size of a study, the Cohen's d for the same underlying population difference will approximately double as well. Explain your reasoning.

Effect Size and Clinical Significance

Common misconceptions

What the exam tests

Can you avoid these mistakes?

Related topics