Spaced repetition works — what 130+ years of research shows
Spaced repetition is the most well-proven study technique in cognitive science, with effect sizes from d = 0.46 to g = 1.15. It beats massed practice (cramming) for long-term retention in almost every domain tested. But it has real limits — it shrinks for complex tasks, disappears on same-day tests, and only works if the spacing interval matches your actual retention goal. Here's what 50+ studies, meta-analyses, and RCTs actually say.
Does spaced repetition actually work? What the meta-analyses show
Cepeda, Pashler, Vul, Wixted, and Rohrer (2006) pulled together 839 assessments from 317 experiments across 184 articles — the largest quantitative review of distributed practice on verbal recall (Psychological Bulletin, 132(3), 354–380). Spacing beat cramming across the board. The catch: there's no single "best" gap. The optimal inter-study interval (ISI) depends on how long you actually need to remember the material.
Donovan and Radosevich (1999), across 63 studies and 112 effect sizes, found d = 0.46 favoring spaced over massed practice (Journal of Applied Psychology, 84(5), 795–805). The effect was stronger for simple motor tasks and weaker for complex cognitive ones — worth remembering if you're studying something like organic chemistry mechanisms.
More recent work has only added weight:
- Latimier, Peyre, and Ramus (2021): g = 0.74 across 29 studies (Educational Psychology Review, 33(3))
- Adesope, Trevisan, and Sundararajan (2017): g = 0.61 across 188 experiments (Review of Educational Research, 87(3))
- Rowland (2014): g = 0.50, rising to g = 0.73 when feedback is added (Psychological Bulletin, 140(6))
Dunlosky et al. (2013) reviewed ten study techniques and rated distributed practice and practice testing as the only two with "high utility" — effective across ages, subjects, and educational settings. Highlighting, rereading, and summarization were rated low utility (Psychological Science in the Public Interest, 14(1)). Mawson and Kang (2025) confirmed this holds in real classrooms: d = 0.54 (95% CI [0.31, 0.77]) across 22 reports and more than 3,000 students (Behavioral Sciences, 15(6), 771).
| Meta-analysis | Studies/effects | Comparison | Effect size | |---|---|---|---| | Cepeda et al. (2006) | 184 articles, 317 experiments | Distributed vs. massed verbal recall | Large; varies by ISI/RI | | Donovan & Radosevich (1999) | 63 studies, 112 ES | Spaced vs. massed (all domains) | d = 0.46 | | Rowland (2014) | 159 ES | Testing vs. restudy | g = 0.50 (g = 0.73 with feedback) | | Adesope et al. (2017) | 188 experiments, 272 ES | Practice testing overall | g = 0.61 | | Latimier et al. (2021) | 29 studies, 39 ES | Spaced vs. massed retrieval | g = 0.74 | | Kim & Webb (2022) | 48 experiments, 98 ES | Spaced vs. massed in L2 learning | g = 1.15 (delayed) | | Mawson & Kang (2025) | 22 reports, 31 ES, N > 3,000 | Distributed vs. massed in classrooms | d = 0.54 |
Language learning: where spaced repetition hits hardest
Vocabulary acquisition is where spaced repetition produces its biggest effects, and the numbers here are striking.
Kim and Webb (2022) analyzed 98 effect sizes from 48 experiments with 3,411 learners and found g = 1.15 on delayed tests — that's a massive effect by any standard (Language Learning, 72(1)).
The Bahrick family study (1993) is worth knowing about if you do any language learning. Four family members learned 300 foreign language word pairs at intervals of 14, 28, or 56 days. The result: 13 sessions at 56-day spacing produced the same retention as 26 sessions at 14-day spacing (Psychological Science, 4(5)). You can cut your study sessions in half just by spacing them out more. A related study by Bahrick and Phelps (1987) found that 30-day spacing led to 15% recall after 8 years versus 8% for 1-day spacing (Journal of Experimental Psychology, 13(2)).
"13 sessions at 56-day spacing produced the same retention as 26 sessions at 14-day spacing. You can cut your study sessions in half just by spacing them out more."
Karpicke and Roediger (2008) showed retrieval practice matters just as much as spacing: 80% recall with continued testing versus ~35% without, at a 1-week delay. Restudying without testing added nothing (Science, 319(5865)). Karpicke and Bauernschmidt (2011) showed that simply increasing absolute spacing — regardless of whether intervals were expanding, uniform, or contracting — produced a 200% improvement in long-term retention (Journal of Experimental Psychology, 37(5)).
On the practical side: Chukharev-Hudilainen and Klepikova (2016) ran the first double-blind RCT in computer-assisted language learning and found EFL students spending an average of 3 minutes per day on spaced repetition tripled their long-term vocabulary retention (CALICO Journal, 33(3)). Nakata and Webb (2016) found long spacing was more than twice as effective as short spacing on delayed posttests for English–Japanese vocabulary (Studies in Second Language Acquisition, 38(3)).
Medical school research: the most rigorous applied evidence
Medical education has produced the most controlled applied research on spaced repetition flashcards. B. Price Kerfoot's RCT at Harvard Medical School (2007) with 95 third-year students showed that email-based spaced repetition significantly improved end-of-year urology scores — Cohen's d = 1.01 for students reviewed 6–8 months after their rotation, d = 0.73 for those 9–11 months out (Medical Education, 41(1)). A follow-up adaptive trial (Kerfoot, 2010) showed 38% greater learning efficiency without sacrificing outcomes (The Journal of Urology, 183(2)).
Anki's impact on board exams is well-documented:
- Deng, Gluckstein, and Larsen (2015): Each additional 1,700 unique Anki cards seen = +1 point on USMLE Step 1, even after controlling for MCAT scores and preclinical grades (Perspectives on Medical Education, 4(6))
- Gilbert et al. (2023): Anki users scored 6.2%–10.7% higher on standardized exams than students using traditional methods (Medical Science Educator, 33(4))
- Durrani et al. (2024): d = 0.8 for the Anki group in a quasi-experimental study with 115 pediatrics students (BMC Medical Education, 24(1))
Shail et al. (2024) reviewed 56 studies across health professions education: 43 of 63 experiments (68%) showed significant benefits of distributed and/or retrieval practice, and only 1 showed a negative effect (Advances in Health Sciences Education, 29).
Does spaced repetition work for math and science?
The evidence for spaced repetition in math and science is growing but patchy.
Rohrer and Taylor (2006) found that distributing math practice across two sessions separated by a week nearly doubled performance on a 4-week test for permutation problems. Tripling the amount of massed practice had zero effect (Applied Cognitive Psychology, 20). Voice and Stirton (2020) found physics students using a spaced repetition web app scored 70% versus 61% (d = 0.47) compared to non-users (New Directions in the Teaching of Physical Sciences, 15(1)).
But math results are noticeably less consistent. Barzagar Nazari and Ebersbach (2019) found strong evidence for spacing benefits in seventh-grade math at 6 weeks — but nothing at 2 weeks (Trends in Neuroscience and Education, 17). Krauspe et al. (2025) found no spacing effect at all for children practicing long multiplication after 8 weeks, with Bayesian analyses confirming the null (Learning and Instruction, 97). Hopkins et al. (2024) found spaced retrieval practice improved calculus performance but not physics — in the same university, same structure (International Journal of STEM Education, 11).
The pattern suggests: the more complex and procedural the task, the less reliably spaced repetition helps. It's not a universal fix.
How to actually time your reviews: the 10–20% rule
When you review matters as much as whether you space at all.
Cepeda, Vul, Rohrer, Wixted, and Pashler (2008) mapped optimal retention across 1,350+ participants and found that the best study gap is roughly 10–20% of your target retention interval (Psychological Science, 19(11)). If you need to remember something for a week, review 1–2 days before. If you need it for a year, space your reviews 3–5 weeks apart. Too little spacing and you're basically cramming. Too much and the material decays before you review it.
One assumption worth challenging: expanding spacing schedules — where intervals gradually increase — are not meaningfully better than uniform spacing. Latimier et al. (2021) found a non-significant g = 0.034 difference between the two approaches. Many flashcard apps are built on the assumption that growing intervals are essential. The data says otherwise. What matters is the absolute amount of spacing, not the pattern.
On scheduling algorithms:
- Pimsleur (1967): The first graduated interval recall system (5s → 25s → 2min → ... → 2yr), Modern Language Journal (51(2))
- Leitner (1972): The classic physical box system for flashcards
- SM-2 (Woźniak, 1990): Achieved 89.3% retention memorizing 10,255 items at 41 minutes/day — still underlies Anki today
- FSRS (Ye, Su, and Cao, 2022): Trained on 220 million memory behavior logs, achieves 20–30% fewer reviews than SM-2 for equivalent retention; integrated into Anki since v23.10 (KDD '22)
- MEMORIZE (Tabibian et al., 2019): Derived from stochastic optimal control, outperformed both Leitner and SuperMemo using Duolingo data from millions of learners (PNAS, 116(10))
The direction is clear: adaptive, personalized algorithms built on actual user data beat fixed heuristics.
When does spaced repetition fail?
1. Task complexity. Donovan and Radosevich (1999) found a significant negative correlation (r = −0.25) between task complexity and effect size. Language learning: g = 1.15. Math meta-analyses: g ≈ 0.28–0.46 at best, and null results appear. For simple paired-associate learning the effect is massive; for complex problem-solving, it shrinks or disappears.
2. Immediate tests. Rawson and Kintsch (2005) showed massed practice outperforms spaced practice on same-day tests (Journal of Educational Psychology, 97(1)). Roediger and Karpicke (2006) replicated this exactly: cramming won at 5 minutes, but spaced testing produced ~50% better recall at 1 week (Psychological Science, 17(3)). If your exam is tomorrow, cramming might actually be the right call.
3. Over-spacing. Verkoeijen, Rikers, and Özsoy (2008) showed a 3.5-week gap for text learning performed no better than massing when tested just 2 days later. A 4-day gap outperformed massing (Applied Cognitive Psychology, 22(5)). The inverted-U relationship from Cepeda et al. (2008) is real — intervals need to be calibrated to what you're actually trying to retain.
4. Methodological noise in some older studies. Delaney, Verkoeijen, and Spirgel (2010) identified several "impostor" effects in classic spacing research — rehearsal borrowing, strategy changes mid-session, recency effects, and item skipping — that inflate apparent spacing effects in some studies (Psychology of Learning and Motivation, Vol. 53). The core finding holds, but the literature has more noise than typical coverage admits.
5. Compliance. Dempster (1988) called the spacing effect "a case study in the failure to apply the results of psychological research" (American Psychologist, 43, 627–634). That critique still holds in places. Barzagar Nazari and Ebersbach (2018) found significantly fewer students completed distributed exercises compared to massed ones in self-regulated learning (Frontiers in Psychology, 9). Seibert Hanson and Brown (2020) described Anki as an "effective but bitter pill" — it works, but students didn't enjoy it and compliance was low. Any app that ignores friction is ignoring a major variable.
Bottom line
Spaced repetition is one of the most replicated findings in psychology. The effect sizes — d = 0.46 to g = 1.15 depending on domain — are large enough to matter in real life. Combined with active recall (retrieval practice), effects get bigger still. Long-term retention across months and years is where it pays off most: the Bahrick studies show meaningful effects persisting close to a decade.
What it's not: a universal solution. It works best for factual and vocabulary learning. It's moderately useful in applied STEM. It's inconsistent for complex conceptual understanding. The optimal gap scales with your retention goal (roughly 10–20% of the target interval), and expanding intervals don't outperform fixed ones despite being in almost every commercial SRS tool.
The science on the core phenomenon is settled. The real open questions are: how do you personalize scheduling to individual learners, how do you build compliance into the product, and how do you translate decades of lab findings into consistent real-world use.
Written by
Founder & developer of Memor More. I build iOS and Mac apps and write about the science of memory and learning. @Jerelii on X
