Scoping Review | TIMG 5104 Directed Studies

Theories & Mechanisms for AI-Powered ESL Speaking System Design

A systematic scoping review mapping the evidence-based foundations for designing AI-powered speaking practice systems for adult English as a Second Language learners.

Joe Hu
Joe (Beiqiao) Hu Researcher & Designer
December 1, 2025
RESEARCH FRAMEWORK
πŸ“š
6 Learning Theories
ACT-R, Usage-Based, SCT, Interactionist...
βš™οΈ
10 Mechanisms
Practice scheduling, feedback, scaffolding...
🎯
4 Design Applications
Scheduler, Feedback, BOPPPS, Tracker
KEY EVIDENCE
g=0.86
Explicit Feedback
+20%
Articulation Rate
d=1.40
Vocabulary Retention
4Γ—
vs. CALL Baseline
17
Empirical Studies
6
Learning Theories
10
Mechanisms
4
Design Applications

This scoping review systematically maps the theories and mechanisms that support the design of AI-powered speaking practice systems for adult English as a Second Language (ESL) learners. Following PRISMA-ScR guidelines, the review synthesized 17 empirical studies published between 2015 and 2025, selected through a mechanism-focused purposive sampling strategy employing an AI-human hybrid triangulation protocol.

Key Outputs: The review identifies 6 foundational learning theories, isolates 10 instructional mechanisms, and proposes 4 evidence-based design applications with concrete KPIs, together forming a theory-driven blueprint for business-ready AI-ESL speaking systems.

The synthesis identifies six foundational learning theories—including Skill Acquisition Theory, the Noticing Hypothesis, and Transfer-Appropriate Processing—that justify specific design choices. Key findings indicate that explicit, multi-modal feedback (ASR for pronunciation, pending prompts for grammar, LLM dialogue for discourse) significantly outperforms single-mode correction.

Furthermore, embedding AI tools within structured pedagogical frameworks (e.g., BOPPPS) amplifies their effectiveness by fostering metacognition. The report proposes a "Theory-Mechanism-Design" (TMD) logic for feature validation to guide the development of business-ready AI-ESL solutions.

Purpose, Methods & Key Outputs

Purpose & Scope

Synthesize 17 empirical studies (2015-2025) to map theories and mechanisms driving AI-powered speaking practice systems. Focus on why interventions work and how to translate them into product-ready features.

Methods

PRISMA-ScR guided purposive sampling across 6 databases (Scopus, Web of Science, EBSCO, JSTOR, ERIC, PsycInfo). AI-human hybrid screening processed 2,877 records, retaining 17 mechanism-rich studies.

Design Target

An AI-enabled speaking coach for real-world, impromptu fluency—where learners rehearse spontaneous speech, receive multi-layered feedback, and build long-term autonomy.

Top 5 Instructional Mechanisms

  • 1
    Phased Practice
    Start blocked, then interleave tasks as accuracy stabilizes
  • 2
    Adaptive Spacing
    Tune inter-session intervals based on performance
  • 3
    Explicit ASR Feedback
    Immediate, segmental feedback with replay/compare
  • 4
    Pending/Elicited Feedback
    Delay corrections to prompt self-repair
  • 5
    Structured Orchestration
    BOPPPS: clear objectives, guided practice, reflection

Top 5 Design Principles

  • 1
    Architect the Practice Curve
    Progression engine: blocked → interleaved → spontaneous
  • 2
    Prioritize Metacognition
    Prompts, hints, and reflection make learners problem-solvers
  • 3
    Visualize Invisible Gains
    Dashboards for construction growth and signature phrases
  • 4
    Lower Anxiety Before Load
    Safe practice space with confidence nudges
  • 5
    Hybridize Feedback
    ASR (micro) + Grammar prompts (meso) + LLM (macro)

Introduction: The ESL Challenge

This review synthesizes theories and mechanisms to guide the design of AI-powered systems that support adult ESL learners in developing speaking proficiency.

Lack of Practice

Insufficient time for meaningful, interactive speaking practice in typical classroom settings. Limited in-class time and large class sizes make individualized practice impossible.

Mingyan et al., 2025

Speaking Anxiety

Fear of negative evaluation and linguistic insecurity leads to reluctance to speak and hinders skill development. Foreign language anxiety significantly impacts willingness to communicate.

Zheng et al., 2025

Inadequate Feedback

Teachers struggle to provide consistent, individualized, and immediate feedback required for effective learning due to time constraints.

Ngo et al., 2024; Sun, 2023

Research Questions

Structured using the Population-Concept-Context (PCC) framework

Population
Adult ESL Learners
Concept
Learning Theories
Context
AI-Powered Practice
RQ1
What learning theories (derived from empirical studies) support the design of AI-powered speaking practice systems for adult ESL learners?
RQ2
What instructional mechanisms (evidenced in empirical research) effectively improve speaking outcomes when implemented in AI-powered environments?

Methods: PRISMA-ScR Approach

Scoping review with purposive, mechanism-rich sampling guided by PRISMA-ScR logic across six databases.

Scopus
Web of Science
EBSCO
JSTOR
ERIC
PsycInfo
Identification
2,877
Records identified from databases
57
Distinct searches
Screening
1,426
After deduplication
-1,298
Not relevant to scope
Eligibility
128
Title/abstract screening
43
Full-text reviewed
-26
Insufficient mechanism detail
Included
17
Studies in final review
47%
Published 2024-2025

Three-Layer AI-Human Triangulation Protocol

The AI-assisted review process replaced a second human screener with multi-model LLM consensus plus targeted human adjudication.

LAYER 1
Multi-Model Screening
GPT-5, Grok-4, Gemini 2.5 Pro
LAYER 2
Cross-Validation
Model consensus comparison
LAYER 3
Human Adjudication
Final verification & synthesis

Evidence Summary: 17 Studies

Primary evidence from included studies spanning practice scheduling, ASR feedback, LLM dialogue, and structured frameworks.

INSIGHT 1

Explicit Feedback Wins

Explicit cues or elicited self-repairs outperformed generic transcripts or immediate supply of answers.

Ngo et al., 2024; Sun, 2023; Zargaran, 2025
INSIGHT 2

Timing Matters

Dense scheduling accelerates procedural speech; interleaving supports transfer. Some metrics like silent pauses may not follow the same pattern.

Li & DeKeyser, 2019; Suzuki, 2021; Zhang et al., 2023
INSIGHT 3

Social Mediation Amplifies

Peer collaboration or LLM partners added motivational and noticing advantages beyond solo AI loops.

Evers & Chen, 2021; Zheng et al., 2025
INSIGHT 4

Structure Raises Quality

Lesson frameworks (BOPPPS, interleaved routines) improved discourse management, confidence, and adherence.

Lai, 2025; Yang et al., 2025

Geographic Distribution of Studies

~2/3 of studies focused on East Asian university learners, highlighting a key limitation in generalizability.

10
East Asia
China, Japan, Taiwan
3
Middle East
Saudi Arabia, Iran
2
North America
USA
1
Europe
Spain
1
Meta-analysis
Mixed

Theoretical Foundations: 6 Learning Theories

"Theory" refers to a principled explanation of how knowledge is acquired, retained, and transferred in second-language speaking.

HIGH PRIORITY
MEDIUM PRIORITY
EMERGING
THEORY 1 | HIGH PRIORITY

Skill Acquisition Theory

Learning progresses through three stages: declarative (understanding rules) → procedural (practicing) → automatic (without conscious thought).

DESIGN IMPLICATION

Schedule high-volume, scaffolded practice with clear progression from accuracy to automaticity.

Li & DeKeyser, 2019; Suzuki, 2017, 2021; Suzuki et al., 2022
THEORY 2 | HIGH PRIORITY

The Noticing Hypothesis

Learners must consciously notice a linguistic feature for it to become part of their acquisition process. Feedback draws attention to gaps.

DESIGN IMPLICATION

Surface precise pronunciation, grammar, and discourse gaps with prompts that elicit self-repair before showing models.

Tejedor-Garcia et al., 2020; Ngo et al., 2024; Zargaran, 2025
THEORY 3 | HIGH PRIORITY

Transfer-Appropriate Processing

Practice is most effective when it closely resembles the cognitive processes required by the target task.

DESIGN IMPLICATION

Simulate performance contexts (timed, impromptu, varied topics) and mix task types once stability is reached.

Suzuki, 2021; Zhang et al., 2023; Yang et al., 2025
THEORY 4 | MEDIUM PRIORITY

Usage-Based / Constructionist

Language is acquired by entrenching "constructions" (form-meaning pairings) through repeated use. Frequency strengthens memory traces.

DESIGN IMPLICATION

Track and recycle signature phrases and sentence frames with user-facing dashboards.

Suzuki, Eguchi & de Jong, 2022
THEORY 5 | MEDIUM PRIORITY

Sociocultural / Affective

Learning occurs within the Zone of Proximal Development with a "More Knowledgeable Other." Safe interaction lowers anxiety and raises willingness to communicate.

DESIGN IMPLICATION

Non-judgmental AI partners, clear goals, and supportive nudges before raising difficulty.

Zheng et al., 2025; Lai, 2025; Sun, 2023; Evers & Chen, 2021
THEORY 6 | EMERGING

Retrieval Practice & Desirable Difficulties

Effortful, spaced retrieval stabilizes memory. Optimal Inter-Session Interval depends on Retention Interval and task complexity.

DESIGN IMPLICATION

Manipulate practice schedules (spacing, interleaving) to create desirable difficulties, optimizing retention.

Li & DeKeyser, 2019; Suzuki, 2017; Suzuki & Hanzawa, 2022

Instructional Mechanisms: 10 Evidence-Based Strategies

These mechanisms represent the HOW—the specific instructional strategies and technologies that operationalize the theories.

CATEGORY 1

Practice & Scheduling

1 Adaptive Spacing

Tune ISI to retention goals; shorter ISI favors procedural performance.

2 Task Repetition Schedules

Blocked (AAA) for fluency, interleaved (ABC) for transfer.

3 Construction Reuse

Recycle constructions across prompts with gradual variation.

CATEGORY 2

Feedback Mechanisms

4 Explicit ASR Feedback

Immediate phoneme/stress highlights with clear exemplars. g=0.86 for explicit.

5 Pending/Elicited Feedback

Withhold answers to prompt self-repair, deepening processing.

CATEGORY 3

Scaffolding & Support

6 Partner Scaffolding

ASR+Peer or LLM partner

7 Structured Orchestration

BOPPPS framework

8 Mobile Micro-Practice

Short, frequent sessions

9 Graduated Difficulty

Mastery gates + fallbacks

10 Confidence Nudges

Self-assessments & reflection

Practice Schedule Progression: Blocked → Interleaved → Spaced

PHASE 1
Blocked Practice
AAA / BBB / CCC
+15-20% articulation rate
Reduced mid-clause pauses
Limited transfer
PHASE 2
Interleaved Practice
ABC / ABC / ABC
Enhanced transfer to novel tasks
Flexible retrieval patterns
Higher cognitive effort
PHASE 3
Spaced Repetition
Optimized ISI based on RI
Optimal long-term retention
1-4 day ISI for procedural
Prevents forgetting & interference
Evidence: Suzuki (2021); Li & DeKeyser (2019); Zhang et al. (2023); Suzuki & Hanzawa (2022)

Theory-Mechanism Relationships

Mapping theoretical foundations to instructional mechanisms and design outcomes

Theory β†’ Mechanism Connection Matrix

Mechanism
ACT-R
Usage-Based
SCT
Interactionist
Self-Efficacy
Desirable Diff.
Blocked→Interleaved→Spaced
●●●
β—‹
β—‹
β—‹
β—‹
●●●
High-Variability Input
β—‹
●●●
β—‹
β—‹
β—‹
●●
Construction Recycling
β—‹
●●●
β—‹
β—‹
β—‹
β—‹
Explicit ASR Feedback
●●
β—‹
β—‹
●●●
β—‹
β—‹
Pending/Elicited Feedback
β—‹
β—‹
β—‹
●●
β—‹
●●●
Partner Scaffolding
β—‹
β—‹
●●●
●●
●●
β—‹
Structured Orchestration
β—‹
β—‹
●●●
β—‹
β—‹
β—‹
Mobile Micro-Practice
●●●
β—‹
β—‹
β—‹
β—‹
●●
Graduated Difficulty
●●
β—‹
●●●
β—‹
●●
β—‹
Confidence Nudges
β—‹
β—‹
β—‹
β—‹
●●●
β—‹
Primary (●●●)
Secondary (●●)
Not Connected (β—‹)

Theory β†’ Mechanism β†’ Design β†’ Outcome Flow

LAYER 1
Theories
ACT-R / Skill Acquisition
Usage-Based Theory
Sociocultural Theory
Interactionist Hypothesis
Self-Efficacy Theory
Desirable Difficulties
LAYER 2
Mechanisms
Practice Scheduling
Input Variability
Feedback Types
Scaffolding Support
Difficulty Progression
LAYER 3
Design Features
ASR Engine Integration
LLM Partner Agent
Tri-Modal Feedback
BOPPPS Structure
Progress Dashboard
LAYER 4
Outcomes
Pronunciation Accuracy
Speech Fluency
Speaking Confidence
Long-term Retention
β†’
β†’
β†’

High-Synergy Integration

ACT-R + Desirable Difficulties

Blocked→Interleaved→Spaced practice sequence optimizes both proceduralization and long-term retention. Mobile micro-practice enables distributed practice.

Complementary Alignment

SCT + Interactionist

Partner scaffolding combines ZPD-appropriate support with negotiated meaning. LLM agents can provide both scaffolding and corrective feedback.

Emerging Connection

Self-Efficacy + Graduated Difficulty

Mastery experiences build confidence when combined with appropriate challenge levels. Confidence nudges reinforce perceived capability.

Key Findings

Four major findings synthesized from 17 empirical studies

01
FINDING ONE

Practice Scheduling Matters

The Blocked β†’ Interleaved β†’ Spaced progression optimizes both initial skill acquisition and long-term retention of speaking skills.

Evidence
g = 0.52-0.86 fluency gains; 15-20% articulation rate improvement
02
FINDING TWO

Explicit Feedback Dominates

Explicit ASR feedback with phoneme-level highlights produces stronger effect sizes than implicit recasts.

Evidence
g = 0.86 explicit vs. g = 0.55 implicit; 4Γ— larger than general CALL
03
FINDING THREE

Scaffolding Enables Autonomy

AI partners operating within learner ZPD improve both linguistic outcomes and speaking confidence.

Evidence
r = 0.36-0.50 confidence-performance correlation; ASR+Peer outperforms solo
04
FINDING FOUR

Construction Grammar Works

High-variability input and construction recycling build robust phonological and syntactic categories for transfer.

Evidence
Enhanced generalization to novel contexts; d = 1.40+ vocabulary retention

Design Applications

Four evidence-based recommendations for AI-powered ESL speaking system design

01

Adaptive Practice Scheduler

Implement blocked β†’ interleaved β†’ spaced progression based on learner mastery metrics.

TARGET KPIs
+15-20% articulation rate
g = 0.52-0.86 fluency
02

Tri-Modal Feedback Engine

Combine ASR scores, waveform visualizations, and LLM-generated corrective prompts.

TARGET KPIs
g = 0.86 explicit feedback
4Γ— CALL effect size
03

BOPPPS Micro-Lesson Engine

Structure each session with Bridge, Objective, Pre-test, Participatory, Post-test, Summary phases.

TARGET KPIs
Structured scaffolding
ZPD-aligned progression
04

Construction Tracker

Log constructions used across prompts and visualize frequency/accuracy over time.

TARGET KPIs
d = 1.40+ retention
Transfer to novel contexts

Tri-Modal Feedback Loop

ASR
Score
Phoneme accuracy
Stress patterns
Fluency metrics
β†’
πŸŽ™οΈ
Learner
Speaking Attempt
β†’
Wave
Visual
Pitch contours
Timing display
Gap analysis
↓
LLM Corrective Prompt
"Try emphasizing the second syllable: re-SEARCH"
↓
↻ Iterative Refinement Cycle
Figure 3: Tri-Modal Feedback integrates numeric scores, visual representations, and natural language guidance

BOPPPS Micro-Lesson Framework

B
Bridge
Hook attention with relevant context
O
Objective
Clear learning goals stated
P
Pre-test
Assess prior knowledge
P
Participatory
Active practice with feedback
P
Post-test
Verify learning gains
S
Summary
Consolidate & preview next
β†’
Figure 4: Each micro-lesson follows the BOPPPS structure for scaffolded instruction

Construction Tracking Dashboard

CONSTRUCTION FREQUENCY
"I would like to..." 47Γ—
"Could you please..." 32Γ—
"What do you think..." 18Γ—
ACCURACY TREND
Week 1 Week 7
MASTERY STATUS
Requests βœ“ Opinions βœ“ Questions ~ Conditionals Subjunctive
● Mastered   ● Progressing   ● Learning   β—‹ Not Started

Discussion

Implications, limitations, and future directions

Evidence Summary

17
Empirical Studies
6
Foundational Theories
10
Instructional Mechanisms
4
Design Applications

Limitations

Geographic Bias

Majority of studies from East Asia (Japan, China, Taiwan). Findings may not generalize to all L1 backgrounds.

Duration Constraints

Most interventions 4-12 weeks. Long-term retention data beyond 1 year remains limited.

Technology Gap

No empirical studies yet with GPT-4 class LLMs for speaking practice. Theoretical extrapolation required.

Future Research Directions

PRIORITY 1

LLM Partner Empirics

RCTs comparing human tutors vs. LLM conversation partners for speaking outcomes.

PRIORITY 2

Adaptive Algorithms

Optimal ISI scheduling algorithms for speaking skill proceduralization.

PRIORITY 3

Cross-Cultural Validation

Replication studies across diverse L1 backgrounds and cultural contexts.

⚠️ Ethical Considerations

Data Privacy: Voice recordings require informed consent and secure storage.
AI Transparency: Learners should know when interacting with AI vs. humans.
Bias Mitigation: ASR systems may disadvantage certain accents; continuous monitoring needed.
Dependency Risk: Systems should build learner autonomy, not reliance.

Conclusion

This scoping review synthesizes six foundational theories and ten instructional mechanisms into an actionable framework for AI-powered ESL speaking system design. The evidence supports a Blocked β†’ Interleaved β†’ Spaced practice progression, explicit tri-modal feedback, and scaffolded partner interactions as high-priority design features.

Key Takeaway

Theory-grounded design transforms AI-ESL systems from simple drill tools into sophisticated learning environments that proceduralize speaking skills, build robust phonological categories, and foster learner confidence.

As LLM capabilities advance, the theoretical foundations and empirical mechanisms identified here provide a principled roadmap for developing next-generation speaking practice systems that genuinely support second language acquisition.

Key References

Selected citations from the 17 empirical studies reviewed

Anderson, J. R. (2015). Cognitive Psychology and Its Implications (8th ed.). Worth Publishers.

Bandura, A. (1997). Self-Efficacy: The Exercise of Control. W.H. Freeman.

BjΓΆrk, R. A., & BjΓΆrk, E. L. (2020). Desirable difficulties in theory and practice. Journal of Applied Research in Memory and Cognition, 9(4), 475-479.

Golonka, E. M., et al. (2014). Technologies for foreign language learning: A review. Computer Assisted Language Learning, 27(1), 70-105.

Li, S., & DeKeyser, R. M. (2019). Implicit and explicit instruction. In J. W. Schwieter & A. Benati (Eds.), The Cambridge Handbook of Language Learning. Cambridge University Press.

Suzuki, Y. (2021). Optimizing fluency training for speaking skills transfer. Studies in Second Language Acquisition, 43(5), 1037-1061.

Tomasello, M. (2003). Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press.

Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.

For the complete reference list, please see the full PDF document.