Cartoon baby illustration

DSC Capstone · ML Theory + NLP

TeleSpeak

Modeling Early Language Acquisition with Reinforcement Learning for Large Language Models

Banff Jiang · Colin Wang · Longhao Lin · Alex Warstadt

One-Screen Pitch

TeleSpeak tests whether the telegraphic speech stage of child language acquisition, where children produce content-word utterances like "want cookie" instead of the formal "I want a cookie", can be learned through a structured training pipeline. We pretrain GPT-2 on the BabyLM corpus to establish baseline language competence, supervised fine-tune (SFT) it on TF-IDF keyword summaries to encourage content-word generation, then apply direct preference optimization (DPO) with a speaker-listener framework and a decaying length objective term to slowly encourage more fluent output. We evaluate whether this pipeline replicated the developmental word-learning trajectory observed in children using Age of Acquisition (AoA) scores.

Why It Matters

The project connects BabyLM-style low-resource learning with a concrete cognitive hypothesis: that telegraphic speech reflects a learnable compression strategy, not just a capacity limitation. By explicitly modeling the acquisition trajectory through curriclum-style training stages and evaluating against developmental norms, we offer a reproducible framework for testing cognitive hypotheses about language emergence in neural models.

Core Question Can a pipeline using corpus pretraining, TF-IDF-supervised keyword extraction, and DPO, replicate the telegraphic speech acquisition trajectory observed in early child language development?
Setup GPT-2 speaker + RoBERTa listener + Direct Preference Optimization (DPO)
Status Proof-of-concept pipeline implemented, results pending full hyperparameter evaluation

Introduction

Modern language models typically require orders of magnitude more data than human learners, and they do not naturally pass through the developmental stages observed in children. BabyLM asks whether useful language behavior can emerge under strict token budgets, but does not explain why early child speech is short, content-heavy, and systematically focused on nouns and verbs over function words. TeleSpeak treats this as a structured learning problem and tests it with a three-stage pipeline built on GPT-2.

GPT-2 is first pretrained on the BabyLM 100M token corpus to establish general language competence — a technical prerequisite with no direct child acquisition analog. It is then fine-tuned on Simple Wikipedia passages using TF-IDF keyword extracts as training targets. Because TF-IDF penalizes high-frequency function words, the resulting targets consist almost entirely of content words, directly operationalizing the telegraphic speech stage where children produce utterances like "want cookie" rather than "I want a cookie." Finally, Direct Preference Optimization with a speaker-listener framework refines the model's outputs: a RoBERTa-large listener scores candidate summaries by semantic similarity to the source, while a decaying length term maintains brevity pressure early in training. We evaluate the pipeline using Age of Acquisition scores across all three checkpoints, comparing the model's word-learning order against developmental norms from child language research.

Target User and Stakeholder

The immediate audience is researchers and mentors interested in cognitively motivated language modeling, efficient learning, and reinforcement-learning-based sequence training. The broader stakeholder is anyone evaluating whether developmental language phenomena can be reproduced with transparent training objectives instead of hand-built rules.

Scope

We focus on a narrow, testable claim: whether a speaker optimized for semantic fidelity plus time-varying brevity pressure will move toward shorter outputs. This project is about emergent production behavior, not full child-language acquisition.

Out of Scope

We are not claiming that the current model captures syntax development, multimodal grounding, caregiver interaction, or developmental timelines. Those would require additional supervision, richer environments, and stronger behavioral evaluation than this capstone currently includes.

Methods

Our pipeline has three stages: unsupervised pretraining, supervised fine-tuning (SFT), and preference optimization (DPO). The same GPT-2 policy model is carried through all stages, while a RoBERTa-based listener is used in DPO to construct chosen/rejected pairs.

Data

We use two data subsets across stages. Stage 0 pretraining uses the full BabyLM `train_100M` corpus (mixed sources). Stage 1 SFT and Stage 2 DPO use a 40,000-passage subset from Simple English Wikipedia for domain-consistent summarization. Source data directory: BabyLM OSF.

{"id": int, "source": str, "passage": str}

Policy Model (Speaker)

GPT-2 is initialized with random weights and trained from scratch to track the full acquisition trajectory. The same model instance is used in pretraining, SFT, and DPO.

Listener Model

RoBERTa-large is used with BERTScore during DPO only. Given two candidate summaries, the listener scores each against the source passage and labels chosen vs rejected.

Preference Generation

For each prompt, the speaker samples two candidates with different seeds, top-p sampling, and temperature. We apply 2-gram Jaccard filtering: if similarity exceeds `max_pair_similarity`, one candidate is regenerated (up to `max_resample_tries`); unresolved near-duplicates are discarded.

What We Built

Data preprocessing, candidate generation, listener scoring, DPO training code, and experiment tracking are all assembled in this repo as a single end-to-end training workflow.

What We Reused

We rely on BabyLM-style data framing, GPT-2 and RoBERTa checkpoints, BERTScore-style semantic comparison, and standard experiment logging tooling rather than building those foundations from scratch.

Diagram showing the pipeline for training the LM.

Figure 1. (Pipeline model): Diagram illustrating the training pipeline implemented to operationalize telegraphic-speech in language models. Scroll to view full image.

SFT Stage

SFT trains the model with standard cross-entropy on keyword summaries extracted from each passage using TF-IDF. This stage teaches the model to produce compact keyword-style outputs before preference optimization.

score(t) = (1 + log tf(t)) * idf(t)

Prompt format:
"Keywords summary. Text: <passage_text> Output:<keywords>"

In our pipeline, SFT bridges unsupervised language pretraining and DPO by providing structured summarization behavior that DPO can refine.

DPO Stage

Direct Preference Optimization (DPO) trains on pairs of candidate summaries labeled as chosen/rejected by the listener. The model is updated to increase probability on chosen outputs and decrease probability on rejected outputs, with an added decaying brevity pressure.

SFT and DPO work in conjunction: SFT gives the model a stable keyword-summary format, and DPO then reshapes that behavior using semantic preference signals and length control.

DPO Objective

In Stage 2, DPO combines preference learning and a decaying length term:

logits = beta * pref_logits + alpha(t) * len_adv
L = -log(sigmoid(logits))

pref_logits = (log pi_theta(y_c|x) - log pi_theta(y_r|x))
            - (log pi_ref(y_c|x)   - log pi_ref(y_r|x))

len_adv = (len_r - len_c) / (len_r + len_c + epsilon)
alpha(t) = alpha_0 * exp(-k * t)

Here, `beta` controls preference strength, `len_adv` rewards concise chosen outputs, and `alpha(t)` decays over training so semantic preference carries more weight later.

Dataset and Validity Notes

We separate data by stage: full mixed `train_100M` for pretraining, then a 40,000-passage Simple English Wikipedia subset for SFT and DPO. This controls domain during fine-tuning while preserving broad language exposure in pretraining.

  • Stage split: Stage 0 pretraining on full mixed corpus; Stage 1/2 on SimpleWiki subset.
  • Split protocol: train/validation split with fixed seed (0.1 validation in current runs) so comparisons are reproducible.
  • Unit of supervision: preference pairs between two candidate summaries of the same passage.
  • Validity concern: semantic overlap metrics can overestimate quality when two weak candidates share surface vocabulary.
  • Leakage control: evaluate on held-out passages and keep fixed prompt sets separate from training pair construction.

Evaluation Plan (Explicit Protocol)

Our main evaluation is Age of Acquisition (AoA), following the BabyLM evaluation pipeline. We use AoA because it directly tests our hypothesis about word-learning trajectory, rather than only final-task fluency.

  • Ground-truth vocabulary norms: MacArthur-Bates CDI / Wordbank early-childhood word list.
  • Core signal: word surprisal across checkpoints; lower surprisal indicates higher familiarity.
  • AoA estimate: fit sigmoid learning curves per word and compare model-derived AoA against human AoA rankings.
  • Primary reported metrics: mean Δsurprisal per word and mean final surprisal.
  • Baseline comparison: BabyLM official pretrained GPT-100M checkpoint vs our DPO model under the same AoA evaluation procedure.
  • Ablations (ongoing): remove listener signal, remove/flatten length decay, and vary pair-filter thresholds.

AoA Methodology

We operationalize lexical acquisition with surprisal, defined as the negative log probability of a target word in context. Intuitively, words the model has learned receive lower surprisal over training.

Surprisal(w) = -log2 P(w | context)

Using CDI vocabulary targets, we track surprisal over checkpoints, fit sigmoid-like learning curves, and derive model AoA estimates for comparison with human AoA norms.

Results

Current quantitative results are from AoA-style surprisal analysis (baseline vs DPO model). Additional sweep-level AoA comparisons are in progress.

Finding 1: ΔSurprisal Shift

Mean Δsurprisal per word is lower for the DPO model (0.271) than baseline (2.067), indicating a different word-learning profile under fine-tuning.

Finding 2: Final Surprisal Tradeoff

Mean final surprisal is higher for the DPO model (11.92) than baseline (8.93), so improvements in trajectory-style metrics must be interpreted with this cost in mind.

Finding 3: Qualitative Stability Still Matters

AoA metrics are informative, but generation quality issues (collapse/noisy outputs in some runs) remain a key limitation for interpreting child-like behavior claims.

Charts

AoA visualization for the target word lick: GPT-2 baseline vs DPO model across token-count scale.

AoA curve for word 'lick' using GPT-2 baseline

Figure 2. (GPT-2 baseline): for the CDI word lick, mean surprisal decreases gradually as log word count increases, showing a steadier learning trajectory. The baseline also shows a broader overall reduction in surprisal (about 11.2 to 10.0).

AoA curve for word 'lick' using DPO model

Figure 3. (DPO model): mean surprisal also trends downward, but with sharper local changes rather than the same steady decline. Its overall reduction is narrower (about 11.64 to 11.54), suggesting a different acquisition pattern for this probe word.

Evaluation Table (Current Metrics)

Metric GPT-2 Baseline DPO Model
Mean Δsurprisal / word 2.067 0.271
Mean final surprisal 8.93 11.92

AoA-oriented metrics from the latest baseline-vs-DPO comparison run.

Example Output Format

Input passage

A held-out passage from the mixed `train_100M` domains.

Verbose candidate

Placeholder candidate preserving most content with longer phrasing.

Concise candidate

Placeholder candidate using shorter, denser wording.

How to Read the Results

We treat AoA claims as meaningful only when all conditions below hold:

  • Model-derived AoA aligns better with human AoA rankings than the baseline.
  • Δsurprisal improvements are not offset by severe degradation in final surprisal or generation quality.
  • The pattern survives ablations removing listener signal or decay scheduling.

Limitations and Failure Modes

  • Shorter outputs may still come from generic compression instead of robust child-like content selection.
  • Failure modes are unstable across runs: earlier punctuation and "kh/k" collapse shifted to function-word and punctuation-heavy drift (for example "the/of/in"), still with low semantic fidelity.
  • Current sweep filtering is too permissive (for example 12,800 kept vs 45 skipped at step 1,200), suggesting weak preference discrimination and noisy pair acceptance.
  • BERTScore-style signals do not fully capture factual accuracy, grammaticality, or readability when outputs become noisy.
  • Mixed-domain training data introduces domain shift; behavior on conversational-style sources may differ from encyclopedia-like passages.
  • The current setup does not model turn-taking, grounding, or developmental progression over time.

Iteration and Next Steps

  • Iteration record: early attempts showed punctuation-token collapse; later runs shifted to lexical/function-word drift, which motivated tighter diagnostics and sweep retuning.
  • Early phase complete: establish the speaker-listener preference pipeline and complete a full epoch run.
  • Immediate next step: retune pair filtering (`score_gap_min`, similarity threshold, resampling) so keep/skip behavior is meaningfully selective instead of near-all-keep.
  • Run a targeted sweep over decoding and regularization (`temperature`, `top_p`, Kullback-Leibler (KL) penalty, and length pressure) to reduce function-word drift and punctuation-heavy outputs.
  • Add explicit diagnostics per run: keep/skip ratio trajectory, top-token entropy/collapse indicators, and fixed-prompt readability checks.
  • Next reporting step: replace placeholders with final plots, ablations, and side-by-side examples from best sweep settings.
  • Longer-term extension: test richer datasets and more behaviorally grounded evaluation.

Implications

  • If sweep results hold, this provides a practical recipe for concise, meaning-preserving generation under constrained-data settings.
  • The work can inform model tuning strategies where brevity and semantic fidelity must be jointly optimized.

Deployment Considerations

  • Do not deploy current checkpoints for user-facing summarization until collapse diagnostics and held-out surprisal comparisons stabilize.
  • Production candidates should include guardrails for malformed output, plus monitoring of token-collapse indicators over time.

Run Path

pip install -r requirements.txt
python speaker_listener_rl/training/dpo_with_listener_wandb.py

These commands show the intended workflow. Final deployment artifacts and experiment outputs will be linked here after the evaluation sweep completes.

Conclusion

Tentative conclusion: the objective is trainable and can produce occasional compressed keyword-style outputs, but this run does not yet show stable telegraphic quality. The next decision point is sweep analysis: we need better tradeoffs between brevity and semantic stability before making a stronger acquisition-style claim. If that tradeoff improves, the impact would be a practical recipe for concise, meaning-preserving generation under low-resource constraints.

Acknowledgements

We thank our mentor Dr. Alex Warstadt for his guidance and feedback throughout this project.

This work used computing resources available through the National Research Platform (NRP) at the University of California, San Diego.

References

  1. BabyLM Challenge (shared task datasets and constraints): https://osf.io/ryjfm/.
  2. Rafailov et al. (2023). Direct Preference Optimization.
  3. Zhang et al. (2020). BERTScore: Evaluating text generation with contextual embeddings.
  4. Radford et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2).
  5. Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  6. Wolf et al. (2020). Transformers: State-of-the-Art Natural Language Processing (Hugging Face).
  7. Biewald (2020). Weights & Biases experiment tracking.
  8. Padovani et al. (2025). Reinforcement-learning approaches in BabyLM submissions.
  9. Stopler et al. (2025). Text-game style RL framing for language learning.
  10. Lazaridou et al. (2016). Multi-agent communication and reference games.