Quick overview
This guide walks you through a methodical debugging workflow for an NLP parser (constituency or dependency). Treat the parser like a program: reproduce the problem, isolate the component, add tests and instrumentation, and iterate until fixed. I've included common causes, practical checks, visualization and logging tricks, and a short example workflow.
1. Reproduce the failure reliably
- Capture one or more minimal example sentences that show the bad parse.
- Record the exact pipeline: tokenizer, normalizer, tagger, parser model and version, model weights, config, random seed, input encoding (UTF-8), and OS/environment.
- Make a tiny script that runs only the pipeline on the example(s) so you can reproduce it deterministically.
2. Isolate components
- Run the tokenizer alone. Is the tokenization what the parser expects? (Token boundaries change parse trees.)
- Run POS tagger / morphological analysis alone. Wrong tags often cascade to bad parses.
- If using a pipeline (tokenizer→tagger→parser), feed the parser gold tokens/tags to see if the parser or upstream step is at fault.
3. Check preprocessing and data conventions
- Tokenization mismatches: model trained on PTB tokenization vs. your tokenizer (e.g., "'s", hyphens, contractions).
- Case sensitivity: was model trained on lowercased text but you pass mixed case?
- Punctuation handling and sentence segmentation: ensure sentence boundaries match training assumptions.
- Encoding issues: hidden Unicode characters, smart quotes, or zero-width spaces.
4. Validate training / gold data alignment
- Check that your training/evaluation treebank conventions (labels, head rules) match your code. UD vs CTB vs PTB differences are common.
- Inspect a few gold parses where the model fails — sometimes gold annotations are inconsistent or noisy.
- If you retrained a model, verify data shuffling, train/val split leakage, and correct label mapping.
5. Add small, targeted unit tests and minimal pairs
- Write unit tests for the tokenizer, POS tagger, and parser that cover the failing cases and nearby edge cases.
- Create minimal pairs (two sentences that differ by one token or punctuation) to see exactly what change flips the output.
6. Instrumentation & logging
- Dump intermediate representations: tokens, token indices, embeddings (or sentinel values), POS votes, parser action sequences (for transition parsers) and probabilities.
- Log softmax/probability distributions for key decisions (attachment choices, bracket choices). Low confidence often signals ambiguous input or missing features.
- Compare model outputs to gold at multiple levels (tokenization → tags → arcs) and compute per-token/per-decision errors.
7. Visualize parses
- Use tree visualizers (NLTK, displaCy for spaCy, brat for annotations, or web-based tree viewers) to inspect differences quickly.
- For dependency parsers, color edges by confidence score to find shaky attachments.
8. Common causes and concrete fixes
- Tokenization mismatch: Fix by using the same tokenizer as used during training, or re-tokenize training data consistently.
- Tagging errors: Improve tagger, or feed gold tags to measure parser-only performance; add features that help (e.g., morphological features).
- Label/hierarchy mismatch: Map labels between treebanks (e.g., UD vs custom labels) or retrain with proper label set.
- Training data noise: Clean or filter noisy treebank examples; remove or correct inconsistent annotations.
- Overfitting/underfitting: Check learning curves; change regularization, model size, or data quantity.
- Non-determinism: Fix by seeding RNGs, disabling multithreading for debugging runs.
- Bug in decoding algorithm: For CKY/Earley/transition parser implementations, add invariant checks (scores non-negative where expected, probabilities normalized), and unit tests for toy grammars where expected parse is known.
9. Evaluation & metrics
- Use appropriate metrics: UAS/LAS for dependency parsing, bracket F1 for constituency parsing.
- Break down errors by sentence length, distance of head-dependent, POS of head/dependent, punctuation presence to find systematic weaknesses.
10. Regression testing & CI
- Add the failing examples as regression tests so they cannot reappear silently.
- Integrate small evaluation on a tiny dev set into CI to catch major regressions quickly.
Practical example: debugging a spaCy dependency parse error
# Minimal reproduction script (Python + spaCy)
import spacy
nlp = spacy.load('en_core_web_sm')
sent = "Ally McBeal debugged the parser's output." # failing example
# 1) Show tokens and spaces
doc = nlp(sent)
for t in doc:
print(repr(t.text), t.idx, t.pos_, t.tag_, t.dep_, t.head.text)
# 2) Visualize or print arcs and scores if model exposes them
Steps you might take next:
- Check tokenization: does spaCy split "parser's" as you expect? If not, customize tokenizer rules or normalize input (e.g., expand contractions) to match training.
- Check POS tags: print t.pos_ and t.tag_. If they are wrong, try using a different tagger or retrain with corrected data.
- Feed gold tokens/tags into your parser code (if possible) to see whether parser alone still gives wrong attachments.
- Instrument parser to show attachment probabilities for the dependent token. If probabilities are low across choices, the model lacks discriminating features.
- If the parser always makes the same class of mistake (e.g., right attachment for prepositional phrases), augment training data with targeted examples or add syntax-aware features.
Checklist to run through quickly
- Reproduce deterministically — create minimal script.
- Verify tokenizer and sentence splitter.
- Verify POS/morph tags and feed gold if needed.
- Compare model assumptions vs data conventions (labels, encoding, case).
- Visualize parse & log decision confidences.
- Add focused unit tests and regression checks.
- Consider retraining or augmenting data if the error is model capability-related.
Tools and helpers
- spaCy (displaCy), NLTK, benepar, AllenNLP, Stanford CoreNLP for visualization and alternative parses.
- Annotation tools: brat, prodigy (for collecting gold corrections).
- Use small synthetic grammars to unit-test algorithms (e.g., toy CFG for CKY testing).
When to ask for help — what to include
If you post a question (on Slack, GitHub, Stack Overflow, or to a teammate), include:
- a minimal reproducible script and the exact input sentence(s)
- pipeline spec: tokenizer, tagger, parser, model versions
- expected parse (gold) and the parse the system produced
- any logs showing probabilities or intermediate outputs
Final tip — think like Ally McBeal
Be curious and methodical: collect evidence, interrogate each module, and don’t assume the model is at fault — often the pipeline or training-data mismatch is the real culprit. Keep failing examples as tests so future changes don’t break the fix.
If you want, paste a failing sentence and the parse your system produced and I will walk through a focused debug session (including example commands and what to log at each stage).