pdf-to-md
Section titled “pdf-to-md”WSL CLI utility that converts Globe and Mail newspaper PDFs (and any generic PDF) to clean Obsidian-ready markdown, with AI-powered article extraction and frontmatter generation.
Package: /home/ta/utils/ai/pdf2md/ · Command: pdf-to-md
What Was Built
Section titled “What Was Built”pymupdf4llmfor PDF conversion (3–6 sec/file, no ML models)- Gemini (free, default) + Claude (paid fallback) for AI extraction via
extract.py config.yamlas SSOT for models, prompts, output paths, 35-tag vocabulary- Rich CLI: status spinner, ✓/✗ indicators, rounded Panel output
Why not Marker? Tried first — impractical in WSL without GPU: 65% RAM, 30+ min on a 165KB PDF. pymupdf4llm achieves same result in 3–6 seconds with no ML dependencies.
Features
Section titled “Features”| Feature | Flag |
|---|---|
| Single article extraction | --source globe |
| Multi-article (all qualifying) | --source globe --multi |
| Multi-part PDFs (merged) | file1.pdf file2.pdf --source globe |
| Generic PDF | --source generic (default) |
| Raw debug output | --no-extract |
| Override title hint | --title "article headline" |
| Gemini quota fallback | --model gemini-2.5-flash-lite |
| Claude provider | --provider claude |
Output Format
Section titled “Output Format”Every file starts with Obsidian YAML frontmatter:
---title: "Article headline"author: "Author, credentials"published: 2026-03-25 # extracted from Globe page header **WEEKDAY, MONTH DD, YYYY**created: 2026-04-04source: "The Globe and Mail"tags: - clippings - globe-and-mail - mortgages - behavioral-finance---Globe articles always get clippings + globe-and-mail. Tag vocabulary (35 tags) in config.yaml → defaults.tags.
Gotchas
Section titled “Gotchas”- onnxruntime GPU warning on every run — cosmetic, harmless, cannot be suppressed (writes direct to OS stderr fd2)
- Gemini daily quota — use a dedicated
GEMINI_API_KEYinpdf2md/.env(not shared with other AI utilities) --multiis Globe-only — requiresmulti_extract_system_promptin source config- Globe
publisheddate — in page header as**WEEKDAY, MONTH DD, YYYY**; prompt extracts it reliably .envencoding — write from WSL only; Windows copy-paste embeds Unicode in comments, garbles file
Key Files
Section titled “Key Files”| Item | WSL Path | Windows Path |
|---|---|---|
| Package | /home/ta/utils/ai/pdf2md/ | \\wsl$\Ubuntu-24.04\home\ta\utils\ai\pdf2md\ |
| Config | /home/ta/utils/ai/pdf2md/config.yaml | same via UNC |
| API keys | /home/ta/utils/ai/pdf2md/.env | same via UNC |
| Output (Globe) | /mnt/d/FSS/KB/Business/Clippings/Globe/ | D:\FSS\KB\Business\Clippings\Globe\ |
| Output (Generic) | /mnt/d/FSS/KB/Business/Clippings/PDFs/ | D:\FSS\KB\Business\Clippings\PDFs\ |