Skip to content

WSL CLI utility that converts Globe and Mail newspaper PDFs (and any generic PDF) to clean Obsidian-ready markdown, with AI-powered article extraction and frontmatter generation.

Package: /home/ta/utils/ai/pdf2md/ · Command: pdf-to-md

  • pymupdf4llm for PDF conversion (3–6 sec/file, no ML models)
  • Gemini (free, default) + Claude (paid fallback) for AI extraction via extract.py
  • config.yaml as SSOT for models, prompts, output paths, 35-tag vocabulary
  • Rich CLI: status spinner, ✓/✗ indicators, rounded Panel output

Why not Marker? Tried first — impractical in WSL without GPU: 65% RAM, 30+ min on a 165KB PDF. pymupdf4llm achieves same result in 3–6 seconds with no ML dependencies.

FeatureFlag
Single article extraction--source globe
Multi-article (all qualifying)--source globe --multi
Multi-part PDFs (merged)file1.pdf file2.pdf --source globe
Generic PDF--source generic (default)
Raw debug output--no-extract
Override title hint--title "article headline"
Gemini quota fallback--model gemini-2.5-flash-lite
Claude provider--provider claude

Every file starts with Obsidian YAML frontmatter:

---
title: "Article headline"
author: "Author, credentials"
published: 2026-03-25 # extracted from Globe page header **WEEKDAY, MONTH DD, YYYY**
created: 2026-04-04
source: "The Globe and Mail"
tags:
- clippings
- globe-and-mail
- mortgages
- behavioral-finance
---

Globe articles always get clippings + globe-and-mail. Tag vocabulary (35 tags) in config.yaml → defaults.tags.

  • onnxruntime GPU warning on every run — cosmetic, harmless, cannot be suppressed (writes direct to OS stderr fd2)
  • Gemini daily quota — use a dedicated GEMINI_API_KEY in pdf2md/.env (not shared with other AI utilities)
  • --multi is Globe-only — requires multi_extract_system_prompt in source config
  • Globe published date — in page header as **WEEKDAY, MONTH DD, YYYY**; prompt extracts it reliably
  • .env encoding — write from WSL only; Windows copy-paste embeds Unicode in comments, garbles file
ItemWSL PathWindows Path
Package/home/ta/utils/ai/pdf2md/\\wsl$\Ubuntu-24.04\home\ta\utils\ai\pdf2md\
Config/home/ta/utils/ai/pdf2md/config.yamlsame via UNC
API keys/home/ta/utils/ai/pdf2md/.envsame via UNC
Output (Globe)/mnt/d/FSS/KB/Business/Clippings/Globe/D:\FSS\KB\Business\Clippings\Globe\
Output (Generic)/mnt/d/FSS/KB/Business/Clippings/PDFs/D:\FSS\KB\Business\Clippings\PDFs\