Changelog¶
Changelog¶
1.2.0 (2026-06-11)¶
Features¶
- extraction guards — size limit at entry, process isolation, public config API (1f24d66)
- extraction guards — size limit at entry, process isolation, public config API (1.2.0) (ecce207)
Bug Fixes¶
- replace assert isinstance() with explicit runtime type checks in registry.py (#206) (2ee0ee8)
- replace assert isinstance() with explicit runtime type guards in registry.py (13e27aa)
- replace assert isinstance() with explicit runtime type guards in registry.py (#206) (c1d2e85)
1.2.0 (2026-06-11)¶
Features¶
- enforce max_file_size_mb at the extract_file/extract_text entry points — oversized files raise ExtractionError instantly before backend dispatch, instead of running into the extraction timeout when callers bypass discover_files
- add config_from_overrides() public API: build an ImportConfig from a partial overrides dict deep-merged onto the bundled defaults (the supported path for library consumers such as m365-extract)
- add extraction.isolation config ("thread" | "process", default "thread"): process mode runs each extraction in a separate spawned process that is killed on timeout — true cancellation and memory isolation for long-running daemons
- include the source file size in ExtractionTimeoutError messages alongside label, timeout, and path
Bug Fixes¶
- process isolation enforces the deadline on every parent-side wait: receiving the payload is bounded by a watchdog that kills a stalled child, and the worker is reaped (bounded join, then kill) on success, error, and interrupt paths — a child kept alive by a leftover non-daemon thread can no longer block the caller indefinitely
- worker exceptions that cannot survive the pickle round-trip are reported as ExtractionError with the original message, instead of escaping as a raw TypeError that aborted CLI batch runs; payloads that still fail to unpickle parent-side are wrapped defensively
- the extraction worker process is no longer daemonic, so backends may spawn their own worker processes (e.g. docling's torch DataLoader)
- stat() failures in the entry-point size guard (file vanished between discovery and extraction) raise ExtractionError, keeping the ObsidianImportError contract so CLI batch runs print FAIL and continue
- the "process died without a result" error now explains the
if __name__ == "__main__":guard required for script consumers under spawn
1.1.2 (2026-05-20)¶
Bug Fixes¶
- call _cleanup_temp_source in copy_media_files to remove temp dirs (6682e90)
- clean up temp dirs created by save_media_to_temp after media copy (#176) (026e76c)
- sanitize PDF form field values against markdown injection (#171) (9b22822)
- serialize Image.MAX_IMAGE_PIXELS mutation with threading.Lock (#194) (3b389c6)
Documentation¶
- add cleanup side-effect to copy_media_files docstring and CHANGELOG entry (152709b)
1.1.1 (2026-05-11)¶
Bug Fixes¶
1.1.0 (2026-04-28)¶
Features¶
- add .html as a first-class backend config key (4602f1b)
- add .html as a first-class backend config key (e64b82f)
Bug Fixes¶
- add decompression bomb guard with configurable image_max_pixels (#113) (bb840be)
- bump Pillow lower bound to >=12.2 for CVE-2026-40192 (#127) (35f9167)
- bump pypdf lower bound to 6.10.2 for DoS CVE fixes (e49878e), closes #126
- correct misleading native extensions list and add .htm dispatch test (afa2521)
- scope try/except per XObject iteration in _extract_page_images (#120) (2d8a820)
[Unreleased]¶
Bug Fixes¶
- clean up temp dirs created by save_media_to_temp after media copy (#176)
Security¶
- serialize Image.MAX_IMAGE_PIXELS mutation with threading.Lock for thread safety (#194)
- track PYSEC-2025-217 in transformers (transitive via docling): X-CLIP checkpoint deserialization RCE, no fix available yet (#201)
- bump pip floor to >=26.1 for CVE-2026-6357 (#183)
- bump pytest floor to >=9.0.3 for CVE-2025-71176 (#184)
- pin cryptography >=46.0.7 for CVE-2026-39892 (#181)
- drop direct twisted dep to remove CVE-2026-42304 exposure (#185)
- track CVE-2026-3219 in pip (build-only dep, no fix available yet) (#182)
- bump pypdf >=6.10.2 to address multiple High-severity DoS CVEs (CVE-2026-40260, GHSA-jj6c-8h6c-hppx, GHSA-4pxv-j86v-mhcw, GHSA-7gw9-cf7v-778f, GHSA-x284-j5p8-9c5p) (#126)
1.0.4 (2026-04-13)¶
Bug Fixes¶
- add extract_images parameter to config_for_backend (#85) (fbc9def)
- bump pypdf >=6.9.2 to address CVE-2026-33699 infinite loop DoS (#94) (67e2286)
- correct exception type in native_pdf.py and add missing test (cdcd0d3)
- decompose _extract_page_images to reduce cyclomatic complexity (#102) (a4547bd)
- extract shared attempt_save_image helper to eliminate DRY violation (#99) (964354d)
- replace hand-rolled YAML escaping with PyYAML serializer (#75) (37535e0)
- resolve merge conflict — reapply exception narrowing to refactored code (f8d8165)
- validate image bytes size and format before Pillow processing (#74) (ba5be45)
1.0.3 (2026-03-27)¶
Bug Fixes¶
- address security and code quality issues (#14, #37, #59, #63, #64) (4845e67)
- address security and code quality issues (#14, #37, #59, #63, #64) (9c9fb18)
- bump pypdf >=6.9.1 to address CVE-2026-33123 DoS vulnerability (#55) (b9e2a8c)
- strengthen type annotations and simplify docling availability check (00f951b)
- strengthen type annotations and simplify docling check (86dfa39)
Documentation¶
- sync documentation with codebase (f797c3d)
1.0.2 (2026-03-20)¶
Bug Fixes¶
1.0.1 (2026-03-17)¶
Bug Fixes¶
- add upper-bound version pins for markitdown and docling optional deps (ef63111), closes #30
- add upper-bound version pins for markitdown and docling optional… (4471a29)
- bump Pillow to >=12.1,<13 to address CVE-2026-25990 (#29) (ee2cc8e)
- move stdlib xml.etree.ElementTree import to TYPE_CHECKING block (#17) (829cea3)
- regenerate pixi.lock in release-please PR (c4f6c76)
- regenerate pixi.lock in release-please PR (1ea1ccb)
1.0.0 (2026-03-12)¶
- feat: embedded media extraction for PDF, DOCX, PPTX (per-document media folders with wikilinks)
- feat:
config_for_backend()convenience API for quick single-backend configuration - feat:
MediaConfigfor image extraction settings (format, max dimension, enable/disable) - deps: added Pillow>=10.0,<12
- BREAKING:
ImportConfigrequiresmedia: MediaConfigfield - BREAKING: backend
extract()returnsExtractionResult(with.markdownand.media_files) instead ofstr
0.2.0 (2026-03-10)¶
- Native backends for CSV, JSON, YAML, and image files
- Image embedding: generates Obsidian
![[filename]]wikilinks and copies source images to vault - Pass-through mode: copy files as-is without extraction (configurable by extension, glob, regex)
- Per-extension backend configuration (
backends.csv,backends.json,backends.yaml,backends.image) OutputConflictErrorexception for destination file conflicts
0.1.0 (2026-03-09)¶
Initial release.
- Native backends: PDF (pdfplumber+pypdf), DOCX (defusedxml), PPTX (python-pptx), XLSX (openpyxl)
- Optional backends: markitdown (fallback), docling (high-quality)
- Config-driven backend selection per file type
- Glob-based file discovery with exclude patterns
- Obsidian-flavored markdown output with YAML frontmatter
- Click CLI: convert, discover, batch, doctor
- YAML configuration with deep-merge defaults