Skip to content

Changelog

Changelog

1.2.0 (2026-06-11)

Features

  • extraction guards — size limit at entry, process isolation, public config API (1f24d66)
  • extraction guards — size limit at entry, process isolation, public config API (1.2.0) (ecce207)

Bug Fixes

  • replace assert isinstance() with explicit runtime type checks in registry.py (#206) (2ee0ee8)
  • replace assert isinstance() with explicit runtime type guards in registry.py (13e27aa)
  • replace assert isinstance() with explicit runtime type guards in registry.py (#206) (c1d2e85)

1.2.0 (2026-06-11)

Features

  • enforce max_file_size_mb at the extract_file/extract_text entry points — oversized files raise ExtractionError instantly before backend dispatch, instead of running into the extraction timeout when callers bypass discover_files
  • add config_from_overrides() public API: build an ImportConfig from a partial overrides dict deep-merged onto the bundled defaults (the supported path for library consumers such as m365-extract)
  • add extraction.isolation config ("thread" | "process", default "thread"): process mode runs each extraction in a separate spawned process that is killed on timeout — true cancellation and memory isolation for long-running daemons
  • include the source file size in ExtractionTimeoutError messages alongside label, timeout, and path

Bug Fixes

  • process isolation enforces the deadline on every parent-side wait: receiving the payload is bounded by a watchdog that kills a stalled child, and the worker is reaped (bounded join, then kill) on success, error, and interrupt paths — a child kept alive by a leftover non-daemon thread can no longer block the caller indefinitely
  • worker exceptions that cannot survive the pickle round-trip are reported as ExtractionError with the original message, instead of escaping as a raw TypeError that aborted CLI batch runs; payloads that still fail to unpickle parent-side are wrapped defensively
  • the extraction worker process is no longer daemonic, so backends may spawn their own worker processes (e.g. docling's torch DataLoader)
  • stat() failures in the entry-point size guard (file vanished between discovery and extraction) raise ExtractionError, keeping the ObsidianImportError contract so CLI batch runs print FAIL and continue
  • the "process died without a result" error now explains the if __name__ == "__main__": guard required for script consumers under spawn

1.1.2 (2026-05-20)

Bug Fixes

  • call _cleanup_temp_source in copy_media_files to remove temp dirs (6682e90)
  • clean up temp dirs created by save_media_to_temp after media copy (#176) (026e76c)
  • sanitize PDF form field values against markdown injection (#171) (9b22822)
  • serialize Image.MAX_IMAGE_PIXELS mutation with threading.Lock (#194) (3b389c6)

Documentation

  • add cleanup side-effect to copy_media_files docstring and CHANGELOG entry (152709b)

1.1.1 (2026-05-11)

Bug Fixes

  • scope Image.MAX_IMAGE_PIXELS mutation to _process_image_bytes lifetime (#163) (037974c)

1.1.0 (2026-04-28)

Features

  • add .html as a first-class backend config key (4602f1b)
  • add .html as a first-class backend config key (e64b82f)

Bug Fixes

  • add decompression bomb guard with configurable image_max_pixels (#113) (bb840be)
  • bump Pillow lower bound to >=12.2 for CVE-2026-40192 (#127) (35f9167)
  • bump pypdf lower bound to 6.10.2 for DoS CVE fixes (e49878e), closes #126
  • correct misleading native extensions list and add .htm dispatch test (afa2521)
  • scope try/except per XObject iteration in _extract_page_images (#120) (2d8a820)

[Unreleased]

Bug Fixes

  • clean up temp dirs created by save_media_to_temp after media copy (#176)

Security

  • serialize Image.MAX_IMAGE_PIXELS mutation with threading.Lock for thread safety (#194)
  • track PYSEC-2025-217 in transformers (transitive via docling): X-CLIP checkpoint deserialization RCE, no fix available yet (#201)
  • bump pip floor to >=26.1 for CVE-2026-6357 (#183)
  • bump pytest floor to >=9.0.3 for CVE-2025-71176 (#184)
  • pin cryptography >=46.0.7 for CVE-2026-39892 (#181)
  • drop direct twisted dep to remove CVE-2026-42304 exposure (#185)
  • track CVE-2026-3219 in pip (build-only dep, no fix available yet) (#182)
  • bump pypdf >=6.10.2 to address multiple High-severity DoS CVEs (CVE-2026-40260, GHSA-jj6c-8h6c-hppx, GHSA-4pxv-j86v-mhcw, GHSA-7gw9-cf7v-778f, GHSA-x284-j5p8-9c5p) (#126)

1.0.4 (2026-04-13)

Bug Fixes

  • add extract_images parameter to config_for_backend (#85) (fbc9def)
  • bump pypdf >=6.9.2 to address CVE-2026-33699 infinite loop DoS (#94) (67e2286)
  • correct exception type in native_pdf.py and add missing test (cdcd0d3)
  • decompose _extract_page_images to reduce cyclomatic complexity (#102) (a4547bd)
  • extract shared attempt_save_image helper to eliminate DRY violation (#99) (964354d)
  • replace hand-rolled YAML escaping with PyYAML serializer (#75) (37535e0)
  • resolve merge conflict — reapply exception narrowing to refactored code (f8d8165)
  • validate image bytes size and format before Pillow processing (#74) (ba5be45)

1.0.3 (2026-03-27)

Bug Fixes

  • address security and code quality issues (#14, #37, #59, #63, #64) (4845e67)
  • address security and code quality issues (#14, #37, #59, #63, #64) (9c9fb18)
  • bump pypdf >=6.9.1 to address CVE-2026-33123 DoS vulnerability (#55) (b9e2a8c)
  • strengthen type annotations and simplify docling availability check (00f951b)
  • strengthen type annotations and simplify docling check (86dfa39)

Documentation

  • sync documentation with codebase (f797c3d)

1.0.2 (2026-03-20)

Bug Fixes

  • eliminate hidden mutation side-effect in _match_image_ref (#41) (e6dc690)

1.0.1 (2026-03-17)

Bug Fixes

  • add upper-bound version pins for markitdown and docling optional deps (ef63111), closes #30
  • add upper-bound version pins for markitdown and docling optional… (4471a29)
  • bump Pillow to >=12.1,<13 to address CVE-2026-25990 (#29) (ee2cc8e)
  • move stdlib xml.etree.ElementTree import to TYPE_CHECKING block (#17) (829cea3)
  • regenerate pixi.lock in release-please PR (c4f6c76)
  • regenerate pixi.lock in release-please PR (1ea1ccb)

1.0.0 (2026-03-12)

  • feat: embedded media extraction for PDF, DOCX, PPTX (per-document media folders with wikilinks)
  • feat: config_for_backend() convenience API for quick single-backend configuration
  • feat: MediaConfig for image extraction settings (format, max dimension, enable/disable)
  • deps: added Pillow>=10.0,<12
  • BREAKING: ImportConfig requires media: MediaConfig field
  • BREAKING: backend extract() returns ExtractionResult (with .markdown and .media_files) instead of str

0.2.0 (2026-03-10)

  • Native backends for CSV, JSON, YAML, and image files
  • Image embedding: generates Obsidian ![[filename]] wikilinks and copies source images to vault
  • Pass-through mode: copy files as-is without extraction (configurable by extension, glob, regex)
  • Per-extension backend configuration (backends.csv, backends.json, backends.yaml, backends.image)
  • OutputConflictError exception for destination file conflicts

0.1.0 (2026-03-09)

Initial release.

  • Native backends: PDF (pdfplumber+pypdf), DOCX (defusedxml), PPTX (python-pptx), XLSX (openpyxl)
  • Optional backends: markitdown (fallback), docling (high-quality)
  • Config-driven backend selection per file type
  • Glob-based file discovery with exclude patterns
  • Obsidian-flavored markdown output with YAML frontmatter
  • Click CLI: convert, discover, batch, doctor
  • YAML configuration with deep-merge defaults