Skip to content

Configuration

obsidian-import uses a YAML configuration file for all settings. Pass it to the CLI with --config or load it programmatically with load_config().

Full Configuration Reference

input:
  directories:
    - path: /path/to/documents
      extensions: [".pdf", ".docx", ".pptx", ".xlsx"]
      exclude: ["*.tmp", "~$*"]

output:
  directory: ./extracted
  frontmatter: true
  metadata_fields:
    - title
    - source
    - original_path
    - file_type
    - extracted_at
    - page_count

backends:
  pdf: native        # pdfplumber + pypdf
  docx: native       # defusedxml
  pptx: native       # python-pptx
  xlsx: native       # openpyxl
  csv: native        # built-in csv module
  json: native       # built-in json module
  yaml: native       # pyyaml
  image: native      # Pillow
  default: native    # fallback for unlisted extensions

extraction:
  timeout_seconds: 120
  max_file_size_mb: 100
  xlsx_max_rows_per_sheet: 500

media:
  extract_images: true
  image_format: png
  image_max_dimension: 0   # 0 = no limit

passthrough:
  extensions: []
  paths: []
  patterns: []

Sections

input

Defines where to find files for batch extraction.

Field Type Description
directories list List of directory entries to scan
directories[].path string Absolute or relative path to scan
directories[].extensions list[string] File extensions to include (e.g., [".pdf", ".docx"])
directories[].exclude list[string] Glob patterns to exclude (e.g., ["*.tmp", "~$*"])

output

Controls the format and location of extracted markdown files.

Field Type Description
directory string Output directory for batch extraction
frontmatter bool Include YAML frontmatter in output
metadata_fields list[string] Which metadata fields to include in frontmatter

backends

Maps file extensions to extraction backends. See Backends for details.

Field Type Description
pdf string Backend for .pdf files
docx string Backend for .docx files
pptx string Backend for .pptx files
xlsx string Backend for .xlsx files
csv string Backend for .csv files
json string Backend for .json files
yaml string Backend for .yaml / .yml files
image string Backend for image files (.png, .jpg, .gif, etc.)
default string Fallback backend for unlisted extensions

Valid values: native, markitdown, docling.

extraction

Controls extraction behavior.

Field Type Description
timeout_seconds int Maximum seconds per file extraction
max_file_size_mb int Maximum file size in megabytes
xlsx_max_rows_per_sheet int Maximum rows to extract per Excel sheet

media

Controls embedded image extraction from documents.

Field Type Description
extract_images bool Extract embedded images from PDF, DOCX, and PPTX files
image_format string Output format for extracted images (png, jpg)
image_max_dimension int Maximum width or height in pixels; 0 means no limit

When extract_images is enabled, images are saved to a per-document folder (<output_dir>/<doc_stem>/) and linked in the markdown as Obsidian wikilinks (![[doc_stem/page1_img0.png]]).

passthrough

Files matching any passthrough rule are copied as-is to the output directory without extraction. Rules are evaluated with OR logic — a file matching any single rule is passed through.

Field Type Description
extensions list[string] File extensions to copy without extraction (e.g., [".md", ".canvas"])
paths list[string] Glob patterns matched against the full file path
patterns list[string] Regular expression patterns matched against the full file path
passthrough:
  extensions: [".md", ".canvas"]
  paths: ["notes/raw/**"]
  patterns: [".*\\.generated\\..*"]

Loading Configuration

CLI

obsidian-import batch --config config.yaml

Python

from pathlib import Path
from obsidian_import.config import load_config

config = load_config(Path("config.yaml"))