Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   Copyright 2026 Zurc Bolzano, Maker of Content, 
                                Lord of the Render Queue, 
                                He Who Forgot to Sleep.

   The running joke is that I am still making content, just now it
   is legal, consenual and primarily focused on primary prevention.

DocWeave

Provenance matters in documents. It matters in people too.

DocWeave is a modular document pipeline for turning PDFs into trustworthy, AI-ready artifacts. It combines native PDF text, OCR, layout analysis, image extraction, chunk planning, and provenance tracking into a staged workflow that can be reused across books, papers, manuals, scans, and other document collections.

DocWeave targets a gap that existing open-source parsers do not prioritize well: local-first, modest-VRAM document processing that preserves both AI-friendly Markdown structure and OCR traceability artifacts such as hOCR.

Why DocWeave?

Built with the belief that what survives deserves to be carried carefully.

  • preserve source fidelity whenever native text is better than OCR
  • use OCR when native text or layout is missing or weak
  • retain structure, layout, images, and provenance for AI workflows
  • keep outputs compact enough for local LLM workflows
  • keep every major step modular and testable

In short: DocWeave is meant to weave together native text, OCR, layout, and metadata rather than forcing a single extraction strategy onto every page.

Current Status

  • The Python package already provides the reusable planning core: discovery, canonical naming, chunking, batching, manifests, and a stage-based runner.
  • The full OCR and AI-deliverable workflow is still partly in a local legacy bridge while the rest of the pipeline is migrated subsystem by subsystem.
  • Currently undergoing a full re-write - please be patient :) encuoragement is always welcome.

Quick Start

Install the core package:

python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"

Run the current planning stage:

docweave plan \
  --input-dir ./input_pdfs \
  --canonical-dir ./work/canonical \
  --manifests-dir ./work/manifests \
  --chunk-output-dir ./work/chunks \
  --page-chunk-size 150 \
  --max-chunks-per-batch 8 \
  --print-summary

This produces:

  • documents_manifest.csv
  • chunks_manifest.csv
  • batches_manifest.csv

Benchmark Positioning

DocWeave is not trying to prove that it is the highest-ceiling parser overall. The benchmark posture is narrower:

Is this pipeline adequate for a local-first user with limited budget and modest hardware who needs compact AI-ready Markdown plus OCR traceability?

That methodology, the fairness rules, and the current comparison framing versus Docling, Marker, and MinerU are documented in docs/benchmarking.md.

Showcase Results

On the public synthetic weaving showcase, DocWeave’s reference bundle currently clears all 8/8 structure-sensitive cases while staying compact and explicitly AI-oriented. On this host, the best local-feasible comparison runs from the other open-source pipelines did not meet the same adequacy bar for this use case.

Variable DocWeave Reference Docling Best Local Run Marker Best Local Run MinerU Best Local Run
Best mode on this host reference bundle GPU GPU CPU
Cases passed 8/8 4/8 1/8 3/8
Adequate for target local-first use case? yes no no no
Main shortfall n/a breaks reading order, weak code fidelity, no provenance JSON weak structure retention and much larger bundle weak structure retention and much larger bundle

What is interesting here is not that DocWeave “wins every benchmark.” It is that, under a deliberately modest-hardware, local-first benchmark profile, DocWeave is currently the only pipeline in this comparison that produces the compact Markdown-plus-provenance shape the project is targeting. The full methodology, fairness rules, and per-mode results are in docs/benchmarking.md.

Acknowledgments

DocWeave relies heavily on PaddlePaddle for OCR and document-processing capabilities. Respect to the PaddlePaddle authors for building and open-sourcing the framework this work stands on.

PaddlePaddle is distributed under the Apache License 2.0.

DocWeave was shaped in conversation with Codex, OpenAI’s coding agent, during design and implementation.

Notes

  • The published package name in pyproject.toml is docweave.
  • The internal Python import package remains document_pipeline for migration stability.
  • The public repo is intentionally engine-first and excludes some local application-specific orchestration.

DocWeave Project

This directory holds the longer-form project documentation that used to live in the root README.

Start here if you want a map of the repo:

If you are landing here from GitHub:

  1. Read the repo README for the short project summary.
  2. Read overview.md for the project scope and migration status.
  3. Read setup.md before attempting GPU OCR or Windows/WSL configuration.

Architecture and Modules

Public Modules and Responsibilities

Core package

Python worker and benchmark scripts

The original PowerShell orchestration layer is intentionally not part of the public engine-first source drop because it still contains application-specific workflow assumptions. The Python package and Python worker scripts are the published base going forward.

Repository Structure

.
├── pyproject.toml
├── README.md
├── docs/
│   ├── README.md
│   ├── architecture.md
│   ├── benchmarking.md
│   ├── cli.md
│   ├── overview.md
│   ├── setup.md
│   └── testing.md
├── bin/
│   ├── benchmark_local_compare.py
│   ├── benchmark_paddle_direct.py
│   ├── benchmark_weaving_showcase.py
│   ├── paddle_direct_extract.py
│   ├── paddle_ocr_pdf.py
│   ├── build_ai_chunk_artifacts.py
│   └── run_paddleocr_images.py
├── fixtures/
│   ├── showcase/
│   └── smoke/
├── src/
│   └── document_pipeline/
│       ├── cli.py
│       ├── config.py
│       ├── context.py
│       ├── discovery.py
│       ├── chunking.py
│       ├── manifests.py
│       ├── models.py
│       ├── naming.py
│       ├── pdf_utils.py
│       ├── planning.py
│       └── stages/
│           ├── base.py
│           └── plan.py
└── tests/
    ├── test_cli.py
    ├── test_local_compare_benchmark.py
    ├── test_manifests.py
    ├── test_naming.py
    ├── test_paddle_direct_compare.py
    ├── test_planning.py
    ├── test_synthetic_smoke_fixture.py
    ├── test_synthetic_weaving_showcase.py
    └── test_weaving_showcase_benchmark.py

Key Functions

The current Python package is small enough that the main reusable entry points are worth calling out explicitly:

  • discover_documents(...)
    • scan an input directory, canonicalize names, compute hashes, and create SourceDocument records
  • build_chunks(...)
    • split documents into deterministic page-range chunks
  • assign_batches(...)
    • group chunks into stable batch assignments
  • build_plan(...)
    • compose discovery, chunking, and batching into a PipelinePlan
  • write_plan_manifests(...)
    • write the current planning outputs to CSV
  • run_stages(...)
    • execute a sequence of reusable pipeline stages against a shared context
  • document_pipeline.cli.main(...)
    • CLI entrypoint for the current planning stage

Benchmarking

Short Comparison

DocWeave is not the only open-source project in this space. The distinction is in what each project optimizes for.

Project Best known for Typical outputs Positioning relative to DocWeave
DocWeave Native-text-first routing, hybrid document reasoning, compact AI bundles, explicit provenance Markdown, provenance JSON, extracted images Aims to preserve the best available text source per region or page, keep outputs small and AI-readable, and retain machine-readable provenance for what was filtered, merged, or externalized.
Docling Broad document-format coverage and strong general document understanding Markdown, HTML, lossless JSON, DocTags More mature and broader in format support. Closer to a full document platform; less specifically centered on the native-text-first plus hybrid-weaving philosophy DocWeave is pursuing.
Marker Fast PDF and image conversion to Markdown or JSON with strong practical defaults Markdown, JSON, HTML, chunks Very strong PDF-to-Markdown baseline. More conversion-focused; less explicit about provenance-rich hybrid source selection.
MinerU LLM-ready parsing with reading-order recovery, layout handling, OCR, formula and table extraction Markdown, JSON, intermediate multimodal formats Probably the closest public peer on “AI-ready outputs.” Strong on layout recovery and cleanup; DocWeave is more explicitly aiming at traceable native-vs-OCR weaving and compact Markdown plus provenance bundles.

This table is intentionally short. It is a positioning summary, not a claim that DocWeave already exceeds the maturity or feature breadth of those projects today.

Benchmark Snapshot

The repo includes a synthetic public showcase fixture and a direct-Paddle benchmark specifically so comparisons can be made without quietly changing the OCR profile to something easier than what DocWeave actually uses.

The benchmark is not trying to prove that DocWeave is the highest-ceiling document parser overall. It is trying to answer a narrower question:

Is this pipeline adequate for a local-first user with limited budget and modest hardware who needs compact AI-ready Markdown plus OCR traceability?

Comparison setup for the direct Paddle baseline:

  • render at 400 DPI
  • use the same preprocessing profile DocWeave currently uses:
    • paddle-book
  • use the same explicit OCR pair DocWeave uses:
    • detector: PP-OCRv5_server_det
    • recognizer: en_PP-OCRv5_mobile_rec
  • use the same document layout detector:
    • PP-DocLayout_plus-L

Fairness rules for the direct Paddle baseline:

  • keep the render setting at 400 DPI, because that matches the current DocWeave application profile
  • use the same explicit OCR pair DocWeave uses:
    • detector: PP-OCRv5_server_det
    • recognizer: en_PP-OCRv5_mobile_rec
  • use the same document layout detector:
    • PP-DocLayout_plus-L
  • do not enable the broader layout_parsing stack by default for the direct baseline, because that changes the workload class and undercuts the comparison

Local result from the tuned direct-Paddle comparison on March 25, 2026 using fixtures/showcase/docweave_weaving_showcase.pdf:

Benchmark Profile Result
Direct Paddle CPU 400 DPI, PP-OCRv5_server_det, en_PP-OCRv5_mobile_rec, PP-DocLayout_plus-L failed after 58.59s with a Paddle ResourceExhaustedError while trying to allocate about 51.7 GB on CPU
Direct Paddle GPU same tuned profile completed in 36.14s, produced a 175,633 byte artifact bundle

What that means:

  • the direct vendor stack itself already benefits materially from a modest local GPU
  • the CPU path, at the same 400 DPI and model profile, is not currently a practical baseline on this laptop
  • the direct Paddle GPU output preserved little of the higher-level showcase structure on its own, which is exactly the gap DocWeave is meant to fill

Benchmark Methodology

DocWeave’s benchmark methodology is intentionally narrow and explicit.

Target use case:

  • a local-first user
  • limited budget
  • modest hardware such as a laptop GPU
  • needs compact Markdown that is usable by LLMs
  • also values OCR traceability artifacts such as hOCR

What the benchmark is measuring:

  • whether a pipeline completes on the local machine
  • whether it produces compact, AI-usable Markdown
  • whether key structure survives:
    • reading order
    • header/footer filtering
    • figure and caption separation
    • externalized scan or figure artifacts
  • whether the output stays small enough to be practical for local AI workflows

What the benchmark is not claiming:

  • that DocWeave is the only document parser in existence
  • that DocWeave is the strongest parser in every environment
  • that server-class or high-VRAM profiles are irrelevant

Fairness rules:

  • run one tool at a time, never simultaneously
  • use explicit documented OCR and layout profiles where the tool exposes them
  • keep DocWeave and direct-Paddle baselines at 400 DPI
  • benchmark the local-feasible profile for each tool, not its largest server-oriented configuration
  • warm model caches separately so timed runs do not silently include first-download costs
  • record host power mode, GPU memory, and RAM usage during the benchmark
  • require a complete runnable comparison env for each live tool:
    • Docling fair-profile runs require easyocr
    • MinerU fair-profile runs require mineru[pipeline], not a bare mineru install

Adequacy thresholds for the local-first use case:

  • the run must complete successfully on the host in that mode
  • the run must produce Markdown
  • the main Markdown artifact must not inline images as base64
  • the run must pass at least 6/8 showcase structure cases
  • the run must pass all core cases:
    • two-column reading order
    • header/footer filtering
    • figure externalization with caption
    • image-only scan externalization
  • the total artifact bundle must stay within 1.5x the source PDF size

How to read the final comparison table:

  • Adequate for target local-first use case? means adequate on this host, in this mode, under the explicit benchmark profile
  • a no verdict does not mean a tool is bad overall
  • it means that, under the constraints above, the tool was not adequate for this specific local-first LLM-oriented workflow

Benchmark Entry Points

Direct Paddle:

PYTHONPATH=src python bin/benchmark_paddle_direct.py \
  --output-dir ./tmp_paddle_direct_compare_tuned

Multi-tool local comparison:

PYTHONPATH=src python bin/benchmark_local_compare.py \
  --output-dir ./tmp_local_compare

CLI Reference

Quick Start

The current Python CLI provides the planning stage.

Example:

docweave plan \
  --input-dir ./input_pdfs \
  --canonical-dir ./work/canonical \
  --manifests-dir ./work/manifests \
  --chunk-output-dir ./work/chunks \
  --page-chunk-size 150 \
  --max-chunks-per-batch 8 \
  --print-summary

This produces planning manifests:

  • documents_manifest.csv
  • chunks_manifest.csv
  • batches_manifest.csv

CLI

Current command set:

  • docweave plan
    • discovers PDFs
    • canonicalizes names
    • computes page chunks
    • assigns batches
    • writes manifests

Run help:

docweave plan --help

Command-Line Parameters

This section documents the current CLI surface area in the repo. The Python package is still in staged migration, so some of the most capable entrypoints still live in the Python worker scripts.

docweave plan

Command:

docweave plan [options]

Parameters:

Parameter Default What it does
--input-dir required Directory to scan for source PDFs.
--canonical-dir required Directory where canonicalized filenames are planned.
--manifests-dir required Directory where planning CSV manifests are written.
--chunk-output-dir required Directory where planned chunk artifacts are expected to live.
--page-chunk-size 150 Number of pages per output chunk.
--max-chunks-per-batch 8 Maximum chunk count grouped into one batch.
--batch-prefix batch Prefix used for batch identifiers such as batch_01.
--semantic-separator __ Separator between semantic name fields such as title, author, and year.
--token-separator _ Separator used inside normalized tokens.
--keep-case off Preserve source casing instead of lowercasing canonical names.
--print-summary off Print a JSON summary of discovered documents, chunks, and batches.

bin/paddle_ocr_pdf.py

Command:

python bin/paddle_ocr_pdf.py input.pdf output.pdf [options]

Parameters:

Parameter Default What it does
input_pdf required Source PDF to OCR.
output_pdf required Output searchable PDF path.
--lang en PaddleOCR language code.
--render-dpi 400 Render DPI used when rasterizing PDF pages.
--preprocess paddle-book Preprocessing profile for rendered pages. Valid values: none, paddle-book.
--max-side 3600 Maximum image side length after preprocessing.
--sidecar-dir empty Directory where hOCR, text, and layout sidecars are written.
--keep-rendered-images off Preserve rendered page images on disk for inspection.
--det-model-name PP-OCRv5_server_det Paddle detection model name or model directory reference.
--rec-model-name en_PP-OCRv5_mobile_rec Paddle recognition model name or model directory reference.
--device cpu Paddle device, for example cpu or gpu:0.
--page-batch-size 4 Number of pages processed per in-memory Paddle batch.
--use-layout-detection off Emit layout-detection results for each page.
--layout-det-model-name PP-DocLayout_plus-L Layout-detection model name.
--use-layout-analysis off Emit layout-analysis results for each page.
--layout-analysis-pipeline layout_parsing PaddleX layout-analysis pipeline name.

bin/paddle_direct_extract.py

Command:

python bin/paddle_direct_extract.py input.pdf output_dir [options]

Parameters:

Parameter Default What it does
input_pdf required Source PDF to process with direct Paddle APIs.
output_dir required Output directory for the direct baseline artifact bundle.
--lang en PaddleOCR language code.
--render-dpi 400 Render DPI used for page rasterization.
--preprocess paddle-book Preprocessing profile for rendered pages. Valid values: none, paddle-book.
--max-side 3600 Maximum image side length after preprocessing.
--det-model-name PP-OCRv5_server_det Paddle detection model reference.
--rec-model-name en_PP-OCRv5_mobile_rec Paddle recognition model reference.
--device cpu Paddle runtime device, for example cpu or gpu:0.
--page-batch-size 4 Number of pages processed per OCR batch.
--use-layout-detection off Emit layout-detection JSON sidecars.
--layout-det-model-name PP-DocLayout_plus-L Layout-detection model reference.
--use-layout-analysis off Emit PaddleX layout-analysis JSON sidecars.
--layout-analysis-pipeline layout_parsing PaddleX layout-analysis pipeline name.
--keep-rendered-images off Preserve rendered page PNGs in the output bundle.

bin/benchmark_paddle_direct.py

Command:

python bin/benchmark_paddle_direct.py [options]

Parameters:

Parameter Default What it does
--input-pdf fixtures/showcase/docweave_weaving_showcase.pdf Public showcase PDF used for the direct Paddle benchmark.
--output-dir fixtures/showcase/paddle_direct_compare Directory where the benchmark report, JSON, and live artifacts are written.
--paddle-python .venvs/paddlegpu-build/bin/python Python executable from the Paddle runtime environment.
--paddle-env-dir .venvs/paddlegpu-build Environment directory whose install size is recorded.
--extract-script bin/paddle_direct_extract.py Direct-Paddle extraction script invoked by the benchmark.
--modes cpu,gpu Comma-separated execution modes to run.
--skip-live-runs off Emit only the benchmark profile and report scaffold without invoking Paddle.
--profile docweave-tuned Benchmark profile. docweave-tuned matches the DocWeave comparison setup: 400 DPI, paddle-book, PP-OCRv5_server_det, en_PP-OCRv5_mobile_rec, and PP-DocLayout_plus-L. paddle-full-layout enables the broader PaddleX layout-analysis stack.
--lang en PaddleOCR language code.
--render-dpi 400 Render DPI used for both CPU and GPU runs. The fair baseline keeps this at 400 DPI because that is the current DocWeave application setting.
--preprocess paddle-book Rendered-page preprocessing profile. Valid values: none, paddle-book. The fair direct-Paddle baseline uses paddle-book to match DocWeave.
--max-side 3600 Maximum rendered image side length.
--det-model-name PP-OCRv5_server_det Paddle detection model reference.
--rec-model-name en_PP-OCRv5_mobile_rec Paddle recognition model reference.
--page-batch-size 4 Pages processed per OCR batch.
--use-layout-detection profile-dependent Force-enable layout detection.
--no-layout-detection off Force-disable layout detection.
--layout-det-model-name PP-DocLayout_plus-L Layout-detection model reference.
--use-layout-analysis profile-dependent Force-enable the broader PaddleX layout-analysis pipeline.
--no-layout-analysis off Force-disable the broader PaddleX layout-analysis pipeline.
--layout-analysis-pipeline layout_parsing PaddleX layout-analysis pipeline name.
--keep-rendered-images off Preserve rendered page images in the output bundle.
--timeout-seconds 1800 Per-run timeout for CPU or GPU execution.
--sample-interval, --gpu-sample-interval 5.0 Sampling interval in seconds for GPU and RAM monitoring. The benchmark records peak and average usage for both.

bin/benchmark_local_compare.py

Command:

python bin/benchmark_local_compare.py [options]

Parameters:

Parameter Default What it does
--input-pdf fixtures/showcase/docweave_weaving_showcase.pdf Public showcase PDF used for the local multi-tool comparison.
--expected-markdown fixtures/showcase/expected/content.md Expected DocWeave markdown artifact used as the quality reference bundle.
--expected-provenance fixtures/showcase/expected/provenance.json Expected DocWeave provenance JSON used as the reference bundle.
--expected-images-dir fixtures/showcase/expected/images Expected extracted-image directory used as the reference bundle.
--output-dir fixtures/showcase/local_compare Directory where the comparison report, JSON, warmup outputs, and live tool artifacts are written.
--tools docling,marker,mineru Comma-separated live tools to benchmark. Supported values: docling, marker, mineru, docweave-live.
--modes cpu,gpu Comma-separated execution modes to benchmark.
--skip-live-tools off Emit only the reference report scaffold without running external tools.
--timeout-seconds 1800 Per-run timeout for each live tool invocation.
--sample-interval, --gpu-sample-interval 5.0 Sampling interval in seconds for GPU and RAM monitoring. The harness runs serially and records peak and average usage.
--warmup-live-tools on Warm model caches on the smaller public smoke PDF before the timed showcase run.
--no-warmup-live-tools off Disable the warmup pass and benchmark from a colder start.
--warmup-input-pdf fixtures/smoke/docweave_pairwise_public_fixture.pdf Smaller public PDF used only for cache/model warmup.
--docling-bin .venvs/docling/bin/docling Path to the Docling CLI binary.
--marker-bin .venvs/marker/bin/marker_single Path to the Marker CLI binary.
--mineru-bin .venvs/mineru/bin/mineru Path to the MinerU CLI binary. For a fair local comparison, this env should be installed as mineru[pipeline] so the pipeline backend actually has torch, doclayout_yolo, and the OCR stack available.
--docling-env-dir .venvs/docling Environment directory whose install size is recorded for Docling.
--marker-env-dir .venvs/marker Environment directory whose install size is recorded for Marker.
--marker-highres-image-dpi 400 Marker OCR DPI. This is held at 400 to stay fair to the DocWeave local profile.
--marker-lowres-image-dpi 96 Marker layout DPI.
--mineru-env-dir .venvs/mineru Environment directory whose install size is recorded for MinerU.
--docweave-command empty Optional shell command template for benchmarking a live DocWeave run. Available placeholders: {input_pdf}, {output_dir}, {mode}, {device}.
--docweave-env-dir empty Optional environment directory whose install size is recorded for a live DocWeave run.
--mineru-gpu-vram-mb 5500 VRAM cap passed to MinerU in GPU mode to keep the comparison in the modest-hardware class.

bin/build_ai_chunk_artifacts.py

Command:

python bin/build_ai_chunk_artifacts.py [options]

Parameters:

Parameter Default What it does
--canonical-pdf required Canonical source PDF used to render page and media crops.
--page-manifest-json required Per-page manifest JSON describing sidecars and provenance for the chunk.
--output-dir required Output directory for the AI bundle.
--book-id required Logical document identifier recorded in provenance.
--page-start required First page number in the chunk, 1-based.
--page-end required Last page number in the chunk, 1-based.
--markdown-name content.md Markdown filename to write in the output bundle.
--provenance-name provenance.json Provenance JSON filename to write in the output bundle.
--images-dir-name images Folder name used for extracted figures and other media.
--render-dpi 400 Render DPI used for image extraction.
--preprocess paddle-book Preprocessing profile applied before image extraction. Valid values: none, paddle-book.
--max-side 3600 Maximum image side length after preprocessing.

bin/run_paddleocr_images.py

Command:

python bin/run_paddleocr_images.py image1.png [image2.png ...] [options]
Parameters:

| Parameter | Default | What it does |
| --- | --- | --- |
| `images` | required | One or more page-image paths to OCR. |
| `--lang` | `en` | PaddleOCR language code. |
| `--max-lines` | `20` | Maximum recognized lines printed per image. |
| `--preprocess` | `none` | Optional preprocessing profile. Valid values: `none`, `paddle-book`. |
| `--max-side` | `3600` | Maximum image side length after preprocessing. |
| `--det-model-name` | `PP-OCRv5_server_det` | Paddle detection model name or model directory reference. |
| `--rec-model-name` | `en_PP-OCRv5_mobile_rec` | Paddle recognition model name or model directory reference. |
| `--device` | `cpu` | Paddle runtime device, for example `cpu` or `gpu:0`. |

Final Note

This repository exists in the shadow of people who did not get to outlive what was done to them. Some of them would have built beautiful things. May they rest in peace.