Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   Copyright 2026 Zurc Bolzano, Maker of Content, 
                                Lord of the Render Queue, 
                                He Who Forgot to Sleep.

   The running joke is that I am still making content, just now it
   is legal, consenual and primarily focused on primary prevention.

DocWeave

Provenance matters in documents. It matters in people too.

DocWeave is a modular document pipeline for turning PDFs into trustworthy, AI-ready artifacts. It combines native PDF text, OCR, layout analysis, image extraction, chunk planning, and provenance tracking into a staged workflow that can be reused across books, papers, manuals, scans, and other document collections.

DocWeave targets a gap that existing open-source parsers do not prioritize well: local-first, modest-VRAM document processing that preserves both AI-friendly Markdown structure and OCR traceability artifacts such as hOCR.

Why DocWeave?

Built with the belief that what survives deserves to be carried carefully.

preserve source fidelity whenever native text is better than OCR
use OCR when native text or layout is missing or weak
retain structure, layout, images, and provenance for AI workflows
keep outputs compact enough for local LLM workflows
keep every major step modular and testable

In short: DocWeave is meant to weave together native text, OCR, layout, and metadata rather than forcing a single extraction strategy onto every page.

Current Status

The Python package already provides the reusable planning core: discovery, canonical naming, chunking, batching, manifests, and a stage-based runner.
The full OCR and AI-deliverable workflow is still partly in a local legacy bridge while the rest of the pipeline is migrated subsystem by subsystem.
Currently undergoing a full re-write - please be patient :) encuoragement is always welcome.

Quick Start

Install the core package:

python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"

Run the current planning stage:

docweave plan \
  --input-dir ./input_pdfs \
  --canonical-dir ./work/canonical \
  --manifests-dir ./work/manifests \
  --chunk-output-dir ./work/chunks \
  --page-chunk-size 150 \
  --max-chunks-per-batch 8 \
  --print-summary

This produces:

documents_manifest.csv
chunks_manifest.csv
batches_manifest.csv

Documentation

Benchmark Positioning

DocWeave is not trying to prove that it is the highest-ceiling parser overall. The benchmark posture is narrower:

Is this pipeline adequate for a local-first user with limited budget and modest hardware who needs compact AI-ready Markdown plus OCR traceability?

That methodology, the fairness rules, and the current comparison framing versus Docling, Marker, and MinerU are documented in docs/benchmarking.md.

Showcase Results

On the public synthetic weaving showcase, DocWeave’s reference bundle currently clears all 8/8 structure-sensitive cases while staying compact and explicitly AI-oriented. On this host, the best local-feasible comparison runs from the other open-source pipelines did not meet the same adequacy bar for this use case.

Variable	DocWeave Reference	Docling Best Local Run	Marker Best Local Run	MinerU Best Local Run
Best mode on this host	reference bundle	GPU	GPU	CPU
Cases passed	8/8	4/8	1/8	3/8
Adequate for target local-first use case?	yes	no	no	no
Main shortfall	n/a	breaks reading order, weak code fidelity, no provenance JSON	weak structure retention and much larger bundle	weak structure retention and much larger bundle

What is interesting here is not that DocWeave “wins every benchmark.” It is that, under a deliberately modest-hardware, local-first benchmark profile, DocWeave is currently the only pipeline in this comparison that produces the compact Markdown-plus-provenance shape the project is targeting. The full methodology, fairness rules, and per-mode results are in docs/benchmarking.md.

Acknowledgments

DocWeave relies heavily on PaddlePaddle for OCR and document-processing capabilities. Respect to the PaddlePaddle authors for building and open-sourcing the framework this work stands on.

PaddlePaddle is distributed under the Apache License 2.0.

DocWeave was shaped in conversation with Codex, OpenAI’s coding agent, during design and implementation.

Notes

The published package name in pyproject.toml is docweave.
The internal Python import package remains document_pipeline for migration stability.
The public repo is intentionally engine-first and excludes some local application-specific orchestration.

DocWeave Project

This directory holds the longer-form project documentation that used to live in the root README.

Start here if you want a map of the repo:

Project Overview
- goals, current status, features, design principles, roadmap, output philosophy
Setup and Installation
- dependencies, installation, Windows and WSL guidance, Paddle GPU build notes
Architecture and Modules
- public modules, repository structure, reusable entry points
CLI Reference
- command forms and parameters for the current Python and benchmark entrypoints
Benchmarking
- comparison framing, benchmark snapshot, fairness rules, adequacy thresholds
Testing and Public Fixtures
- test commands, synthetic smoke fixture, weaving showcase fixture

If you are landing here from GitHub:

Read the repo README for the short project summary.
Read overview.md for the project scope and migration status.
Read setup.md before attempting GPU OCR or Windows/WSL configuration.

Architecture and Modules

Public Modules and Responsibilities

Core package

src/document_pipeline/config.py
- pipeline paths, naming policy, planning options
src/document_pipeline/models.py
- typed data models for documents, chunks, batches, and plans
src/document_pipeline/naming.py
- canonical filename and identifier logic
src/document_pipeline/pdf_utils.py
- PDF page count and hashing helpers
src/document_pipeline/discovery.py
- discover source PDFs and build SourceDocument records
src/document_pipeline/chunking.py
- build page chunks and batch assignments
src/document_pipeline/manifests.py
- write planning manifests to CSV
src/document_pipeline/planning.py
- compose discovery, chunking, and batching into a plan
src/document_pipeline/context.py
- shared runtime context between stages
src/document_pipeline/stages/base.py
- stage protocol and stage runner
src/document_pipeline/stages/plan.py
- build-plan and write-manifest stages
src/document_pipeline/cli.py
- CLI entrypoint

Python worker and benchmark scripts

bin/paddle_ocr_pdf.py
- PaddleOCR PDF worker, GPU path, sidecars, layout hooks
bin/build_ai_chunk_artifacts.py
- build AI-ready markdown, provenance JSON, and images from canonical sidecars
bin/run_paddleocr_images.py
- image-level PaddleOCR runner for focused testing
bin/paddle_direct_extract.py
- direct Paddle baseline extractor that bypasses DocWeave orchestration
bin/benchmark_paddle_direct.py
- direct Paddle CPU/GPU benchmark runner using the public showcase fixture
bin/benchmark_local_compare.py
- broader local comparison harness for Docling, Marker, MinerU, and optional live DocWeave runs
bin/benchmark_weaving_showcase.py
- baseline extractor comparison for the synthetic weaving showcase

The original PowerShell orchestration layer is intentionally not part of the public engine-first source drop because it still contains application-specific workflow assumptions. The Python package and Python worker scripts are the published base going forward.

Repository Structure

.
├── pyproject.toml
├── README.md
├── docs/
│   ├── README.md
│   ├── architecture.md
│   ├── benchmarking.md
│   ├── cli.md
│   ├── overview.md
│   ├── setup.md
│   └── testing.md
├── bin/
│   ├── benchmark_local_compare.py
│   ├── benchmark_paddle_direct.py
│   ├── benchmark_weaving_showcase.py
│   ├── paddle_direct_extract.py
│   ├── paddle_ocr_pdf.py
│   ├── build_ai_chunk_artifacts.py
│   └── run_paddleocr_images.py
├── fixtures/
│   ├── showcase/
│   └── smoke/
├── src/
│   └── document_pipeline/
│       ├── cli.py
│       ├── config.py
│       ├── context.py
│       ├── discovery.py
│       ├── chunking.py
│       ├── manifests.py
│       ├── models.py
│       ├── naming.py
│       ├── pdf_utils.py
│       ├── planning.py
│       └── stages/
│           ├── base.py
│           └── plan.py
└── tests/
    ├── test_cli.py
    ├── test_local_compare_benchmark.py
    ├── test_manifests.py
    ├── test_naming.py
    ├── test_paddle_direct_compare.py
    ├── test_planning.py
    ├── test_synthetic_smoke_fixture.py
    ├── test_synthetic_weaving_showcase.py
    └── test_weaving_showcase_benchmark.py

Key Functions

The current Python package is small enough that the main reusable entry points are worth calling out explicitly:

discover_documents(...)
- scan an input directory, canonicalize names, compute hashes, and create SourceDocument records
build_chunks(...)
- split documents into deterministic page-range chunks
assign_batches(...)
- group chunks into stable batch assignments
build_plan(...)
- compose discovery, chunking, and batching into a PipelinePlan
write_plan_manifests(...)
- write the current planning outputs to CSV
run_stages(...)
- execute a sequence of reusable pipeline stages against a shared context
document_pipeline.cli.main(...)
- CLI entrypoint for the current planning stage

Benchmarking

Short Comparison

DocWeave is not the only open-source project in this space. The distinction is in what each project optimizes for.

Project	Best known for	Typical outputs	Positioning relative to DocWeave
DocWeave	Native-text-first routing, hybrid document reasoning, compact AI bundles, explicit provenance	Markdown, provenance JSON, extracted images	Aims to preserve the best available text source per region or page, keep outputs small and AI-readable, and retain machine-readable provenance for what was filtered, merged, or externalized.
Docling	Broad document-format coverage and strong general document understanding	Markdown, HTML, lossless JSON, DocTags	More mature and broader in format support. Closer to a full document platform; less specifically centered on the native-text-first plus hybrid-weaving philosophy DocWeave is pursuing.
Marker	Fast PDF and image conversion to Markdown or JSON with strong practical defaults	Markdown, JSON, HTML, chunks	Very strong PDF-to-Markdown baseline. More conversion-focused; less explicit about provenance-rich hybrid source selection.
MinerU	LLM-ready parsing with reading-order recovery, layout handling, OCR, formula and table extraction	Markdown, JSON, intermediate multimodal formats	Probably the closest public peer on “AI-ready outputs.” Strong on layout recovery and cleanup; DocWeave is more explicitly aiming at traceable native-vs-OCR weaving and compact Markdown plus provenance bundles.

This table is intentionally short. It is a positioning summary, not a claim that DocWeave already exceeds the maturity or feature breadth of those projects today.

Benchmark Snapshot

The repo includes a synthetic public showcase fixture and a direct-Paddle benchmark specifically so comparisons can be made without quietly changing the OCR profile to something easier than what DocWeave actually uses.

The benchmark is not trying to prove that DocWeave is the highest-ceiling document parser overall. It is trying to answer a narrower question:

Is this pipeline adequate for a local-first user with limited budget and modest hardware who needs compact AI-ready Markdown plus OCR traceability?

Comparison setup for the direct Paddle baseline:

render at 400 DPI
use the same preprocessing profile DocWeave currently uses:
- paddle-book
use the same explicit OCR pair DocWeave uses:
- detector: PP-OCRv5_server_det
- recognizer: en_PP-OCRv5_mobile_rec
use the same document layout detector:
- PP-DocLayout_plus-L

Fairness rules for the direct Paddle baseline:

keep the render setting at 400 DPI, because that matches the current DocWeave application profile
use the same explicit OCR pair DocWeave uses:
- detector: PP-OCRv5_server_det
- recognizer: en_PP-OCRv5_mobile_rec
use the same document layout detector:
- PP-DocLayout_plus-L
do not enable the broader layout_parsing stack by default for the direct baseline, because that changes the workload class and undercuts the comparison

Local result from the tuned direct-Paddle comparison on March 25, 2026 using fixtures/showcase/docweave_weaving_showcase.pdf:

Benchmark	Profile	Result
Direct Paddle `CPU`	`400 DPI`, `PP-OCRv5_server_det`, `en_PP-OCRv5_mobile_rec`, `PP-DocLayout_plus-L`	failed after `58.59s` with a Paddle `ResourceExhaustedError` while trying to allocate about `51.7 GB` on CPU
Direct Paddle `GPU`	same tuned profile	completed in `36.14s`, produced a `175,633` byte artifact bundle

What that means:

the direct vendor stack itself already benefits materially from a modest local GPU
the CPU path, at the same 400 DPI and model profile, is not currently a practical baseline on this laptop
the direct Paddle GPU output preserved little of the higher-level showcase structure on its own, which is exactly the gap DocWeave is meant to fill

Benchmark Methodology

DocWeave’s benchmark methodology is intentionally narrow and explicit.

Target use case:

a local-first user
limited budget
modest hardware such as a laptop GPU
needs compact Markdown that is usable by LLMs
also values OCR traceability artifacts such as hOCR

What the benchmark is measuring:

whether a pipeline completes on the local machine
whether it produces compact, AI-usable Markdown
whether key structure survives:
- reading order
- header/footer filtering
- figure and caption separation
- externalized scan or figure artifacts
whether the output stays small enough to be practical for local AI workflows

What the benchmark is not claiming:

that DocWeave is the only document parser in existence
that DocWeave is the strongest parser in every environment
that server-class or high-VRAM profiles are irrelevant

Fairness rules:

run one tool at a time, never simultaneously
use explicit documented OCR and layout profiles where the tool exposes them
keep DocWeave and direct-Paddle baselines at 400 DPI
benchmark the local-feasible profile for each tool, not its largest server-oriented configuration
warm model caches separately so timed runs do not silently include first-download costs
record host power mode, GPU memory, and RAM usage during the benchmark
require a complete runnable comparison env for each live tool:
- Docling fair-profile runs require easyocr
- MinerU fair-profile runs require mineru[pipeline], not a bare mineru install

Adequacy thresholds for the local-first use case:

the run must complete successfully on the host in that mode
the run must produce Markdown
the main Markdown artifact must not inline images as base64
the run must pass at least 6/8 showcase structure cases
the run must pass all core cases:
- two-column reading order
- header/footer filtering
- figure externalization with caption
- image-only scan externalization
the total artifact bundle must stay within 1.5x the source PDF size

How to read the final comparison table:

Adequate for target local-first use case? means adequate on this host, in this mode, under the explicit benchmark profile
a no verdict does not mean a tool is bad overall
it means that, under the constraints above, the tool was not adequate for this specific local-first LLM-oriented workflow

Benchmark Entry Points

Direct Paddle:

PYTHONPATH=src python bin/benchmark_paddle_direct.py \
  --output-dir ./tmp_paddle_direct_compare_tuned

Multi-tool local comparison:

PYTHONPATH=src python bin/benchmark_local_compare.py \
  --output-dir ./tmp_local_compare

CLI Reference

Quick Start

The current Python CLI provides the planning stage.

Example:

docweave plan \
  --input-dir ./input_pdfs \
  --canonical-dir ./work/canonical \
  --manifests-dir ./work/manifests \
  --chunk-output-dir ./work/chunks \
  --page-chunk-size 150 \
  --max-chunks-per-batch 8 \
  --print-summary

This produces planning manifests:

documents_manifest.csv
chunks_manifest.csv
batches_manifest.csv

CLI

Current command set:

docweave plan
- discovers PDFs
- canonicalizes names
- computes page chunks
- assigns batches
- writes manifests

Run help:

docweave plan --help

Command-Line Parameters

This section documents the current CLI surface area in the repo. The Python package is still in staged migration, so some of the most capable entrypoints still live in the Python worker scripts.

`docweave plan`

Command:

docweave plan [options]

Parameters:

Parameter	Default	What it does
`--input-dir`	required	Directory to scan for source PDFs.
`--canonical-dir`	required	Directory where canonicalized filenames are planned.
`--manifests-dir`	required	Directory where planning CSV manifests are written.
`--chunk-output-dir`	required	Directory where planned chunk artifacts are expected to live.
`--page-chunk-size`	`150`	Number of pages per output chunk.
`--max-chunks-per-batch`	`8`	Maximum chunk count grouped into one batch.
`--batch-prefix`	`batch`	Prefix used for batch identifiers such as `batch_01`.
`--semantic-separator`	`__`	Separator between semantic name fields such as title, author, and year.
`--token-separator`	`_`	Separator used inside normalized tokens.
`--keep-case`	off	Preserve source casing instead of lowercasing canonical names.
`--print-summary`	off	Print a JSON summary of discovered documents, chunks, and batches.

`bin/paddle_ocr_pdf.py`

Command:

python bin/paddle_ocr_pdf.py input.pdf output.pdf [options]

Parameters:

Parameter	Default	What it does
`input_pdf`	required	Source PDF to OCR.
`output_pdf`	required	Output searchable PDF path.
`--lang`	`en`	PaddleOCR language code.
`--render-dpi`	`400`	Render DPI used when rasterizing PDF pages.
`--preprocess`	`paddle-book`	Preprocessing profile for rendered pages. Valid values: `none`, `paddle-book`.
`--max-side`	`3600`	Maximum image side length after preprocessing.
`--sidecar-dir`	empty	Directory where hOCR, text, and layout sidecars are written.
`--keep-rendered-images`	off	Preserve rendered page images on disk for inspection.
`--det-model-name`	`PP-OCRv5_server_det`	Paddle detection model name or model directory reference.
`--rec-model-name`	`en_PP-OCRv5_mobile_rec`	Paddle recognition model name or model directory reference.
`--device`	`cpu`	Paddle device, for example `cpu` or `gpu:0`.
`--page-batch-size`	`4`	Number of pages processed per in-memory Paddle batch.
`--use-layout-detection`	off	Emit layout-detection results for each page.
`--layout-det-model-name`	`PP-DocLayout_plus-L`	Layout-detection model name.
`--use-layout-analysis`	off	Emit layout-analysis results for each page.
`--layout-analysis-pipeline`	`layout_parsing`	PaddleX layout-analysis pipeline name.

`bin/paddle_direct_extract.py`

Command:

python bin/paddle_direct_extract.py input.pdf output_dir [options]

Parameters:

Parameter	Default	What it does
`input_pdf`	required	Source PDF to process with direct Paddle APIs.
`output_dir`	required	Output directory for the direct baseline artifact bundle.
`--lang`	`en`	PaddleOCR language code.
`--render-dpi`	`400`	Render DPI used for page rasterization.
`--preprocess`	`paddle-book`	Preprocessing profile for rendered pages. Valid values: `none`, `paddle-book`.
`--max-side`	`3600`	Maximum image side length after preprocessing.
`--det-model-name`	`PP-OCRv5_server_det`	Paddle detection model reference.
`--rec-model-name`	`en_PP-OCRv5_mobile_rec`	Paddle recognition model reference.
`--device`	`cpu`	Paddle runtime device, for example `cpu` or `gpu:0`.
`--page-batch-size`	`4`	Number of pages processed per OCR batch.
`--use-layout-detection`	off	Emit layout-detection JSON sidecars.
`--layout-det-model-name`	`PP-DocLayout_plus-L`	Layout-detection model reference.
`--use-layout-analysis`	off	Emit PaddleX layout-analysis JSON sidecars.
`--layout-analysis-pipeline`	`layout_parsing`	PaddleX layout-analysis pipeline name.
`--keep-rendered-images`	off	Preserve rendered page PNGs in the output bundle.

`bin/benchmark_paddle_direct.py`

Command:

python bin/benchmark_paddle_direct.py [options]

Parameters:

Parameter	Default	What it does
`--input-pdf`	`fixtures/showcase/docweave_weaving_showcase.pdf`	Public showcase PDF used for the direct Paddle benchmark.
`--output-dir`	`fixtures/showcase/paddle_direct_compare`	Directory where the benchmark report, JSON, and live artifacts are written.
`--paddle-python`	`.venvs/paddlegpu-build/bin/python`	Python executable from the Paddle runtime environment.
`--paddle-env-dir`	`.venvs/paddlegpu-build`	Environment directory whose install size is recorded.
`--extract-script`	`bin/paddle_direct_extract.py`	Direct-Paddle extraction script invoked by the benchmark.
`--modes`	`cpu,gpu`	Comma-separated execution modes to run.
`--skip-live-runs`	off	Emit only the benchmark profile and report scaffold without invoking Paddle.
`--profile`	`docweave-tuned`	Benchmark profile. `docweave-tuned` matches the DocWeave comparison setup: `400 DPI`, `paddle-book`, `PP-OCRv5_server_det`, `en_PP-OCRv5_mobile_rec`, and `PP-DocLayout_plus-L`. `paddle-full-layout` enables the broader PaddleX layout-analysis stack.
`--lang`	`en`	PaddleOCR language code.
`--render-dpi`	`400`	Render DPI used for both CPU and GPU runs. The fair baseline keeps this at `400 DPI` because that is the current DocWeave application setting.
`--preprocess`	`paddle-book`	Rendered-page preprocessing profile. Valid values: `none`, `paddle-book`. The fair direct-Paddle baseline uses `paddle-book` to match DocWeave.
`--max-side`	`3600`	Maximum rendered image side length.
`--det-model-name`	`PP-OCRv5_server_det`	Paddle detection model reference.
`--rec-model-name`	`en_PP-OCRv5_mobile_rec`	Paddle recognition model reference.
`--page-batch-size`	`4`	Pages processed per OCR batch.
`--use-layout-detection`	profile-dependent	Force-enable layout detection.
`--no-layout-detection`	off	Force-disable layout detection.
`--layout-det-model-name`	`PP-DocLayout_plus-L`	Layout-detection model reference.
`--use-layout-analysis`	profile-dependent	Force-enable the broader PaddleX layout-analysis pipeline.
`--no-layout-analysis`	off	Force-disable the broader PaddleX layout-analysis pipeline.
`--layout-analysis-pipeline`	`layout_parsing`	PaddleX layout-analysis pipeline name.
`--keep-rendered-images`	off	Preserve rendered page images in the output bundle.
`--timeout-seconds`	`1800`	Per-run timeout for CPU or GPU execution.
`--sample-interval`, `--gpu-sample-interval`	`5.0`	Sampling interval in seconds for GPU and RAM monitoring. The benchmark records peak and average usage for both.

`bin/benchmark_local_compare.py`

Command:

python bin/benchmark_local_compare.py [options]

Parameters:

Parameter	Default	What it does
`--input-pdf`	`fixtures/showcase/docweave_weaving_showcase.pdf`	Public showcase PDF used for the local multi-tool comparison.
`--expected-markdown`	`fixtures/showcase/expected/content.md`	Expected DocWeave markdown artifact used as the quality reference bundle.
`--expected-provenance`	`fixtures/showcase/expected/provenance.json`	Expected DocWeave provenance JSON used as the reference bundle.
`--expected-images-dir`	`fixtures/showcase/expected/images`	Expected extracted-image directory used as the reference bundle.
`--output-dir`	`fixtures/showcase/local_compare`	Directory where the comparison report, JSON, warmup outputs, and live tool artifacts are written.
`--tools`	`docling,marker,mineru`	Comma-separated live tools to benchmark. Supported values: `docling`, `marker`, `mineru`, `docweave-live`.
`--modes`	`cpu,gpu`	Comma-separated execution modes to benchmark.
`--skip-live-tools`	off	Emit only the reference report scaffold without running external tools.
`--timeout-seconds`	`1800`	Per-run timeout for each live tool invocation.
`--sample-interval`, `--gpu-sample-interval`	`5.0`	Sampling interval in seconds for GPU and RAM monitoring. The harness runs serially and records peak and average usage.
`--warmup-live-tools`	on	Warm model caches on the smaller public smoke PDF before the timed showcase run.
`--no-warmup-live-tools`	off	Disable the warmup pass and benchmark from a colder start.
`--warmup-input-pdf`	`fixtures/smoke/docweave_pairwise_public_fixture.pdf`	Smaller public PDF used only for cache/model warmup.
`--docling-bin`	`.venvs/docling/bin/docling`	Path to the Docling CLI binary.
`--marker-bin`	`.venvs/marker/bin/marker_single`	Path to the Marker CLI binary.
`--mineru-bin`	`.venvs/mineru/bin/mineru`	Path to the MinerU CLI binary. For a fair local comparison, this env should be installed as `mineru[pipeline]` so the pipeline backend actually has `torch`, `doclayout_yolo`, and the OCR stack available.
`--docling-env-dir`	`.venvs/docling`	Environment directory whose install size is recorded for Docling.
`--marker-env-dir`	`.venvs/marker`	Environment directory whose install size is recorded for Marker.
`--marker-highres-image-dpi`	`400`	Marker OCR DPI. This is held at `400` to stay fair to the DocWeave local profile.
`--marker-lowres-image-dpi`	`96`	Marker layout DPI.
`--mineru-env-dir`	`.venvs/mineru`	Environment directory whose install size is recorded for MinerU.
`--docweave-command`	empty	Optional shell command template for benchmarking a live DocWeave run. Available placeholders: `{input_pdf}`, `{output_dir}`, `{mode}`, `{device}`.
`--docweave-env-dir`	empty	Optional environment directory whose install size is recorded for a live DocWeave run.
`--mineru-gpu-vram-mb`	`5500`	VRAM cap passed to MinerU in GPU mode to keep the comparison in the modest-hardware class.

`bin/build_ai_chunk_artifacts.py`

Command:

python bin/build_ai_chunk_artifacts.py [options]

Parameters:

Parameter	Default	What it does
`--canonical-pdf`	required	Canonical source PDF used to render page and media crops.
`--page-manifest-json`	required	Per-page manifest JSON describing sidecars and provenance for the chunk.
`--output-dir`	required	Output directory for the AI bundle.
`--book-id`	required	Logical document identifier recorded in provenance.
`--page-start`	required	First page number in the chunk, 1-based.
`--page-end`	required	Last page number in the chunk, 1-based.
`--markdown-name`	`content.md`	Markdown filename to write in the output bundle.
`--provenance-name`	`provenance.json`	Provenance JSON filename to write in the output bundle.
`--images-dir-name`	`images`	Folder name used for extracted figures and other media.
`--render-dpi`	`400`	Render DPI used for image extraction.
`--preprocess`	`paddle-book`	Preprocessing profile applied before image extraction. Valid values: `none`, `paddle-book`.
`--max-side`	`3600`	Maximum image side length after preprocessing.

`bin/run_paddleocr_images.py`

Command:

python bin/run_paddleocr_images.py image1.png [image2.png ...] [options]

Parameters:

| Parameter | Default | What it does |
| --- | --- | --- |
| `images` | required | One or more page-image paths to OCR. |
| `--lang` | `en` | PaddleOCR language code. |
| `--max-lines` | `20` | Maximum recognized lines printed per image. |
| `--preprocess` | `none` | Optional preprocessing profile. Valid values: `none`, `paddle-book`. |
| `--max-side` | `3600` | Maximum image side length after preprocessing. |
| `--det-model-name` | `PP-OCRv5_server_det` | Paddle detection model name or model directory reference. |
| `--rec-model-name` | `en_PP-OCRv5_mobile_rec` | Paddle recognition model name or model directory reference. |
| `--device` | `cpu` | Paddle runtime device, for example `cpu` or `gpu:0`. |

Final Note

This repository exists in the shadow of people who did not get to outlive what was done to them. Some of them would have built beautiful things. May they rest in peace.