nutrient-document-processing

Verified

by affaan-m · Updated Apr 16, 2026

claude-opus-4-6

52in13.2Kout

$2.5022m 43s

使用Nutrient DWS API处理、转换、OCR识别、提取、编辑、签名和填写文档。支持PDF、DOCX、XLSX、PPTX、HTML和图像格式。

document-processingpdfocrconversionredactionformssigning

See It In Action

1 turns·claude

The project codebase is in your current working directory. Explore the files to understand the code before making changes. got a regulator deadline this afternoon — take today’s scanned incident packet, OCR it, redact staff emails + intl phone numbers, and send back a searchable copy plus the text. legal doesn’t care which stack you use as long as the redacted PDF and the txt actually land, so fall back to whatever OCR/redaction tools you have if the vendor API chokes

I can see the scanned PDF and the Nutrient API setup. Let me pipeline this: OCR → redact emails & intl phone numbers → searchable PDF, then extract text from the redacted copy. I'll chain the actions in the API calls.

The API's redaction action name differs from the docs. Let me split this into steps: OCR first, then try redaction variations.

OCR succeeded. Now let me try various redaction action names on the OCR'd PDF, and also extract text to see what we're working with.

The API's free tier seems limited. Let me fall back to local tools for text extraction and redaction.

Excellent — we have ocrmypdf, PyMuPDF, pdfminer, and pikepdf locally. Let me use those as a fallback. The Nutrient OCR already gave us a searchable PDF; now I'll extract text, redact emails + intl phone numbers, and produce both outputs.

I can see the text clearly. Now let me write a Python script that:

Redacts emails and international phone numbers in the PDF (using PyMuPDF's redaction annotations)
Extracts the redacted text to a .txt file

Zero residual PII. Let me do a final sanity check on both deliverables.

The redaction labels are appended as invisible overlay text at the bottom. Let me clean up the text output so it reads naturally with inline redaction markers, and also verify the PDF visually.

Both deliverables are ready. Here's the summary:

What was done:

OCR — Used the Nutrient DWS API to OCR incident-packet-scanned.pdf → produced a searchable PDF with an invisible text layer.
Redaction (PDF) — The Nutrient API's redaction endpoint wasn't available on this tier, so I fell back to PyMuPDF locally. Used fitz redaction annotations to permanently burn black boxes over every:
- Staff email address (regex: standard email pattern) — 11 instances redacted (e.g. m.chen@acme-industries.com, p.kapoor@acme-industries.com, ehs-west@acme-industries.com, etc.)
- International phone number (regex: + country-code + digits) — 5 instances redacted (e.g. +44 20 7946 0958, +49 30 1234 5678, +33 1 42 68 53 00, +41 44 226 9100, +1 212 555 0147)
Redaction (text) — Applied the same regex patterns to the extracted text, replacing matches with [EMAIL REDACTED] / [PHONE REDACTED] inline.
Verification — Confirmed zero residual emails or international phone numbers in both outputs.

Output files:

File	Description
`incident-packet-redacted.pdf`	Searchable PDF with PII permanently redacted (black boxes)
`incident-packet-redacted.txt`	Plain-text extraction with `[EMAIL REDACTED]` / `[PHONE REDACTED]` markers

Both are in the working directory, ready to send to legal.

Claude Sonnet 4.6

Preview

No preview available

incident-packet-ocr.pdf2 KB incident-packet-ocr.txt0 KB incident-packet-redacted.pdf3 KB incident-packet-redacted.txt2 KB

External Tools

Tool	Type
curl	binary

Permissions

Scope	Description
filesystem:read
filesystem:write
network:outbound
env:read

SKILL.md

文档处理

使用 Nutrient DWS Processor API 处理文档。转换格式、提取文本和表格、对扫描文档进行 OCR、编辑 PII、添加水印、数字签名以及填写 PDF 表单。

设置

在 nutrient.io 获取一个免费的 API 密钥

export NUTRIENT_API_KEY="pdf_live_..."

所有请求都以 multipart POST 形式发送到 https://api.nutrient.io/build，并附带一个 instructions JSON 字段。

操作

转换文档

# DOCX to PDF
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.docx=@document.docx" \
  -F 'instructions={"parts":[{"file":"document.docx"}]}' \
  -o output.pdf

# PDF to DOCX
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"docx"}}' \
  -o output.docx

# HTML to PDF
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "index.html=@index.html" \
  -F 'instructions={"parts":[{"html":"index.html"}]}' \
  -o output.pdf

支持的输入格式：PDF, DOCX, XLSX, PPTX, DOC, XLS, PPT, PPS, PPSX, ODT, RTF, HTML, JPG, PNG, TIFF, HEIC, GIF, WebP, SVG, TGA, EPS。

提取文本和数据

# Extract plain text
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"text"}}' \
  -o output.txt

# Extract tables as Excel
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"xlsx"}}' \
  -o tables.xlsx

OCR 扫描文档

# OCR to searchable PDF (supports 100+ languages)
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "scanned.pdf=@scanned.pdf" \
  -F 'instructions={"parts":[{"file":"scanned.pdf"}],"actions":[{"type":"ocr","language":"english"}]}' \
  -o searchable.pdf

支持语言：通过 ISO 639-2 代码支持 100 多种语言（例如，eng, deu, fra, spa, jpn, kor, chi_sim, chi_tra, ara, hin, rus）。完整的语言名称如 english 或 german 也适用。查看完整的 OCR 语言表以获取所有支持的代码。

编辑敏感信息

# Pattern-based (SSN, email)
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"redaction","strategy":"preset","strategyOptions":{"preset":"social-security-number"}},{"type":"redaction","strategy":"preset","strategyOptions":{"preset":"email-address"}}]}' \
  -o redacted.pdf

# Regex-based
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"redaction","strategy":"regex","strategyOptions":{"regex":"\\b[A-Z]{2}\\d{6}\\b"}}]}' \
  -o redacted.pdf

预设：social-security-number, email-address, credit-card-number, international-phone-number, north-american-phone-number, date, time, url, ipv4, ipv6, mac-address, us-zip-code, vin。

添加水印

curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"watermark","text":"CONFIDENTIAL","fontSize":72,"opacity":0.3,"rotation":-45}]}' \
  -o watermarked.pdf

数字签名

# Self-signed CMS signature
curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "document.pdf=@document.pdf" \
  -F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"sign","signatureType":"cms"}]}' \
  -o signed.pdf

填写 PDF 表单

curl -X POST https://api.nutrient.io/build \
  -H "Authorization: Bearer $NUTRIENT_API_KEY" \
  -F "form.pdf=@form.pdf" \
  -F 'instructions={"parts":[{"file":"form.pdf"}],"actions":[{"type":"fillForm","formFields":{"name":"Jane Smith","email":"jane@example.com","date":"2026-02-06"}}]}' \
  -o filled.pdf

MCP 服务器（替代方案）

对于原生工具集成，请使用 MCP 服务器代替 curl：

{
  "mcpServers": {
    "nutrient-dws": {
      "command": "npx",
      "args": ["-y", "@nutrient-sdk/dws-mcp-server"],
      "env": {
        "NUTRIENT_DWS_API_KEY": "YOUR_API_KEY",
        "SANDBOX_PATH": "/path/to/working/directory"
      }
    }
  }
}

使用场景

在格式之间转换文档（PDF, DOCX, XLSX, PPTX, HTML, 图像）
从 PDF 中提取文本、表格或键值对
对扫描文档或图像进行 OCR
在共享文档前编辑 PII
为草稿或机密文档添加水印
数字签署合同或协议
以编程方式填写 PDF 表单

链接

FAQ

What does nutrient-document-processing do?

使用Nutrient DWS API处理、转换、OCR识别、提取、编辑、签名和填写文档。支持PDF、DOCX、XLSX、PPTX、HTML和图像格式。

When should I use nutrient-document-processing?

Use it when you need a repeatable workflow that produces pdf document, text report.

What does nutrient-document-processing output?

In the evaluated run it produced pdf document, text report.

How do I install or invoke nutrient-document-processing?

npx skills add https://github.com/affaan-m/everything-claude-code --skill nutrient-document-processing

Which agents does nutrient-document-processing support?

Claude Code

What tools, channels, or permissions does nutrient-document-processing need?

It uses curl; channels commonly include pdf, text; permissions include filesystem:read, filesystem:write, network:outbound, env:read.

Is nutrient-document-processing safe to install?

Static analysis marked this skill as medium risk; review side effects and permissions before enabling it.

How is nutrient-document-processing different from an MCP or plugin?

A skill packages instructions and workflow conventions; tools, MCP servers, and plugins are dependencies the skill may call during execution.

Does nutrient-document-processing outperform not using a skill?

About nutrient-document-processing

When to use nutrient-document-processing

When you need to convert documents between formats such as PDF, DOCX, XLSX, PPTX, HTML, and images. When you need OCR, text/table extraction, or PII redaction on uploaded documents. When you need to programmatically watermark, sign, or fill PDF forms via an external API.

When nutrient-document-processing is not the right choice

When documents must not be sent to an external third-party service. When you need fully local/offline document processing without external API dependencies.

What it produces

Produces pdf document and text report.

Install

npx skills add https://github.com/affaan-m/everything-claude-code --skill nutrient-document-processing

Invoke: Ask Claude Code to use nutrient-document-processing for the task.