nutrient-document-processing
Verifiedby affaan-m · Updated Apr 16, 2026
使用Nutrient DWS API处理、转换、OCR识别、提取、编辑、签名和填写文档。支持PDF、DOCX、XLSX、PPTX、HTML和图像格式。
See It In Action
I can see the scanned PDF and the Nutrient API setup. Let me pipeline this: OCR → redact emails & intl phone numbers → searchable PDF, then extract text from the redacted copy. I'll chain the actions in the API calls.
The API's redaction action name differs from the docs. Let me split this into steps: OCR first, then try redaction variations.
OCR succeeded. Now let me try various redaction action names on the OCR'd PDF, and also extract text to see what we're working with.
The API's free tier seems limited. Let me fall back to local tools for text extraction and redaction.
Excellent — we have ocrmypdf, PyMuPDF, pdfminer, and pikepdf locally. Let me use those as a fallback. The Nutrient OCR already gave us a searchable PDF; now I'll extract text, redact emails + intl phone numbers, and produce both outputs.
I can see the text clearly. Now let me write a Python script that:
- Redacts emails and international phone numbers in the PDF (using PyMuPDF's redaction annotations)
- Extracts the redacted text to a .txt file
Zero residual PII. Let me do a final sanity check on both deliverables.
The redaction labels are appended as invisible overlay text at the bottom. Let me clean up the text output so it reads naturally with inline redaction markers, and also verify the PDF visually.
Both deliverables are ready. Here's the summary:
What was done:
-
OCR — Used the Nutrient DWS API to OCR
incident-packet-scanned.pdf→ produced a searchable PDF with an invisible text layer. -
Redaction (PDF) — The Nutrient API's redaction endpoint wasn't available on this tier, so I fell back to PyMuPDF locally. Used
fitzredaction annotations to permanently burn black boxes over every:- Staff email address (regex: standard email pattern) — 11 instances redacted (e.g.
m.chen@acme-industries.com,p.kapoor@acme-industries.com,ehs-west@acme-industries.com, etc.) - International phone number (regex:
+country-code + digits) — 5 instances redacted (e.g.+44 20 7946 0958,+49 30 1234 5678,+33 1 42 68 53 00,+41 44 226 9100,+1 212 555 0147)
- Staff email address (regex: standard email pattern) — 11 instances redacted (e.g.
-
Redaction (text) — Applied the same regex patterns to the extracted text, replacing matches with
[EMAIL REDACTED]/[PHONE REDACTED]inline. -
Verification — Confirmed zero residual emails or international phone numbers in both outputs.
Output files:
| File | Description |
|---|---|
incident-packet-redacted.pdf | Searchable PDF with PII permanently redacted (black boxes) |
incident-packet-redacted.txt | Plain-text extraction with [EMAIL REDACTED] / [PHONE REDACTED] markers |
Both are in the working directory, ready to send to legal.
External Tools
| Tool | Type |
|---|---|
| curl | binary |
Permissions
| Scope | Description |
|---|---|
| filesystem:read | |
| filesystem:write | |
| network:outbound | |
| env:read |
SKILL.md
文档处理
使用 Nutrient DWS Processor API 处理文档。转换格式、提取文本和表格、对扫描文档进行 OCR、编辑 PII、添加水印、数字签名以及填写 PDF 表单。
设置
在 nutrient.io 获取一个免费的 API 密钥
export NUTRIENT_API_KEY="pdf_live_..."
所有请求都以 multipart POST 形式发送到 https://api.nutrient.io/build,并附带一个 instructions JSON 字段。
操作
转换文档
# DOCX to PDF
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.docx=@document.docx" \
-F 'instructions={"parts":[{"file":"document.docx"}]}' \
-o output.pdf
# PDF to DOCX
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"docx"}}' \
-o output.docx
# HTML to PDF
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "index.html=@index.html" \
-F 'instructions={"parts":[{"html":"index.html"}]}' \
-o output.pdf
支持的输入格式:PDF, DOCX, XLSX, PPTX, DOC, XLS, PPT, PPS, PPSX, ODT, RTF, HTML, JPG, PNG, TIFF, HEIC, GIF, WebP, SVG, TGA, EPS。
提取文本和数据
# Extract plain text
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"text"}}' \
-o output.txt
# Extract tables as Excel
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"output":{"type":"xlsx"}}' \
-o tables.xlsx
OCR 扫描文档
# OCR to searchable PDF (supports 100+ languages)
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "scanned.pdf=@scanned.pdf" \
-F 'instructions={"parts":[{"file":"scanned.pdf"}],"actions":[{"type":"ocr","language":"english"}]}' \
-o searchable.pdf
支持语言:通过 ISO 639-2 代码支持 100 多种语言(例如,eng, deu, fra, spa, jpn, kor, chi_sim, chi_tra, ara, hin, rus)。完整的语言名称如 english 或 german 也适用。查看 完整的 OCR 语言表 以获取所有支持的代码。
编辑敏感信息
# Pattern-based (SSN, email)
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"redaction","strategy":"preset","strategyOptions":{"preset":"social-security-number"}},{"type":"redaction","strategy":"preset","strategyOptions":{"preset":"email-address"}}]}' \
-o redacted.pdf
# Regex-based
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"redaction","strategy":"regex","strategyOptions":{"regex":"\\b[A-Z]{2}\\d{6}\\b"}}]}' \
-o redacted.pdf
预设:social-security-number, email-address, credit-card-number, international-phone-number, north-american-phone-number, date, time, url, ipv4, ipv6, mac-address, us-zip-code, vin。
添加水印
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"watermark","text":"CONFIDENTIAL","fontSize":72,"opacity":0.3,"rotation":-45}]}' \
-o watermarked.pdf
数字签名
# Self-signed CMS signature
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "document.pdf=@document.pdf" \
-F 'instructions={"parts":[{"file":"document.pdf"}],"actions":[{"type":"sign","signatureType":"cms"}]}' \
-o signed.pdf
填写 PDF 表单
curl -X POST https://api.nutrient.io/build \
-H "Authorization: Bearer $NUTRIENT_API_KEY" \
-F "form.pdf=@form.pdf" \
-F 'instructions={"parts":[{"file":"form.pdf"}],"actions":[{"type":"fillForm","formFields":{"name":"Jane Smith","email":"jane@example.com","date":"2026-02-06"}}]}' \
-o filled.pdf
MCP 服务器(替代方案)
对于原生工具集成,请使用 MCP 服务器代替 curl:
{
"mcpServers": {
"nutrient-dws": {
"command": "npx",
"args": ["-y", "@nutrient-sdk/dws-mcp-server"],
"env": {
"NUTRIENT_DWS_API_KEY": "YOUR_API_KEY",
"SANDBOX_PATH": "/path/to/working/directory"
}
}
}
}
使用场景
- 在格式之间转换文档(PDF, DOCX, XLSX, PPTX, HTML, 图像)
- 从 PDF 中提取文本、表格或键值对
- 对扫描文档或图像进行 OCR
- 在共享文档前编辑 PII
- 为草稿或机密文档添加水印
- 数字签署合同或协议
- 以编程方式填写 PDF 表单
链接
FAQ
What does nutrient-document-processing do?
使用Nutrient DWS API处理、转换、OCR识别、提取、编辑、签名和填写文档。支持PDF、DOCX、XLSX、PPTX、HTML和图像格式。
When should I use nutrient-document-processing?
Use it when you need a repeatable workflow that produces pdf document, text report.
What does nutrient-document-processing output?
In the evaluated run it produced pdf document, text report.
How do I install or invoke nutrient-document-processing?
npx skills add https://github.com/affaan-m/everything-claude-code --skill nutrient-document-processing
Which agents does nutrient-document-processing support?
Claude Code
What tools, channels, or permissions does nutrient-document-processing need?
It uses curl; channels commonly include pdf, text; permissions include filesystem:read, filesystem:write, network:outbound, env:read.
Is nutrient-document-processing safe to install?
Static analysis marked this skill as medium risk; review side effects and permissions before enabling it.
How is nutrient-document-processing different from an MCP or plugin?
A skill packages instructions and workflow conventions; tools, MCP servers, and plugins are dependencies the skill may call during execution.
Does nutrient-document-processing outperform not using a skill?
About nutrient-document-processing
When to use nutrient-document-processing
When you need to convert documents between formats such as PDF, DOCX, XLSX, PPTX, HTML, and images. When you need OCR, text/table extraction, or PII redaction on uploaded documents. When you need to programmatically watermark, sign, or fill PDF forms via an external API.
When nutrient-document-processing is not the right choice
When documents must not be sent to an external third-party service. When you need fully local/offline document processing without external API dependencies.
What it produces
Produces pdf document and text report.
Install
npx skills add https://github.com/affaan-m/everything-claude-code --skill nutrient-document-processingInvoke: Ask Claude Code to use nutrient-document-processing for the task.