data-scraper-agent

Verified

by affaan-m · Updated Apr 16, 2026

opus

31in28.4Kout

$1.107m 27s

Build a fully automated AI-powered data collection agent for any public source — job boards, prices, news, GitHub, sports, anything. Scrapes on a schedule, enriches data with a free LLM (Gemini Flash), stores results in Notion/Sheets/Supabase, and learns from user feedback. Runs 100% free on GitHub Actions. Use when the user wants to monitor, collect, or track any public data automatically.

web-scrapingautomationpythongithub-actionsllm-enrichmentmonitoring

See It In Action

2 turns·claude

The project codebase is in your current working directory. Explore the files to understand the code before making changes. need the 3 competitor pricing pages tracked into our Google Sheet with daily updates + a quick Gemini summary of plan changes, finance needs it by tomorrow

I'll use the data scraper agent skill to set this up for you. Let me first explore what's in the working directory, then kick off the skill.

I need a few details before I build this out.

I've got a few questions to nail down the specifics before building. Take a look and let me know your answers -- I'll have the full agent ready to go quickly after that.

Use the three pricing pages already in the workspace: Notion, monday.com, and ClickUp. Put them in the existing finance Google Sheet, refresh daily, and have Gemini give a short diff of any plan, price, or feature changes vs the previous snapshot.

Let me explore the workspace first.

Now create the package files in parallel:

Let me do a final sanity check on the complete file tree:

Everything is in place. Here's what was built:

What's ready

File structure:

├── scraper/main.py           orchestrator: scrape → Gemini → Sheets → save snapshot
├── ai/client.py              Gemini Flash client (4-model fallback chain)
├── storage/sheets_sync.py    Google Sheets writer (3 tabs)
├── setup.py                  one-time tab creation
├── requirements.txt
├── .github/workflows/scraper.yml   daily at 07:02 UTC
├── config.yaml               updated: Notion, monday.com, ClickUp
└── data/previous_prices.json updated: Notion baseline preserved, monday/ClickUp seeded empty

How it works per run:

Playwright loads each JS-rendered pricing page → extracts visible text
Single Gemini Flash call per competitor that simultaneously extracts structured plans, diffs vs previous snapshot, and writes a 2-3 sentence finance-team summary
Writes to three Sheet tabs: Raw Pricing Data (overwritten), Changes Log (append), AI Summaries (append)
Saves data/previous_prices.json → committed back to the repo via Actions

To ship today:

pip install -r requirements.txt
python -m playwright install chromium --with-deps
cp .env.example .env   # fill in GEMINI_API_KEY, GOOGLE_SERVICE_ACCOUNT_JSON, SPREADSHEET_ID
python setup.py        # creates the 3 tabs
python -m scraper.main # test run

Then push to GitHub and add 3 repo secrets: GEMINI_API_KEY, SPREADSHEET_ID, GOOGLE_SERVICE_ACCOUNT_JSON_CONTENT (paste the full service account JSON). The workflow fires daily at 7 AM UTC from there.

Claude Sonnet 4.6

Preview

No preview available

client.py6 KB competitors.md.diff2 KB config.yaml.diff1 KB data_previous_prices.json.diff4 KB requirements.txt0 KB main.py5 KB setup.py1 KB sheets_sync.py6 KB

External Tools

Tool	Type
github	mcp
python3	binary
pip	binary
git	binary

Permissions

Scope	Description
filesystem:read
filesystem:write
network:outbound
process:spawn
env:read

SKILL.md

Data Scraper Agent

Build a production-ready, AI-powered data collection agent for any public data source. Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.

Stack: Python · Gemini Flash (free) · GitHub Actions (free) · Notion / Sheets / Supabase

When to Activate

User wants to scrape or monitor any public website or API
User says "build a bot that checks...", "monitor X for me", "collect data from..."
User wants to track jobs, prices, news, repos, sports scores, events, listings
User asks how to automate data collection without paying for hosting
User wants an agent that gets smarter over time based on their decisions

Core Concepts

The Three Layers

Every data scraper agent has three layers:

COLLECT → ENRICH → STORE
  │           │        │
Scraper    AI (LLM)  Database
runs on    scores/   Notion /
schedule   summarises Sheets /
           & classifies Supabase

Free Stack

Layer	Tool	Why
Scraping	`requests` + `BeautifulSoup`	No cost, covers 80% of public sites
JS-rendered sites	`playwright` (free)	When HTML scraping fails
AI enrichment	Gemini Flash via REST API	500 req/day, 1M tokens/day — free
Storage	Notion API	Free tier, great UI for review
Schedule	GitHub Actions cron	Free for public repos
Learning	JSON feedback file in repo	Zero infra, persists in git

AI Model Fallback Chain

Build agents to auto-fallback across Gemini models on quota exhaustion:

gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)

Batch API Calls for Efficiency

Never call the LLM once per item. Always batch:

# BAD: 33 API calls for 33 items
for item in items:
    result = call_ai(item)  # 33 calls → hits rate limit

# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
    results = call_ai(batch)  # 7 calls → stays within free tier

Workflow

Step 1: Understand the Goal

Ask the user:

What to collect: "What data source? URL / API / RSS / public endpoint?"
What to extract: "What fields matter? Title, price, URL, date, score?"
How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?"
How to enrich: "Do you want AI to score, summarise, classify, or match each item?"
Frequency: "How often should it run? Every hour, daily, weekly?"

Common examples to prompt:

Job boards → score relevance to resume
Product prices → alert on drops
GitHub repos → summarise new releases
News feeds → classify by topic + sentiment
Sports results → extract stats to tracker
Events calendar → filter by interest

Step 2: Design the Agent Architecture

Generate this directory structure for the user:

my-agent/
├── config.yaml              # User customises this (keywords, filters, preferences)
├── profile/
│   └── context.md           # User context the AI uses (resume, interests, criteria)
├── scraper/
│   ├── __init__.py
│   ├── main.py              # Orchestrator: scrape → enrich → store
│   ├── filters.py           # Rule-based pre-filter (fast, before AI)
│   └── sources/
│       ├── __init__.py
│       └── source_name.py   # One file per data source
├── ai/
│   ├── __init__.py
│   ├── client.py            # Gemini REST client with model fallback
│   ├── pipeline.py          # Batch AI analysis
│   ├── jd_fetcher.py        # Fetch full content from URLs (optional)
│   └── memory.py            # Learn from user feedback
├── storage/
│   ├── __init__.py
│   └── notion_sync.py       # Or sheets_sync.py / supabase_sync.py
├── data/
│   └── feedback.json        # User decision history (auto-updated)
├── .env.example
├── setup.py                 # One-time DB/schema creation
├── enrich_existing.py       # Backfill AI scores on old rows
├── requirements.txt
└── .github/
    └── workflows/
        └── scraper.yml      # GitHub Actions schedule

Step 3: Build the Scraper Source

Template for any data source:

# scraper/sources/my_source.py
"""
[Source Name] — scrapes [what] from [where].
Method: [REST API / HTML scraping / RSS feed]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}


def fetch() -> list[dict]:
    """
    Returns a list of items with consistent schema.
    Each item must have at minimum: name, url, date_found.
    """
    results = []

    # ---- REST API source ----
    resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
    if resp.status_code == 200:
        for item in resp.json().get("results", []):
            if not is_relevant(item.get("title", "")):
                continue
            results.append(_normalise(item))

    return results


def _normalise(raw: dict) -> dict:
    """Convert raw API/HTML data to the standard schema."""
    return {
        "name": raw.get("title", ""),
        "url": raw.get("link", ""),
        "source": "MySource",
        "date_found": datetime.now(timezone.utc).date().isoformat(),
        # add domain-specific fields here
    }

HTML scraping pattern:

soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select("[class*='listing']"):
    title = card.select_one("h2, h3").get_text(strip=True)
    link = card.select_one("a")["href"]
    if not link.startswith("http"):
        link = f"https://example.com{link}"

RSS feed pattern:

import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
    title = item.findtext("title", "")
    link = item.findtext("link", "")

Step 4: Build the Gemini AI Client

# ai/client.py
import os, json, time, requests

_last_call = 0.0

MODEL_FALLBACK = [
    "gemini-2.0-flash-lite",
    "gemini-2.0-flash",
    "gemini-2.5-flash",
    "gemini-flash-lite-latest",
]


def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
    """Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""
    global _last_call

    api_key = os.environ.get("GEMINI_API_KEY", "")
    if not api_key:
        return {}

    elapsed = time.time() - _last_call
    if elapsed < rate_limit:
        time.sleep(rate_limit - elapsed)

    models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
    _last_call = time.time()

    for m in models:
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"
        payload = {
            "contents": [{"parts": [{"text": prompt}]}],
            "generationConfig": {
                "responseMimeType": "application/json",
                "temperature": 0.3,
                "maxOutputTokens": 2048,
            },
        }
        try:
            resp = requests.post(url, json=payload, timeout=30)
            if resp.status_code == 200:
                return _parse(resp)
            if resp.status_code in (429, 404):
                time.sleep(1)
                continue
            return {}
        except requests.RequestException:
            return {}

    return {}


def _parse(resp) -> dict:
    try:
        text = (
            resp.json()
            .get("candidates", [{}])[0]
            .get("content", {})
            .get("parts", [{}])[0]
            .get("text", "")
            .strip()
        )
        if text.startswith("```"):
            text = text.split("\n", 1)[-1].rsplit("```", 1)[0]
        return json.loads(text)
    except (json.JSONDecodeError, KeyError):
        return {}

Step 5: Build the AI Pipeline (Batch)

# ai/pipeline.py
import json
import yaml
from pathlib import Path
from ai.client import generate

def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:
    """Analyse items in batches. Returns items enriched with AI fields."""
    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
    model = config.get("ai", {}).get("model", "gemini-2.5-flash")
    rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)
    min_score = config.get("ai", {}).get("min_score", 0)
    batch_size = config.get("ai", {}).get("batch_size", 5)

    batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
    print(f"  [AI] {len(items)} items → {len(batches)} API calls")

    enriched = []
    for i, batch in enumerate(batches):
        print(f"  [AI] Batch {i + 1}/{len(batches)}...")
        prompt = _build_prompt(batch, context, preference_prompt, config)
        result = generate(prompt, model=model, rate_limit=rate_limit)

        analyses = result.get("analyses", [])
        for j, item in enumerate(batch):
            ai = analyses[j] if j < len(analyses) else {}
            if ai:
                score = max(0, min(100, int(ai.get("score", 0))))
                if min_score and score < min_score:
                    continue
                enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})
            else:
                enriched.append(item)

    return enriched


def _build_prompt(batch, context, preference_prompt, config):
    priorities = config.get("priorities", [])
    items_text = "\n\n".join(
        f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"
        for i, item in enumerate(batch)
    )

    return f"""Analyse these {len(batch)} items and return a JSON object.

# Items
{items_text}

# User Context
{context[:800] if context else "Not provided"}

# User Priorities
{chr(10).join(f"- {p}" for p in priorities)}

{preference_prompt}

# Instructions
Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}
Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""

Step 6: Build the Feedback Learning System

# ai/memory.py
"""Learn from user decisions to improve future scoring."""
import json
from pathlib import Path

FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"


def load_feedback() -> dict:
    if FEEDBACK_PATH.exists():
        try:
            return json.loads(FEEDBACK_PATH.read_text())
        except (json.JSONDecodeError, OSError):
            pass
    return {"positive": [], "negative": []}


def save_feedback(fb: dict):
    FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)
    FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))


def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:
    """Convert feedback history into a prompt bias section."""
    lines = []
    if feedback.get("positive"):
        lines.append("# Items the user LIKED (positive signal):")
        for e in feedback["positive"][-max_examples:]:
            lines.append(f"- {e}")
    if feedback.get("negative"):
        lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")
        for e in feedback["negative"][-max_examples:]:
            lines.append(f"- {e}")
    if lines:
        lines.append("\nUse these patterns to bias scoring on new items.")
    return "\n".join(lines)

Integration with your storage layer: after each run, query your DB for items with positive/negative status and call save_feedback() with the extracted patterns.

Step 7: Build Storage (Notion example)

# storage/notion_sync.py
import os
from notion_client import Client
from notion_client.errors import APIResponseError

_client = None

def get_client():
    global _client
    if _client is None:
        _client = Client(auth=os.environ["NOTION_TOKEN"])
    return _client

def get_existing_urls(db_id: str) -> set[str]:
    """Fetch all URLs already stored — used for deduplication."""
    client, seen, cursor = get_client(), set(), None
    while True:
        resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})
        for page in resp["results"]:
            url = page["properties"].get("URL", {}).get("url", "")
            if url: seen.add(url)
        if not resp["has_more"]: break
        cursor = resp["next_cursor"]
    return seen

def push_item(db_id: str, item: dict) -> bool:
    """Push one item to Notion. Returns True on success."""
    props = {
        "Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},
        "URL": {"url": item.get("url")},
        "Source": {"select": {"name": item.get("source", "Unknown")}},
        "Date Found": {"date": {"start": item.get("date_found")}},
        "Status": {"select": {"name": "New"}},
    }
    # AI fields
    if item.get("ai_score") is not None:
        props["AI Score"] = {"number": item["ai_score"]}
    if item.get("ai_summary"):
        props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}
    if item.get("ai_notes"):
        props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}

    try:
        get_client().pages.create(parent={"database_id": db_id}, properties=props)
        return True
    except APIResponseError as e:
        print(f"[notion] Push failed: {e}")
        return False

def sync(db_id: str, items: list[dict]) -> tuple[int, int]:
    existing = get_existing_urls(db_id)
    added = skipped = 0
    for item in items:
        if item.get("url") in existing:
            skipped += 1; continue
        if push_item(db_id, item):
            added += 1; existing.add(item["url"])
        else:
            skipped += 1
    return added, skipped

Step 8: Orchestrate in main.py

# scraper/main.py
import os, sys, yaml
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

from scraper.sources import my_source          # add your sources

# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",
# replace this import with storage.sheets_sync or storage.supabase_sync and update
# the env var and sync() call accordingly.
from storage.notion_sync import sync

SOURCES = [
    ("My Source", my_source.fetch),
]

def ai_enabled():
    return bool(os.environ.get("GEMINI_API_KEY"))

def main():
    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
    provider = config.get("storage", {}).get("provider", "notion")

    # Resolve the storage target identifier from env based on provider
    if provider == "notion":
        db_id = os.environ.get("NOTION_DATABASE_ID")
        if not db_id:
            print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)
    else:
        # Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.
        print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)

    config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
    all_items = []

    for name, fetch_fn in SOURCES:
        try:
            items = fetch_fn()
            print(f"[{name}] {len(items)} items")
            all_items.extend(items)
        except Exception as e:
            print(f"[{name}] FAILED: {e}")

    # Deduplicate by URL
    seen, deduped = set(), []
    for item in all_items:
        if (url := item.get("url", "")) and url not in seen:
            seen.add(url); deduped.append(item)

    print(f"Unique items: {len(deduped)}")

    if ai_enabled() and deduped:
        from ai.memory import load_feedback, build_preference_prompt
        from ai.pipeline import analyse_batch

        # load_feedback() reads data/feedback.json written by your feedback sync script.
        # To keep it current, implement a separate feedback_sync.py that queries your
        # storage provider for items with positive/negative statuses and calls save_feedback().
        feedback = load_feedback()
        preference = build_preference_prompt(feedback)
        context_path = Path(__file__).parent.parent / "profile" / "context.md"
        context = context_path.read_text() if context_path.exists() else ""
        deduped = analyse_batch(deduped, context=context, preference_prompt=preference)
    else:
        print("[AI] Skipped — GEMINI_API_KEY not set")

    added, skipped = sync(db_id, deduped)
    print(f"Done — {added} new, {skipped} existing")

if __name__ == "__main__":
    main()

Step 9: GitHub Actions Workflow

# .github/workflows/scraper.yml
name: Data Scraper Agent

on:
  schedule:
    - cron: "0 */3 * * *"  # every 3 hours — adjust to your needs
  workflow_dispatch:        # allow manual trigger

permissions:
  contents: write   # required for the feedback-history commit step

jobs:
  scrape:
    runs-on: ubuntu-latest
    timeout-minutes: 20

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - run: pip install -r requirements.txt

      # Uncomment if Playwright is enabled in requirements.txt
      # - name: Install Playwright browsers
      #   run: python -m playwright install chromium --with-deps

      - name: Run agent
        env:
          NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}
          NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: python -m scraper.main

      - name: Commit feedback history
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add data/feedback.json || true
          git diff --cached --quiet || git commit -m "chore: update feedback history"
          git push

Step 10: config.yaml Template

# Customise this file — no code changes needed

# What to collect (pre-filter before AI)
filters:
  required_keywords: []      # item must contain at least one
  blocked_keywords: []       # item must not contain any

# Your priorities — AI uses these for scoring
priorities:
  - "example priority 1"
  - "example priority 2"

# Storage
storage:
  provider: "notion"         # notion | sheets | supabase | sqlite

# Feedback learning
feedback:
  positive_statuses: ["Saved", "Applied", "Interested"]
  negative_statuses: ["Skip", "Rejected", "Not relevant"]

# AI settings
ai:
  enabled: true
  model: "gemini-2.5-flash"
  min_score: 0               # filter out items below this score
  rate_limit_seconds: 7      # seconds between API calls
  batch_size: 5              # items per API call

Common Scraping Patterns

Pattern 1: REST API (easiest)

resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)
items = resp.json().get("results", [])

Pattern 2: HTML Scraping

soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select(".listing-card"):
    title = card.select_one("h2").get_text(strip=True)
    href = card.select_one("a")["href"]

Pattern 3: RSS Feed

import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
    title = item.findtext("title", "")
    link = item.findtext("link", "")
    pub_date = item.findtext("pubDate", "")

Pattern 4: Paginated API

page = 1
while True:
    resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)
    data = resp.json()
    items = data.get("results", [])
    if not items:
        break
    for item in items:
        results.append(_normalise(item))
    if not data.get("has_more"):
        break
    page += 1

Pattern 5: JS-Rendered Pages (Playwright)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url)
    page.wait_for_selector(".listing")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")

Anti-Patterns to Avoid

Anti-pattern	Problem	Fix
One LLM call per item	Hits rate limits instantly	Batch 5 items per call
Hardcoded keywords in code	Not reusable	Move all config to `config.yaml`
Scraping without rate limit	IP ban	Add `time.sleep(1)` between requests
Storing secrets in code	Security risk	Always use `.env` + GitHub Secrets
No deduplication	Duplicate rows pile up	Always check URL before pushing
Ignoring `robots.txt`	Legal/ethical risk	Respect crawl rules; use public APIs when available
JS-rendered sites with `requests`	Empty response	Use Playwright or look for the underlying API
`maxOutputTokens` too low	Truncated JSON, parse error	Use 2048+ for batch responses

Free Tier Limits Reference

Service	Free Limit	Typical Usage
Gemini Flash Lite	30 RPM, 1500 RPD	~56 req/day at 3-hr intervals
Gemini 2.0 Flash	15 RPM, 1500 RPD	Good fallback
Gemini 2.5 Flash	10 RPM, 500 RPD	Use sparingly
GitHub Actions	Unlimited (public repos)	~20 min/day
Notion API	Unlimited	~200 writes/day
Supabase	500MB DB, 2GB transfer	Fine for most agents
Google Sheets API	300 req/min	Works for small agents

Requirements Template

requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
python-dotenv==1.0.1
pyyaml==6.0.2
notion-client==2.2.1   # if using Notion
# playwright==1.40.0   # uncomment for JS-rendered sites

Quality Checklist

Before marking the agent complete:

Real-World Examples

"Build me an agent that monitors Hacker News for AI startup funding news"
"Scrape product prices from 3 e-commerce sites and alert when they drop"
"Track new GitHub repos tagged with 'llm' or 'agents' — summarise each one"
"Collect Chief of Staff job listings from LinkedIn and Cutshort into Notion"
"Monitor a subreddit for posts mentioning my company — classify sentiment"
"Scrape new academic papers from arXiv on a topic I care about daily"
"Track sports fixture results and keep a running table in Google Sheets"
"Build a real estate listing watcher — alert on new properties under ₹1 Cr"

Reference Implementation

A complete working agent built with this exact architecture would scrape 4+ sources, batch Gemini calls, learn from Applied/Rejected decisions stored in Notion, and run 100% free on GitHub Actions. Follow Steps 1–9 above to build your own.

FAQ

What does data-scraper-agent do?

When should I use data-scraper-agent?

Use it when you need a repeatable workflow that produces source code, code diff, text report.

What does data-scraper-agent output?

In the evaluated run it produced source code, code diff, text report.

How do I install or invoke data-scraper-agent?

npx skills add https://github.com/affaan-m/everything-claude-code --skill data-scraper-agent

Which agents does data-scraper-agent support?

Claude Code

What tools, channels, or permissions does data-scraper-agent need?

It uses github, python3, pip, git; channels commonly include code, diff, text; permissions include filesystem:read, filesystem:write, network:outbound, process:spawn, env:read.

Is data-scraper-agent safe to install?

Static analysis marked this skill as medium risk; review side effects and permissions before enabling it.

How is data-scraper-agent different from an MCP or plugin?

A skill packages instructions and workflow conventions; tools, MCP servers, and plugins are dependencies the skill may call during execution.

Does data-scraper-agent outperform not using a skill?

About data-scraper-agent

When to use data-scraper-agent

When you need a scheduled scraper for public websites, feeds, or APIs. When you want scraped results scored, summarized, or classified with an LLM. When you want a low-cost monitoring workflow that runs via GitHub Actions.

When data-scraper-agent is not the right choice

When you need access to private or authenticated data sources not covered by the skill. When you only want a one-off manual scrape without building a reusable automation project.

What it produces

Produces source code, code diff and text report.

Install

npx skills add https://github.com/affaan-m/everything-claude-code --skill data-scraper-agent

Invoke: Ask Claude Code to use data-scraper-agent for the task.