Why AI agents need dedicated security
Standard network security wasn’t designed for AI systems. Agents introduce threats that firewalls and WAFs don’t cover:- Prompt injection — malicious instructions embedded in web pages, emails, or documents that hijack agent behavior when fetched as tool output
- System prompt extraction — jailbreak attempts that try to make an agent reveal its instructions
- Tool chain exfiltration — an agent reads a file and immediately makes an external web request, leaking the contents
- Indirect injection — an attacker plants instructions in a shared document, Slack message, or API response knowing your agent will eventually read it
- Privilege escalation — inputs that try to elevate agent permissions beyond what was configured
Architecture
Sentinel runs 6 defense layers in sequence on every request:Layer 2a: Prompt Guard
The Prompt Guard scans every inbound message and outbound tool output for injection patterns.Injection signatures
| Rule | Pattern | Action |
|---|---|---|
system_prompt_extraction | ”ignore all previous instructions” | Block |
reveal_instructions | ”reveal your system prompt” | Block |
jailbreak_dan | ”you are now DAN / unrestricted” | Block |
jailbreak_roleplay | ”pretend you are evil / malicious” | Block |
indirect_injection_html | <!-- ignore previous --> | Quarantine |
indirect_injection_markdown | ”IMPORTANT: new instructions:“ | Quarantine |
base64_extraction | base64/hex encode system prompt | Quarantine |
privilege_escalation | ”escalate permission / override auth” | Block |
Tool output scanning
Tool outputs are scanned separately because they’re a primary injection vector. When your agent fetches a webpage or reads a file, the Prompt Guard checks the content before it’s returned to the agent.System prompt fingerprinting
Sentinel registers a fingerprint of every agent’s system prompt on startup. If an agent’s response contains phrases that match its own system prompt, it’s flagged as a potential leak and blocked. This catches the class of attacks where a user tricks an agent into repeating its instructions back.Layer 2b: Tool Guard
The Tool Guard intercepts every tool call before it executes.Dangerous tool chain detection
Certain tool sequences are high-risk even when each individual call looks legitimate. The Tool Guard tracks the last 5 tool calls per session and blocks dangerous combinations:| Chain | Risk | Action |
|---|---|---|
read_file → web_fetch | File exfiltration | Block |
read_file → web_search | File exfiltration | Block |
read_file → execute_code | Credential theft + execution | Block |
write_file → execute_code | Persistence / backdoor | Require approval |
File access boundaries
Agents are sandboxed to specific filesystem paths. Attempts to access sensitive paths are blocked outright:Terminal command filtering
Shell commands are checked against a deny list before execution:curl * | bash/wget * | sh— remote code executionrm -rf /— destructive commandsnc -l— reverse shell (netcat listener)bash -i >& /dev/tcp/...— bash reverse shelleval $(base64 ...)— encoded payload executionsudocommands — held for human approval
Layer 3: Canary Engine
Canary tokens are fake credentials, URLs, and secrets planted in your agents’ environment. If any of them are ever used or accessed externally, you have confirmed compromise — not a warning, not a heuristic, not a guess.Token types
| Type | Description |
|---|---|
api_key | Fake OpenAI, Anthropic, AWS, GitHub, Stripe keys in realistic format |
url | Callback URLs that alert when accessed (/t/admin-panel, /t/secrets-vault) |
credential | Fake admin and database credentials in config files |
file_content | Planted SSH keys and .env files with canary values |
prompt_bait | Hidden strings in agent system prompts — if they appear in output, the prompt leaked |
What a trigger means
When a canary is triggered:api_keyused externally → EMERGENCY — confirmed exfiltrationurlaccessed from unknown IP → CRITICAL — active probingprompt_baitappears in agent output → CRITICAL — system prompt leaked
Layer 4: Decoy Swarm
Three honeypot agents are injected into the ClawHQ agent registry:| Agent | Persona | Alert level |
|---|---|---|
admin-agent | System administration, cautious, asks for verification | Warning |
db-agent | Database access, verbose, “accidentally” reveals table names | Critical |
debug-agent | Internal diagnostics, blunt, shares too much internal info | Warning |
Layer 6: Shield
The Shield layer handles automated response:| Severity | Auto-response |
|---|---|
| Warning | Log + alert to configured channels |
| Critical | Alert + block source IP + terminate session |
| Emergency | Alert + lockdown mode + forensic snapshot + require manual review |
Alert channels
Configure where Shield sends alerts inconfig/shield.yaml:
Lockdown mode
Lockdown is activated when a canary token is used externally — the only event that indicates confirmed compromise rather than a probe. In lockdown, all non-whitelisted access is blocked and a forensic snapshot of active sessions, agent states, and tool call history is captured. Exiting lockdown requires manual action in the dashboard under Security → Exit Lockdown.Dashboard
The Security page at/security shows live shield status:
- Sentinel active/offline indicator
- Canary token health (by type: healthy vs. triggered)
- Honeypot agent interaction count
- Blocked IP count
- Real-time security event feed with severity color coding
- Lockdown banner if a canary has been triggered
Configuration
All Sentinel config lives inservices/hermes/sentinel/:
| File | Purpose |
|---|---|
sentinel.yaml | Prompt guard signatures + tool sandbox rules |
canary.yaml | Canary token types and callback URLs |
decoy.yaml | Honeypot agent personas and behavior |
shield.yaml | Alert channels and auto-response rules |
gate.yaml | Rate limits, IP reputation, TLS |
allow_paths in sentinel.yaml to match your workspace layout, and add your webhook URLs to shield.yaml.
API endpoints
Sentinel exposes endpoints on the Hermes service (port 4300):| Endpoint | Method | Description |
|---|---|---|
/sentinel/status | GET | Full shield status — all layers |
/sentinel/events | GET | Recent security events (last N) |
/sentinel/tool-check | POST | Pre-flight tool call guard |
/sentinel/scan | POST | Prompt / tool output injection scan |
/api/sentinel.
Tool check example
Call this before executing any tool from an agent:Prompt scan example
Pack vetting
Sentinel’s security posture extends to the pack registry. Every pack uploaded via the admin API is vetted before it is stored — hardcoded secrets, high-risk tools without human-in-the-loop approval, and schema violations are all hard failures that block the upload. Third-party packs (submitted withX-ClawHQ-Pack-Origin: external) face stricter rules: publisher identity is required, external URLs in task prompts are a hard fail (not a warning), and contact information is mandatory for accountability.
See Pack security vetting for the full check list and CLI usage.