[{"content":" The pitch in one sentence: endpoint risk and threat hunting with Fleet just got a lot easier with the MCP. Ask a question in English. Get a real osquery scan across every host you own. See the SQL. See the assumptions. Decide what to do next.\nWhat it is A Model Context Protocol server for Fleet — exposes Fleet\u0026rsquo;s API as typed tools any AI agent can call Where it runs Anywhere with stdio or SSE — Claude Desktop, Claude Code, Cursor, Slack bots, custom agents What it gives you Live osquery, policy compliance, CVE impact, fleet inventory — spoken to in plain English What it doesn\u0026rsquo;t do Hide its work, run destructive ops on its own authority, or pretend to be a vulnerability scanner Repo github.com/karmine05/fleet-mcp The thirty-second pitch BEFORE REST API + jq $ curl -H \"Authorization: Bearer $FLEET\" \\ \"$URL/api/v1/fleet/queries\" | \\ jq '.queries[] | select(.platform == \"linux\")' # pick a query, get its id... $ curl -X POST -d @body.json \\ \"$URL/api/v1/fleet/queries/run\" # poll for results, parse JSON, # cross-reference host IDs to # labels, build the report yourself # 15 minutes later: answer AFTER Plain English you how many linux hosts haven't rebooted in 30 days? fleet-mcp scanning 7 linux hosts across 3 teams... 2 hosts: uptime \u0026gt; 500d, both servers. 3 hosts \u0026lt; 30d. 2 offline. SQL used: SELECT * FROM uptime; (shown ↓) ✓ 30 seconds. SQL visible. Receipts attached. Same osquery. Same Fleet API. Different surface. That\u0026rsquo;s the entire idea. Same Fleet. Same osquery. Same authoritative data. The interface changed from plumbing to language, and the time-to-answer collapsed by an order of magnitude. The osquery, the Fleet RBAC, the policies — none of that goes away. The 15 minutes of curl-jq-pagination glue does.\nWhy MCP exists Fleet already had an excellent REST API. osquery already had a beautiful SQL surface. So why build another thing in front of them?\nBecause the gap that actually costs you time isn\u0026rsquo;t between the question and the data. The data is right there. The gap is between the question and the right query against the right hosts presented in a form a human can act on in five minutes.\nA question \"are we exposed?\" ↑ the gap The right query SQL · targets · schema Live results per host, per team A decision action or no-op The bridge that used to be hand-built API glue. That\u0026rsquo;s what fleet-mcp is. The reason that gap is expensive is that crossing it well requires knowing:\nWhich osquery tables exist on which platforms (the chrome_extensions table behaves differently on macOS vs Linux; kernel_modules only exists on Linux). Which Fleet labels and teams a question should be scoped to. How to validate the target set before firing a fleet-wide query that gets rate-limited or returns garbage. How to format results so the conclusion is obvious, not buried in twelve columns of host JSON. A human security engineer who\u0026rsquo;s been doing this for years can do all of that in their head. Anyone newer to the platform — or any AI agent without context — can\u0026rsquo;t. fleet-mcp encodes that knowledge as typed tools, so the agent doing the work has the same situational awareness an experienced operator would.\nWhat fleet-mcp actually is A small Go server. Two transports (stdio and SSE). One job: turn Fleet\u0026rsquo;s REST surface into a catalog of typed tools that obey the Model Context Protocol, so any MCP-compatible AI client — Claude Desktop, Claude Code, Cursor, or a custom Slack bot — can call them natively without re-implementing Fleet\u0026rsquo;s API for the nth time.\nAI CLIENT Claude Desktop Cursor / Claude Code Slack bot Custom agent whichever surface your team lives in MCP fleet-mcp Tool catalog get_endpoints · get_host get_policies · get_labels get_vulnerability_impact prepare_live_query run_live_query get_osquery_schema get_vetted_queries get_aggregate_platforms … stdio · SSE Go binary, MIT REST FLEET + OSQUERY Fleet API RBAC · audit · scheduling osquery agents on every enrolled host Linux workstation macOS workstation Windows workstation IT servers long-uptime cohort Testing \u0026amp; QA VMs, lab hosts What the agent gets from the tool catalog isn\u0026rsquo;t access to a generic HTTP client — it\u0026rsquo;s a set of purpose-built primitives with names that map to questions an operator would ask. get_vulnerability_impact(cve_id). get_policy_compliance(policy_id). prepare_live_query → run_live_query (the prepare step exists specifically to validate target sets and schema before a destructive-looking SQL hits production).\nThe full inventory at the time of writing:\nTool What it does get_endpoints List enrolled hosts get_host Full host detail — labels, team, platform get_queries List saved Fleet queries get_policies List policies with pass/fail counts get_labels List labels get_aggregate_platforms Host count broken down by OS get_total_system_count Active enrolled count get_policy_compliance Compliance stats for a policy get_vulnerability_impact Systems impacted by a CVE prepare_live_query Validate targets + fetch osquery schema run_live_query Execute live osquery SQL create_saved_query Persist a new query get_osquery_schema Schema for a given platform get_vetted_queries CIS-8.1 compliance query library Two patterns to notice. First, the prepare → run split for live queries is not bureaucracy — it\u0026rsquo;s the safety rail that keeps an agent from firing a malformed SQL against 10,000 hosts because it hallucinated a table name. Second, get_vetted_queries ships a curated library so the agent has good defaults instead of inventing osquery from first principles every time.\nThree things this changes about endpoint risk and threat hunting The abstractions above only matter if they translate into work you couldn\u0026rsquo;t easily do before. Three real examples — sanitized — from running this in production.\n1. Pre-CVE response, in minutes Public exploit drops. No CVE assigned. Vendor advisories not out yet. Your vulnerability scanner returns empty because there\u0026rsquo;s nothing to match.\nDrop the intel blurb into Slack. Tag the bot. The bot translates the artifacts in the writeup (kernel modules, sockets, sysctls, distro families) into an osquery scan, runs prepare_live_query to validate targets, then run_live_query against every Linux host across every team — and returns a per-host artifact report with named risks.\noperator 12:27 PM @Fleet — public Linux PrivEsc PoC just dropped, no CVE yet. Affects Ubuntu / RHEL / Fedora. Two variants chain page-cache writes. Run a targeted scan across all teams, surface the at-risk hosts and the kernel module state. Artifacts only. Fleet 12:27 PM · used 3 tools prepare_live_query get_osquery_schema run_live_query Scope: 7 Linux hosts across 3 teams. 3 online, 4 offline (queued for re-run). Per-host artifacts (online) host-a · workstation · ubuntu 24.04 · kernel 6.18.7 modules of interest: xfrm_user ✓ xfrm_algo ✓ rxrpc not loaded risk: distro in scope; absence from lsmod ≠ mitigation (auto-load on socket open)\nhost-b · IT servers · ubuntu 24.04 · kernel 6.8.0 · uptime 503d risk: vulnerable kernel range, hasn\u0026rsquo;t rebooted in ~1.4y — schedule maintenance\nhost-c · IT servers · fedora 41 · kernel 6.11.4 · uptime 504d risk: explicitly named distro, same reboot concern\nNext actions (say the word and I\u0026rsquo;ll draft a Fleet policy PR): 1. Policy: fail any host with uptime \u0026gt; 30d 2. kernel.unprivileged_userns_clone=0 3. modprobe blacklist rxrpc / xfrm on workstations 4. re-scan 4 offline hosts on return The artifacts that matter — kernel version, loaded modules, uptime, distro family — are the artifacts the agent surfaces. The host names are placeholders. The risks are named. The next actions are concrete. No CVE was harmed in the making of this answer.\n2. CVE blast radius, scoped to a team A different shape: a CVE does exist (or four — Chrome zero-days are like that), and the right question is \u0026ldquo;how many of my systems are exposed and which team are they on, because the answer determines who I message.\u0026rdquo;\nThe classic version of this is a JIRA ticket, a curl loop, a spreadsheet, and 90 minutes. With the MCP it\u0026rsquo;s three sentences:\nhow many systems have the latest Chrome vulnerabilities — and how many on the Workstations team? Fleet tools: web_search → get_vulnerability_impact ×4 → get_endpoints(fleet=Workstations) → run_live_query Latest Chrome zero-days across the fleet (157 hosts total): 12 CVE-XXXX-A (WebGPU UAF) · CISA KEV deadline today\n10 on the Workstations team — running Chrome \u0026lt; 146.0.7680.178\nPlus 1 host with a stale Chrome-for-Testing build; 1 host on Chrome Canary. 54 hosts offline at scan — actual impacted count may be higher. Re-running on roll-in.\nSQL used: SELECT name, bundle_short_version FROM apps WHERE name LIKE \u0026lsquo;%Chrome%\u0026rsquo;; Three properties that matter here:\nIt scopes to a team. That\u0026rsquo;s a get_endpoints(fleet=Workstations) call under the hood, not a SQL filter the operator had to write. It chains four CVE lookups in one breath. Each get_vulnerability_impact is cheap, so the agent runs them in parallel and merges. A human doing this by hand would short-circuit and only check one. It surfaces the offline cohort honestly. \u0026ldquo;54 hosts offline at scan\u0026rdquo; is a real caveat, not a footnote you have to hunt for. The answer is bounded, and the bound is shown. 3. Knowing what not to do The most underrated property of a tool catalog is what\u0026rsquo;s not in it. The MCP doesn\u0026rsquo;t expose read_keychain_secret. It can\u0026rsquo;t. macOS keychain values are encrypted at rest, and Fleet can read metadata via osquery but not secrets.\nWhen asked \u0026ldquo;what\u0026rsquo;s in my keychain?\u0026rdquo; the right answer is the one the agent actually gives:\nwhat's in the keychain on user-a's MacBook? Fleet I can't access keychain values. Secrets are encrypted by macOS; even with admin osquery, passwords and private keys aren't readable. I can show metadata — labels, types, paths, creation dates — if that's useful. Want me to run that scoped to user-a's host? This is the boring, correct behavior, and it\u0026rsquo;s the one you want. An MCP server that pretended to do more than its underlying API allows would be worse than no MCP server at all. The discipline is in the tool boundary, not in the prompt.\nWhat this is not A manifesto without a list of what it isn\u0026rsquo;t is just marketing.\nfleet-mcp is not a vulnerability scanner. It\u0026rsquo;s a translation layer for endpoint questions. The authoritative data still lives in osquery and Fleet. When the CVE pipeline has a row, the MCP can pull it via get_vulnerability_impact. When the pipeline doesn\u0026rsquo;t have a row yet — Dirty Frag, the Mini Shai-Hulud worm, the npm supply-chain compromise of the week — the MCP runs the artifact query the operator described and tells you what the hosts actually look like. Catalog tools answer \u0026ldquo;what CVEs apply?\u0026rdquo; Artifact tools answer \u0026ldquo;what do these hosts actually look like right now?\u0026rdquo; The second one is what threat hunting needs.\nfleet-mcp is not an autonomous incident responder. The architecture is deliberate: the agent can propose a Fleet policy, a script, a query — but the human stays in the loop for anything that mutates state. run_live_query runs read-only osquery. There is no delete_host tool. There is no run_arbitrary_shell. If you want to wire the same MCP into a workflow that does run scripts, that\u0026rsquo;s downstream — and you should keep the approval gate.\nfleet-mcp doesn\u0026rsquo;t hide its SQL. Every example above ships with the underlying osquery shown. This is non-negotiable. If you can\u0026rsquo;t review the query, you can\u0026rsquo;t trust the answer, and the moment trust breaks the tool stops being useful. The transparency isn\u0026rsquo;t decorative — it\u0026rsquo;s the contract.\nfleet-mcp is not a substitute for knowing your stack. The agent will happily run a query that asks kernel_modules to do work on a macOS host, and Fleet will return nothing, and the operator has to know enough to recognize that. The tools encode structure; they don\u0026rsquo;t replace literacy.\nHow to try it Half a page. From a fresh clone:\ngit clone https://github.com/karmine05/fleet-mcp.git cd fleet-mcp cp .env.example .env # edit .env with your Fleet base URL + API token go build -o fleet-mcp . ./fleet-mcp # SSE on :8080/sse for Cursor / Claude Code # or ./fleet-mcp -transport stdio # for Claude Desktop For Claude Desktop, drop this into claude_desktop_config.json:\n{ \u0026#34;mcpServers\u0026#34;: { \u0026#34;fleet\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;/path/to/fleet-mcp\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-transport\u0026#34;, \u0026#34;stdio\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;FLEET_BASE_URL\u0026#34;: \u0026#34;https://your-fleet.example.com\u0026#34;, \u0026#34;FLEET_API_KEY\u0026#34;: \u0026#34;YOUR_FLEET_API_KEY\u0026#34; } } } } Restart Claude. The Fleet tools show up in context. Ask it something hard. Watch the SQL.\nThe framing that holds up Two questions sit at the heart of every endpoint security workflow:\nWhich hosts are exposed right now? What did we miss?\nFor a long time both were answered the same way: ship a vulnerability scanner, hope the catalog is current, page through a dashboard, write a spreadsheet. The catalog is never quite current and the spreadsheet is always slightly stale. The answers were technically correct and operationally inert.\nThe other path — and the one fleet-mcp commits to — is to keep the authoritative data (osquery, Fleet, RBAC) exactly where it is, expose it as a typed tool surface, and let the language model be the thing that translates a tired security engineer\u0026rsquo;s 11pm question into the right scan against the right hosts presented in the right form.\nThe data was already there. The plumbing is what changed. Endpoint risk and threat hunting with Fleet just got a lot easier with the MCP.\nLinks Repo: github.com/karmine05/fleet-mcp Model Context Protocol: modelcontextprotocol.io Fleet: fleetdm.com Demo (1-hour walkthrough): youtube.com/watch?v=8K77litllPk License: MIT.\n","permalink":"https://karmine05.github.io/dirtyfrag-blog/posts/fleet-mcp-manifesto/","summary":"Endpoint risk and threat hunting with Fleet just got a lot easier with the MCP. fleet-mcp is a Model Context Protocol server that turns Fleet\u0026rsquo;s API into a typed tool catalog any AI agent can call. This is the manifesto — why it exists, what it does, what it deliberately won\u0026rsquo;t do, and what it gives you that a REST API never could.","title":"Endpoint Risk and Threat Hunting, in Plain English: A Fleet MCP Manifesto"},{"content":" CVE-2026-45321 / GHSA-g7cv-rxg3-hmpx. Active since May 11, 2026. 42 TanStack packages (84 versions) directly compromised, plus the broader Mini Shai-Hulud campaign affecting 175 packages across 17 namespaces. Daemonizes silently on npm install. Harvests GitHub Actions OIDC, AWS, Vault, and Kubernetes credentials. Propagates autonomously. If you run JavaScript anywhere near a developer machine: stop and read this.\nCVE CVE-2026-45321 / GHSA-g7cv-rxg3-hmpx Campaign Mini Shai-Hulud Status Active (May 11, 2026) TanStack scope 42 packages · 84 versions · 12M+ weekly downloads Broader campaign 175 packages · 406 versions · 17 namespaces Primary targets Developer machines, CI/CD runners, cloud workloads What happened An orphaned commit in the TanStack/router repository was used to hijack the repository\u0026rsquo;s CI workflow OIDC token. With that token, the attacker bypassed 2FA and npm publishing protections, pushing a malicious payload — router_init.js (2.3 MB) — into affected package versions as a post-install hook.\nThe infection chain once a developer runs npm install:\nrouter_init.js executes via postInstall hook Daemonizes: detaches from the terminal, nothing looks wrong, install appears to complete normally Harvests credentials in order: GitHub Actions OIDC tokens, AWS via IMDSv2/Secrets Manager/SSM across all regions, HashiCorp Vault, Kubernetes service account tokens Propagates: uses the stolen OIDC token to republish a new malicious version to npm under the legitimate maintainer identity Persists: writes hooks to .claude/ and .vscode/ directories Exfiltrates: over Session\u0026rsquo;s decentralized P2P network (filev2.getsession[.]org) Commits to compromised repositories via GitHub GraphQL API, spoofing claude@users.noreply.github.com as the author The campaign name — Mini Shai-Hulud — is a Dune reference. A small sandworm that eats everything in its path.\nWhy this is different from most supply chain attacks It daemonizes. Most malicious post-install scripts do their damage synchronously and leave a trace in terminal output. This one forks, detaches, and returns control to the terminal immediately. The install looks clean.\nSigstore attestations are worthless here. The malware generates valid provenance attestations because it publishes through a legitimate maintainer\u0026rsquo;s OIDC token. The package has a valid signature. Verifying signatures tells you nothing.\nIt propagates via OIDC, not stolen passwords. If any CI runner in a compromised developer\u0026rsquo;s environment runs with a GitHub Actions OIDC token that has npm publish permissions, the worm can republish under that identity. Two-factor authentication doesn\u0026rsquo;t protect against this — the token is already issued.\nThe C2 is P2P. Exfiltration goes over Session\u0026rsquo;s decentralized network. There\u0026rsquo;s no single IP to block, no traditional C2 domain that resolves to a known bad actor. DNS blocking filev2.getsession[.]org is necessary but note that Session is a real messaging application — you may have legitimate traffic to this domain.\nIt targets .claude/ directories. The worm specifically writes persistence hooks to ~/.claude/ — the configuration directory used by Claude Code. It also targets .vscode/. This is a deliberate choice: developer tooling is where credentials live, and developer machines have access to production environments.\nCampaign markers have been reported by Socket.dev but are not yet in the official GHSA. Malicious package.json files reportedly contain a unique PBKDF2 salt (svksjrhjkcejg) and the string IfYouRevokeThisTokenItWillWipeTheComputerOfTheOwner. Useful as supplementary indicators for deep detection; treat as unconfirmed pending official advisory update.\nDetection approach with Fleet We ran this across 30 hosts in two passes:\nFleet live queries (osquery SQL) — fast fleet-wide sweep for known-bad package versions, persistence files, active processes, and C2 connections. Results in seconds. Deep scan scripts deployed via Fleet run-script — comprehensive per-host filesystem scan that the SQL queries cannot cover. Results in ~30 seconds per host. Both layers are necessary. Here\u0026rsquo;s why.\nCritical caveat: what Fleet\u0026rsquo;s npm table misses Fleet\u0026rsquo;s npm_packages osquery table queries globally installed packages only — the paths osquery\u0026rsquo;s walker discovers by default: ~/.npm-global, /usr/local/lib/node_modules, /opt/homebrew/lib/node_modules. It does not scan project-local node_modules/ directories.\nIn practice, almost all developer npm installs are local — npm install without -g drops packages into a node_modules/ folder inside the project directory. A developer with 20 active projects could have a compromised @tanstack/react-router@1.169.8 installed in ~/code/my-app/node_modules/ and Fleet\u0026rsquo;s npm_packages query returns zero rows for it.\nThe SQL queries are the right first pass: fast, fleet-wide, catches global/NVM installs and anyone who ran npm install -g with a bad version. But to find per-project exposure, you need the scripts.\nThe deep scan scripts fix this by running find ... -maxdepth 10 -type d -name node_modules recursively across every user\u0026rsquo;s home directory, then checking each discovered node_modules/ tree against the full compromised version list. This is the only reliable way to catch local installs.\nShort version: SQL queries = global exposure check. Scripts = the complete picture.\nThe detection tooling Fleet SQL queries Three SQL files for Linux, macOS, and Windows. Each query is a UNION ALL across multiple indicator classes and returns a tagged result set:\n-- Each result row is tagged by severity class: -- EXPOSURE_global_pkg_version — compromised version in global npm tree -- CRITICAL_persistence_* — systemd service / LaunchAgent / payload file present -- CRITICAL_active_payload_* — malware process currently running -- HIGH_persistence_editor_hook — .claude/ or .vscode/ hooks present The coverage:\n42 TanStack packages (84 versions) directly compromised 133 additional packages (322 versions) from the broader Mini Shai-Hulud campaign — these share the same payload SHA256s, C2 domains, and campaign markers, so the same IoC set detects them @tanstack/setup is the forged package — never legitimate at any version, flag any installation immediately Payload file paths across all common install locations including NVM Active process check (router_init.js, router_runtime.js, tanstack_runner.js in cmdline) Systemd user service (gh-token-monitor.service) and macOS LaunchAgent (com.user.gh-token-monitor.plist) Editor hooks in .claude/ and .vscode/ Deep scan scripts (7-phase) mini_shai_hulud_scan_fleet_deep.sh (Linux/macOS) and mini_shai_hulud_scan_windows.ps1 (Windows).\nEach script runs through seven phases in sequence, with a 300-second timeout:\nPhase What it checks 1 — System persistence systemd user services, LaunchAgents 2 — Payload files router_init.js and tanstack_runner.js by SHA256, across all home directories and global npm paths 3 — Editor hooks .claude/router_runtime.js, .claude/setup.mjs, .vscode/setup.mjs 4 — Local npm packages All node_modules/ directories recursively — this is the layer the SQL queries miss 5 — Git dead-drop commits claude@users.noreply.github.com spoofed author, voicproducoes (reported compromised maintainer account per Socket.dev, not in GHSA), specific malicious commit hash 6 — Campaign markers PBKDF2 salt svksjrhjkcejg, campaign string in package.json, malicious commit reference 7 — Workflow injection .github/workflows/*.yml scanning for toJSON(secrets), C2 domains, __DAEMONIZED, router_init Exit codes are explicit:\nCode Verdict Meaning 0 🟢 CLEAN No indicators found 1 🟡 EXPOSED Compromised package version installed, no execution evidence 2 🟠 HIGH Editor-hook persistence found 3 🔴 CRITICAL Payload file or system persistence — likely compromised What the scans found Linux/macOS — 24 hosts (100% responded):\nAll 24 Linux hosts returned exit 0 — CLEAN. Sample output from one host:\n[*] Phase 6 — campaign markers in package.json Scanning package.json files in /root... [OK] no campaign markers found [*] Phase 7 — injected GitHub workflows [OK] no malicious workflows found ═══════════════════════════════════════════════════════════════ SUMMARY (host=automater duration=31s) ═══════════════════════════════════════════════════════════════ CRITICAL findings: 0 HIGH findings: 0 EXPOSURE findings: 0 VERDICT: 🟢 CLEAN — no indicators found Windows — 5 hosts targeted (80% responded, 1 pending):\n4 Windows hosts ran and returned output. DC01, DC02, WIN10-1, and WRK-AI all completed. One host pending at time of screenshot.\nIoC quick reference Primary indicators — check these first Indicator Value / Pattern Malware file router_init.js (SHA256: ab4fcadaec...601266c) Runner file tanstack_runner.js (SHA256: 2ec78d5...e27fc96) Forged package @tanstack/setup — any version is malicious Active process node process with router_init.js in cmdline C2 egress (confirmed) filev2.getsession[.]org C2 egress (reported) api.masscan.cloud, litter.catbox.moe — per Socket.dev, not in official GHSA Spoofed author claude@users.noreply.github.com Persistence locations Platform Path Linux ~/.config/systemd/user/gh-token-monitor.service Linux ~/.local/bin/gh-token-monitor.sh macOS ~/Library/LaunchAgents/com.user.gh-token-monitor.plist All ~/.claude/router_runtime.js, ~/.claude/setup.mjs All ~/.vscode/setup.mjs Key compromised package versions (most impactful) Package Bad versions @tanstack/react-router 1.169.5, 1.169.8 @tanstack/router-core 1.169.5, 1.169.8 @tanstack/react-start 1.167.68, 1.167.71 @tanstack/router-plugin 1.167.38, 1.167.41 @mistralai/mistralai 2.2.2, 2.2.3, 2.2.4 @opensearch-project/opensearch 3.5.3, 3.6.2, 3.7.0, 3.8.0 42 TanStack packages directly compromised; the SQL queries also cover the 175-package broader Mini Shai-Hulud campaign since IoCs are shared.\nImmediate actions Priority 1 — do this now (\u0026lt; 15 min) Block DNS egress to filev2.getsession[.]org and api.masscan.cloud at your DNS resolver and firewall. Do this before anything else — cuts exfiltration. Deploy the SQL queries as Fleet live queries against all endpoints. Any host returning CRITICAL (exit code 3, or active process): isolate immediately, then rotate credentials in this order — npm tokens first, GitHub PATs, then AWS/Vault/K8s. Priority 2 — within 1 hour Deploy the deep scan scripts via Fleet run-script. The SQL queries check global packages; the scripts check every node_modules/ directory on disk. You need both. Audit git history in any repository for commits from claude@users.noreply.github.com (confirmed spoofed author) or voicproducoes (reported compromised account, per Socket.dev). Check CI/CD workflow files for toJSON(secrets), getsession.org, or router_init. Priority 3 — within 24 hours Proactive credential rotation on any machine where an affected package was installed in the last 7 days, even if the scan is clean. The malware may have already exfiltrated and cleaned up. Audit npm publish logs for unexpected publishes from your organization\u0026rsquo;s packages. Pin GitHub Actions references to commit SHAs. Key lessons Sigstore provenance doesn\u0026rsquo;t protect you from OIDC token theft. The attacker had a legitimately-issued OIDC token; the signature was valid. Provenance attestations are meaningful for supply chain attribution, not for detecting credential compromise.\nPost-install hooks are still the attack surface. postInstall scripts execute with the same privileges as the user running npm install. On developer machines and CI runners, that\u0026rsquo;s often too much.\nDeveloper tooling directories are high-value targets. The choice to persist in .claude/ specifically shows that attackers track which tools developers use. Directory-based persistence in editor/AI tooling survives across project changes and reboots.\nScan local node_modules/, not just global. Fleet\u0026rsquo;s npm_packages table is a fast first pass but covers global installs only. Any environment where developers do per-project installs requires a filesystem-level scan to get the full picture.\nRotate first, investigate second. If any affected package version was installed in the last 7 days, assume credentials were harvested and rotate them before spending time on forensics. The window between install and exfiltration is measured in seconds, not minutes.\nResources Resource Link Socket.dev blog https://socket.dev/blog/tanstack-npm-packages-compromised-mini-shai-hulud-supply-chain-attack TanStack postmortem https://tanstack.com/blog/npm-supply-chain-compromise-postmortem Mini Shai-Hulud campaign page https://socket.dev/supply-chain-attacks/mini-shai-hulud Linux/macOS Fleet SQL /code/tanstack-linux-queries.sql macOS Fleet SQL /code/tanstack-macos-queries.sql Windows Fleet SQL /code/tanstack-windows-queries.sql Linux/macOS deep scan script /code/mini_shai_hulud_scan_fleet_deep.sh Windows deep scan script /code/mini_shai_hulud_scan_windows.ps1 ","permalink":"https://karmine05.github.io/dirtyfrag-blog/posts/mini-shai-hulud-tanstack-supply-chain/","summary":"An active npm supply chain worm targeting developer credentials dropped on May 11, 2026. 42 TanStack packages (84 versions) directly compromised. The broader Mini Shai-Hulud campaign affects 175 packages across 17 namespaces. This is the detection approach we ran across 30 hosts using Fleet — and the critical caveat about what Fleet\u0026rsquo;s built-in npm table misses.","title":"Mini Shai-Hulud: Detecting a Live npm Supply Chain Worm with Fleet"},{"content":" The point of this writeup: vulnerability management isn\u0026rsquo;t CVE management. When a public exploit lands before NVD has caught up, traditional vuln scanners return empty and incident response stalls waiting for a row in a database. Fleet\u0026rsquo;s primitives — live osquery, run-script, policies — let you investigate, scope, mitigate, and verify based on the technical artifacts of the threat (loaded modules, running processes, sysctls, file paths) instead of the catalog representation of it. This is a worked example.\nThreat Dirty Frag — Linux kernel privilege escalation, public PoC, no CVE assigned Time from intel landing to scoped mitigation ~25 minutes Hosts in scope 7 Linux across 3 teams (Workstations, IT Servers, Testing \u0026amp; QA) Outcome Mitigation deployed to non-Docker hosts, alternative hardening on Docker Swarm hosts, reboot queue established for long-uptime servers Why this is a problem the catalog can\u0026rsquo;t solve Most vulnerability response is gated on the CVE pipeline:\nPoC public → CVE reserved → CVE published → vendor advisory → NVD entry → scanner signature → you find out Each arrow can be hours or weeks. During that window, scanners that key off CVE IDs and vendor advisories are blind. The exploit is real, the artifacts of vulnerability are present on hosts, but no catalog yet knows about it. Fleet doesn\u0026rsquo;t have to wait — you can query the artifacts directly.\nI\u0026rsquo;ve hit this pattern before with the Axios npm supply chain compromise (no CVE for the malicious version at first) and BlueHammer (CVE assigned, but standard correlation returned 204 because Microsoft Defender\u0026rsquo;s out-of-band update channel doesn\u0026rsquo;t surface the patched version through normal version metadata). Detection has to come from the artifacts.\nThe workflow Telegram TL;DR → Slack @Fleet bot → fleet-mcp scoping → per-host artifact report ↓ in-place mitigation viable? ↓ ↓ yes edge case ↓ ↓ run mitigation diagnose userspace pin ↓ ↓ ↓ blast radius analysis ↓ ↓ ↓ alternative mitigation ↓ ↓ └──→ verification policy ←──┘ ↓ reboot queue tracker Each box below is something Fleet (plus a Slack bot wired to fleet-mcp) actually does. None of it requires a CVE.\nStep 1 — Intel ingestion (Telegram) A small Telegram bot subscribed to threat-intel feeds (SecurityAffairs, oss-security, GitHub Advisories, vendor PSIRTs) auto-summarizes new posts to a TL;DR. The Dirty Frag summary that triggered this response:\nWhy this routing matters. A summary in Telegram is the smallest possible nudge. It has no authority — it\u0026rsquo;s just a heads-up. The decision to escalate to a full Fleet investigation is human. The automation comes after that decision, not before. This avoids the failure mode where a noisy intel feed triggers fleet-wide queries on every CVE-7 PHP plugin issue.\nOperationally awkward properties of this particular intel:\nNo CVE → CVE correlation lookups return empty Vendor advisories not yet published → no patches to schedule Two attack variants (xfrm-ESP and RxRPC) → mitigation choice is non-obvious Page cache write → traditional integrity monitoring (file hash on disk) won\u0026rsquo;t catch it because the on-disk file is unchanged Step 2 — Scoping via Slack + fleet-mcp I drop the TL;DR into a thread and tag @Fleet:\nBehind the bot is fleet-mcp — a Model Context Protocol server exposing Fleet\u0026rsquo;s API as tools. The bot synthesizes the intel into an osquery scan covering distro family/version, kernel version, kernel module state for the implicated modules (esp4, esp6, rxrpc, af_rxrpc, xfrm_user, xfrm_algo, algif_aead), and uptime. The scan SQL:\n-- 01-scope-scan.sql SELECT os.platform, os.name AS distro_name, os.version AS distro_version, os.codename AS distro_codename, k.version AS kernel_version, k.arch, CAST(u.total_seconds / 86400 AS INTEGER) AS uptime_days, -- Per-module presence flags. Note: absence here is NOT a mitigation — -- these modules auto-load on demand when an unprivileged process -- opens the relevant socket family. COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;esp4\u0026#39;), 0) AS mod_esp4, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;esp6\u0026#39;), 0) AS mod_esp6, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;rxrpc\u0026#39;), 0) AS mod_rxrpc, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;af_rxrpc\u0026#39;), 0) AS mod_af_rxrpc, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;xfrm_user\u0026#39;), 0) AS mod_xfrm_user, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;xfrm_algo\u0026#39;), 0) AS mod_xfrm_algo, COALESCE((SELECT 1 FROM kernel_modules WHERE name = \u0026#39;algif_aead\u0026#39;), 0) AS mod_algif_aead FROM os_version os CROSS JOIN kernel_info k CROSS JOIN uptime u; This isn\u0026rsquo;t a CVE check. It\u0026rsquo;s an artifact check. It would have worked the day the exploit dropped. Download: 01-scope-scan.sql.\nStep 3 — Findings \u0026amp; risk assessment The bot returns a structured per-host report:\nFollowed by a top-line observations block:\nMy reading of the report:\nAll 6 named-distro hosts are in the vulnerable population per the intel (Ubuntu + Fedora). No CVE has been assigned, so vulnerability-feed lookups will not flag these — manual tracking is required. Two long-uptime servers (503 d and 504 d uptime) are the highest-priority remediation targets — they need a reboot once a patched kernel ships, which means a maintenance window should be scheduled now, in parallel with the mitigation rollout. None of the implicated modules are currently loaded on the responding hosts, but esp4 / esp6 / rxrpc are auto-loaded on demand when an unprivileged process opens the relevant socket family — so absence from lsmod is not a mitigation. A real mitigation requires blocking module load (modprobe.d) or restricting user-namespace creation (kernel.unprivileged_userns_clone=0). One workstation is running a non-stock kernel; confirm with the owner that they\u0026rsquo;re tracking the upstream Ubuntu kernel security advisory cadence — a vendor-customized kernel may lag on patches. Workstation hosts are end-user systems where unprivileged local access is by design — these are the realistic exploitation targets. openSUSE Leap was not named in the intel; treat as lower priority pending vendor confirmation. The \u0026ldquo;absence from lsmod is not mitigation\u0026rdquo; point is the kind of detail that gets lost when responders only look at vendor advisories. A future advisory will likely say \u0026ldquo;affects kernels with CONFIG_INET_ESP=y\u0026rdquo;. Most distros ship that as a module, not built-in. So lsmod will say \u0026ldquo;no esp4 here, we\u0026rsquo;re fine\u0026rdquo; — but the moment any unprivileged process calls socket(AF_INET, SOCK_RAW, IPPROTO_ESP), the module loads and the host is exposed. The kernel build configuration is the actual exposure indicator, not the running module list. Artifact-based queries have to encode that nuance.\nStep 4 — Mitigation design The exploit requires a target module to be loaded (or loadable). The minimal mitigation has three parts:\nBlock future load attempts. Drop a file in /etc/modprobe.d/ that maps each implicated module to /bin/false. This is stronger than the blacklist directive — blacklist only stops alias-based auto-load, while install ... /bin/false blocks explicit modprobe too. Unload any in-flight copies. rmmod the modules if they\u0026rsquo;re currently resident. This will fail when something is using them; that\u0026rsquo;s a separate problem (Step 6). Drop page caches. Both attack chains write to the page cache. Flushing cached pages forces a re-read from disk on next access, which clears any in-memory file modification an attacker may have already staged. Full script: dirtyfrag-mitigation.sh. Key choices:\n#!/bin/bash # dirtyfrag-mitigation.sh set -u CONF_FILE=\u0026#34;/etc/modprobe.d/dirtyfrag.conf\u0026#34; MODULES=(esp4 esp6 rxrpc) EXIT=0 if [ \u0026#34;$(id -u)\u0026#34; -ne 0 ]; then echo \u0026#34;ERROR: must run as root\u0026#34; \u0026gt;\u0026amp;2 exit 1 fi # Strong blacklist — blocks alias resolution AND explicit modprobe cat \u0026gt; \u0026#34;$CONF_FILE\u0026#34; \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; install esp4 /bin/false install esp6 /bin/false install rxrpc /bin/false EOF chmod 0644 \u0026#34;$CONF_FILE\u0026#34; echo \u0026#34;WROTE: $CONF_FILE\u0026#34; for mod in \u0026#34;${MODULES[@]}\u0026#34;; do if lsmod | awk \u0026#39;{print $1}\u0026#39; | grep -qx \u0026#34;$mod\u0026#34;; then if rmmod \u0026#34;$mod\u0026#34; 2\u0026gt;/dev/null; then echo \u0026#34;UNLOADED: $mod\u0026#34; else echo \u0026#34;WARN: $mod loaded but could not be unloaded (likely in use)\u0026#34; EXIT=2 fi else echo \u0026#34;NOT-LOADED: $mod\u0026#34; fi done echo 3 \u0026gt; /proc/sys/vm/drop_caches 2\u0026gt;/dev/null \u0026amp;\u0026amp; echo \u0026#34;CACHES: dropped\u0026#34; # Exit 0 = clean; 2 = blacklist written but module still resident (reboot needed) exit \u0026#34;$EXIT\u0026#34; Exit-code semantics are deliberate so Fleet\u0026rsquo;s run-script results page tells you something useful:\nExit Meaning 0 Blacklist written, no target modules resident → fully mitigated 1 Hard failure (not root, couldn\u0026rsquo;t write conf) → host needs follow-up 2 Blacklist written, but a target module is in-use → host needs reboot to clear Exit 2 is intentionally non-zero so it surfaces in the Fleet UI as something distinct from clean success. This is a tradeoff — you get a \u0026ldquo;script execution error\u0026rdquo; badge on hosts that aren\u0026rsquo;t actually broken, but you also get an at-a-glance reboot queue.\nStep 5 — Deploy via Fleet run-script Target selection, scoped to Linux:\nBefore running the script, validate the scope query against the picked host to confirm the artifact baseline:\nTrigger the script from the host details page (Actions → Run script):\nOr via fleetctl:\nfleetctl run-script \\ --script-path ./dirtyfrag-mitigation.sh \\ --host linux-host-01 The script enters the run queue:\nAnd shows up in the host activity log:\nFor most hosts (workstations, the offline hosts once they come back online), this completes the mitigation cycle. The verification policy in Step 9 confirms it. But.\nStep 6 — The snag Running the same script against a Docker Swarm manager returned exit 2:\nWROTE: /etc/modprobe.d/dirtyfrag.conf WARN: esp4 loaded but could not be unloaded (likely in use) NOT-LOADED: esp6 NOT-LOADED: rxrpc CACHES: dropped ----- verification ----- [loaded target modules] esp4 script execution error: exit status 2 Per the script\u0026rsquo;s exit-code contract this means the conf file landed but esp4 is pinned in the running kernel. The modprobe blacklist only takes effect at next boot. If the host reboots without first identifying what\u0026rsquo;s pinning esp4, two things happen:\nThe blacklist activates and blocks esp4 from loading. Whatever was using esp4 either fails or silently degrades. This is the part vendor advisories cannot tell you. The advisory will say \u0026ldquo;blacklist these modules\u0026rdquo;. It cannot know that on your hosts, these modules have legitimate consumers.\nStep 7 — Diagnose the userspace pin Two queries find the holder.\nQuery A — module state (02-module-state.sql):\nSELECT name, size, used_by, status FROM kernel_modules WHERE name IN (\u0026#39;esp4\u0026#39;,\u0026#39;esp6\u0026#39;,\u0026#39;rxrpc\u0026#39;,\u0026#39;xfrm_user\u0026#39;,\u0026#39;xfrm_algo\u0026#39;); Result:\nReading: esp4 and xfrm_user are loaded with no kernel-side dependents (used_by = -) but their refcount is non-zero. That\u0026rsquo;s the userspace-pin signature — something is using the xfrm netlink interface directly.\nQuery B — userspace consumers (03-userspace-consumers.sql):\nSELECT p.pid, p.name AS process, p.path, p.cmdline FROM processes p WHERE p.name IN (\u0026#39;charon\u0026#39;,\u0026#39;pluto\u0026#39;,\u0026#39;starter\u0026#39;,\u0026#39;ipsec\u0026#39;,\u0026#39;iked\u0026#39;,\u0026#39;racoon\u0026#39;,\u0026#39;swanctl\u0026#39;, \u0026#39;dockerd\u0026#39;,\u0026#39;containerd\u0026#39;,\u0026#39;docker-proxy\u0026#39;) OR p.cmdline LIKE \u0026#39;%strongswan%\u0026#39; OR p.cmdline LIKE \u0026#39;%libreswan%\u0026#39; OR p.path LIKE \u0026#39;%/dockerd%\u0026#39; OR p.path LIKE \u0026#39;%/containerd%\u0026#39;; Result:\nNo charon, no pluto, no traditional IPsec stack. Docker is the consumer. Docker Swarm encrypted overlay networks (docker network create --opt encrypted) program xfrm state directly via netlink — no userland daemon involved. That programming pulls in xfrm_user and esp4.\nStep 8 — Blast radius \u0026amp; alternative mitigation Consequence of going forward with the blacklist on this host:\nOn reboot, esp4 cannot load. Docker Swarm encrypted overlay networking on this manager fails. Depending on Docker version, this is either silent fall-back to unencrypted (worse than expected) or hard failure to attach overlay networks. Any container relying on encrypted overlay traffic is impacted. The right move is to not deploy the modprobe blacklist on Docker Swarm hosts and instead apply a second-tier mitigation that blunts the attack without breaking Docker:\n# /etc/sysctl.d/99-dirtyfrag-userns.conf kernel.unprivileged_userns_clone = 0 This stops unprivileged processes from creating user namespaces, which is the prerequisite for the xfrm-ESP variant. The RxRPC variant is unaffected by this knob — for that, Docker hosts have to wait for a kernel patch. This is an honest tradeoff, documented as such, not a clean win.\nFull script: dirtyfrag-userns-mitigation.sh. Detects distro family and sets the right sysctl (kernel.unprivileged_userns_clone on Debian/Ubuntu, user.max_user_namespaces on RHEL/Fedora).\nFor the Docker host specifically:\n# Roll back the modprobe blacklist ssh docker-host \u0026#39;sudo rm -f /etc/modprobe.d/dirtyfrag.conf\u0026#39; # Apply the userns mitigation instead fleetctl run-script \\ --script-path ./dirtyfrag-userns-mitigation.sh \\ --host docker-host Step 9 — Verification policies After the mitigation lands, a file query confirms the conf file is in place:\nThree Fleet policies track the rollout:\nPolicy Pass condition dirtyfrag-blacklist-deployed /etc/modprobe.d/dirtyfrag.conf exists and is non-empty dirtyfrag-userns-hardened (Docker hosts) kernel.unprivileged_userns_clone = 0 at runtime dirtyfrag-fully-mitigated Blacklist deployed AND no target module resident The reboot queue is just the failing set of dirtyfrag-fully-mitigated minus the failing set of dirtyfrag-blacklist-deployed — hosts that have the blacklist but still have a module loaded.\nFull GitOps YAML: dirtyfrag-policies.yml.\nWhat this gives you that a CVE pipeline doesn\u0026rsquo;t Three concrete things:\nLatency. Time from intel landing to scoped fleet-wide visibility was minutes, gated only on the on-call\u0026rsquo;s decision to escalate. No waiting on NIST, no waiting on a vendor PSIRT, no waiting on a scanner vendor to ship a signature. Specificity. The investigation surfaced something a generic advisory could not: the Docker Swarm blast radius. The general \u0026ldquo;blacklist these modules\u0026rdquo; guidance from a future advisory would have caused a Docker Swarm outage on this host. Artifact-based investigation caught it before the reboot. Honest gaps. The userns mitigation doesn\u0026rsquo;t cover the RxRPC variant. The reboot queue is real and tracked. The offline hosts remain unverified until they come back online. None of this is hidden by a green \u0026ldquo;patched\u0026rdquo; badge — Fleet shows exactly which hosts are in which state. The framing that helps:\nA vulnerability scanner asks: which CVEs apply to this host? An artifact query asks: what does this host actually look like right now?\nThe first is bounded by the catalog. The second is bounded only by what osquery can see — which on Linux is most of what matters. For pre-CVE threats, only the second one works.\nReusing this pattern The shape of this response is generalizable. For any pre-CVE Linux kernel threat:\nTranslate the intel into a list of artifacts (modules, sysctls, files, processes, distro versions). Write a scope query that returns those artifacts per host. Write a mitigation script that touches only the artifacts the threat depends on. Write a verification policy that confirms the mitigation landed. Run the script. If anything pushes back (exit 2, errors), diagnose the userspace context before forcing the mitigation. Track residual state (reboot queues, offline hosts, exception cohorts) as named policies. The Slack-bot front-end is convenience, not the substance. The substance is osquery + scripts + policies. Those three primitives, applied artifact-first, cover the gap that CVE-based tooling can\u0026rsquo;t.\nCaveats The mitigation script\u0026rsquo;s drop_caches step is best-effort. It does not retroactively undo a successful exploit — if the host was already compromised before the script ran, dropping caches forces a re-read of legitimate on-disk content but does not remediate any persistence the attacker may have established outside the page cache. Treat as harm-reduction, not detection. The userns mitigation is partial. It only blocks the xfrm-ESP variant. The RxRPC variant works on hosts that have rxrpc.ko built (default on Ubuntu kernels) regardless of namespace policy. Docker hosts running Ubuntu remain partially exposed until a patched kernel ships. Long-uptime hosts won\u0026rsquo;t pick up future kernel patches without reboot. The mitigation script does not address this; the reboot queue policy does. Schedule maintenance windows in parallel with the mitigation rollout, not after. Offline hosts are unverified. Fleet returns no live data for offline hosts. The findings table treats them as in-scope-by-distro, not as confirmed-vulnerable or confirmed-mitigated. A second pass is required when they come online. Downloads File What it does dirtyfrag-mitigation.sh Modprobe blacklist + rmmod + drop_caches dirtyfrag-userns-mitigation.sh Sysctl-based alternative for Docker / IPsec hosts 01-scope-scan.sql Initial fleet-wide artifact scan 02-module-state.sql Post-mitigation module diagnostic 03-userspace-consumers.sql Find what\u0026rsquo;s pinning a stuck module 04-blacklist-deployed.sql Verification SQL backing the policy dirtyfrag-policies.yml Three Fleet GitOps policies for tracking All artifacts are MIT-licensed. Source for this post is on GitHub.\n","permalink":"https://karmine05.github.io/dirtyfrag-blog/posts/pre-cve-response-with-fleet/","summary":"Vulnerability management isn\u0026rsquo;t CVE management. When a public exploit lands before NVD has caught up, traditional vuln scanners return empty and incident response stalls waiting for a row in a database. This is a worked example of using Fleet\u0026rsquo;s primitives — live osquery, run-script, policies — to investigate, scope, mitigate, and verify based on the artifacts of the threat instead of its catalog representation.","title":"Pre-CVE Threat Response: A Dirty Frag Walkthrough with Fleet"},{"content":"Security operations engineer working on endpoint management at scale. These notes are about how I actually run security ops day to day — the tools, the patterns, and the things that didn\u0026rsquo;t work.\nMost of what I write touches Fleet, osquery, Linux internals, and the gap between \u0026ldquo;vendor said this fixes it\u0026rdquo; and \u0026ldquo;the host actually fixed it.\u0026rdquo;\nOpinions are my own.\n","permalink":"https://karmine05.github.io/dirtyfrag-blog/about/","summary":"about","title":"about"}]