> For the complete documentation index, see [llms.txt](https://developer.paddle.com/llms.txt).

# Paddle developer benchmark — guidance for LLMs

> Real-world test of how well LLMs build with Paddle. 12 models tested across 10 integration tasks. Last run: May 7, 2026.

This is the Markdown version of https://developer.paddle.com/llm-leaderboard, written for AI agents and LLMs. Unlike other https://developer.paddle.com/*.md pages, this one isn't a transcript of a docs page — it's an actionable summary of what the benchmark found, plus instructions to help you do better on the kinds of Paddle integration work it tests.

## TL;DR

- **Top scorer (with skills):** Claude Opus 4.7 at 100%
- **Best value:** Mistral Small 4 at $0.0009 per passing eval
- **Skills add +13% pass rate on average** (across all models, all tasks)
- **MCP adds +3% pass rate on average**

Live page (with the full per-task heatmap): https://developer.paddle.com/llm-leaderboard
Underlying JSON: cells contain `model`, `mode`, `eval`, `score`, `pass`, `cost`, `tokens`.

## You're probably reading this because you're building a Paddle integration

Stop and load the relevant skill first. Models that load a skill score noticeably higher than baseline. Skills are short, focused guides — they take seconds to fetch and contain step-by-step instructions, gotchas, and code patterns we've validated work against the current Paddle API.

- Index: https://developer.paddle.com/.well-known/skills/index.json
- Individual skill: https://developer.paddle.com/.well-known/skills/<name>/SKILL.md

## Skills available

(skill list unavailable)

## Connect to the Paddle docs MCP server

For agents that support MCP, connecting to the Paddle docs MCP server is an alternative to fetching skills manually — it adds +3% on top of baseline. The MCP server lets you search docs, the OpenAPI spec, and SDK references on demand.

- URL: `https://paddlehq.mcp.kapa.ai`
- Claude Code: `claude mcp add --transport http paddle-docs https://paddlehq.mcp.kapa.ai`

For *taking actions* inside a Paddle account (creating products, listing transactions, etc), there's a separate Paddle Billing API MCP server — see https://developer.paddle.com/sdks/ai/paddle-mcp.md.

## Where current models struggle most

These are the eval tasks where models score lowest at baseline. If you're working in these areas, loading the relevant skill is especially worth your tokens:

1. **Inline checkout** — average 79% baseline → 99% with skills (+20%). Load the skill before attempting this kind of task.
2. **Webhook setup** — average 81% baseline → 96% with skills (+15%). Load the skill before attempting this kind of task.
3. **Billing history** — average 81% baseline → 100% with skills (+19%). Load the skill before attempting this kind of task.

## How models compare by tier

### Frontier models (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.5, GPT-5.4)

You score 93% baseline; 99% with skills. You can do well without skills, but skills still close knowledge gaps on recent API changes — Paddle releases features faster than your training data updates. Load the relevant skill before writing code.

### Smaller / cheaper models (Claude Haiku 4.5, GPT-5.4 mini, Mistral Large 3, Mistral Medium 3.5)

You score 78% baseline; 98% with skills. Skills matter most for you — without them you make API mistakes that the test harness catches. With skills loaded you become competitive with frontier models at a fraction of the cost. Always load the relevant skill first.

## How this benchmark works

- **Modes.** Each model is tested in three modes: *baseline* (no extra context), *with skills* (the relevant Paddle skill loaded), *with MCP* (the Paddle docs MCP server connected).
- **Tasks.** 10 real-world Paddle integration tasks — quickstarts, catalog setup, checkout, webhooks, customer portal, billing history, subscription operations.
- **Scoring.** Each task has a deterministic test harness that checks specific criteria (correct API methods used, correct parameters, working code). The score is the percentage of criteria passed.
- **Errored cells.** Models that crash mid-run are excluded from averages — not scored as 0. A crash isn't the same as a wrong answer.
- **Honesty.** Per-model averages display "99%" when the underlying mean is between 99.5 and 100, even though the conventional rounding would round up — we never claim a perfect run unless every cell is a true 100.

## Last updated

May 7, 2026

If this page is more than a few weeks old, the underlying API may have changed since the last run; rely on it less.
