LLM benchmark
Understand which LLM you should choose. We evaluate LLMs based on their ability to generate working Paddle integration code from simple, real-world prompts. Every model is evaluated on the same set of prompts. The score is the percentage of criteria that pass.
| # | Model | Billing history | Catalog setup | Inline checkout | Customer portal | Pricing page | Quick start | Cancel subs | Sync subs | Update subs | Webhook setup |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.797% | 94 | 95 | 93 | 100 | 100 | 100 | 100 | 100 | 100 | 83 |
| 2 | GPT-5.595% | 100 | 100 | 93 | 100 | 95 | 92 | 100 | 86 | 100 | 83 |
| 3 | Gemini 3.1 Pro94% | 94 | 95 | 87 | 100 | 95 | 92 | 100 | 93 | 100 | 83 |
| 4 | Claude Sonnet 4.692% | 82 | 79 | 93 | 100 | 95 | 92 | 100 | 100 | 93 | 83 |
| 5 | Gemini 3 Flash92% | 79 | 95 | 87 | 92 | 95 | 92 | 100 | 93 | 100 | 83 |
| 6 | GPT-5.491% | 85 | 89 | 87 | 100 | 81 | 92 | 100 | 93 | 100 | 83 |
| 7 | Gemini 3.1 Flash-Lite89% | 82 | 100 | 87 | 75 | 90 | 92 | 100 | 100 | 80 | 83 |
| 8 | Mistral Large 383% | 76 | 95 | 60 | 67 | 76 | 100 | 77 | 93 | 100 | 83 |
| 9 | GPT-5.4 mini83% | 88 | 68 | 73 | 75 | 90 | 100 | 92 | 86 | 87 | 67 |
| 10 | Mistral Small 472% | 65 | 58 | 53 | 75 | 71 | 85 | 77 | 86 | 67 | 83 |
| 11 | Claude Haiku 4.572% | 65 | 74 | 73 | 78 | 67 | 100 | 62 | 64 | 60 | 75 |
| 12 | Mistral Medium 3.569% | 65 | 53 | 60 | 75 | 76 | 77 | 62 | 93 | 60 | 75 |
≥9585–9470–84<70Last updated May 7, 2026
This page helps you understand and compare how different large language models do at writing code for Paddle, so you can choose the one that best fits your needs.
Each model is run against the same Paddle integration scenarios across three modes:
- a baseline prompt
- a prompt with agent skills
- a prompt with access to the Paddle docs MCP server
When we get the model response, we compare against a working Paddle integration. A model passes a scenario when every deterministic check (substring, regex) passes and every LLM-as-judge rubric scores favorably.
Each rubric targets one antipattern at a time, so a model that ships working code but mishandles webhook idempotency is scored differently from one that handles idempotency but mismatches a price ID. Scores are the percentage of criteria the model passed.
What each task tests
- Quick start
-
Bootstrap a Next.js Paddle integration. Graded on the client/server boundary (
use client, no server keys in the browser), env-based environment selection (never defaulting to production), price ID correctness, and checkout button state integrity. - Catalog setup
-
A one-shot Node script that seeds a three-tier product and price catalog. Graded on creating products before prices, using the lowest-unit amount for money (
"1000"cents, not"10.00"), validating thebillingCycleshape, sourcing the API key from env vars, and defaulting to sandbox. - Pricing page
-
A localized three-tier pricing page with monthly/yearly toggle. Graded on a single batched
PricePreviewfetch (all six tier+frequency IDs together), rendering Paddle's formatted total unmodified, handling unknown countries correctly, keeping price IDs consistent between preview and checkout, and showing a loading affordance. - Inline checkout
-
An inline checkout with quantity selector and live price summary. Graded on the
frameTargetdiv matching, throttled/debounced price updates, a single Paddle init per mount, event-driven UI (price from the event payload, not client-side math), and environment safety. - Webhook setup
-
A route handler that verifies Paddle signatures and routes typed events. Graded on verifying before processing side effects, passing the raw body unchanged, reading the secret from
PADDLE_NOTIFICATION_WEBHOOK_SECRET(not the API key), returning an error (not 200) on signature failure, typing events with TypeScript enums, and being idempotent across Paddle retries. - Sync subscriptions
-
Mirror Paddle subscription state into Supabase via webhooks. Graded on UPSERT-based idempotency (keyed on the Paddle subscription ID, safe under out-of-order delivery), treating
subscription_statusas the terminal source of truth, granting access totrialingcustomers, discriminating event types, and keeping the Supabase service-role key server-side. - Customer portal
-
Mint a Paddle customer portal session URL from a Server Action. Graded on authenticating before the SDK call, resolving
customer_idfrom the auth session (never from input), returning only the overview URL (no metadata leak), handling missing-customer cases, and rejecting any input from the client. - Cancel subscription
-
Cancel a subscription via a Server Action. Graded on auth, ownership check (the signed-in customer owns the subscription),
effectiveFrom: "next_billing_period"as the safe default (the customer keeps access until period end — not immediate),revalidatePathafter success, and a slim return shape rather than the full SDK object. - Update subscription
-
Upgrade a subscription with proration and item replacement. Graded on auth, ownership check,
prorationBillingMode: "prorated_immediately"(the only correct value for upgrades), replace-not-append item semantics,revalidatePathafter success, and a slim return shape. - Billing history
-
A paginated billing history view filtered to the signed-in customer. Graded on auth, a mandatory server-resolved
customer_idfilter, single-page pagination (one page plushasMore, no eager-load loops), amount formatting (lowest units toIntl.NumberFormat), zero-decimal currency handling (JPY/KRW/CLP not divided by 100), excludingdraft/readystatuses, and a slim DTO return.
How to read these results if you're choosing an LLM
- Skills and MCP shine in agent loops.
We grade a single one-shot generation, so the leaderboard understates their real value. In an agent loop, a model that ships broken code can ask the docs MCP to troubleshoot and converge on something that works. These columns capture how good a model is on the first attempt, not the final answer. - Smaller models can underperform with MCP.
Effective MCP use depends on the model knowing what to ask. If a smaller model dips below its baseline in MCP mode, that's a tool-use ceiling rather than a weakness of the docs MCP itself. - Things move quickly in the AI space.
These results are a snapshot, but new models are released frequently. If you're making a long-term architectural decision, treat the ranking as directional rather than definitive. - Tied scores are genuinely tied.
When several models reach the same score on a scenario or overall, this benchmark can't usefully distinguish them. At that point, cost, latency, your existing tooling, and model update stability should drive your decision.
Shoutout to the Clerk.com team for the inspiration for this page.