LLM benchmark

#	Model	Billing history	Catalog setup	Inline checkout	Customer portal	Pricing page	Quick start	Cancel subs	Sync subs	Update subs	Webhook setup
1	Claude Opus 4.797%	94	95	93	100	100	100	100	100	100	83
2	GPT-5.595%	100	100	93	100	95	92	100	86	100	83
3	Gemini 3.1 Pro94%	94	95	87	100	95	92	100	93	100	83
4	Claude Sonnet 4.692%	82	79	93	100	95	92	100	100	93	83
5	Gemini 3 Flash92%	79	95	87	92	95	92	100	93	100	83
6	GPT-5.491%	85	89	87	100	81	92	100	93	100	83
7	Gemini 3.1 Flash-Lite89%	82	100	87	75	90	92	100	100	80	83
8	Mistral Large 383%	76	95	60	67	76	100	77	93	100	83
9	GPT-5.4 mini83%	88	68	73	75	90	100	92	86	87	67
10	Mistral Small 472%	65	58	53	75	71	85	77	86	67	83
11	Claude Haiku 4.572%	65	74	73	78	67	100	62	64	60	75
12	Mistral Medium 3.569%	65	53	60	75	76	77	62	93	60	75

This page helps you understand and compare how different large language models do at writing code for Paddle, so you can choose the one that best fits your needs.

Each model is run against the same Paddle integration scenarios across three modes:

a baseline prompt
a prompt with agent skills
a prompt with access to the Paddle docs MCP server

When we get the model response, we compare against a working Paddle integration. A model passes a scenario when every deterministic check (substring, regex) passes and every LLM-as-judge rubric scores favorably.

Each rubric targets one antipattern at a time, so a model that ships working code but mishandles webhook idempotency is scored differently from one that handles idempotency but mismatches a price ID. Scores are the percentage of criteria the model passed.

What each task tests

Quick start: Bootstrap a Next.js Paddle integration. Graded on the client/server boundary (use client, no server keys in the browser), env-based environment selection (never defaulting to production), price ID correctness, and checkout button state integrity.
Catalog setup: A one-shot Node script that seeds a three-tier product and price catalog. Graded on creating products before prices, using the lowest-unit amount for money ("1000" cents, not "10.00"), validating the billingCycle shape, sourcing the API key from env vars, and defaulting to sandbox.
Pricing page: A localized three-tier pricing page with monthly/yearly toggle. Graded on a single batched PricePreview fetch (all six tier+frequency IDs together), rendering Paddle's formatted total unmodified, handling unknown countries correctly, keeping price IDs consistent between preview and checkout, and showing a loading affordance.
Inline checkout: An inline checkout with quantity selector and live price summary. Graded on the frameTarget div matching, throttled/debounced price updates, a single Paddle init per mount, event-driven UI (price from the event payload, not client-side math), and environment safety.
Webhook setup: A route handler that verifies Paddle signatures and routes typed events. Graded on verifying before processing side effects, passing the raw body unchanged, reading the secret from PADDLE_NOTIFICATION_WEBHOOK_SECRET (not the API key), returning an error (not 200) on signature failure, typing events with TypeScript enums, and being idempotent across Paddle retries.
Sync subscriptions: Mirror Paddle subscription state into Supabase via webhooks. Graded on UPSERT-based idempotency (keyed on the Paddle subscription ID, safe under out-of-order delivery), treating subscription_status as the terminal source of truth, granting access to trialing customers, discriminating event types, and keeping the Supabase service-role key server-side.
Customer portal: Mint a Paddle customer portal session URL from a Server Action. Graded on authenticating before the SDK call, resolving customer_id from the auth session (never from input), returning only the overview URL (no metadata leak), handling missing-customer cases, and rejecting any input from the client.
Cancel subscription: Cancel a subscription via a Server Action. Graded on auth, ownership check (the signed-in customer owns the subscription), effectiveFrom: "next_billing_period" as the safe default (the customer keeps access until period end — not immediate), revalidatePath after success, and a slim return shape rather than the full SDK object.
Update subscription: Upgrade a subscription with proration and item replacement. Graded on auth, ownership check, prorationBillingMode: "prorated_immediately" (the only correct value for upgrades), replace-not-append item semantics, revalidatePath after success, and a slim return shape.
Billing history: A paginated billing history view filtered to the signed-in customer. Graded on auth, a mandatory server-resolved customer_id filter, single-page pagination (one page plus hasMore, no eager-load loops), amount formatting (lowest units to Intl.NumberFormat), zero-decimal currency handling (JPY/KRW/CLP not divided by 100), excluding draft/ready statuses, and a slim DTO return.

How to read these results if you're choosing an LLM

Skills and MCP shine in agent loops.
We grade a single one-shot generation, so the leaderboard understates their real value. In an agent loop, a model that ships broken code can ask the docs MCP to troubleshoot and converge on something that works. These columns capture how good a model is on the first attempt, not the final answer.
Smaller models can underperform with MCP.
Effective MCP use depends on the model knowing what to ask. If a smaller model dips below its baseline in MCP mode, that's a tool-use ceiling rather than a weakness of the docs MCP itself.
Things move quickly in the AI space.
These results are a snapshot, but new models are released frequently. If you're making a long-term architectural decision, treat the ranking as directional rather than definitive.
Tied scores are genuinely tied.
When several models reach the same score on a scenario or overall, this benchmark can't usefully distinguish them. At that point, cost, latency, your existing tooling, and model update stability should drive your decision.

Shoutout to the Clerk.com team for the inspiration for this page.