For AI agents and LLMs: a structured documentation index is available at /llms.txt. Every page has a Markdown sibling — append .md to any URL.

Skip to content
Docs

LLM benchmark

Understand which LLM you should choose. We evaluate LLMs based on their ability to generate working Paddle integration code from simple, real-world prompts. Every model is evaluated on the same set of prompts. The score is the percentage of criteria that pass.

#ModelBilling historyCatalog setupInline checkoutCustomer portalPricing pageQuick startCancel subsSync subsUpdate subsWebhook setup
1
Claude Opus 4.797%
94
95
93
100
100
100
100
100
100
83
2
GPT-5.595%
100
100
93
100
95
92
100
86
100
83
3
Gemini 3.1 Pro94%
94
95
87
100
95
92
100
93
100
83
4
Claude Sonnet 4.692%
82
79
93
100
95
92
100
100
93
83
5
Gemini 3 Flash92%
79
95
87
92
95
92
100
93
100
83
6
GPT-5.491%
85
89
87
100
81
92
100
93
100
83
7
Gemini 3.1 Flash-Lite89%
82
100
87
75
90
92
100
100
80
83
8
Mistral Large 383%
76
95
60
67
76
100
77
93
100
83
9
GPT-5.4 mini83%
88
68
73
75
90
100
92
86
87
67
10
Mistral Small 472%
65
58
53
75
71
85
77
86
67
83
11
Claude Haiku 4.572%
65
74
73
78
67
100
62
64
60
75
12
Mistral Medium 3.569%
65
53
60
75
76
77
62
93
60
75

≥9585–9470–84<70Last updated May 7, 2026

This page helps you understand and compare how different large language models do at writing code for Paddle, so you can choose the one that best fits your needs.

Each model is run against the same Paddle integration scenarios across three modes:

When we get the model response, we compare against a working Paddle integration. A model passes a scenario when every deterministic check (substring, regex) passes and every LLM-as-judge rubric scores favorably.

Each rubric targets one antipattern at a time, so a model that ships working code but mishandles webhook idempotency is scored differently from one that handles idempotency but mismatches a price ID. Scores are the percentage of criteria the model passed.

What each task tests

Quick start

Bootstrap a Next.js Paddle integration. Graded on the client/server boundary (use client, no server keys in the browser), env-based environment selection (never defaulting to production), price ID correctness, and checkout button state integrity.

Catalog setup

A one-shot Node script that seeds a three-tier product and price catalog. Graded on creating products before prices, using the lowest-unit amount for money ("1000" cents, not "10.00"), validating the billingCycle shape, sourcing the API key from env vars, and defaulting to sandbox.

Pricing page

A localized three-tier pricing page with monthly/yearly toggle. Graded on a single batched PricePreview fetch (all six tier+frequency IDs together), rendering Paddle's formatted total unmodified, handling unknown countries correctly, keeping price IDs consistent between preview and checkout, and showing a loading affordance.

Inline checkout

An inline checkout with quantity selector and live price summary. Graded on the frameTarget div matching, throttled/debounced price updates, a single Paddle init per mount, event-driven UI (price from the event payload, not client-side math), and environment safety.

Webhook setup

A route handler that verifies Paddle signatures and routes typed events. Graded on verifying before processing side effects, passing the raw body unchanged, reading the secret from PADDLE_NOTIFICATION_WEBHOOK_SECRET (not the API key), returning an error (not 200) on signature failure, typing events with TypeScript enums, and being idempotent across Paddle retries.

Sync subscriptions

Mirror Paddle subscription state into Supabase via webhooks. Graded on UPSERT-based idempotency (keyed on the Paddle subscription ID, safe under out-of-order delivery), treating subscription_status as the terminal source of truth, granting access to trialing customers, discriminating event types, and keeping the Supabase service-role key server-side.

Customer portal

Mint a Paddle customer portal session URL from a Server Action. Graded on authenticating before the SDK call, resolving customer_id from the auth session (never from input), returning only the overview URL (no metadata leak), handling missing-customer cases, and rejecting any input from the client.

Cancel subscription

Cancel a subscription via a Server Action. Graded on auth, ownership check (the signed-in customer owns the subscription), effectiveFrom: "next_billing_period" as the safe default (the customer keeps access until period end — not immediate), revalidatePath after success, and a slim return shape rather than the full SDK object.

Update subscription

Upgrade a subscription with proration and item replacement. Graded on auth, ownership check, prorationBillingMode: "prorated_immediately" (the only correct value for upgrades), replace-not-append item semantics, revalidatePath after success, and a slim return shape.

Billing history

A paginated billing history view filtered to the signed-in customer. Graded on auth, a mandatory server-resolved customer_id filter, single-page pagination (one page plus hasMore, no eager-load loops), amount formatting (lowest units to Intl.NumberFormat), zero-decimal currency handling (JPY/KRW/CLP not divided by 100), excluding draft/ready statuses, and a slim DTO return.

How to read these results if you're choosing an LLM

  • Skills and MCP shine in agent loops.
    We grade a single one-shot generation, so the leaderboard understates their real value. In an agent loop, a model that ships broken code can ask the docs MCP to troubleshoot and converge on something that works. These columns capture how good a model is on the first attempt, not the final answer.
  • Smaller models can underperform with MCP.
    Effective MCP use depends on the model knowing what to ask. If a smaller model dips below its baseline in MCP mode, that's a tool-use ceiling rather than a weakness of the docs MCP itself.
  • Things move quickly in the AI space.
    These results are a snapshot, but new models are released frequently. If you're making a long-term architectural decision, treat the ranking as directional rather than definitive.
  • Tied scores are genuinely tied.
    When several models reach the same score on a scenario or overall, this benchmark can't usefully distinguish them. At that point, cost, latency, your existing tooling, and model update stability should drive your decision.

Shoutout to the Clerk.com team for the inspiration for this page.