A/B Testing

A/B testing lets you run multiple system prompt variants simultaneously and measure which one performs best based on real user feedback and CSAT scores. Iterate on your bot's personality, tone, and instructions with data-driven confidence.

How It Works

1
Create 2 or more prompt variants with different system prompts. Each variant is a complete system prompt that will be used instead of your default.
2
Assign weights to control traffic split (e.g., 50/50 or 70/30). Weights determine what percentage of new conversations see each variant.
3
Each new conversation is randomly assigned a variant based on the weights. Once assigned, the conversation stays on that variant for its entire lifetime.
4
User feedback is attributed to the variant — thumbs up/down on individual messages and CSAT ratings at the session level are all tracked per variant.
5
Compare performance metrics to pick the winner. Review impressions, feedback scores, and CSAT averages side by side.

Creating Variants

Navigate to your bot settings to create and manage A/B test variants. You can also manage variants programmatically via the API.

Each variant requires:

•
Name — A descriptive label (e.g., "Variant A - Formal Tone")
•
System Prompt — The full system prompt for this variant
•
Weight (0–100) — Relative traffic allocation
•
Active toggle — Enable or disable the variant without deleting it

API endpoint: POST /api/c/bots/{botId}/variants

Example Setup

Two variants with a 50/50 split

Variant A - Formal Tone
  System Prompt: "You are a professional customer service agent.
    Always use formal language and complete sentences.
    Address the customer respectfully."
  Weight: 50
  Active: true

Variant B - Casual Tone
  System Prompt: "You are a friendly, approachable assistant.
    Keep your responses casual and conversational.
    Use simple language and feel free to be personable."
  Weight: 50
  Active: true

Metrics Tracked Per Variant

Metric	Description
Impressions	How many conversations used this variant
Thumbs Up	Number of positive message-level feedback ratings
Thumbs Down	Number of negative message-level feedback ratings
CSAT Total	Sum of all session-level satisfaction scores
CSAT Count	Number of sessions that submitted a CSAT rating

Best Practices

Change one thing at a time

Only change one aspect between variants — tone, response length, specific instructions — to isolate what actually makes a difference. If you change multiple things, you won't know which change drove the improvement.

Run tests long enough

Aim for at least 100 conversations per variant before drawing conclusions. Smaller sample sizes can produce misleading results.

Start even, then shift

Begin with a 50/50 split to gather data quickly. Once a winner emerges, shift weight toward the better-performing variant (e.g., 80/20) before fully committing.

Deactivate losing variants

Set isActive to false on underperforming variants to stop sending traffic to them. Their historical data is preserved for future reference.

When No Variants Exist

A/B testing is entirely optional. When no variants are configured, your bot uses its default system prompt as normal. You can start and stop A/B tests at any time without affecting your bot's standard behavior.