A/B Testing
A/B testing lets you run multiple system prompt variants simultaneously and measure which one performs best based on real user feedback and CSAT scores. Iterate on your bot's personality, tone, and instructions with data-driven confidence.
How It Works
- 1Create 2 or more prompt variants with different system prompts. Each variant is a complete system prompt that will be used instead of your default.
- 2Assign weights to control traffic split (e.g., 50/50 or 70/30). Weights determine what percentage of new conversations see each variant.
- 3Each new conversation is randomly assigned a variant based on the weights. Once assigned, the conversation stays on that variant for its entire lifetime.
- 4User feedback is attributed to the variant — thumbs up/down on individual messages and CSAT ratings at the session level are all tracked per variant.
- 5Compare performance metrics to pick the winner. Review impressions, feedback scores, and CSAT averages side by side.
Creating Variants
Navigate to your bot settings to create and manage A/B test variants. You can also manage variants programmatically via the API.
Each variant requires:
- •Name — A descriptive label (e.g., "Variant A - Formal Tone")
- •System Prompt — The full system prompt for this variant
- •Weight (0–100) — Relative traffic allocation
- •Active toggle — Enable or disable the variant without deleting it
POST /api/c/bots/{botId}/variantsExample Setup
Variant A - Formal Tone
System Prompt: "You are a professional customer service agent.
Always use formal language and complete sentences.
Address the customer respectfully."
Weight: 50
Active: true
Variant B - Casual Tone
System Prompt: "You are a friendly, approachable assistant.
Keep your responses casual and conversational.
Use simple language and feel free to be personable."
Weight: 50
Active: trueMetrics Tracked Per Variant
| Metric | Description |
|---|---|
| Impressions | How many conversations used this variant |
| Thumbs Up | Number of positive message-level feedback ratings |
| Thumbs Down | Number of negative message-level feedback ratings |
| CSAT Total | Sum of all session-level satisfaction scores |
| CSAT Count | Number of sessions that submitted a CSAT rating |
Best Practices
Change one thing at a time
Only change one aspect between variants — tone, response length, specific instructions — to isolate what actually makes a difference. If you change multiple things, you won't know which change drove the improvement.
Run tests long enough
Aim for at least 100 conversations per variant before drawing conclusions. Smaller sample sizes can produce misleading results.
Start even, then shift
Begin with a 50/50 split to gather data quickly. Once a winner emerges, shift weight toward the better-performing variant (e.g., 80/20) before fully committing.
Deactivate losing variants
Set isActive to false on underperforming variants to stop sending traffic to them. Their historical data is preserved for future reference.