Evals

Overview

Evals (evaluations) are structured tests that validate whether Gradial’s AI agent (“Grady”) can successfully complete specific tasks for your organization. They serve as both acceptance criteria during onboarding and ongoing quality assurance as systems evolve. Think of Evals as your automated QA team—they run the same tests consistently, catching issues before they impact your content operations.

Why Evals Matter

Gradial operates in a complex environment where multiple factors can affect output quality:

Factor	What Can Change	Potential Impact
Your CMS environment	Component updates, template changes, new patterns	Agent may not recognize or correctly use updated components
Your content requirements	New use cases, brand guidelines, compliance rules	Agent needs to learn new constraints
Gradial platform	Feature releases, agent improvements, bug fixes	Behavior may change even if improvements are intended
Underlying AI models	Model updates from AI providers	Subtle changes in reasoning, output format, or capabilities

Evals provide a safety net across all of these dimensions, ensuring that what works today continues to work tomorrow.

Setting Up Your Eval Environment

Before building Evals, you need a dedicated area in your CMS where tests can run without affecting production content or being affected by other changes.

Create an Isolated Content Tree

During onboarding, work with your team to establish a dedicated section of your content tree specifically for Evals. This can live in a lower environment (stage, QA) or in production—what matters is that it’s clearly separated from content that serves real users.

/content/yoursite/
├── en/                    ← Production content
│   ├── products/
│   ├── articles/
│   └── ...
└── eval-content/          ← Dedicated Eval environment
    ├── simple-updates/
    ├── complex-updates/
    ├── new-page-tests/
    └── migration-tests/

Whether you choose a lower environment or production depends on your needs. Lower environments are safer but may drift from production. Production Eval areas ensure you’re testing against the real system but require stricter governance.

Requirements for Eval Content

Your Eval content area must meet these criteria:

Requirement	Why It Matters
Agent-only modifications	Only Grady should modify this content—human edits would invalidate test baselines
Protected from overwrites	Content syncs, deployments, or bulk operations should never touch this area
Isolated from production	Changes here must never impact live customer experiences
Persistent and stable	Content should remain available across releases and environment refreshes

What to Include

Populate your Eval content area with:

Baseline pages — Known-good pages in each template type that serve as starting points for tests
Reference pages — Examples of expected outputs for comparison
Test scenarios — Pages set up for specific Eval cases (e.g., a page with a hero that needs updating)
Pattern examples — Instances of each Design System pattern for validation

Governance

Establish clear ownership and rules for the Eval content area:

Document the location — Ensure everyone knows this content is off-limits
Restrict access — Limit who can manually modify Eval content
Exclude from deployments — Configure your deployment process to skip this area
Include in backups — Ensure Eval content is preserved and recoverable

Important: Treat your Eval content as critical infrastructure. If someone accidentally modifies or deletes it, your Evals will produce unreliable results until the content is restored.

Evals in Customer Onboarding

During onboarding, Evals serve as the validation framework for “Grady’s Road to Graduation”—the process of proving the agent can reliably handle your specific use cases.

The Graduation Process

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  1. Define      │     │  2. Build       │     │  3. Run         │
│  Use Cases      │ ──▶ │  Evals          │ ──▶ │  Evals          │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
┌─────────────────┐     ┌─────────────────┐            │
│  5. Graduate    │     │  4. Iterate     │            │
│  Use Case       │ ◀── │  & Refine       │ ◀──────────┘
└─────────────────┘     └─────────────────┘

Step 1: Define Use Cases

Work with your Gradial team to identify the specific tasks you need Grady to perform. Common use cases include:

Simple content updates — Text changes, image swaps, link updates
Medium content updates — Adding/removing components, restructuring sections
Complex content updates — Multi-component changes, conditional content
New page creation — From reference pages, templates, or source documents
Batch migrations — Large-scale content transitions

Each use case should have clear inputs, expected outputs, and success criteria.

Step 2: Build Evals

For each use case, create Evals that test whether Grady can complete the task correctly. An Eval consists of:

Component	Description	Example
Input	The task prompt or instruction	”Update the hero headline to ‘New Product Launch‘“
Context	Reference pages, patterns, or source material	Target page URL, Design System patterns
Expected Output	What success looks like	Hero component contains exact headline text
Validation Criteria	How to measure success	Text match, component structure, no regressions

Step 3: Run Evals

Execute the Evals against your actual AEM environment. Each run produces:

Pass/Fail status for each Eval
Detailed results showing what the agent did
Comparison data between expected and actual output
Error logs if something went wrong

Step 4: Iterate and Refine

When Evals fail, work with your Gradial team to identify the cause:

Agent issue — Improve prompting, patterns, or agent configuration
Eval issue — Criteria too strict, edge case not accounted for
Environment issue — Component or template needs adjustment

Refine and re-run until Evals pass consistently.

Step 5: Graduate the Use Case

A use case “graduates” when:

✅ Evals pass consistently (not just once)
✅ Results meet your quality standards
✅ Your team accepts the output as production-ready
✅ Edge cases have been identified and handled

Graduated use cases move into ongoing monitoring.

Customer Acceptance

Evals provide objective criteria for accepting Gradial use cases. Rather than subjective assessments of “does this look right?”, Evals give you:

Clear Success Metrics

Pass rate — What percentage of Evals succeed
Consistency — Do results stay stable across multiple runs
Quality score — How well outputs match expectations

Documented Expectations

Evals serve as living documentation of what you expect from each use case. This creates alignment between your team and Gradial on exactly what “working” means.

Ongoing Evals: Continuous Quality Assurance

Once use cases graduate, Evals shift from validation to monitoring. They run regularly to catch regressions before they impact your operations.

When Evals Run

Evals can be triggered by:

Trigger	Why It Matters
Scheduled intervals	Catch gradual drift or intermittent issues
Your releases	Verify Grady still works after CMS updates, new components, or template changes
Gradial releases	Confirm platform updates don’t break your use cases
Model updates	Detect changes in AI behavior after underlying model changes

What Evals Catch

Your Environment Changes When you update components, add new templates, or modify page structures, Evals verify that Grady can still work with your updated environment. This prevents surprises when your development changes reach production. Gradial Platform Changes As Gradial releases new features and improvements, Evals ensure these changes don’t negatively impact your existing use cases. Even well-intentioned improvements can have unintended side effects. AI Model Changes The AI models powering Grady are periodically updated by their providers. These updates can subtly change how the agent reasons about tasks, formats outputs, or handles edge cases. Evals detect these changes before they affect your content.

Regression Detection

When an Eval that previously passed starts failing, this signals a regression. The Eval results help pinpoint:

What changed — Which specific behavior is different
When it changed — Correlation with releases or updates
Impact scope — How many use cases are affected

This enables fast diagnosis and resolution.

Eval Types

Different types of Evals serve different purposes:

Functional Evals

Test whether the agent can complete specific tasks correctly.

“Can Grady update a headline?”
“Can Grady add a new component to a page?”
“Can Grady migrate content from a source URL?”

Quality Evals

Test whether outputs meet quality standards beyond basic functionality.

“Does the output follow brand guidelines?”
“Is the content structure optimal?”
“Are accessibility requirements met?”

Regression Evals

Test that previously working functionality still works.

“Does use case X still pass after the latest release?”
“Are all graduated use cases still functional?”

Edge Case Evals

Test handling of unusual or boundary conditions.

“What happens with very long content?”
“How does Grady handle missing images?”
“What if a referenced component doesn’t exist?”

Best Practices

Building Effective Evals

Be specific — Vague success criteria lead to inconsistent results
Test one thing — Each Eval should validate a single behavior
Use realistic inputs — Test with content similar to production
Include edge cases — Don’t just test the happy path
Document expectations — Make it clear what success looks like

Managing Evals Over Time

Review regularly — Ensure Evals still reflect current requirements
Add new Evals — As you add use cases or discover edge cases
Retire obsolete Evals — Remove tests for deprecated functionality
Version control — Track changes to Evals over time

Responding to Failures

Don’t ignore flaky Evals — Intermittent failures often indicate real issues
Investigate promptly — The longer a regression persists, the harder it is to diagnose
Communicate broadly — Share Eval status with stakeholders
Fix forward — Address root causes, not just symptoms

Eval Lifecycle Summary

┌─────────────────────────────────────────────────────────────────────┐
│                         ONBOARDING PHASE                            │
├─────────────────────────────────────────────────────────────────────┤
│  Define Use Case → Build Evals → Run & Iterate → Graduate          │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         PRODUCTION PHASE                            │
├─────────────────────────────────────────────────────────────────────┤
│  Scheduled Runs ──┐                                                 │
│  Your Releases ───┼──▶ Run Evals ──▶ Pass? ──▶ Continue            │
│  Gradial Releases─┤                    │                            │
│  Model Updates ───┘                    ▼                            │
│                                      Fail? ──▶ Investigate & Fix    │
└─────────────────────────────────────────────────────────────────────┘

Getting Started

Support

Overview

Why Evals Matter

Setting Up Your Eval Environment

Create an Isolated Content Tree

Requirements for Eval Content

What to Include

Governance

Evals in Customer Onboarding

The Graduation Process

Step 1: Define Use Cases

Step 2: Build Evals

Step 3: Run Evals

Step 4: Iterate and Refine

Step 5: Graduate the Use Case

Customer Acceptance

Clear Success Metrics

Documented Expectations

Ongoing Evals: Continuous Quality Assurance

When Evals Run

What Evals Catch

Regression Detection

Eval Types

Functional Evals

Quality Evals

Regression Evals

Edge Case Evals

Best Practices

Building Effective Evals

Managing Evals Over Time

Responding to Failures

Eval Lifecycle Summary

Getting Started

Support

​Overview

​Why Evals Matter

​Setting Up Your Eval Environment

​Create an Isolated Content Tree

​Requirements for Eval Content

​What to Include

​Governance

​Evals in Customer Onboarding

​The Graduation Process

​Step 1: Define Use Cases

​Step 2: Build Evals

​Step 3: Run Evals

​Step 4: Iterate and Refine

​Step 5: Graduate the Use Case

​Customer Acceptance

​Clear Success Metrics

​Documented Expectations

​Ongoing Evals: Continuous Quality Assurance

​When Evals Run

​What Evals Catch

​Regression Detection

​Eval Types

​Functional Evals

​Quality Evals

​Regression Evals

​Edge Case Evals

​Best Practices

​Building Effective Evals

​Managing Evals Over Time

​Responding to Failures

​Eval Lifecycle Summary

Overview

Why Evals Matter

Setting Up Your Eval Environment

Create an Isolated Content Tree

Requirements for Eval Content

What to Include

Governance

Evals in Customer Onboarding

The Graduation Process

Step 1: Define Use Cases

Step 2: Build Evals

Step 3: Run Evals

Step 4: Iterate and Refine

Step 5: Graduate the Use Case

Customer Acceptance

Clear Success Metrics

Documented Expectations

Ongoing Evals: Continuous Quality Assurance

When Evals Run

What Evals Catch

Regression Detection

Eval Types

Functional Evals

Quality Evals

Regression Evals

Edge Case Evals

Best Practices

Building Effective Evals

Managing Evals Over Time

Responding to Failures

Eval Lifecycle Summary