Skip to main content

Overview

Evals (evaluations) are structured tests that validate whether Gradial’s AI agent (“Grady”) can successfully complete specific tasks for your organization. They serve as both acceptance criteria during onboarding and ongoing quality assurance as systems evolve. Think of Evals as your automated QA team—they run the same tests consistently, catching issues before they impact your content operations.

Why Evals Matter

Gradial operates in a complex environment where multiple factors can affect output quality:
FactorWhat Can ChangePotential Impact
Your CMS environmentComponent updates, template changes, new patternsAgent may not recognize or correctly use updated components
Your content requirementsNew use cases, brand guidelines, compliance rulesAgent needs to learn new constraints
Gradial platformFeature releases, agent improvements, bug fixesBehavior may change even if improvements are intended
Underlying AI modelsModel updates from AI providersSubtle changes in reasoning, output format, or capabilities
Evals provide a safety net across all of these dimensions, ensuring that what works today continues to work tomorrow.

Setting Up Your Eval Environment

Before building Evals, you need a dedicated area in your CMS where tests can run without affecting production content or being affected by other changes.

Create an Isolated Content Tree

During onboarding, work with your team to establish a dedicated section of your content tree specifically for Evals. This can live in a lower environment (stage, QA) or in production—what matters is that it’s clearly separated from content that serves real users.
/content/yoursite/
├── en/                    ← Production content
│   ├── products/
│   ├── articles/
│   └── ...
└── eval-content/          ← Dedicated Eval environment
    ├── simple-updates/
    ├── complex-updates/
    ├── new-page-tests/
    └── migration-tests/
Whether you choose a lower environment or production depends on your needs. Lower environments are safer but may drift from production. Production Eval areas ensure you’re testing against the real system but require stricter governance.

Requirements for Eval Content

Your Eval content area must meet these criteria:
RequirementWhy It Matters
Agent-only modificationsOnly Grady should modify this content—human edits would invalidate test baselines
Protected from overwritesContent syncs, deployments, or bulk operations should never touch this area
Isolated from productionChanges here must never impact live customer experiences
Persistent and stableContent should remain available across releases and environment refreshes

What to Include

Populate your Eval content area with:
  • Baseline pages — Known-good pages in each template type that serve as starting points for tests
  • Reference pages — Examples of expected outputs for comparison
  • Test scenarios — Pages set up for specific Eval cases (e.g., a page with a hero that needs updating)
  • Pattern examples — Instances of each Design System pattern for validation

Governance

Establish clear ownership and rules for the Eval content area:
  • Document the location — Ensure everyone knows this content is off-limits
  • Restrict access — Limit who can manually modify Eval content
  • Exclude from deployments — Configure your deployment process to skip this area
  • Include in backups — Ensure Eval content is preserved and recoverable
Important: Treat your Eval content as critical infrastructure. If someone accidentally modifies or deletes it, your Evals will produce unreliable results until the content is restored.

Evals in Customer Onboarding

During onboarding, Evals serve as the validation framework for “Grady’s Road to Graduation”—the process of proving the agent can reliably handle your specific use cases.

The Graduation Process

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  1. Define      │     │  2. Build       │     │  3. Run         │
│  Use Cases      │ ──▶ │  Evals          │ ──▶ │  Evals          │
└─────────────────┘     └─────────────────┘     └─────────────────┘

┌─────────────────┐     ┌─────────────────┐            │
│  5. Graduate    │     │  4. Iterate     │            │
│  Use Case       │ ◀── │  & Refine       │ ◀──────────┘
└─────────────────┘     └─────────────────┘

Step 1: Define Use Cases

Work with your Gradial team to identify the specific tasks you need Grady to perform. Common use cases include:
  • Simple content updates — Text changes, image swaps, link updates
  • Medium content updates — Adding/removing components, restructuring sections
  • Complex content updates — Multi-component changes, conditional content
  • New page creation — From reference pages, templates, or source documents
  • Batch migrations — Large-scale content transitions
Each use case should have clear inputs, expected outputs, and success criteria.

Step 2: Build Evals

For each use case, create Evals that test whether Grady can complete the task correctly. An Eval consists of:
ComponentDescriptionExample
InputThe task prompt or instruction”Update the hero headline to ‘New Product Launch‘“
ContextReference pages, patterns, or source materialTarget page URL, Design System patterns
Expected OutputWhat success looks likeHero component contains exact headline text
Validation CriteriaHow to measure successText match, component structure, no regressions

Step 3: Run Evals

Execute the Evals against your actual AEM environment. Each run produces:
  • Pass/Fail status for each Eval
  • Detailed results showing what the agent did
  • Comparison data between expected and actual output
  • Error logs if something went wrong

Step 4: Iterate and Refine

When Evals fail, work with your Gradial team to identify the cause:
  • Agent issue — Improve prompting, patterns, or agent configuration
  • Eval issue — Criteria too strict, edge case not accounted for
  • Environment issue — Component or template needs adjustment
Refine and re-run until Evals pass consistently.

Step 5: Graduate the Use Case

A use case “graduates” when:
  • ✅ Evals pass consistently (not just once)
  • ✅ Results meet your quality standards
  • ✅ Your team accepts the output as production-ready
  • ✅ Edge cases have been identified and handled
Graduated use cases move into ongoing monitoring.

Customer Acceptance

Evals provide objective criteria for accepting Gradial use cases. Rather than subjective assessments of “does this look right?”, Evals give you:

Clear Success Metrics

  • Pass rate — What percentage of Evals succeed
  • Consistency — Do results stay stable across multiple runs
  • Quality score — How well outputs match expectations

Documented Expectations

Evals serve as living documentation of what you expect from each use case. This creates alignment between your team and Gradial on exactly what “working” means.

Ongoing Evals: Continuous Quality Assurance

Once use cases graduate, Evals shift from validation to monitoring. They run regularly to catch regressions before they impact your operations.

When Evals Run

Evals can be triggered by:
TriggerWhy It Matters
Scheduled intervalsCatch gradual drift or intermittent issues
Your releasesVerify Grady still works after CMS updates, new components, or template changes
Gradial releasesConfirm platform updates don’t break your use cases
Model updatesDetect changes in AI behavior after underlying model changes

What Evals Catch

Your Environment Changes When you update components, add new templates, or modify page structures, Evals verify that Grady can still work with your updated environment. This prevents surprises when your development changes reach production. Gradial Platform Changes As Gradial releases new features and improvements, Evals ensure these changes don’t negatively impact your existing use cases. Even well-intentioned improvements can have unintended side effects. AI Model Changes The AI models powering Grady are periodically updated by their providers. These updates can subtly change how the agent reasons about tasks, formats outputs, or handles edge cases. Evals detect these changes before they affect your content.

Regression Detection

When an Eval that previously passed starts failing, this signals a regression. The Eval results help pinpoint:
  • What changed — Which specific behavior is different
  • When it changed — Correlation with releases or updates
  • Impact scope — How many use cases are affected
This enables fast diagnosis and resolution.

Eval Types

Different types of Evals serve different purposes:

Functional Evals

Test whether the agent can complete specific tasks correctly.
  • “Can Grady update a headline?”
  • “Can Grady add a new component to a page?”
  • “Can Grady migrate content from a source URL?”

Quality Evals

Test whether outputs meet quality standards beyond basic functionality.
  • “Does the output follow brand guidelines?”
  • “Is the content structure optimal?”
  • “Are accessibility requirements met?”

Regression Evals

Test that previously working functionality still works.
  • “Does use case X still pass after the latest release?”
  • “Are all graduated use cases still functional?”

Edge Case Evals

Test handling of unusual or boundary conditions.
  • “What happens with very long content?”
  • “How does Grady handle missing images?”
  • “What if a referenced component doesn’t exist?”

Best Practices

Building Effective Evals

  • Be specific — Vague success criteria lead to inconsistent results
  • Test one thing — Each Eval should validate a single behavior
  • Use realistic inputs — Test with content similar to production
  • Include edge cases — Don’t just test the happy path
  • Document expectations — Make it clear what success looks like

Managing Evals Over Time

  • Review regularly — Ensure Evals still reflect current requirements
  • Add new Evals — As you add use cases or discover edge cases
  • Retire obsolete Evals — Remove tests for deprecated functionality
  • Version control — Track changes to Evals over time

Responding to Failures

  • Don’t ignore flaky Evals — Intermittent failures often indicate real issues
  • Investigate promptly — The longer a regression persists, the harder it is to diagnose
  • Communicate broadly — Share Eval status with stakeholders
  • Fix forward — Address root causes, not just symptoms

Eval Lifecycle Summary

┌─────────────────────────────────────────────────────────────────────┐
│                         ONBOARDING PHASE                            │
├─────────────────────────────────────────────────────────────────────┤
│  Define Use Case → Build Evals → Run & Iterate → Graduate          │
└─────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                         PRODUCTION PHASE                            │
├─────────────────────────────────────────────────────────────────────┤
│  Scheduled Runs ──┐                                                 │
│  Your Releases ───┼──▶ Run Evals ──▶ Pass? ──▶ Continue            │
│  Gradial Releases─┤                    │                            │
│  Model Updates ───┘                    ▼                            │
│                                      Fail? ──▶ Investigate & Fix    │
└─────────────────────────────────────────────────────────────────────┘