Overview
Evals (evaluations) are structured tests that validate whether Gradial’s AI agent (“Grady”) can successfully complete specific tasks for your organization. They serve as both acceptance criteria during onboarding and ongoing quality assurance as systems evolve. Think of Evals as your automated QA team—they run the same tests consistently, catching issues before they impact your content operations.Why Evals Matter
Gradial operates in a complex environment where multiple factors can affect output quality:| Factor | What Can Change | Potential Impact |
|---|---|---|
| Your CMS environment | Component updates, template changes, new patterns | Agent may not recognize or correctly use updated components |
| Your content requirements | New use cases, brand guidelines, compliance rules | Agent needs to learn new constraints |
| Gradial platform | Feature releases, agent improvements, bug fixes | Behavior may change even if improvements are intended |
| Underlying AI models | Model updates from AI providers | Subtle changes in reasoning, output format, or capabilities |
Setting Up Your Eval Environment
Before building Evals, you need a dedicated area in your CMS where tests can run without affecting production content or being affected by other changes.Create an Isolated Content Tree
During onboarding, work with your team to establish a dedicated section of your content tree specifically for Evals. This can live in a lower environment (stage, QA) or in production—what matters is that it’s clearly separated from content that serves real users.Requirements for Eval Content
Your Eval content area must meet these criteria:| Requirement | Why It Matters |
|---|---|
| Agent-only modifications | Only Grady should modify this content—human edits would invalidate test baselines |
| Protected from overwrites | Content syncs, deployments, or bulk operations should never touch this area |
| Isolated from production | Changes here must never impact live customer experiences |
| Persistent and stable | Content should remain available across releases and environment refreshes |
What to Include
Populate your Eval content area with:- Baseline pages — Known-good pages in each template type that serve as starting points for tests
- Reference pages — Examples of expected outputs for comparison
- Test scenarios — Pages set up for specific Eval cases (e.g., a page with a hero that needs updating)
- Pattern examples — Instances of each Design System pattern for validation
Governance
Establish clear ownership and rules for the Eval content area:- Document the location — Ensure everyone knows this content is off-limits
- Restrict access — Limit who can manually modify Eval content
- Exclude from deployments — Configure your deployment process to skip this area
- Include in backups — Ensure Eval content is preserved and recoverable
Important: Treat your Eval content as critical infrastructure. If someone accidentally modifies or deletes it, your Evals will produce unreliable results until the content is restored.
Evals in Customer Onboarding
During onboarding, Evals serve as the validation framework for “Grady’s Road to Graduation”—the process of proving the agent can reliably handle your specific use cases.The Graduation Process
Step 1: Define Use Cases
Work with your Gradial team to identify the specific tasks you need Grady to perform. Common use cases include:- Simple content updates — Text changes, image swaps, link updates
- Medium content updates — Adding/removing components, restructuring sections
- Complex content updates — Multi-component changes, conditional content
- New page creation — From reference pages, templates, or source documents
- Batch migrations — Large-scale content transitions
Step 2: Build Evals
For each use case, create Evals that test whether Grady can complete the task correctly. An Eval consists of:| Component | Description | Example |
|---|---|---|
| Input | The task prompt or instruction | ”Update the hero headline to ‘New Product Launch‘“ |
| Context | Reference pages, patterns, or source material | Target page URL, Design System patterns |
| Expected Output | What success looks like | Hero component contains exact headline text |
| Validation Criteria | How to measure success | Text match, component structure, no regressions |
Step 3: Run Evals
Execute the Evals against your actual AEM environment. Each run produces:- Pass/Fail status for each Eval
- Detailed results showing what the agent did
- Comparison data between expected and actual output
- Error logs if something went wrong
Step 4: Iterate and Refine
When Evals fail, work with your Gradial team to identify the cause:- Agent issue — Improve prompting, patterns, or agent configuration
- Eval issue — Criteria too strict, edge case not accounted for
- Environment issue — Component or template needs adjustment
Step 5: Graduate the Use Case
A use case “graduates” when:- ✅ Evals pass consistently (not just once)
- ✅ Results meet your quality standards
- ✅ Your team accepts the output as production-ready
- ✅ Edge cases have been identified and handled
Customer Acceptance
Evals provide objective criteria for accepting Gradial use cases. Rather than subjective assessments of “does this look right?”, Evals give you:Clear Success Metrics
- Pass rate — What percentage of Evals succeed
- Consistency — Do results stay stable across multiple runs
- Quality score — How well outputs match expectations
Documented Expectations
Evals serve as living documentation of what you expect from each use case. This creates alignment between your team and Gradial on exactly what “working” means.Ongoing Evals: Continuous Quality Assurance
Once use cases graduate, Evals shift from validation to monitoring. They run regularly to catch regressions before they impact your operations.When Evals Run
Evals can be triggered by:| Trigger | Why It Matters |
|---|---|
| Scheduled intervals | Catch gradual drift or intermittent issues |
| Your releases | Verify Grady still works after CMS updates, new components, or template changes |
| Gradial releases | Confirm platform updates don’t break your use cases |
| Model updates | Detect changes in AI behavior after underlying model changes |
What Evals Catch
Your Environment Changes When you update components, add new templates, or modify page structures, Evals verify that Grady can still work with your updated environment. This prevents surprises when your development changes reach production. Gradial Platform Changes As Gradial releases new features and improvements, Evals ensure these changes don’t negatively impact your existing use cases. Even well-intentioned improvements can have unintended side effects. AI Model Changes The AI models powering Grady are periodically updated by their providers. These updates can subtly change how the agent reasons about tasks, formats outputs, or handles edge cases. Evals detect these changes before they affect your content.Regression Detection
When an Eval that previously passed starts failing, this signals a regression. The Eval results help pinpoint:- What changed — Which specific behavior is different
- When it changed — Correlation with releases or updates
- Impact scope — How many use cases are affected
Eval Types
Different types of Evals serve different purposes:Functional Evals
Test whether the agent can complete specific tasks correctly.- “Can Grady update a headline?”
- “Can Grady add a new component to a page?”
- “Can Grady migrate content from a source URL?”
Quality Evals
Test whether outputs meet quality standards beyond basic functionality.- “Does the output follow brand guidelines?”
- “Is the content structure optimal?”
- “Are accessibility requirements met?”
Regression Evals
Test that previously working functionality still works.- “Does use case X still pass after the latest release?”
- “Are all graduated use cases still functional?”
Edge Case Evals
Test handling of unusual or boundary conditions.- “What happens with very long content?”
- “How does Grady handle missing images?”
- “What if a referenced component doesn’t exist?”
Best Practices
Building Effective Evals
- Be specific — Vague success criteria lead to inconsistent results
- Test one thing — Each Eval should validate a single behavior
- Use realistic inputs — Test with content similar to production
- Include edge cases — Don’t just test the happy path
- Document expectations — Make it clear what success looks like
Managing Evals Over Time
- Review regularly — Ensure Evals still reflect current requirements
- Add new Evals — As you add use cases or discover edge cases
- Retire obsolete Evals — Remove tests for deprecated functionality
- Version control — Track changes to Evals over time
Responding to Failures
- Don’t ignore flaky Evals — Intermittent failures often indicate real issues
- Investigate promptly — The longer a regression persists, the harder it is to diagnose
- Communicate broadly — Share Eval status with stakeholders
- Fix forward — Address root causes, not just symptoms