A Typical Eval Workflow: Systematically Improving Your Prompts

Vuk Dukic
Founder, AI/ML Engineer
April 23, 2026

Law RAG

A Typical Eval Workflow: Systematically Improving Your Prompts

Prompt engineering is both an art and a science. While crafting the perfect prompt often feels like trial and error, a structured evaluation workflow transforms this process into a systematic, measurable practice. Whether you're building a customer support chatbot, a content generation tool, or a research assistant, understanding how to objectively measure and improve your prompts is essential for production-quality AI applications.

This guide walks through the five core steps of a typical prompt evaluation workflow, showing you how to move from guesswork to data-driven prompt optimization.


Why Prompt Evaluation Matters

Before diving into the workflow, it's worth understanding why evaluation is critical:

  • Objectivity: Replace subjective judgment ("this feels better") with quantifiable metrics
  • Reproducibility: Track changes over time and understand what actually improves performance
  • Confidence: Deploy prompts to production knowing they've been tested against representative scenarios
  • Iteration speed: Quickly test variations and identify what works without manual review of every response

Now let's break down the five-step workflow.


Step 1: Draft a Prompt

Every evaluation starts with a baseline prompt—your initial attempt at solving the problem. This doesn't need to be perfect; it's simply your starting point for measurement and improvement.

Example baseline prompt:

prompt = f"""
Please answer the user's question:

{question}
"""

This simple template takes a user question and asks Claude to answer it. While basic, it gives us a foundation to measure against. The key is to start somewhere, even if you know the prompt will need refinement.

Best practices for your initial prompt:

  • Keep it simple and clear
  • Include the core instruction you want the model to follow
  • Use template variables (like {question}) for dynamic content
  • Document your intent—what are you trying to achieve?

Step 2: Create an Eval Dataset

Your evaluation dataset is a collection of representative inputs that mirror real-world usage. Think of it as a test suite for your prompt—it should cover the range of scenarios your application will encounter.

Example dataset:

eval_dataset = [
    "What's 2+2?",
    "How do I make oatmeal?",
    "How far away is the Moon?"
]

This toy example includes three diverse questions: a simple math problem, a how-to request, and a factual query. In production, your dataset might contain:

  • Tens of examples for quick iteration during development
  • Hundreds of examples for comprehensive testing before deployment
  • Thousands of examples for large-scale applications with diverse use cases

How to build your dataset:

  1. Manual curation: Collect real user queries from logs, support tickets, or user research
  2. Synthetic generation: Use Claude to generate representative examples based on your use case
  3. Edge case inclusion: Add challenging, ambiguous, or adversarial examples to test robustness
  4. Diversity: Ensure coverage across different question types, complexity levels, and domains

Example of using Claude to generate eval data:

dataset_generation_prompt = """
Generate 20 diverse customer support questions for a SaaS product 
that offers project management tools. Include:
- Technical troubleshooting questions
- Feature requests
- Billing inquiries
- How-to questions
- Account management issues

Format as a JSON array of strings.
"""

Step 3: Feed Through Claude

With your prompt template and dataset ready, the next step is to generate responses. For each item in your dataset, merge it with your prompt template and send it to Claude.

Example execution:

responses = []

for question in eval_dataset:
    # Merge question into prompt template
    full_prompt = f"""
Please answer the user's question:

{question}
"""
    
    # Send to Claude
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": full_prompt}]
    )
    
    responses.append({
        "question": question,
        "answer": response.content[0].text
    })

Sample outputs:

QuestionClaude's Response
What's 2+2?2 + 2 = 4
How do I make oatmeal?To make oatmeal: 1) Bring water or milk to a boil, 2) Add oats, 3) Reduce heat and simmer for 5 minutes, stirring occasionally, 4) Add toppings like fruit, nuts, or honey.
How far away is the Moon?The Moon is approximately 238,855 miles (384,400 kilometers) from Earth on average. This distance varies slightly due to the Moon's elliptical orbit.

At this stage, you have a complete set of question-answer pairs ready for evaluation.


Step 4: Feed Through a Grader

The grader is the heart of your evaluation workflow. It objectively scores each of Claude's responses, typically on a scale from 1 to 10, based on criteria relevant to your use case.

Common grading approaches:

1. LLM-as-Judge

Use Claude (or another LLM) to evaluate the responses:

grading_prompt = f"""
You are an expert evaluator. Score the following answer on a scale of 1-10 based on:
- Accuracy
- Completeness
- Clarity
- Helpfulness

Question: {question}
Answer: {answer}

Provide only a numeric score from 1-10.
"""

2. Rule-based grading

For structured outputs, check for specific criteria:

def grade_response(question, answer):
    score = 10
    if len(answer) < 20:
        score -= 3  # Too brief
    if "I don't know" in answer:
        score -= 5  # Unhelpful
    if question.lower() in answer.lower():
        score += 1  # Good context awareness
    return max(1, min(10, score))

3. Human evaluation

For high-stakes applications, have human reviewers score responses.

4. Ground truth comparison

If you have known correct answers, measure similarity or exact match:

def grade_with_ground_truth(answer, ground_truth):
    if answer.strip() == ground_truth.strip():
        return 10
    # Use semantic similarity for partial credit
    similarity = calculate_similarity(answer, ground_truth)
    return int(similarity * 10)

Example grading results:

| Question | Answer | Score | Reasoning | |----------|--------|-------|-----------|| | What's 2+2? | 2 + 2 = 4 | 10 | Perfect accuracy | | How do I make oatmeal? | To make oatmeal: 1) Bring water... | 4 | Lacks detail on ratios, cooking times | | How far away is the Moon? | The Moon is approximately 238,855 miles... | 9 | Accurate and informative |

Average score: (10 + 4 + 9) ÷ 3 = 7.66

This baseline score gives you an objective measurement to improve upon.


Step 5: Change Prompt and Repeat

Now comes the iterative improvement cycle. Based on your grading results, modify your prompt and run the entire evaluation again to see if performance improves.

Example iteration:

Looking at our scores, the oatmeal question performed poorly (score: 4). The response lacked detail. Let's modify the prompt to encourage more comprehensive answers:

improved_prompt = f"""
Please answer the user's question:

{question}

Answer the question with ample detail, including specific steps, 
measurements, or context where relevant.
"""

Re-run the evaluation:

After running the improved prompt through the same dataset and grading process, you might see:

| Question | Original Score | New Score | Change | |----------|----------------|-----------|--------|| | What's 2+2? | 10 | 10 | No change | | How do I make oatmeal? | 4 | 8 | +4 improvement | | How far away is the Moon? | 9 | 9 | No change |

New average: (10 + 8 + 9) ÷ 3 = 9.0

The improved prompt increased our average score from 7.66 to 9.0—a measurable improvement driven by adding detail guidance.

Common prompt improvements to test:

  • Add examples (few-shot prompting)
  • Specify output format (JSON, markdown, bullet points)
  • Include constraints (length limits, tone requirements)
  • Add role context ("You are an expert chef...")
  • Clarify edge cases ("If you don't know, say so clearly")
  • Chain-of-thought ("Think step-by-step before answering")

Prompt Scoring: The Key Benefit

The power of this workflow lies in objective, numerical comparison:

Compare versions side-by-side

Baseline prompt:     7.66 average score
+ Detail instruction: 9.0 average score  ← Clear winner
+ Examples added:     8.2 average score

Track improvements over time

Week 1: 7.66
Week 2: 9.0 (+17% improvement)
Week 3: 9.3 (+3% improvement)

Make confident deployment decisions

  • Deploy the version with the highest score
  • Set minimum score thresholds for production (e.g., "only deploy if score > 8.5")
  • A/B test prompts in production with real traffic

Remove guesswork Instead of asking "Does this feel better?", you can definitively say "This version scores 1.34 points higher on our eval set."


Scaling Your Eval Workflow

Start Small

  • 5-10 examples: Quick sanity checks during development
  • Manual grading: Review responses yourself initially
  • Simple prompts: Test one variable at a time

Scale Up

  • 50-100 examples: Comprehensive testing before production
  • Automated grading: Use LLM-as-judge or rule-based scoring
  • Multiple graders: Combine different evaluation criteria

Production-Grade

  • 1000+ examples: Enterprise applications with diverse use cases
  • Continuous evaluation: Run evals on every prompt change
  • Multi-dimensional scoring: Accuracy, safety, tone, compliance, cost
  • Human-in-the-loop: Sample-based human review for quality assurance

Tools and Frameworks

While you can build this workflow from scratch with basic Python, several tools can accelerate the process:

Open source:

  • PromptFoo: Evaluation framework with built-in graders and reporting
  • LangSmith: Tracing and evaluation for LangChain applications
  • OpenAI Evals: Evaluation framework from OpenAI
  • Anthropic's Prompt Engineering Tools: Official utilities and examples

Paid platforms:

  • Braintrust: Evaluation and observability platform
  • HumanLoop: Prompt management and evaluation
  • Weights & Biases: Experiment tracking with prompt eval support

DIY approach: A simple Python script with Claude API calls and CSV logging can be surprisingly effective for many use cases.


Best Practices

1. Version control your prompts

Treat prompts like code—use git to track changes and link commits to eval scores.

2. Document your grading criteria

Be explicit about what makes a "10" vs. a "7" so scoring is consistent.

3. Test on realistic data

Synthetic examples are useful, but real user queries reveal edge cases.

4. Monitor for regression

When improving one area, ensure you're not degrading performance elsewhere.

5. Iterate quickly

Run small experiments frequently rather than big changes infrequently.

6. Combine quantitative and qualitative

Numbers guide you, but reading actual responses reveals why scores changed.


Conclusion

A systematic evaluation workflow transforms prompt engineering from an art into a science. By following these five steps—draft, dataset, generate, grade, iterate—you can:

  • Measure prompt performance objectively
  • Compare variations with confidence
  • Deploy prompts knowing they've been tested
  • Continuously improve based on data

Start small with a handful of examples and simple grading, then scale up as your application matures. The key is establishing the habit of measurement early, so every prompt change is informed by evidence rather than intuition.

Ready to start evaluating? Pick a prompt you're currently using, create 5-10 test cases, and run your first eval. You might be surprised what the numbers reveal.


Want to dive deeper? Check out Anthropic's prompt engineering guide and evaluation best practices for advanced techniques and real-world examples.

Share this article:
View all articles

Related Articles

Unlock the Full Power of AI-Driven Transformation

Schedule Demo

See how Anablock can automate and scale your business with AI.

Book Demo

Start a Support Agent

Talk directly with our AI experts and get real-time guidance.

Call Now

Send us a Message

Summarize this page content with AI