Search Design Pattern

Found 6 bookmarks

Newest

Parloa's Bayesian Framework to A/B Test AI Agents

Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups

·parloa.com·Dec 3, 2025

Parloa's Bayesian Framework to A/B Test AI Agents

Eval Driven System Design - From Prototype to Production

This cookbook provides a practical, end-to-end guide on how to effectively use evals as the core process in creating a production-grade a...

#eval-driven-development

·cookbook.openai.com·Oct 6, 2025

Eval Driven System Design - From Prototype to Production

Building resilient prompts using an evaluation flywheel

This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts. A resilient prom...

·cookbook.openai.com·Oct 6, 2025

Building resilient prompts using an evaluation flywheel

The "think" tool: Enabling Claude to stop and think \ Anthropic

A blog post for developers, describing a new method for complex tool-use situations

The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.

·anthropic.com·Apr 6, 2025

The "think" tool: Enabling Claude to stop and think \ Anthropic

SCIPE - Systematic Chain Improvement and Problem Evaluation

Related to LangChain

·blog.langchain.dev·Nov 7, 2024

SCIPE - Systematic Chain Improvement and Problem Evaluation

Aligning LLM-as-a-Judge with Human Preferences

Deep dive into self-improving evaluators in LangSmith, motivated by the rise of LLM-as-a-Judge evaluators plus research on few-shot learning and aligning human preferences.

·blog.langchain.dev·Jun 27, 2024

Aligning LLM-as-a-Judge with Human Preferences