Found 6 bookmarks
Newest
Parloa's Bayesian Framework to A/B Test AI Agents
Parloa's Bayesian Framework to A/B Test AI Agents
Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups
·parloa.com·
Parloa's Bayesian Framework to A/B Test AI Agents
Building resilient prompts using an evaluation flywheel
Building resilient prompts using an evaluation flywheel
This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts. A resilient prom...
·cookbook.openai.com·
Building resilient prompts using an evaluation flywheel
The "think" tool: Enabling Claude to stop and think \ Anthropic
The "think" tool: Enabling Claude to stop and think \ Anthropic
A blog post for developers, describing a new method for complex tool-use situations
The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.
·anthropic.com·
The "think" tool: Enabling Claude to stop and think \ Anthropic
Aligning LLM-as-a-Judge with Human Preferences
Aligning LLM-as-a-Judge with Human Preferences
Deep dive into self-improving evaluators in LangSmith, motivated by the rise of LLM-as-a-Judge evaluators plus research on few-shot learning and aligning human preferences.
·blog.langchain.dev·
Aligning LLM-as-a-Judge with Human Preferences