Evaluation

23 bookmarks

Newest

LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale - Confident AI

In this article, I'll debunk what LLM judges are and go through why they are the best for LLM evaluation.

Concept

·confident-ai.com·Apr 24, 2026

LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale - Confident AI

Trace learning for self-improving agents

Concept

·x.com·Apr 2, 2026

Trace learning for self-improving agents

Agent Quality

Ebook

·up.raindrop.io·Dec 18, 2025

Agent Quality

Evals Flashcards – Hamel’s Blog - Hamel Husain

Notes on applied AI engineering, machine learning, and data science.

Concept

·hamel.dev·Dec 3, 2025

Evals Flashcards – Hamel’s Blog - Hamel Husain

Parloa's Bayesian Framework to A/B Test AI Agents

Learn about our hierarchical Bayesian model for A/B testing AI agents. It combines deterministic binary metrics and LLM-judge scores into a single framework that accounts for variation across different groups

Design Pattern

·parloa.com·Dec 3, 2025

Parloa's Bayesian Framework to A/B Test AI Agents

How to Correctly Report LLM-as-a-Judge Evaluations

Paper

·arxiv.org·Nov 27, 2025

How to Correctly Report LLM-as-a-Judge Evaluations

Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture

Evaluation-Driven Development of LLM Agents

Unlike deterministic systems, an LLM agent’s output is often probabilistic, meaning multiple responses may be valid within a given scenario.

Paper #eval-driven-development

·arxiv.org·Oct 6, 2025

Evaluation-Driven Development of LLM Agents: A Process Model and Reference Architecture

Eval Driven System Design - From Prototype to Production

This cookbook provides a practical, end-to-end guide on how to effectively use evals as the core process in creating a production-grade a...

Design Pattern #eval-driven-development

·cookbook.openai.com·Oct 6, 2025

Eval Driven System Design - From Prototype to Production

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Applying the scientific method, building via eval-driven development, and monitoring AI output.

Building product evals is simply the scientific method in disguise. That’s the secret sauce. It’s a cycle of inquiry, experimentation, and analysis.

Concept

·eugeneyan.com·Oct 6, 2025

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Building resilient prompts using an evaluation flywheel

This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts. A resilient prom...

Design Pattern

·cookbook.openai.com·Oct 6, 2025

Building resilient prompts using an evaluation flywheel

Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation

Key Contributors: Lily Sierra, Nour Alkhatib, Steven Gross, Jacquelene Obeid, Kyle Swint, Monta Shen, Gary Song, Riddhima Sejpal, Jatin…

Example

·tech.instacart.com·Jul 15, 2025

Turbocharging Customer Support Chatbot Development with LLM-Based Automated Evaluation

Evaluating Long-Context Question & Answer Systems

Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.

Deep Dive

·eugeneyan.com·Jun 25, 2025

Evaluating Long-Context Question & Answer Systems

The "think" tool: Enabling Claude to stop and think \ Anthropic

A blog post for developers, describing a new method for complex tool-use situations

The primary evaluation metric used in τ-bench is pass^k, which measures the probability that all k independent task trials are successful for a given task, averaged across all tasks. Unlike the pass@k metric that is common for other LLM evaluations (which measures if at least one of k trials succeeds), pass^k evaluates consistency and reliability—critical qualities for customer service applications where consistent adherence to policies is essential.

Design Pattern

·anthropic.com·Apr 6, 2025

The "think" tool: Enabling Claude to stop and think \ Anthropic

Evaluating Quality in Large Language Models: A Comprehensive Approach using the legal industry as a…

Evaluating the quality of outputs from Large Language Models (LLMs) is an intricate task due to the open-ended nature of many LLM tasks…

Concept

·medium.com·Dec 16, 2024

Evaluating Quality in Large Language Models: A Comprehensive Approach using the legal industry as a…

Check grounding with RAG | Vertex AI Agent Builder | Google Cloud

Check grounding with RAG

Tool

·cloud.google.com·Nov 28, 2024

Check grounding with RAG | Vertex AI Agent Builder | Google Cloud

Creating a LLM-as-a-Judge That Drives Business Results –

A step-by-step guide with my learnings from 30+ AI implementations.

Tutorial

·hamel.dev·Nov 28, 2024

Creating a LLM-as-a-Judge That Drives Business Results –

SCIPE - Systematic Chain Improvement and Problem Evaluation

Related to LangChain

Design Pattern

·blog.langchain.dev·Nov 7, 2024

SCIPE - Systematic Chain Improvement and Problem Evaluation

Task-Specific LLM Evals that Do & Don't Work

Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

Concept #summary

·eugeneyan.com·Oct 4, 2024

Task-Specific LLM Evals that Do & Don't Work

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

Concept #summary

·eugeneyan.com·Sep 9, 2024

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Aligning LLM-as-a-Judge with Human Preferences

Deep dive into self-improving evaluators in LangSmith, motivated by the rise of LLM-as-a-Judge evaluators plus research on few-shot learning and aligning human preferences.

Design Pattern

·blog.langchain.dev·Jun 27, 2024

Aligning LLM-as-a-Judge with Human Preferences

LlamaIndex: RAG Evaluation Showdown with GPT-4 vs. Open-Source Prometheus Model — LlamaIndex, Data Framework for LLM Applications

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).

Example

·llamaindex.ai·May 22, 2024

LlamaIndex: RAG Evaluation Showdown with GPT-4 vs. Open-Source Prometheus Model — LlamaIndex, Data Framework for LLM Applications

Using LLM-as-a-judge 🧑‍⚖️ for an automated and versatile evaluation - Hugging Face Open-Source AI Cookbook

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tutorial

·huggingface.co·May 22, 2024

Using LLM-as-a-judge 🧑‍⚖️ for an automated and versatile evaluation - Hugging Face Open-Source AI Cookbook

RAG Evaluation - Hugging Face Open-Source AI Cookbook

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Tutorial

·huggingface.co·May 22, 2024

RAG Evaluation - Hugging Face Open-Source AI Cookbook