Concept

7 bookmarks

Custom sorting

LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale - Confident AI

In this article, I'll debunk what LLM judges are and go through why they are the best for LLM evaluation.

·confident-ai.com·Apr 24, 2026

Task-Specific LLM Evals that Do & Don't Work

Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

·eugeneyan.com·Oct 4, 2024

Trace learning for self-improving agents

·x.com·Apr 2, 2026

Evals Flashcards – Hamel’s Blog - Hamel Husain

Notes on applied AI engineering, machine learning, and data science.

·hamel.dev·Dec 3, 2025

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

Applying the scientific method, building via eval-driven development, and monitoring AI output.

Building product evals is simply the scientific method in disguise. That’s the secret sauce. It’s a cycle of inquiry, experimentation, and analysis.

·eugeneyan.com·Oct 6, 2025

Evaluating Quality in Large Language Models: A Comprehensive Approach using the legal industry as a…

Evaluating the quality of outputs from Large Language Models (LLMs) is an intricate task due to the open-ended nature of many LLM tasks…

·medium.com·Dec 16, 2024

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

·eugeneyan.com·Sep 9, 2024