Report

21 bookmarks

Newest

Slidecrafting

how to create slides with quarto

·slidecrafting-book.com·Oct 10, 2025

Slidecrafting

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation...

Given the widespread adoption and usage of Large Language Models (LLMs), it is crucial to have flexible and interpretable evaluations of their instruction-following ability. Preference judgments between model outputs have become the de facto evaluation standard, despite distilling complex, multi-faceted preferences into a single ranking. Furthermore, as human annotation is slow and costly, LLMs are increasingly used to make these judgments, at the expense of reliability and interpretability. In this work, we propose TICK (Targeted Instruct-evaluation with ChecKlists), a fully automated, interpretable evaluation protocol that structures evaluations with LLM-generated, instruction-specific checklists. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists that decompose the instruction into a series of YES/NO questions. Each question asks whether a candidate response meets a specific requirement of the instruction. We demonstrate that using TICK leads to a significant increase (46.4% $\to$ 52.2%) in the frequency of exact agreements between LLM judgements and human preferences, as compared to having an LLM directly score an output. We then show that STICK (Self-TICK) can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection. STICK self-refinement on LiveBench reasoning tasks leads to an absolute gain of $+$7.8%, whilst Best-of-N selection with STICK attains $+$6.3% absolute improvement on the real-world instruction dataset, WildBench. In light of this, structured, multi-faceted self-improvement is shown to be a promising way to further advance LLM capabilities. Finally, by providing LLM-generated checklists to human evaluators tasked with directly scoring LLM responses to WildBench instructions, we notably increase inter-annotator agreement (0.194 $\to$ 0.256).

Literature #explainability

·arxiv.org·Apr 2, 2025

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation...

RAG 2.0 is really about grounding general purpose agents in proprietary… | Jerry Liu

RAG 2.0 is really about grounding general purpose agents in proprietary enterprise context. Instead of simply one-shot answering a simple question, the agent…

·linkedin.com·Mar 27, 2025

RAG 2.0 is really about grounding general purpose agents in proprietary… | Jerry Liu

lumina-ai-inc/chunkr: Vision model based document ingestion

Vision model based document ingestion.

Similar Solutions #vector-similarity

·github.com·Jan 28, 2025

microsoft/markitdown: Python tool for converting files and office documents to Markdown.

Python tool for converting files and office documents to Markdown. - microsoft/markitdown

#document-understanding

·github.com·Jan 17, 2025

microsoft/markitdown: Python tool for converting files and office documents to Markdown.

PyMuPDF 1.25.2 documentation

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

#document-understanding

·pymupdf.readthedocs.io·Jan 17, 2025

PyMuPDF 1.25.2 documentation

Sensible - Document Processing for Developers

Extract any data from any document. Resumes, invoices, contracts, academic research, bank statements, utility bills and more – Sensible can parse them all.

#document-understanding

·sensible.so·Jan 17, 2025

Sensible - Document Processing for Developers

Extend

The document processing platform built for the next generation.

#document-understanding

·extend.app·Jan 17, 2025

Extend

Reducto Document Ingestion API

Reducto is an API that provides high quality data ingestion for large language models (LLMs). It works with any vector database or embedding system. It can parse PDFs, Excel, PowerPoint, and more.

#document-understanding

·reducto.ai·Jan 17, 2025

Reducto Document Ingestion API

GitHub - ucbepic/docetl: A system for agentic LLM-powered data processing and ETL

A system for agentic LLM-powered data processing and ETL - ucbepic/docetl

·github.com·Jan 15, 2025

GitHub - ucbepic/docetl: A system for agentic LLM-powered data processing and ETL

GitHub - virattt/ai-hedge-fund: An AI Hedge Fund Team

An AI Hedge Fund Team. Contribute to virattt/ai-hedge-fund development by creating an account on GitHub.

·github.com·Jan 6, 2025

GitHub - virattt/ai-hedge-fund: An AI Hedge Fund Team

DeepSeekV3, Gemini, Mixtral and many others are all Mixture of Experts (MoEs).

But what exactly are MoEs? 🤔 A Mixture of Experts (MoE) is a machine learning framework that resembles a team of specialists, each adept at handling different aspects of a complex task. It's like… — Akshay 🚀 (@akshay_pachaar)

·x.com·Dec 29, 2024

DeepSeekV3, Gemini, Mixtral and many others are all Mixture of Experts (MoEs).

What can VLM brings to RAG beyond input modality change?

For “R”, our DSE dropped the document processing and improved relevancy modeling by preserving the content integration. Now for “G”, we propose VISA. Aiming to take a step towards more verifiable and intuitive V-RAG.… — Xueguang Ma (@xueguang_ma)

·x.com·Dec 22, 2024

What can VLM brings to RAG beyond input modality change?

LlamaReport Preview: Transform any Documents into Structured Reports — LlamaIndex - Build Knowledge Assistants over your Enterprise Data

LlamaIndex is a simple, flexible framework for building knowledge assistants using LLMs connected to your enterprise data.