LLM Evaluation Resources

A curated list of frameworks, benchmarks, and tools for evaluating large language models and AI agents.

Last updated: 06/04/2026

LLM Evaluation Resources

A curated list of frameworks, benchmarks, and tools for evaluating large language models and AI agents.

Type Badges

Evaluation Frameworks

General-purpose frameworks for building and running LLM evaluations.

Benchmarks

Standardised benchmark suites and leaderboards for comparing model performance.

Observability & Monitoring Platforms

End-to-end platforms combining evaluation with tracing, logging, and monitoring.

Security & Adversarial Testing

Tools focused on red-teaming, safety, and robustness testing.

Agentic Performance

Benchmarks and tools for evaluating LLMs acting as autonomous agents.

Tool-Calling Evaluation

Tools for evaluating function calling and tool-use capabilities.

RAG Evaluation

Tools specifically designed for evaluating retrieval-augmented generation pipelines.

Domain-Specific Evaluation

Evaluation tools targeting specific domains or verticals.

Utilities & Experimentation

Tools for model comparison, grid search, and inference optimization.

Other Resource Lists