LLM Evaluation Resources

A curated list of frameworks, benchmarks, and tools for evaluating large language models and AI agents.

Type Badges

Library — Python/JS library or SDK — use programmatically
CLI — Command-line tool
Server — Self-hosted web application with UI
Desktop — Desktop GUI application

Evaluation Frameworks

General-purpose frameworks for building and running LLM evaluations.

deepeval — Library · The LLM Evaluation Framework
confident-ai/deepeval View on GitHub
openevals — Library · Ready-made evaluators for your LLM apps
langchain-ai/openevals View on GitHub
evalite — CLI Desktop · Evaluate your LLM-powered apps with TypeScript (local dev UI)
mattpocock/evalite View on GitHub
lighteval — Library CLI · All-in-one toolkit for evaluating LLMs
huggingface/lighteval View on GitHub
evalscope — Library CLI · Framework for efficient LLM/VLM evaluation
modelscope/evalscope View on GitHub
giskard-oss — Library Server · Open-source evaluation and testing library for LLM agents
Giskard-AI/giskard-oss View on GitHub
Eval — Library · High-performance LLM evaluation with parallel API calls
ai-twinkle/Eval View on GitHub
simple-llm-eval — Library · Simple LLM evaluation using LLM-as-a-judge
cyberark/simple-llm-eval View on GitHub
evidently — Library Server · Evaluate, test, and monitor ML and LLM-powered systems
evidentlyai/evidently View on GitHub
ai-eval — Library · Prompt evaluation and optimization system for LLM applications
stellarlinkco/ai-eval View on GitHub
neuro-judge — Library · LLM-as-a-Judge framework with multi-model, multi-criteria, and cost tracking
furqan1pk/neuro-judge View on GitHub
OpenJudge — Library · Unified framework for holistic evaluation and quality rewards
agentscope-ai/OpenJudge View on GitHub
One-Eval — Library · Automated LLM evaluation system via agents
OpenDCAI/One-Eval View on GitHub
GAGE — Library CLI · Unified evaluation engine for LLMs, MLLMs, audio, and diffusion models
HiThink-Research/GAGE View on GitHub
lm-evaluation-harness — Library CLI · A framework for few-shot evaluation of language models
EleutherAI/lm-evaluation-harness View on GitHub

Benchmarks

Standardised benchmark suites and leaderboards for comparing model performance.

opencompass — Library CLI · LLM evaluation platform supporting 100+ datasets
open-compass/opencompass View on GitHub
VLMEvalKit — Library CLI · Evaluation toolkit for large multi-modality models
open-compass/VLMEvalKit View on GitHub
olmes — Library CLI · Reproducible, flexible LLM evaluations
allenai/olmes View on GitHub
bench — CLI · A tool for evaluating LLMs
arthur-ai/bench View on GitHub
genai-bench — CLI · Token-level performance evaluation of LLM serving systems
sgl-project/genai-bench View on GitHub
sparse-frontier — Library · Evaluation framework for training-free sparse attention in LLMs
PiotrNawrot/sparse-frontier View on GitHub

Observability & Monitoring Platforms

End-to-end platforms combining evaluation with tracing, logging, and monitoring.

langfuse — Server · Open-source LLM engineering platform with observability
langfuse/langfuse View on GitHub
opik — Server · Debug, evaluate, and monitor your LLM applications
comet-ml/opik View on GitHub
trulens — Library Server · Evaluation and tracking for LLM experiments and AI agents
truera/trulens View on GitHub
langwatch — Server · Platform for LLM evaluations and AI agent testing
langwatch/langwatch View on GitHub
openlit — Server · OpenTelemetry-native LLM observability, monitoring, and evaluations
openlit/openlit View on GitHub
mlflow — Library Server · Open-source platform for agents, LLMs, and ML with debugging and monitoring
mlflow/mlflow View on GitHub
agenta — Server · Open-source LLMOps platform: playground, management, evaluation, and observability
Agenta-AI/agenta View on GitHub

Security & Adversarial Testing

Tools focused on red-teaming, safety, and robustness testing.

rogue — Library · Stress-test your AI agents before attackers do
qualifire-dev/rogue View on GitHub
moonshot — Library Server · Evaluating and red-teaming LLM applications with benchmarking
aiverify-foundation/moonshot View on GitHub
agent-security-sandbox — Library · Benchmark for indirect prompt injection defenses in LLM agents
X-PG13/agent-security-sandbox View on GitHub
AgentDefense-Bench — Library · Security benchmark for infrastructure-layer defenses in MCP-based agent systems
arunsanna/AgentDefense-Bench View on GitHub

Agentic Performance

Benchmarks and tools for evaluating LLMs acting as autonomous agents.

AgentBench — Library CLI · Comprehensive benchmark to evaluate LLMs as agents
THUDM/AgentBench View on GitHub
claw-eval — Library · Human-verified evaluation harness for evaluating LLMs as agents
claw-eval/claw-eval View on GitHub
strands-agents/evals — Library · Comprehensive evaluation framework for AI agents and LLM applications
strands-agents/evals View on GitHub
agentdojo — Library · Dynamic environment to evaluate attacks and defenses for LLM agents
ethz-spylab/agentdojo View on GitHub
AgentCPM — Library · End-to-end infrastructure for training and evaluating LLM agents
OpenBMB/AgentCPM View on GitHub
MemoryAgentBench — Library · Benchmark for evaluating memory in LLM agents via multi-turn interactions
HUST-AI-HYZ/MemoryAgentBench View on GitHub
HaluMem — Library · Operation-level hallucination evaluation benchmark for agent memory systems
MemTensor/HaluMem View on GitHub
HarnessLab — Library · Benchmark for evaluating LLM agent harness components (context, retry, memory)
polskiTran/HarnessLab View on GitHub
ResearchHarness — CLI · Trusted-local harness for research agents with real tool use and evaluation
black-yt/ResearchHarness View on GitHub
iris-eval/mcp-server — Server · Agent eval standard for MCP — quality scoring, safety, and cost budgets
iris-eval/mcp-server View on GitHub

Tool-Calling Evaluation

Tools for evaluating function calling and tool-use capabilities.

gorilla — Library CLI · Training and evaluating LLMs for function calls
ShishirPatil/gorilla View on GitHub

RAG Evaluation

Tools specifically designed for evaluating retrieval-augmented generation pipelines.

ragas — Library · Supercharge your LLM application evaluations (RAG-focused)
vibrantlabsai/ragas View on GitHub
AutoRAG — Library Server · RAG AutoML tool for finding the optimal RAG pipeline
Marker-Inc-Korea/AutoRAG View on GitHub

Domain-Specific Evaluation

Evaluation tools targeting specific domains or verticals.

med-lm-envs — Library · Automated LLM evaluation suite for medical tasks
MedARC-AI/med-lm-envs View on GitHub
MedEvalKit — Library · A unified medical evaluation framework
alibaba-damo-academy/MedEvalKit View on GitHub

Utilities & Experimentation

Tools for model comparison, grid search, and inference optimization.

ollama-grid-search — Desktop · Desktop app for evaluating and comparing multiple LLM models via grid search
dezoito/ollama-grid-search View on GitHub
speculators — Library · Library for speculative decoding algorithms for LLM inference
vllm-project/speculators View on GitHub

Other Resource Lists

llm-benchmark — Curated list of LLM evaluation frameworks and benchmarks
terryyz/llm-benchmark View on GitHub