LLM Evaluation Resources
A curated list of frameworks, benchmarks, and tools for evaluating large language models and AI agents.
Type Badges
Library— Python/JS library or SDK — use programmaticallyCLI— Command-line toolServer— Self-hosted web application with UIDesktop— Desktop GUI application
Evaluation Frameworks
General-purpose frameworks for building and running LLM evaluations.
deepeval —
confident-ai/deepeval View on GitHubLibrary· The LLM Evaluation Frameworkopenevals —
langchain-ai/openevals View on GitHubLibrary· Ready-made evaluators for your LLM appsevalite —
mattpocock/evalite View on GitHubCLIDesktop· Evaluate your LLM-powered apps with TypeScript (local dev UI)lighteval —
huggingface/lighteval View on GitHubLibraryCLI· All-in-one toolkit for evaluating LLMsevalscope —
modelscope/evalscope View on GitHubLibraryCLI· Framework for efficient LLM/VLM evaluationgiskard-oss —
Giskard-AI/giskard-oss View on GitHubLibraryServer· Open-source evaluation and testing library for LLM agentsEval —
ai-twinkle/Eval View on GitHubLibrary· High-performance LLM evaluation with parallel API callssimple-llm-eval —
cyberark/simple-llm-eval View on GitHubLibrary· Simple LLM evaluation using LLM-as-a-judgeevidently —
evidentlyai/evidently View on GitHubLibraryServer· Evaluate, test, and monitor ML and LLM-powered systemsai-eval —
stellarlinkco/ai-eval View on GitHubLibrary· Prompt evaluation and optimization system for LLM applicationsneuro-judge —
furqan1pk/neuro-judge View on GitHubLibrary· LLM-as-a-Judge framework with multi-model, multi-criteria, and cost trackingOpenJudge —
agentscope-ai/OpenJudge View on GitHubLibrary· Unified framework for holistic evaluation and quality rewardsOne-Eval —
OpenDCAI/One-Eval View on GitHubLibrary· Automated LLM evaluation system via agentsGAGE —
HiThink-Research/GAGE View on GitHubLibraryCLI· Unified evaluation engine for LLMs, MLLMs, audio, and diffusion modelslm-evaluation-harness —
EleutherAI/lm-evaluation-harness View on GitHubLibraryCLI· A framework for few-shot evaluation of language models
Benchmarks
Standardised benchmark suites and leaderboards for comparing model performance.
opencompass —
open-compass/opencompass View on GitHubLibraryCLI· LLM evaluation platform supporting 100+ datasetsVLMEvalKit —
open-compass/VLMEvalKit View on GitHubLibraryCLI· Evaluation toolkit for large multi-modality modelsolmes —
allenai/olmes View on GitHubLibraryCLI· Reproducible, flexible LLM evaluationsbench —
arthur-ai/bench View on GitHubCLI· A tool for evaluating LLMsgenai-bench —
sgl-project/genai-bench View on GitHubCLI· Token-level performance evaluation of LLM serving systemssparse-frontier —
PiotrNawrot/sparse-frontier View on GitHubLibrary· Evaluation framework for training-free sparse attention in LLMs
Observability & Monitoring Platforms
End-to-end platforms combining evaluation with tracing, logging, and monitoring.
langfuse —
langfuse/langfuse View on GitHubServer· Open-source LLM engineering platform with observabilityopik —
comet-ml/opik View on GitHubServer· Debug, evaluate, and monitor your LLM applicationstrulens —
truera/trulens View on GitHubLibraryServer· Evaluation and tracking for LLM experiments and AI agentslangwatch —
langwatch/langwatch View on GitHubServer· Platform for LLM evaluations and AI agent testingopenlit —
openlit/openlit View on GitHubServer· OpenTelemetry-native LLM observability, monitoring, and evaluationsmlflow —
mlflow/mlflow View on GitHubLibraryServer· Open-source platform for agents, LLMs, and ML with debugging and monitoringagenta —
Agenta-AI/agenta View on GitHubServer· Open-source LLMOps platform: playground, management, evaluation, and observability
Security & Adversarial Testing
Tools focused on red-teaming, safety, and robustness testing.
rogue —
qualifire-dev/rogue View on GitHubLibrary· Stress-test your AI agents before attackers domoonshot —
aiverify-foundation/moonshot View on GitHubLibraryServer· Evaluating and red-teaming LLM applications with benchmarkingagent-security-sandbox —
X-PG13/agent-security-sandbox View on GitHubLibrary· Benchmark for indirect prompt injection defenses in LLM agentsAgentDefense-Bench —
arunsanna/AgentDefense-Bench View on GitHubLibrary· Security benchmark for infrastructure-layer defenses in MCP-based agent systems
Agentic Performance
Benchmarks and tools for evaluating LLMs acting as autonomous agents.
AgentBench —
THUDM/AgentBench View on GitHubLibraryCLI· Comprehensive benchmark to evaluate LLMs as agentsclaw-eval —
claw-eval/claw-eval View on GitHubLibrary· Human-verified evaluation harness for evaluating LLMs as agentsstrands-agents/evals —
strands-agents/evals View on GitHubLibrary· Comprehensive evaluation framework for AI agents and LLM applicationsagentdojo —
ethz-spylab/agentdojo View on GitHubLibrary· Dynamic environment to evaluate attacks and defenses for LLM agentsAgentCPM —
OpenBMB/AgentCPM View on GitHubLibrary· End-to-end infrastructure for training and evaluating LLM agentsMemoryAgentBench —
HUST-AI-HYZ/MemoryAgentBench View on GitHubLibrary· Benchmark for evaluating memory in LLM agents via multi-turn interactionsHaluMem —
MemTensor/HaluMem View on GitHubLibrary· Operation-level hallucination evaluation benchmark for agent memory systemsHarnessLab —
polskiTran/HarnessLab View on GitHubLibrary· Benchmark for evaluating LLM agent harness components (context, retry, memory)ResearchHarness —
black-yt/ResearchHarness View on GitHubCLI· Trusted-local harness for research agents with real tool use and evaluationiris-eval/mcp-server —
iris-eval/mcp-server View on GitHubServer· Agent eval standard for MCP — quality scoring, safety, and cost budgets
Tool-Calling Evaluation
Tools for evaluating function calling and tool-use capabilities.
gorilla —
ShishirPatil/gorilla View on GitHubLibraryCLI· Training and evaluating LLMs for function calls
RAG Evaluation
Tools specifically designed for evaluating retrieval-augmented generation pipelines.
ragas —
vibrantlabsai/ragas View on GitHubLibrary· Supercharge your LLM application evaluations (RAG-focused)AutoRAG —
Marker-Inc-Korea/AutoRAG View on GitHubLibraryServer· RAG AutoML tool for finding the optimal RAG pipeline
Domain-Specific Evaluation
Evaluation tools targeting specific domains or verticals.
med-lm-envs —
MedARC-AI/med-lm-envs View on GitHubLibrary· Automated LLM evaluation suite for medical tasksMedEvalKit —
alibaba-damo-academy/MedEvalKit View on GitHubLibrary· A unified medical evaluation framework
Utilities & Experimentation
Tools for model comparison, grid search, and inference optimization.
ollama-grid-search —
dezoito/ollama-grid-search View on GitHubDesktop· Desktop app for evaluating and comparing multiple LLM models via grid searchspeculators —
vllm-project/speculators View on GitHubLibrary· Library for speculative decoding algorithms for LLM inference
Other Resource Lists
llm-benchmark — Curated list of LLM evaluation frameworks and benchmarks
terryyz/llm-benchmark View on GitHub