AgentIndex icon
AgentIndex
ToolsCategoriesTrendingNewCompare
Submit Tool
Home/
Observability/
AgentBench
AgentBench logo

AgentBench

Active·★ 3.5k·Apache-2.0·Updated 2026-02-08
★ Trending★ Essential

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

AgentBench is a comprehensive benchmark for evaluating Large Language Models (LLMs) as agents across diverse environments, now featuring a function-calling version integrated with AgentRL. It provides a containerized setup for various tasks like OS interaction, database operations, and web shopping, enabling robust and reproducible agent evaluation.

#LLM Evaluation#Agent Benchmarking#Function Calling#Docker#Multi-task Learning
$ Install
$ pip install -r requirements.txt
↗ Visit site★ GitHub
01

Features

01Comprehensive LLM-as-Agent Evaluation across diverse environments.
02Function Calling integration for advanced agent interaction.
03Fully containerized deployment using Docker Compose for reproducibility.
04Multi-task and multi-turn interaction for realistic agent assessment.
05Extensible framework for adding new evaluation tasks.
02

Compatibility

Docker
Native
Verified via docs
Python
Native
Verified via docs
OpenAI API
Supported
Verified via docs
Large Language Models
Supported
Verified via docs
03

Quick start

1
$ pip install -r requirements.txt
04

Use cases

↳Systematically benchmark the performance of various LLM-based agents.
↳Develop and refine advanced LLM agent architectures and strategies.
↳Conduct academic research on the capabilities and limitations of agentic AI.
05

Alternatives

GitHub MCP Server logo
GitHub MCP Server★ 30.3k
GitHub's official MCP Server. Allows AI agents to interact directly with your GitHub repositories (read files, search code, issues).
vs →
genai-toolbox logo
genai-toolbox★ 15.4k
MCP Toolbox for Databases is an open source MCP server for databases.
vs →
chinese-llm-benchmark logo
chinese-llm-benchmark★ 6.1k
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括335个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.5、文心ERNIE-X1.1、ERNIE-5.0-Thinking、qwen3-max、百川、讯飞星火、商汤senseChat等商用模型, 以及kimi-k2、ernie4.5、minimax-M2、deepseek-v3.2、qwen3-2507、llama4、智谱GLM-4.6、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
vs →
FinnewsHunter logo
FinnewsHunter★ 1.4k
FinnewsHunter: Multi-agent financial intelligence platform powered by AgenticX. Real-time news analysis, sentiment fusion, and alpha factor mining.
vs →
xLAM logo
xLAM★ 621
xLAM: A Family of Large Action Models to Empower AI Agent Systems
vs →
QuantDinger logo
QuantDinger★ 6.9k
AI-driven, local-first quantitative trading platform for research, backtesting and live execution. Python-native, privacy-first, open source.
vs →
minima logo
minima★ 1.0k
On-premises conversational RAG with configurable containers
vs →
presenton logo
presenton★ 7.5k
Open-Source AI Presentation Generator and API (Gamma, Beautiful AI, Decktopus Alternative)
vs →
See all alternatives →

Related searches

AgentBench AlternativesBest Observability Tools 2026Open Source ObservabilityAgentBench TutorialAgentBench Vs CompetitorsLLM EvaluationAgent BenchmarkingFunction Calling

Comments

Log in to leave a comment

No comments yet. Be the first!

On this page
01Features02Compatibility03Quick start04Use cases05Alternatives
Stats
GitHub Stars★ 3.5k
Last commit3mo ago
StatusActive
LicenseApache-2.0
CategoryObservability
Trend (30d)
+0.1k↑ 4.3%
Links
Documentation↗Discussion↗Issues↗Releases↗

Deploy on DigitalOcean — Get $200 Free Credit

Ad
© 2026 AgentIndex.app|Built by a 10-year iOS Developer.
QYSGitHubBuy me a coffee ☕

Browse by Category

Code AssistantWorkflow AutomationRAG / Knowledge BaseMulti-AgentBrowser AutomationLLM InfraDev ToolingObservability

Not affiliated with Anthropic, OpenAI or Microsoft.