Tools Categories Trending New Compare

AgentBench

AgentBench

Active·★ 3.5k·Apache-2.0·Updated 2026-02-08

★ Trending★ Essential

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

AgentBench is a comprehensive benchmark for evaluating Large Language Models (LLMs) as agents across diverse environments, now featuring a function-calling version integrated with AgentRL. It provides a containerized setup for various tasks like OS interaction, database operations, and web shopping, enabling robust and reproducible agent evaluation.

#LLM Evaluation#Agent Benchmarking#Function Calling#Docker#Multi-task Learning

$ Install

$ pip install -r requirements.txt

↗ Visit site ★ GitHub

01

Features

01Comprehensive LLM-as-Agent Evaluation across diverse environments.

02Function Calling integration for advanced agent interaction.

03Fully containerized deployment using Docker Compose for reproducibility.

04Multi-task and multi-turn interaction for realistic agent assessment.

05Extensible framework for adding new evaluation tasks.

02

Compatibility

Docker

Native

Verified via docs

Python

Native

Verified via docs

OpenAI API

Supported

Verified via docs

Large Language Models

Supported

Verified via docs

03

Quick start

1

$ pip install -r requirements.txt

04

Use cases

↳Systematically benchmark the performance of various LLM-based agents.

↳Develop and refine advanced LLM agent architectures and strategies.

↳Conduct academic research on the capabilities and limitations of agentic AI.

05

Alternatives

GitHub MCP Server★ 30.3k

GitHub's official MCP Server. Allows AI agents to interact directly with your GitHub repositories (read files, search code, issues).

genai-toolbox★ 15.4k

MCP Toolbox for Databases is an open source MCP server for databases.

chinese-llm-benchmark★ 6.1k

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括335个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.5、文心ERNIE-X1.1、ERNIE-5.0-Thinking、qwen3-max、百川、讯飞星火、商汤senseChat等商用模型，以及kimi-k2、ernie4.5、minimax-M2、deepseek-v3.2、qwen3-2507、llama4、智谱GLM-4.6、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

FinnewsHunter★ 1.4k

FinnewsHunter: Multi-agent financial intelligence platform powered by AgenticX. Real-time news analysis, sentiment fusion, and alpha factor mining.

xLAM: A Family of Large Action Models to Empower AI Agent Systems

QuantDinger★ 6.9k

AI-driven, local-first quantitative trading platform for research, backtesting and live execution. Python-native, privacy-first, open source.

On-premises conversational RAG with configurable containers

presenton★ 7.5k

Open-Source AI Presentation Generator and API (Gamma, Beautiful AI, Decktopus Alternative)

See all alternatives →

Related searches

AgentBench Alternatives Best Observability Tools 2026 Open Source Observability AgentBench Tutorial AgentBench Vs Competitors LLM Evaluation Agent Benchmarking Function Calling

Comments

Log in to leave a comment

No comments yet. Be the first!

On this page

01Features 02Compatibility 03Quick start 04Use cases 05Alternatives

Stats

GitHub Stars★ 3.5k

Last commit3mo ago

StatusActive

LicenseApache-2.0

CategoryObservability

Trend (30d)

+0.1k↑ 4.3%

Links

Documentation↗Discussion↗Issues↗Releases↗

Deploy on DigitalOcean — Get $200 Free Credit

© 2026 AgentIndex.app|Built by a 10-year iOS Developer.

QYS GitHub Buy me a coffee ☕

Browse by Category

Code Assistant Workflow Automation RAG / Knowledge Base Multi-Agent Browser Automation LLM Infra Dev Tooling Observability

Not affiliated with Anthropic, OpenAI or Microsoft.