Observability & Evaluation

15 open-source agents tracked · Ranked by trust score · Updated 2026-06-04 18:04 UTC

HVTracker independently evaluates 15 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is MLflow with a trust score of 90.7/100 (Grade A). Other leading projects include Weights & Biases Weave and Langfuse.

Agents

Avg Trust

172.5k

Total Stars

Grade A

#	Agent	Trust	Stars	Language
1	MLflow A Listed The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, eva	90.7	26.3k	Python
2	Weights & Biases Weave A Listed Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.	83.6	1.1k	Python
3	Langfuse B Listed 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte	77.4	28.5k	TypeScript
4	Promptfoo B Listed Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C	76.7	21.9k	TypeScript
5	Arize Phoenix B Listed AI Observability & Evaluation	76.2	10.0k	Python
6	Evidently B Listed Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or d	72.9	7.6k	Jupyter Notebook
7	Giskard B Listed 🐢 Open-Source Evaluation & Testing library for LLM Agents	71.0	5.4k	Python
8	LangWatch B Listed The platform for LLM evaluations and AI agent testing	69.1	3.3k	TypeScript
9	DeepEval B Listed The LLM Evaluation Framework	67.5	15.9k	Python
10	Helicone C Listed 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓	58.4	5.8k	TypeScript
11	Ragas C Listed Supercharge Your LLM Application Evaluations 🚀	56.2	14.2k	Python
12	Opik D Listed Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, autom	46.7	19.4k	Python
13	TruLens D Listed Evaluation and Tracking for LLM Experiments and AI Agents	42.1	3.4k	Python
14	Agenta D Listed The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one	40.6	4.2k	TypeScript
15	AgentOps D Listed Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame	35.1	5.6k	Python