🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

180

Total Evaluation Runs

Models Evaluated

Test Cases

Providers Tested

AmazonGoogle

Source Code

Data updated: Jul 6, 2025

Model Leaderboard

Rank	Model	Success Rate ↓	Avg Cost/Run	Avg Duration	Avg Tokens	Runs	Provider
1	gemini-2.5-pro-preview-06-05	40.0%	$0.0455	46.89s	8,693.733	15	Google
2	gemini-2.5-pro-preview-05-06	33.3%	$0.3224	77.49s	46,132.333	15	Google
3	gemini-2.5-pro-preview-03-25	26.7%	$0.2849	100.73s	37,934.067	15	Google
4	gemini-2.5-flash	20.0%	$0.0123	12.85s	12,838.467	15	Google
5	gemini-2.5-flash-lite-preview-06-17	6.7%	$0.0007	4.42s	4,198.2	15	Google
6	gemini-2.5-flash-preview-04-17	6.7%	$0.0539	27.50s	20,486.067	15	Google
7	gemini-2.5-flash-preview-05-20	6.7%	$0.0164	11.22s	7,726.4	15	Google
8	bedrock:us.amazon.nova-premier-v1:0	6.7%	$0.0565	78.33s	15,556.267	15	Amazon
9	bedrock:us.amazon.nova-pro-v1:0	0.0%	$0.0000	50.70s	0	15	Amazon
10	bedrock:us.amazon.nova-micro-v1:0	0.0%	$0.0001	17.39s	1,744.2	15	Amazon
11	bedrock:us.amazon.nova-lite-v1:0	0.0%	$0.0002	25.54s	1,926.667	15	Amazon
12	gemini-2.0-flash	0.0%	$0.0003	7.10s	1,581.533	15	Google

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: July 6, 2025 at 03:03 PM UTC

View Source