Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

180
Total Evaluation Runs
12
Models Evaluated
3
Test Cases

Providers Tested

AmazonGoogle
Data updated: Jul 6, 2025
Difficulty:
Provider:

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Avg Duration Avg Tokens Runs Provider
1 gemini-2.5-pro-preview-06-05
40.0%
$0.0455
46.89s 8,693.733 15 Google
2 gemini-2.5-pro-preview-05-06
33.3%
$0.3224
77.49s 46,132.333 15 Google
3 gemini-2.5-pro-preview-03-25
26.7%
$0.2849
100.73s 37,934.067 15 Google
4 gemini-2.5-flash
20.0%
$0.0123
12.85s 12,838.467 15 Google
5 gemini-2.5-flash-lite-preview-06-17
6.7%
$0.0007
4.42s 4,198.2 15 Google
6 gemini-2.5-flash-preview-04-17
6.7%
$0.0539
27.50s 20,486.067 15 Google
7 gemini-2.5-flash-preview-05-20
6.7%
$0.0164
11.22s 7,726.4 15 Google
8 bedrock:us.amazon.nova-premier-v1:0
6.7%
$0.0565
78.33s 15,556.267 15 Amazon
9 bedrock:us.amazon.nova-pro-v1:0
0.0%
$0.0000
50.70s 0 15 Amazon
10 bedrock:us.amazon.nova-micro-v1:0
0.0%
$0.0001
17.39s 1,744.2 15 Amazon
11 bedrock:us.amazon.nova-lite-v1:0
0.0%
$0.0002
25.54s 1,926.667 15 Amazon
12 gemini-2.0-flash
0.0%
$0.0003
7.10s 1,581.533 15 Google

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: July 6, 2025 at 03:03 PM UTC