Harvey LAB (Legal Agent Benchmark) is an open-source benchmark released by Harvey in May 2026 that evaluates AI agents on 1,200+ legal tasks across 24 practice areas, using 75,000+ attorney-written rubric criteria. It tests long-horizon legal work — analysis, synthesis, and deliverable creation — rather than short tasks like document comparison. It is the first practice-area-specific quality scorecard for legal AI agents.

Which AI performs best on Harvey LAB?

As of May 2026, Claude Opus 4.7 leads the Harvey LAB leaderboard at 7.1% under the strict all-pass standard, followed by Claude Sonnet 4.6 at 5.4%, Claude Opus 4.6 at 4.2%, GPT-5.5 at 2.1%, and Gemini 3.5 Flash at 0.8%. Importantly, no single model leads every practice area — performance varies significantly across transactional, advisory, regulatory, and litigation tasks.

Does Harvey LAB mean AI is not ready for legal work?

No — it means AI legal work requires human review and that quality varies by task type and practice area. The sub-10% all-pass rate reflects a strict standard where every rubric criterion must be met, not overall uselessness. Many tasks at partial pass rates are still deployable under a human-review pattern, which is appropriate for most law firms today. LAB helps identify which specific tasks and practice areas are AI-ready at what level of oversight.

How can a small law firm use Harvey LAB?

Small law firms can use LAB to: identify which AI models perform best for their practice area, understand which task types are suitable for AI-assisted workflows versus full human drafting, and build an internal evaluation framework by running their own high-volume tasks through AI and grading against LAB criteria. The benchmark is open-source and free to access.

How is Harvey LAB different from other legal AI benchmarks?

Prior legal AI benchmarks typically focused on short, isolated tasks such as contract comparison or clause extraction, and often used crowd-sourced or AI-generated evaluation criteria. LAB tests long-horizon tasks that mirror how legal work is actually assigned and reviewed at law firms — client matters with full materials and deliverable requirements — graded by attorney-written rubrics. It is also practice-area specific across 24 areas, where prior benchmarks provided only aggregate scores.

Harvey LAB: AI Quality Scores for Law Firms

Published: June 2026 | By: The Crossing Report

Here's the conversation that's been happening in law firm conference rooms for three years: a partner suggests trying AI on contract review. Someone asks which AI. Nobody has a good answer. The vendor demo looked great. The actual work product was inconsistent. The firm has no way to compare.

On May 6, 2026, that conversation got a different ending.

Harvey released the Legal Agent Benchmark — LAB — the first open-source, practice-area-specific quality scorecard for legal AI agents. For the first time, a small law firm doing M&A work, or employment litigation, or regulatory compliance, can look at structured quality data for their specific practice area before committing to a tool.

The headline finding is sobering: frontier AI models complete less than 10% of real legal tasks end-to-end under LAB's strict evaluation standard. But the strategic implication is the opposite of discouraging. Here's why.

What Harvey LAB Is (and Why It's Different From Prior Legal AI Benchmarks)

Before LAB, legal AI benchmarks mostly measured speed or accuracy on short, isolated tasks: compare these two contracts, extract this clause, find a relevant statute. Those tasks matter. But they're not how legal work actually gets done.

A client matter isn't a single question. It's a file — with context, history, and required deliverables. A real-world legal assignment asks: analyze this term sheet, identify the three points the opposing party is most likely to push back on, and draft a negotiation memo.

LAB tests that kind of work.

What LAB covers:

1,200+ long-horizon legal agent tasks
24 practice areas spanning transactional, advisory, regulatory, and litigation work
75,000+ rubric criteria written by attorneys — not crowd-sourced, not AI-generated
Open-source: any firm, researcher, or vendor can run their AI against it

The evaluation standard is strict: the "all-pass" score requires that an AI agent meet every rubric criterion on a task. Not most of them — every one. This is a quality floor, not an average.

That strictness is what makes the benchmark useful. A vendor demo optimized to look good on partial credit tells you very little about how AI will perform under supervision on a real matter. LAB measures the floor.

The 82% of professional services firms currently flying blind on AI ROI — a finding from the Thomson Reuters 2026 AI in Professional Services Report — now have a domain-specific measurement framework for legal work. LAB is the first tool designed for that purpose.

The Key Number: Frontier AI Completes Less Than 10% of Legal Tasks End-to-End

Under LAB's strict all-pass standard, here is where the leading AI models currently sit:

Harvey LAB Leaderboard (May 2026)

Claude Opus 4.7: 7.1%
Claude Sonnet 4.6: 5.4%
Claude Opus 4.6: 4.2%
GPT-5.5: 2.1%
Gemini 3.5 Flash: 0.8%

The first reaction most firm owners have to these numbers is: "That sounds terrible." That reaction is wrong.

Consider what 7.1% actually means. Harvey tested 1,200+ tasks. At 7.1%, Claude Opus 4.7 is completing roughly 85 of those tasks end-to-end, to attorney-reviewed quality standards, without any human assistance. Two years ago that number was zero.

More importantly: the 92.9% of tasks that don't fully pass the all-pass standard aren't failures — they're tasks that require human review before the work product goes out. Which is exactly what responsible AI deployment in a law firm looks like today.

The right framing: LAB tells you where your AI can already operate under light supervision, and where it still needs a partner-level eye on every output. That's not a verdict on AI readiness. It's a deployment map.

One more point that matters more than the top-line score: no single model leads every practice area. The model that performs best on regulatory compliance may not be the strongest in M&A due diligence. If you choose an AI tool based on headline leaderboard position, you may be choosing the wrong tool for the work your firm actually does.

How to Read LAB Results for Your Practice Area

LAB covers 24 practice areas. For a small firm, that level of specificity is the point.

A 15-attorney employment law firm and a 10-attorney M&A boutique are not the same buyer. They have different high-volume tasks, different risk tolerances for AI errors, and different client expectations. A single "best legal AI" recommendation is useless for both of them.

LAB lets you look at your actual practice area — not "legal AI" in the abstract.

Two deployment patterns to understand:

Review pattern: AI drafts or analyzes; you review and approve before anything goes out. Appropriate for tasks where AI performance falls below roughly 20% on the all-pass standard. This covers most legal AI deployment in 2026.

Delegation pattern: AI completes a task autonomously, with you spot-checking a sample. Only appropriate for tasks where AI performs at a threshold high enough to trust unreviewed output — a threshold very few tasks have yet cleared under LAB scoring.

For a small firm, every AI-assisted matter should start in the review pattern. The value isn't removing the attorney — it's removing the time the attorney spends on drafting, research synthesis, and document organization, while retaining the judgment call at the end.

LAB tells you where AI can take on the most weight in that workflow, and where it still needs you pulling more of it.

What a Small Law Firm Actually Does With This

LAB is open-source and free to access. Here is a four-step process to put it to work.

Step 1: Identify your 2–3 highest-volume task categories. Not the most complex work — the most repeated. Contract review, client intake, research memos, deposition prep, regulatory filings. These are where AI assistance creates compounding return.

Step 2: Check LAB results for those task types across your practice area. The benchmark covers 24 practice areas and breaks down results by task category. Look at which AI models perform best on the specific task types you run most often — not the overall leaderboard.

Step 3: Choose AI tools based on your practice area, not the headline winner. The overall leaderboard is a starting point. Your decision should be based on performance data for your actual work.

Step 4: Run your own internal mini-benchmark. Take 5 real tasks from your file — client matters that are already closed, so you know the correct answer. Run them through the AI tools you're considering. Grade the output against your own standard. This is the ROI measurement step that 82% of firms skip entirely. LAB gives you the rubric structure to do it properly.

This four-step process doesn't require a legal engineer or a technology committee. A managing partner can complete it in a Saturday morning. The output is a deployment decision grounded in your firm's actual work, not vendor marketing.

The Honest Limitation (and Why It Still Matters)

LAB is calibrated for large law firm work product standards. The rubric criteria were written by attorneys evaluating quality at the Am Law 100–200 level.

For a 10-person firm, some of those standards may be more stringent than what you actually need. A task that fails LAB's all-pass standard on a narrow formatting or citation issue may still produce output that's entirely useful for your practice.

LAB also uses strict all-pass grading — every criterion must be met. In real-world deployment, partial credit matters. An AI that gets 90% of a task right still saves you the 90%, even if the last 10% needs rework.

These limitations don't undercut LAB's value. They just contextualize it. The alternative to LAB is choosing AI based on vendor case studies, conference demos, and the opinion of whoever attended the most recent legal tech conference. That approach has a worse failure rate than any number LAB produces.

LAB is the first independently validated signal of where legal AI quality actually sits, across real legal work, evaluated by attorneys, broken down by practice area. For a small firm trying to make a defensible AI deployment decision in 2026, it's the closest thing to a quality standard the industry has produced.

Use it. Then validate against your own file.

The Bottom Line

Harvey LAB doesn't say AI is ready to replace your associates. It says AI is ready to take on a specific slice of the work your firm does — and it tells you, by practice area, what that slice is.

The firms that will be ahead in 2028 are not the ones that waited for AI to hit 80% on the all-pass standard. They're the ones that figured out which 7% to delegate now, built the review infrastructure to catch the other 93%, and iterated from there.

LAB is the starting map. The territory is your firm.

Want the complete practice-area evaluation framework — which task categories across legal, accounting, and consulting have cleared the AI quality bar, and how to run a 5-task internal benchmark before committing to any AI tool? That guide is in this week's premium Crossing Report.

Subscribe to The Crossing Report for weekly intelligence on AI and professional services firms.

Harvey LAB Is the First Legal AI Scorecard That Tells You Which AI Is Ready for Your Practice Area