The problem with AI in legal translation
In 2026, courts across the United States, France, and the United Kingdom are confronting a wave of AI-generated errors in legal filings. A global database maintained by legal analytics providers now tracks more than 1,353 documented cases of AI hallucinations in court documents worldwide. In some instances, the consequences have included monetary sanctions exceeding $100,000, license suspensions, and referrals for bar discipline.
These incidents involve AI-generated citations. But a quieter, equally consequential problem has been building in a different part of legal practice: the translation of legal documents.
When a single AI model translates a contract clause, it does so by generating the statistically most probable rendering of the source text. It does not know what the clause means legally. It does not know what is at stake if a term is mistranslated. And according to industry data synthesized from Intento and WMT24, individual top-tier large language models hallucinate or fabricate content between 10% and 18% of the time during translation tasks.
For legal content, that range is not acceptable. A mistranslated indemnification clause, a missing carve-out, or a term that changes meaning in the target jurisdiction can render an agreement unenforceable. The French National Bar Association codified exactly this risk in March 2026: lawyers who use AI content without proper verification face disciplinary proceedings.
Yet legal professionals are increasingly using AI for translation, often under time and budget pressure. The question is not whether to use AI, but whether the AI workflow being used has been built to catch what a single model will miss.
What a ‘real use case’ translation looks like
The example that follows describes how our team translated a 4,200-word commercial services agreement from English into Spanish for a cross-border procurement engagement. The document included governing law clauses, a limitation of liability section, indemnification terms, and a dispute resolution mechanism.
The language pair was English to Latin American Spanish. The target jurisdiction was Mexico. The stakes were straightforward: the translated agreement was to be executed by the client-side counterparty, whose primary language was Spanish, and any ambiguity in the limitation of liability provisions would be subject to Mexican contract law interpretation.
This is a document type where legal translation fails visibly. The Intento State of Translation Automation 2025 report specifically flagged legal language as one of the content categories where both human and AI translators produce the highest rate of meaning-altering errors, including cases where data protection concepts disappear entirely through omission.
Our goal was to run the document through a process that could surface disagreement between translation outputs before a human reviewer ever saw it, concentrating review effort on the points of genuine ambiguity rather than having a reviewer check every sentence.
Step-by-step: how the translation was done
The following is the workflow we ran. It is reproducible by any legal team handling cross-border document translation.
Step 1: Document segmentation and pre-processing. Before any model touched the text, the document was reviewed for clause boundaries. Legal agreements contain nested conditional clauses, defined terms that must be translated consistently throughout the document, and cross-references that depend on how other sections were rendered. Splitting the document into segments without respecting these structural relationships produces terminology drift, the same defined term translated differently in two sections of the same agreement.
Step 2: Running the document through multiple AI models simultaneously. The full document was submitted to an AI translation platform that processes text through 22 independent models at once. Each model produces its own output independently. This is the same principle applied in legal expert panels and inter-rater reliability frameworks in research: when you need to identify where uncertainty exists, you need multiple independent judgments, not one confident one.
Step 3: Identifying divergence points. The platform flagged every clause where the 22 model outputs disagreed significantly. In this document, 14 clause segments produced meaningful divergence. These included the governing law clause, two indemnification provisions, and one limitation of liability carve-out. The divergence was not in fluency, all model outputs read naturally in Spanish. It was in legal meaning. In three cases, the models disagreed on how to render a term that has no direct equivalent in Mexican legal usage and requires a jurisdictional adaptation.
Step 4: Concentrated human review. Instead of reviewing 4,200 words of AI output, the human reviewer focused on the 14 flagged segments. This is the structural advantage of a multi-model approach for legal translation: it does not attempt to eliminate human judgment. It directs human judgment to the places where it is most needed.
Step 5: Final verification and sign-off. The reviewed document was then formally signed off by a legal translator with domain knowledge in Mexican commercial law. The certification covered the human-reviewed segments. The final deliverable was the document the human reviewer had approved, not the raw AI output.
The output: what the verification layer caught
In the three cases of jurisdictional term divergence, the flagged segments included a phrase that most individual AI models rendered as a direct translation of ‘reasonable efforts’, a standard that, in Mexican commercial law, is interpreted differently from its English-law counterpart. Two models produced an output that defaulted to the English-law meaning. Had those outputs gone forward without review, the clause would have imposed a materially different obligation on the Mexican counterparty than the parties intended.
This divergence was surfaced automatically, before human review began, because the platform’s model outputs disagreed on how to handle the term. The human reviewer then had a specific flag to investigate, rather than relying on a general read of a 4,200-word document.
The AI translator used for this workflow was MachineTranslation.com, which compares the outputs of 22 AI models and selects the translation that most of them agree on. According to MachineTranslation.com‘s internal benchmarks, this approach reduces critical translation errors to under 2%, compared to a 10% to 18% critical error rate associated with single-model AI translation.
The point here is architectural. The error in the ‘reasonable efforts’ clause was not a fluency error. It was a jurisdictional meaning error. No spell-check, no grammar review, and no standard quality assurance pass would have caught it. The only thing that surfaced it was the fact that 22 independent models disagreed on how to render it.
What this means for legal practice
Legal professionals handling multilingual documents are not going to stop using AI. The efficiency gains are real, and the volume pressure is not going away. But the current default of running a document through one model and checking the output manually is not a verification process. It is a single-point-of-failure process with a human reviewer at the end.
The workflow described here is not more expensive or slower than that default. It is structurally different. It moves the human reviewer from checking everything to checking what has been flagged. For a 4,200-word agreement, the difference between reviewing 4,200 words and reviewing 14 flagged segments is the difference between a day of work and two hours of work, with higher confidence in the result.
For legal teams that need to go further, the human verification step described in Step 5, where a domain-qualified translator formally signs off on the reviewed output, is the equivalent of a certified translation for documents that will be submitted to courts or regulatory authorities. The key point is that this certification now covers a document that has already had its divergence points identified and resolved, rather than a document that one reviewer has checked once.
| Practical guidance for legal professionals handling AI translation: Do not treat a single AI output as a first draft. Treat it as one vote. The question is not ‘does this read correctly?’, it is ‘do multiple independent models agree on what this means?’ If they do not, that is where your human reviewer needs to spend time. |
Technology in legal proceedings continues to evolve rapidly. The same principle that applies to AI-generated citations in court filings applies to AI-generated translations of documents executed under foreign law: responsibility for the output does not transfer to the tool. What multi-model verification provides is a process that concentrates accountability on the points of genuine risk, rather than distributing it across the length of a document.
Conclusion
AI translation is not going to produce perfect legal documents. That is not its function. Its function is to produce a highly reliable draft that can be efficiently reviewed and certified by a qualified human.
The workflow described here, multi-model output, automated divergence flagging, concentrated human review, formal sign-off, produces a legal translation with a defensible process behind it. The flagged divergence points are documented. The human review is targeted. The final deliverable is one that a qualified professional has reviewed and approved.
In a practice area where the question ‘how was this translation produced?’ can arise in litigation, arbitration, or regulatory review, having a documented, structured answer to that question is not a secondary consideration. It is part of the work.
