Campbell Brown’s Forum AI Builds Expert-Backed Judges to Fix Model Accuracy on High-Stakes Topics
Forum AI uses expert-crafted benchmarks and AI judges to improve accuracy of foundation models on geopolitics, finance, hiring and other high-stakes topics.
Campbell Brown has launched Forum AI to confront accuracy failures in large foundation models and to create expert-driven evaluations for complex subjects. Forum AI was founded to design benchmarks and train automated “AI judges” that align with leading human experts on topics where nuance matters. The effort aims to move evaluation beyond simple metrics and checkbox audits toward domain-specific, defensible assessments.
Forum AI founded to audit high-stakes model outputs
Forum AI was created 17 months ago in New York with the explicit goal of assessing model performance on “high-stakes topics.” High-stakes areas include geopolitics, mental health, finance and hiring—domains where errors can lead to significant harm or legal exposure.
The company recruits domain experts to architect benchmarks and then trains machine evaluators to replicate expert judgments at scale. That approach is intended to provide repeatable, measurable assessments that mirror human reasoning in complex cases.
Expert panels and AI judges designed for consensus
Brown has assembled a roster of prominent figures to help shape the benchmarks, including historians, policymakers and cybersecurity officials. The list of contributors reportedly includes Niall Ferguson, Fareed Zakaria, a former secretary of state and other senior practitioners who anchor Forum AI’s standards.
Forum AI’s stated objective is to achieve roughly 90 percent consensus between AI judges and the human expert panels. The company says it has reached that threshold on some geopolitical evaluations, demonstrating that automated adjudication can approximate expert views when benchmarks are well defined.
Independent audits uncover bias and contextual gaps
Early audits by Forum AI highlighted a range of failures in leading foundation models, from geopolitical sourcing errors to ideological skew. Tests reportedly showed models drawing on inappropriate sources for unrelated topics and exhibiting consistent left-leaning political tendencies in aggregated outputs.
Beyond overt bias, Forum AI’s evaluations flagged subtler problems such as missing context, omitted perspectives and straw-manning of complex arguments. Those findings underscore that measurable accuracy is more than factual correctness; it also requires balanced framing and awareness of nuance.
Lessons from social platforms inform the approach
Brown cites her experience at a major social media company as a cautionary lesson about optimizing for engagement over accuracy. Efforts to build fact-checking and content remediation systems at scale often failed or were dismantled, she has said, leaving a landscape where platform incentives did not prioritize truth.
Forum AI’s model assumes a different path: that rigorous evaluation and expert-aligned metrics can nudge platforms and model builders toward outputs that better reflect reality. The company argues this shift is possible if product and compliance priorities change.
Enterprise demand seen as the most viable market
Forum AI is positioning its services toward enterprises that face regulatory and liability risks in decisions supported by AI. Companies using models for credit, lending, insurance or hiring have tangible incentives to prefer accuracy over engagement-driven outputs.
Brown has argued that while consumer chatbot experiences remain messy, enterprise customers will push vendors to adopt higher-evidence evaluation because mistakes carry financial and legal consequences. Converting compliance interest into steady revenue, however, remains a practical challenge for the startup.
Regulatory audits fall short, experts warn
Forum AI’s work also draws attention to the limits of current audit regimes. Early US municipal and state-level AI hiring audit experiences revealed many vendors and products passed checks despite substantive issues, according to public reports and compliance reviews.
Brown contends that standard checkbox audits and generic benchmarks are insufficient, and that meaningful evaluation requires domain specialists to identify edge cases and contextual risks. That kind of evaluation is more time-consuming and expensive but can uncover problems that simpler tests miss.
Funding and business prospects for Forum AI have progressed alongside its evaluations. The company raised seed capital last fall from investors including a notable early-stage fund, and it has sought enterprise clients interested in bespoke compliance and benchmarking. Forum AI’s revenue model relies on convincing organizations that deeper, domain-specific audits are worth the cost compared with lighter-touch certifications.
Forum AI’s work highlights a larger tension in the development of foundation models: whether the industry will prioritize truthful, context-aware outputs or default to product metrics that favor engagement. The company’s expert-driven benchmarks represent one attempt to institutionalize accuracy as an engineering and commercial priority.
The coming months will test whether expert-aligned evaluation can scale and influence major model builders and enterprise buyers. If Forum AI’s approach gains traction, it could reshape how models are judged and deployed in areas where mistakes carry real-world consequences.