Capability is not the bottleneck to scaling AI agents. Trust is.

We are building the science of model behavior. We test frontier models under pressure: real tasks, tool access, ambiguity. We turn this data into failure predictions, mitigations, and deployment guidance for the labs that build models and the enterprises that deploy them.

Our work

Applied evaluations: we work with teams running agents in production to predict what breaks when models change. If that's you, get in touch!

SysAdmin benchmark : Measuring propensities that could lead to Loss of Control in an open world Linux sandbox. 7 frontier models, 2800 tasks, 2x2 factorial design. Submitted to NeurIPS 2026. Draft here.


Early Benchmark : testing power seeking in models tasked with simple system admin work.  Presented at the Alignment Workshop at NeurIPS 2025.
Poster here.


Evaluations
of the most downloaded open source models, on propensities polled HuggingFace users were interested in - Instruction Following and Hallucinations.


Analysis of power seeking behaviour amongst agents in Moltbook + the effect of humans masquerading as agents.


We publish our research here : https://propensitylabs.substack.com/


Frequently Asked Questions

What are model propensities?

Propensities are what a model is inclined to do when given the opportunity. Capabilities are what it is able to do. Two models can pass the same capability benchmarks and behave completely differently under pressure: one games the test, another quietly expands its own permissions.

Why measure propensities and not just capabilities?

Capability benchmarks tell you what a model can do, not what it will do in your environment. A model can ace the benchmark and still delete your test files to make a task pass. The field measures capability well. Almost nobody measures behavior systematically, across labs, release after release. That is the gap we're fixing.

What behaviors do you test for?

We initially focused on ones that cause incidents: privilege escalation, test gaming, scope creep beyond the task, and resistance to being stopped or redirected. We're expanding to cover everything that could cause failures in production

Is this safety research or a commercial product?

Both, by design. The propensities behind production failures and the propensities behind Loss of Control are the same behaviors at different stakes. We publish the research openly and work with labs and enterprises on evaluation and deployment.

Why a Public Benefit Corporation?

So we're independent and can evaluate models from all labs without being biased towards any. We're committed to reducing risks from advanced AI systems for the benefit of humanity.

Can I try out your evals or collaborate?

Of course! Please reach out to us via the form below or email us at info@propensitylabs.ai

Founders

Mana Azarm
Co-founder and Chief Scientist
Assistant Professor @ USF (prev-UOttawa)
Led Data Infra @ Doordash
Rahul Nambiar
Co-founder and CEO
Tech Lead @ Meta across 90+ teams
Built massive infra @ AWS

Contact US

Running agents in production or want to collaborate on research?

Thank you—our team will respond soon.
[background image] image of a workspace (for a mobile gaming)
Submission failed. Please review your details.