The bitter lesson of context metadata

Stop adding business logic to your datasets

The beginners don't think about metadata at all - they just dump raw data and move on. Most obsess over it, adding elaborate contextual metadata to every dataset. The experienced devs come full circle: they use minimal metadata too.

Less is more, and let me tell you why.

Why more metadata limits your agent potential

Walk into most financial firms today and you'll find data teams frantically enriching datasets with contextual metadata. They think they're being helpful.

A news headlines dataset gets annotated: "Market-moving financial news optimized for sentiment analysis. Pre-filtered for sentiment-bearing language. Link to ticker symbols for trading signals. Cross-reference with earnings calendar for event-driven analysis."

Earnings call transcripts get tagged: "Management commentary for fundamental analysis. Extract forward guidance and compare to analyst estimates. Score management tone for confidence signals. Primary source for qualitative assessment."

This looks sophisticated, but it’s actually not a good idea.

The core mistake: you're baking business logic into data.

First, you're overfitting to known use cases. That news dataset "optimized for sentiment analysis"? You've just made it miss using that data for regulatory monitoring, compliance tracking, thematic research, and portfolio rebalancing workflows. The same data could serve a dozen purposes, but your metadata has steered agent for a specific singular-purpose use case.

A risk team wants to scan headlines for regulatory announcements. A compliance officer needs disclosure triggers. A thematic analyst is tracking industry narratives. But your metadata keeps screaming "this is sentiment data for trading”. You’re actually constraining the value of this data.

Second, your metadata may evolve over time. Business logic changes. Market structure evolves. New workflows appear. That elaborate context metadata you wrote? It's no longer valid, steering people away from valid uses because the documentation says the data is "for" something else.

Static annotations can't keep up with dynamic needs. Every time you write "this dataset enables alpha generation through sentiment extraction", you're limiting the true potential of this dataset.

Third, and this is the big one, you're lobotomizing your agents. You hired intelligent AI systems to reason about data, then you pre-decided how they should think about it. You've stolen the reasoning step and replaced it with instructions.

It's like hiring a chess grandmaster and handing them a playbook of pre-approved moves. The whole point of having a grandmaster is that they can see combinations you can't.

But you've constrained them to patterns you already know.

When you tell an agent "this news data is for sentiment analysis, join it with market data on tickers, use it to predict price movements", you've eliminated the agent's ability to discover that the same data could reveal regulatory risk, identify sector rotation, or flag portfolio rebalancing triggers.

Every line of business context you add is a constraint you're imposing on future intelligence.

And here's what really matters in finance: alpha doesn't live in individual datasets. It emerges at intersections, when you combine datasets in non-obvious ways. A news headline alone tells you something happened. The insight comes when you overlay market data, options flow, portfolio holdings, sector performance, and historical patterns.

But if you've pre-decided how each dataset should be used, you've pre-decided which intersections are "valid". You've turned your data infrastructure into a map with pre-drawn routes, when the whole point of intelligent agents is that they can discover new paths.

The sophisticated answer: Do almost nothing

So what should you actually do?

Keep datasets brutally minimal.

Structural metadata only:

  • Schema (what columns exist)

  • Data types (string, integer, timestamp)

  • Lineage (where this came from)

  • Update frequency (how fresh it is)

  • Provenance (who owns it)

That's it. No business definitions. No use case documentation. No "this data is for X analysis" guidance. No elaborate context about what the data "means" or "enables".

Just clean, structural information that describes the data technically without telling anyone how to think about it.

Let the data be dumb. Let the agent be smart.

Here's what this looks like in practice. Take those earnings call transcripts. Instead of pre-annotating them with business logic, you provide minimal metadata: timestamp, company identifier, speaker labels, transcript text. Clean structure, zero interpretation.

Now an advanced agent needs to build a fundamental analysis model. It doesn't read your instructions. It reasons:

"I'm validating management guidance. Transcripts contain forward-looking statements that need verification. I'll extract revenue and margin guidance from management remarks, pull analyst consensus from estimation databases, grab actual results from next quarter's financials, and check insider trading activity around the call date. If management's confidence doesn't align with their personal trading behavior, that's a signal. If guidance consistently diverges from results, that's another signal".

Nobody told it to combine transcripts + estimates + actuals + insider trades. It understood the analytical goal and reasoned about which data combinations would answer it.

That's the difference. Pre-written metadata tells agents how data has been used. Agent intelligence figures out how it can be used.

And this is why minimal metadata wins: maximum flexibility equals maximum value.

The same earnings transcript can serve fundamental analysis, sentiment modeling, thematic research, competitive intelligence, and risk assessment. All without conflicting metadata pulling in different directions. Each workflow gets what it needs because the agent reasons about combinations dynamically, not because someone documented every possible use case upfront.

Context isn't pre-written in documentation. It's computed at runtime based on goals.

Agents adapt to novel situations, discover unexpected data combinations, and avoid the maintenance burden of outdated annotations. Most importantly: they find alpha at intersections you didn't anticipate, because you haven't told them which intersections are "allowed".

Less is more

If you're enriching datasets with elaborate contextual metadata, stop. You're building at the wrong layer.

The practical shift:

Stop: Adding business logic to datasets. Documenting use cases in metadata. Pre-deciding how data should be combined. Trying to make datasets "self-explanatory" with context.

Start: Keeping datasets structurally clean. Building agent intelligence that reasons about combinations. Trusting AI to figure out context based on goals.

At OpenBB, we've designed around this philosophy. Users select which widget metadata gets exposed to agents - they control what context is available rather than inheriting someone else's assumptions. More importantly, users build their own agents and integrate them into their workspace, defining context based on what actually matters for their specific workflows. The agents aren't reading generic dataset descriptions. They're reasoning within user-defined contexts about which data to combine and why.

This is the shift: from data teams trying to predict every use case, to agents defining context dynamically based on actual work.

Both extremes of the IQ curve arrive at minimal metadata. One by accident, they never thought about it. One by design, they understand that intelligence lives in agents, not annotations.

The difference between them? Knowing that context is a property of use, not data. That flexibility beats comprehensiveness. That the smartest datasets are the most generic.

The smartest thing you can do with your data is almost nothing.

Stop over-engineering.

Start trusting your AI.

Less is more.

AI-native CRM

“When I first opened Attio, I instantly got the feeling this was the next generation of CRM.”
— Margaret Shen, Head of GTM at Modal

Attio is the AI-native CRM for modern teams. With automatic enrichment, call intelligence, AI agents, flexible workflows and more, Attio works for any business and only takes minutes to set up.

Join industry leaders like Granola, Taskrabbit, Flatfile and more.