Week 2: Building the AI knowledge layer at bigspark

I’m late with this one. My son turned 18, and we spent the week in Leipzig watching Crystal Palace win a European title. Some things matter more than shipping articles on time. Normal service resumes.

The work didn’t stop while I was away. The knowledge layer kept growing, and that’s what this week is about.

The problem

Without grounded knowledge, an AI agent is just expensive autocomplete.

You can have a sophisticated agent architecture — planning loops, tool use, multi-model orchestration — and it’ll still confidently make things up if it doesn’t have access to authoritative, structured information.

Every enterprise I talk to has the same gap. They’ve got AI tools. They’ve got enthusiastic teams. What they don’t have is a way to get the right knowledge into the system reliably, maintainably, and with an audit trail.

That’s the problem we spent the week on.

Two tiers of knowledge

We’ve built two different knowledge systems, each for a different job.

Tier 1: The Wiki (general-purpose memory)

Based on the Karpathy LLM Wiki pattern, this is organisational memory that builds itself. Feed it URLs, files, images, conversations, and it ingests, structures, and interlinks everything into a searchable knowledge base.

It’s the team’s shared memory. Unstructured, flexible, growing constantly. Anyone can feed it. The bot lives in Slack and Telegram, backed by a 27-tool MCP server. Ask it a question, it searches its own memory. Give it something new, it files it away.

This is where the messy stuff lives: meeting notes, research, quick references. The things that don’t fit a schema but need to be findable.

As it grows, we’ll likely move it to a proper vector database. For now, it’s git-backed and fast enough.

Tier 2: GAAARS Knowledge Bases (strongly typed, purpose-built)

GAAARS stands for Git As An Asset Repository Service. The idea: treat a knowledge base the way you’d treat code. Version it. Test it. Deploy it. Review changes in PRs. Run CI against it.

Each specialist KB is a git repository containing:

Content — markdown files mirroring the source structure (FCA handbook sections, Shelter housing advice pages, PRA rulebook chapters)
Embeddings — pre-computed vectors stored as JSONL, with the model pinned
Metadata — content hashes, source URLs, and timestamps for change detection
Ontology — the domain’s structure and relationships
Scrapers and indexers — Python tooling that keeps the content fresh
Tests and CI — automated validation on every push

No vector database. No managed service. No monthly bill that scales with your data. Just files in git.

Why git, not a database

The cost saving is real. We’re running a dozen knowledge bases with no infrastructure cost beyond GitHub. But that’s not the main reason.

Semantic versioning. Every KB has releases. You know exactly which version of the FCA handbook your agent is reasoning against. When regulation changes, you can diff it.

Git does the hard work. Branching, merging, blame, history — all free. Want to know when a piece of guidance changed? git log. Want to test a KB update before it goes live? Branch it.

Speed. Spinning up a new domain KB takes hours, not weeks. Scrape the source, chunk it, embed it, commit it. CI validates, tag a release, the MCP server picks it up.

Auditability. In regulated industries, “where did the agent get that information?” isn’t optional. With GAAARS the answer is always: this file, this commit, this version.

Composability. Each KB and tool is independent but exposes the same MCP interface. Our regulatory expert agent composes several at once — the FCA Handbook and Data, the PRA rulebook as well as MCP servers which expose live data such as the FCA Register.

What we shipped

A dozen GAAARS knowledge bases, each with its own repo, CI pipeline, and MCP server:

Regulatory: FCA Handbook, FCA data, PRA Rulebook

Legal: UK Legislation

Vulnerability and safeguarding: Citizens Advice, Shelter, Refuge, Women’s Aid, Karma Nirvana, Men’s Advice Line, Surviving Economic Abuse

Internal: Big Ideas — a store for our own concepts and design thinking, bigspark Components — a knowledge base of our internal repos, bigspark Products and Services — our own products and services.

Alongside these sit MCP tools that query live or external sources rather than git-tracked content: Companies House, the FCA Register and Warning List, web and OSINT search, plus UK sanctions and SRA/CLC legal-services screening. Different job, same interface.

It’s a working knowledge layer covering financial regulation, legislation, and vulnerability indicators — versioned, tested, and deployable.

The honest bit

GAAARS won’t fit everything. The wiki will outgrow flat files. Some domains need real-time indexing or similarity search at a scale where JSON embeddings don’t cut it.

We’ve already hit this. The sanctions screening and legal services tools use a local SQLite database, not git-tracked markdown, because the content didn’t suit flat files. The MCP interface is the same either way — the agent doesn’t know or care whether the backend is git or a database — so migrating a KB is a backend change, not a rewrite.

When a knowledge base earns the complexity of a vector or index database, we’ll move it. Until then, we won’t pay that tax.

Start simple. Earn complexity. Same principle as the bricklaying: don’t over-engineer the first course.

What’s next

The knowledge layer is down. The agents can now reason against real, authoritative, versioned information that we own outright. Next week: what owning your own data layer means for all the software you currently rent.

Article 3 in a series documenting bigspark’s AI-native transformation. Article 1: Staff Augmentation Is Dead. Article 2: Measure Twice, Build Once.

Week 2: Knowledge is Power.

The problem

Two tiers of knowledge

Tier 1: The Wiki (general-purpose memory)

Tier 2: GAAARS Knowledge Bases (strongly typed, purpose-built)

Why git, not a database

What we shipped

The honest bit

What’s next

Necessary Cookies

Analytics Cookies

Marketing Cookies

The problem

Two tiers of knowledge

Tier 1: The Wiki (general-purpose memory)

Tier 2: GAAARS Knowledge Bases (strongly typed, purpose-built)

Why git, not a database

What we shipped

The honest bit

What’s next

We use cookies to enhance your experience

Cookie Preferences

Necessary Cookies

Analytics Cookies

Marketing Cookies