Documentation

build-kg is a skill for coding agents that turns any topic into a structured knowledge graph stored in Apache AGE (PostgreSQL). No vendor lock-in. No hosting fees. Just your own database.

Give build-kg a topic and your coding agent autonomously generates an ontology, discovers authoritative sources, crawls documentation, chunks and loads documents, parses every fragment with an LLM, and produces a queryable graph you can explore with standard Cypher queries.

Key Capabilities

Auto-generated ontology — Your coding agent analyzes your topic and creates a tailored graph schema (node types, edge types, properties, JSON schema) automatically. No manual schema design required.
Self-hosted — everything runs on your machine. The graph lives in your own PostgreSQL database with Apache AGE. No cloud vendor, no platform fees.
Any topic — from Kubernetes networking to machine learning algorithms to React architecture patterns. The ontology is auto-generated for your topic.

Installation

Prerequisites

Python 3.10+
Docker — for PostgreSQL + Apache AGE
Anthropic API key (or OpenAI API key) — for LLM-based text parsing

Install

Terminal

$ git clone https://github.com/agtm1199/build-kg.git $ cd build-kg # Full setup: creates venv, installs dependencies, starts DB, initializes graph $ make setup

Configure your environment

Terminal

$ cp .env.example .env # Edit .env and set your API key: # ANTHROPIC_API_KEY=sk-ant-... # Or alternatively: OPENAI_API_KEY=sk-... # Verify everything works $ make verify

Tip The default .env.example has database credentials that match the Docker container. You only need to add your ANTHROPIC_API_KEY (or OPENAI_API_KEY if using OpenAI).

Activate the skill

The /build-kg skill is included in the repository. It activates automatically based on your coding agent:

Agent	Skill file	How it works
Claude Code	`.claude/skills/build-kg/SKILL.md`	Auto-detected. Type `/build-kg <topic>` to use.
Amazon Kiro	`.claude/skills/build-kg/SKILL.md`	Auto-detected (Agent Skills standard). Type `/build-kg <topic>`.
Qoder	`.claude/skills/build-kg/SKILL.md`	Auto-detected (Agent Skills standard). Type `/build-kg <topic>`.
Antigravity	`.claude/skills/build-kg/SKILL.md`	Auto-detected (Agent Skills standard). Type `/build-kg <topic>`.
OpenAI Codex	`AGENTS.md`	Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
GitHub Copilot	`.github/copilot-instructions.md`	Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
Cursor	`.cursor/rules/build-kg.mdc`	Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
Windsurf	`.windsurf/rules/build-kg.md`	Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.

Note No extra installation step is needed. The skill files ship with the repo — cloning is all it takes.

Quick Start

Use the /build-kg skill in your coding agent (Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Windsurf, Amazon Kiro, Qoder, or Antigravity) to build a knowledge graph:

Terminal

$ /build-kg kubernetes networking

This runs all phases autonomously — ontology generation, source discovery, crawling, chunking, loading, parsing, and validation. Expected output: a queryable knowledge graph with Component, Concept, and Configuration nodes connected by USES, CONFIGURES, and DEPENDS_ON edges.

Pipeline

build-kg transforms any topic into a knowledge graph through an 8-phase pipeline. Each phase produces an artifact that feeds the next.

Phase 0

Init

graph name, dirs

Phase 0.5

Ontology

auto-gen or profile

Phase 1

Discover

WebSearch

Phase 2

Crawl

Crawl4AI

Phase 3

Chunk

Unstructured

Phase 4

Load

PostgreSQL

Phase 5

Parse

Claude Haiku 3.5

Phase 6

Validate

Cypher queries

Phase 0: Initialize

Create a working directory and choose a graph name. Naming conventions:

kg_<topic> (e.g., kg_k8s_net for Kubernetes networking)

The skill handles directory creation automatically.

Phase 0.5: Ontology Generation

The key differentiator of build-kg. For generic topics, your coding agent analyzes your subject and auto-generates an ontology — node types, edge types, properties, and a JSON schema. The generated ontology typically includes 3–7 node types and is saved to ontology.yaml in the working directory.

Your coding agent auto-generates an ontology. The generated ontology includes:

Node types with labels, descriptions, and properties (3-7 types recommended)
Edge types with source/target node labels and descriptions
Root node — the primary node type that maps 1:1 to source fragments
JSON schema — the expected LLM output format

Phase 1: Source Discovery

A 5-round methodology identifies authoritative sources for the topic. Each round progressively deepens coverage:

Round	Name	Method	Purpose
1	Landscape Mapping	8–15 parallel web searches	Identify authoritative sources
2	Deep Source Discovery	Fetch main pages	Find documents, standards, specs
3	Coverage Verification	Cross-reference checklist	Identify gaps in topic coverage
4	Gap Filling	Targeted searches	Fill gaps; aim for 90%+ coverage
5	Secondary Sources	Supporting material	Add context and references

The output is a crawl manifest (crawl_manifest.json) listing every URL with metadata and priority tiers.

Phase 2: Crawl

Breadth-first web crawling using Crawl4AI with headless Chromium. Supports JavaScript-rendered pages. Pages are saved as markdown. Priority tiers control depth, page limits, and delay:

Tier	Description	Depth	Max Pages	Delay
P1	Primary authoritative	3	100	1500 ms
P2	Secondary documentation	2	50	1500 ms
P3	Supporting guides	1	25	2000 ms
P4	Reference material	1	15	2000 ms

Phase 3: Chunk

Documents are split into semantically coherent fragments using the Unstructured library. Two strategies are available:

by_title (recommended) — respects headings and section boundaries
basic — fills chunks uniformly to the character limit

Each chunk is saved as JSON with metadata: source path, index, SHA-256 fingerprint, and position.

Phase 4: Load

Chunks are loaded into PostgreSQL tables (source_document and source_fragment) with metadata from the crawl manifest. Adjacent chunks are linked via context_before and context_after fields.

Database Schema

source_document — one row per crawled source document (webpage):

Column	Type	Nullable	Description
`doc_id`	UUID (PK)	no	Auto-generated primary key
`jurisdiction`	TEXT	yes	Jurisdiction code (null for generic topics)
`authority`	TEXT	yes	Source organization name (null for generic topics)
`publisher`	TEXT	yes	Publishing organization
`doc_type`	TEXT	yes	Document type (null for generic topics)
`title`	TEXT	no	Document title
`url`	TEXT	no	Source URL
`filepath`	TEXT (UNIQUE)	no	Local file path, used for upsert deduplication
`metadata`	JSONB	yes	Arbitrary key-value metadata
`retrieved_at`	TIMESTAMPTZ	no	When the document was crawled

source_fragment — one row per chunk of a source document:

Column	Type	Nullable	Description
`fragment_id`	UUID (PK)	no	Auto-generated primary key
`doc_id`	UUID (FK)	no	Reference to parent source_document
`excerpt`	TEXT	no	The actual text content of the chunk
`context_before`	TEXT	yes	Last 200 characters of the preceding chunk
`context_after`	TEXT	yes	First 200 characters of the following chunk
`source_url`	TEXT	yes	URL of the original source
`jurisdiction`	TEXT	yes	Inherited from source_document
`authority`	TEXT	yes	Inherited from source_document
`metadata`	JSONB	yes	Arbitrary key-value metadata

Phase 5: Parse

Each fragment is sent to Claude Haiku 3.5 (or GPT-4o-mini) with a structured prompt derived from the ontology. The LLM returns entities and relationships in JSON, which are loaded into Apache AGE as graph vertices and edges. The skill supports both synchronous parsing (real-time, standard pricing) and batch parsing (50% cheaper, results in 1–24 hours).

Phase 6: Validate

Cypher queries count nodes by label from the ontology and produce a report card of the completed graph.

Ontology System

The ontology defines what your knowledge graph looks like — the types of nodes, their properties, and how they connect via edges.

Auto-Generated Ontology

For generic topics (using the default profile or no profile), your coding agent auto-generates an ontology tailored to the subject. Example for "kubernetes networking":

nodes:
  - label: Component
    description: "A Kubernetes networking component"
    properties:
      name: string
      type: string
      description: string
      layer: string
  - label: Concept
    description: "A networking concept or protocol"
    properties:
      name: string
      description: string
      category: string
  - label: Configuration
    description: "A configuration option or setting"
    properties:
      name: string
      description: string
      default_value: string
edges:
  - label: USES
    source: Component
    target: Concept
    description: "Component uses this concept"
  - label: CONFIGURES
    source: Configuration
    target: Component
    description: "Configuration applies to component"
  - label: DEPENDS_ON
    source: Component
    target: Component
    description: "Component depends on another"
root_node: Component
json_schema: |
  {
    "entities": [
      {"_label": "Component|Concept|Configuration", "name": "...", ...}
    ],
    "relationships": [
      {"_label": "USES|CONFIGURES|DEPENDS_ON", "_from_index": 0, "_to_index": 1}
    ]
  }

OntologyConfig Data Model

Model	Fields	Description
`NodeDef`	`label`, `description`, `properties`	A node type in the graph. Properties map name to type (`string`, `number`, `boolean`, `list`).
`EdgeDef`	`label`, `source`, `target`, `description`	An edge type connecting two node types.
`OntologyConfig`	`nodes`, `edges`, `root_node`, `json_schema`	Complete ontology definition. `root_node` is the primary type that maps 1:1 to source fragments.

Pydantic Models

class NodeDef(BaseModel):
    label: str              # Vertex label (e.g., "Component", "Concept")
    description: str = ""   # Used in LLM prompt for guidance
    properties: Dict[str, str] = {}  # name -> type (string, number, boolean, list)

class EdgeDef(BaseModel):
    label: str              # Edge label (e.g., "USES", "DERIVED_FROM")
    source: str             # Source node label
    target: str             # Target node label
    description: str = ""   # Used in LLM prompt for guidance

class OntologyConfig(BaseModel):
    description: str = ""
    nodes: List[NodeDef] = []
    edges: List[EdgeDef] = []
    root_node: str = ""     # Primary node that maps 1:1 to source fragments
    json_schema: Optional[str] = None  # Expected LLM output JSON format

Tip The ontology is the most important part of your knowledge graph. When auto-generated, review it carefully before proceeding. You can edit the ontology.yaml file to adjust node types, properties, and edge definitions.

Domain Profiles

Domain profiles are YAML files that parameterize the entire pipeline: ontology, LLM prompt, ID extraction patterns, and source discovery templates.

Built-in Profiles

Profile	Ontology	Domain
`default`	None (auto-generated)	Any topic

Selecting a Profile

Set the DOMAIN environment variable in your .env file before running the skill:

.env

# In .env: DOMAIN=financial-aml # Or use a custom profile file: DOMAIN=/path/to/my-domain.yaml

Profile Inheritance

Profiles can inherit from a base using extends: default. Child fields override parent fields via deep merge. This avoids repeating universal configuration in every profile.

When loading a profile by name:

If the name is a file path (ends in .yaml/.yml), load directly
Otherwise, look for {name}.yaml in src/build_kg/domains/

Creating Custom Profiles

A profile controls three pipeline aspects: Ontology (node/edge types), Parsing (system message and extraction rules), and Discovery (search templates, sub-domains).

Profile YAML Structure

name: "My Domain"
description: "Custom domain profile"
version: "1.0"
extends: default

ontology:
  root_node: MyEntity
  nodes:
    - label: MyEntity
      description: "Primary entity type"
      properties: {name: string, type: string}
  edges:
    - label: RELATES_TO
      source: MyEntity
      target: MyEntity
  json_schema: |
    {"entities": [...], "relationships": [...]}

parsing:
  system_message: "You are an expert in..."
  requirement_types: [...]

id_patterns:
  patterns: {}

discovery:
  search_templates:
    - "<topic> official documentation"
  sub_domains:
    - name: "Sub-area 1"
      description: "..."

Configuration

All configuration is via .env file or environment variables.

Variable	Default	Description
`DB_HOST`	`localhost`	PostgreSQL host
`DB_PORT`	`5432`	PostgreSQL port
`DB_NAME`	`buildkg`	Database name
`DB_USER`	`buildkg`	Database user
`DB_PASSWORD`	—	Database password (required)
`LLM_PROVIDER`	`anthropic`	LLM provider: `anthropic` or `openai`
`ANTHROPIC_API_KEY`	—	Anthropic API key (required if using Anthropic)
`ANTHROPIC_MODEL`	`claude-haiku-4-5-20251001`	Anthropic model for parsing
`OPENAI_API_KEY`	—	OpenAI API key (required if using OpenAI)
`OPENAI_MODEL`	`gpt-4o-mini`	OpenAI model for parsing (when using OpenAI)
`AGE_GRAPH_NAME`	`knowledge_graph`	Apache AGE graph name
`BATCH_SIZE`	`10`	Fragments per processing batch
`MAX_WORKERS`	`3`	Concurrent worker threads
`DOMAIN`	`food-safety`	Domain profile name or file path
`RATE_LIMIT_DELAY`	`1.0`	Seconds between API calls (sync parser)

.env File Format

.env

# Database DB_HOST=localhost DB_PORT=5432 DB_NAME=buildkg DB_USER=buildkg DB_PASSWORD=buildkg_dev # LLM LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-... # Graph AGE_GRAPH_NAME=kg_k8s_net # Domain DOMAIN=default

Important DB_PASSWORD and an LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY) are required. All other variables have sensible defaults. Run make verify to check your configuration.

Manifest Format

The crawl manifest (crawl_manifest.json) is the central configuration file for a build-kg pipeline run. It lists every source to crawl, provides metadata for each source, defines defaults, and tracks coverage.

JSON Schema

{
  "topic": "string",
  "graph_name": "string",
  "created_at": "ISO 8601 string",
  "sources": [ ... ],
  "defaults": { ... },
  "coverage": { ... },
  "metadata": { ... }
}

Top-Level Fields

Field	Type	Required	Description
`topic`	string	yes	Human-readable topic description
`graph_name`	string	yes	Apache AGE graph name (`reg_<country>_<domain>` or `kg_<topic>`)
`created_at`	string	yes	ISO 8601 timestamp
`sources`	array	yes	List of source objects
`defaults`	object	no	Default metadata for unmatched chunks
`coverage`	object	no	Tracks sub-domain coverage
`metadata`	object	no	Arbitrary key-value metadata

Source Object Fields

Field	Type	Required	Description
`source_name`	string	yes	Short identifier used as crawl output directory name. Must be unique. Used by loader for source matching.
`url`	string	yes	Starting URL for the crawl
`title`	string	yes	Descriptive title
`authority`	string	no	Issuing organization (nullable for generic topics)
`jurisdiction`	string	no	Jurisdiction code (nullable for generic topics)
`doc_type`	string	no	Document type (nullable for generic topics)
`priority`	string	no	Priority tier: `P1`, `P2`, `P3`, or `P4`
`sub_domains`	array	no	List of sub-domain strings this source covers
`depth`	integer	no	Crawl depth override
`max_pages`	integer	no	Maximum pages override
`delay`	integer	no	Crawl delay in ms override

Example: Generic Manifest

{
  "topic": "Kubernetes networking",
  "graph_name": "kg_k8s_net",
  "created_at": "2026-02-20T10:00:00Z",
  "sources": [
    {
      "source_name": "k8s_networking_docs",
      "url": "https://kubernetes.io/docs/concepts/services-networking/",
      "title": "Kubernetes Networking Concepts",
      "priority": "P1",
      "depth": 3,
      "max_pages": 80,
      "delay": 1500
    },
    {
      "source_name": "cilium_docs",
      "url": "https://docs.cilium.io/en/stable/",
      "title": "Cilium CNI Documentation",
      "priority": "P2",
      "depth": 2,
      "max_pages": 50,
      "delay": 1500
    }
  ],
  "defaults": {}
}

Examples

Kubernetes Networking

Terminal

$ /build-kg kubernetes networking # Your agent generates ontology: Component, Concept, Configuration # Discovers sources: kubernetes.io, cilium.io, calico docs # Crawls ~200 pages, chunks, loads, parses # Result: queryable KG with Component-USES-Concept relationships

Sample queries:

-- Count nodes by type
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (n:Component) RETURN count(n)
$$) AS (cnt agtype);

-- Find relationships
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (c:Component)-[:USES]->(concept:Concept)
    RETURN c.name, concept.name, concept.category
    LIMIT 10
$$) AS (component agtype, concept agtype, category agtype);

-- Traverse the graph (multi-hop)
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (a:Component)-[:DEPENDS_ON]->(b:Component)-[:USES]->(c:Concept)
    RETURN a.name, b.name, c.name
    LIMIT 10
$$) AS (comp_a agtype, comp_b agtype, concept agtype);

Machine Learning Algorithms

Terminal

$ /build-kg machine learning optimization algorithms # Your agent generates ontology: Algorithm, Technique, Application, Paper # Discovers: arxiv papers, textbook sites, framework docs # Result: KG connecting algorithms to techniques and applications

Sample queries:

-- Find all algorithms of a specific type
SELECT * FROM cypher('kg_ml_opt', $$
    MATCH (a:Algorithm)
    WHERE a.type = 'gradient-based'
    RETURN a.name, a.description
    LIMIT 20
$$) AS (name agtype, description agtype);

-- Find algorithms used in applications
SELECT * FROM cypher('kg_ml_opt', $$
    MATCH (a:Algorithm)-[:APPLIED_TO]->(app:Application)
    RETURN a.name, app.name
    LIMIT 10
$$) AS (algorithm agtype, application agtype);

Cost

The only cost is LLM API calls during the parse phase. Everything else — crawling, chunking, loading, database, queries — runs locally for free.

Fragments	Sync Parser	Batch Parser (50% off)
100	~$0.03	~$0.015
1,000	~$0.30	~$0.15
5,000	~$1.50	~$0.75
10,000	~$3.00	~$1.50

What's Free

Crawling — uses open-source Crawl4AI with local Chromium
Chunking — uses open-source Unstructured library locally
Loading — direct PostgreSQL inserts
Database — Docker container on your machine
Queries — standard Cypher via Apache AGE

Tip The skill automatically handles test runs and cost optimization. before committing to a full run.

Troubleshooting

Quick Reference

Error	Phase	Solution
Connection refused on port 5432	Any DB operation	Start the database container
relation "source_fragment" does not exist	Load / Parse	Run the init SQL or recreate the container
AGE extension not found	Setup / Parse	Run `make db-init`
LLM API error 401	Parse	Check `ANTHROPIC_API_KEY` (or `OPENAI_API_KEY`) in `.env`
Chromium not found	Crawl	Run `crawl4ai-setup`
No fragments found	Parse	Verify the database load succeeded
Batch still in "validating"	Batch Parse	Wait; batches take 1–24 hours
Ontology file not found	Parse / Setup	Check `--ontology` path is correct
Empty entities in LLM output	Parse (ontology)	Check text quality and ontology fit
PDF support not installed	Chunk	Run `make setup`
DB_PASSWORD is required	Any	Run `cp .env.example .env`

Connection refused on port 5432

Symptom:

Error

psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?

Fix:

Terminal

# Start the database $ docker compose -f db/docker-compose.yml up -d # Check it is running and healthy $ docker compose -f db/docker-compose.yml ps # If the container exists but is stopped $ docker compose -f db/docker-compose.yml start

If the container is running but you still cannot connect, check if another PostgreSQL instance is using port 5432 (change DB_PORT in .env), or if you are running inside a VM or container (use the correct host IP instead of localhost).

relation "source_fragment" does not exist

Symptom:

Error

psycopg2.errors.UndefinedTable: relation "source_fragment" does not exist

Fix:

Terminal

# Option 1: Reset the database (destroys all data) $ make db-reset # Option 2: Run init.sql manually $ docker exec -i build-kg-db psql -U buildkg -d buildkg < db/init.sql # Verify tables exist $ docker exec build-kg-db psql -U buildkg -d buildkg -c "\dt"

AGE extension not found

Symptom:

Error

ERROR: extension "age" is not available

Fix: Make sure you are using the Apache AGE Docker image, then run:

Terminal

$ docker compose -f db/docker-compose.yml up -d $ make db-init

LLM API error 401

Symptom:

Error

AuthenticationError: Error code: 401 - Incorrect API key provided

Fix:

Check your .env file: grep ANTHROPIC_API_KEY .env
Verify the key format: ANTHROPIC_API_KEY=sk-ant-...
Run make verify to test all connections
Check your account for billing issues at console.anthropic.com

Chromium not found

Symptom:

Error

crawl4ai: Browser not found. Please run crawl4ai-setup first.

Fix:

Terminal

# Make sure the virtual environment is active $ source venv/bin/activate # Install Chromium $ crawl4ai-setup

If crawl4ai-setup fails, check disk space (~400MB required) and install missing libraries on Linux: sudo apt install libnss3 libatk-bridge2.0-0 libdrm2 libxcomposite1 libxdamage1 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2

No fragments found

Symptom:

Output

Found 0 fragments to process No fragments to process!

Fix:

Check the fragment count by querying SELECT COUNT(*) FROM source_fragment;
If zero, re-run the skill or check that the load phase completed successfully
If you filtered results, verify the data exists in the database.

Batch still in "validating"

This is normal. Batches can take 1–24 hours. Monitor with:

The skill monitors batch progress automatically. Batches typically complete within 1–24 hours.

If stuck for more than 1 hour, check the JSONL file for formatting errors and your provider account for rate limit or quota issues.

DB_PASSWORD is required

Fix:

Terminal

$ cp .env.example .env # The default password matches the Docker container: DB_PASSWORD=buildkg_dev

Empty entities in LLM output (ontology mode)

The parser reports many fragments as "skipped" with 0 entities extracted. This happens when the text is too short/generic, the ontology node types do not match the content, or chunks are too small.

Fix:

Check that your ontology node types are appropriate for the source content
Try increasing --max-chars during chunking for larger, more informative fragments
Review the auto-generated ontology and adjust definitions if needed

Getting More Help

If your issue is not listed here:

Run make verify to check overall system health
Check Docker container logs: docker compose -f db/docker-compose.yml logs
Open an issue at github.com/agtm1199/build-kg/issues with the exact error message, the command you ran, your Python version, and your OS

FAQ

What topics can I use?

Any topic. The ontology is auto-generated and tailored to your subject (e.g., "kubernetes networking", "machine learning").

How much does it cost?

Only LLM API costs for the parse phase. Everything else is free and local. Approximately ~$0.30 per 1,000 fragments with the sync parser, or ~$0.15 per 1,000 fragments using the Batch API (50% cheaper).

Can I use my own ontology?

Yes. Create a YAML file defining your node types, edge types, and JSON schema, then set the DOMAIN environment variable to point to your profile. You can also create a full domain profile with custom parsing instructions and discovery templates.

Do I need a coding agent?

Yes. build-kg natively supports 8 coding agent platforms: Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Windsurf, Amazon Kiro, Qoder, and Antigravity. Each has a dedicated skill file that ships with the repo.

What LLM does it use?

Claude Haiku 3.5 by default (via Anthropic). Configurable via the LLM_PROVIDER and ANTHROPIC_MODEL environment variables in your .env file. OpenAI models (e.g., GPT-4o-mini) are also supported by setting LLM_PROVIDER=openai.

Can I export the graph?

Yes. It is standard PostgreSQL with the Apache AGE extension. Use pg_dump, Cypher queries, or any PostgreSQL client to export data. You can also query the graph directly from any application that connects to PostgreSQL.

What database does it use?

PostgreSQL with the Apache AGE extension. AGE adds Cypher query support to PostgreSQL, allowing you to store and query graph data alongside relational data in the same database. It runs in a Docker container on your machine.

How do I query the graph after building it?

Use standard Cypher queries wrapped in the cypher() function via any PostgreSQL client. Example: SELECT * FROM cypher('kg_k8s_net', $$ MATCH (n) RETURN n LIMIT 10 $$) as (n agtype); You can use psql, DBeaver, pgAdmin, or connect programmatically from any language with a PostgreSQL driver.