Documentation
build-kg is a skill for coding agents that turns any topic into a structured knowledge graph stored in Apache AGE (PostgreSQL). No vendor lock-in. No hosting fees. Just your own database.
Give build-kg a topic and your coding agent autonomously generates an ontology, discovers authoritative sources, crawls documentation, chunks and loads documents, parses every fragment with an LLM, and produces a queryable graph you can explore with standard Cypher queries.
Key Capabilities
- Auto-generated ontology — Your coding agent analyzes your topic and creates a tailored graph schema (node types, edge types, properties, JSON schema) automatically. No manual schema design required.
- Self-hosted — everything runs on your machine. The graph lives in your own PostgreSQL database with Apache AGE. No cloud vendor, no platform fees.
- Any topic — from Kubernetes networking to machine learning algorithms to React architecture patterns. The ontology is auto-generated for your topic.
Installation
Prerequisites
- Python 3.10+
- Docker — for PostgreSQL + Apache AGE
- Anthropic API key (or OpenAI API key) — for LLM-based text parsing
Install
Configure your environment
.env.example has database credentials that match the Docker container.
You only need to add your ANTHROPIC_API_KEY (or OPENAI_API_KEY if using OpenAI).
Activate the skill
The /build-kg skill is included in the repository. It activates automatically based on your coding agent:
| Agent | Skill file | How it works |
|---|---|---|
| Claude Code | .claude/skills/build-kg/SKILL.md |
Auto-detected. Type /build-kg <topic> to use. |
| Amazon Kiro | .claude/skills/build-kg/SKILL.md |
Auto-detected (Agent Skills standard). Type /build-kg <topic>. |
| Qoder | .claude/skills/build-kg/SKILL.md |
Auto-detected (Agent Skills standard). Type /build-kg <topic>. |
| Antigravity | .claude/skills/build-kg/SKILL.md |
Auto-detected (Agent Skills standard). Type /build-kg <topic>. |
| OpenAI Codex | AGENTS.md |
Auto-detected. Ask the agent to “build a knowledge graph about <topic>”. |
| GitHub Copilot | .github/copilot-instructions.md |
Auto-detected. Ask the agent to “build a knowledge graph about <topic>”. |
| Cursor | .cursor/rules/build-kg.mdc |
Auto-detected. Ask the agent to “build a knowledge graph about <topic>”. |
| Windsurf | .windsurf/rules/build-kg.md |
Auto-detected. Ask the agent to “build a knowledge graph about <topic>”. |
Quick Start
Use the /build-kg skill in your coding agent (Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Windsurf, Amazon Kiro, Qoder, or Antigravity) to build a knowledge graph:
This runs all phases autonomously — ontology generation, source discovery, crawling, chunking, loading, parsing, and validation. Expected output: a queryable knowledge graph with Component, Concept, and Configuration nodes connected by USES, CONFIGURES, and DEPENDS_ON edges.
Pipeline
build-kg transforms any topic into a knowledge graph through an 8-phase pipeline. Each phase produces an artifact that feeds the next.
Phase 0: Initialize
Create a working directory and choose a graph name. Naming conventions:
kg_<topic>(e.g.,kg_k8s_netfor Kubernetes networking)
The skill handles directory creation automatically.
Phase 0.5: Ontology Generation
The key differentiator of build-kg. For generic topics, your coding agent analyzes your subject and
auto-generates an ontology — node types, edge types, properties, and a JSON schema.
The generated ontology typically includes 3–7 node types and is saved to
ontology.yaml in the working directory.
Your coding agent auto-generates an ontology. The generated ontology includes:
- Node types with labels, descriptions, and properties (3-7 types recommended)
- Edge types with source/target node labels and descriptions
- Root node — the primary node type that maps 1:1 to source fragments
- JSON schema — the expected LLM output format
Phase 1: Source Discovery
A 5-round methodology identifies authoritative sources for the topic. Each round progressively deepens coverage:
| Round | Name | Method | Purpose |
|---|---|---|---|
| 1 | Landscape Mapping | 8–15 parallel web searches | Identify authoritative sources |
| 2 | Deep Source Discovery | Fetch main pages | Find documents, standards, specs |
| 3 | Coverage Verification | Cross-reference checklist | Identify gaps in topic coverage |
| 4 | Gap Filling | Targeted searches | Fill gaps; aim for 90%+ coverage |
| 5 | Secondary Sources | Supporting material | Add context and references |
The output is a crawl manifest (crawl_manifest.json) listing every URL with
metadata and priority tiers.
Phase 2: Crawl
Breadth-first web crawling using Crawl4AI with headless Chromium. Supports JavaScript-rendered pages. Pages are saved as markdown. Priority tiers control depth, page limits, and delay:
| Tier | Description | Depth | Max Pages | Delay |
|---|---|---|---|---|
| P1 | Primary authoritative | 3 | 100 | 1500 ms |
| P2 | Secondary documentation | 2 | 50 | 1500 ms |
| P3 | Supporting guides | 1 | 25 | 2000 ms |
| P4 | Reference material | 1 | 15 | 2000 ms |
Phase 3: Chunk
Documents are split into semantically coherent fragments using the Unstructured library. Two strategies are available:
by_title(recommended) — respects headings and section boundariesbasic— fills chunks uniformly to the character limit
Each chunk is saved as JSON with metadata: source path, index, SHA-256 fingerprint, and position.
Phase 4: Load
Chunks are loaded into PostgreSQL tables (source_document and source_fragment)
with metadata from the crawl manifest. Adjacent chunks are linked via context_before
and context_after fields.
Database Schema
source_document — one row per crawled source document (webpage):
| Column | Type | Nullable | Description |
|---|---|---|---|
doc_id | UUID (PK) | no | Auto-generated primary key |
jurisdiction | TEXT | yes | Jurisdiction code (null for generic topics) |
authority | TEXT | yes | Source organization name (null for generic topics) |
publisher | TEXT | yes | Publishing organization |
doc_type | TEXT | yes | Document type (null for generic topics) |
title | TEXT | no | Document title |
url | TEXT | no | Source URL |
filepath | TEXT (UNIQUE) | no | Local file path, used for upsert deduplication |
metadata | JSONB | yes | Arbitrary key-value metadata |
retrieved_at | TIMESTAMPTZ | no | When the document was crawled |
source_fragment — one row per chunk of a source document:
| Column | Type | Nullable | Description |
|---|---|---|---|
fragment_id | UUID (PK) | no | Auto-generated primary key |
doc_id | UUID (FK) | no | Reference to parent source_document |
excerpt | TEXT | no | The actual text content of the chunk |
context_before | TEXT | yes | Last 200 characters of the preceding chunk |
context_after | TEXT | yes | First 200 characters of the following chunk |
source_url | TEXT | yes | URL of the original source |
jurisdiction | TEXT | yes | Inherited from source_document |
authority | TEXT | yes | Inherited from source_document |
metadata | JSONB | yes | Arbitrary key-value metadata |
Phase 5: Parse
Each fragment is sent to Claude Haiku 3.5 (or GPT-4o-mini) with a structured prompt derived from the ontology. The LLM returns entities and relationships in JSON, which are loaded into Apache AGE as graph vertices and edges. The skill supports both synchronous parsing (real-time, standard pricing) and batch parsing (50% cheaper, results in 1–24 hours).
Phase 6: Validate
Cypher queries count nodes by label from the ontology and produce a report card of the completed graph.
Ontology System
The ontology defines what your knowledge graph looks like — the types of nodes, their properties, and how they connect via edges.
Auto-Generated Ontology
For generic topics (using the default profile or no profile), your coding agent auto-generates
an ontology tailored to the subject. Example for "kubernetes networking":
nodes:
- label: Component
description: "A Kubernetes networking component"
properties:
name: string
type: string
description: string
layer: string
- label: Concept
description: "A networking concept or protocol"
properties:
name: string
description: string
category: string
- label: Configuration
description: "A configuration option or setting"
properties:
name: string
description: string
default_value: string
edges:
- label: USES
source: Component
target: Concept
description: "Component uses this concept"
- label: CONFIGURES
source: Configuration
target: Component
description: "Configuration applies to component"
- label: DEPENDS_ON
source: Component
target: Component
description: "Component depends on another"
root_node: Component
json_schema: |
{
"entities": [
{"_label": "Component|Concept|Configuration", "name": "...", ...}
],
"relationships": [
{"_label": "USES|CONFIGURES|DEPENDS_ON", "_from_index": 0, "_to_index": 1}
]
}
OntologyConfig Data Model
| Model | Fields | Description |
|---|---|---|
NodeDef |
label, description, properties |
A node type in the graph. Properties map name to type (string, number, boolean, list). |
EdgeDef |
label, source, target, description |
An edge type connecting two node types. |
OntologyConfig |
nodes, edges, root_node, json_schema |
Complete ontology definition. root_node is the primary type that maps 1:1 to source fragments. |
Pydantic Models
class NodeDef(BaseModel):
label: str # Vertex label (e.g., "Component", "Concept")
description: str = "" # Used in LLM prompt for guidance
properties: Dict[str, str] = {} # name -> type (string, number, boolean, list)
class EdgeDef(BaseModel):
label: str # Edge label (e.g., "USES", "DERIVED_FROM")
source: str # Source node label
target: str # Target node label
description: str = "" # Used in LLM prompt for guidance
class OntologyConfig(BaseModel):
description: str = ""
nodes: List[NodeDef] = []
edges: List[EdgeDef] = []
root_node: str = "" # Primary node that maps 1:1 to source fragments
json_schema: Optional[str] = None # Expected LLM output JSON format
ontology.yaml file to adjust
node types, properties, and edge definitions.
Domain Profiles
Domain profiles are YAML files that parameterize the entire pipeline: ontology, LLM prompt, ID extraction patterns, and source discovery templates.
Built-in Profiles
| Profile | Ontology | Domain |
|---|---|---|
default | None (auto-generated) | Any topic |
Selecting a Profile
Set the DOMAIN environment variable in your .env file before running the skill:
Profile Inheritance
Profiles can inherit from a base using extends: default. Child fields override
parent fields via deep merge. This avoids repeating universal configuration in every profile.
When loading a profile by name:
- If the name is a file path (ends in
.yaml/.yml), load directly - Otherwise, look for
{name}.yamlinsrc/build_kg/domains/
Creating Custom Profiles
A profile controls three pipeline aspects: Ontology (node/edge types), Parsing (system message and extraction rules), and Discovery (search templates, sub-domains).
Profile YAML Structure
name: "My Domain"
description: "Custom domain profile"
version: "1.0"
extends: default
ontology:
root_node: MyEntity
nodes:
- label: MyEntity
description: "Primary entity type"
properties: {name: string, type: string}
edges:
- label: RELATES_TO
source: MyEntity
target: MyEntity
json_schema: |
{"entities": [...], "relationships": [...]}
parsing:
system_message: "You are an expert in..."
requirement_types: [...]
id_patterns:
patterns: {}
discovery:
search_templates:
- "<topic> official documentation"
sub_domains:
- name: "Sub-area 1"
description: "..."
Configuration
All configuration is via .env file or environment variables.
| Variable | Default | Description |
|---|---|---|
DB_HOST | localhost | PostgreSQL host |
DB_PORT | 5432 | PostgreSQL port |
DB_NAME | buildkg | Database name |
DB_USER | buildkg | Database user |
DB_PASSWORD | — | Database password (required) |
LLM_PROVIDER | anthropic | LLM provider: anthropic or openai |
ANTHROPIC_API_KEY | — | Anthropic API key (required if using Anthropic) |
ANTHROPIC_MODEL | claude-haiku-4-5-20251001 | Anthropic model for parsing |
OPENAI_API_KEY | — | OpenAI API key (required if using OpenAI) |
OPENAI_MODEL | gpt-4o-mini | OpenAI model for parsing (when using OpenAI) |
AGE_GRAPH_NAME | knowledge_graph | Apache AGE graph name |
BATCH_SIZE | 10 | Fragments per processing batch |
MAX_WORKERS | 3 | Concurrent worker threads |
DOMAIN | food-safety | Domain profile name or file path |
RATE_LIMIT_DELAY | 1.0 | Seconds between API calls (sync parser) |
.env File Format
DB_PASSWORD and an LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY) are required. All other variables
have sensible defaults. Run make verify to check your configuration.
Manifest Format
The crawl manifest (crawl_manifest.json) is the central configuration file for a build-kg pipeline run.
It lists every source to crawl, provides metadata for each source, defines defaults, and tracks coverage.
JSON Schema
{
"topic": "string",
"graph_name": "string",
"created_at": "ISO 8601 string",
"sources": [ ... ],
"defaults": { ... },
"coverage": { ... },
"metadata": { ... }
}
Top-Level Fields
| Field | Type | Required | Description |
|---|---|---|---|
topic | string | yes | Human-readable topic description |
graph_name | string | yes | Apache AGE graph name (reg_<country>_<domain> or kg_<topic>) |
created_at | string | yes | ISO 8601 timestamp |
sources | array | yes | List of source objects |
defaults | object | no | Default metadata for unmatched chunks |
coverage | object | no | Tracks sub-domain coverage |
metadata | object | no | Arbitrary key-value metadata |
Source Object Fields
| Field | Type | Required | Description |
|---|---|---|---|
source_name | string | yes | Short identifier used as crawl output directory name. Must be unique. Used by loader for source matching. |
url | string | yes | Starting URL for the crawl |
title | string | yes | Descriptive title |
authority | string | no | Issuing organization (nullable for generic topics) |
jurisdiction | string | no | Jurisdiction code (nullable for generic topics) |
doc_type | string | no | Document type (nullable for generic topics) |
priority | string | no | Priority tier: P1, P2, P3, or P4 |
sub_domains | array | no | List of sub-domain strings this source covers |
depth | integer | no | Crawl depth override |
max_pages | integer | no | Maximum pages override |
delay | integer | no | Crawl delay in ms override |
Example: Generic Manifest
{
"topic": "Kubernetes networking",
"graph_name": "kg_k8s_net",
"created_at": "2026-02-20T10:00:00Z",
"sources": [
{
"source_name": "k8s_networking_docs",
"url": "https://kubernetes.io/docs/concepts/services-networking/",
"title": "Kubernetes Networking Concepts",
"priority": "P1",
"depth": 3,
"max_pages": 80,
"delay": 1500
},
{
"source_name": "cilium_docs",
"url": "https://docs.cilium.io/en/stable/",
"title": "Cilium CNI Documentation",
"priority": "P2",
"depth": 2,
"max_pages": 50,
"delay": 1500
}
],
"defaults": {}
}
Examples
Kubernetes Networking
Sample queries:
-- Count nodes by type
SELECT * FROM cypher('kg_k8s_net', $$
MATCH (n:Component) RETURN count(n)
$$) AS (cnt agtype);
-- Find relationships
SELECT * FROM cypher('kg_k8s_net', $$
MATCH (c:Component)-[:USES]->(concept:Concept)
RETURN c.name, concept.name, concept.category
LIMIT 10
$$) AS (component agtype, concept agtype, category agtype);
-- Traverse the graph (multi-hop)
SELECT * FROM cypher('kg_k8s_net', $$
MATCH (a:Component)-[:DEPENDS_ON]->(b:Component)-[:USES]->(c:Concept)
RETURN a.name, b.name, c.name
LIMIT 10
$$) AS (comp_a agtype, comp_b agtype, concept agtype);
Machine Learning Algorithms
Sample queries:
-- Find all algorithms of a specific type
SELECT * FROM cypher('kg_ml_opt', $$
MATCH (a:Algorithm)
WHERE a.type = 'gradient-based'
RETURN a.name, a.description
LIMIT 20
$$) AS (name agtype, description agtype);
-- Find algorithms used in applications
SELECT * FROM cypher('kg_ml_opt', $$
MATCH (a:Algorithm)-[:APPLIED_TO]->(app:Application)
RETURN a.name, app.name
LIMIT 10
$$) AS (algorithm agtype, application agtype);
Cost
The only cost is LLM API calls during the parse phase. Everything else — crawling, chunking, loading, database, queries — runs locally for free.
| Fragments | Sync Parser | Batch Parser (50% off) |
|---|---|---|
| 100 | ~$0.03 | ~$0.015 |
| 1,000 | ~$0.30 | ~$0.15 |
| 5,000 | ~$1.50 | ~$0.75 |
| 10,000 | ~$3.00 | ~$1.50 |
What's Free
- Crawling — uses open-source Crawl4AI with local Chromium
- Chunking — uses open-source Unstructured library locally
- Loading — direct PostgreSQL inserts
- Database — Docker container on your machine
- Queries — standard Cypher via Apache AGE
Troubleshooting
Quick Reference
| Error | Phase | Solution |
|---|---|---|
| Connection refused on port 5432 | Any DB operation | Start the database container |
| relation "source_fragment" does not exist | Load / Parse | Run the init SQL or recreate the container |
| AGE extension not found | Setup / Parse | Run make db-init |
| LLM API error 401 | Parse | Check ANTHROPIC_API_KEY (or OPENAI_API_KEY) in .env |
| Chromium not found | Crawl | Run crawl4ai-setup |
| No fragments found | Parse | Verify the database load succeeded |
| Batch still in "validating" | Batch Parse | Wait; batches take 1–24 hours |
| Ontology file not found | Parse / Setup | Check --ontology path is correct |
| Empty entities in LLM output | Parse (ontology) | Check text quality and ontology fit |
| PDF support not installed | Chunk | Run make setup |
| DB_PASSWORD is required | Any | Run cp .env.example .env |
Connection refused on port 5432
Symptom:
Fix:
If the container is running but you still cannot connect, check if another PostgreSQL instance is using port 5432 (change DB_PORT in .env), or if you are running inside a VM or container (use the correct host IP instead of localhost).
relation "source_fragment" does not exist
Symptom:
Fix:
AGE extension not found
Symptom:
Fix: Make sure you are using the Apache AGE Docker image, then run:
LLM API error 401
Symptom:
Fix:
- Check your
.envfile:grep ANTHROPIC_API_KEY .env - Verify the key format:
ANTHROPIC_API_KEY=sk-ant-... - Run
make verifyto test all connections - Check your account for billing issues at console.anthropic.com
Chromium not found
Symptom:
Fix:
If crawl4ai-setup fails, check disk space (~400MB required) and install missing libraries on Linux: sudo apt install libnss3 libatk-bridge2.0-0 libdrm2 libxcomposite1 libxdamage1 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2
No fragments found
Symptom:
Fix:
- Check the fragment count by querying
SELECT COUNT(*) FROM source_fragment; - If zero, re-run the skill or check that the load phase completed successfully
- If you filtered results, verify the data exists in the database.
Batch still in "validating"
This is normal. Batches can take 1–24 hours. Monitor with:
The skill monitors batch progress automatically. Batches typically complete within 1–24 hours.
If stuck for more than 1 hour, check the JSONL file for formatting errors and your provider account for rate limit or quota issues.
DB_PASSWORD is required
Fix:
Empty entities in LLM output (ontology mode)
The parser reports many fragments as "skipped" with 0 entities extracted. This happens when the text is too short/generic, the ontology node types do not match the content, or chunks are too small.
Fix:
- Check that your ontology node types are appropriate for the source content
- Try increasing
--max-charsduring chunking for larger, more informative fragments - Review the auto-generated ontology and adjust definitions if needed
Getting More Help
If your issue is not listed here:
- Run
make verifyto check overall system health - Check Docker container logs:
docker compose -f db/docker-compose.yml logs - Open an issue at github.com/agtm1199/build-kg/issues with the exact error message, the command you ran, your Python version, and your OS
FAQ
DOMAIN environment variable to point to your profile. You can also create a full domain profile
with custom parsing instructions and discovery templates.
LLM_PROVIDER
and ANTHROPIC_MODEL environment variables in your .env file.
OpenAI models (e.g., GPT-4o-mini) are also supported by setting LLM_PROVIDER=openai.
pg_dump,
Cypher queries, or any PostgreSQL client to export data. You can also query the graph
directly from any application that connects to PostgreSQL.
cypher() function via any PostgreSQL client.
Example: SELECT * FROM cypher('kg_k8s_net', $$ MATCH (n) RETURN n LIMIT 10 $$) as (n agtype);
You can use psql, DBeaver, pgAdmin, or connect programmatically from any language with a PostgreSQL driver.