Documentation

build-kg is a skill for coding agents that turns any topic into a structured knowledge graph stored in Apache AGE (PostgreSQL). No vendor lock-in. No hosting fees. Just your own database.

Give build-kg a topic and your coding agent autonomously generates an ontology, discovers authoritative sources, crawls documentation, chunks and loads documents, parses every fragment with an LLM, and produces a queryable graph you can explore with standard Cypher queries.

Key Capabilities

Installation

Prerequisites

Install

Terminal
$ git clone https://github.com/agtm1199/build-kg.git $ cd build-kg # Full setup: creates venv, installs dependencies, starts DB, initializes graph $ make setup

Configure your environment

Terminal
$ cp .env.example .env # Edit .env and set your API key: # ANTHROPIC_API_KEY=sk-ant-... # Or alternatively: OPENAI_API_KEY=sk-... # Verify everything works $ make verify
Tip The default .env.example has database credentials that match the Docker container. You only need to add your ANTHROPIC_API_KEY (or OPENAI_API_KEY if using OpenAI).

Activate the skill

The /build-kg skill is included in the repository. It activates automatically based on your coding agent:

AgentSkill fileHow it works
Claude Code .claude/skills/build-kg/SKILL.md Auto-detected. Type /build-kg <topic> to use.
Amazon Kiro .claude/skills/build-kg/SKILL.md Auto-detected (Agent Skills standard). Type /build-kg <topic>.
Qoder .claude/skills/build-kg/SKILL.md Auto-detected (Agent Skills standard). Type /build-kg <topic>.
Antigravity .claude/skills/build-kg/SKILL.md Auto-detected (Agent Skills standard). Type /build-kg <topic>.
OpenAI Codex AGENTS.md Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
GitHub Copilot .github/copilot-instructions.md Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
Cursor .cursor/rules/build-kg.mdc Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
Windsurf .windsurf/rules/build-kg.md Auto-detected. Ask the agent to “build a knowledge graph about <topic>”.
Note No extra installation step is needed. The skill files ship with the repo — cloning is all it takes.

Quick Start

Use the /build-kg skill in your coding agent (Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Windsurf, Amazon Kiro, Qoder, or Antigravity) to build a knowledge graph:

Terminal
$ /build-kg kubernetes networking

This runs all phases autonomously — ontology generation, source discovery, crawling, chunking, loading, parsing, and validation. Expected output: a queryable knowledge graph with Component, Concept, and Configuration nodes connected by USES, CONFIGURES, and DEPENDS_ON edges.

Pipeline

build-kg transforms any topic into a knowledge graph through an 8-phase pipeline. Each phase produces an artifact that feeds the next.

Phase 0
Init
graph name, dirs
Phase 0.5
Ontology
auto-gen or profile
Phase 1
Discover
WebSearch
Phase 2
Crawl
Crawl4AI
Phase 3
Chunk
Unstructured
Phase 4
Load
PostgreSQL
Phase 5
Parse
Claude Haiku 3.5
Phase 6
Validate
Cypher queries

Phase 0: Initialize

Create a working directory and choose a graph name. Naming conventions:

The skill handles directory creation automatically.

Phase 0.5: Ontology Generation

The key differentiator of build-kg. For generic topics, your coding agent analyzes your subject and auto-generates an ontology — node types, edge types, properties, and a JSON schema. The generated ontology typically includes 3–7 node types and is saved to ontology.yaml in the working directory.

Your coding agent auto-generates an ontology. The generated ontology includes:

Phase 1: Source Discovery

A 5-round methodology identifies authoritative sources for the topic. Each round progressively deepens coverage:

RoundNameMethodPurpose
1Landscape Mapping8–15 parallel web searchesIdentify authoritative sources
2Deep Source DiscoveryFetch main pagesFind documents, standards, specs
3Coverage VerificationCross-reference checklistIdentify gaps in topic coverage
4Gap FillingTargeted searchesFill gaps; aim for 90%+ coverage
5Secondary SourcesSupporting materialAdd context and references

The output is a crawl manifest (crawl_manifest.json) listing every URL with metadata and priority tiers.

Phase 2: Crawl

Breadth-first web crawling using Crawl4AI with headless Chromium. Supports JavaScript-rendered pages. Pages are saved as markdown. Priority tiers control depth, page limits, and delay:

TierDescriptionDepthMax PagesDelay
P1Primary authoritative31001500 ms
P2Secondary documentation2501500 ms
P3Supporting guides1252000 ms
P4Reference material1152000 ms

Phase 3: Chunk

Documents are split into semantically coherent fragments using the Unstructured library. Two strategies are available:

Each chunk is saved as JSON with metadata: source path, index, SHA-256 fingerprint, and position.

Phase 4: Load

Chunks are loaded into PostgreSQL tables (source_document and source_fragment) with metadata from the crawl manifest. Adjacent chunks are linked via context_before and context_after fields.

Database Schema

source_document — one row per crawled source document (webpage):

ColumnTypeNullableDescription
doc_idUUID (PK)noAuto-generated primary key
jurisdictionTEXTyesJurisdiction code (null for generic topics)
authorityTEXTyesSource organization name (null for generic topics)
publisherTEXTyesPublishing organization
doc_typeTEXTyesDocument type (null for generic topics)
titleTEXTnoDocument title
urlTEXTnoSource URL
filepathTEXT (UNIQUE)noLocal file path, used for upsert deduplication
metadataJSONByesArbitrary key-value metadata
retrieved_atTIMESTAMPTZnoWhen the document was crawled

source_fragment — one row per chunk of a source document:

ColumnTypeNullableDescription
fragment_idUUID (PK)noAuto-generated primary key
doc_idUUID (FK)noReference to parent source_document
excerptTEXTnoThe actual text content of the chunk
context_beforeTEXTyesLast 200 characters of the preceding chunk
context_afterTEXTyesFirst 200 characters of the following chunk
source_urlTEXTyesURL of the original source
jurisdictionTEXTyesInherited from source_document
authorityTEXTyesInherited from source_document
metadataJSONByesArbitrary key-value metadata

Phase 5: Parse

Each fragment is sent to Claude Haiku 3.5 (or GPT-4o-mini) with a structured prompt derived from the ontology. The LLM returns entities and relationships in JSON, which are loaded into Apache AGE as graph vertices and edges. The skill supports both synchronous parsing (real-time, standard pricing) and batch parsing (50% cheaper, results in 1–24 hours).

Phase 6: Validate

Cypher queries count nodes by label from the ontology and produce a report card of the completed graph.

Ontology System

The ontology defines what your knowledge graph looks like — the types of nodes, their properties, and how they connect via edges.

Auto-Generated Ontology

For generic topics (using the default profile or no profile), your coding agent auto-generates an ontology tailored to the subject. Example for "kubernetes networking":

nodes:
  - label: Component
    description: "A Kubernetes networking component"
    properties:
      name: string
      type: string
      description: string
      layer: string
  - label: Concept
    description: "A networking concept or protocol"
    properties:
      name: string
      description: string
      category: string
  - label: Configuration
    description: "A configuration option or setting"
    properties:
      name: string
      description: string
      default_value: string
edges:
  - label: USES
    source: Component
    target: Concept
    description: "Component uses this concept"
  - label: CONFIGURES
    source: Configuration
    target: Component
    description: "Configuration applies to component"
  - label: DEPENDS_ON
    source: Component
    target: Component
    description: "Component depends on another"
root_node: Component
json_schema: |
  {
    "entities": [
      {"_label": "Component|Concept|Configuration", "name": "...", ...}
    ],
    "relationships": [
      {"_label": "USES|CONFIGURES|DEPENDS_ON", "_from_index": 0, "_to_index": 1}
    ]
  }

OntologyConfig Data Model

ModelFieldsDescription
NodeDef label, description, properties A node type in the graph. Properties map name to type (string, number, boolean, list).
EdgeDef label, source, target, description An edge type connecting two node types.
OntologyConfig nodes, edges, root_node, json_schema Complete ontology definition. root_node is the primary type that maps 1:1 to source fragments.

Pydantic Models

class NodeDef(BaseModel):
    label: str              # Vertex label (e.g., "Component", "Concept")
    description: str = ""   # Used in LLM prompt for guidance
    properties: Dict[str, str] = {}  # name -> type (string, number, boolean, list)

class EdgeDef(BaseModel):
    label: str              # Edge label (e.g., "USES", "DERIVED_FROM")
    source: str             # Source node label
    target: str             # Target node label
    description: str = ""   # Used in LLM prompt for guidance

class OntologyConfig(BaseModel):
    description: str = ""
    nodes: List[NodeDef] = []
    edges: List[EdgeDef] = []
    root_node: str = ""     # Primary node that maps 1:1 to source fragments
    json_schema: Optional[str] = None  # Expected LLM output JSON format
Tip The ontology is the most important part of your knowledge graph. When auto-generated, review it carefully before proceeding. You can edit the ontology.yaml file to adjust node types, properties, and edge definitions.

Domain Profiles

Domain profiles are YAML files that parameterize the entire pipeline: ontology, LLM prompt, ID extraction patterns, and source discovery templates.

Built-in Profiles

ProfileOntologyDomain
defaultNone (auto-generated)Any topic

Selecting a Profile

Set the DOMAIN environment variable in your .env file before running the skill:

.env
# In .env: DOMAIN=financial-aml # Or use a custom profile file: DOMAIN=/path/to/my-domain.yaml

Profile Inheritance

Profiles can inherit from a base using extends: default. Child fields override parent fields via deep merge. This avoids repeating universal configuration in every profile.

When loading a profile by name:

  1. If the name is a file path (ends in .yaml/.yml), load directly
  2. Otherwise, look for {name}.yaml in src/build_kg/domains/

Creating Custom Profiles

A profile controls three pipeline aspects: Ontology (node/edge types), Parsing (system message and extraction rules), and Discovery (search templates, sub-domains).

Profile YAML Structure

name: "My Domain"
description: "Custom domain profile"
version: "1.0"
extends: default

ontology:
  root_node: MyEntity
  nodes:
    - label: MyEntity
      description: "Primary entity type"
      properties: {name: string, type: string}
  edges:
    - label: RELATES_TO
      source: MyEntity
      target: MyEntity
  json_schema: |
    {"entities": [...], "relationships": [...]}

parsing:
  system_message: "You are an expert in..."
  requirement_types: [...]

id_patterns:
  patterns: {}

discovery:
  search_templates:
    - "<topic> official documentation"
  sub_domains:
    - name: "Sub-area 1"
      description: "..."

Configuration

All configuration is via .env file or environment variables.

VariableDefaultDescription
DB_HOSTlocalhostPostgreSQL host
DB_PORT5432PostgreSQL port
DB_NAMEbuildkgDatabase name
DB_USERbuildkgDatabase user
DB_PASSWORDDatabase password (required)
LLM_PROVIDERanthropicLLM provider: anthropic or openai
ANTHROPIC_API_KEYAnthropic API key (required if using Anthropic)
ANTHROPIC_MODELclaude-haiku-4-5-20251001Anthropic model for parsing
OPENAI_API_KEYOpenAI API key (required if using OpenAI)
OPENAI_MODELgpt-4o-miniOpenAI model for parsing (when using OpenAI)
AGE_GRAPH_NAMEknowledge_graphApache AGE graph name
BATCH_SIZE10Fragments per processing batch
MAX_WORKERS3Concurrent worker threads
DOMAINfood-safetyDomain profile name or file path
RATE_LIMIT_DELAY1.0Seconds between API calls (sync parser)

.env File Format

.env
# Database DB_HOST=localhost DB_PORT=5432 DB_NAME=buildkg DB_USER=buildkg DB_PASSWORD=buildkg_dev # LLM LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-... # Graph AGE_GRAPH_NAME=kg_k8s_net # Domain DOMAIN=default
Important DB_PASSWORD and an LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY) are required. All other variables have sensible defaults. Run make verify to check your configuration.

Manifest Format

The crawl manifest (crawl_manifest.json) is the central configuration file for a build-kg pipeline run. It lists every source to crawl, provides metadata for each source, defines defaults, and tracks coverage.

JSON Schema

{
  "topic": "string",
  "graph_name": "string",
  "created_at": "ISO 8601 string",
  "sources": [ ... ],
  "defaults": { ... },
  "coverage": { ... },
  "metadata": { ... }
}

Top-Level Fields

FieldTypeRequiredDescription
topicstringyesHuman-readable topic description
graph_namestringyesApache AGE graph name (reg_<country>_<domain> or kg_<topic>)
created_atstringyesISO 8601 timestamp
sourcesarrayyesList of source objects
defaultsobjectnoDefault metadata for unmatched chunks
coverageobjectnoTracks sub-domain coverage
metadataobjectnoArbitrary key-value metadata

Source Object Fields

FieldTypeRequiredDescription
source_namestringyesShort identifier used as crawl output directory name. Must be unique. Used by loader for source matching.
urlstringyesStarting URL for the crawl
titlestringyesDescriptive title
authoritystringnoIssuing organization (nullable for generic topics)
jurisdictionstringnoJurisdiction code (nullable for generic topics)
doc_typestringnoDocument type (nullable for generic topics)
prioritystringnoPriority tier: P1, P2, P3, or P4
sub_domainsarraynoList of sub-domain strings this source covers
depthintegernoCrawl depth override
max_pagesintegernoMaximum pages override
delayintegernoCrawl delay in ms override

Example: Generic Manifest

{
  "topic": "Kubernetes networking",
  "graph_name": "kg_k8s_net",
  "created_at": "2026-02-20T10:00:00Z",
  "sources": [
    {
      "source_name": "k8s_networking_docs",
      "url": "https://kubernetes.io/docs/concepts/services-networking/",
      "title": "Kubernetes Networking Concepts",
      "priority": "P1",
      "depth": 3,
      "max_pages": 80,
      "delay": 1500
    },
    {
      "source_name": "cilium_docs",
      "url": "https://docs.cilium.io/en/stable/",
      "title": "Cilium CNI Documentation",
      "priority": "P2",
      "depth": 2,
      "max_pages": 50,
      "delay": 1500
    }
  ],
  "defaults": {}
}

Examples

Kubernetes Networking

Terminal
$ /build-kg kubernetes networking # Your agent generates ontology: Component, Concept, Configuration # Discovers sources: kubernetes.io, cilium.io, calico docs # Crawls ~200 pages, chunks, loads, parses # Result: queryable KG with Component-USES-Concept relationships

Sample queries:

-- Count nodes by type
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (n:Component) RETURN count(n)
$$) AS (cnt agtype);

-- Find relationships
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (c:Component)-[:USES]->(concept:Concept)
    RETURN c.name, concept.name, concept.category
    LIMIT 10
$$) AS (component agtype, concept agtype, category agtype);

-- Traverse the graph (multi-hop)
SELECT * FROM cypher('kg_k8s_net', $$
    MATCH (a:Component)-[:DEPENDS_ON]->(b:Component)-[:USES]->(c:Concept)
    RETURN a.name, b.name, c.name
    LIMIT 10
$$) AS (comp_a agtype, comp_b agtype, concept agtype);

Machine Learning Algorithms

Terminal
$ /build-kg machine learning optimization algorithms # Your agent generates ontology: Algorithm, Technique, Application, Paper # Discovers: arxiv papers, textbook sites, framework docs # Result: KG connecting algorithms to techniques and applications

Sample queries:

-- Find all algorithms of a specific type
SELECT * FROM cypher('kg_ml_opt', $$
    MATCH (a:Algorithm)
    WHERE a.type = 'gradient-based'
    RETURN a.name, a.description
    LIMIT 20
$$) AS (name agtype, description agtype);

-- Find algorithms used in applications
SELECT * FROM cypher('kg_ml_opt', $$
    MATCH (a:Algorithm)-[:APPLIED_TO]->(app:Application)
    RETURN a.name, app.name
    LIMIT 10
$$) AS (algorithm agtype, application agtype);

Cost

The only cost is LLM API calls during the parse phase. Everything else — crawling, chunking, loading, database, queries — runs locally for free.

FragmentsSync ParserBatch Parser (50% off)
100~$0.03~$0.015
1,000~$0.30~$0.15
5,000~$1.50~$0.75
10,000~$3.00~$1.50

What's Free

Tip The skill automatically handles test runs and cost optimization. before committing to a full run.

Troubleshooting

Quick Reference

ErrorPhaseSolution
Connection refused on port 5432Any DB operationStart the database container
relation "source_fragment" does not existLoad / ParseRun the init SQL or recreate the container
AGE extension not foundSetup / ParseRun make db-init
LLM API error 401ParseCheck ANTHROPIC_API_KEY (or OPENAI_API_KEY) in .env
Chromium not foundCrawlRun crawl4ai-setup
No fragments foundParseVerify the database load succeeded
Batch still in "validating"Batch ParseWait; batches take 1–24 hours
Ontology file not foundParse / SetupCheck --ontology path is correct
Empty entities in LLM outputParse (ontology)Check text quality and ontology fit
PDF support not installedChunkRun make setup
DB_PASSWORD is requiredAnyRun cp .env.example .env

Connection refused on port 5432

Symptom:

Error
psycopg2.OperationalError: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?

Fix:

Terminal
# Start the database $ docker compose -f db/docker-compose.yml up -d # Check it is running and healthy $ docker compose -f db/docker-compose.yml ps # If the container exists but is stopped $ docker compose -f db/docker-compose.yml start

If the container is running but you still cannot connect, check if another PostgreSQL instance is using port 5432 (change DB_PORT in .env), or if you are running inside a VM or container (use the correct host IP instead of localhost).

relation "source_fragment" does not exist

Symptom:

Error
psycopg2.errors.UndefinedTable: relation "source_fragment" does not exist

Fix:

Terminal
# Option 1: Reset the database (destroys all data) $ make db-reset # Option 2: Run init.sql manually $ docker exec -i build-kg-db psql -U buildkg -d buildkg < db/init.sql # Verify tables exist $ docker exec build-kg-db psql -U buildkg -d buildkg -c "\dt"

AGE extension not found

Symptom:

Error
ERROR: extension "age" is not available

Fix: Make sure you are using the Apache AGE Docker image, then run:

Terminal
$ docker compose -f db/docker-compose.yml up -d $ make db-init

LLM API error 401

Symptom:

Error
AuthenticationError: Error code: 401 - Incorrect API key provided

Fix:

  1. Check your .env file: grep ANTHROPIC_API_KEY .env
  2. Verify the key format: ANTHROPIC_API_KEY=sk-ant-...
  3. Run make verify to test all connections
  4. Check your account for billing issues at console.anthropic.com

Chromium not found

Symptom:

Error
crawl4ai: Browser not found. Please run crawl4ai-setup first.

Fix:

Terminal
# Make sure the virtual environment is active $ source venv/bin/activate # Install Chromium $ crawl4ai-setup

If crawl4ai-setup fails, check disk space (~400MB required) and install missing libraries on Linux: sudo apt install libnss3 libatk-bridge2.0-0 libdrm2 libxcomposite1 libxdamage1 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 libasound2

No fragments found

Symptom:

Output
Found 0 fragments to process No fragments to process!

Fix:

  1. Check the fragment count by querying SELECT COUNT(*) FROM source_fragment;
  2. If zero, re-run the skill or check that the load phase completed successfully
  3. If you filtered results, verify the data exists in the database.

Batch still in "validating"

This is normal. Batches can take 1–24 hours. Monitor with:

The skill monitors batch progress automatically. Batches typically complete within 1–24 hours.

If stuck for more than 1 hour, check the JSONL file for formatting errors and your provider account for rate limit or quota issues.

DB_PASSWORD is required

Fix:

Terminal
$ cp .env.example .env # The default password matches the Docker container: DB_PASSWORD=buildkg_dev

Empty entities in LLM output (ontology mode)

The parser reports many fragments as "skipped" with 0 entities extracted. This happens when the text is too short/generic, the ontology node types do not match the content, or chunks are too small.

Fix:

  1. Check that your ontology node types are appropriate for the source content
  2. Try increasing --max-chars during chunking for larger, more informative fragments
  3. Review the auto-generated ontology and adjust definitions if needed

Getting More Help

If your issue is not listed here:

  1. Run make verify to check overall system health
  2. Check Docker container logs: docker compose -f db/docker-compose.yml logs
  3. Open an issue at github.com/agtm1199/build-kg/issues with the exact error message, the command you ran, your Python version, and your OS

FAQ

What topics can I use?
Any topic. The ontology is auto-generated and tailored to your subject (e.g., "kubernetes networking", "machine learning").
How much does it cost?
Only LLM API costs for the parse phase. Everything else is free and local. Approximately ~$0.30 per 1,000 fragments with the sync parser, or ~$0.15 per 1,000 fragments using the Batch API (50% cheaper).
Can I use my own ontology?
Yes. Create a YAML file defining your node types, edge types, and JSON schema, then set the DOMAIN environment variable to point to your profile. You can also create a full domain profile with custom parsing instructions and discovery templates.
Do I need a coding agent?
Yes. build-kg natively supports 8 coding agent platforms: Claude Code, OpenAI Codex, GitHub Copilot, Cursor, Windsurf, Amazon Kiro, Qoder, and Antigravity. Each has a dedicated skill file that ships with the repo.
What LLM does it use?
Claude Haiku 3.5 by default (via Anthropic). Configurable via the LLM_PROVIDER and ANTHROPIC_MODEL environment variables in your .env file. OpenAI models (e.g., GPT-4o-mini) are also supported by setting LLM_PROVIDER=openai.
Can I export the graph?
Yes. It is standard PostgreSQL with the Apache AGE extension. Use pg_dump, Cypher queries, or any PostgreSQL client to export data. You can also query the graph directly from any application that connects to PostgreSQL.
What database does it use?
PostgreSQL with the Apache AGE extension. AGE adds Cypher query support to PostgreSQL, allowing you to store and query graph data alongside relational data in the same database. It runs in a Docker container on your machine.
How do I query the graph after building it?
Use standard Cypher queries wrapped in the cypher() function via any PostgreSQL client. Example: SELECT * FROM cypher('kg_k8s_net', $$ MATCH (n) RETURN n LIMIT 10 $$) as (n agtype); You can use psql, DBeaver, pgAdmin, or connect programmatically from any language with a PostgreSQL driver.