Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Eru – ESKG Extraction

eru is a production-ready, agnostic pipeline designed to extract highly structured Knowledge Graphs (Nodes and Edges) from unstructured text.

Why does it exist?

Extracting knowledge graphs using LLMs alone is flawed: they hallucinate entities, fail to strictly follow JSON schemas, and consume massive amounts of tokens (which is slow and expensive). Conversely, traditional NER (Named Entity Recognition) models extract entities perfectly but cannot understand complex relationships or implicit intents.

Eru bridges this gap. It forces small, fast LLMs to act only as logical routers between physically extracted entities, guaranteeing deterministic, hallucination-free graph extraction.

The 3-Layer Architecture

Eru processes text through a strict, three-step pipeline:

LayerDescription
ExtractionUses a Bi-encoder (GLiNER) to find physical entities in the text (e.g., Persons, Locations). It automatically deduplicates identical entities.
ReasoningUses a SLM constrained by outlines. The LLM is forbidden from inventing physical entities; it can only draw relationships using the IDs from Layer 1, or generate implicit conceptual nodes.
ValidationA logical gate that prunes mathematically impossible relationships (e.g., a “Car” cannot “Authorize” a “Person”) before the graph is saved to your database.

Here is a minimal example of how to use Eru to extract a graph from a simple sentence, allowing the model to deduce the implicit intent behind the action.

import json
from typing import Literal
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer
import outlines

from eru.engine import EskgEngine
from eru.extractor.gliner import GlinerExtractor
from eru.reasoner.outlines import OutlinesReasoner
from eru.logic.eskg import EskgLogicValidator
from eru.types import RelationDef

# 1. Define your Graph Schema
class Node(BaseModel):
    id: str
    text: str
    type: str

class Edge(BaseModel):
    source_id: str
    target_id: str
    relation_type: Literal["buys", "has_intent"]

class DailyGraph(BaseModel):
    entities: list[Node]
    relations: list[Edge]

def main():
    text = "Alice purchased a Macbook today because she wants to learn coding."

    # Layer 1: Extract explicit physical entities
    extractor = GlinerExtractor(
        labels=["PERSON", "PRODUCT", "TIME"], 
        threshold=0.3,
    )

    # Define relationship rules for the LLM
    rules = [
        RelationDef(
            name="buys",
            description="When a person purchases an item.",
            allowed_sources=["PERSON"],
            allowed_targets=["PRODUCT"],
        ),
        RelationDef(
            name="has_intent",
            description="The implicit reason or goal behind the action.",
        )
    ]

    # Layer 2: Setup the SLM Reasoner
    model_name = "Qwen/Qwen2.5-0.5B-Instruct"
    hf_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
    hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
    llm = outlines.from_transformers(hf_model, hf_tokenizer)
    
    reasoner = OutlinesReasoner(
        model=llm, 
        relation_defs=rules,
        open_entity_types=["INTENT"] # Allow LLM to invent concepts like 'learning to code'
    )

    # Layer 3: Setup the Logic Validator
    validator = EskgLogicValidator(
        get_entities=lambda g: g.entities,
        get_relations=lambda g: g.relations
    )

    # Run the Engine
    engine = EskgEngine(
        schema=DailyGraph, 
        extractor=extractor, 
        reasoner=reasoner, 
        validator=validator
    )
    graph = engine.process(text)

    print(json.dumps(graph.model_dump(), indent=2))

if __name__ == "__main__":
    main()

Expected Output

The engine cleanly separates the physically extracted nodes (Alice, Macbook) from the inferred conceptual node (learning to code).

{
  "entities": [
    {"id": "ent_0", "text": "Alice", "type": "PERSON"},
    {"id": "ent_1", "text": "Macbook", "type": "PRODUCT"}
  ],
  "relations": [
    {
      "source_id": "ent_0",
      "target_id": "ent_1",
      "relation_type": "buys"
    },
    {
      "source_id": "ent_0",
      "target_id": "to learn coding",
      "relation_type": "has_intent"
    }
  ]
}