Data Sources & Routing

sources defines how raw files in S3 are caught and routed into Kafka.

sources:
  - id: image_source
    match_pattern: "^images/" # support Regexes!
    topic: raw_images
    schema_path: schemas/avro/image.avsc

Definition Fields

Field	Type	Description
`id`	String	Unique identifier for this source.
`topic`	String	Kafka topic where parsed data will be sent.
`schema_path`	String	(Optional) Local path to an `.avsc` Avro schema.
`match_pattern`	String	(Optional) Regex to identify where to stream.
`parser`	String	See Data Parsers.

How Routing Works

When a file arrives in S3 (e.g., s3://my-bucket/finance/january.csv), the Rust Ingestor must decide which source block to use.

1. Regex Matching (match_pattern)

If you define match_pattern: "^finance/.*\.csv$", the system will test the S3 key against this regex. If it matches, this source is selected.

2. Otherwise

If no match_pattern is defined, the system will reject the incoming source, and won’t process the raw data.

Required & Reserved Fields

While Avro is flexible, Galadril’s internal logic (especially the Python Vision service) expects specific fields to be present in the schema to properly link data back to its origin.

1. Unique Identifiers

Every record must have a unique ID. The system looks for these fields in order of priority to determine the Kafka partitioning key:

event_id (Common for logs/financials)
image_id (For satellite/photos)
document_id (For PDFs/Docs)
article_id (For OSINT/News)

2. Traceability Fields

To ensure the Python pipeline can download binary content, the following field is mandatory for metadata parsers:

storage_path (String): The full S3 URI.