Models
The system supports several pre-trained models out of the box. These can be
referenced using their respective _MODEL_NAME identifiers.
| Model Name | Description | Research Paper |
|---|---|---|
| bge_m3 | Embedding for semantic and multi-vector search. | Arxiv |
| face_recognition | Face analysis and identification via InsightFace (Buffalo). | |
| got_ocr | Unified visual text processing with GOT-OCR 2.0. | Arxiv |
| grounded_sam | Zero-shot object detection and segmentation (Grounded-SAM). | Arxiv |
| owlv2 | Open-world object detection and localization. | Arxiv |
| siglip2 | Advanced vision-language model for image-text understanding. | Arxiv |
| timesfm_forecast | Time-series foundation model for versatile forecasting. | Arxiv |
| whisper | Multilingual speech-to-text with diarization and embeddings. | Arxiv |
Model Details
1. BGE-M3 (bge_m3)
Versatile embedding model that supports multi-functionality, multi-linguality, and multi-granularity. It performs simultaneous dense, sparse (lexical), and multi-vector (ColBERT) retrieval.
- Inputs:
text(Required): The string content to be embedded.
- Outputs:
dense: A high-dimensional vector for semantic similarity.sparse: A dictionary of token weights for lexical/keyword matching.colbert: A multi-vector representation for fine-grained late interaction scoring.
2. Face Recognition (face_recognition)
Performs facial detection, alignment, and embedding extraction.
- Inputs:
action(Required): The inference action to perform (fromFaceAction).image(Required): BGR image as numpy array(H, W, 3).
- Outputs:
faces_count: Number of detected faces.faces: List of objects containingbbox,keypoints,confidence, and the 512-dembedding.
3. GOT OCR (got_ocr)
General Object Role-playing OCR for high-quality text extraction.
- Inputs:
image/images: Single ndarray or list of images.action: Specific OCR task (fromGotOcrAction).format: Boolean (defaultTrue) to maintain formatting.box: Optional crop area[x1, y1, x2, y2].
- Outputs:
text: The extracted string content.
4. Grounded SAM (grounded_sam)
Combines language understanding with precise image segmentation.
- Inputs:
image&text(Required): The source image and the prompt to segment.threshold: Detection confidence (default0.2).return_masks: Whether to return the binary segmentation masks.- Note: Supports tiling for high-resolution images via
use_tiling.
- Outputs:
total_objects: Count of all detected instances.concepts: Dictionary mapped by label containingbox,score, and optionallymask.
5. OWLv2 (owlv2)
Open-world localized vocabulary object detection.
- Inputs:
image&text(Required): Image and query labels.threshold: Detection sensitivity (default0.1).
- Outputs:
concepts: Detailed instances per label withboxandscore.
6. SigLIP 2 (siglip2)
State-of-the-art vision-language model for creating shared embeddings.
- Inputs:
action(Required):embed_imageorembed_text.image: Required for image embedding.text: Required for text embedding.
- Outputs:
embedding: Vector representation.embedding_dim: Size of the vector.
7. TimesFM Forecast (timesfm_forecast)
Foundation model for time-series forecasting with external covariates.
- Inputs:
history(Required): List of numerical historical values.horizon: Number of future steps to predict (default24).dynamic_numerical_covariates: Optional dict for external regressors.
- Outputs:
point_forecast: The predicted mean values.quantiles: Deciles (0.1 to 0.9) for uncertainty estimation.
8. Whisper (whisper)
Multilingual speech-to-text with advanced speaker diarization.
- Inputs:
audio(Required): Dict containingwaveform(ndarray) andsample_rate.task:transcribeortranslate(defaulttranscribe).language: Optional ISO language code.enable_diarization: Boolean to trigger speaker clustering and embeddings.
- Outputs:
text: Full transcript string.chunks: Segments withtimestamp,text,speaker, andspeaker_embedding.language: Detected or used language.