Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Models

The system supports several pre-trained models out of the box. These can be referenced using their respective _MODEL_NAME identifiers.

Model NameDescriptionResearch Paper
bge_m3Embedding for semantic and multi-vector search.Arxiv
face_recognitionFace analysis and identification via InsightFace (Buffalo).
got_ocrUnified visual text processing with GOT-OCR 2.0.Arxiv
grounded_samZero-shot object detection and segmentation (Grounded-SAM).Arxiv
owlv2Open-world object detection and localization.Arxiv
siglip2Advanced vision-language model for image-text understanding.Arxiv
timesfm_forecastTime-series foundation model for versatile forecasting.Arxiv
whisperMultilingual speech-to-text with diarization and embeddings.Arxiv

Model Details

1. BGE-M3 (bge_m3)

Versatile embedding model that supports multi-functionality, multi-linguality, and multi-granularity. It performs simultaneous dense, sparse (lexical), and multi-vector (ColBERT) retrieval.

  • Inputs:
    • text (Required): The string content to be embedded.
  • Outputs:
    • dense: A high-dimensional vector for semantic similarity.
    • sparse: A dictionary of token weights for lexical/keyword matching.
    • colbert: A multi-vector representation for fine-grained late interaction scoring.

2. Face Recognition (face_recognition)

Performs facial detection, alignment, and embedding extraction.

  • Inputs:
    • action (Required): The inference action to perform (from FaceAction).
    • image (Required): BGR image as numpy array (H, W, 3).
  • Outputs:
    • faces_count: Number of detected faces.
    • faces: List of objects containing bbox, keypoints, confidence, and the 512-d embedding.

3. GOT OCR (got_ocr)

General Object Role-playing OCR for high-quality text extraction.

  • Inputs:
    • image / images: Single ndarray or list of images.
    • action: Specific OCR task (from GotOcrAction).
    • format: Boolean (default True) to maintain formatting.
    • box: Optional crop area [x1, y1, x2, y2].
  • Outputs:
    • text: The extracted string content.

4. Grounded SAM (grounded_sam)

Combines language understanding with precise image segmentation.

  • Inputs:
    • image & text (Required): The source image and the prompt to segment.
    • threshold: Detection confidence (default 0.2).
    • return_masks: Whether to return the binary segmentation masks.
    • Note: Supports tiling for high-resolution images via use_tiling.
  • Outputs:
    • total_objects: Count of all detected instances.
    • concepts: Dictionary mapped by label containing box, score, and optionally mask.

5. OWLv2 (owlv2)

Open-world localized vocabulary object detection.

  • Inputs:
    • image & text (Required): Image and query labels.
    • threshold: Detection sensitivity (default 0.1).
  • Outputs:
    • concepts: Detailed instances per label with box and score.

6. SigLIP 2 (siglip2)

State-of-the-art vision-language model for creating shared embeddings.

  • Inputs:
    • action (Required): embed_image or embed_text.
    • image: Required for image embedding.
    • text: Required for text embedding.
  • Outputs:
    • embedding: Vector representation.
    • embedding_dim: Size of the vector.

7. TimesFM Forecast (timesfm_forecast)

Foundation model for time-series forecasting with external covariates.

  • Inputs:
    • history (Required): List of numerical historical values.
    • horizon: Number of future steps to predict (default 24).
    • dynamic_numerical_covariates: Optional dict for external regressors.
  • Outputs:
    • point_forecast: The predicted mean values.
    • quantiles: Deciles (0.1 to 0.9) for uncertainty estimation.

8. Whisper (whisper)

Multilingual speech-to-text with advanced speaker diarization.

  • Inputs:
    • audio (Required): Dict containing waveform (ndarray) and sample_rate.
    • task: transcribe or translate (default transcribe).
    • language: Optional ISO language code.
    • enable_diarization: Boolean to trigger speaker clustering and embeddings.
  • Outputs:
    • text: Full transcript string.
    • chunks: Segments with timestamp, text, speaker, and speaker_embedding.
    • language: Detected or used language.