Skip to main content
Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications. Many LFM models are available as pre-exported ONNX packages on Hugging Face. For models not yet available, use the LiquidONNX tool to export any LFM to ONNX.

Pre-exported Models

Pre-exported ONNX models are available from LiquidAI and the onnx-community. Check the Model Library for a complete list of available formats.

Quantization Options

Each ONNX export includes multiple precision levels. Q4 is recommended for most deployments and supports WebGPU, CPU, and GPU. FP16 offers higher quality and works on WebGPU and GPU. Q8 provides a quality/size balance but is server-only (CPU/GPU). FP32 is the full precision baseline.

Python Inference

Installation

pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2

Basic Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
    "tensor(float)": np.float32,
    "tensor(float16)": np.float16,
    "tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    dtype = DTYPE_MAP.get(inp.type, np.float32)
    cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if "position_ids" in input_names:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv")
        name = name.replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated, skip_special_tokens=True))

WebGPU Inference

ONNX models run in browsers via Transformers.js with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

Setup

  1. Install Transformers.js:
npm install @huggingface/transformers
  1. Enable WebGPU in your browser:
    • Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
    • Verify: Check chrome://gpu for WebGPU status

Usage

import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.

LiquidONNX Export Tool

LiquidONNX is the official tool for exporting LFM models to ONNX. Use it to export models not yet available as pre-built packages, or to customize export settings.

Installation

git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu

Supported Models

FamilyQuantization Formats
LFM2.5, LFM2 (text)fp32, fp16, q4, q8
LFM2.5-VL, LFM2-VL (vision)fp32, fp16, q4, q8
LFM2-MoEfp32, fp16, q4, q4f16
LFM2.5-Audiofp32, fp16, q4, q8

Export Commands

Text models:
# Export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Export specific precisions
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision fp16 q4
Vision-language models:
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# Alternative vision format for specific runtimes
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --vision-format conv2d
MoE models:
uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision
Audio models:
uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision

Export Options

FlagDescription
--precisionOutput formats: fp16, q4, q8, or omit args for all
--output-dirOutput base directory (default: current directory)
--skip-exportSkip FP32 export, only run quantization on existing export
--block-sizeBlock size for quantization (default: 32)
--q4-asymmetricUse asymmetric Q4 (default is symmetric for WebGPU)
--split-dataSplit external data into chunks in GB (default: 2.0)

Inference with LiquidONNX

LiquidONNX includes inference commands for testing exported models:
# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
    --images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
    --audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
    --prompt "Hello, how are you?" --output speech.wav --precision q4
For complete documentation and advanced options, see the LiquidONNX GitHub repository.