ONNX

Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.

ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications. Many LFM models are available as pre-exported ONNX packages on Hugging Face. For models not yet available, use the LiquidONNX tool to export any LFM to ONNX.

Pre-exported Models

Pre-exported ONNX models are available from LiquidAI and the onnx-community. Check the Model Library for a complete list of available formats.

Quantization Options

Each ONNX export includes multiple precision levels. Q4 is recommended for most deployments and supports WebGPU, CPU, and GPU. FP16 offers higher quality and works on WebGPU and GPU. Q8 provides a quality/size balance but is server-only (CPU/GPU). FP32 is the full precision baseline.

Python Inference

Installation

pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2

Basic Usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
    "tensor(float)": np.float32,
    "tensor(float16)": np.float16,
    "tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    dtype = DTYPE_MAP.get(inp.type, np.float32)
    cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if "position_ids" in input_names:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv")
        name = name.replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated, skip_special_tokens=True))

WebGPU Inference

ONNX models run in browsers via Transformers.js with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

Setup

Install Transformers.js:

npm install @huggingface/transformers

Enable WebGPU in your browser:
- Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
- Verify: Check chrome://gpu for WebGPU status

Usage

import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));

WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.

LiquidONNX Export Tool

LiquidONNX is the official tool for exporting LFM models to ONNX. Use it to export models not yet available as pre-built packages, or to customize export settings.

Installation

git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu

Supported Models

Family	Quantization Formats
LFM2.5, LFM2 (text)	fp32, fp16, q4, q8
LFM2.5-VL, LFM2-VL (vision)	fp32, fp16, q4, q8
LFM2-MoE	fp32, fp16, q4, q4f16
LFM2.5-Audio	fp32, fp16, q4, q8

Export Commands

Text models:

# Export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Export specific precisions
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision fp16 q4

Vision-language models:

uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# Alternative vision format for specific runtimes
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --vision-format conv2d

MoE models:

uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision

Audio models:

uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision

Export Options

Flag	Description
`--precision`	Output formats: `fp16`, `q4`, `q8`, or omit args for all
`--output-dir`	Output base directory (default: current directory)
`--skip-export`	Skip FP32 export, only run quantization on existing export
`--block-size`	Block size for quantization (default: 32)
`--q4-asymmetric`	Use asymmetric Q4 (default is symmetric for WebGPU)
`--split-data`	Split external data into chunks in GB (default: 2.0)

Inference with LiquidONNX

LiquidONNX includes inference commands for testing exported models:

# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
    --images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
    --audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
    --prompt "Hello, how are you?" --output speech.wav --precision q4

For complete documentation and advanced options, see the LiquidONNX GitHub repository.

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

Pre-exported Models

Quantization Options

Python Inference

Installation

Basic Usage

WebGPU Inference

Setup

Usage

LiquidONNX Export Tool

Installation

Supported Models

Export Commands

Export Options

Inference with LiquidONNX

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

​Pre-exported Models

​Quantization Options

​Python Inference

​Installation

​Basic Usage

​WebGPU Inference

​Setup

​Usage

​LiquidONNX Export Tool

​Installation

​Supported Models

​Export Commands

​Export Options

​Inference with LiquidONNX

Pre-exported Models

Quantization Options

Python Inference

Installation

Basic Usage

WebGPU Inference

Setup

Usage

LiquidONNX Export Tool

Installation

Supported Models

Export Commands

Export Options

Inference with LiquidONNX