Kokoro Local TTS: Offline Custom Voices

Introduction

Interest in voice applications has surged with the rise of bidirectional APIs from major AI providers. The trade-off is data exposure when sending text to external services. That’s why a reliable local text-to-speech option matters.

I’ve been working with Kokoro 82M, a compact text-to-speech model that runs locally, including on machines without a GPU. It produces strong speech quality, is easy to set up, and supports custom voices through embeddings. This article explains what it is, how it works, and how to run it on your system.

What is Kokoro Local TTS + Custom Voices?

Kokoro 82M is a small neural TTS model released on Hugging Face and GitHub. It ships as open weights with voice embeddings (voice packs) you can load, blend, and reuse. You can try it in a public demo, run it in notebooks, or deploy it locally using ONNX for speed.

It ranks at the top of the Hugging Face TTS Arena among models with accessible weights. Even though it arrived without a formal announcement, it has gained attention because it performs well, installs quickly, and supports multiple voices and languages.

Table Overview: Kokoro Local TTS + Custom Voices

Category	Details
Model name	Kokoro 82M
Size	~82M parameters
Availability	Hugging Face and GitHub (open weights)
Inference	CPU or GPU; ONNX runtime recommended for local speed
Training data	Under 100 hours of audio
Architecture base	Inspired by StyleTTS2 (paper and repo available)
Voice system	Model + per-voice embeddings (“voice packs”)
Languages/accents	English (US and UK), plus options for French, Japanese, Korean, and Chinese
Phoneme support	Trained on phonemes; handles US and UK pronunciations
Leaderboard	Top-ranked on Hugging Face TTS Arena among accessible-weight models
Custom voices	Blend embeddings; save and reuse new voice tensors
Community tools	ONNX package, FastAPI server (OpenAI-compatible speech endpoint), Rust inference
Runs without GPU	Yes; real-time or faster on modern CPUs (ONNX)
Intended use	Local TTS for apps, agents, scripting, and batch generation

Key Features of Kokoro

Local-first: Process text to speech entirely on your own machine.
Compact model: ~82M parameters with strong audio quality from a small footprint.
Voice packs: Per-voice embeddings control speaker identity; you can blend or create new voices.
Multilingual voices: English (US/UK), plus options for French, Japanese, Korean, and Chinese.
Phoneme-based: Accepts and respects phonetic inputs for more controlled pronunciation.
Flexible runtime: Run via PyTorch, or use ONNX for a faster local pipeline.
Community ecosystem: ONNX package, an OpenAI-compatible FastAPI server, and Rust inference tooling.

Try Kokoro in the Browser

A public demo lets you select voices, accents, and supported languages. It includes multiple American and British English voices, plus selections for French, Japanese, Korean, and Chinese.

If you’re evaluating voice tone and pronunciation, the demo is a quick way to assess the model before installing it locally.

Model Details and Training

Kokoro 82M was trained on less than 100 hours of audio. The architecture is based on StyleTTS2, which has a public repository and paper. The team has discussed plans to train the next version on more data to improve quality further.

Voices are packaged as embeddings stored in voice packs. Each embedding defines a speaker identity the model can render. The maintainers have offered to create a voice pack for a specific voice if you contribute data to a future training run.

Community Tools and Integrations

Several projects make it easy to adopt Kokoro in real applications:

Kokoro ONNX: A packaged ONNX runtime version for fast local inference.
FastAPI TTS server: Emulates an OpenAI-compatible speech endpoint. You can switch from a cloud TTS call to a local server with minimal changes.
Rust inference: A Rust-based runtime aimed at production use and speed.

These tools cover everything from quick demos to production APIs.

Getting Started in a Notebook

You can follow public guide code in a notebook environment to test the core features and build scripts.

Core Components

Model: The core TTS model converts text (or phonemes) and a voice embedding into audio.
Voice embeddings: Each voice has its own embedding tensor. Loading different embeddings changes the speaker identity without altering the model.

Embeddings may be named by accent; for example, American voices may be labeled with a prefix, and British voices with another prefix. More voice packs are planned.

Generate Speech

Basic generation involves:

Load the model.
Load a voice embedding from a voice pack.
Provide text (or phonemes) and optional parameters (speed, punctuation handling).
Run inference to produce audio.

Because the model is trained on phonemes, you can pass phonetic inputs to refine pronunciation. It can handle both US and UK phonemes.

Save Audio

After generation, save the output in a format your application expects:

Convert the audio buffer to WAV or WebM.
Write to disk.
Download or pass to downstream tools.

This is straightforward in notebooks and helps you build a reusable workflow.

Custom Voices by Blending Embeddings

Kokoro’s voice identity lives in embeddings. That means you can create new voices by blending existing embeddings and then save the result as a reusable voice.

Understanding the Embedding

A typical voice embedding tensor in Kokoro has shape 511 × 1 × 256. New voices you create should have the same shape. At inference time, you load this tensor the same way you would load any built-in voice.

Blending Methods

There are multiple ways to blend two or more voice embeddings:

Simple average
- Add two embeddings and divide by 2.
- Quick to try, but often skews toward one of the originals.
Weighted average
- Choose weights for each embedding, for example 70% voice A and 30% voice B.
- Gives predictable control over how much each source influences the result.
Linear interpolation
- Interpolate from embedding A to embedding B across a parameter t in [0,1].
- You may notice the voice stays close to A for a while and then changes rapidly near B.
Spherical interpolation
- Interpolate on a hypersphere rather than linearly.
- Originates from image latent blending research; often yields smoother transitions and better control.

Practical steps:

Load two or more embeddings.
Normalize if your method requires it.
Combine them using your chosen technique (average, weighted, linear, or spherical).
Validate the shape (511 × 1 × 256).
Save the blended tensor as a new voice embedding.

Save and Reuse Your Voice

Once you have a new embedding:

Store it in your voice pack format (JSON plus binary weights or a single file your runtime expects).
Load it exactly as you would load a built-in voice.
Keep a record of blending parameters so you can reproduce or tweak the voice later.

Optional: Train a Voice-to-Embedding Mapper

If you plan to generate many custom voices from samples, one approach is to train a model that maps a reference voice to a compatible embedding. This would not require retraining the Kokoro TTS model; it would only estimate new embeddings. Building that pipeline is outside the scope of this guide, but it’s a viable path for large-scale voice creation.

Run Locally with ONNX for Speed

While you can run the PyTorch version in notebooks, the ONNX runtime is a good choice for local performance. It’s fast on modern CPUs and pairs well with a simple CLI or API for batch and interactive use.

Prerequisites

Python environment manager (such as uv).
ONNX runtime and the Kokoro ONNX package.
The ONNX model file and the voice pack file (commonly a JSON plus embedding data).

On macOS, uv can be installed via Homebrew. On Windows or Linux, follow the installation steps for your platform.

Setup Steps

Create and activate a virtual environment with uv.
Install the Kokoro ONNX package via pip.
Place the ONNX model file in the expected folder (check the package docs).
Place the voice pack (embeddings) file in the expected folder (often named voices.json or similar).
Review included examples in the package repository to confirm paths and options.

Once these files are in place, the scripts can find the model and voice pack automatically, or you can supply explicit paths.

Generate Audio with ONNX

Use the “hello” or sample generation script from the package.
Provide:
- Text to synthesize.
- Voice name (from the voice pack).
- Optional settings like speed or phoneme mode.
Run the script (for example, with uv run).
Confirm that an audio file (such as WAV) appears in the output directory.

This setup lends itself to wrapping the command in a simple function call from your app. You can wire it to a local API, shell script, or workflow tool.

Tips for Local Use

Batch processing: Queue multiple texts and generate audio in a single session for better throughput.
Custom voices: Store your blended embeddings alongside the official pack to switch voices easily.
Phonemes: For tricky words or names, feed phonemes to get consistent pronunciation across runs.
Voice selection: Keep a short list of preferred voices and a reference playlist to standardize tone.

Build an OpenAI-Compatible Local Endpoint

A community FastAPI project exposes a local TTS endpoint that follows the OpenAI-compatible speech API shape. If your toolchain already integrates with that API, you can repoint it to the local server to keep requests on-device.

Setup outline:

Install the FastAPI project and its dependencies.
Configure the path to the ONNX model and voice pack.
Start the server.
Update your application’s TTS URL to the local endpoint.

This approach is helpful for teams that want to swap out a remote TTS provider for a local service without rewriting their client code.

Rust Inference Option

A Rust-based inference project is available for those building production systems in Rust. It targets speed and deployability. If you’re integrating TTS into a Rust application or service, this is worth reviewing.

Phonemes and Pronunciation Control

Because Kokoro is trained on phonemes, you can control pronunciation more precisely than with plain text alone:

Feed ARPAbet or another supported scheme to force the exact sounds.
Switch between US and UK phoneme sets depending on the accent you want.
Mix phonemes and text when only a few words need adjustment.

This is especially helpful for names, technical terms, and brand words.

Saving and Managing Outputs

Keep your audio outputs organized:

File formats: WAV for lossless master files; WebM or MP3 for smaller distribution files.
Metadata: Include voice name, speed, phoneme mode, and text hash in the filename or a sidecar JSON.
Versioning: Store outputs and voice embeddings in a versioned directory to reproduce results later.

Scaling Up: Workflows and Automation

As your use grows:

Script common tasks (voice selection, phoneme prep, batch generation).
Add a simple CLI with flags for text, voice, speed, and output path.
Use a queue or job runner for long batches.
Cache or memoize frequent phrases if you render them often.

These small steps make Kokoro a reliable part of your daily toolset.

Privacy and Local Agents

With Kokoro running on-device, you can build a complete local voice loop:

Speech-to-text (ASR) for input.
Local reasoning or routing.
Kokoro for text-to-speech output.

This allows private, interactive agents with no recurring API fees and no external data transfer for speech generation.

Troubleshooting Checklist

If audio won’t generate or sounds incorrect:

Paths: Confirm the ONNX model and voice pack locations.
Voice key: Ensure the voice name matches an entry in the voice pack.
Sample rate: Check audio player and output format compatibility.
Phonemes: If text sounds off, try phoneme inputs or a different accent pack.
CPU load: For real-time use, close other heavy processes or try a smaller batch size.

Maintenance and Upgrades

Kokoro is evolving. To stay current:

Refresh to new voice packs as they are released.
Test new ONNX builds for speed and quality.
Consider the next trained version when it becomes available, as more data may yield better results.

Keep your custom voice embeddings backed up. They should remain compatible as long as the embedding interface and shape remain stable.

Step-by-Step Quickstart (Notebook to Local)

Notebook trial
- Load the model and a voice embedding.
- Generate a short sample.
- Save to WAV and confirm playback.
Voice control
- Switch between an American and a British voice.
- Try phonemes for a few terms to verify pronunciation control.
Custom voice
- Blend two embeddings via a weighted or spherical interpolation.
- Save the new embedding and regenerate a short sample.
Local ONNX
- Install the Kokoro ONNX package and set up uv.
- Copy the ONNX model and voice pack files.
- Run the sample script with your preferred voice and text.
- Confirm output speed and audio quality.
Optional API
- Start the FastAPI server for an OpenAI-compatible endpoint.
- Point your application to the local URL.

Practical Voice Management

Naming: Use consistent voice IDs (e.g., en-US-female-01, en-UK-male-02).
Presets: Keep a JSON with default speed, punctuation, and phoneme settings per voice.
Reference clips: Maintain a short set of lines to compare voice output across updates.
Documentation: Record how each custom voice was created (weights and interpolation method).

Security and Compliance

Local TTS means:

Your input text remains on-device.
You control output handling and retention.
You can conform to stricter data policies without third-party transmission.

If you store outputs, ensure your retention policy and access controls meet your organization’s requirements.

Performance Notes

ONNX on modern CPUs often runs faster than real time for typical sentence lengths.
Longer passages benefit from batch or chunked synthesis.
Phoneme inputs can sometimes reduce retries and re-renders by improving clarity on the first pass.

Summary

Kokoro 82M is a compact local text-to-speech model that produces high-quality voice output without sending data to external services. Voices are defined by embeddings, which makes customization straightforward. You can run it in a notebook, switch to ONNX for fast local inference, and even expose it as a local API that mirrors a familiar endpoint. With phoneme support, multiple languages and accents, and a growing tool ecosystem, it is well-suited for applications that need private, controllable, and efficient text-to-speech on your own hardware.