OpenAI lets developers build real-time voice apps - at a substantial premium

Jakub Porzycki/NurPhoto via Getty Images

OpenAI's annual developer day took place Wednesday in San Francisco, with a raft of product and feature announcements. The event's centerpiece was the company's introduction of its real-time application programming interface (API).

The feature for developers makes it possible to send and receive spoken-language inputs and outputs during inference operations, or making predictions with a production large language model (LLM). It is hoped this type of interaction can enable a more fluid, real-time conversation between a person and a language model.

Also: OpenAI's Altman sees 'superintelligence' just around the corner - but he's short on details

This capability also comes at a hefty premium. OpenAI currently prices the GPT-4o large language model, which is the model that forms the basis for the real-time API, at $2.50 per million tokens of input text, and $10 per million output tokens.

The real-time input and output cost is at least twice that rate, based on both text and audio tokens, since GPT-4o needs both kinds of input and output. Input and output tokens for GPT-4o when using the real-time API cost $5 and $20, respectively, per million tokens.

A busy schedule at the developer day.

OpenAI

For voice tokens, the cost is a whopping $100 per million audio input tokens and $200 per million audio output tokens.

Also: How to use ChatGPT to optimize your resume

OpenAI notes that with standard statistics for voice conversations, the pricing of audio tokens "equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output."

OpenAI's pricing sheet for real-time API function calls in GPT-4o large language model inference.

OpenAI

OpenAI gives examples of how real-time voice can be used in generative AI, including an automated health coach giving a person advice, and a language tutor that can engage in conversations with a student to practice a new language.

During the developer conference, OpenAI offered a way to reduce the total cost to developers, with prompt caching, which is re-using tokens on inputs that have been previously submitted to the model. That approach cuts the price of GPT-4o input text tokens in half.

Also: OpenAI's budget GPT-4o mini model is now cheaper to fine-tune, too

Also introduced Wednesday was LLM "distillation", which lets developers use the data from larger models to train smaller models.

A developer captures the input and output of one of OpenAI's more capable language models, such as GPT-4o, using the technique known as "stored completions". Those stored completions then become the training data to "fine tune" a smaller model, such as GPT-4o mini.

OpenAI bills the distillation service as a way to eliminate a lot of iterative work required by developers to train smaller models from larger models.

"Until now, distillation has been a multi-step, error-prone process," says the company's blog on the matter, "which required developers to manually orchestrate multiple operations across disconnected tools, from generating datasets to fine-tuning models and measuring performance improvements."

Also: Businesses can reach decision dominance using AI. Here's how

Distillation comes in addition to OpenAI's existing fine-tuning service, the difference being that you can use the larger model's input-output pairs as the fine-tuning data. To the fine-tuning service, the company Wednesday added image fine tuning. A developer submits a data set of images, just as they would with text, to make an existing model, such as GPT-4o, more specific to a task or a domain of knowledge.

An example in practice is work by food delivery service Grab. The company uses real-world images of street signs to have GPT-4o perform mapping of the company's delivery routes. "Grab was able to improve lane count accuracy by 20% and speed limit sign localization by 13% over a base GPT-4o model, enabling them to better automate their mapping operations from a previously manual process," states OpenAI.

Pricing is based on chopping up each image a developer submits into tokens, which are then priced at $3.75 per million input tokens and $15 per million output tokens, the same as standard fine-tuning. For training image models, the cost is $25 per million tokens.

Original link