Stable Audio Online

Stable Audio is a latent diffusion model architecture tailored for audio, considering text metadata, audio file duration, and start time as conditions.

Create beautiful art using stable audio ONLINE for free.

Making your dreams come true

Generate amazing AI music from text using Stable Audio.

Easy to use is an easy-to-use interface for creating musics using the recently released 907M parameter U-Net based on the model used in Moûsai model.

High quality musics.
It can create high quality musics of anything you can imagine in less than one second type in a text prompt and hit Generate.
Fast generation.
The flagship Stable Audio model is able to render 95 seconds of stereo audio at a 44.1 kHz sample rate in less than one second on an NVIDIA A100 GPU.


We case about your privacy.

We don't collect and use ANY personal information, neither store your text or music.
No limitations on what you can enter.

Stable Audio

Just enter your prompt and click the generate button.

No code required to generate your music!

Frequently asked questions

If you can’t find what you’re looking for, email our support team and if you’re lucky someone will get back to you.

How to use the Stable Audio?
Create custom-length music just by describing it. Powered by the latest audio diffusion models.
What is the copyright for using Stable Audio generated musics?
The area of AI-generated musics and copyright is complex and will vary from jurisdiction to jurisdiction.
Which model are you using?
We are using the Stable Audio model, which is a latent diffusion model architecture for audio conditioned on text metadata as well as audio file duration and start time, allowing for control over the content and length of the generated audio.
Where can I access the Stable Audio Online website?
What are Diffusion Models?
Generative models are a class of machine learning models that can generate new data based on training data.
What was the Stable Audio model trained on?
The diffusion model for Stable Audio is a 907M parameter U-Net based on the model used in Moûsai. It uses a combination of residual layers, self-attention layers, and cross-attention layers to denoise the input conditioned on text and timing embeddings. Memory-efficient implementations of attention were added to the U-Net to allow the model to scale more efficiently to longer sequence lengths.
© Copyright © 2023 Stable Audio Online. All rights reserved.