Last year, developers Seth Forsgren and Hayk Martiros announced $4 million in funding for their music generation app called Riffusion. This application can create music using an innovative approach that has attracted widespread attention and has even been featured in the scientific literature on the topic.
Riffusion uses the “Stable Diffusion” deep learning model to generate music from text prompts, transforming words into spectrograms that are then converted into sound. This has led millions of users to try the app, which has been cited in research by major tech companies such as Meta, Google, and ByteDance.
Forsgren and Martiros state: “Our goal is to make music creation accessible to everyone through AI technology, allowing anyone to produce original tracks and express themselves musically in a simple and intuitive way.”
But how does Generative AI applied to Riffusion’s music work? And more generally, how do Generative AIs create songs, sounds, and melodies?
Riffusion: a new way to create music?
Originally launched as a hobby project about two years ago, Riffusion quickly gained global attention thanks to its unique ability to create music using images of audio rather than traditional audio. The curiosity and interest generated by Riffusion attracted not only users but also investors.
Recently, Forsgren and Martiros announced they closed a $4 million seed funding round led by Greycroft, with participation from South Park Commons and Sky9. This funding represents a significant step toward the commercialization of Riffusion, which is now supported and endorsed, among others, by the well-known music duo The Chainsmokers.
Alongside the funding, Riffusion launched a new, free, and improved version of the app. This version allows users to describe text and musical styles to generate custom, shareable “riffs.” “The new Riffusion lets anyone create original music through short, shareable audio clips,” Forsgren explained in an interview with TechCrunch. “From aspiring musicians to anyone who just wants to say ‘good morning’ in a creative way, riffs are a new form of expression and communication that drastically lower the barriers to music creation.”
This development not only represents a significant technological advancement but also highlights the growing interest and investment in generative AI applied to music. With continued financial backing and technological evolution, Riffusion could redefine how we conceive and create music, making music production accessible to an ever-wider audience.
Between images and sounds: an innovative approach
So how exactly does Riffusion work? This innovative music generation application is based on Generative AI, using a Deep Learning model called “Stable Diffusion.” Released in 2022, Stable Diffusion is designed to generate high‑quality images from textual descriptions, and its code was made public, allowing other researchers to modify and adapt it for various purposes.
Stable Diffusion was trained on a massive dataset of at least 5 billion image‑caption pairs, categorized by their features. This enables it to understand and generate detailed images based on text prompts. Seth Forsgren and Hayk Martiros saw the potential of Stable Diffusion and decided to adapt it to create spectrogram images—graphical representations of a sound’s intensity over time and frequency.
A spectrogram is like a “photograph” of sound. The horizontal axis (x‑axis) represents time, while the vertical axis (y‑axis) represents frequency. The color of each pixel indicates the sound amplitude at a specific frequency and moment in time. Using a mathematical technique called the Short‑Time Fourier Transform, audio can be converted into a spectrogram. This process is reversible: Forsgren and Martiros then used the spectrograms generated by Stable Diffusion to create audio.
Riffusion thus enables high‑quality audio generation from spectrograms. Thanks to this innovation, Riffusion can take a user‑provided text description, such as “a jazz melody with piano and saxophone,” and turn it into a spectrogram. That spectrogram is then converted into an actual musical track.
The result is a short audio clip, called a “riff,” which can be shared publicly or with friends. This approach makes music creation accessible to everyone, regardless of musical skill level, allowing anyone to express themselves creatively through music.
Conclusion
Riffusion is a fascinating example of how generative AI can revolutionize music creation, making it accessible to anyone. The innovative use of spectrograms generated by Stable Diffusion to create music demonstrates the immense potential of this technology.
At Social Thingum, with our experience in developing AI algorithms based on Natural Language Processing and image analysis, we recognize the importance and innovation of these technological developments. To stay updated on the latest in Innovation and Artificial Intelligence and to understand how algorithms are shaping today’s world, continue following us on LinkedIn.
Learn More
https://www.ilsoftware.it/riffusion-creare-tracce-audio-con-l-intelligenza-artificiale_25375/

