LoFi Album with Python Version Zero

In this post, I will present a project I have been working on: a Python script that automatically generates music videos using artificial intelligence techniques and audio processing. Throughout the article, I will break down the code into sections to explain how it works and how each component contributes to the creation of these videos.

The code was run in Google Colaboratory and can be accessed at the following link:

Install Libraries

In the first code segment, the necessary libraries and tools are installed to run the script:

imagemagick: This tool is used to process and manipulate images in different formats.
The ImageMagick policy file (/etc/ImageMagick-6/policy.xml) is modified to allow reading and writing of images.
Several PyTorch libraries are installed with specific versions:
- torch: This is the main PyTorch package, a Python library for deep learning used in this project to implement artificial intelligence algorithms.
- torchvision: Contains utilities for working with images and applying transformations to them, as well as pretrained computer vision models.
- torchaudio: Provides tools for working with audio data and applying audio transformations.
- torchtext: Used to process and manipulate text data.
- torchdata: Facilitates the loading and preprocessing of datasets.
The xformers library is installed from a specific provided link. This library contains implementations of optimized and efficient transformers for natural language processing and computer vision tasks.
Additional libraries are installed, such as:
- diffusers: Facilitates the implementation of stable diffusion models, which are used in this project to generate images.
- transformers: Contains implementations of transformer-based natural language processing models, such as BERT and GPT.
- ftfy: A library for cleaning and normalizing text, fixing common encoding and formatting errors.
- pydub: Allows manipulation of audio files in different formats.
- accelerate: Provides an API for running deep learning models on multiple devices and hardware accelerators.
The matchering library is installed, which is used to process and master audio files, adjusting the sound and quality of the automatically generated music files.
Finally, pytube is installed, a library that allows downloading YouTube videos in Python.

This code segment is responsible for preparing the working environment, ensuring that all necessary dependencies are installed before starting to work with the main script.

!apt install imagemagick -qq > /dev/null
!cat /etc/ImageMagick-6/policy.xml | sed 's/none/read,write/g'> /etc/ImageMagick-6/policy.xml

!pip install -q torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 torchtext==0.14.1 torchdata==0.5.1 --extra-index-url https://download.pytorch.org/whl/cu116 -U
!pip install -q https://github.com/camenduru/stable-diffusion-webui-colab/releases/download/0.0.17/xformers-0.0.17+b6be33a.d20230315-cp39-cp39-linux_x86_64.whl
!pip install -q -U diffusers transformers ftfy pydub accelerate

!pip install -q matchering

!pip install -q pytube

It is necessary to restart the Google Colab runtime after running the library installation code segment. This can be done from the top menu under Runtime > Restart runtime.

Import Libraries

The necessary libraries are imported for processing images, audio, and using deep learning algorithms. numpy and PIL (Python Imaging Library) are used to work with images and numerical arrays, while pydub, scipy.io.wavfile, and torchaudio are used to manipulate and process audio files. matchering is used for audio mastering. Finally, torch and diffusers provide the tools to implement artificial intelligence models, such as stable diffusion in this case.

import numpy as np
from PIL import Image
import pydub
from scipy.io import wavfile
import io
import typing as T
from pydub import AudioSegment
from IPython.display import Audio, display
import matchering as mg

import torch
from diffusers import StableDiffusionPipeline
import torchaudio

The next step is responsible for loading the model to automatically generate music:

A variable model_id2 is defined containing the path to the pretrained model called “riffusion-model-v1”. Riffusion is an application based on stable diffusion for generating music in real time Riffusion. This model is specifically designed to work with spectrograms, which allows generating music of different styles and genres.
A pipe2 object is created using the StableDiffusionPipeline class. The from_pretrained function loads the pretrained model specified by model_id2 and sets the data type to torch.float16. By using the mixed-precision data type float16, memory consumption can be reduced and the performance of the music generation process can be improved.
Finally, the pipe2 object is moved to the GPU using the to("cuda") method. This allows the model to run on the GPU, taking advantage of its parallel processing capability and accelerating music generation tasks.

model_id2 = "riffusion/riffusion-model-v1"
pipe2 = StableDiffusionPipeline.from_pretrained(model_id2, torch_dtype=torch.float16)
pipe2 = pipe2.to("cuda")

The following functions work together to generate music from a pretrained Riffusion model, converting spectrograms into audio files and vice versa.

get_music(audiofilename, prompt, musicAI_indx, duration=5): This function is used to generate music from a pretrained Riffusion model. The function takes as parameters the audio filename, the text input (prompt), the model index, and the desired duration of the music. The function creates a spectrogram and then converts it into an audio file in WAV and MP3 format. It returns the spectrogram and the name of the MP3 file.
wav_bytes_from_spectrogram_image(image: Image.Image) -> T.Tuple[io.BytesIO, float]: This function takes a spectrogram image as input and returns an io.BytesIO object containing the reconstructed WAV audio file data and the duration of the audio in seconds. The function performs the reconstruction using the inverse Mel transform and the Griffin-Lim algorithm.
spectrogram_from_image(image: Image.Image, max_volume: float = 50, power_for_image: float = 0.25) -> np.ndarray: This function takes a spectrogram image as input and returns a spectrogram magnitude array. The function performs a series of operations on the image, such as inverting, changing the maximum volume, and applying an inverse power curve.
waveform_from_spectrogram(...) -> np.ndarray: This function takes a spectrogram magnitude array as input and returns a reconstructed waveform. The function uses the inverse Mel transform and the Griffin-Lim algorithm to approximate the phase and reconstruct the waveform.
mp3_bytes_from_wav_bytes(wav_bytes: io.BytesIO) -> io.BytesIO: This function takes an io.BytesIO object containing WAV file data as input and returns an io.BytesIO object containing the converted MP3 file data. The function uses the pydub library to perform the WAV to MP3 conversion.

#@title functions
def get_music(audiofilename, prompt, musicAI_indx, duration=5):
    mp3file_name = f"{audiofilename}.mp3"
    wavfile_name = f"{audiofilename}.wav"
    if musicAI_indx == 0:
        if duration == 5:
            width_duration=512
        else :
            width_duration = 512 + ((int(duration)-5) * 128)
        spec = pipe2(prompt, height=512, width=width_duration).images[0]
        print(spec)
        wav = wav_bytes_from_spectrogram_image(spec)
        with open(wavfile_name, "wb") as f:
            f.write(wav[0].getbuffer())


        #Convert to mp3, for video merging function
        wavfile = AudioSegment.from_wav(wavfile_name)
        wavfile.export(mp3file_name, format="mp3")
        return spec, mp3file_name


def wav_bytes_from_spectrogram_image(image: Image.Image) -> T.Tuple[io.BytesIO, float]:
    """
    Reconstruct a WAV audio clip from a spectrogram image. Also returns the duration in seconds.
    """

    max_volume = 50
    power_for_image = 0.25
    Sxx = spectrogram_from_image(image, max_volume=max_volume, power_for_image=power_for_image)

    sample_rate = 44100  # [Hz]
    clip_duration_ms = 5000  # [ms]

    bins_per_image = 512
    n_mels = 512

    # FFT parameters
    window_duration_ms = 100  # [ms]
    padded_duration_ms = 400  # [ms]
    step_size_ms = 10  # [ms]

    # Derived parameters
    num_samples = int(image.width / float(bins_per_image) * clip_duration_ms) * sample_rate
    n_fft = int(padded_duration_ms / 1000.0 * sample_rate)
    hop_length = int(step_size_ms / 1000.0 * sample_rate)
    win_length = int(window_duration_ms / 1000.0 * sample_rate)

    samples = waveform_from_spectrogram(
        Sxx=Sxx,
        n_fft=n_fft,
        hop_length=hop_length,
        win_length=win_length,
        num_samples=num_samples,
        sample_rate=sample_rate,
        mel_scale=True,
        n_mels=n_mels,
        max_mel_iters=200,
        num_griffin_lim_iters=32,
    )

    wav_bytes = io.BytesIO()
    wavfile.write(wav_bytes, sample_rate, samples.astype(np.int16))
    wav_bytes.seek(0)

    duration_s = float(len(samples)) / sample_rate

    return wav_bytes, duration_s


def spectrogram_from_image(
    image: Image.Image, max_volume: float = 50, power_for_image: float = 0.25
) -> np.ndarray:
    """
    Compute a spectrogram magnitude array from a spectrogram image.
    TODO(hayk): Add image_from_spectrogram and call this out as the reverse.
    """
    # Convert to a numpy array of floats
    data = np.array(image).astype(np.float32)

    # Flip Y take a single channel
    data = data[::-1, :, 0]

    # Invert
    data = 255 - data

    # Rescale to max volume
    data = data * max_volume / 255

    # Reverse the power curve
    data = np.power(data, 1 / power_for_image)

    return data


def waveform_from_spectrogram(
    Sxx: np.ndarray,
    n_fft: int,
    hop_length: int,
    win_length: int,
    num_samples: int,
    sample_rate: int,
    mel_scale: bool = True,
    n_mels: int = 512,
    max_mel_iters: int = 200,
    num_griffin_lim_iters: int = 32,
    device: str = "cuda:0",
) -> np.ndarray:
    """
    Reconstruct a waveform from a spectrogram.
    This is an approximate inverse of spectrogram_from_waveform, using the Griffin-Lim algorithm
    to approximate the phase.
    """
    Sxx_torch = torch.from_numpy(Sxx).to(device)

    if mel_scale:
        mel_inv_scaler = torchaudio.transforms.InverseMelScale(
            n_mels=n_mels,
            sample_rate=sample_rate,
            f_min=0,
            f_max=10000,
            n_stft=n_fft // 2 + 1,
            norm=None,
            mel_scale="htk",
            max_iter=max_mel_iters,
        ).to(device)

        Sxx_torch = mel_inv_scaler(Sxx_torch)

    griffin_lim = torchaudio.transforms.GriffinLim(
        n_fft=n_fft,
        win_length=win_length,
        hop_length=hop_length,
        power=1.0,
        n_iter=num_griffin_lim_iters,
    ).to(device)

    waveform = griffin_lim(Sxx_torch).cpu().numpy()

    return waveform


def mp3_bytes_from_wav_bytes(wav_bytes: io.BytesIO) -> io.BytesIO:
    mp3_bytes = io.BytesIO()
    sound = pydub.AudioSegment.from_wav(wav_bytes)
    sound.export(mp3_bytes, format="mp3")
    mp3_bytes.seek(0)
    return mp3_bytes

Functions for Creating Audio from a List of Words

The following functions work together to generate music from a list of words, apply audio effects such as crossfade and fades, and improve audio quality through automatic mastering.

list_get_music(list_variables, texto_prompt, dur=5): This function takes a list of strings, a prompt, and the music duration. For each string in the list, it generates music using the get_music function with the specified prompt and duration. The generated music files are saved with names based on their index in the list.
list_display_audio(list_variables): This function takes a list of strings and plays the audio files previously generated by list_get_music. It uses the IPython.display.Audio module to play the audio files in MP3 format.
n_times(number_segments, n_times=3): This function takes the number of audio segments and the desired number of repetitions. It multiplies the repetitions of each audio segment and saves the results in WAV files.
n_times_segments(number_segments, n_times=3): This function takes the number of audio segments and the desired number of repetitions. It multiplies the repetitions of each audio segment and returns a list of AudioSegment objects with the repeated segments.
audio_segments_concat(songs_list, crossfade=2000, fade_in=2000, fade_out=3000, out_filename="full_song.wav"): This function takes a list of AudioSegment objects, crossfade, fade in, fade out values, and an output filename. It concatenates the audio segments with crossfade, applies fade in at the beginning and fade out at the end, and saves the result in a WAV file.
crear_fades(seg_inicial=0, seg_final=6, fade_in=3000, fade_out=3000): This function takes the indices of the initial and final segments, and the fade in and fade out values. It applies fade in to the initial segment and fade out to the final segment, and saves the results in WAV files.
mejoras_audio(number_segments=7, master_reference='audio.mp4'): This function takes the number of audio segments and a mastering reference file. It converts each audio segment from mono to stereo, applies automatic mastering using the reference file, and saves the results in WAV files. It also removes the temporary files created during the process.

from os import lstat
import os

def list_get_music(list_variables:list,
                   texto_prompt: str,
                   dur:int=5) -> None:
  """Takes a list of strings and a prompt and generates sounds with riffusion"""

  for number, str_var in enumerate(list_variables):
    prompt = f"{str_var} {texto_prompt}"
    get_music(f"{number}",prompt=prompt, musicAI_indx=0, duration=dur)


def list_display_audio(list_variables:list)->None:
  """display audio of sounds generated by riffusion"""

  for number, _ in enumerate(list_variables):
    display(Audio(f'{number}.mp3'))

def n_times(number_segments:int,
            n_times:int=3) -> None:
  """multiply the repetitions of the audio files and then
  save them"""

  for number in range(number_segments):
    song = AudioSegment.from_wav(f'{number}.wav')
    song = song *n_times
    song.export(f'{number}_times.wav',format='wav')

def n_times_segments(number_segments:int,
                     n_times:int=3) -> list:
  """multiply the repetitions of the segments and then
  concatenate them in a list"""

  songs_list = []
  for number in range(number_segments):
    song = AudioSegment.from_wav(f'{number}.wav')
    song = song *n_times
    songs_list.append(song)
  return songs_list

def audio_segments_concat(songs_list:list,
                          crossfade:int = 2000,
                          fade_in:int = 2000,
                          fade_out:int = 3000,
                          out_filename:str = "full_song.wav"):
  full_song = songs_list[0]
  for song_segment in songs_list[1:]:
    full_song = full_song.append(song_segment, crossfade=crossfade)

  # fade in and fade out
  full_song = full_song.fade_in(fade_in).fade_out(fade_out)
  #store song
  full_song.export(out_filename,format='wav')
  print(f"File {out_filename} stored")

def crear_fades(seg_inicial:int =0,
                seg_final:int = 6,
                fade_in:int = 3000,
                fade_out:int = 3000):

  inicial = AudioSegment.from_wav(f'{seg_inicial}_times.wav')
  inicial = inicial.fade_in(fade_in)
  inicial.export(f'{seg_inicial}_times.wav',format='wav')

  final = AudioSegment.from_wav(f'{seg_final}_times.wav')
  final = final.fade_out(fade_out)
  final.export(f'{seg_final}_times.wav',format='wav')

def mejoras_audio(number_segments:int=7,
                  master_reference:str = 'audio.mp4'):
  "Convert audio from Mono to Stereo and then apply automatic mastering"

  for number in range(number_segments):
    song = AudioSegment.from_wav(f'{number}_times.wav')
    #convert to stereo
    song = song.set_channels(2)
    song.export(f'{number}_stereo.wav',format='wav')

    mg.process(
      # The track you want to master
      target=f"{number}_stereo.wav",
      # Some "wet" reference track
      reference=master_reference,
      # Where and how to save your results
      results=[
        mg.pcm16(f"{number}_master.wav")
      ],
     )
    os.remove(f"{number}_stereo.wav")
    os.remove(f"{number}_times.wav")

Audio for the Same Word

Generating audio for the same word

# List of generator variables
word = "learning" #@param {type:"string"}
prompt = "lofi programming with python" #@param {type:"string"}
palabra = (f"{word} "*7).split()

Generate audio files with a variable list and a fixed prompt, with a set duration of 10 seconds, which is the limit allowed by the free version of Google Colab.

list_get_music(palabra,prompt,10)

Display the generated audio files

list_display_audio(palabra)

Increase the duration of the generated audio files by multiplying by 3 to get 30 seconds

n_times(len(palabra), n_times=3)

Apply fade in and fade out to the generated audio files to create smooth transitions between segments.

crear_fades(0,6,3000,3000)

Download a reference audio file for automatic mastering


from pytube import YouTube as YT
import threading as th
video   = YT('https://www.youtube.com/watch?v=NMKsn5puEsQ', use_oauth=True, allow_oauth_cache=False)
video.streams.get_by_itag(140).download()

Automatically master the audio files using the Python library Matchering

mejoras_audio(len(palabra), 'Bookoo Bread Co Instrumental - Scallops Hotel.mp4')

Create Images for the Video

This code block is necessary to create the images that will go in the video:

It defines a variable img_model_id containing the path to the pretrained model called “runwayml/stable-diffusion-v1-5”. This model is designed to work with images and is based on stable stochastic diffusion, which allows generating high-quality images with a wide variety of styles and themes.
It creates an img_pipe object using the StableDiffusionPipeline class. The from_pretrained function loads the pretrained model specified by img_model_id and sets the data type to torch.float16. By using the mixed-precision data type float16, memory consumption can be reduced and the performance of the image generation process can be improved.
Next, the img_pipe object is moved to the GPU using the to("cuda") method. This allows the model to run on the GPU, taking advantage of its parallel processing capability and accelerating image generation tasks.
Finally, efficient memory attention is enabled for the transformers within the model using the enable_xformers_memory_efficient_attention() method. This optimization allows reducing GPU memory consumption during inference, which can be useful for generating high-resolution images or running the model on devices with limited resources.

img_model_id = "runwayml/stable-diffusion-v1-5"
img_pipe = StableDiffusionPipeline.from_pretrained(img_model_id, torch_dtype=torch.float16, revision="fp16")
img_pipe = img_pipe.to("cuda")
img_pipe.enable_xformers_memory_efficient_attention()

This code block defines a function called create_images that takes as arguments a list of strings (list_variables) and a text prompt (text_prompt). The function aims to generate images from the elements in the variable list using the previously loaded stable diffusion model.

The function performs the following actions:

It iterates over the variable list and their indices using enumerate(list_variables).
For each variable in the list, it creates a complete prompt by concatenating the variable and the provided text_prompt. Thus, the complete prompt includes additional information that guides the model to generate high-quality images, such as “artstation hall of fame gallery” and “editors choice”.
It uses the previously loaded img_pipe object to generate an image from the complete prompt. The img_pipe function returns an object containing a list of images, and it selects the first image from the list with images[0].
It saves the generated image to a PNG file using the variable’s index in the list as the filename (for example, “0.png”, “1.png”, etc.).

In summary, this function creates images using the stable diffusion model from a list of variables and a text prompt, and saves the generated images to PNG files.

def create_images(list_variables:list,
                  text_prompt:str)->None:

  for number, str_var in enumerate(list_variables):
    prompt = f"{str_var} {text_prompt}"
    image = img_pipe(prompt + ", artstation hall of fame gallery, editors choice, #1 digital painting of all time, most beautiful image ever created, emotionally evocative, greatest art ever made, lifetime achievement magnum opus masterpiece, the most amazing breathtaking image with the deepest message ever painted, a thing of beauty beyond imagination or words").images[0]
    image.save(f"{number}.png")

Create images for the video

create_images(palabra*2, prompt)

Display the generated images

from IPython.display import Image as display_image
imagenes2 = [f"{x}.png" for x in range(7)]
for img in imagenes2:
  display(display_image(filename=img))

Create the Final Video

This code creates a 1900x1080 video from a series of images and audio segments, applying transition and resizing effects, and adding a black background and text to the final video.

It creates a base black image of 1900x1080 pixels and saves it as ‘black.jpg’. This image will be used as the background in the final video so that it has high resolution.
It imports the necessary libraries for video and audio processing, such as moviepy.editor, pathlib.Path, and moviepy.video.fx.resize.
The creacion_video_segmentos(numero_segmentos) function takes the numero_segmentos argument indicating the number of audio and image segments to use. The function creates a list of video clips, where each clip consists of an image and its corresponding audio segment. The list of video clips is returned at the end of the function.
The Final_concatenar_crossfade(lista_videos, custom_padding=1, nombre='lofi') function takes a list of video clips, a custom padding value, and a name for the final video. This function creates a final video through the following actions:
a. It applies a slide transition to each video clip in the list and concatenates the clips with negative padding, which creates a crossfade effect between the clips.
b. It resizes the resulting video to 1080x1080 pixels and places the previously created black background behind this video.
c. It adds text with the video name at a specific position.
d. It combines the video and text into a single video object and sets the duration of the final video.
e. It writes the final video to an MP4 file with the provided name, using 24 frames per second and the ‘aac’ audio codec.

#create base black image
from PIL import Image;
Image.new('RGB',
          (1900, 1080),
          color = (0,0,0)).save('black.jpg')

from moviepy.editor import *
from pathlib import Path
from moviepy.video.fx.resize import resize

def creacion_video_segmentos(numero_segmentos:int) -> list:
  video_clips = []

  #list images files
  for number in range(numero_segmentos):
    clip_audio = AudioFileClip(f"{number}_master.wav").set_start(0)
    clip_video = ImageClip(f"{number}.png", duration=clip_audio.duration)
    clip_video = clip_video.set_audio(clip_audio)
    video_clips.append(clip_video)
  return video_clips

def Final_concatenar_crossfade(lista_videos:list,
                         custom_padding:int=1,
                         nombre:str = 'lofi'):

  slided_clips = [CompositeVideoClip([clip.fx( transfx.slide_out, custom_padding, 'bottom')]) for clip in lista_videos]
  video_slides = concatenate( slided_clips, padding=-1)
  video_slides = resize(video_slides, width=1080, height=1080)
  black_image = ImageClip("black.jpg")
  final = CompositeVideoClip([black_image, video_slides.set_position("center")])
  final = final.set_duration(video_slides.duration)

  ## TODO Add the song name

  # create a TextClip object with the desired text
  text_clip = TextClip(nombre,
                       font="Amiri-regular",
                       fontsize=40,
                       color='grey40')
  # set the position of the text clip
  text_clip = (text_clip
               .set_position((100, 900))
               .set_start(0)
              )
  final = CompositeVideoClip([final, text_clip])

  # FINAL ASSEMBLY
  final_video_audio = CompositeVideoClip([final, text_clip]).set_duration(video_slides.duration)

  final_video_audio.write_videofile(f"{nombre}.mp4",
                              fps=24,
                              threads = 2,
                              audio_codec='aac',
                              audio=True)

Code to create each of the audio segments

lista_videos = creacion_video_segmentos(7)

Final creation of the video by concatenating the previously generated segments

# Final Video Creation
concatenar_crossfade(lista_videos,1, word)

In this post, we have explored how to use stable diffusion models to generate music and images from a list of variables and a text prompt. We then used the MoviePy library to combine these images with corresponding audio segments and create a video with smooth transitions and resizing effects.

This process illustrates how it is possible to combine artificial intelligence techniques and video processing to create unique and personalized audiovisual products from individual elements such as images and audio segments.

Next Album

These generated videos only have static images; the next album will feature videos with dynamic images.
The Riffusion audio still has quality limitations; the next album will use a new audio generation tool such as https://github.com/facebookresearch/audiocraft
MoviePy is a tool for automating the creation of videos.