In recent years, generative AI models, including language models, have made great progress, especially the release of ChatGPT, which has allowed everyone to see the charm of large language models. Whether it is computer vision, or text description in the field of NLP to generate various images and **, to perform machine translation, text generation and other large models, it has made unexpected developments. But ** with audio always seems to be a little behind. Is it possible to use artificial intelligence technology to synthesize different ** or sound effects?
Audiocraft consists of three models: Musicgen, Audiogen, and Encodec.
MusicGen: Trained using Meta-owned and specifically licensed texts, generated from user-entered text.
AudioGen is trained on public sounds, generating audio sounds based on text input from the user.
Encodec decoder, which can produce higher quality** with fewer artifacts, similar to audio compression technology. Encodec is a lossy neural codec that is specially trained to compress any type of audio and reconstruct the original signal with high fidelity.
The Audiocraft family of models is capable of producing high-quality audio with long-term consistency, and can be easily interacted with through the UI interface. With audiocraft, the overall design of the audio generation model is simplified, and we can directly use the open source **for generation**.
cd /content!git clone /content/audiocraft!pip install -r requirements.txt!python -m demos.musicgen_app --share
We can directly use the above ** to generate a visual UI interface, we only need to enter the corresponding text in the input box, and we can use the model to generate**.
In order to make it easier for developers to use audiocraft, the model has been open sourced, and we can directly use the open source **for synthesis**.
python3 -m pip install -u git+ audiocraft.models import musicgenfrom audiocraft.utils.notebook import display_audioimport torchmodel = musicgen.musicgen.get_pretrained('medium', device='cuda')model.set_generation_params(duration=8)
First we need to use pip to install audiocraft and import the musicgen generation function from audiocraft's models.
Here we use musicgenmusicgen.get pretrained to load the pre-trained model of the model, and after the function runs to this step, it will automatically search for whether there is a model in the project folder and do it automatically.
downloading state_dict.bin: 100% 3.68g/3.68g [03:42<00:00, 19.4mb/s]downloading (…ve/main/spiece.model: 100% 792k/792k [00:00<00:00, 10.5mb/s]downloading (…lve/main/config.json: 100% 1.21k/1.21k [00:00<00:00, 45.8kb/s]downloading model.safetensors: 100% 892m/892m [00:10<00:00, 46.2mb/s]downloading (…ssion_state_dict.bin: 100% 236m/236m [03:45<00:00, 1.05mb/s]res = model.generate([ 'crazy edm, he**y bang', 'classic reggae track with an electronic guitar solo', 'lofi slow bpm electro chill with organic samples', 'rock with saturated guitars, a he**y bass line and crazy drum break and fills.', 'earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves',],progress=true)display_audio(res, 32000)
Once the model is complete, we can use the modelgenerate function, here you can input multiple texts at a time, the model will automatically generate multiple audio files based on the input text, and finally, we can display or generate a good file.
text prompt: pop dance track with catchy melodies, tropical percussions, and upbeat rhythms, perfect for the beach
Of course, this model has been published in Hugging Face's transformers library, and we can also use the transformers library to run this directly.
pip install git+ transformers import autoprocessor, musicgenforconditionalgenerationprocessor = autoprocessor.from_pretrained("facebook/musicgen-small")model = musicgenforconditionalgeneration.from_pretrained("facebook/musicgen-small")inputs = processor( text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and he**y drums"], padding=true, return_tensors="pt",)audio_values = model.generate(**inputs, max_new_tokens=256)
Of course, here we don't need to install audiocraft, but install the transformers library, and then import the relevant audiocraft application from the transformers library. Then load the relevant model file, and enter the ** text that needs to be generated, and finally you can use the modelgenerate function to generate a ** file.
from ipython.display import audiosampling_rate = model.config.audio_encoder.sampling_rateaudio(audio_values[0].numpy(),rate=sampling_rate)import scipysampling_rate = model.config.audio_encoder.sampling_ratescipy.io.w**file.write("musicgen_out.w**", rate=sampling_rate, data=audio_values[0, 0].numpy())
After the generated ** file, we can use the above functions to ** or store it, which is convenient for later processing operations. Of course, the above ** are the ** implementations generated by MusicGen**, and the ** implementation process of other Audiogen and Encodec can refer to the github source code.