Bridging the gap between training on synthetic data and real data in TinyML.
Devices and components
Arduino GIGA display assembly
NVIDIA Jetson Nano Developer Kit
Software and tools
Edge Impulse Studio
Project description
Speech
Speech
Music
Noise
Generating text prompts using ChatGPT
I am creating a dataset of synthetic audio data for an audio classification task, which generates audio from text prompts. The first class in this dataset is background noise. Here is a list of examples for this class:
A hammer is hitting a wooden surface,
A noise of nature,
The sound of waves crashing on the shore,
A thunderstorm in the distance,
Traffic noise on a busy street,
The hum of an air conditioning unit,
Birds chirping in the morning,
The sound of a train passing
Could you please provide other 100 examples?
Woman delivering a motivational speech
Man conducting a job interview
Group discussion in a business meeting
Woman giving a lecture in a classroom
Man reading a poem aloud
Casual conversation at a park
Woman participating in a radio talk show
Man narrating a story
Group discussion at a social gathering
Woman giving instructions in a workshop
Man reciting a monologue
Casual conversation at a coffee shop
Woman conducting a podcast interview
Man providing commentary for a sports event
Group discussion at a book club meeting
Woman leading a team meeting
Man practicing a speech in front of a mirror
Casual conversation at a birthday party
Woman participating in a panel discussion
Man delivering a eulogy at a funeral
Group discussion at a community forum
A classical piano composition with a melancholic tone.
Upbeat jazz ensemble featuring saxophone, trumpet, and drums.
Electronic dance music (EDM) track with a pulsating beat.
Acoustic guitar solo playing a lively flamenco piece.
Ambient instrumental with soothing synthesizers and gentle percussion.
Rock ballad with powerful vocals and electric guitar solos.
Traditional Indian sitar and tabla duet.
African drum ensemble with rhythmic patterns and tribal chants.
Bossa nova jazz with smooth guitar and soft percussion.
Celtic folk song featuring violin, flute, and bodhran.
Reggae track with a laid-back groove and Caribbean influences.
Funky bassline-driven groove with brass instruments.
Minimalist piano piece reminiscent of Erik Satie.
Heavy metal guitar riff with intense drumming.
Up-tempo bluegrass with banjo, fiddle, and mandolin.
Synthwave track inspired by 80s electronic music.
Flamboyant Broadway musical number with vocals and orchestration.
Ambient chillout music with ethereal synthesizers and gentle beats.
Japanese traditional koto and shakuhachi composition.
Hip-hop beat with catchy samples and rap vocals.
Spanish flamenco fusion with a modern twist.
Progressive rock epic with intricate instrumental sections.
Caribbean steel drum ensemble playing a festive tune.
Traditional Irish pub song with accordion and tin whistle.
Upbeat ska with a lively horn section.
Raindrops tapping on a tin roof
Leaves rustling in the wind
Distant howling of wolves
Crackling of a fireplace
Cicadas buzzing on a summer night
Water flowing in a gentle stream
Creaking of a wooden ship at sea
Rustling of grass in a meadow
Purring of a cat
Sizzling of food on a barbecue
Chirping crickets in the evening
Drip-dropping of water in a cave
Buzzing bees in a garden
Distant rumble of a waterfall
Footsteps on a gravel path
Hooting of an owl in the night
Cracking of ice in a frozen lake
Rustling of pages in a book
Bubble popping in a pond
Hooves trotting on a dirt road
Bubbling of a hot spring
Generating datasets for audio classification using AudioCraft
MusicGen: Generates music based on text prompts.
AudioGen: Creates sound effects from simple descriptions.
EnCodec: Compresses and tokenizes audio data for efficient processing.
AI Performance: 275 TOPS
GPU: 2048-core NVIDIA Ampere architecture GPU with 64 Tensor cores
Processor: Arm® Cortex®-A78AE v8.2 64-bit 12-core processor
Memory: 64 GB LPDDR5 256-bit 204.8 GB/s
Storage: 64 GB eMMC 5.1
Power: 15 W - 60 W (with housing accessory)
Namespace(disable=[''], output='/tmp/autotag', packages=['audiocraft'], prefer=['local', 'registry', 'build'], quiet=False, user='dustynv', verbose=False)
-- L4T_VERSION=35.2.1 JETPACK_VERSION=5.1 CUDA_VERSION=11.8.89
-- Finding compatible container image for ['audiocraft']
dustynv/audiocraft:r35.4.1
+ docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /mnt/storage/Projects/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb dustynv/audiocraft:r35.4.1
allow 10 sec for JupyterLab to start @ http://192.168.0.104:8888 (password nvidia)
JupterLab logging location: /var/log/jupyter.log (inside the container)
sudo apt-get update
sudo apt-get install ffmpeg
mport torchaudio
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
import os
import torch
import gc
def purge():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
gc.collect()
# Load descriptions from a text file
descriptions_file = 'descriptions.txt' # Update with your file name
with open(descriptions_file, 'r') as file:
descriptions = file.read().splitlines()
# Initialize the AudioGen model
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=2) # generate 2 seconds.
# Create an output folder if it doesn't exist
output_folder = 'output'
os.makedirs(output_folder, exist_ok=True)
# Generate and save audio samples in chunks of 50 descriptions
chunk_size = 50
num_chunks = (len(descriptions) + chunk_size - 1) // chunk_size
for chunk_idx in range(num_chunks):
start_idx = chunk_idx * chunk_size
end_idx = (chunk_idx + 1) * chunk_size
current_descriptions = descriptions[start_idx:end_idx]
# Generate audio samples based on descriptions
wav = model.generate(current_descriptions, progress=True)
# Save generated audio samples to the output folder
for idx, one_wav in enumerate(wav):
# Save each sample as a WAV file in the output folder
output_file_path = os.path.join(output_folder, f'{start_idx + idx}')
audio_write(output_file_path, one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
# Purge memory after processing each chunk
purge()
print(f'Audio samples saved in the "{output_folder}" folder.')
facebook/musicgen-small: A 300M model, for generating text to music only.
facebook/musicgen-medium: A 1.5B model, for generating text to music only.
facebook/musicgen-melody: A 1.5B model, for text to music and text+melody to music.
facebook/musicgen-large: A 3.3B model, for generating text to music only.
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import os
import torch
import gc
def purge():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
gc.collect()
# Load descriptions from a text file
descriptions_file = 'descriptions.txt' # Update with your file name
with open(descriptions_file, 'r') as file:
descriptions = file.read().splitlines()
# Initialize the AudioGen model
model = MusicGen.get_pretrained('facebook/musicgen-small')
model.set_generation_params(duration=2) # generate 2 seconds of music.
#wav = model.generate_unconditional(4)
# Create an output folder if it doesn't exist
output_folder = 'output'
os.makedirs(output_folder, exist_ok=True)
# Generate and save audio samples in chunks of 50 descriptions
chunk_size = 50
num_chunks = (len(descriptions) + chunk_size - 1) // chunk_size
for chunk_idx in range(num_chunks):
start_idx = chunk_idx * chunk_size
end_idx = (chunk_idx + 1) * chunk_size
current_descriptions = descriptions[start_idx:end_idx]
# Generate audio samples based on descriptions
wav =model.generate(current_descriptions, progress=True)
# Save generated audio samples to the output folder
for idx, one_wav in enumerate(wav):
# Save each sample as a WAV file in the output folder
output_file_path = os.path.join(output_folder, f'{start_idx + idx}')
audio_write(output_file_path, one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
# Purge memory after processing each chunk
purge()
print(f'Audio samples saved in the "{output_folder}" folder.')
Model training using the Edge Impulse platform
Speech lessons
Background noise class
Window size = 2000 ms
Window increase = 300 ms
Click Save Pulse.
Connect the Arduino GIGA R1 WiFi device to your computer using a USB cable.
/* Edge Impulse ingestion SDK
* Copyright (c) 2022 EdgeImpulse Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
// If your target is limited in memory remove this macro to save 10K RAM
#define EIDSP_QUANTIZE_FILTERBANK 0
/*
** NOTE: If you run into TFLite arena allocation issue.
**
** This may be due to may dynamic memory fragmentation.
** Try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt (create
** if it doesn't exist) and copy this file to
** `<ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/`.
**
** See
** (https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the-installed-cores-located-)
** to find where Arduino installs cores on your machine.
**
** If the problem persists then there's not enough memory for this model and application.
*/
/* Includes ---------------------------------------------------------------- */
#include <PDM.h>
#include <Audio_classification_v1_inferencing.h>
#include "Arduino_GigaDisplay_GFX.h"
#define screen_size_x 480
#define screen_size_y 800
GigaDisplay_GFX display; // create the object
#define BLACK 0x0000
/** Audio buffers, pointers and selectors */
typedef struct {
int16_t *buffer;
uint8_t buf_ready;
uint32_t buf_count;
uint32_t n_samples;
} inference_t;
static inference_t inference;
static signed short sampleBuffer[2048];
static bool debug_nn = false; // Set this to true to see e.g. features generated from the raw signal
static const int frequency = 20000;
/**
* @brief Arduino setup function
*/
void setup()
{
// put your setup code here, to run once:
Serial.begin(115200);
// comment out the below line to cancel the wait for USB connection (needed for native USB)
while (!Serial);
Serial.println("Edge Impulse Inferencing Demo");
display.begin();
display.fillScreen(BLACK);
display.setTextSize(2);
display.setRotation(1);
delay(1000);
// summary of inferencing settings (from model_metadata.h)
ei_printf("Inferencing settings:\n");
ei_printf("\tInterval: %.2f ms.\n", (float)EI_CLASSIFIER_INTERVAL_MS);
ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));
if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
return;
}
}
/**
* @brief Arduino main function. Runs the inferencing loop.
*/
void loop()
{
ei_printf("Starting inferencing in 2 seconds...\n");
delay(2000);
ei_printf("Recording...\n");
bool m = microphone_inference_record();
if (!m) {
ei_printf("ERR: Failed to record audio...\n");
return;
}
ei_printf("Recording done\n");
signal_t signal;
signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
signal.get_data = µphone_audio_signal_get_data;
ei_impulse_result_t result = { 0 };
EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
if (r != EI_IMPULSE_OK) {
ei_printf("ERR: Failed to run classifier (%d)\n", r);
return;
}
// print the predictions
ei_printf("Predictions ");
ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
result.timing.dsp, result.timing.classification, result.timing.anomaly);
ei_printf(": \n");
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
ei_printf(" %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);
}
const char *resultLabel = getMax(result.classification);
display.fillScreen(BLACK);
Serial.println(resultLabel);
//display.println(resultLabel);
int artLength = 9; // Change this value based on the length of your ASCII art
int centerX = (screen_size_x - (artLength * 10)) / 2; // 10 is the approximate width of each character in the font
if (strcmp(resultLabel, "speech") == 0) {
display.fillScreen(BLACK);
// Display ASCII art for speech
display.setCursor(100, 150);
display.println("\n");
// Print each line of the ASCII art
display.setCursor(100, 160);
display.println(" ##### ###### ####### ####### ##### # # ");
display.setCursor(100, 180);
display.println("# # # # # # # # # # ");
display.setCursor(100, 200);
display.println("# # # # # # # # ");
display.setCursor(100, 220);
display.println(" ##### ###### ##### ##### # ####### ");
display.setCursor(100, 240);
display.println(" # # # # # # # ");
display.setCursor(100, 260);
display.println("# # # # # # # # # ");
display.setCursor(100, 280);
display.println(" ##### # ####### ####### ##### # # \n");
} else if (strcmp(resultLabel, "music") == 0) {
display.fillScreen(BLACK);
// Display ASCII art for music
display.setCursor(200, 150);
display.println("\n");
// Print each line of the ASCII art
display.setCursor(200, 160);
display.println("# # # # ##### ### ##### ");
display.setCursor(200, 180);
display.println("## ## # # # # # # # ");
display.setCursor(200, 200);
display.println("# # # # # # # # # ");
display.setCursor(200, 220);
display.println("# # # # # ##### # # ");
display.setCursor(200, 240);
display.println("# # # # # # # ");
display.setCursor(200, 260);
display.println("# # # # # # # # # ");
display.setCursor(200, 280);
display.println("# # ##### ##### ### ##### \n");
} else if (strcmp(resultLabel, "noise") == 0) {
display.fillScreen(BLACK);
// Display ASCII art for noise
display.setCursor(200, 150);
display.println("\n");
// Print each line of the ASCII art
display.setCursor(200, 160);
display.println("# # ####### ### ##### ####### ");
display.setCursor(200, 180);
display.println("## # # # # # # # ");
display.setCursor(200, 200);
display.println("# # # # # # # # ");
display.setCursor(200, 220);
display.println("# # # # # # ##### ##### ");
display.setCursor(200, 240);
display.println("# # # # # # # # ");
display.setCursor(200, 260);
display.println("# ## # # # # # # ");
display.setCursor(200, 280);
display.println("# # ####### ### ##### ####### \n");
}
#if EI_CLASSIFIER_HAS_ANOMALY == 1
ei_printf(" anomaly score: %.3f\n", result.anomaly);
#endif
}
const char* getMax(ei_impulse_result_classification_t *classifications)
{
uint8_t maxLabelIndex = 0;
for (size_t i = 1; i < EI_CLASSIFIER_LABEL_COUNT; i++) {
if (classifications[i].value > classifications[maxLabelIndex].value) {
maxLabelIndex = i;
}
}
return classifications[maxLabelIndex].label;
}
/**
* @brief PDM buffer full callback
* Get data and call audio thread callback
*/
static void pdm_data_ready_inference_callback(void)
{
int bytesAvailable = PDM.available();
// read into the sample buffer
int bytesRead = PDM.read((char *)&sampleBuffer[0], bytesAvailable);
if (inference.buf_ready == 0) {
for(int i = 0; i < bytesRead>>1; i++) {
inference.buffer[inference.buf_count++] = sampleBuffer[i];
if(inference.buf_count >= inference.n_samples) {
inference.buf_count = 0;
inference.buf_ready = 1;
break;
}
}
}
}
/**
* @brief Init inferencing struct and setup/start PDM
*
* @param[in] n_samples The n samples
*
* @return { description_of_the_return_value }
*/
static bool microphone_inference_start(uint32_t n_samples)
{
inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
if(inference.buffer == NULL) {
return false;
}
inference.buf_count = 0;
inference.n_samples = n_samples;
inference.buf_ready = 0;
// configure the data receive callback
PDM.onReceive(&pdm_data_ready_inference_callback);
PDM.setBufferSize(512);
// initialize PDM with:
// - one channel (mono mode)
// - a 16 kHz sample rate
if (!PDM.begin(1, frequency)) {
ei_printf("Failed to start PDM!");
microphone_inference_end();
return false;
}
// set the gain, defaults to 20
PDM.setGain(5);
return true;
}
/**
* @brief Wait on new data
*
* @return True when finished
*/
static bool microphone_inference_record(void)
{
inference.buf_ready = 0;
inference.buf_count = 0;
while(inference.buf_ready == 0) {
delay(10);
}
return true;
}
/**
* Get raw audio signal data
*/
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)
{
numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);
return 0;
}
/**
* @brief Stop PDM and release buffers
*/
static void microphone_inference_end(void)
{
PDM.end();
free(inference.buffer);
}
#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endif
02:38:59.938 -> Recording...
02:39:01.551 -> Recording done
02:39:02.130 -> Predictions (DSP: 26 ms., Classification: 530 ms., Anomaly: 0 ms.):
02:39:02.130 -> music: 0.10147
02:39:02.130 -> noise: 0.63983
02:39:02.130 -> speech: 0.25870
02:39:02.130 -> noise
02:39:02.161 -> Starting inferencing in 2 seconds...
02:39:04.147 -> Recording...
02:39:05.759 -> Recording done
02:39:06.323 -> Predictions (DSP: 26 ms., Classification: 530 ms., Anomaly: 0 ms.):
02:39:06.323 -> music: 0.05066
02:39:06.323 -> noise: 0.84657
02:39:06.323 -> speech: 0.10276
arduino
Note: Content and images are from: https://projecthub.arduino.cc/, with some modifications.
If you want it removed due to copyright reasons, please leave a comment. Thank you.
I want to share this article more widely so that everyone knows about Arduino and your project.