QC

TinyML: Audio classification using synthetic data

Bridging the gap between training on synthetic data and real data in TinyML.

Devices and components

Arduino GIGA display assembly

NVIDIA Jetson Nano Developer Kit

Software and tools

Edge Impulse Studio

Project description

Speech

Speech

Music

Noise

Generating text prompts using ChatGPT

I am creating a dataset of synthetic audio data for an audio classification task, which generates audio from text prompts. The first class in this dataset is background noise. Here is a list of examples for this class:

A hammer is hitting a wooden surface,

A noise of nature,

The sound of waves crashing on the shore,

A thunderstorm in the distance,

Traffic noise on a busy street,

The hum of an air conditioning unit,

Birds chirping in the morning,

The sound of a train passing

Could you please provide other 100 examples?

Woman delivering a motivational speech

Man conducting a job interview

Group discussion in a business meeting

Woman giving a lecture in a classroom

Man reading a poem aloud

Casual conversation at a park

Woman participating in a radio talk show

Man narrating a story

Group discussion at a social gathering

Woman giving instructions in a workshop

Man reciting a monologue

Casual conversation at a coffee shop

Woman conducting a podcast interview

Man providing commentary for a sports event

Group discussion at a book club meeting

Woman leading a team meeting

Man practicing a speech in front of a mirror

Casual conversation at a birthday party

Woman participating in a panel discussion

Man delivering a eulogy at a funeral

Group discussion at a community forum

A classical piano composition with a melancholic tone.

Upbeat jazz ensemble featuring saxophone, trumpet, and drums.

Electronic dance music (EDM) track with a pulsating beat.

Acoustic guitar solo playing a lively flamenco piece.

Ambient instrumental with soothing synthesizers and gentle percussion.

Rock ballad with powerful vocals and electric guitar solos.

Traditional Indian sitar and tabla duet.

African drum ensemble with rhythmic patterns and tribal chants.

Bossa nova jazz with smooth guitar and soft percussion.

Celtic folk song featuring violin, flute, and bodhran.

Reggae track with a laid-back groove and Caribbean influences.

Funky bassline-driven groove with brass instruments.

Minimalist piano piece reminiscent of Erik Satie.

Heavy metal guitar riff with intense drumming.

Up-tempo bluegrass with banjo, fiddle, and mandolin.

Synthwave track inspired by 80s electronic music.

Flamboyant Broadway musical number with vocals and orchestration.

Ambient chillout music with ethereal synthesizers and gentle beats.

Japanese traditional koto and shakuhachi composition.

Hip-hop beat with catchy samples and rap vocals.

Spanish flamenco fusion with a modern twist.

Progressive rock epic with intricate instrumental sections.

Caribbean steel drum ensemble playing a festive tune.

Traditional Irish pub song with accordion and tin whistle.

Upbeat ska with a lively horn section.

Raindrops tapping on a tin roof

Leaves rustling in the wind

Distant howling of wolves

Crackling of a fireplace

Cicadas buzzing on a summer night

Water flowing in a gentle stream

Creaking of a wooden ship at sea

Rustling of grass in a meadow

Purring of a cat

Sizzling of food on a barbecue

Chirping crickets in the evening

Drip-dropping of water in a cave

Buzzing bees in a garden

Distant rumble of a waterfall

Footsteps on a gravel path

Hooting of an owl in the night

Cracking of ice in a frozen lake

Rustling of pages in a book

Bubble popping in a pond

Hooves trotting on a dirt road

Bubbling of a hot spring

Generating datasets for audio classification using AudioCraft

MusicGen: Generates music based on text prompts.

AudioGen: Creates sound effects from simple descriptions.

EnCodec: Compresses and tokenizes audio data for efficient processing.

AI Performance: 275 TOPS

GPU: 2048-core NVIDIA Ampere architecture GPU with 64 Tensor cores

Processor: Arm® Cortex®-A78AE v8.2 64-bit 12-core processor

Memory: 64 GB LPDDR5 256-bit 204.8 GB/s

Storage: 64 GB eMMC 5.1

Power: 15 W - 60 W (with housing accessory)

Namespace(disable=[''], output='/tmp/autotag', packages=['audiocraft'], prefer=['local', 'registry', 'build'], quiet=False, user='dustynv', verbose=False)

-- L4T_VERSION=35.2.1 JETPACK_VERSION=5.1 CUDA_VERSION=11.8.89

-- Finding compatible container image for ['audiocraft']

dustynv/audiocraft:r35.4.1

+ docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /mnt/storage/Projects/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb dustynv/audiocraft:r35.4.1

allow 10 sec for JupyterLab to start @ http://192.168.0.104:8888 (password nvidia)

JupterLab logging location: /var/log/jupyter.log (inside the container)

sudo apt-get update

sudo apt-get install ffmpeg

mport torchaudio

from audiocraft.models import AudioGen

from audiocraft.data.audio import audio_write

import os

import torch

import gc

def purge():

torch.cuda.empty_cache()

torch.cuda.ipc_collect()

gc.collect()

# Load descriptions from a text file

descriptions_file = 'descriptions.txt' # Update with your file name

with open(descriptions_file, 'r') as file:

descriptions = file.read().splitlines()

# Initialize the AudioGen model

model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=2) # generate 2 seconds.

# Create an output folder if it doesn't exist

output_folder = 'output'

os.makedirs(output_folder, exist_ok=True)

# Generate and save audio samples in chunks of 50 descriptions

chunk_size = 50

num_chunks = (len(descriptions) + chunk_size - 1) // chunk_size

for chunk_idx in range(num_chunks):

start_idx = chunk_idx * chunk_size

end_idx = (chunk_idx + 1) * chunk_size

current_descriptions = descriptions[start_idx:end_idx]

# Generate audio samples based on descriptions

wav = model.generate(current_descriptions, progress=True)

# Save generated audio samples to the output folder

for idx, one_wav in enumerate(wav):

# Save each sample as a WAV file in the output folder

output_file_path = os.path.join(output_folder, f'{start_idx + idx}')

audio_write(output_file_path, one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

# Purge memory after processing each chunk

purge()

print(f'Audio samples saved in the "{output_folder}" folder.')

facebook/musicgen-small: A 300M model, for generating text to music only.

facebook/musicgen-medium: A 1.5B model, for generating text to music only.

facebook/musicgen-melody: A 1.5B model, for text to music and text+melody to music.

facebook/musicgen-large: A 3.3B model, for generating text to music only.

import torchaudio

from audiocraft.models import MusicGen

from audiocraft.data.audio import audio_write

import os

import torch

import gc

def purge():

torch.cuda.empty_cache()

torch.cuda.ipc_collect()

gc.collect()

# Load descriptions from a text file

descriptions_file = 'descriptions.txt' # Update with your file name

with open(descriptions_file, 'r') as file:

descriptions = file.read().splitlines()

# Initialize the AudioGen model

model = MusicGen.get_pretrained('facebook/musicgen-small')

model.set_generation_params(duration=2) # generate 2 seconds of music.

#wav = model.generate_unconditional(4)

# Create an output folder if it doesn't exist

output_folder = 'output'

os.makedirs(output_folder, exist_ok=True)

# Generate and save audio samples in chunks of 50 descriptions

chunk_size = 50

num_chunks = (len(descriptions) + chunk_size - 1) // chunk_size

for chunk_idx in range(num_chunks):

start_idx = chunk_idx * chunk_size

end_idx = (chunk_idx + 1) * chunk_size

current_descriptions = descriptions[start_idx:end_idx]

# Generate audio samples based on descriptions

wav =model.generate(current_descriptions, progress=True)

# Save generated audio samples to the output folder

for idx, one_wav in enumerate(wav):

# Save each sample as a WAV file in the output folder

output_file_path = os.path.join(output_folder, f'{start_idx + idx}')

audio_write(output_file_path, one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

# Purge memory after processing each chunk

purge()

print(f'Audio samples saved in the "{output_folder}" folder.')

Model training using the Edge Impulse platform

Speech lessons

Background noise class

Window size = 2000 ms

Window increase = 300 ms

Click Save Pulse.

Connect the Arduino GIGA R1 WiFi device to your computer using a USB cable.

/* Edge Impulse ingestion SDK

* Copyright (c) 2022 EdgeImpulse Inc.

*

* Licensed under the Apache License, Version 2.0 (the "License");

* you may not use this file except in compliance with the License.

* You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

*

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

*

*/

// If your target is limited in memory remove this macro to save 10K RAM

#define EIDSP_QUANTIZE_FILTERBANK 0

/*

** NOTE: If you run into TFLite arena allocation issue.

**

** This may be due to may dynamic memory fragmentation.

** Try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt (create

** if it doesn't exist) and copy this file to

** `<ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/`.

**

** See

** (https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the-installed-cores-located-)

** to find where Arduino installs cores on your machine.

**

** If the problem persists then there's not enough memory for this model and application.

*/

/* Includes ---------------------------------------------------------------- */

#include <PDM.h>

#include <Audio_classification_v1_inferencing.h>

#include "Arduino_GigaDisplay_GFX.h"

#define screen_size_x 480

#define screen_size_y 800

GigaDisplay_GFX display; // create the object

#define BLACK 0x0000

/** Audio buffers, pointers and selectors */

typedef struct {

int16_t *buffer;

uint8_t buf_ready;

uint32_t buf_count;

uint32_t n_samples;

} inference_t;

static inference_t inference;

static signed short sampleBuffer[2048];

static bool debug_nn = false; // Set this to true to see e.g. features generated from the raw signal

static const int frequency = 20000;

/**

* @brief Arduino setup function

*/

void setup()

{

// put your setup code here, to run once:

Serial.begin(115200);

// comment out the below line to cancel the wait for USB connection (needed for native USB)

while (!Serial);

Serial.println("Edge Impulse Inferencing Demo");

display.begin();

display.fillScreen(BLACK);

display.setTextSize(2);

display.setRotation(1);

delay(1000);

// summary of inferencing settings (from model_metadata.h)

ei_printf("Inferencing settings:\n");

ei_printf("\tInterval: %.2f ms.\n", (float)EI_CLASSIFIER_INTERVAL_MS);

ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);

ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);

ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));

if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {

ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);

return;

}

}

/**

* @brief Arduino main function. Runs the inferencing loop.

*/

void loop()

{

ei_printf("Starting inferencing in 2 seconds...\n");

delay(2000);

ei_printf("Recording...\n");

bool m = microphone_inference_record();

if (!m) {

ei_printf("ERR: Failed to record audio...\n");

return;

}

ei_printf("Recording done\n");

signal_t signal;

signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;

signal.get_data = &microphone_audio_signal_get_data;

ei_impulse_result_t result = { 0 };

EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);

if (r != EI_IMPULSE_OK) {

ei_printf("ERR: Failed to run classifier (%d)\n", r);

return;

}

// print the predictions

ei_printf("Predictions ");

ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",

result.timing.dsp, result.timing.classification, result.timing.anomaly);

ei_printf(": \n");

for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {

ei_printf(" %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);

}

const char *resultLabel = getMax(result.classification);

display.fillScreen(BLACK);

Serial.println(resultLabel);

//display.println(resultLabel);

int artLength = 9; // Change this value based on the length of your ASCII art

int centerX = (screen_size_x - (artLength * 10)) / 2; // 10 is the approximate width of each character in the font

if (strcmp(resultLabel, "speech") == 0) {

display.fillScreen(BLACK);

// Display ASCII art for speech

display.setCursor(100, 150);

display.println("\n");

// Print each line of the ASCII art

display.setCursor(100, 160);

display.println(" ##### ###### ####### ####### ##### # # ");

display.setCursor(100, 180);

display.println("# # # # # # # # # # ");

display.setCursor(100, 200);

display.println("# # # # # # # # ");

display.setCursor(100, 220);

display.println(" ##### ###### ##### ##### # ####### ");

display.setCursor(100, 240);

display.println(" # # # # # # # ");

display.setCursor(100, 260);

display.println("# # # # # # # # # ");

display.setCursor(100, 280);

display.println(" ##### # ####### ####### ##### # # \n");

} else if (strcmp(resultLabel, "music") == 0) {

display.fillScreen(BLACK);

// Display ASCII art for music

display.setCursor(200, 150);

display.println("\n");

// Print each line of the ASCII art

display.setCursor(200, 160);

display.println("# # # # ##### ### ##### ");

display.setCursor(200, 180);

display.println("## ## # # # # # # # ");

display.setCursor(200, 200);

display.println("# # # # # # # # # ");

display.setCursor(200, 220);

display.println("# # # # # ##### # # ");

display.setCursor(200, 240);

display.println("# # # # # # # ");

display.setCursor(200, 260);

display.println("# # # # # # # # # ");

display.setCursor(200, 280);

display.println("# # ##### ##### ### ##### \n");

} else if (strcmp(resultLabel, "noise") == 0) {

display.fillScreen(BLACK);

// Display ASCII art for noise

display.setCursor(200, 150);

display.println("\n");

// Print each line of the ASCII art

display.setCursor(200, 160);

display.println("# # ####### ### ##### ####### ");

display.setCursor(200, 180);

display.println("## # # # # # # # ");

display.setCursor(200, 200);

display.println("# # # # # # # # ");

display.setCursor(200, 220);

display.println("# # # # # # ##### ##### ");

display.setCursor(200, 240);

display.println("# # # # # # # # ");

display.setCursor(200, 260);

display.println("# ## # # # # # # ");

display.setCursor(200, 280);

display.println("# # ####### ### ##### ####### \n");

}

#if EI_CLASSIFIER_HAS_ANOMALY == 1

ei_printf(" anomaly score: %.3f\n", result.anomaly);

#endif

}

const char* getMax(ei_impulse_result_classification_t *classifications)

{

uint8_t maxLabelIndex = 0;

for (size_t i = 1; i < EI_CLASSIFIER_LABEL_COUNT; i++) {

if (classifications[i].value > classifications[maxLabelIndex].value) {

maxLabelIndex = i;

}

}

return classifications[maxLabelIndex].label;

}

/**

* @brief PDM buffer full callback

* Get data and call audio thread callback

*/

static void pdm_data_ready_inference_callback(void)

{

int bytesAvailable = PDM.available();

// read into the sample buffer

int bytesRead = PDM.read((char *)&sampleBuffer[0], bytesAvailable);

if (inference.buf_ready == 0) {

for(int i = 0; i < bytesRead>>1; i++) {

inference.buffer[inference.buf_count++] = sampleBuffer[i];

if(inference.buf_count >= inference.n_samples) {

inference.buf_count = 0;

inference.buf_ready = 1;

break;

}

}

}

}

/**

* @brief Init inferencing struct and setup/start PDM

*

* @param[in] n_samples The n samples

*

* @return { description_of_the_return_value }

*/

static bool microphone_inference_start(uint32_t n_samples)

{

inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));

if(inference.buffer == NULL) {

return false;

}

inference.buf_count = 0;

inference.n_samples = n_samples;

inference.buf_ready = 0;

// configure the data receive callback

PDM.onReceive(&pdm_data_ready_inference_callback);

PDM.setBufferSize(512);

// initialize PDM with:

// - one channel (mono mode)

// - a 16 kHz sample rate

if (!PDM.begin(1, frequency)) {

ei_printf("Failed to start PDM!");

microphone_inference_end();

return false;

}

// set the gain, defaults to 20

PDM.setGain(5);

return true;

}

/**

* @brief Wait on new data

*

* @return True when finished

*/

static bool microphone_inference_record(void)

{

inference.buf_ready = 0;

inference.buf_count = 0;

while(inference.buf_ready == 0) {

delay(10);

}

return true;

}

/**

* Get raw audio signal data

*/

static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)

{

numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);

return 0;

}

/**

* @brief Stop PDM and release buffers

*/

static void microphone_inference_end(void)

{

PDM.end();

free(inference.buffer);

}

#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE

#error "Invalid model for current sensor."

#endif

02:38:59.938 -> Recording...

02:39:01.551 -> Recording done

02:39:02.130 -> Predictions (DSP: 26 ms., Classification: 530 ms., Anomaly: 0 ms.):

02:39:02.130 -> music: 0.10147

02:39:02.130 -> noise: 0.63983

02:39:02.130 -> speech: 0.25870

02:39:02.130 -> noise

02:39:02.161 -> Starting inferencing in 2 seconds...

02:39:04.147 -> Recording...

02:39:05.759 -> Recording done

02:39:06.323 -> Predictions (DSP: 26 ms., Classification: 530 ms., Anomaly: 0 ms.):

02:39:06.323 -> music: 0.05066

02:39:06.323 -> noise: 0.84657

02:39:06.323 -> speech: 0.10276

arduino




Note: Content and images are from: https://projecthub.arduino.cc/, with some modifications.
If you want it removed due to copyright reasons, please leave a comment. Thank you.
I want to share this article more widely so that everyone knows about Arduino and your project.

SendData

Điều khiển trạng thái qua Firebase Trạng thái hiện tại: Đang tải... ĐỔI TRẠNG THÁI