Why this?

The most close of the thing that i can find on internet is transcrybe which does the same thing as translation and caption but the two most prominent thing was

Its not open source so not free (free as in freedom)
Its not free (free as in beer here)

Getting started

I researched the ways by which we can get the audio from the device and got to know about blackhole. So as to install it i have installed it using brew

brew install blackhole-2ch

Then i can made a multi output device to forward audio to blackhole as well as my computer system

Then i started working with the python but soon realized python is not the thing for the sub second latency so switched to rust for development

Rust development journey

As python has sounddevice i need to figure out the libraries for handling audio devices in rust and later on i found out cpal (https://docs.rs/cpal/)

It was capable of getting the audio so start developing in rust

i will explain you step by step what i have done

    let host = cpal::default_host();

    let device = host
        .input_devices()?
        .find(|device| device.to_string().contains("BlackHole"))
        .ok_or("Blackhole not found")?;
    println!("Using default input device: {}", &device);
    let config = device.default_input_config()?;

Here i get the default host and a device for audio capturing you must look here to understand its the default thing needed for getting the device

Later on i figured out that the device config is this

SupportedStreamConfig { channels: 2, sample_rate: 48000, buffer_size: Range { min: 15, max: 4096 }, sample_format: F32 }

So the blackhole 2ch has 2 channels sampling rate is 48000Hz and sampling format is F32 (float32)

Then we created a streamconfig we will later see and a channel to transfer audio data between threads

let streamconfig: cpal::StreamConfig = config.into();
    let (tx,rx) = channel();

since we are getting the output of sound we will be using the inputstream method as this

let stream = device.build_input_stream(
        streamconfig,
        move |data: &[f32], _: &cpal::InputCallbackInfo| {
            let _ = tx.send(data.to_vec());
        },
        err_fn,
        None,
    )?;

As per docs of build_input_stream it needs a streamconfig callback function and error function and timeout

callback function gives the data and some timing information since we are only interested in data we are ignoring timing informating using the _

also as the data will be overwritten later as soon as the data came in the callback function we move it to the channel as a vector

then we start the stream using

    stream.play()?;

As of now i have done this

thread::spawn(move || {
        let mut audio_buffer: Vec<f32> = Vec::new();
        while let Ok(raw_audio_data) = rx.recv() {
            println!("Data received: {:?}", raw_audio_data);
        //     Downsampling the data (usually in 48Khz needed 16Khz)
        //     48000 - > 16000 Hz

        }
    });

    loop {
        thread::sleep(std::time::Duration::from_secs(1));
    }

spawning a thread and printing the data recieved but i will update it till then understand this

we are consuming the data using rx.recv() and printing it verbosely

now your question will be why this infinite loop then

its because if it exits the loop the program will stop all threads will be killed so this infinite loop is to prevent the killing of the thread and the program itself

Integrating whisper in it

While searching for whisper i got to know that there is a rust binding whisper-rs which is the rust binding for whisper.cpp which we will be using as a dependency in rust project

The problem with whisper is its very picky about the audio chunks

It requires

16000 Hz frequency
Mono audio

But the blackhole provides

48000 Hz sampling rate
Stereo audio

So we will be doing this

taking each pair and averaging it to get the mono audio and then taking the third one to get the 16000 Hz frequency as 16000*3 = 48000

in rust

let mut mono_16khz = Vec::with_capacity(raw_audio_data.len() / 6);
            for chunk in raw_audio_data.chunks_exact(6) {
                // Downmix the first stereo frame of this group to mono
                let mono1 = (chunk[0] + chunk[1]) / 2.0;
                let mono2 = (chunk[2] + chunk[3]) / 2.0;
                let mono3 = (chunk[4] + chunk[5]) / 2.0;

                // Average the 3 mono samples to get 1 downsampled sample
                let final_sample = (mono1 + mono2 + mono3) / 3.0;


                mono_16khz.push(final_sample); // Decimates by 3 automatically by skipping the other 2 frames
            }
            audio_buffer.extend(mono_16khz);

making a mono_16khz vector with capacity raw_audiolen/6 3 for above thing and /2 extra for stereo to mono

and then we are extending the audio_buffer

Since we are extending the vector we must drain also somewhere to not comsume too much memory while running so the logic for comsuming memory can be written as

let mut audio_buffer: Vec<f32> = Vec::new();
        let sample_rate = 16000;
        let process_interval = sample_rate * 3;  // Process every 3 seconds of audio
        let max_window_size = sample_rate * 30;  // Keep max 30 seconds of context
        let mut last_processed_len = 0;

// ---- Some code snipper of storing in vector buffer -----
if audio_buffer.len() - last_processed_len >= process_interval {
                if audio_buffer.len() > max_window_size {
                    let drain_amount = audio_buffer.len() - max_window_size;
                    audio_buffer.drain(0..drain_amount);
                }
                last_processed_len = audio_buffer.len();
            };

Later i have implemented whisper-rs for translation of whatever language as according to this example but then i thought maybe i should polish the things so i used the library rubato for resampling the data into the 16 Khz.

Then i want the ggml model binary for speech to text recognition so i downloaded the multilingual ggml-small model from here .

Then i got a problem that the whisper.cpp is giving out verbose log

it can be solved using

whisper_rs::install_logging_hooks();

later you can view the whole script at my repository

Github url : https://github.com/lsnnt/ntranscrybe

Making of ntranscribe

Why this?

Getting started

Rust development journey

Integrating whisper in it

Demo

Comments

More from this blog

Making an http server very close to computer

Building a Massive Q&A Dataset from Sarthaks.com

I built a Spotify recently-played banner for GitHub — without registering an OAuth app

How do i reverse engineered Chotadhobi app

Command Palette

Why this?

Getting started

Rust development journey

Integrating whisper in it

Demo

Comments

More from this blog