Made a utility to timestamp audio automatically

SoundReader

Takes in an audio file and spits out timestamps of burps.

:warning: Notice support is provided for the computer literate:

  • If you have questions, such as “How do I install python?”, “Which ONNX runtime should I install?”, “Why isn’t ffmpeg launching?”, kindly write them down on a piece of paper, throw it in a garbage can, and find the answer online- they are well documented.
  • If you have questions regarding errors or exceptions, please write them down here with your system info and exactly what your command was, I’ll gladly look into it!
  • If you write down “It’s not working,” you won’t be seeing a response (at least from me).
  • Don’t ask for scan requests in this thread, put it Female Media > Female Requests/Male Media > Male Requests.

If the barrier to entry is too high, consider using videoscan.net, which is a ready-made, robust system and has done almost all of the work for you:

The goal is to get feedback and develop an automated way to tune the model using outside of using AudioSet.

Requirements

Python and pip

Ensure you have python (3.10+) and pip installed and executable from your shell

Dependencies

onnx: Used verify the model files for correctness

onnxruntime: Used as a platform to perform inferencing.
Notice: Install the correct onnxruntime for your platform using this table: ONNX Runtime | Home Ensure Python is selected as the API.

tqdm: Used to display loading bar and time estimations

colored: Used to color video titles for better discernability

argparse: Used for command argument parsing

ffmpeg: used for transcoding

Install the required dependencies using pip:

pip install argparse colored numpy onnx onnxruntime tqdm

Downloads

Program: https://tctbff.duckdns.org/programs/sound_reader/sound_reader.py
Model: https://tctbff.duckdns.org/programs/sound_reader/bdetectionmodel_05_01_23.onnx

Usage

python sound_reader.py --model bdetectionmodel_05_01_23.onnx audio.opus

Will print out the timestamps with a confidence >= 90 This can be adjusted by tweaking --threshold

More tunables can be found with: python sound_reader.py --help

How can you help?

Send suggestions, feedback, and code!
Share false positives or false negatives with me (in the form of 2 second audio segments- ffmpeg is your friend here) along with a label of what the audio segment is supposed to be (burping, coughing, talking, etc.)

Closing

If you think there are some not-so-obvious details I missed out on, post them here!
Also, do feel free to share changes or suggestions you’ve made here.

Colab Notebook

If you would like to try it out without installing anything, I’ve made a colab notebook with some testing code: Google Colab

31 Likes

Fair enough, give me a few mins. I’ll do it on windows, since I don’t have a mac and I’m assuming this is usual stuff for those on Linux.

https://www.youtube.com/watch?v=1KhUnNPUwfE

9 Likes

You’ll have the remove them manually (either by using rm, del, or your file manager). What I usually do is, download all the videos from a channel all at once and scan them by using a wildcard aka: python sound_reader.py --model bdetectionmodel_05_01_23.onnx *.opus

2 Likes

Yes, but it depends on your computer. If you’re running some low spec computer like a laptop w/o a gpu(I was running my demo on an old laptop), it will usually be slow and the difference is large. If you have a gpu, usually downloading will take more time than actually getting the timestamps

2 Likes

Yes, this is that command. It will pass in every file that ends with .opus into sound_reader. Note it will depend on which shell you’re using. Zsh (the default on macos) and bash (the usual default on Linux) will support wildcards like *. On windows, you can try Powershell and see if it works there, but to my knowledge command prompt doesn’t have this simple feature.

I’ll need more details than that, but what it sounds like is: you tried using the wildcard trick but either your shell doesn’t support it or your wildcard didn’t match anything.

could this theoretically work for farts also, or just burps?

Yes, audioset has a class for fart sounds and the model was trained on it as well. You can specify the --focus_idx 60 flag (60 is the class id for farts) to make sound_detector print timestamps for farts.

3 Likes

thank you!

1 Like

It uses ffmpeg behind the scenes to read the audio, so anything ffmpeg supports- this supports… And you can bet ffmpeg supports some of the most obscure formats out there

1 Like

So in the tutorial I used directml, since that’s supposed to be the ‘just works’ ml api on windows. It appears that it is not. Since you’re using an Nvidia GPU, uninstall onnxruntine-directml with pip and then install onnxruntime-gpu instead. This will use Tensorrt or cuda depending on your drivers (picks Tensorrt when possible, since it uses the tensor cores on your gpu). Let me know if it works after that, since the error looks like a directml issue

Edit: Also, make sure you delete the existing model and redownload it. The model I serve to you is a base unoptimized version. Once you perform inferences using sound_reader, onnx will perform optimizations on the model and save them. Those changes will include device specific instructions and will not work on any other device.

1 Like

I’m using a 1070 and it worked with the default install. could I see a performance gain doing the gpu install you listed?

You probably won’t see a huge gain, but cuda/cublas will generally perform better than directml.

Whoops, forgot to mention that you will need cuda for that. You can download it here: CUDA Toolkit Archive | NVIDIA Developer get cuda 11.6 and select all the correct info (onnx doesn’t support 12 atm) and get cuDNN Archive | NVIDIA Developer 8.5.0 for cuda 11.x

1 Like

That’s really weird. If you’re certain that you removed onnxruntime-directml, can you run pip install --force-reinstall onnxruntime-gpu

Idk if this is a common problem or just me not being familiar with coding, but even after clearing a cell
or refreshing in the GoogleCollab, the sound reader rescans old stuff.

awesome tool, thanks for posting it. got it working quickly, no hassle on my end. im ripping the audio out of an mp4 or ts file then converting it to oppus and feeding it into the tool. works perfectly, amazing tool

edit: sorry overlooked the false positive formatting.

so I’m not sure if the program was just detecting just the croaky sound at the end or that in addition to the loud emphasis of the words beforehand, kind of a coincidence. the first one was 2 timestamps and the second was 3 timestamps.

I think you might’ve forgotten to remove the old files before rerunning. Make sure to do that, since it will scan all files (including the old ones)

@joemanxdjoe You actually don’t need to do any file ripping on your own (unless the file is not compatible with ffmpeg), sound_reader will do that automatically

2 Likes

ay nice, thats handy, time including extracting and converting audio was 166 seconds, just feeding the video file straight into it is 120 seconds, and a lot simpler code wise. (forgot to mention this is also includes a few other processing stuff after, your model is extremely quick)

1 Like