Made a utility to timestamp audio automatically


Takes in an audio file and spits out timestamps of burps.

:warning: Notice support is provided for the computer literate:

  • If you have questions, such as “How do I install python?”, “Which ONNX runtime should I install?”, “Why isn’t ffmpeg launching?”, kindly write them down on a piece of paper, throw it in a garbage can, and find the answer online- they are well documented.
  • If you have questions regarding errors or exceptions, please write them down here with your system info and exactly what your command was, I’ll gladly look into it!
  • If you write down “It’s not working,” you won’t be seeing a response (at least from me).
  • Don’t ask for scan requests in this thread, put it Female Requests/Male Requests.

If the barrier to entry is too high, consider using, which is a ready-made, robust system and has done almost all of the work for you:

The goal is to get feedback and develop an automated way to tune the model using outside of using AudioSet.


Python and pip

Ensure you have python (3.10+) and pip installed and executable from your shell


onnx: Used verify the model files for correctness

onnxruntime: Used as a platform to perform inferencing.
Notice: Install the correct onnxruntime for your platform using this table: ONNX Runtime | Home Ensure Python is selected as the API.

tqdm: Used to display loading bar and time estimations

colored: Used to color video titles for better discernability

argparse: Used for command argument parsing

ffmpeg: used for transcoding

Install the required dependencies using pip:

pip install argparse colored numpy onnx onnxruntime tqdm




python --model bdetectionmodel_05_01_23.onnx audio.opus

Will print out the timestamps with a confidence >= 90 This can be adjusted by tweaking --threshold

More tunables can be found with: python --help

How can you help?

Send suggestions, feedback, and code!
Share false positives or false negatives with me (in the form of 2 second audio segments- ffmpeg is your friend here) along with a label of what the audio segment is supposed to be (burping, coughing, talking, etc.)


If you think there are some not-so-obvious details I missed out on, post them here!
Also, do feel free to share changes or suggestions you’ve made here.

Colab Notebook

If you would like to try it out without installing anything, I’ve made a colab notebook with some testing code: Google Colab


thanks for the work, can you do a video showcasing how it works? don’t really have to explain anything, just using the tool


Fair enough, give me a few mins. I’ll do it on windows, since I don’t have a mac and I’m assuming this is usual stuff for those on Linux.


it worked, thanks, but, the archive yt-dlp creates, it stays on your hd? if yes how to delete already scanned videos

You’ll have the remove them manually (either by using rm, del, or your file manager). What I usually do is, download all the videos from a channel all at once and scan them by using a wildcard aka: python --model bdetectionmodel_05_01_23.onnx *.opus


ok, it’s normal twitch vods take more time to process?

Yes, but it depends on your computer. If you’re running some low spec computer like a laptop w/o a gpu(I was running my demo on an old laptop), it will usually be slow and the difference is large. If you have a gpu, usually downloading will take more time than actually getting the timestamps


what do you mean? does exist a command to scan all videos from a channel?

Yes, this is that command. It will pass in every file that ends with .opus into sound_reader. Note it will depend on which shell you’re using. Zsh (the default on macos) and bash (the usual default on Linux) will support wildcards like *. On windows, you can try Powershell and see if it works there, but to my knowledge command prompt doesn’t have this simple feature.

in a single video works, but when in a channel i get this: RuntimeError: Failed to load audio: ffmpeg version Copyright (c) 2000-2023 the FFmpeg developers

1 Like

I’ll need more details than that, but what it sounds like is: you tried using the wildcard trick but either your shell doesn’t support it or your wildcard didn’t match anything.

1 Like

nevermind, got it

could this theoretically work for farts also, or just burps?

Yes, audioset has a class for fart sounds and the model was trained on it as well. You can specify the --focus_idx 60 flag (60 is the class id for farts) to make sound_detector print timestamps for farts.


thank you!

1 Like

Kudos for the robust tool. Would this work for downloaded streams in MP4/mkv format? I have a bunch of old mfc streams from someone I’d like to scan.

It uses ffmpeg behind the scenes to read the audio, so anything ffmpeg supports- this supports… And you can bet ffmpeg supports some of the most obscure formats out there

1 Like

The setup process was seamless, but scanning a file in either .mp4 or .opus format brought up this error -

\AppData\Roaming\Python\Python310\site-packages\onnxruntime\capi\", line 217, in run
return, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(896)\onnxruntime_pybind11_state.pyd!00007FFA80ACFE91: (caller: 00007FFA80AD08BF) Exception(2) tid(5028) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

This happens around 20% in of scanning the file.

System info:
OS - Windows 10 Pro 64bit
CPU - i9-9900kf
GPU - RTX 3070ti

So in the tutorial I used directml, since that’s supposed to be the ‘just works’ ml api on windows. It appears that it is not. Since you’re using an Nvidia GPU, uninstall onnxruntine-directml with pip and then install onnxruntime-gpu instead. This will use Tensorrt or cuda depending on your drivers (picks Tensorrt when possible, since it uses the tensor cores on your gpu). Let me know if it works after that, since the error looks like a directml issue

Edit: Also, make sure you delete the existing model and redownload it. The model I serve to you is a base unoptimized version. Once you perform inferences using sound_reader, onnx will perform optimizations on the model and save them. Those changes will include device specific instructions and will not work on any other device.

1 Like

I’m using a 1070 and it worked with the default install. could I see a performance gain doing the gpu install you listed?