Work in progress. If you come back later there'll probably be more. - SHH, 10/24/21.


There are many tutorials on doing audio classification; usually these invovle rendering your audio as (mel-)spectrograms and doing image classification on those. There are not as many tutorials on doing audio processing or generation but there's been a growing list. Lately I've become somewhat proficient with fastai and would like to port some audio processing examples over to it. There are a few choices for tasks and datasets -- great work on source separation, for example.

My Choice: Reproduce Micro-TCN

Since I've been interested in audio effects, I'll choose the task of reproducing Christian Steinmetz and Josh Reiss's Micro-TCN work for learning to profile audio compressors. That code uses PyTorch Lightning instead of fastai. We should be able to do the bare minimum integration with fastai by following Zach Mueller's prescription. The experience gained from doing this can hopefully serve when adapting other audio tasks & models to work with fastai.

Other things we could try (in later notebooks)

We could just grab any old audio data and then we could learn some kind of inverse effect such as denoising: we could add noise to the audio files and then train the network to remove the noise. But what other audio datasets are available?

Installs and imports

# Next line only executes on Colab. Colab users: Please enable GPU in Edit > Notebook settings
! [ -e /content ] && pip install -Uqq pip fastai git+https://github.com/drscotthawley/fastproaudio.git

# Additional installs for this tutorial
%pip install -q fastai_minima torchsummary pyzenodo3 wandb

# Install micro-tcn and auraloss packages (from source, will take a little while)
%pip install -q wheel --ignore-requires-python git+https://github.com/csteinmetz1/micro-tcn.git  git+https://github.com/csteinmetz1/auraloss

# After this cell finishes, restart the kernel and continue below
Note: you may need to restart the kernel to use updated packages.
  WARNING: Missing build requirements in pyproject.toml for git+https://github.com/csteinmetz1/auraloss.
  WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
Note: you may need to restart the kernel to use updated packages.
from fastai.vision.all import *
from fastai.text.all import *
from fastai.callback.fp16 import *
import wandb
from fastai.callback.wandb import *
import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T
from IPython.display import Audio 
import matplotlib.pyplot as plt
import torchsummary
from fastproaudio.core import *
from glob import glob
import json

use_fastaudio = False
if use_fastaudio:
    from fastaudio.core.all import *
    from fastaudio.augment.all import *
    from fastaudio.ci import skip_if_ci

Download and Inspect the Data

The "SignalTrain LA2A Reduced" dataset is something I made Friday night. It's only the first 10 seconds of each of the 20-minute audio files making up the full SignalTrain LA2A dataset, which consists of lots of audio files run through an LA2A audio compressor at different knob settings. At 200 MB, the Reduced version is enough to train the model some and see that it's working for the purposes of this demo, though you'd probably want more data to make a high-quality model. (If you'd rather train using the full 20 GB dataset, use URLs.SIGNALTRAIN_LA2A_1_1 below, but everything will take longer!)

path = get_audio_data(URLs.SIGNALTRAIN_LA2A_REDUCED); path
Path('/home/shawley/.fastai/data/SignalTrain_LA2A_Reduced')
fnames_in = sorted(glob(str(path)+'/*/input*'))
fnames_targ = sorted(glob(str(path)+'/*/*targ*'))
ind = -1   # pick one spot in the list of files
fnames_in[ind], fnames_targ[ind]
('/home/shawley/.fastai/data/SignalTrain_LA2A_Reduced/Val/input_260_.wav',
 '/home/shawley/.fastai/data/SignalTrain_LA2A_Reduced/Val/target_260_LA2A_2c__1__85.wav')

Input audio

import warnings
warnings.filterwarnings("ignore", category=UserWarning)   # turn off annoying matplotlib warnings

waveform, sample_rate = torchaudio.load(fnames_in[ind])
show_audio(waveform, sample_rate)
Shape: (1, 441000), Dtype: torch.float32, Duration: 10.0 s
Max:  0.225,  Min: -0.218, Mean:  0.000, Std Dev:  0.038