Deep Learning-Based Multi-Channel Speech Enhancement for Low Signal-to-Noise Ratio Scenario

by Yifei Wei

Speech enhancement for drone audition is highly challenging due to the strong rotor noise. In practice, the signal-to-noise ratio (SNR) can drop below -15 dB. Such extremely low SNR conditions make conventional enhancement algorithms ineffective, as most of the energy in the mixture originates from the ego-noise of the drone itself. Despite recent advances, most deep learning-based speech enhancement methods for drone audition only focus on single-channel inputs. Even among the few methods designed for multi-channel audio, the enhanced output is often reduced to a single channel, thereby discarding spatial information that is essential for downstream processing tasks. To address these limitations, this project introduces Spatial-U-Net, a multi-channel end-to-end deep learning-based framework designed to suppress drone noise directly in the time domain with multi-channel outputs. Experiments are conducted on real drone-recorded audio with SNR levels ranging from -40 to -10 dB. The results show that Spatial-U-Net consistently outperforms several representative speech enhancement methods across three standard metrics: SNR improvement, short-time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ).

During this event, we will demonstrate the effectiveness of the proposed method in handling microphone recordings with extremely low SNR conditions.