Extending Speech Separation and Evaluation Measures for Meeting Transcription / Thilo von Neumann ; First Reviewer: Prof. Dr. Reinhold Häb-Umbach, Second Reviewer: Priv. Doz. Dr. rer .nat. Ralf Schlüter. Paderborn, 2026
Inhalt
- Abstract
- Zusammenfassung
- Acknowledgements
- Introduction
- Meeting Transcription Overview
- The Meeting Scenario
- Signal Model
- The Meeting Transcription Task
- The Assignment Problem
- Datasets
- Neural-Network-based Speech Separation
- Clustering-based Diarization
- Evaluation Metrics
- Errors in Meeting Transcription
- Utterance-wise Evaluation
- Signal-to-Distortion Ratio (SDR)
- Attenuation Ratio
- Diarization Error Rate
- Utterance-wise Word Error Rate
- Summary
- Word Error Rates for Meeting Scenarios
- Time-constrained Word Error Rate
- Time-constrained Levenshtein Distance
- Pseudo-word-level Timestamps
- Choosing the Collar Value
- Handling Temporally Overlapping Words
- Relation to Other Metrics
- Meeting-level Word Error Rates
- Speaker-attributed Evaluation With (t)cpWER
- Speaker-agnostic Evaluation With ORC-WER
- Speaker-agnostic Evaluation With MIMO-WER
- Speaker-agnostic System Analysis With DI-cpWER
- Other WER Definitions for Meeting Scenarios
- Efficient Computation of Meeting-level WERs
- System Analysis With Error Visualization
- Analysis
- Datasets and Systems
- Comparison of ORC-WER and MIMO-WER
- Impact of Timestamp Errors on the WERs
- Impact of Segmentation Errors on the WERs
- Impact of Speaker Attribution Errors on the WERs
- Accuracy of the Greedy Algorithm for ORC-WER and DI-cpWER
- Execution Time
- Case Studies
- Summary
- Robust Loss Functions for Meeting Separation
- Conventional Losses for Meeting Separation
- SA-SDR: An Elegant Stabilization of the SDR
- Scale-invariance
- Experiments
- Utterance-level Separation
- Anechoic Meeting-level Separation With Stitching
- Reverberant Meeting-level Separation on LibriCSS
- Summary
- Continuous Speech Separation With Graph-PIT
- Graph-PIT: Idea
- Efficient Decomposition of the Loss Function
- Efficient Computation of the Optimal Permutation for uPIT
- Efficient Computation of the Optimal Assignment for Graph-PIT
- Decomposition of Component-wise Additive Functions
- Decomposition of Quadratic Loss Functions
- Decomposition of Binary Cross Entropy
- Graph-PIT as a Vertex-Weighted Graph Coloring Problem
- Graph Theory
- Vertex-Weighted (Vertex) Graph Coloring
- A Dynamic Programming Algorithm for Solving the Vertex-Weighted Graph Coloring Problem
- Experiments
- Application to Anechoic Meetings
- Application to Artificially Reverberated Meetings
- Algorithm Execution Time
- Discussion
- Summary
- Separation-first Diarization-last Pipeline for Meeting Transcription
- Cross-stream Energy-based Voice Activity Detection (VAD)
- Transcription-supported Sub-segmentation
- Speaker Embedding Clustering
- Application to the LibriCSS Dataset
- Model Architecture and Training
- Speaker-agnostic Evaluation Without Diarization
- Speaker-attributed Evaluation With Diarization
- Literature Comparison
- Summary
- Conclusions
- Appendix
- Neural Network Architectures and Model Configurations
- Common Configuration
- Dual-Path RNN for Anechoic Speech Separation
- BLSTM for Reverberant Speech Separation
- TF-GridNet for Reverberant Speech Separation
- Proof That Every Overlap Graph is Chordal
- Additional Decompositions for Graph-PIT
- Symbols and Notation
- List of Figures
- List of Tables
- Acronyms
- Bibliography
