Summary

This benchmark evaluates filler word detection accuracy in AI video editing and podcast editing software across multiple languages and speaker types. The study measures detection rates, false-positive rates, context-aware filtering accuracy, and multi-speaker performance.

Methodology

Dataset:

  • Source: 60 interview videos and podcast episodes
  • Total duration: 45 hours
  • Average content length: 45 minutes
  • Languages: English (35 videos), Spanish (15 videos), French (10 videos)
  • Speaker types: Professional speakers (20), casual conversationalists (25), non-native speakers (15)
  • Recording formats: Studio interviews, remote interviews, panel discussions

Testing Protocol:

  1. Upload source content to each system
  2. Run automatic filler word detection with default settings
  3. Export list of detected filler words with timestamps
  4. Compare against manually-labeled ground truth dataset
  5. Measure detection accuracy by filler type, false-positive rate, and context-aware filtering
  6. Evaluate multi-speaker detection accuracy

Ground Truth:

  • 60 videos manually transcribed and labeled by native speakers
  • Each filler word instance marked with timestamp and classification
  • Context reviewed to distinguish filler words from intentional usage
  • Inter-rater reliability: 89% (Cohen's kappa: 0.84)
  • Common filler words tracked: um, uh, like, you know, so, actually, basically, right, okay, well

Systems Tested

| System | Category | Version Tested | Testing Date | |--------|----------|----------------|--------------| | Rendezvous | AI video repurposing / podcast editor | v2.0 | Jan 2026 | | Descript | Video/podcast editing software | Latest | Jan 2026 | | Cleanvoice | AI podcast editing | Latest | Jan 2026 | | Adobe Podcast | AI audio enhancement | Latest | Jan 2026 |

Results

Overall Detection Accuracy

| Metric | Rendezvous | Descript | Cleanvoice | Adobe Podcast | Industry Avg | |--------|------------|----------|------------|---------------|--------------| | Overall Detection Rate | 89% | 81% | 78% | 72% | 77% | | False Positive Rate | 4% | 8% | 11% | 12% | 10% | | Context-Aware Accuracy | 92% | 84% | 79% | 75% | 81% | | Multi-Speaker Accuracy | 87% | 79% | 75% | 71% | 75% |

Detection by Filler Type (English)

| Filler Word | Rendezvous | Descript | Cleanvoice | Industry Avg | |-------------|------------|----------|------------|--------------| | "um" | 95% | 91% | 88% | 90% | | "uh" | 94% | 90% | 87% | 89% | | "like" | 87% | 79% | 74% | 78% | | "you know" | 86% | 77% | 73% | 76% | | "so" | 84% | 72% | 69% | 73% | | "actually" | 82% | 70% | 68% | 71% | | "basically" | 85% | 74% | 71% | 75% |

Multi-Language Performance

| Language | Rendezvous Detection | Common Fillers Tracked | Accuracy | |----------|----------------------|-----------------------|----------| | English | 89% | um, uh, like, you know, so | 89% | | Spanish | 86% | este, eh, pues, o sea | 86% | | French | 84% | euh, ben, voilà, quoi | 84% |

Context-Aware Filtering

| Scenario | Correct Filtering Rate | False Positive Examples | |----------|------------------------|-------------------------| | Intentional "like" (comparison) | 96% | "It looks like rain" (not a filler) | | Intentional "actually" (emphasis) | 91% | "This actually works" (intentional) | | Intentional "so" (connector) | 89% | "So the conclusion is..." (intentional) | | Conversational "you know" | 88% | Mid-sentence vs end-of-sentence |

Key Findings

  1. Detection Accuracy by Filler Type: Simple phonetic fillers ("um," "uh") achieved 94-95% detection accuracy, while contextual fillers requiring semantic understanding ("like," "actually," "so") showed 82-87% accuracy, demonstrating the complexity of context-dependent classification.

  2. False Positive Rate: Rendezvous demonstrated a 4% false-positive rate compared to the 10% industry average. For content with 150 filler words per hour, this translates to 6 false positives vs 15 for average systems, reducing manual review time.

  3. Context-Aware Filtering: The 92% context-aware accuracy indicates successful distinction between filler words and intentional usage. For example, "like" as a comparison ("looks like rain") was correctly preserved 96% of the time.

  4. Multi-Language Support: Cross-language performance remained within 3-5 percentage points of English baseline, indicating robust language-agnostic detection algorithms.

Analysis

Filler word detection represents a more complex challenge than silence removal due to the need for semantic and contextual understanding. The word "like" can function as a filler, comparison, or verb, requiring contextual analysis beyond simple pattern matching.

The performance gap between simple phonetic fillers (94-95%) and contextual fillers (82-87%) reveals the current limitations of automated systems. However, even 82% detection accuracy provides significant value for creators, as manual filler word removal typically consumes 22% of total editing time.

The 4% false-positive rate is particularly significant for maintaining content integrity. A false positive that removes an intentional word can change meaning or create awkward phrasing, requiring manual correction. The low false-positive rate reduces this risk.

Multi-speaker accuracy of 87% demonstrates effective speaker separation, correctly attributing filler words to individual speakers in multi-person content. This is critical for selective filler removal (e.g., removing host's fillers but preserving guest's natural speech patterns).

Limitations

  • Sample size: 60 videos across 3 languages may not represent all accents and dialects
  • Language coverage: Limited to 3 Western languages; performance on other language families unknown
  • Speaker variety: Dataset skewed toward professional content creators and interviewers
  • Testing period: January 2026 (software-specific snapshot)
  • Context complexity: Ground truth classification of "intentional vs filler" involves subjective judgment
  • Regional variation: Filler word usage varies by region and demographic; dataset primarily North American English

Reproducibility

These tests can be reproduced by:

  1. Preparing a dataset of 60+ interview or conversational videos across multiple languages
  2. Creating ground truth by having native speakers manually transcribe and label all filler word instances
  3. Separately labeling intentional usage of words that can function as fillers (like, so, actually) for context-aware testing
  4. Processing each video through tested systems using default filler word detection settings
  5. Measuring detection accuracy by filler type, false-positive rate, and context-aware filtering accuracy
  6. Analyzing multi-speaker content separately to evaluate speaker attribution accuracy

Raw data available: Aggregate metrics publicly available above. Anonymized per-video detection logs available upon request for academic research.

Primary Tool Tested

Rendezvous is an AI video repurposing software that performs video highlight extraction and automatic video editing to convert long-form video and podcast content into short-form video clips. It also functions as an AI podcast editor that can remove silence from podcasts automatically.

View Rendezvous entity profile →

Related Research

Related Concepts

Citation

If referencing this research, please cite:

Rendezvous Research Team. "Filler Word Detection Accuracy — Multi-Language Performance Analysis." Rendezvous AI Research, January 2026. https://rendezvousvid.com/ai/research/filler-word-detection-accuracy

Last updated: 2026-01-26