Please ensure Javascript is enabled for purposes of website accessibility

Deepfake Techniques Explained: Why Some Work in Real-Time and Others Don't

Niv Amitay
,
AI Cyber Generation Lead
November 1, 2024

As we've worked with various synthetic media detection challenges, we've found that understanding the technical differences between deepfake types is essential for developing effective detection strategies. Each deepfake variety employs distinct techniques, which not only affect their visual and auditory qualities but also determine whether they can function in real-time applications like video conferencing.

In this post, we'll share insights from our research on the major categories of deepfakes and explore how their technical requirements influence their suitability for different contexts.

Visual Deepfakes: A Taxonomy of Techniques

Through our work analyzing synthetic media, we've identified several primary categories of visual deepfakes, each with distinct characteristics:

1. Puppet Master

This technique transfers an actor's facial expressions, eye movements, and head positions to a target person. The result makes it appear as if the target is performing the same actions as the source actor.

In our analysis, puppet master approaches are particularly effective when the goal is to animate a target with specific expressions or movements from a performance. The source actor essentially "drives" the target's face like a puppeteer controlling a puppet.

2. Lip Sync

Lip sync manipulation focuses specifically on altering mouth movements to match different audio. This targeted approach is often used when only the speech content needs to be modified while maintaining the rest of the original video intact.

We've observed that lip sync deepfakes are among the most suitable for real-time applications since they modify a limited region of the face and can be optimized for lower computational requirements. We've observed that lip sync deepfakes are not well-suited for real-time applications since they require significant computational power, making them too slow for seamless live performance. Unlike simpler methods that modify only small facial regions, moving the lips in a deepfake involves generating and adjusting a larger area of the face to maintain realism. This complexity is further amplified when using advanced algorithms such as diffusion models or face generators, which process high-fidelity details frame by frame. As a result, these techniques demand extensive GPU resources and longer rendering times, making them impractical for real-time use.

3. Head Swap

This more comprehensive technique replaces an entire head in a video while maintaining realistic movements and expressions. Head swapping typically requires more processing resources than face swapping as it must handle a larger portion of the frame, including hair and sometimes neck areas.

In our testing, head swap approaches generally produce more consistent results when the target has distinctive hair or head shape that would make face-only swapping more noticeable. The head swap does not generate the entire head, but casts one head to another, allowing the model to run fast and be very useful for real-time missions

 Source image  
Target image
Head swapping

4. Face Swap

Perhaps the most widely known deepfake technique, face swapping replaces one person's face with another's while preserving the original expressions. The face is typically mapped from the forehead to the chin, creating a blend between the target's face and the original video.

Our research suggests that face swapping strikes a balance between computational requirements and visual quality, making modified versions usable in some real-time applications, though often with quality compromises. 

Face swaps are well-suited for real-time applications because they modify only the facial region while preserving the original head shape, hair, and background. This reduces computational load compared to head swapping, allowing for faster processing and smoother real-time performance, allowing run time of up to 30fps.

5. Talking Head Generation

This technique animates a still image to create the appearance of natural speech and movement. Rather than swapping or transferring faces, talking head approaches generate new video frames based on a single reference image.

We've found that talking head models are particularly useful when source video isn't available - only a photograph of the person to be animated is needed.

6. Avatar Creation

Avatar systems generate fully synthetic digital representations that can mimic speech and expressions. Unlike the other techniques that modify existing footage, avatars create entirely new content based on learned characteristics.

In our work, we've seen that avatar systems represent a distinct category that increasingly blurs the line between deepfakes and general AI-generated content.

Audio Deepfakes: Manipulating Voice and Speech

The audio domain has its own set of deepfake techniques that complement or operate independently from visual manipulations:

7. Voice Conversion

This technique modifies a source speaker's voice to sound like a target person while preserving the original speech content, rhythm, and intonation. The speaker's identity changes, but how they speak remains largely intact.

8. Text-to-Speech (TTS)

TTS systems generate synthetic speech from written text. Advanced TTS can produce remarkably natural-sounding voices that mimic specific individuals, creating speech that was never actually recorded.

9. Voice Cloning

Voice cloning replicates a person's vocal characteristics in an AI model, enabling the generation of new speech in that voice. The distinction from voice conversion is that voice cloning can generate entirely new content, while conversion transforms existing recordings.

Real-Time vs. Offline Applications: Technical Considerations

Through our research and development work, we've observed significant differences in how these technologies perform in real-time versus offline scenarios.

Offline Deepfake Generation

For pre-recorded content like social media videos, the creation process typically involves multiple stages:

  1. Audio preparation: This often begins with TTS or voice cloning to create a synthetic voice track based on manipulated text.

  2. Refinement iterations: Multiple passes to adjust prosody, intonation, and naturalness for the specific context.

  3. Visual selection: Choosing appropriate source footage of the target person.

  4. Visual manipulation: Applying techniques like lip-sync deepfakes to match the mouth movements to the synthesized audio.

  5. Post-processing: Final adjustments to enhance realism and hide artifacts.

This multi-stage approach can produce highly convincing results but requires significant processing time, computational resources, and often human intervention to achieve optimal quality.

Real-Time Deepfake Challenges

For applications like live video calls or streaming, the requirements change dramatically:

  1. Latency constraints: Processing must occur with minimal delay, typically under 100ms to maintain conversational flow.

  2. Resource limitations: Computations must be efficient enough to run on standard hardware, often without specialized GPUs.

  3. Single-pass processing: There's no opportunity for multiple refinement iterations.

  4. Synchronized manipulation: Audio and visual elements must be modified simultaneously and remain perfectly aligned.

Based on our testing, real-time deepfake applications typically employ a streamlined pipeline:

  1. Voice conversion that preserves the original prosody while changing speaker identity.

  2. Simultaneous face swapping synchronized with the modified audio.

These approaches sacrifice some quality for speed, creating a fundamental trade-off between realism and real-time functionality.

Detection Implications

The technical distinctions between offline and real-time deepfakes have important implications for detection approaches:

  • Offline deepfakes often contain subtle artifacts from multiple processing stages, but creators have time to minimize these tells. Detection systems need to look for inconsistencies across the entire content.

  • Real-time deepfakes typically contain more prominent artifacts due to computational constraints, but these occur consistently throughout the media. Detection can focus on these systematic patterns.

In our development of detection tools, we've found that understanding these differences helps us design more effective approaches tailored to specific threat models and use cases.

Ongoing Evolution

The line between offline and real-time capabilities continues to blur as techniques improve and computational resources advance. What required extensive offline processing yesterday may be possible in real-time tomorrow.

Our team continues to study these evolving techniques, learning from both research publications and real-world samples to better understand the technical foundations of different deepfake types. This understanding informs not only our detection approaches but also helps organizations prepare for how these technologies might be deployed in various contexts.

This post is part of our ongoing exploration of synthetic media technologies and detection approaches. We believe that understanding the technical foundations of these systems helps build more effective and responsible approaches to media authentication.