Introduction
With the proliferation of deepfake technology, the ability to discern real from manipulated content has become a critical concern. Deepfake videos, often indistinguishable from authentic footage, pose serious threats to privacy, security, and the integrity of information.
In response to this challenge, cutting-edge research in computer vision and machine learning has yielded innovative solutions for detecting deepfakes. Among these, the fusion of convolutional neural networks (ConvNets) and attention-based models has emerged as a promising approach.
In this project, we present an in-depth exploration of deepfake detection using ConvNets with attention, specifically focusing on the CoAtNet architecture. CoAtNet, a novel family of image recognition models, seamlessly integrates ConvNets and attention mechanisms, offering a powerful tool for analyzing facial images extracted from videos.
Data processing
We only used 3 chunks of 50 chunks of data from the DFDC dataset for our project which resulted in the size of (~30gb).
Overview of preprocessing steps:
Face Extraction Using BlazeFace:
BlazeFace is a lightweight and efficient face-detection model. It is utilized to extract faces from each frame of the videos in the DFDC dataset. This step ensures that only the relevant facial regions are considered for analysis. Here are some examples of face extractions from a video sample.
Real video:
DeepFake video:
Normalization of Pixel Values:
After face extraction, pixel values of the extracted facial images are normalized. Normalization standardizes the pixel values to a common scale, typically between 0 and 1, to improve the convergence and stability of the training process.
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
These mean and std values are given by Kaggle during the competition for the DFDC dataset. using these we are normalizing the pixel values.
The above figure is an example of normalized pixel values. We first normalized its pixel values and then to visualize it, we did invert_normalization.
Augmentation Techniques like Albumenation:
Augmentation techniques, such as the Albumenation library, are applied to increase the diversity and robustness of the training dataset. Albumenation introduces variations in the training data by applying transformations such as rotation, flipping, scaling, and color adjustments to the facial images.
#Basic Geometric Transformations
RandomRotate90: Rotates the image by 90, 180, or 270 degrees (controlled by p=0.2).
Transpose: Flips rows and columns (potentially useful for text or certain object orientations, with p=0.2).
Horizontal Flip: Mirrors the image horizontally (p=0.5).
Vertical Flip: Mirrors the image vertically (p=0.5).
#Random Effects:
OneOf ([GaussNoise()], p=0.2): Adds random noise to the image with a 0.2 probability within this group (other options are not applied if noise is chosen).
#Combined Transformations:
Shift Scale Rotate: This applies a combination of random shift, scale, and rotation in a single step (p=0.2).
#Pixel-Level Adjustments:
One Of ([CLAHE(clip_limit=2), Sharpen(), Emboss(), Random Brightness Contrast()], p=0.2): Within this group, one of these transformations is applied with a 0.2 probability:
CLAHE: Contrast Limited Adaptive Histogram Equalization (improves local contrast).
Sharpen: Enhances image edges.
Emboss: Creates a raised or sunken effect.
Random Brightness Contrast: Randomly adjusts brightness and contrast.
#Color Adjustments:
Hue Saturation Value: Randomly modifies the image’s hue (color), saturation (intensity), and value (brightness) with a 0.2 probability.
Temporal Consistency vs. Face Extraction and Classification
- Temporal Consistency refers to maintaining coherence across sequential frames in video analysis, often achieved through models integrating time-based architectures like LSTM or GRU to capture temporal dependencies. However, recent advancements demonstrate that face extraction and classification alone can yield effective results without explicitly modeling temporal relationships.
- By focusing solely on face extraction and classification, without considering temporal consistency, the model can efficiently detect deepfake content while simplifying the architecture and reducing computational complexity.
- We solely focused on detecting Image or video manipulations Ignoring Audio whereas many of the current best models detect audio manipulations too.
- The current models leverage efficient vision transformers, such as the Cross Efficient Vision Transformer (CEViT), which combines the efficiency of vision transformers with cross-modal fusion for improved performance across various tasks in computer vision.
CoAtNet Architecture
CoAtNet is a new family of image recognition models that combines the strengths of convolutional neural networks(ConvNets) and attention-based models (like Transformers). The CoatNet model is specifically designed for efficient image classification tasks, making it well-suited for processing large volumes of facial images extracted from videos.
- The CoAtNet architecture comprises five stages (S0, S1, S2, S3, S4), each tailored to specific characteristics of the data and task at hand. Beginning with a simple 2-layer convolutional stem in S0, the subsequent stages employ a combination of MBConv blocks with squeeze-excitation (SE) and Transformer blocks.
- To optimize model performance, we adopt a strategic approach to stage selection. Convolution stages precede Transformer stages, leveraging the former’s proficiency in processing local patterns common in early stages. This leads to four variants: C-C-C-C, C-C-C-T, C-C-T-T, and C-T-T-T, with varying degrees of convolution and Transformer stages. Through rigorous experimentation, we determine that the C-C-T-T configuration yields the best balance between generalization ability and model capacity.
[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes
Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present CoAtNets(pronounced "coat" nets), a family of hybrid models built from two key insights: (1) depthwise Convolution and self-Attention can be naturally unified via simple relative attention; (2) vertically stacking convolution layers and attention layers in a principled way is surprisingly effective in improving generalization, capacity and efficiency. Experiments show that our CoAtNets achieve state-of-the-art performance under different resource constraints across various datasets: Without extra data, CoAtNet achieves 86.0% ImageNet top-1 accuracy; When pre-trained with 13M images from ImageNet-21K, our CoAtNet achieves 88.56% top-1 accuracy, matching ViT-huge pre-trained with 300M images from JFT-300M while using 23x less data; Notably, when we further scale up CoAtNet with JFT-3B, it achieves 90.88% top-1 accuracy on ImageNet, establishing a new state-of-the-art result.
arxiv.org
Project Architecture
Our approach to this project is to use CoATnet-0. The CoAtNet authors proposed 5 best architectures for best performance(CoAtNet-0 to CoAtNet-4). CoAtNet-0 is the smaller model. we used it for our detection to make a small and compact model. Here is a brief explanation of our model layers:
model = Coatnet(image_size=(224, 224), in_channels=3, num_blocks=[2, 2, 3, 5, 2], channels=[64, 96, 192, 384, 768], num_classes=2)
model.summary()
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 112, 112] 1,728
BatchNorm2d-2 [-1, 64, 112, 112] 128
GELU-3 [-1, 64, 112, 112] 0
Conv2d-4 [-1, 64, 112, 112] 36,864
BatchNorm2d-5 [-1, 64, 112, 112] 128
GELU-6 [-1, 64, 112, 112] 0
MaxPool2d-7 [-1, 64, 56, 56] 0
Conv2d-8 [-1, 96, 56, 56] 6,144
BatchNorm2d-9 [-1, 64, 112, 112] 128
Conv2d-10 [-1, 256, 56, 56] 16,384
BatchNorm2d-11 [-1, 256, 56, 56] 512
GELU-12 [-1, 256, 56, 56] 0
Conv2d-13 [-1, 256, 56, 56] 2,304
BatchNorm2d-14 [-1, 256, 56, 56] 512
GELU-15 [-1, 256, 56, 56] 0
AdaptiveAvgPool2d-16 [-1, 256, 1, 1] 0
Linear-17 [-1, 16] 4,096
GELU-18 [-1, 16] 0
Linear-19 [-1, 256] 4,096
Sigmoid-20 [-1, 256] 0
SE-21 [-1, 256, 56, 56] 0
Conv2d-22 [-1, 96, 56, 56] 24,576
BatchNorm2d-23 [-1, 96, 56, 56] 192
PreNorm-24 [-1, 96, 56, 56] 0
MBConv-25 [-1, 96, 56, 56] 0
BatchNorm2d-26 [-1, 96, 56, 56] 192
Conv2d-27 [-1, 384, 56, 56] 36,864
BatchNorm2d-28 [-1, 384, 56, 56] 768
GELU-29 [-1, 384, 56, 56] 0
Conv2d-30 [-1, 384, 56, 56] 3,456
BatchNorm2d-31 [-1, 384, 56, 56] 768
GELU-32 [-1, 384, 56, 56] 0
AdaptiveAvgPool2d-33 [-1, 384, 1, 1] 0
Linear-34 [-1, 24] 9,216
GELU-35 [-1, 24] 0
Linear-36 [-1, 384] 9,216
Sigmoid-37 [-1, 384] 0
SE-38 [-1, 384, 56, 56] 0
Conv2d-39 [-1, 96, 56, 56] 36,864
BatchNorm2d-40 [-1, 96, 56, 56] 192
PreNorm-41 [-1, 96, 56, 56] 0
MBConv-42 [-1, 96, 56, 56] 0
MaxPool2d-43 [-1, 96, 28, 28] 0
Conv2d-44 [-1, 192, 28, 28] 18,432
BatchNorm2d-45 [-1, 96, 56, 56] 192
Conv2d-46 [-1, 384, 28, 28] 36,864
BatchNorm2d-47 [-1, 384, 28, 28] 768
GELU-48 [-1, 384, 28, 28] 0
Conv2d-49 [-1, 384, 28, 28] 3,456
BatchNorm2d-50 [-1, 384, 28, 28] 768
GELU-51 [-1, 384, 28, 28] 0
AdaptiveAvgPool2d-52 [-1, 384, 1, 1] 0
Linear-53 [-1, 24] 9,216
GELU-54 [-1, 24] 0
Linear-55 [-1, 384] 9,216
Sigmoid-56 [-1, 384] 0
SE-57 [-1, 384, 28, 28] 0
Conv2d-58 [-1, 192, 28, 28] 73,728
BatchNorm2d-59 [-1, 192, 28, 28] 384
PreNorm-60 [-1, 192, 28, 28] 0
MBConv-61 [-1, 192, 28, 28] 0
BatchNorm2d-62 [-1, 192, 28, 28] 384
Conv2d-63 [-1, 768, 28, 28] 147,456
BatchNorm2d-64 [-1, 768, 28, 28] 1,536
GELU-65 [-1, 768, 28, 28] 0
Conv2d-66 [-1, 768, 28, 28] 6,912
BatchNorm2d-67 [-1, 768, 28, 28] 1,536
GELU-68 [-1, 768, 28, 28] 0
AdaptiveAvgPool2d-69 [-1, 768, 1, 1] 0
Linear-70 [-1, 48] 36,864
GELU-71 [-1, 48] 0
Linear-72 [-1, 768] 36,864
Sigmoid-73 [-1, 768] 0
SE-74 [-1, 768, 28, 28] 0
Conv2d-75 [-1, 192, 28, 28] 147,456
BatchNorm2d-76 [-1, 192, 28, 28] 384
PreNorm-77 [-1, 192, 28, 28] 0
MBConv-78 [-1, 192, 28, 28] 0
BatchNorm2d-79 [-1, 192, 28, 28] 384
Conv2d-80 [-1, 768, 28, 28] 147,456
BatchNorm2d-81 [-1, 768, 28, 28] 1,536
GELU-82 [-1, 768, 28, 28] 0
Conv2d-83 [-1, 768, 28, 28] 6,912
BatchNorm2d-84 [-1, 768, 28, 28] 1,536
GELU-85 [-1, 768, 28, 28] 0
AdaptiveAvgPool2d-86 [-1, 768, 1, 1] 0
Linear-87 [-1, 48] 36,864
GELU-88 [-1, 48] 0
Linear-89 [-1, 768] 36,864
Sigmoid-90 [-1, 768] 0
SE-91 [-1, 768, 28, 28] 0
Conv2d-92 [-1, 192, 28, 28] 147,456
BatchNorm2d-93 [-1, 192, 28, 28] 384
PreNorm-94 [-1, 192, 28, 28] 0
MBConv-95 [-1, 192, 28, 28] 0
MaxPool2d-96 [-1, 192, 14, 14] 0
Conv2d-97 [-1, 384, 14, 14] 73,728
MaxPool2d-98 [-1, 192, 14, 14] 0
Rearrange-99 [-1, 196, 192] 0
LayerNorm-100 [-1, 196, 192] 384
Linear-101 [-1, 196, 768] 147,456
Softmax-102 [-1, 8, 196, 196] 0
Linear-103 [-1, 196, 384] 98,688
Dropout-104 [-1, 196, 384] 0
Attention-105 [-1, 196, 384] 0
PreNorm-106 [-1, 196, 384] 0
Rearrange-107 [-1, 384, 14, 14] 0
Rearrange-108 [-1, 196, 384] 0
LayerNorm-109 [-1, 196, 384] 768
Linear-110 [-1, 196, 768] 295,680
GELU-111 [-1, 196, 768] 0
Dropout-112 [-1, 196, 768] 0
Linear-113 [-1, 196, 384] 295,296
Dropout-114 [-1, 196, 384] 0
FeedForward-115 [-1, 196, 384] 0
PreNorm-116 [-1, 196, 384] 0
Rearrange-117 [-1, 384, 14, 14] 0
Transformer-118 [-1, 384, 14, 14] 0
Rearrange-119 [-1, 196, 384] 0
LayerNorm-120 [-1, 196, 384] 768
Linear-12
Author Of article : Nikhil Reddy
Read full article