- Pooja Korah and Christy Varghese
- July 30, 2024
IR Face Detection and Recognition using YOLOv8 and FaceNet
Introduction
Face recognition technology is an advanced system that uses intricate algorithms and machine learning methods to recognize or authenticate people based on their distinct facial characteristics. This technology works by taking pictures or video frames of faces, identifying characteristics like the shape of the nose, the distance between the eyes, and the jawline, and then transferring those attributes into a digital representation called a face print. To ensure precise identifications, these face prints are then compared to a database of recognized faces.
Illumination, pose, expression changes, and facial disguises can significantly decrease recognition accuracy. Use of Infrared images is a good solution since the issues with changes in visible light have a much smaller impact if illumination is controlled in other parts of the spectrum. In numerous applications, including those in the military, banking, marketing, healthcare, public security, video surveillance, access restrictions for laptops and mobile phones, and law enforcement, face recognition has emerged as a leading physiological biometric for identity authentication. One of the major challenges of IR face recognition is the distinct thermal patterns emitted by faces in infrared images. In this context, the development of an efficient method for infrared face recognition ideal for real-time applications on edge devices becomes paramount.
IR Face Recognition
Face recognition mainly involves two tasks: face detection and face verification. Face detection is a core task in computer vision that entails recognizing and pinpointing human faces in images or video streams. The first stage in a face recognition system is to identify a face; there are various object identification techniques available for this purpose. This process serves as a critical precursor to various applications, including facial recognition, emotion analysis, and identity verification. The You Only Look Once (YOLO) approach renowned for its object detection capabilities has been used for face detection.
Face recognition involves the extraction of discriminative facial features from images or videos, followed by comparison with a database of known faces to establish identity. Deep learning, particularly Convolutional Neural Networks (CNNs), has emerged as a powerful tool for face recognition. These systems learn to extract intricate patterns and representations from facial images, capturing both spatial and semantic information essential for distinguishing between individuals. Through extensive training on large-scale datasets, deep learning models can effectively generalize across diverse facial appearances, expressions, poses, and lighting conditions, making them suitable for real-world deployment. Pre-trained models such as Arcface or Facenet, trained on large image dataset can be used for face verification.
YOLO V8
The architecture of YOLOv8, which uses a convolutional neural network with two primary components—the head and the backbone—is a development of earlier YOLO models. The head comprises multiple convolutional layers followed by fully connected layers responsible for predicting bounding boxes, objectness scores, and class probabilities. In short, YOLOv8 incorporates a self-attention mechanism in the network’s head and utilizes a feature pyramid network for multi-scaled object detection. This allows it to concentrate on different parts of an image and identify objects of varying sizes and scales.
Facenet
Face recognition systems often rely on advanced architectures like FaceNet to accurately identify individuals based on facial features. FaceNet, a pioneering deep learning model, revolutionized the field by introducing a novel approach to face embedding, where faces are represented as high-dimensional vectors in a continuous embedding space. The architecture of FaceNet comprises several key components designed to extract discriminative features from facial images and generate compact embeddings suitable for comparison.
One distinguishing feature of FaceNet is its use of triplet loss during training, a specialized loss function that encourages the model to map similar faces closer together in the embedding space while pushing dissimilar faces apart. By optimizing the embeddings to minimize the distance between embeddings of the same individual and maximize the distance between embeddings of different individuals, FaceNet learns to generate embeddings that are highly discriminative and invariant to variations in pose, expression, and lighting.
Challenges
- Building a face recognition system requires loading separate models for both face detection and recognition.
- This can impact performance and decrease the overall FPS (Frames Per Second) of the system.
- To address this challenge and create a more compact model, we can reduce the number of parameters in the YOLO v8n model.
Modified YOLO V8
Deeper neural networks encounter challenges during training due to the vanishing gradient problem. This phenomenon occurs as gradients become increasingly smaller with each layer, hindering the learning process. While techniques like skip connections and batch normalization mitigate this issue to some extent, deeper networks don’t always lead to better accuracy. On the other hand, wider neural networks can capture more detailed information, but extremely wide and shallow networks struggle to understand complex patterns.
Thus, scaling the depth (number of layers) and width (number of channels) of the network architecture with the depth_multiple and width_multiple options in the YOLOv8 YAML configuration files helps to accommodate varying hardware limitations or performance demands by adjusting the model’s size. A value less than 1 would reduce the depth and width. The value of max_channels act as an upper limit on the number of channels to prevent the network from becoming too wide.
By using multipliers to control the number of layers and channels, the model can be tailored to specific needs. These multipliers allow customization of the model’s size and computational requirements without altering the underlying architecture. This approach enables fine-tuning to meet various performance goals, making the model adaptable for different scenarios.
Performance Evaluation
Trained YOLO v8n and modified YOLO v8n on face detection dataset available in Kaggle. The comparison of model size is given below:
Model | Model Size |
Yolo v8n | 6.2MB |
Modified Yolo v8n | 1.2MB |
Table 1: Comparison of model size
Reducing parameters in YOLOv8 significantly decreases the model size, which in turn leads to a substantial increase in FPS (Frames Per Second). A comparison of the performance of a face recognition system using YOLOv8n_facenet with a modified YOLOv8n with reduced parameters revealed a significant 7x increase in FPS.
Conclusion
IR face recognition is crucial for security applications, video surveillance, and nighttime facial recognition tasks. However, loading separate models for face detection (like YOLOv8) and verification (like Facenet) significantly reduces FPS, limiting its use on edge devices. To address this challenge, a method has been proposed to reduce the model size of YOLOv8. This modified, compact YOLOv8 model improves the overall FPS of the face recognition system, making it ideal for real-time applications on resource-constrained devices.
References
- https://github.com/ultralytics/ultralytics
- https://github.com/timesler/facenet-pytorch
- https://www.kaggle.com/datasets/fareselmenshawii/face-detection-dataset
- Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.