I have read several literatures on face
detection and it took me quite a while just before the juice-of-understanding
started to flow because it was so difficult to grasp the fundamentals at the
beginning.
To say the least, face detection is not
straight forward as it requires a lot of details to be mastered after which
everything comes easy. So I have thought to share my understanding on the
subject so as to help new-comers gain faster understanding and get through
quickly. I have really tried my best to keep the presentation simple with
illustrative diagrams.
The article is quite lengthy and is
intended to be a comprehensive introduction on the subject and requires
patience to follow the logic through.
Let’s start this presentation by
looking at some important definitions. So what is face detection?
Introduction
Face detection is the process of
detecting face(s) in a digital image. A digital image is a computer file that
can display images which has properties like, resolution, image size, colour or
non-colour. The colour is often defined as a measure of intensity of red, blue,
and green colours.
According to Burgin et al. (2011), Face detection is one of
the most thoroughly explored research areas in computer vision. Face detection
eventually would help humans improve their human-computer interactions in the
future.
Face detection has been used
extensively in biometric applications comparing with other biometrics (Yi et al., 2013), face biometric is
superior because of its non intrusive nature compared to its other counterparts
- iris, fingerprints, finger geometry, hand geometry, hand veins, palm, retina,
voice, because all of these require the voluntary action by a person. Face detection
can be done covertly from a distance and passively without the explicit
participation of a person(s).
Face detection often involves the
detection of one or more faces in an image compared to face/ facial recognition
systems which involves the ability of the system to recognize or identify
people by their facial characteristics directly from a playback video, live-feed
or still images.
Face recognition is different from face
detection; face recognition requires the following four sub-steps which
includes: face detection for finding faces in an image, normalization-step
which enhances the quality of the facial image, face-feature-extraction step
which extracts salient facial features that is useful for distinguishing faces,
and face-matching which involves matching the face against one or many enrolled
faces in a database. Face detection is a prerequisite first step in face
recognition which is mainly used for verification and identification of a
person.
The process of face detection begins
with the image file which is stored in a computer's memory, processed, and
displayed on a computer screen for it to be seen. Face detection is a subset of
object detection systems. Object detection systems are generalized systems that
can detect any object with the classifier trained to identify the objects - it
could be a face, human form, a cat, shoe, banana, apples - just anything that
is an object. Other areas of object detection include gait analysis – which
involves study of human walking patterns.
Application
Face detection application is currently
helping blind people ‘perceive’ photos through their mobile apps. Face
detection is helping disabled people improve their daily lives, and maybe
utilize hand-free alternative in other applications. It also features in facial
tracking in safety applications in using heavy equipment or nuclear equipment
and detecting sleepiness or lack of attention while driving or using hazardous
machines.
Further, it is used in video
applications, used in chewing analysis, facial motion analysis, etc. The goal
of a face detection system is to determine whether or not there are any faces
in the image field, and if present return the location and extent of each face
(Nenad, 2012).
Face detection is often affected by
pose, rotation, poor illumination of the image/face, facial expressions,
occlusion, distance of face in the image, etc. In practical scenarios, for
instance, a video surveillance feed most times capture faces that are non-ideal
or unconstrained and poorly lit, poor visibility of the images as a result of
weather conditions, and also the distance of the faces on the images could be
potential problems for the system.
For instance, there are two possible
scenarios in how this technology may be used - constrained (or controlled) and
unconstrained (or uncontrolled) environments.
Constrained environment is a scenario
when the faces in the image are ensured to be frontal, i.e. facing the camera
(sometimes head may be rotated like in figure above), so that it would be easy
for the algorithm to detect the face(s) more easily. That would mean that
facial images in an unconstrained environment, that is, faces looking away from
the camera may be more difficult to detect because of angular pose or rotations
of the faces.
Unconstrained scenario: detected faces of people in MM-Airport-2, Ikeja, Lagos. Image source: the News |
In 2001, Viola and Jones (also known as
Viola-Jones) invented a framework for detecting objects and it was refined to
be used for face detection. The framework later became the industry de-facto
standard for implementing object or face detection systems.
Further, according to Bober and
Pasehalakis (2012), numerous methodologies exist for implementing face
detection systems;
(I) Appearance-based: This approach
treat face detection as a recognition problem and in general do not utilize
human knowledge of facial appearances, but rely on machine learning and
statistical methods to ascertain such characteristics.
(II)
Feature-based approach involves detecting human facial properties such
as eyes, nose, or parts or combination thereof, and it usually rely on
low-level operators and heuristics rules of human face structure.
(III) Color based technique may be
described as a subset of feature-based method, where the feature in question is
the colour of the human skin.
In their novelty object detection
framework, Paul Viola and Michael Jones discovered a way to utilize image
filters (known as Haar rectangle features) to probe images for the absence or
presence of faces. The filters were implemented to locate facial features in an
image. These rectangle filters or features were based on the work of a
Hungarian-Jewish mathematician Alfred Haar.
Haar rectangle features |
Viola-Jones used several different rectangle
haar features of the same size in their framework; the first is a light block
next to a dark one, horizontally. The second is the same but vertically. The
third is light block sandwiched between two dark blocks (or otherwise). The
fourth constitutes four rectangles as shown.
Face
Detection Architecture
Typical face detection architecture is
shown below; this architecture is basic and can include more modules depending
on the complexity of the design.
Face detection architecture |
Several modules in a face detection
system include; image store, image cropper, image-scaler, integral image
regeneration, and a classifier.
Image
store
The image to be tested is stored in the
RAM and is first converted to a grayscale image which is now defined as a set
of intensity values from 0 to 255. Next, we sum the intensity values in various
rectangular blocks. Under these sums we can detect darker blocks adjacent to
lighter blocks in a light block (TechRadar, 2010)
This module stores the image data in
the RAM arriving from the input which could be from hard drive, SD card, or
directly from camera sensors. Some architecture may include a frame grabber
module that handles all image frames from different input sources.
This module transfers the image data to
the classifier module based on the scale information from the image scaler
module (Cho et al., 2009).
The Image Cropper
The image cropper module is the next in
the face detection system. The module requires a sub-window to hold part of the
image in the image store. The sub-window scans the image advancing from left to
right (in a single row fashion) with one pixel shift of one step to the right
at a time, and advances to the next row of the image. Meaning that if the scan
or test image has a width x height of 320 x 240, it has a total of 320 pixel
rows and 240 pixel columns. Also, according to Viola-Jones’ experiment they
used a sub window with a width and height of 24 x 24.
There are mainly three scaling
algorithm used in face detection routine in this module; nearest-neighbour,
bilinear, and poly-phase algorithms.
The image scaler
Most of the time in a photo there could
be several faces in it with various sizes. To easily allow the face detection
algorithm to locate those faces, different sub-windows are used. As mentioned
in the last section, Viola-Jones used a sub-window of 24 x 24 (width x height)
which means that if there are faces that are smaller than that width and height
of the sub-window they may not be detected. To handle this problem, several
sub-window sizes are used, for instance, you could use a 10 x 10 sub-window or
in another case use a 50 x 50 sub window. The module provides means to resize
the sub-window so as to increase the chances of detecting all faces in an image
irrespective of their sizes. The images are scaled down or up based on a scale
factor by the image scaler module (Mirzaei, 2011).
The image scaler module generates and
transfers the address of the RAM location containing a frame image in the
image-store module to request image data according to a scale factor. The image
store module transfers a pixel data to the classifier module based on the
address of the RAM required from the image scaler module (Mirzaei, 2011).
Integral
Image
The rectangle features can be at an
arbitrary scale or at any position in the sub-window image (i.e., during
training process - explained in next section). The feature value is calculated
as the sum of the pixel intensities in the light rectangle(s) minus the sum of
the pixels in the dark rectangle(s). The value of the feature is then used in a
classifier to determine if that feature is present in the original image.
The whole process can be shortened
using a technique known as integral image at every point. It involves the
summing of the intensities of the pixels in a given rectangle. This is merely
the sum of the pixels of the rectangle from the upper left corner to the given
point, and it can be calculated for every point in the original image in a
single pass across the image. By using each corner of the rectangle, the area
can be computed quickly (Cho et al.,
2009).
Integral image computation |
According to Viola-Jones, the sum of
the pixels within rectangle D can be
computed with four array references. The value of the integral image at
location 1 is the sum of the pixels
in rectangle A. The value at location
2 is A + B, at location 3 is A + C, and at location 4 is A + B + C + D. The sum within D can be
computed as 4 + 1 – (2 + 3).
This integral image technique is far
more efficient than when images are analyzed by pixels since it helps to
calculate the summed intensity values faster for a set of features.
(TechRadar, 2010) Features provide a coarse, low-resolution view of the image and
good at detecting edges between light and dark, bars, and other simple
structures. In reality this module is actually implemented within the classifier
module.
Classifier
This module performs the classification
for the face detection using Haar feature data. The face detection is performed
by the Haar feature classification using an integral image.
Building
and Training a Classifier
Next, we need to build and train a
classifier by submitting to the face detection algorithm 24 x 24 images with
faces, and 24 x 24 images of non-faces so that the system is tuned and trained
to detect faces and reject non-faces. The list of faces and non-faces is called
a training set. As a result because of the large number of training set, the
classifier module is thus computationally expensive. The training experiment
for Viola-Jones utilized 4,000 images with faces and 10,000 images with
non-faces which took them several weeks to conclude. But today there are faster
ways of building a face detection classifier which I explain in future articles
using OpenCV.
Training here involves creating a
learning system and the one used by Viola-Jones is an Artificial Intelligence
(AI) based technique called Adaboost to implement their classifier.
Adaboost is a predictive algorithm used
for classification and regression, and is a variant of boosting – an ensemble
learning algorithm which can be used for classification.
Adaboost (Adaptive Boosting, proposed
by Freund-Schapire) is similar to neural networks which involves creating a
strong learner by iteratively adding many weak learners. A weak learner here
could be thought of as a stage which comprises of rectangle haar features. A
weak-learner has a probability of 1/2, which makes it better than random
guessing. A strong learner has an arbitrarily small error of probability
(Alpadym, 2010). This weak-learner is combined with many other weak-learners
that results to a strong learner. Each of the weak learners comprise of
rectangle features assigned a weighting vector that when
added up (i.e., the weightings of the haar features) test true for an image,
thereby defining the accuracy of the classifier. Think of the weak learners as
many classifiers or stages. The classifier module typically contains many
stages (or cascaded classifiers) for the classification of faces or non-faces.
So, because there are four rectangle
haar features, there are some 45,000 different ways to place one of the
four types of feature onto a 24 x 24 image (TechRadar, 2010). For example, for
the first type of
feature,
you could have rectangles one pixel wide by two pixels deep, all the way
up to 1 x 24, then 2 x 2 to 2 x 24, and so on. These different-sized features
would be placed in various positions across the image to test every possible
feature for every possible size at every possible position.
Further, in building a classifier, a
threshold is required to determine whether a feature is assumed to be detected
or not. Using this,
you would then apply every one of the 45,000 possible features to your training
set. You would discover that certain features have a high success rate than
other features in accepting faces or discarding non-faces, which is actually
called training.
Viola-Jones in their experiment discovered that two features when
combined and tuned by Adaboost into a single classifier were spot-on in
allowing image candidates with faces 100% of the time but with a 40% chance of
missed-detections (also known as false positives). Their
final cascaded classifiers developed had 32 stages and used a total of 4,297
useful features from the 45,000 features.
According to Viola and Jones (2004), the
first and second rectangle feature selected is by Adaboost. The two features
shown in the top row are overlaid on a typical training face in the bottom row
as shown below.
The first feature measures the
difference in intensity between the region of the eyes and a region across the
upper cheeks. The feature capitalizes on observation that the eye region is
often darker than the cheeks. The second feature compares the intensities in
the eyes regions to the intensity across the bridge of the nose (Viola and
Jones, 2004).
The classifier which has now been
primed or calibrated is said to be trained based on certain settings (feature
thresholds) and can be employed to probe face candidates to either accept faces
in images or reject when they are absent.
After training, next, is to test an
image candidate; each sub-window of the original image is tested against the
first classifier stage. If it passes that classifier, it's tested against
the second. If it passes that one, it's then tested against the third, and so
on. If it fails at any stage of the testing, the sub-window is rejected as a
possible face. If it passes through all the classifiers then the sub-window is
classified as a face (TechRadar, 2010).
The detected face(s) location is stored
and then a coloured square or shape indicates the detected face or faces in the
image.
Research
Approaches
Face detection is often categorized
under computer vision systems. Let’s
recap; face detection is a process of detecting faces in a live video feed or
in a still image. A video would require a real time tracking algorithm of the
face or faces in a video irrespective of the changing positions of the people
in the video. A still image is easy to detect because the face in the image is
static. Researchers often focus on either or both; i.e., static or real-time
face detection systems.
Other scenarios in face detection
involve the faces to be detected in a constrained or unconstrained environments as mentioned
earlier.
Furthermore, there could be software or
hardware approach which often requires ways to accelerate (speed-up) the
detection process using special hardware like FPGAs (Field Programmable Gate
Arrays) or GPUs (Graphics Processing Units). Software approach requires
improving the algorithm for better throughput of the system without using special hardware.
Wrapping
Up
In subsequent articles that will be
published, I will explore in-depth face detection and recognition system
designs for constrained and unconstrained environments, and different
acceleration techniques for implementing on a PC, mobile phone (Android), and
an FPGA device.
About the Author; Ogor Anumbor is a mechanical
engineer, a scientist-in-training, and has been studying face
detection/recognition systems since 2011. He is also interested in FPGA, GPU,
and image processing systems – object tracking, gait-analysis, voice
recognition systems, smart-home systems, embedded systems, and has ample field
experience with microprocessors.
References
Alpaydm Ethem (2010) Introduction to
Machine Learning, 2nd edition, The MIT Press, Cambridge, Massachusetts, London,
England.
Bober,
M., Pasehalakis, S. (2003) A Low-Cost FPGA System for High Speed Face Detection
and Tracking, December 2003, 2nd IEEE International Conference on
FPGA, In Proceed., Tokyo, Japan, December 15-17, pp. 214 – 221.
Burgin, W., Pantofaru, C., Smart, D.,
W. (2011) Using Depth Information to Improve Face Detection, March 6-9, 2011.
Cho, J., Mirzaei, S., Oberg, J.,
Kastner, R. (2009) FPGA-Based Face Detection System Using Haar Classifiers,
February 22-24, 2009, ACM 978-1-60558-410-2/09/02.
Freund, Y., and Schapire,
R.E..(1996) Experiments with a New Boosting Algorithm. In Thirteenth
International Conference on Machine Learning, ed. L.Saitta, 148–156. San
Mateo, CA: Morgan Kaufmann.
Mirzaei, A., M. (2011) Acceleration of
Face Detection Algorithm on an FPGA, June 2011, Msc. Thesis, MCV-VIBOT. Imperial
College, London, UK.
Nenad
Markus (2012) Overview of Algorithms for Face Detection and Tracking.
TechRadar (2010) How Face
Detection Works, July 28, 2010, PC Plus issue 296. date accessed: October 6,
2014, 2.27PM.
The
News (2015) Easter: Nigerian Passengers Scramble for Air Tickets, April 3,
2015, 3:19PM http://thenewsnigeria.com.ng/2015/04/easter-nigerian-passengers-scramble-for-air-tickets/
Date accessed: December 27, 2016, 3:00 AM
Viola,
P., Jones, M.J. (2004) Robust Real-Time Face Detection, International Journal
of Computer Vision 57 (2), 137-154, 2004, Kluwer Academic Publishers.
Yi,
D., Lei, Z., Li, S. Z. (2013) Towards Pose Robust Face Recognition, CVPR2013.
That's great news. We wish you good luck with your blogging journey and feel free to contact us in case you need anything.
ReplyDeletego next page
Your blog is very nice and we hope you are
ReplyDeleteProviding more information in future.
see here now
Will blog comment backlink increase my referral
ReplyDeleteread this
Nice information keep sharing
ReplyDeletelook at here
I have read your blog it’s very attractive and impressive. I like it your blog.
ReplyDeleteClick for this