What's New:

Wednesday, 28 December 2016

Understanding Face Detection System; How It Works

 Author: Anumbor Ogor
Image source: the News (face detection performed on image)

I have read several literatures on face detection and it took me quite a while just before the juice-of-understanding started to flow because it was so difficult to grasp the fundamentals at the beginning.

To say the least, face detection is not straight forward as it requires a lot of details to be mastered after which everything comes easy. So I have thought to share my understanding on the subject so as to help new-comers gain faster understanding and get through quickly. I have really tried my best to keep the presentation simple with illustrative diagrams.

The article is quite lengthy and is intended to be a comprehensive introduction on the subject and requires patience to follow the logic through.

Let’s start this presentation by looking at some important definitions. So what is face detection?


Face detection is the process of detecting face(s) in a digital image. A digital image is a computer file that can display images which has properties like, resolution, image size, colour or non-colour. The colour is often defined as a measure of intensity of red, blue, and green colours.

According to Burgin et al. (2011), Face detection is one of the most thoroughly explored research areas in computer vision. Face detection eventually would help humans improve their human-computer interactions in the future.

Face detection has been used extensively in biometric applications comparing with other biometrics (Yi et al., 2013), face biometric is superior because of its non intrusive nature compared to its other counterparts - iris, fingerprints, finger geometry, hand geometry, hand veins, palm, retina, voice, because all of these require the voluntary action by a person. Face detection can be done covertly from a distance and passively without the explicit participation of a person(s).  

Face detection often involves the detection of one or more faces in an image compared to face/ facial recognition systems which involves the ability of the system to recognize or identify people by their facial characteristics directly from a playback video, live-feed or still images.

Face recognition is different from face detection; face recognition requires the following four sub-steps which includes: face detection for finding faces in an image, normalization-step which enhances the quality of the facial image, face-feature-extraction step which extracts salient facial features that is useful for distinguishing faces, and face-matching which involves matching the face against one or many enrolled faces in a database. Face detection is a prerequisite first step in face recognition which is mainly used for verification and identification of a person.

The process of face detection begins with the image file which is stored in a computer's memory, processed, and displayed on a computer screen for it to be seen. Face detection is a subset of object detection systems. Object detection systems are generalized systems that can detect any object with the classifier trained to identify the objects - it could be a face, human form, a cat, shoe, banana, apples - just anything that is an object. Other areas of object detection include gait analysis – which involves study of human walking patterns.


Face detection application is currently helping blind people ‘perceive’ photos through their mobile apps. Face detection is helping disabled people improve their daily lives, and maybe utilize hand-free alternative in other applications. It also features in facial tracking in safety applications in using heavy equipment or nuclear equipment and detecting sleepiness or lack of attention while driving or using hazardous machines.

Further, it is used in video applications, used in chewing analysis, facial motion analysis, etc. The goal of a face detection system is to determine whether or not there are any faces in the image field, and if present return the location and extent of each face (Nenad, 2012).

Face detection is often affected by pose, rotation, poor illumination of the image/face, facial expressions, occlusion, distance of face in the image, etc. In practical scenarios, for instance, a video surveillance feed most times capture faces that are non-ideal or unconstrained and poorly lit, poor visibility of the images as a result of weather conditions, and also the distance of the faces on the images could be potential problems for the system.

For instance, there are two possible scenarios in how this technology may be used - constrained (or controlled) and unconstrained (or uncontrolled) environments.

Constrained scenario: frontal face; detected face of the author

Constrained environment is a scenario when the faces in the image are ensured to be frontal, i.e. facing the camera (sometimes head may be rotated like in figure above), so that it would be easy for the algorithm to detect the face(s) more easily. That would mean that facial images in an unconstrained environment, that is, faces looking away from the camera may be more difficult to detect because of angular pose or rotations of the faces.

Unconstrained scenario: detected faces of people in MM-Airport-2, Ikeja, Lagos. Image source: the News

In 2001, Viola and Jones (also known as Viola-Jones) invented a framework for detecting objects and it was refined to be used for face detection. The framework later became the industry de-facto standard for implementing object or face detection systems.

Further, according to Bober and Pasehalakis (2012), numerous methodologies exist for implementing face detection systems;

(I) Appearance-based: This approach treat face detection as a recognition problem and in general do not utilize human knowledge of facial appearances, but rely on machine learning and statistical methods to ascertain such characteristics.

(II)  Feature-based approach involves detecting human facial properties such as eyes, nose, or parts or combination thereof, and it usually rely on low-level operators and heuristics rules of human face structure.

(III) Color based technique may be described as a subset of feature-based method, where the feature in question is the colour of the human skin.

In their novelty object detection framework, Paul Viola and Michael Jones discovered a way to utilize image filters (known as Haar rectangle features) to probe images for the absence or presence of faces. The filters were implemented to locate facial features in an image. These rectangle filters or features were based on the work of a Hungarian-Jewish mathematician Alfred Haar.

Haar rectangle features

Viola-Jones used several different rectangle haar features of the same size in their framework; the first is a light block next to a dark one, horizontally. The second is the same but vertically. The third is light block sandwiched between two dark blocks (or otherwise). The fourth constitutes four rectangles as shown.

Face Detection Architecture

Typical face detection architecture is shown below; this architecture is basic and can include more modules depending on the complexity of the design.

Face detection architecture
Several modules in a face detection system include; image store, image cropper, image-scaler, integral image regeneration, and a classifier.

Image store

The image to be tested is stored in the RAM and is first converted to a grayscale image which is now defined as a set of intensity values from 0 to 255. Next, we sum the intensity values in various rectangular blocks. Under these sums we can detect darker blocks adjacent to lighter blocks in a light block (TechRadar, 2010)

This module stores the image data in the RAM arriving from the input which could be from hard drive, SD card, or directly from camera sensors. Some architecture may include a frame grabber module that handles all image frames from different input sources.

This module transfers the image data to the classifier module based on the scale information from the image scaler module (Cho et al., 2009).

The Image Cropper

The image cropper module is the next in the face detection system. The module requires a sub-window to hold part of the image in the image store. The sub-window scans the image advancing from left to right (in a single row fashion) with one pixel shift of one step to the right at a time, and advances to the next row of the image. Meaning that if the scan or test image has a width x height of 320 x 240, it has a total of 320 pixel rows and 240 pixel columns. Also, according to Viola-Jones’ experiment they used a sub window with a width and height of 24 x 24.

There are mainly three scaling algorithm used in face detection routine in this module; nearest-neighbour, bilinear, and poly-phase algorithms.

The image scaler

Most of the time in a photo there could be several faces in it with various sizes. To easily allow the face detection algorithm to locate those faces, different sub-windows are used. As mentioned in the last section, Viola-Jones used a sub-window of 24 x 24 (width x height) which means that if there are faces that are smaller than that width and height of the sub-window they may not be detected. To handle this problem, several sub-window sizes are used, for instance, you could use a 10 x 10 sub-window or in another case use a 50 x 50 sub window. The module provides means to resize the sub-window so as to increase the chances of detecting all faces in an image irrespective of their sizes. The images are scaled down or up based on a scale factor by the image scaler module (Mirzaei, 2011).

The image scaler module generates and transfers the address of the RAM location containing a frame image in the image-store module to request image data according to a scale factor. The image store module transfers a pixel data to the classifier module based on the address of the RAM required from the image scaler module (Mirzaei, 2011).

Integral Image

The rectangle features can be at an arbitrary scale or at any position in the sub-window image (i.e., during training process - explained in next section). The feature value is calculated as the sum of the pixel intensities in the light rectangle(s) minus the sum of the pixels in the dark rectangle(s). The value of the feature is then used in a classifier to determine if that feature is present in the original image.

The whole process can be shortened using a technique known as integral image at every point. It involves the summing of the intensities of the pixels in a given rectangle. This is merely the sum of the pixels of the rectangle from the upper left corner to the given point, and it can be calculated for every point in the original image in a single pass across the image. By using each corner of the rectangle, the area can be computed quickly (Cho et al., 2009).

Integral image computation
According to Viola-Jones, the sum of the pixels within rectangle D can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle A. The value at location 2 is A + B, at location 3 is A + C, and at location 4 is A + B + C + D. The sum within D can be computed as 4 + 1 – (2 + 3).

This integral image technique is far more efficient than when images are analyzed by pixels since it helps to calculate the summed intensity values faster for a set of features. (TechRadar, 2010) Features provide a coarse, low-resolution view of the image and good at detecting edges between light and dark, bars, and other simple structures. In reality this module is actually implemented within the classifier module.


This module performs the classification for the face detection using Haar feature data. The face detection is performed by the Haar feature classification using an integral image.

Building and Training a Classifier

Next, we need to build and train a classifier by submitting to the face detection algorithm 24 x 24 images with faces, and 24 x 24 images of non-faces so that the system is tuned and trained to detect faces and reject non-faces. The list of faces and non-faces is called a training set. As a result because of the large number of training set, the classifier module is thus computationally expensive. The training experiment for Viola-Jones utilized 4,000 images with faces and 10,000 images with non-faces which took them several weeks to conclude. But today there are faster ways of building a face detection classifier which I explain in future articles using OpenCV.

Training here involves creating a learning system and the one used by Viola-Jones is an Artificial Intelligence (AI) based technique called Adaboost to implement their classifier.

Adaboost is a predictive algorithm used for classification and regression, and is a variant of boosting – an ensemble learning algorithm which can be used for classification.

Adaboost (Adaptive Boosting, proposed by Freund-Schapire) is similar to neural networks which involves creating a strong learner by iteratively adding many weak learners. A weak learner here could be thought of as a stage which comprises of rectangle haar features. A weak-learner has a probability of 1/2, which makes it better than random guessing. A strong learner has an arbitrarily small error of probability (Alpadym, 2010). This weak-learner is combined with many other weak-learners that results to a strong learner. Each of the weak learners comprise of rectangle features assigned a weighting vector that when added up (i.e., the weightings of the haar features) test true for an image, thereby defining the accuracy of the classifier. Think of the weak learners as many classifiers or stages. The classifier module typically contains many stages (or cascaded classifiers) for the classification of faces or non-faces.

Cascaded classifiers showing five stages

So, because there are four rectangle haar features, there are some 45,000 different ways to place one of the four types of feature onto a 24 x 24 image (TechRadar, 2010). For example, for the first type of feature, you could have rectangles one pixel wide by two pixels deep, all the way up to 1 x 24, then 2 x 2 to 2 x 24, and so on. These different-sized features would be placed in various positions across the image to test every possible feature for every possible size at every possible position.

Further, in building a classifier, a threshold is required to determine whether a feature is assumed to be detected or not. Using this, you would then apply every one of the 45,000 possible features to your training set. You would discover that certain features have a high success rate than other features in accepting faces or discarding non-faces, which is actually called training.

Viola-Jones in their experiment discovered that two features when combined and tuned by Adaboost into a single classifier were spot-on in allowing image candidates with faces 100% of the time but with a 40% chance of missed-detections (also known as false positives). Their final cascaded classifiers developed had 32 stages and used a total of 4,297 useful features from the 45,000 features.

According to Viola and Jones (2004), the first and second rectangle feature selected is by Adaboost. The two features shown in the top row are overlaid on a typical training face in the bottom row as shown below.


 Two rectangle features on 24x24 sub window images of the author
The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eyes regions to the intensity across the bridge of the nose (Viola and Jones, 2004).

The classifier which has now been primed or calibrated is said to be trained based on certain settings (feature thresholds) and can be employed to probe face candidates to either accept faces in images or reject when they are absent.

After training, next, is to test an image candidate; each sub-window of the original image is tested against the first classifier stage. If it passes that classifier, it's tested against the second. If it passes that one, it's then tested against the third, and so on. If it fails at any stage of the testing, the sub-window is rejected as a possible face. If it passes through all the classifiers then the sub-window is classified as a face (TechRadar, 2010).

The detected face(s) location is stored and then a coloured square or shape indicates the detected face or faces in the image.

Research Approaches

Face detection is often categorized under computer vision systems.  Let’s recap; face detection is a process of detecting faces in a live video feed or in a still image. A video would require a real time tracking algorithm of the face or faces in a video irrespective of the changing positions of the people in the video. A still image is easy to detect because the face in the image is static. Researchers often focus on either or both; i.e., static or real-time face detection systems.

Other scenarios in face detection involve the faces to be detected in a constrained or unconstrained environments as mentioned earlier.

Furthermore, there could be software or hardware approach which often requires ways to accelerate (speed-up) the detection process using special hardware like FPGAs (Field Programmable Gate Arrays) or GPUs (Graphics Processing Units). Software approach requires improving the algorithm for better throughput of the system without using special hardware.

Wrapping Up

In subsequent articles that will be published, I will explore in-depth face detection and recognition system designs for constrained and unconstrained environments, and different acceleration techniques for implementing on a PC, mobile phone (Android), and an FPGA device.

About the Author; Ogor Anumbor is a mechanical engineer, a scientist-in-training, and has been studying face detection/recognition systems since 2011. He is also interested in FPGA, GPU, and image processing systems – object tracking, gait-analysis, voice recognition systems, smart-home systems, embedded systems, and has ample field experience with microprocessors.
Contact: eocote2002@yahoo.com, twitter.com/ogoranumbor, website: facebook.com/herculestechnology.


Alpaydm Ethem (2010) Introduction to Machine Learning, 2nd edition, The MIT Press, Cambridge, Massachusetts, London, England.

Bober, M., Pasehalakis, S. (2003) A Low-Cost FPGA System for High Speed Face Detection and Tracking, December 2003, 2nd IEEE International Conference on FPGA, In Proceed., Tokyo, Japan, December 15-17, pp. 214 – 221.

Burgin, W., Pantofaru, C., Smart, D., W. (2011) Using Depth Information to Improve Face Detection, March 6-9, 2011.

Cho, J., Mirzaei, S., Oberg, J., Kastner, R. (2009) FPGA-Based Face Detection System Using Haar Classifiers, February 22-24, 2009, ACM 978-1-60558-410-2/09/02.

Freund, Y., and Schapire, R.E..(1996) Experiments with a New Boosting Algorithm. In Thirteenth International Conference on Machine Learning, ed. L.Saitta, 148–156. San Mateo, CA: Morgan Kaufmann.

Mirzaei, A., M. (2011) Acceleration of Face Detection Algorithm on an FPGA, June 2011, Msc. Thesis, MCV-VIBOT. Imperial College, London, UK.

Nenad Markus (2012) Overview of Algorithms for Face Detection and Tracking.

TechRadar (2010) How Face Detection Works, July 28, 2010, PC Plus issue 296. date accessed: October 6, 2014, 2.27PM.

The News (2015) Easter: Nigerian Passengers Scramble for Air Tickets, April 3, 2015, 3:19PM http://thenewsnigeria.com.ng/2015/04/easter-nigerian-passengers-scramble-for-air-tickets/ Date accessed: December 27, 2016, 3:00 AM

Viola, P., Jones, M.J. (2004) Robust Real-Time Face Detection, International Journal of Computer Vision 57 (2), 137-154, 2004, Kluwer Academic Publishers.

Yi, D., Lei, Z., Li, S. Z. (2013) Towards Pose Robust Face Recognition, CVPR2013.


  1. That's great news. We wish you good luck with your blogging journey and feel free to contact us in case you need anything.
    go next page

  2. Your blog is very nice and we hope you are
    Providing more information in future.
    see here now

  3. Will blog comment backlink increase my referral
    read this

  4. I have read your blog it’s very attractive and impressive. I like it your blog.
    Click for this
