Computer graphics has always been an interest of mine. Seeing the latest CGI film or AAA video game showing off the current peak of graphics technology and creativity can really put one in awe of what we as humans have achieved. We have generally solved the problem of converting a physical model of a scene into a lifelike sequence of images, complete with complex geometry and realistic shading.
The reverse of the problem is not true. Even with the most advanced hardware and algorithms, you will be hard pressed to extract any useful information from a sequence of images using a computer. Even simple problems which we as humans perceive as trivial can not always be solved satisfactory by modern systems. Here is a short list of the current areas actively being researched:
- Background and foreground separation
- Face detection and recognition
- Object tracking
- Texture classification
- Environment mapping
These are problems that we are faced with daily, yet we solve them autonomously in split seconds. So why are computers not able to deal with these problems in a satisfactory manner?
There are a few answers to this question. For one, the conversion of a 3D scene into a 2D image results in the loss of one dimension, which results in a problem with an infinite number of solutions. Another reason for the poor performance of computer vision systems compared to our own is the digitization process, where a large amount of the data is lost due to sampling. While these answers can explain why we should not be too concerned about the rise of the machines just yet, it still does not explain why the human brain is able to extract volumes of useful information from the same sequence of digital images.
The real reason why computer vision is still in its infancy is because we simply do not know how to effectively model the tasks associated with vision. The modelling problem can be traced back to another problem, which is we simply do not know how human vision works. It is true that the physical side is already well documented and can be emulated by a normal camera. The mental side, however, is still very much a mystery.
A Simple Vision Test
In order to create efficient algorithms to address the vision problem, it might be beneficial to try and figure out how the brain solves a simple vision problem, namely object detection. Consider the following image of a mouse:
Now try to find the mouse in the following images.
The first two images should not have presented much of a problem, but the third image should have been slightly more challenging. What did you look for in the above images when searching for the mouse? The answer will likely differ from person to person, but the simplest approach for the first two images would be to look for areas in the image which have the same colour tone as that of the original mouse. This approach will obviously not work for the third image, in which the image consists solely of black lines and artefacts against a white background. A more appropriate approach would be to search for a shape similar to that of the mouse.
The above test demonstrates two characteristics of human vision. Firstly, we look for similarities between previous experiences and that of the current situation to extract important information from what we perceive. Secondly, we are able to use more than one rule of comparison to extract information from the world. This highlights one of the important considerations for any computer vision algorithm: What are the important features to extract from a known model and look for in an unknown environment. In the case of the first image, the important feature to extract from the model of the mouse would be the colour, while for the third image the shape of the mouse would be more appropriate.
Lets consider the shape of the mouse. One question we have to asks ourselves is how can we effectively model the shape of the mouse in such a way that we will always be able to identify similar mice if present? Consider the next two images:
In the first image a large portion of the mouse is blended into the background. From this image it can be seen that the mouse cannot be found using the shape of the mouse as a whole. Instead, we can find the mouse by looking for certain key parts of the mouse, such as the nose, ears, tail and legs. The more parts we can find, the more certain we are that we found a similar mouse. In the second image we can also identify various mice by their individual parts, even though large portions are occluded by objects.
This brings us to another important characteristic of human vision: We can identify objects by their parts. We can also identify the position of smaller parts by first detecting the larger object. Detection by parts is one of the relatively newer methods currently being researched within the field of Computer Vision and has shown good results.
The one characteristic of human vision that both these tests shows is our ability to adapt our problem solving process to the given environment. This highlights one of the biggest problems in computer vision, the fact that many algorithms use only one method to solve a specific vision problem. While this might be satisfactory for constrained conditions, it simply cannot solve the general vision problem. Unfortunately, it is not possible to combine various algorithms to create one super algorithm, as this will require enormous amounts of processing power while not necessarily providing better results. Still, the key to solving the vision problem might lie in a combination of various systems.
While modern computer systems might be impressive when compared to the humble abacus and might be one of the few entities which can win a chess match by brute force, it still is nothing when compared to the amazing power of the human brain. We need to look into how we as humans perceive and extract information from the world using our visual senses in order to create competent vision systems in the future.