If you are a beginner in the area of the image and video processing, you may often hear the term real time processing. In this post, we will try to explain the term and list some typical concerns related to this term.
Real time image processing is related with typical frame rate. Current standard for capture is typically 30 frames per second. Real time processing would require processing all the frames as soon as they are captured. So broadly speaking, if capture rate is 30 FPS then 30 frames needs to be processed in one second. That comes to around 33 milliseconds (1000 ms / 30 frames = 33 ms/frame). Similar calculation can be done for any frame rate to get required processing time per frame.
In image and video processing, the source of our signal is a camera. So, what real time image processing really means is: produce output simultaneously with the input. What is actually meant is that the algorithm will run at the rate of the source (e.g. a camera) supplying the images, so the algorithm can process images at the frame rate of the camera.
Just out of curiosity, let’s see how the human vision works:
The first thing to understand is that we perceive different aspects of vision differently. Detecting motion is not the same as detecting light. Another thing is that different parts of the eye perform differently. The center of vision is good at different stuff than the periphery. And another thing is that there are natural, physical limits to what we can perceive. It takes time for the light that passes through your cornea to become information on which your brain can act, and our brains can only process that information at a certain speed.
Another important concept: the whole of what we perceive is greater than what any one element of our visual system can achieve. This point is fundamental to understanding our perception of vision.
The temporal sensitivity and resolution of human vision varies depending on the type and characteristics of visual stimulus, and it differs between individuals. The human visual system can process 10 to 12 images per second and perceive them individually, while higher rates are perceived as motion. Modulated light (such as a computer display) is perceived as stable by the majority of participants in studies when the rate is higher than 50 Hz through 90 Hz. This perception of modulated light as steady is known as the flicker fusion threshold. However, when the modulated light is non-uniform and contains an image, the flicker fusion threshold can be much higher, in the hundreds of hertz. Regarding image recognition, people have been found to recognize a specific image in an unbroken series of different images, each of which lasts as little as 13 milliseconds. Persistence of vision sometimes accounts for very short single-millisecond visual stimulus having a perceived duration of between 100 ms and 400 ms. Multiple stimuli that are very short are sometimes perceived as a single stimulus, such as a 10 ms green flash of light immediately followed by a 10 ms red flash of light perceived as a single yellow flash of light.
The real-time aspect is critical in many real-world devices or products such as mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-assisted intelligent robots, spectral imaging systems, and many other embedded image or video processing systems.
With the increasing capabilities of imaging systems like cameras with very high-density captures having 16 or more megapixels, it is extremely difficult to get real time performance for many applications.
What applications need real time performance and what applications do not:
When talking about the numerous applications of image and video processing, we can say that some applications in some systems need real time processing, and some don’t. That is why we will talk about online (real time) and offline processing.
Offline processing is processing already recorded video sequence or image. So, digital video stabilization, video enhancement, video coloring, or any application can work with already prepared video. These applications can be found in marketing, industry, medical imaging, film industry or in some ordinary commercial applications, such as a user that wants to stabilize and enhance some video from the phone library.
Offline processing enables using more complex and computationally demanding algorithms, therefore usually gives better results than real time processing. That is why offline processing tools are used a lot in academic research and in some kinds of challenges.
Some of Deep Learning tools for offline processing (on CPU) are:
On the other hand, some applications have a demand for real time processing. For example, traffic monitoring, target tracking in military applications, surveillance and monitoring, real time video games, etc. are apps that demand real time feedback and processed image from sensor.
The algorithms that work in real time do not have the luxury of high complexity, since the processing time for each frame is determined by source frame rate and resolution. New hardware solutions nowadays offer better processing speeds, but there are still limitations, depending of the specific application.
Systems with multiple complex applications working in parallel:
Sometimes the application demands multiple complex algorithms working in parallel. That is the time when not only the complexity of the algorithms is considered, but also which algorithm will be processed first and how this affects the desired performance of the application. One good example is when video enhancement and digital video stabilization algorithm work in parallel.
Video stabilization and video dehazing algorithms in the same video processing pipeline can affect the results of each other. This interesting topic is described in a paper [Dehazing Algorithms Influence on Video Stabilization Performance] given in references at the end of the post. When there is no severe haze, noise or low contrast in the scene, it is important to perform video stabilization algorithm prior to video dehazing algorithm. On the other hand, when the feature level in the scene is low, which happens because of severe haze or low contrast in image, the stabilization algorithm cannot perform well, since it cannot calculate global motion accurately. That is why, for the sake of the better stabilization performance, the proposed pipeline performs video dehazing algorithm prior to video stabilization.
At the end, we will mention some of the possibilities for real time image processing platforms:
- FPGA – very good for complex parallel operations, example of the application in paper [High-performance electronic image stabilization for shift and rotation correction] given in references.
- Nvidia Jetson TX1, TX2, Xavier –
“Get real-time Artificial Intelligence (AI) performance where you need it most with the high-performance, low-power NVIDIA Jetson AGX systems. Processing of complex data can now be done on-board edge devices. This means you can count on fast, accurate inference in everything from robots and drones to enterprise collaboration devices and intelligent cameras. Bringing AI to the edge unlocks huge potential for devices in network-constrained environments.” – from Nvidia site, given in references.