Automated video denoising
Pixop Denoiser is our solution to enhancing the perceived visual quality of noisy video and is the ideal preprocessing step before applying our Pixop Deep Restoration filter. This solution can reduce noise in digital video in an automated fashion, as opposed to going through the time-consuming task of hand-tuning multiple parameters and/or noise profiles using off-the-shelf video editing packages and plugins. In a nutshell, this AI filter can reduce:
- Gaussian video noise
- Minor scanline jittering
- Aliasing that results from e.g. (de-)interlacing
- Compression artifacts
Our deep convolutional neural network (CNN) architecture uses a combination of spatial and temporal filtering, learning how to spatially denoise frames and then optimally combine the effects of motion, brightness variations, and temporal imperfections to generate the denoised output. As few assumptions are made, the model is designed for real-world scenarios where conditions can change rapidly, producing a different noise distribution for every shot.
During the learning phase, the CNN is presented with tens of thousands of image pairs of artificially degraded and perfect image patches. These degradations have been carefully engineered to resemble the type of artifacts commonly found in noisy raw and lossy compressed digital SD video.
We performed extensive validation on the trained model using several different video sources to ensure that the output is consistently attractive to the end-user.
How it works
Video is processed frame-by-frame using three video frames (previous, current and next) as input. An enhanced frame is produced via inference using our pre-trained neural network model as shown in the diagram below:
This type of multi-frame approach is common among denoising algorithms as it allows better noise reduction performance to be achieved for regions in a frame with little or no motion.
A comparison to other denoisers
We conducted a test on October 20, 2020, of Pixop Denoiser's performance relative to a couple of other algorithms on the 15 seconds pedestrian_area sequence which is part of Derf's Test Media Collection at Xiph.org. Initially, the source video was downscaled and cropped from 1080p HD to 720x576 pixels SD via FFmpeg in order to reduce noise in the original recording and produce a "noise-free" ground truth baseline. From the ground truth SD, we then created a noisy version synthetically using FFmpeg's built-in noise video filter with parameters "c0s=6:c1s=4:c2s=4:allf=t" for adding a fair amount of temporal Gaussian noise to the input.
We then ran three denoisers on the noisy version:
- 3D denoiser: video filter built into FFmpeg 4.3-2 using default parameters - output encoded via lossless FFV1
- Neat Video 5: plugin (version 5.3.0) for Final Cut Pro X (version 10.4.8) using factory settings and automatic noise profiling - output encoded via Apple ProRes 4444 XQ
- Pixop Denoiser: production model available in our web app - output encoded via H.264 @ 36.7 Mbps
- 3D denoiser: 41.2 dB / 0.968
- Neat Video 5: 39.2 dB / 0.975
- Pixop Denoiser: 45.3 dB / 0.986
In this test, Pixop's Denoiser is clearly the most accurate performer of the three, both in terms of PSNR and SSIM (higher numbers are better).
Here are three original frames (70, 180 and 260) from the video sequence:
For each of these frames, here are zoomed-in 75x150 pixels cropouts of the original ("ground truth") and noisy frame, along with the result of applying the 3D Denoiser, Neat Video 5 and Pixop Denoiser respectively.
Frame 70 cropouts
Frame 180 cropouts
Frame 260 cropouts
With the Pixop Denoiser, you get the desirable mix of both texture retention, sharpness and noise reduction as demonstrated in these cropouts.
What type of video material is Pixop Denoiser best suited for?
Ideal for digital video degraded by processes that imply naturally occurring sources of noise are added to the signal, such as video acquisition and/or digitization.
- The maximum output resolution is currently UHD 4K (3920x2160 pixels).
- The current model is not designed to produce a slick image on an extremely noisy input from e.g. night shots. In the future, we will probably introduce the option of an additional model for this purpose.
Expected runtime performance (frames per second (FPS) is stated, as well as the time to process 1 hour of PAL video in parentheses):
- SD - 30 FPS (50 minutes)
- HD - 5 FPS (5 hours)
- 4K - 1.1 FPS (23 hours)
Note that these numbers assume H.264 encoding being used. Very CPU intensive codecs like H.265 and/or extremely high bitrates will yield less performance.