Table Of Content
Video denoising is a crucial step in the video editing workflow, typically performed before color grading. Through interviews with multiple professional video editors, we learned that they not only seek clean and natural denoising results but also emphasize the efficiency of the denoising process to ensure an uninterrupted workflow. At the same time, editors require flexible control to achieve different artistic expression needs.
Specifically, they often strive to strike a balance between preserving a moderate amount of noise (to avoid overly smooth images) and completely removing noise, as the best approach varies depending on the scene.
Efficient and high-quality denoising technology can significantly enhance the realism of videos while greatly reducing rendering time and improving overall computational efficiency. Therefore, it holds important practical value and research significance in the field of video processing.
In the previous version of the Denoiser, the limited number of functional models resulted in insufficient adaptability for denoising different types of video footage. These model limitations constrained its performance in complex and diverse video scenarios. At the same time, the older version also faced significant bottlenecks in processing speed, making it difficult to meet the demands of real-time or efficient batch processing.
This was especially challenging when dealing with high-resolution or long-duration videos. These limitations restricted the widespread application of Denoiser technology and affected user satisfaction. To overcome these bottlenecks, UniFab has introduced the all-new Denoiser-Fast model.
The denoising effect and performance are mainly influenced by the following factors:
Depthwise Separable Convolution Implementation Principle
To address the functional shortcomings of previous versions, the new Denoiser model incorporates multiple technical innovations:
UniFab Denoiser Fast integrates key innovations for optimal performance.
Denoiser-Fast uses a modular design, dividing denoising into sub-modules for feature extraction, fusion, and noise suppression. This pipeline reduces computation load and boosts parallel processing. Its core employs depthwise separable convolution, splitting standard convolution into depthwise and pointwise (1x1) convolutions.
Depthwise Convolution
Performs K×K convolution independently on each input channel, keeping the number of output channels unchanged. Its computational cost is:
Pointwise Convolution
Uses 1×1 convolution to fuse channel information, with computational cost:
Compared to the standard convolution computational cost:
This design reduces model parameters by over 50% and greatly decreases multiply-accumulate (MAC) operations, significantly saving computational resources.
Furthermore, the depthwise separable convolution in Denoiser-Fast carefully considers inter-channel information interaction to avoid performance degradation caused by oversimplification. A lightweight channel attention mechanism, such as the Squeeze-and-Excitation (SE) module, is introduced. It assigns a weight coefficient sc for each channel:
Here, zc is the global channel descriptor obtained by global average pooling; W1,W2 represent the weights of fully connected layers; δ and σ are the ReLU and Sigmoid activation functions respectively. The weight sc adjusts the strength of corresponding channel features, balancing spatial and channel information. This dynamic adjustment of channel importance ensures balanced capture of spatial and channel features, achieving both efficiency and denoising performance.
The Denoiser-Fast architecture implements a Recursive Residual Module, which recursively applies a single convolutional layer T times, enhancing the network's expressive power without increasing the number of parameters. The recursive structure is defined as:
where F is the convolutional residual function, θ are the shared parameters, and the initial input is h(0)=x.
Online feature reuse uses skip connections to pass low-level shallow features f0 to deeper layers, represented
where E and D are the encoder and decoder functions respectively, and α is a learnable fusion weight. This strategy mitigates gradient vanishing and accelerates information flow.
At the computation level, Denoiser-Fast uses tensor tiling to split large intermediate feature matrices into multiple smaller 2D blocks, enabling parallel processing in the GPU multithreaded environment. This approach reduces memory usage per thread and synchronization overhead, significantly improving hardware resource utilization.
Denoiser-Fast transforms convolution operations into matrix multiplication via the im2col method:
where X is the input matrix after im2col transformation and W is the weight matrix.
To handle large matrices, the model applies tensor tiling by dividing feature matrices into multiple smaller blocks of size M×N:
Matrix multiplications run in parallel on GPU threads, reducing memory and sync overhead while boosting throughput. Using libraries like NVIDIA’s CUTLASS and cuBLAS Lt, Denoiser-Fast converts convolution to matrix multiplication via im2col for optimal memory and compute efficiency. It adaptively selects tiling sizes and thread mappings per hardware, significantly speeding up inference.
To enhance the model's ability to fit complex nonlinear relationships, Denoiser-Fast innovatively integrates a hybrid adaptive module combining parameterized ReLU (PReLU) and Swish activation functions. This module dynamically adjusts the activation function shape through learnable parameters, achieving adaptive nonlinear mapping while balancing network expressiveness and numerical stability.
The PReLU function is expressed as:
where a is a learnable parameter that improves representation in the negative input range.
The Swish function is defined as:
where β is a tunable parameter allowing nonlinear adaptation, enhancing smooth gradient flow, which promotes detail preservation and edge enhancement.
This design effectively separates noise from signal, performing exceptionally well in blurry edges and subtle texture areas. It facilitates detail restoration and edge retention while avoiding common issues such as over-smoothing and artifacts.
Dynamic Block Partitioning Mechanism
Denoiser-Fast uses a content-aware dynamic block partitioning strategy that adjusts block sizes based on frame size, noise density, and motion complexity. This creates variable, overlapping blocks that optimize resource use, avoiding resource waste from static partitioning and enhancing asynchronous processing efficiency and accuracy.
Asynchronous Pipeline Framework Design
Denoiser-Fast uses a four-stage pipeline—block preparation, feature extraction, temporal alignment, and fusion reconstruction—running concurrently on multiple CUDA streams. This overlapping execution avoids sync bottlenecks, with blocks processed in parallel via queues for continuous, non-blocking data flow and high throughput.
Cross-block State Transmission Based on Gated Recurrent Units (GRU)
To reduce spatial fragmentation from block processing, Denoiser-Fast uses lightweight GRU modules to pass hidden states between blocks. This enables cross-region feature fusion with context, preventing artifacts and boundary discontinuities.
Boundary Smoothing and Overlapping Computation Strategy
For block edges, Denoiser-Fast applies a learnable weighted overlapping fusion scheme that averages outputs in overlapping areas, effectively suppressing edge artifacts. Weights are automatically optimized during training, with dynamically adjustable values at different positions to ensure seamless stitching.
Deep Utilization of Hardware Prefetch Mechanism
Denoiser-Fast optimizes NVIDIA GPU prefetching by adjusting memory access stride and block sizes to ensure contiguous data layout in VRAM, enhancing cache line usage and preloading data to reduce memory latency.
Memory Access Pattern Reorganization
To solve access jumps caused by depthwise convolution, Denoiser-Fast reorganizes data storage formats by mixing NHWC and NCHW layouts. Using an access frequency analyzer, it dynamically switches layouts to reduce non-coalesced accesses, improving memory efficiency and minimizing SIMD instruction stalls.
Multi-level Cache Management and Intelligent Replacement
Uses multi-level cache controllers with adaptive replacement algorithms to prioritize hot data, reduce write-back, and maximize cache hit rates, effectively easing VRAM bandwidth bottlenecks through real-time access pattern analysis.
Pipeline-friendly Data Flow Design
Aligned with asynchronous pipeline processing, Denoiser-Fast designs a circular buffer structure, reusing cache in different compute stages via ring buffers. This avoids frequent VRAM allocation and release, reduces memory fragmentation and loading stalls, enabling high-speed data flow and concurrent access.
Denoiser-Fast features an efficient Command Queue Manager (CQM) that monitors GPU SM utilization and task queue depth in real time. It employs dynamic load balancing to allocate tasks intelligently, maximizing compute unit utilization and minimizing idle time. Scheduling adapts based on feedback to enhance overall throughput.
Leveraging CUDA multi-stream concurrency, Denoiser-Fast breaks down computation into multiple asynchronous task streams covering data transfer, kernel execution, and memory operations, forming a fine-grained pipeline:
where n is the number of active streams, and Computei and Transferi represent computation and data transfer times for stream i, maximizing their overlap to enhance efficiency.
Additionally, the system supports hot-plug flexible scheduling: when GPU units become temporarily unavailable or system resources dynamically change, the command queue automatically redistributes tasks to avoid pipeline stalls and compute interruptions.
Denoiser-Fast adopts advanced Automatic Mixed Precision (AMP) technology with a custom precision control module dynamically switching computation data formats between FP16 and FP32.
Dynamic Loss Scaling adjusts loss values as:
where L is the current loss, and s is an adaptive scaling factor preventing instability caused by FP16’s limited numeric range during training.
The system also supports layer-wise precision allocation, using second-order gradient information estimation (Hessian approximation) to determine sensitivity of layers to precision. Sensitive layers retain FP32 computation, while less sensitive ones use FP16, balancing performance and accuracy optimally.
Denoiser-Fast employs hardware-aware kernel fusion to combine multiple lightweight operations into single CUDA kernels, reducing kernel launch overhead, improving data reuse, and increasing compute density.
It fully utilizes NVIDIA GPU Tensor Cores via WMMA (Warp Matrix Multiply Accumulate) interfaces to support mixed-precision matrix multiplication:
enabling hardware-level acceleration for convolution and fully connected operations. Compared to traditional FP32 methods, inference speed can increase by up to 8x.
Using stream-parallel prefetching algorithms, Denoiser-Fast proactively loads FP16 data blocks into L1 shared cache, hiding VRAM access latency.
Combined with cache-aware scheduling, it orders high-memory-load tasks to reduce cache replacement frequency and minimize cache thrashing. CUDA streams enable overlapping compute and data transfer, maximizing bandwidth utilization.
In multiple performance tests, the new Denoiser version demonstrates significant advantages. It achieves approximately 50% faster processing speed compared to the previous version, with noticeable noise suppression and complete detail preservation. While maintaining high speed, the new Denoiser-Fast model also ensures high-quality output, delivering powerful real-time processing capability.
| Test Material: Duration: 30s Resolution: 1920×1080 Frame Rate: 24fps | |
| UniFab Standard Model | UniFab Fast Model |
| Processing Time: 1min 10s Processing Speed: 10.3 fps | Processing Time: 48s Processing Speed: 15 fps |
Through algorithm optimization and architectural upgrades, we have significantly overcome the performance bottlenecks and limitations of the previous version. In the future, we will continue to integrate cutting-edge AI technologies to advance intelligent denoising algorithms, explore better denoising models, and achieve more efficient denoising solutions, helping the video processing field embrace broader development prospects.
👉 Community Link : 🎉New Model | UniFab Denoiser Fast Model in Detail - UniFab AI Community
You are welcome to share topics or frame interpolation models that interest you on our forum. We regularly publish technical reviews and version updates to drive continuous improvement and carefully consider your testing and evaluation feedback.
Next article topic:UniFab Video Upscaler AI
Preview of past articles: