New Model | UniFab Denoiser Fast Model in Detail

Video denoising is a crucial step in the video editing workflow, typically performed before color grading. Through interviews with multiple professional video editors, we learned that they not only seek clean and natural denoising results but also emphasize the efficiency of the denoising process to ensure an uninterrupted workflow. At the same time, editors require flexible control to achieve different artistic expression needs. 

Specifically, they often strive to strike a balance between preserving a moderate amount of noise (to avoid overly smooth images) and completely removing noise, as the best approach varies depending on the scene.

Efficient and high-quality denoising technology can significantly enhance the realism of videos while greatly reducing rendering time and improving overall computational efficiency. Therefore, it holds important practical value and research significance in the field of video processing.

Background and Challenges

In the previous version of the Denoiser, the limited number of functional models resulted in insufficient adaptability for denoising different types of video footage. These model limitations constrained its performance in complex and diverse video scenarios. At the same time, the older version also faced significant bottlenecks in processing speed, making it difficult to meet the demands of real-time or efficient batch processing. 

This was especially challenging when dealing with high-resolution or long-duration videos. These limitations restricted the widespread application of Denoiser technology and affected user satisfaction. To overcome these bottlenecks, UniFab has introduced the all-new Denoiser-Fast model.

Technical Analysis of Video Denoiser

The denoising effect and performance are mainly influenced by the following factors:

Technical Details of Algorithm Complexity

Depthwise Separable Convolution Implementation Principle

  • The computational complexity of traditional convolution operations is O(_K_² ⋅ C_in ⋅ C_out ⋅ H ⋅ W), where K is the kernel size, C_in and C_out are the input and output channel numbers, and H, W are spatial dimensions. Depthwise separable convolution decomposes the convolution into:
    • Depthwise Convolution: performs convolution independently on each input channel, reducing complexity to O(_K_² ⋅ C_in ⋅ H ⋅ W);
    • Pointwise Convolution (1x1 convolution): responsible for fusing channel information, with complexity O(C_in ⋅ C_out ⋅ H ⋅ W).
      This design significantly reduces the number of parameters and computation while effectively capturing spatial and channel information.、
  • Recursive Residual Unit Design Details
    The recursive residual structure reuses sub-module parameters through recursive calls, simulating multiple nonlinear layers with fewer parameters. This reduces network complexity and improves model generalization.
  • Channel Attention Mechanism
    Using a Squeeze-and-Excitation (SE) module to weight channels, the process includes:
    • Global average pooling to obtain channel descriptors;
    • Passing through two fully connected layers to generate channel weights;
    • Applying weights to each channel to achieve dynamic channel enhancement, selectively strengthening information and effectively improving feature representation capability.

Specific Technical Details of Data Processing Workflow

  • Zero-copy Technology Implementation
    Zero-copy enables GPU to directly access CPU memory by sharing mappings between user space and kernel space, avoiding data copying. This is typically achieved through Direct Memory Access (DMA) and OS-supported memory mapping APIs (e.g., mmap), reducing CPU involvement and greatly lowering latency.
  • Asynchronous Pipeline and Task Scheduling
    Multiple tasks run concurrently using CUDA streams, with tasks distributed across streams and synchronized via events (cudaEvent) to manage data dependencies. Asynchronous scheduling allows simultaneous data loading, computation, and output, achieving a non-blocking high-throughput pipeline.
  • Block Processing and State Maintenance
    Images are divided into overlapping patches to ensure boundary consistency, while contextual states are passed to subsequent patches using Gated Recurrent Units (GRU). This avoids edge artifacts and maintains continuity in processing.
  • Cache Locality Optimization
    Memory alignment and prefetch instructions are used to ensure coherent memory access patterns, improving L1 and L2 cache hit rates and significantly reducing access latency.

Specific Optimization Techniques at the Hardware Execution Level

  • Mixed Precision Training and Inference
    NVIDIA Tensor Cores enable FP16 inference; during training, dynamic loss scaling is applied to prevent numerical underflow or overflow, ensuring stable model training. Quantization strategies support INT8 inference, with calibration data used to determine quantization parameters.
  • Tensor Partitioning Scheduling and Thread Block Mapping
    Large tensors are sliced into smaller chunks and distributed across different GPU Streaming Multiprocessors (SMs). Threads cooperate in warps to mask memory latency and improve throughput.
  • High-speed Memory Access and Bandwidth Optimization
    Using HBM2/3 memory combined with software prefetchers and hardware cache coherence protocols reduces memory bottlenecks. Coalesced memory access patterns minimize bus contention and maximize bandwidth utilization.
  • Heterogeneous Computing and Multi-device Collaboration
    A unified scheduling layer dynamically assigns tasks to CPUs, GPUs, and possibly FPGAs. PCIe passthrough and NVLink high-speed interconnect enable efficient data transfer and synchronization across devices, ensuring smooth overall computation pipelines.

UniFab New Model - Denoiser-Fast

To address the functional shortcomings of previous versions, the new Denoiser model incorporates multiple technical innovations:

UniFab Denoiser Fast integrates key innovations for optimal performance.

  • Lightweight Architecture: Utilizes depthwise separable convolutions and modular design to significantly reduce model parameters and computation, enabling efficient inference.
  • Recursive Residuals and Feature Reuse: Enhances feature modeling through recursive residual structures, with multi-layer skip connections for fast feature fusion and gradient propagation.
  • Matrix Multiplication and Tensor Partitioning Optimization: Dynamically partitions large feature matrices and leverages efficient matrix multiplication libraries for parallel computation, improving hardware utilization and inference speed.
  • Adaptive Activation Functions: Integrates parameterized activation modules to enhance non-linear representation, improving detail restoration and noise separation.
  • Asynchronous Block Pipeline: Divides images into blocks processed asynchronously to reduce latency while ensuring result continuity and stability.
  • Memory Access Optimization: Enhances data memory contiguity and cache hit rates to reduce bandwidth bottlenecks and boost overall processing efficiency.
  • Hardware Collaborative Scheduling and Mixed Precision: Combines CUDA stream concurrency management with mixed precision computing to fully leverage GPU power, achieving high throughput and low latency.

In-depth Analysis of the Denoiser-Fast Model’s Underlying Principles

Lightweight Architecture Design and Modular Construction

Denoiser-Fast uses a modular design, dividing denoising into sub-modules for feature extraction, fusion, and noise suppression. This pipeline reduces computation load and boosts parallel processing. Its core employs depthwise separable convolution, splitting standard convolution into depthwise and pointwise (1x1) convolutions.

Depthwise Convolution
Performs K×K convolution independently on each input channel, keeping the number of output channels unchanged. Its computational cost is:

3101.png

Pointwise Convolution
Uses 1×1 convolution to fuse channel information, with computational cost:

3102.png

Compared to the standard convolution computational cost:

3103.png

This design reduces model parameters by over 50% and greatly decreases multiply-accumulate (MAC) operations, significantly saving computational resources.

Furthermore, the depthwise separable convolution in Denoiser-Fast carefully considers inter-channel information interaction to avoid performance degradation caused by oversimplification. A lightweight channel attention mechanism, such as the Squeeze-and-Excitation (SE) module, is introduced. It assigns a weight coefficient sc for each channel:

3104.png

Here, zc is the global channel descriptor obtained by global average pooling; W1,W2 represent the weights of fully connected layers; δ and σ are the ReLU and Sigmoid activation functions respectively. The weight sc adjusts the strength of corresponding channel features, balancing spatial and channel information. This dynamic adjustment of channel importance ensures balanced capture of spatial and channel features, achieving both efficiency and denoising performance.

Recursive Residual Mechanism and Online Feature Reuse

The Denoiser-Fast architecture implements a Recursive Residual Module, which recursively applies a single convolutional layer T times, enhancing the network's expressive power without increasing the number of parameters. The recursive structure is defined as:

3201.png

where F is the convolutional residual function, θ are the shared parameters, and the initial input is h(0)=x.

Online feature reuse uses skip connections to pass low-level shallow features f0 to deeper layers, represented 

3203.png

where E and D are the encoder and decoder functions respectively, and α is a learnable fusion weight. This strategy mitigates gradient vanishing and accelerates information flow.

Matrix Multiplication Optimization and Tensor Tiling Technique

Snipaste_2025-10-29_12-09-19.jpg

At the computation level, Denoiser-Fast uses tensor tiling to split large intermediate feature matrices into multiple smaller 2D blocks, enabling parallel processing in the GPU multithreaded environment. This approach reduces memory usage per thread and synchronization overhead, significantly improving hardware resource utilization.

Denoiser-Fast transforms convolution operations into matrix multiplication via the im2col method:

3301.png

where X is the input matrix after im2col transformation and W is the weight matrix.

To handle large matrices, the model applies tensor tiling by dividing feature matrices into multiple smaller blocks of size M×N:

3302.png

Matrix multiplications run in parallel on GPU threads, reducing memory and sync overhead while boosting throughput. Using libraries like NVIDIA’s CUTLASS and cuBLAS Lt, Denoiser-Fast converts convolution to matrix multiplication via im2col for optimal memory and compute efficiency. It adaptively selects tiling sizes and thread mappings per hardware, significantly speeding up inference.

Adaptive Activation Functions and Nonlinear Mapping Adjustment

Snipaste_2025-10-29_13-02-29.jpg

To enhance the model's ability to fit complex nonlinear relationships, Denoiser-Fast innovatively integrates a hybrid adaptive module combining parameterized ReLU (PReLU) and Swish activation functions. This module dynamically adjusts the activation function shape through learnable parameters, achieving adaptive nonlinear mapping while balancing network expressiveness and numerical stability.

The PReLU function is expressed as:

3401.png

where a is a learnable parameter that improves representation in the negative input range.

The Swish function is defined as:

3402.png

where β is a tunable parameter allowing nonlinear adaptation, enhancing smooth gradient flow, which promotes detail preservation and edge enhancement.

This design effectively separates noise from signal, performing exceptionally well in blurry edges and subtle texture areas. It facilitates detail restoration and edge retention while avoiding common issues such as over-smoothing and artifacts.

Data Flow and Computing Process Optimization

Snipaste_2025-10-29_14-48-20.jpg

Asynchronous Block Pipeline Scheduling

  • Dynamic Block Partitioning Mechanism

    Denoiser-Fast uses a content-aware dynamic block partitioning strategy that adjusts block sizes based on frame size, noise density, and motion complexity. This creates variable, overlapping blocks that optimize resource use, avoiding resource waste from static partitioning and enhancing asynchronous processing efficiency and accuracy.

  • Asynchronous Pipeline Framework Design

    Denoiser-Fast uses a four-stage pipeline—block preparation, feature extraction, temporal alignment, and fusion reconstruction—running concurrently on multiple CUDA streams. This overlapping execution avoids sync bottlenecks, with blocks processed in parallel via queues for continuous, non-blocking data flow and high throughput.

  • Cross-block State Transmission Based on Gated Recurrent Units (GRU)

    To reduce spatial fragmentation from block processing, Denoiser-Fast uses lightweight GRU modules to pass hidden states between blocks. This enables cross-region feature fusion with context, preventing artifacts and boundary discontinuities.

  • Boundary Smoothing and Overlapping Computation Strategy

    For block edges, Denoiser-Fast applies a learnable weighted overlapping fusion scheme that averages outputs in overlapping areas, effectively suppressing edge artifacts. Weights are automatically optimized during training, with dynamically adjustable values at different positions to ensure seamless stitching.

Memory Access Continuity Optimization

  • Deep Utilization of Hardware Prefetch Mechanism

    Denoiser-Fast optimizes NVIDIA GPU prefetching by adjusting memory access stride and block sizes to ensure contiguous data layout in VRAM, enhancing cache line usage and preloading data to reduce memory latency.

  • Memory Access Pattern Reorganization

    To solve access jumps caused by depthwise convolution, Denoiser-Fast reorganizes data storage formats by mixing NHWC and NCHW layouts. Using an access frequency analyzer, it dynamically switches layouts to reduce non-coalesced accesses, improving memory efficiency and minimizing SIMD instruction stalls.

  • Multi-level Cache Management and Intelligent Replacement

    Uses multi-level cache controllers with adaptive replacement algorithms to prioritize hot data, reduce write-back, and maximize cache hit rates, effectively easing VRAM bandwidth bottlenecks through real-time access pattern analysis.

  • Pipeline-friendly Data Flow Design

    Aligned with asynchronous pipeline processing, Denoiser-Fast designs a circular buffer structure, reusing cache in different compute stages via ring buffers. This avoids frequent VRAM allocation and release, reduces memory fragmentation and loading stalls, enabling high-speed data flow and concurrent access.

Hardware Scheduling and Mixed Precision Strategy

Efficient Command Queue Manager and Dynamic Task Scheduling

Denoiser-Fast features an efficient Command Queue Manager (CQM) that monitors GPU SM utilization and task queue depth in real time. It employs dynamic load balancing to allocate tasks intelligently, maximizing compute unit utilization and minimizing idle time. Scheduling adapts based on feedback to enhance overall throughput.

Leveraging CUDA multi-stream concurrency, Denoiser-Fast breaks down computation into multiple asynchronous task streams covering data transfer, kernel execution, and memory operations, forming a fine-grained pipeline:

3501.png

where n is the number of active streams, and Computei and Transferi represent computation and data transfer times for stream i, maximizing their overlap to enhance efficiency.

Additionally, the system supports hot-plug flexible scheduling: when GPU units become temporarily unavailable or system resources dynamically change, the command queue automatically redistributes tasks to avoid pipeline stalls and compute interruptions.

Fine-Grained Mixed Precision Computing and Dynamic Format Switching

Denoiser-Fast adopts advanced Automatic Mixed Precision (AMP) technology with a custom precision control module dynamically switching computation data formats between FP16 and FP32.

  • Inference uses FP16, halving bandwidth and significantly boosting compute speed—achieving over 2x acceleration and memory savings.
  • Training uses FP32 for key gradient calculations to ensure stability and prevent numerical overflow or underflow.

Dynamic Loss Scaling adjusts loss values as:

3502.png

where L is the current loss, and s is an adaptive scaling factor preventing instability caused by FP16’s limited numeric range during training.

The system also supports layer-wise precision allocation, using second-order gradient information estimation (Hessian approximation) to determine sensitivity of layers to precision. Sensitive layers retain FP32 computation, while less sensitive ones use FP16, balancing performance and accuracy optimally.

Hardware-aware Kernel Fusion and Tensor Core Acceleration

Denoiser-Fast employs hardware-aware kernel fusion to combine multiple lightweight operations into single CUDA kernels, reducing kernel launch overhead, improving data reuse, and increasing compute density.

It fully utilizes NVIDIA GPU Tensor Cores via WMMA (Warp Matrix Multiply Accumulate) interfaces to support mixed-precision matrix multiplication:

3503.png

enabling hardware-level acceleration for convolution and fully connected operations. Compared to traditional FP32 methods, inference speed can increase by up to 8x.

Memory Bandwidth and Latency Optimization

Using stream-parallel prefetching algorithms, Denoiser-Fast proactively loads FP16 data blocks into L1 shared cache, hiding VRAM access latency.

Combined with cache-aware scheduling, it orders high-memory-load tasks to reduce cache replacement frequency and minimize cache thrashing. CUDA streams enable overlapping compute and data transfer, maximizing bandwidth utilization.

UniFab Denoiser-Fast Model Performance Showcase

In multiple performance tests, the new Denoiser version demonstrates significant advantages. It achieves approximately 50% faster processing speed compared to the previous version, with noticeable noise suppression and complete detail preservation. While maintaining high speed, the new Denoiser-Fast model also ensures high-quality output, delivering powerful real-time processing capability.

20251029-115240.jpg
Test Material:
Duration: 30s
Resolution: 1920×1080
Frame Rate: 24fps
UniFab Standard ModelUniFab Fast Model
Processing Time: 1min 10s
Processing Speed: 10.3 fps
Processing Time: 48s
Processing Speed: 15 fps

Summary and Outlook

Through algorithm optimization and architectural upgrades, we have significantly overcome the performance bottlenecks and limitations of the previous version. In the future, we will continue to integrate cutting-edge AI technologies to advance intelligent denoising algorithms, explore better denoising models, and achieve more efficient denoising solutions, helping the video processing field embrace broader development prospects.

👉 Community Link : 🎉New Model | UniFab Denoiser Fast Model in Detail - UniFab AI Community

You are welcome to share topics or frame interpolation models that interest you on our forum. We regularly publish technical reviews and version updates to drive continuous improvement and carefully consider your testing and evaluation feedback.

Next article topic:UniFab Video Upscaler AI

Preview of past articles:

avatar
Ethan
I am the product manager of UniFab. From a product perspective, I will present authentic software data and performance comparisons to help users better understand UniFab and stay updated with our latest developments.