Unifab Colorizer AI: Working Principles and Evolution Through Iterations

Last updated on 2025-08-20
byEthan

Table Of Content

With the advancement of digitization in film archives and historical footage, the demand for automatic colorization of black-and-white videos has gradually increased. Traditional methods for colorizing black-and-white videos often rely on manual color grading and complex post-processing workflows, which are not only time-consuming and labor-intensive but also struggle to ensure natural color restoration and consistency.

UniFab Colorizer AI leverages deep learning algorithms, trained on large-scale color video datasets, to automatically predict and generate colors for black-and-white videos. This tool efficiently performs color conversion on video frames, significantly reducing processing time while maintaining color continuity and realism.

This article will introduce the working principles and technical advantages of UniFab Colorizer AI, providing an in-depth overview of its iterative development to help you gain a comprehensive understanding of its capabilities.

Knowledge Related to Video Colorization

Input Data Representation and Color Space Selection

Black-and-white images, represented as two-dimensional pixel matrices, have luminance values for each pixel typically ranging from 0 to 255, directly reflecting the distribution of light intensity in the grayscale image. Color images, on the other hand, are represented through three color channels—Red (R), Green (G), and Blue (B)—forming the RGB color space. The task of the network is to map grayscale images into the RGB space to achieve color restoration.

However, compared to the RGB color space, the Lab color space demonstrates significant advantages in image colorization tasks. The Lab color space decomposes an image into one luminance channel (L) and two chrominance channels (a and b), where:

The L channel represents luminance (ranging from 0 to 100), accurately reflecting the light intensity information of black-and-white images;
The a and b channels correspond to the color components green–red and blue–yellow, respectively, with values approximately ranging from -128 to +127.

This decomposition aligns with human vision, which is more sensitive to luminance than color. By preserving the L channel and predicting only the chrominance channels, the mapping complexity.

Network Architecture Design Strategies

Based on the above analysis of color spaces, a typical image colorization network architecture includes:

Image colorization network architecture

Input layer receives the preprocessed grayscale image (L channel).
Feature extraction module employs multiple convolutional layers; deep networks can capture multi-level features such as textures, edges, and spatial structures within the image.
Chrominance prediction module focuses on predicting the values of the a and b chrominance channels, reducing prediction complexity and improving color restoration accuracy.
Fusion and reconstruction module combines the predicted a and b channels with the input L channel to restore a complete Lab image, which is then converted to an RGB color image to achieve the final colorization effect.

This design not only leverages the human visual system's sensitivity to luminance but also ensures a good balance between model capacity and training difficulty.

Network optimization and evaluation indicators

a large amount of labeled data, gradually improving prediction accuracy. To objectively evaluate the performance of colorization algorithms, various metrics have been introduced in academia, including:

Fréchet Inception Distance (FID)

FID measures the distribution difference between generated images and real images in a high-dimensional feature space. It is defined based on two sets of Inception network features (assumed to follow Gaussian distributions) with means and covariances (μr, Σ_r_) and (μg, Σ_g_), respectively:

Here, ∥⋅∥₂ denotes the Euclidean distance, and Tr represents the trace of a matrix.

Colorfulness Score (CF)

The Colorfulness Score reflects the richness of colors in an image. It is based on two components derived from the RGB image: _rg_ = _R_ − _G_ and _yb_ = 0.5 × (_R_ + _G_) − _B_. The score is calculated using their mean and standard deviation, defined as follows:

Substituting the overall color evaluation:

The larger the value of this indicator, the brighter the image color.

User Study

A subjective evaluation based on user preferences, where real users provide feedback on the color quality of generated images through surveys or ratings. Despite its subjectivity, this metric offers intuitive and valuable insights for model improvement.

How UniFab Colorizer AI works

Network Architecture Design

The unifab model's network architecture is based on the classic UNet framework, integrated with a ResNet encoder to fully leverage the advantages of both, enhancing the effectiveness and efficiency of video colorization.

1.1 Overview of UNet Architecture

UNet is a classic image-to-image translation architecture, particularly well-suited for pixel-level tasks such as detail restoration and color mapping in video AI colorization. It effectively processes spatial information across consecutive frames while preserving image structure integrity.

Encoder-Decoder Structure: The encoder progressively downsamples the input image to extract semantic features at multiple scales; the decoder progressively upsamples to restore the original image size, outputting the colorized image.
Skip Connections: Features from corresponding encoder layers are directly passed to their matching decoder layers, effectively fusing shallow detailed features with deep semantic features, avoiding bottlenecks and improving detail restoration.

1.2 Choice of ResNet Encoder

The unifab model employs ResNet (Residual Network) as the backbone feature extractor in the encoder part of UNet, for the following reasons and advantages:

Deep Feature Learning Capability: ResNet introduces residual connections (shortcut connections) to solve the gradient vanishing problem in deep networks, enabling the stacking of more layers and capturing richer hierarchical features.
Use of Pretrained Models: Leveraging ResNet pretrained on large-scale image classification datasets (e.g., ImageNet), unifab can transfer general visual features to the colorization task, accelerating training convergence and improving model generalization.
Multi-Scale Structure: Outputs from different layers of ResNet contain image features at various receptive fields, which are suitable for fusion in subsequent decoder and skip connection layers, ensuring precise detail restoration.

1.3 Decoder Module Design

The decoder restores features through transposed convolution (deconvolution) or upsampling combined with convolution:

Progressive Upsampling: Gradually restores the spatial resolution from abstract feature maps back to the original input size.
Skip Connection Fusion: At each corresponding layer, the decoder receives features from the encoder, fusing high-resolution shallow features with deep semantic features, resulting in output images with accurate structural contours and rich details.
Multiple Convolutional Blocks: The decoder usually consists of several convolutional blocks (Conv + BatchNorm + ReLU), enhancing nonlinear expressive capacity and improving fitting for complex colors and textures.

This module is critical for capturing texture edges in the input grayscale image and their corresponding colors, effectively preventing colorization results from becoming overly blurred or distorted.

Color Space and Loss Functions

2.1 Color Space Selection

The unifab model is designed for video colorization tasks based on the Lab color space, leveraging the advantages of this color space to enhance colorization performance:

Separation of Luminance and Chrominance: The Lab space separates an image into a luminance channel (L) and two chrominance channels (a and b). The L channel reflects the grayscale information, while the a and b channels correspond to color components. Unifab retains the input image’s L channel as a luminance reference and only predicts the a and b chrominance channels through the network. This effectively reduces the dimensionality of the learning space, lowers model complexity, and ensures the luminance information is fully accurate without needing regeneration.
Perceptual Uniformity: The Lab color space is designed to align with human visual perception of color changes, meaning that channel value changes correspond approximately linearly with perceived color differences. This perceptual uniformity contrasts with the nonlinear characteristics of the RGB space, facilitating stable learning of subtle color variations by the model and improving the naturalness and detail of color reproduction.
Wide and Balanced Color Distribution: The a and b channels in Lab space cover color axes from green to red and blue to yellow, allowing description of a broad and balanced range of colors. This provides favorable conditions for diverse and rich color representations.

2.2 Loss Function Design

The unifab model employs a multi-loss function optimization strategy, jointly considering pixel accuracy, structural semantics, and visual realism. The specific components include:

Reconstruction Loss: Commonly uses L1 or L2 norms to measure the pixel-wise difference between the predicted chrominance channels and the ground truth colors. L1 loss tends to produce sharper image details and reduce color blurring; L2 loss is more sensitive to larger errors and focuses more on overall color correction. This loss ensures fundamental color accuracy and serves as the training foundation.
Perceptual Loss: Extracts features from intermediate layers of a pretrained visual network (e.g., VGG) and calculates the difference between the predicted and ground truth images in a high-level semantic feature space. Perceptual loss goes beyond pixels to focus on image structure, texture, and semantic consistency, helping the model restore more natural and coherent color textures and improving visual quality.
Adversarial Loss: In some versions, a Generative Adversarial Network (GAN) framework is introduced, where a discriminator distinguishes between generated and real images, forcing the generator (unifab) to output samples with more realistic and diverse colors. Adversarial training enhances the vividness and detail richness of image colors, effectively eliminating monotonous and dull color transitions.
Regularization Terms: To ensure color continuity and prevent excessive color bleeding or abnormal jumps, unifab typically includes smoothness regularization or color constraints. These reduce abrupt color boundary changes and improve the natural transition effect of overall colors.

Training Process

The training process of the unifab model is carefully designed to enhance the model’s color restoration capability, generalization performance, and stability in video colorization. It mainly includes the following key components:

3.1 Data Preparation

Large-scale Color Image Dataset: Training relies on a rich collection of color images, which are first converted into grayscale images along with their corresponding color channels (e.g., the a and b channels in Lab space), forming input-output pairs. This approach ensures the model learns the mapping from grayscale information to color restoration.
Diverse Data Sources Covering Wide Scenarios: The dataset includes various scenes such as natural landscapes, portraits, urban street views, and more, ensuring that the model can handle diverse image types and complex textures.

3.2 Self-supervised Learning Strategy

Grayscale-to-Color Restoration Task: Unifab employs a self-supervised learning paradigm that requires no additional color annotations, relying solely on restoring color images from grayscale inputs. This significantly reduces dependence on manual labeling resources and facilitates acquisition of large-scale training data.
Input Grayscale Images Contain Only Luminance Information: The model learns to predict chrominance channels, focusing the training objective and reducing problem dimensionality, which helps faster convergence and better results.

3.3 Progressive Training

Initial Stage: The model is primarily trained using reconstruction loss to ensure it quickly grasps the basic mapping rules between luminance and color, achieving stable color restoration.
Mid-to-Late Stage: Perceptual loss is introduced to encourage the model to optimize color and texture consistency at high semantic levels, improving visual quality. Combined with adversarial training, the model learns to produce outputs with greater realism and color diversity, avoiding monotone or distorted colors.
Dynamic Adjustment of Learning Rate and Loss Weights: Learning rates and loss function weights are adjusted according to different training stages to balance convergence speed and result quality.

3.4 Data Augmentation

Geometric Transformations: Includes random rotation, flipping, scale cropping, etc., enhancing the model’s ability to adapt to transformation invariance.
Color and Lighting Augmentation: Adjustments to brightness, contrast, and saturation simulate diverse lighting environments, improving model robustness.
Noise Addition: Adding random noise or blur increases tolerance to image quality variations.

Technical Advantages and Development Iteration of unifab

Technical Advantages of unifab Compared to General Technologies

Multi-style Parallel Generation Architecture: unifab adopts a multi-branch style embedding mechanism that integrates multiple style vectors at the encoder or decoder stages, enabling the network to simultaneously predict color mappings for multiple styles. This design not only effectively improves computational resource utilization but also avoids the complexity of training multiple separate models, achieving one-click output of diverse styles and significantly surpassing traditional single-style generation models.

Adaptive Multi-scene Feature Fusion: unifab leverages a multi-task learning strategy by training dedicated feature extraction submodules for different image scenes. Combined with self-attention mechanisms, it dynamically balances features from various scenes, enhancing the ability to capture complex textures and structures, and effectively preventing unnatural colorization caused by feature conflicts across scenes.
Enhanced Temporal Consistency Mechanism: The video colorization version introduces temporal consistency regularization (Temporal Consistency Loss), combined with optical flow-based motion compensation techniques. This strictly constrains inter-frame color variations when processing dynamic scenes. Paired with recurrent neural networks (RNN) or temporal convolutional networks (TCN) to capture temporal context, it significantly reduces flicker and color drifting, achieving smoother and more natural video color transitions.

Technical Development and Iteration Trajectory of unifab

Version 1 (V1): Basic Single-scene Color Restoration

Early unifab focused on single-scene video colorization, mainly targeting relatively uniform video clips (e.g., single-person portraits or fixed environments). This version employed UNet architecture with a multi-style generation mechanism (style embedding module) to support multi-style outputs, allowing users to freely select between styles. However, V1’s color prediction is entirely based on a single reference frame, lacking the capability to recognize and handle multi-scene environmental changes in videos. It struggled with videos containing scene transitions or complex backgrounds, often leading to distorted or inconsistent colorization results.

Version 2 (V2): Multi-scene Intelligent Segmentation and Collaborative Colorization

The core upgrade in V2 is multi-scene support through integrating an intelligent scene segmentation module combined with a multi-path feature network to dynamically detect and partition video content by scenes. Key technical aspects include:
- Multi-scene Video Segmentation: The model analyzes visual differences between adjacent video frames, statistically aggregating frame variations over a temporal window to automatically identify scene transition points. Based on this mechanism, the video is intelligently divided into multiple continuous, relatively stable scene segments. This assists the semantic segmentation network in focusing on feature extraction and color mapping for the current scene, effectively avoiding color confusion and abrupt color jumps during scene transitions.

Collaborative Multi-branch Colorization Network: To handle different scene features, V2 designs a multi-branch decoder where each branch adapts to the color distribution of specific scene categories. Using attention mechanisms across spatial and channel dimensions, it dynamically fuses color features from various scenes, enhancing overall naturalness and detail presentation.
Intelligent Reference Frame Selection and Fusion: By constructing a multi-reference-frame feature pool and a temporal similarity evaluation model, the system automatically selects reference frames that best match the style and content of the current frame for joint inference. This multi-reference-frame context fusion significantly improves color accuracy and temporal continuity.
Enhanced Temporal Consistency: In the recursive updating of video colors, optical flow-based motion compensation strategies combined with temporal convolutional networks (TCN) ensure smooth and natural color changes across frames, substantially reducing flicker and color drift.

Future Plans and Continuous Improvements

Global Color Anchoring: Address major tone shifts to achieve consistent global color control across frames.
Multi-Character Identity Locking: Strengthen tracking and recognition of multiple characters to ensure stable and distinct colors for different roles.
Intelligent Multi-Scene Adaptation: Optimize segmentation and color mapping of multi-scene videos to enhance natural color transitions during scene switches.
Fusion of Local and Global Color Constraints: Integrate local detail and global style constraints to reduce color drifting and abrupt changes.
Enhancement of Color Detail Representation: Improve the model’s ability to capture fine textures and subtle tones to boost restoration accuracy.
Acceleration of Inference and Model Lightweighting: Optimize model structure and inference efficiency to support real-time applications on a wider range of devices.

Our goal is to develop more efficient and user-friendly black-and-white colorization models to help users easily and effectively restore black-and-white videos. We strive to lead the industry in inference speed, restoration quality, and detail handling capability.

You are welcome to join our forum to participate in discussions and stay updated. 👉 Community Link:

If you have topics of interest or models you want to follow, please leave a message on the forum. We will seriously consider your testing and evaluation suggestions and regularly publish professional technical reviews and upgrade reports.

Thank you for your attention and support of UniFab!

EthanMore >>

I am the product manager of UniFab. From a product perspective, I will present authentic software data and performance comparisons to help users better understand UniFab and stay updated with our latest developments.