QualiVision: Multi-Modal Video Quality Assessment with Quality-Aware Fusion and Discriminative Learning Strategies

Ritik Bompilwar, Saurabh Koshatwar

ICCV 2025 · Visual Quality Assessment (VQualA) Workshop

Abstract

The proliferation of AI-generated video content across media production, entertainment, and social platforms has created new challenges in automatic quality assessment. While traditional metrics focus primarily on technical artifacts, AI-generated content requires comprehensive evaluation across temporal consistency, image fidelity, aesthetic appeal, and text-video alignment to ensure semantic coherence between input prompts and visual outputs. We present QualiVision, a multi-modal video quality assessment framework designed to address these requirements. Working with the VQualA 2025 challenge dataset of 4,000 annotated AI-generated videos, we develop two complementary architectures that integrate video understanding with textual prompt analysis. Our first approach enhances DOVER++ with a quality-aware cross-modal fusion mechanism that dynamically adapts attention based on textual content. The second employs V-JEPA2 with strategic layer freezing and discriminative learning rates for efficient adaptation of large-scale transformers. Both models utilize a hybrid loss function combining smooth L1, ranking, and scale-aware components to better align with human perceptual judgments. Experimental evaluation of our V-JEPA2 architecture achieved a final score of 0.57, and DOVER++ obtained 0.39 on the challenge test set, demonstrating the effectiveness of cross-modal understanding for AI-generated video quality assessment.

Methodology

QualiVision evaluates AI-generated videos across temporal consistency, image fidelity, aesthetic appeal, text-video alignment, and overall quality. The paper proposes two complementary multimodal architectures. The first extends DOVER++ with quality-aware cross-modal fusion, where text embeddings guide attention over video quality dimensions. The second adapts V-JEPA2 using strategic layer freezing and discriminative learning rates, allowing a large video transformer to be fine-tuned efficiently for quality assessment. Both approaches use a hybrid loss combining smooth L1 loss, ranking loss, and scale-aware loss to better align model predictions with human perceptual scores.

Results

Method	Validation Score	Test Score
DOVER++	0.52	0.39
V-JEPA2	0.63	0.57

V-JEPA2 reaches 0.57 on the official VQualA challenge test set, with DOVER++ at 0.39. Ablations in the paper attribute the largest gains to cross-modal fusion (DOVER++) and strategic layer freezing (V-JEPA2).

Citation

@InProceedings{Bompilwar_2025_ICCV,
  author    = {Bompilwar, Ritik and Koshatwar, Saurabh},
  title     = {QualiVision: Multi-Modal Video Quality Assessment with Quality-Aware Fusion and Discriminative Learning Strategies},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  month     = {October},
  year      = {2025},
  pages     = {3469-3478}
}