LoViF: Multimodal Product Rating Prediction

Ritik Bompilwar, Sahil Faisal

2025 · CVPR 2025 Efficient VLM Workshop Challenge

Overview

LoViF is a lightweight multimodal fusion model that predicts product quality ratings from images and text metadata. The pipeline follows a dual-encoder → fusion → regression design: a frozen SigLIP2 ViT-B/16 backbone encodes both images and tokenised product text into 768-d embeddings, with only the last two vision blocks unfrozen to keep the trainable parameter count low. Image and text embeddings are projected into a shared space with learnable query tokens and passed through Transformer fusion blocks for deep cross-modal interaction, then a gated pooling mechanism and a 3-layer MLP head regress a bounded rating in [1, 5].

Highlights

Frozen SigLIP2 encoder with selective fine-tuning of the last two vision blocks for parameter efficiency.
Shared-space fusion with learned type embeddings and query tokens enabling cross-modal self-attention.
Gated routing between image and text representations, plus a raw cosine-similarity scalar fed into the head.
Composite loss combining MSE, 1−PLCC, and margin ranking — directly optimising the competition metric.
~383M total parameters (7.5M trainable) · 20.7 GFLOPs.

References

Trained on the Amazon Products 2023 dataset (Hou et al., 2024); validation and test splits from the CVPR Workshop Efficiency VLM dataset. The vision encoder builds on SigLIP (Zhai et al., 2023).