LoViF: Multimodal Product Rating Prediction
Overview
LoViF is a lightweight multimodal fusion model that predicts product quality ratings from images and text metadata. The pipeline follows a dual-encoder → fusion → regression design: a frozen SigLIP2 ViT-B/16 backbone encodes both images and tokenised product text into 768-d embeddings, with only the last two vision blocks unfrozen to keep the trainable parameter count low. Image and text embeddings are projected into a shared space with learnable query tokens and passed through Transformer fusion blocks for deep cross-modal interaction, then a gated pooling mechanism and a 3-layer MLP head regress a bounded rating in [1, 5].
Highlights
- Frozen SigLIP2 encoder with selective fine-tuning of the last two vision blocks for parameter efficiency.
- Shared-space fusion with learned type embeddings and query tokens enabling cross-modal self-attention.
- Gated routing between image and text representations, plus a raw cosine-similarity scalar fed into the head.
- Composite loss combining MSE, 1−PLCC, and margin ranking — directly optimising the competition metric.
- ~383M total parameters (7.5M trainable) · 20.7 GFLOPs.
References
Trained on the Amazon Products 2023 dataset (Hou et al., 2024); validation and test splits from the CVPR Workshop Efficiency VLM dataset. The vision encoder builds on SigLIP (Zhai et al., 2023).