Back to projects

LoViF: Multimodal Product Rating Prediction

Ritik Bompilwar, Sahil Faisal

2025 · CVPR 2025 Efficient VLM Workshop Challenge
LoViF
GitHub Model Weights

Overview

LoViF is a lightweight multimodal fusion model that predicts product quality ratings from images and text metadata. The pipeline follows a dual-encoder → fusion → regression design: a frozen SigLIP2 ViT-B/16 backbone encodes both images and tokenised product text into 768-d embeddings, with only the last two vision blocks unfrozen to keep the trainable parameter count low. Image and text embeddings are projected into a shared space with learnable query tokens and passed through Transformer fusion blocks for deep cross-modal interaction, then a gated pooling mechanism and a 3-layer MLP head regress a bounded rating in [1, 5].

Highlights

References

Trained on the Amazon Products 2023 dataset (Hou et al., 2024); validation and test splits from the CVPR Workshop Efficiency VLM dataset. The vision encoder builds on SigLIP (Zhai et al., 2023).