Video-LLaMA is a language model for video understanding built on MiniGPT-4 with pre-training and fine-tuning stages.
The Video-LLaMA project aims to empower large language models with video understanding capability. The project has introduced several language decoders and fine-tuned the video-language aligned models using machine-translated video chat instructions. The pre-trained and fine-tuned checkpoints can be downloaded from the provided links. The repository includes scripts to run demos and training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset. The project acknowledges several other projects that have contributed to the development of Video-LLaMA.