Enhancing Vision-language Understanding with Advanced Large Language Models
MiniGPT-4 aligns a frozen visual encoder with a frozen large language model, Vicuna, using just one projection layer which possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Only performing the pretraining on raw image-text pairs produced unnatural language outputs that lack coherency including repetition and fragmented sentences. MiniGPT-4 consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model. MiniGPT-4 only requires training the linear layer to align the visual features with the Vicuna.