Today, I have some interesting news to share with you. DeepMind’s Flamingo model, which has gained attention in recent times, now has an open-source counterpart called OpenFlamingo. This framework enables the training and evaluation of large multimodal models (LMMs). If you’re interested in exploring it further, feel free to visit their GitHub repository and demo.
The team responsible for OpenFlamingo has put in substantial effort for their first release. They have developed a Python framework to train Flamingo-style LMMs, created a large-scale multimodal dataset, established an in-context learning evaluation benchmark for vision-language tasks, and introduced an initial version of their OpenFlamingo-9B model based on LLaMA.
It is worth noting that there has been significant progress in open-source LMMs recently, with the release of models like BLIP-2 and FROMAGe. These advancements showcase the potential of multimodal systems, and OpenFlamingo is another promising contribution to this field.
The primary goal of OpenFlamingo is to develop a multimodal system that can handle a wide range of vision-language tasks, aiming to match the capabilities of GPT-4. The importance of open-source models for fostering collaboration, accelerating progress, and broadening access to state-of-the-art LMMs cannot be overstated. This release serves as an important step towards achieving that goal.
Although the current OpenFlamingo-9B model is not fully optimized, it demonstrates the project’s potential. By working together and incorporating feedback from the community, it is possible to train even better LMMs. The community’s involvement in providing feedback and contributing to the repository is highly encouraged.
In terms of technical details, OpenFlamingo follows the implementation of the original Flamingo model quite closely. It adopts the same architecture but relies on open-source datasets for training, as Flamingo’s training data is not publicly available. The released OpenFlamingo-9B checkpoint is trained on 5M samples from their new Multimodal C4 dataset and 10M samples from LAION-2B.
Let’s take a closer look at the Multimodal-C4 dataset. This dataset is an expansion of the text-only C4 dataset used to train T5 models. The team behind OpenFlamingo retrieves the original webpages from Common Crawl and collects downloadable images. They then perform data cleaning through deduplication and content filtering to eliminate NSFW and unrelated images. Face detection is also employed to discard images with positive identifications. Finally, images and sentences are interleaved using bipartite matching within a document. Multimodal-C4 consists of approximately 75 million documents, containing around 400M images and 38B tokens. More details about the dataset will be released soon.
To assess the performance of OpenFlamingo, the team evaluates it on a diverse set of downstream tasks. Their ultimate goal is to build an open-source version of Flamingo’s benchmark and standardize vision-language task evaluation. At present, they support visual question-answering (VQAv2, OK-VQA), captioning (COCO, Flickr30k), and image classification (ImageNet) tasks. They plan to add many more evaluation sets in the future, probing model reasoning, biases, and more. The benchmark can be accessed via the OpenFlamingo repository.
As part of the release, the team is also providing a checkpoint for the under-development OpenFlamingo-9B, a LMM built on top of LLaMA 7B and CLIP ViT/L-14. Though the model is still a work in progress, it already offers value to the community. For instance, the checkpoint’s performance on COCO and VQAv2 is quite promising, with improvements seen as the number of shots increases. However, it should be noted that the reported performance is on validation data for OpenFlamingo-9B, while DeepMind Flamingo-9B’s performance is on test data.
Safety and ethical considerations are also addressed by the OpenFlamingo team. As the model is built on top of frozen LLaMA and CLIP models, it may inherit the harms of the parent models. The team acknowledges the potential for harmful use of these models but emphasizes the importance of studying the harms of large multimodal models to develop better ways of mitigating them in the future.
It is crucial to remember that OpenFlamingo-9B is a research artifact and not a finished product. It can produce unintended, inappropriate, offensive, and/or inaccurate results. The team advocates for caution and thorough evaluations before using the models in any real applications.
"prompt": "OpenFlamingo - Open-Source In-Context Learning for Vision-Language Models, deepleaps.com, best quality, 4k, 8k, ultra highres, raw photo in hdr, sharp focus, intricate texture, skin imperfections, photograph of",
"negative_prompt": "worst quality, low quality, normal quality, child, painting, drawing, sketch, cartoon, anime, render, 3d, blurry, deformed, disfigured, morbid, mutated, bad anatomy, bad art",
"original_prompt": "OpenFlamingo - Open-Source In-Context Learning for Vision-Language Models, deepleaps.com, best quality, 4k, 8k, ultra highres, raw photo in hdr, sharp focus, intricate texture, skin imperfections, photograph of",