)
Try following the 'image in words' example
Image In Words is a generative model designed for scenarios that require generating ultra-detailed text from images. It is particularly suitable for recognition tasks of large language model (LLM) assistants and for leveraging AI recognition and description capabilities in more complex scenarios using gpt4o. It only supports English and has been trained using approximately 100,000 hours of English data. Image In Words has demonstrated high quality and naturalness in various tests.
Utilizing a human-involved annotation framework, each image description is ensured to have a high level of detail and accuracy, avoiding the common issues of short and irrelevant descriptions found in existing datasets.
The vision-language model fine-tuned with IIW data shows a notable improvement in description accuracy and coherence, with model performance improved by 31% compared to previous work.
The framework reduces fictional content in descriptions through rigorous verification techniques, ensuring that descriptions truly reflect the details of the image without adding non-existent details.
Descriptions generated by the framework are not only detailed and easy to read but also understandable by a broad audience, ensuring comprehensiveness by capturing all relevant aspects of the visual content.
By using models trained with IIW data, visual-language reasoning capabilities are significantly enhanced, enabling a better understanding and interpretation of visual content, and generating more accurate and meaningful descriptions.
The IIW framework has excelled in multiple practical applications, including improving accessibility for visually impaired users, enhancing image search functionalities, and more accurate content review, showcasing its vast potential across different fields.
We have released enriched versions of the IIW-Benchmark Eval dataset, human-written descriptions by IIW (image and object-level annotations), comparisons with previous work (DCI, DOCCI), and machine-generated LocNar and XM3600 datasets as open source. The statistics below reflect the richness of the data (e.g., significant increases in length and richness for each part of speech).
The datasets are released under the CC-BY-4.0 license and can be found on GitHub and downloaded from Hugging Face in 'jsonl' format.
For all information about IIW, browse web pages, projects, data downloads, visualizations, and more.
@misc{garg2024imageinwords,
title={ImageInWords: Unlocking Hyper-Detailed Image Descriptions},
author={Roopal Garg and Andrea Burns and Burcu Karagol Ayan and Yonatan Bitton and Ceslee Montgomery and Yasumasa Onoe and Andrew Bunner and Ranjay Krishna and Jason Baldridge and Radu Soricut},
year={2024},
eprint={2405.02793},
archivePrefix={arXiv},
primaryClass={cs.CV}
}