“`html
Imagine trying to photograph each of the approximately 11,000 tree species in North America. You’d only have a tiny sample of the millions of photos in nature image datasets. These large collections, featuring everything from butterflies to humpback whales, are invaluable for ecologists. They provide insights into unique behaviors, rare conditions, migration patterns, and how organisms respond to pollution and climate change.
Despite their comprehensiveness, these nature image datasets aren’t as user-friendly as they could be. Searching through them to find images that support your research hypothesis is a time-consuming task. An automated research assistant, like artificial intelligence systems known as multimodal vision language models (VLMs), could be more helpful. These models are trained on both text and images, allowing them to identify subtle details, such as specific tree species in a photo’s background.
So, how effective are VLMs in helping researchers retrieve images? A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), University College London, iNaturalist, and others developed a test to assess this. The task for each VLM was to locate and organize the most relevant images from the team’s “INQUIRE” dataset, which includes 5 million wildlife photos and 250 search prompts from ecologists and biodiversity experts.
Looking for that special frog
The evaluations revealed that larger, more advanced VLMs, trained on extensive data, could sometimes provide the desired results. They performed well on basic visual queries, like identifying debris on a reef, but struggled with more complex queries requiring expert knowledge, such as identifying specific biological conditions or behaviors. For instance, while VLMs could easily find jellyfish on a beach, they had trouble with technical prompts like “axanthism in a green frog,” a condition affecting their skin color.
The study suggests these models need more domain-specific training data to handle complex queries. MIT PhD student Edward Vendrow, a CSAIL affiliate and co-leader of the dataset work, believes that with more informative data, VLMs could become excellent research assistants. “We aim to build retrieval systems that help scientists find the exact images needed for biodiversity monitoring and climate change analysis,” says Vendrow. “While multimodal models don’t yet grasp complex scientific language, INQUIRE will be a crucial benchmark for tracking improvements in understanding scientific terminology and aiding researchers in finding precise images.”
The experiments showed that larger models were more effective for both simple and complex searches due to their extensive training data. Initially, the team used the INQUIRE dataset to see if VLMs could narrow down a pool of 5 million images to the top 100 most relevant results. For straightforward queries like “a reef with manmade structures and debris,” larger models like “SigLIP” succeeded, while smaller CLIP models struggled. According to Vendrow, larger VLMs are just beginning to be useful for tougher queries.
Vendrow and his colleagues also assessed how well multimodal models could re-rank those 100 results, organizing images by relevance to a search. In these tests, even large models trained on curated data, like GPT-4o, struggled, achieving a precision score of only 59.6 percent, the highest among the models.
The researchers presented their findings at the Conference on Neural Information Processing Systems (NeurIPS) earlier this month.
Inquiring for INQUIRE
The INQUIRE dataset includes search queries based on discussions with experts like ecologists, biologists, and oceanographers about the types of images they seek, such as animals’ unique conditions and behaviors. Annotators spent 180 hours searching the iNaturalist dataset with these prompts, labeling around 33,000 matches from roughly 200,000 results that fit the prompts.
For example, annotators used queries like “a hermit crab using plastic waste as its shell” and “a California condor tagged with a green ’26′” to find subsets of the larger dataset depicting these specific, rare events.
The researchers then used the same queries to test VLMs’ ability to retrieve iNaturalist images. The annotators’ labels showed when models struggled to understand scientific keywords, as results sometimes included irrelevant images. For instance, results for “redwood trees with fire scars” sometimes included unmarked trees.
“This is careful data curation, focusing on real scientific inquiries in ecology and environmental science,” says Sara Beery, MIT CSAIL principal investigator and co-senior author. “It expands our understanding of VLMs’ capabilities in impactful scientific settings and highlights research gaps, particularly for complex queries, technical terminology, and subtle differences crucial for our collaborators.”
“Our results suggest some vision models can help wildlife scientists retrieve images, but many tasks remain too challenging for even the best models,” says Vendrow. “While INQUIRE focuses on ecology and biodiversity, its diverse queries mean successful VLMs could excel in other observation-intensive fields.”
Inquiring minds want to see
The researchers are collaborating with iNaturalist to create a query system to help scientists and others find desired images. Their demo allows users to filter searches by species, enabling quicker discovery of relevant results, like different cat eye colors. Vendrow and co-author Omiros Pantazis, who recently completed his PhD at University College London, aim to improve the re-ranking system by enhancing current models for better results.
University of Pittsburgh Associate Professor Justin Kitzes praises INQUIRE’s ability to uncover secondary data. “Biodiversity datasets are becoming too large for individual scientists to review,” says Kitzes, who wasn’t involved in the research. “This paper highlights a challenging, unsolved problem: effective data search beyond ‘who is here’ to explore characteristics, behavior, and species interactions. Efficiently uncovering these complex phenomena in biodiversity image data is crucial for fundamental science and real-world impacts in ecology and conservation.”
Vendrow, Pantazis, and Beery co-authored the paper with iNaturalist software engineer Alexander Shepard, University College London professors Gabriel Brostow and Kate Jones, University of Edinburgh associate professor and co-senior author Oisin Mac Aodha, and University of Massachusetts at Amherst Assistant Professor Grant Van Horn, who served as co-senior author. Their work was partly supported by the Generative AI Laboratory at the University of Edinburgh, the U.S. National Science Foundation/Natural Sciences and Engineering Research Council of Canada Global Center on AI and Biodiversity Change, a Royal Society Research Grant, and the Biome Health Project funded by the World Wildlife Fund United Kingdom.
“`
Source link