Google DeepMind's Participation at NeurIPS 2024

Research

Published 5 December 2024

Advancing adaptive AI agents, empowering 3D scene creation, and innovating LLM training for a smarter, safer future

Next week, AI researchers from around the globe will convene for the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), scheduled for December 10-15 in Vancouver. Two significant papers by Google DeepMind researchers will receive the Test of Time awards for their impactful contributions to the field. Ilya Sutskever will discuss Sequence to Sequence Learning with Neural Networks, co-authored with Google DeepMind’s VP of Drastic Research, Oriol Vinyals, and Distinguished Scientist Quoc V. Le. Additionally, Ian Goodfellow and David Warde-Farley from Google DeepMind will present on Generative Adversarial Nets.

We’ll also demonstrate how our foundational research translates into real-world applications, showcasing live demonstrations such as Gemma Scope, AI for music generation, weather forecasting, and more. Google DeepMind teams will present over 100 new papers on topics including AI agents, generative media, and innovative learning approaches.

Building adaptive, smart, and safe AI Agents

AI agents based on large language models (LLMs) are proving effective in executing digital tasks through natural language commands. However, their success hinges on precise interactions with complex user interfaces, necessitating extensive training data. With AndroidControl, we provide the most diverse control dataset to date, featuring over 15,000 human-collected demonstrations across more than 800 apps. AI agents trained with this dataset exhibited notable performance improvements, which we hope will further research into more general AI agents.

To enable AI agents to generalize across tasks, they must learn from each experience. We introduce a method for in-context abstraction learning, which helps agents identify key task patterns and relationships from imperfect demonstrations and natural language feedback, thereby enhancing their performance and adaptability.

A frame from a video demonstration of someone making a sauce, with individual elements identified and numbered. ICAL can extract the important aspects of the process.

Developing AI that fulfills users’ goals can increase its usefulness, but aligning AI systems is crucial. We propose a theoretical method to measure an AI system’s goal-directedness and demonstrate how a model’s perception of its user can affect its safety filters. These insights highlight the importance of robust safeguards to prevent unintended or unsafe behaviors, ensuring AI agents’ actions align with intended safe uses.

Advancing 3D scene creation and simulation

The demand for high-quality 3D content is rising in industries like gaming and visual effects, but creating lifelike 3D scenes remains costly and time-consuming. Our latest work introduces innovative 3D generation, simulation, and control methods to streamline content creation for faster, more flexible workflows.

Producing high-quality, realistic 3D assets and scenes often requires capturing and modeling thousands of 2D photos. We present CAT3D, a system capable of creating 3D content in as little as a minute from any number of images—even one image or a text prompt. CAT3D uses a multi-view diffusion model to generate additional consistent 2D images from various viewpoints and uses those images for traditional 3D modeling techniques. This approach surpasses previous methods in both speed and quality.

CAT3D enables 3D scene creation from any number of generated or real images.

Left to right: Text-to-image-to-3D, a real photo to 3D, several photos to 3D.

Simulating scenes with numerous rigid objects, such as a cluttered tabletop or tumbling Lego bricks, remains computationally demanding. To tackle this challenge, we introduce SDF-Sim, a technique that represents object shapes in a scalable way, accelerating collision detection and enabling efficient simulation of large, complex scenes.

A complex simulation of shoes falling and colliding, accurately modeled using SDF-Sim.

AI image generators based on diffusion models struggle to control the 3D position and orientation of multiple objects. Our solution, Neural Assets, introduces object-specific representations that capture both appearance and 3D pose, learned through training on dynamic video data. Neural Assets allows users to move, rotate, or swap objects across scenes, making it a valuable tool for animation, gaming, and virtual reality.

Given a source image and object 3D bounding boxes, we can translate, rotate, and rescale the object, or transfer objects or backgrounds between images.

Improving how LLMs learn and respond

We are enhancing the way LLMs train, learn, and respond to users, focusing on improving performance and efficiency.

With larger context windows, LLMs can now learn from thousands of examples at once, known as many-shot in-context learning (ICL). This process enhances model performance on tasks like math, translation, and reasoning, although it typically requires high-quality, human-generated data. To make training more cost-effective, we explore methods to adapt many-shot ICL, reducing the need for manually curated data. With abundant data available for training language models, the main constraint for teams is the available compute. We address the critical question: with a fixed compute budget, how do you choose the right model size to achieve the best results?

Another innovative approach, Time-Reversed Language Models (TRLM), explores pretraining and finetuning an LLM to work in reverse. When given traditional LLM responses as input, a TRLM generates queries that might have produced those responses. When paired with a traditional LLM, this method not only helps ensure responses follow user instructions better but also improves the generation of citations for summarized text and enhances safety filters against harmful content.

Curating high-quality data is essential for training large AI models, but manual curation is challenging at scale. To address this, our Joint Example Selection (JEST) algorithm optimizes training by identifying the most learnable data within larger batches, enabling up to 13× fewer training rounds and 10× less computation, outperforming state-of-the-art multimodal pretraining baselines.

Planning tasks are another challenge for AI, especially in stochastic environments where randomness or uncertainty influences outcomes. Researchers use various inference types for planning, but there is no consistent approach. We demonstrate that planning itself can be seen as a distinct type of probabilistic inference and propose a framework for ranking different inference techniques based on their planning effectiveness.

Bringing together the global AI community

We are proud to be a Diamond Sponsor of the conference and support Women in Machine Learning, LatinX in AI, and Black in AI in building communities worldwide working in AI, machine learning, and data science.

If you’re attending NeurIPS this year, visit the Google DeepMind and Google Research booths to explore cutting-edge research in demos, workshops, and more throughout the conference.

Source link