A new densely annotated 3D-text dataset called 3D-GRAND, led by researchers at the University of Michigan, has been released. The achievement was presented on June 15 at the Computer Vision and Pattern Recognition (CVPR) conference in Nashville, Tennessee, and simultaneously published on the arXiv preprint server. This dataset is expected to advance the development of embodied AI such as household robots by tightly linking language with 3D space.

In benchmark tests against previous 3D datasets, models trained on 3D-GRAND demonstrated clear superiority. Their grounding accuracy reached 38%, an improvement of 7.7% over the previous best model; the hallucination rate dropped dramatically from the prior state-of-the-art 48% to just 6.67%.
At present, common household sweeping robots have relatively single functions. The 3D-GRAND dataset lays the foundation for developing the next generation of domestic robots. In the future, we may be able to casually instruct a robot: "Pick up the book next to the lamp on the nightstand and bring it to me." This requires the robot to first understand the spatial meaning of language. Joyce Chai, Professor of Computer Science and Engineering at the University of Michigan, pointed out that large multimodal language models are mostly trained on 2D image-text data, but humans live in a three-dimensional world. For robots to interact with humans, they must understand spatial terms and perspectives, interpret object orientations, and use language.
However, 3D data is scarce, and 3D data based on text is even harder to find—words like "sofa" need to be associated with the sofa's 3D coordinates. Like all large language models (LLMs), 3D-LLMs perform best when trained on massive datasets, but building large datasets through camera imaging is time-consuming and expensive because annotators must manually specify objects and their spatial relationships and link words to the corresponding objects.
To address this, the research team adopted a new approach: using generative AI to create synthetic rooms and automatically annotate them with 3D structures. The resulting 3D-GRAND dataset contains 40,087 home scenes and 6.2 million finely detailed descriptions. Jianing "Jed" Yang, a PhD student in Computer Science and Engineering at the University of Michigan, said that synthetic data labels are free and much easier to manage.
After generating the synthetic 3D data, the AI pipeline first uses a vision model to describe object colour, shape, and material, then a pure-text model generates scene descriptions while a scene graph ensures noun phrases are linked to specific 3D objects. A final quality-control step employs a hallucination filter to ensure every object mentioned in the text has a corresponding object in the 3D scene. Human evaluators spot-checked 10,200 room annotation pairs and found the synthetic annotation error rate to be approximately 5% to 8%—comparable to professional human annotation. Yang noted that LLM-based annotation reduced cost and time by an order of magnitude; the 6.2 million annotations were created in just two days.
To test the new dataset, the team trained models on 3D-GRAND and compared them against three baseline models (3D-LLM, LEO, and 3D-VISTA). The established benchmark ScanRefer evaluated grounding accuracy, while the newly introduced benchmark 3D-POPE assessed object hallucination. Results showed that models trained on 3D-GRAND significantly outperformed the competition.
Joyce Chai said she looks forward to seeing 3D-GRAND help robots better understand space, adopt different viewpoints, and improve communication and collaboration with humans. The next step will be real-robot testing.













