Today, we released DataGemma, the first open models designed to connect LLMs with the extensive, real-world data drawn from Google’s Data Commons. These models represent our early research on potential pathways to improve the accuracy of large language models (LLMs) when queried for numerical and statistical information.
You can read more about the work we’ve done at the following locations:
- Google Keyword Blog – DataGemma: Using real-world data to address AI hallucinations
- Google Research Blog – Grounding AI in reality with a little help from Data Commons
- DataGemma Paper – Knowing When to Ask – Bridging Large Language Models and Data
You can also download the DataGemma models from Hugging Face or Kaggle Models. To get started quickly, try our quick start notebooks for both the RIG and RAG approaches. These notebooks provide a hands-on introduction to using DataGemma and exploring its capabilities. If you do access the models, please make sure to read the disclaimer below.
Bo Xu on behalf of the Data Commons team
Disclaimer: You’re accessing a very early version of DataGemma. It is meant for trusted tester use (primarily for academic and research use) and not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory, behavior. Please anticipate errors and limitations as we actively develop this large language model interface.
Your feedback and evaluations are critical to refining DataGemma’s performance and will directly contribute to its training process. Known limitations are detailed in the reviewer guide, which we encourage you to consult for a comprehensive understanding of Data Gemma’s current capabilities.