DataGemma – Grounding LLMs with real-world data from Data Commons

Today, we released DataGemma, the first open models designed to connect LLMs with the extensive, real-world data drawn from Google’s Data Commons. These models represent our early research on potential pathways to improve the accuracy of large language models (LLMs) when queried for numerical and statistical information.

You can read more about the work we’ve done at the following locations:

You can also download the DataGemma models from Hugging Face or Kaggle Models. To get started quickly, try our quick start notebooks for both the RIG and RAG approaches. These notebooks provide a hands-on introduction to using DataGemma and exploring its capabilities. If you do access the models, please make sure to read the disclaimer below. 

Bo Xu on behalf of the Data Commons team

Disclaimer: You’re accessing a very early version of DataGemma. It is meant for trusted tester use (primarily for academic and research use) and not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory, behavior. Please anticipate errors and limitations as we actively develop this large language model interface. 

Your feedback and evaluations are critical to refining DataGemma’s performance and will directly contribute to its training process. Known limitations are detailed in the reviewer guide, which we encourage you to consult for a comprehensive understanding of Data Gemma’s current capabilities.

Discover more from Data Commons • Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading