We are excited to announce a new set of open-source tools that simplify the initial steps of importing data in the Statistical Data and Metadata eXchange (SDMX) format into Data Commons. These tools address the critical first stage of the data ingestion pipeline, making it easier for our growing community to work with SDMX data and contribute to Data Commons.
What is SDMX?
SDMX is a global standard for exchanging statistical data and metadata among organizations – it provides a common language and standardized data structure for critical indicators such as unemployment rates, GDP, and population figures, ensuring global interoperability among statistical bodies. SDMX is widely used by national statistical offices, central banks, and international organizations like the OECD, the World Bank, and the United Nations.
Starting today, our Data Commons tools support SDMX 2.1, the most widely adopted version of the standard.
Key features
Our new SDMX import tools, which can be used by both contributors to base Data Commons and owners of Custom Data Commons instances, offer several capabilities designed to streamline the first stage of data integration:
- Dual usage modes: The tools function both as a command-line download tool and as a Python library (
SdmxClient) for programmatic integration into data pipelines. - Automatic format conversion: The tools automatically download from an SDMX feed to local storage and convert the data to a standardized CSV format, regardless of whether the source provides data in XML, JSON, or other formats. This simplifies downstream processing by providing a clean, tabular starting point. It’s important to note that after this conversion, the process of mapping the data concepts and columns to the Data Commons schema remains a manual step, similar to other CSV ingestions.
- Rich metadata mapping: SDMX’s rich metadata provides the necessary context to facilitate the manual mapping of concepts to Data Commons. For data providers, enriching your metadata with clear definitions is the most effective way to make your data more discoverable and easier to integrate in the future.
- Simplified auto-refresh setup: Datasets can be configured for automatic periodic updates without writing custom download scripts for each dataset, making it easy to keep data current as sources update.
All tools are available in our GitHub repository, with detailed usage instructions in the README.
How to get started and contribute
We envision three primary ways for the community to engage with these new tools:
| User group | How you can help |
| Custom Data Commons owners | Leverage our Python client to download and standardize SDMX data for your own custom ETL processes and private Data Commons instances. |
| Open source contributors | The tools are open source. Help us improve them by adding features, fixing bugs, and enhancing their robustness. |
| SDMX data providers | Focus on providing high-quality, comprehensive metadata in your SDMX files. This is the single most impactful way to ensure your data can be more easily mapped and utilized. |
To see an example, we have used these tools to import the OECD’s quarterly GDP data, which can be explored in Data Commons.
What’s next?
To address the current manual mapping effort and further automate the pipeline, our future roadmap is focused on following enhancements, with releases planned across the coming quarters:
- SDMX 3.0 support: We will add support for version 3.0 of the standard as it gains wider adoption.
- Auto-schematization: We plan to leverage SDMX’s rich metadata to automatically generate schema mappings, reducing the manual effort required to integrate new datasets.
- Enhanced auto-refresh: Future improvements will enable checking SDMX metadata for data availability and triggering updates only when new data is released.
We look forward to seeing the new and interesting datasets that the community will bring to Data Commons!