NeurIPS Logo

🥐 Croissant RAI Metadata Editor

Upload a Croissant JSON file to check which Responsible AI metadata fields are present. Fill in the missing fields and export an updated file ready for NeurIPS submission. Read the NeurIPS dataset submission requirements.

A detailed description of the require RAI metadata with concrete xamples can be found in the guidelines for RAI attributes

Upload a Croissant file to get started.

Every dataset reflects choices — about scope, collection method, and context — that constrain where it can be safely applied. This section asks you to make those constraints explicit and visible to downstream users

Document the specific boundaries of your dataset: the tasks it was not designed for, the conditions under which it is expected to fail, and any use cases where deployment should be strictly avoided. Think about geographic coverage, the modalities captured, the population represented, and the conditions of data collection. For example, a medical imaging benchmark evaluated only on European clinical data should explicitly state it cannot be generalized to other healthcare systems; a UAV localization dataset collected in Germany and the United States must flag that its performance will degrade in regions, scene types, or conditions outside those covered.

Pay particular attention to use cases where misapplication carries real harm. Benchmarks designed for model evaluation are routinely misused for training; datasets covering safety-critical domains like clinical imaging are sometimes taken as proxies for deployment readiness. If your dataset is not suitable for a specific application, say so explicitly — a sentence stating "this dataset is strictly not recommended for clinical deployment or commercial applications" protects both your work and downstream users.

Bias in a dataset does not mean the data is wrong — it means the data reflects a particular slice of the world. This section asks you to identify which slice, and what that implies for downstream applications, such as models trained using this data.

Document any known or suspected systematic skew in your data: who is over- or under-represented, which languages or geographic regions dominate, whether annotation was performed by a homogeneous group, and how any of these factors might affect model behaviour in practice. Be specific. Rather than stating "the dataset may contain demographic bias," describe the actual distribution — for instance, that annotators were predominantly from the United States, United Kingdom, and Canada, with a majority holding higher education qualifications, and that this skew is reflected in quality judgements. Similarly, if a dataset draws from a single institutional source, document how that source's practices and norms may have shaped what is — and is not — present in the data.

Where the bias is structural — such as a naturally long-tailed distribution of rare conditions, or a deliberate overrepresentation of one collection environment — explain whether the imbalance was intentional and what effect it is expected to have on downstream use. Synthetic datasets require particular care here. Generators — including large language models — inherit and can amplify the biases of their training data. Document the known limitations of the generator, any demographic or cultural skews in its outputs, and whether the synthetic distribution accurately represents the target population.

This section requires you to take stock of any personal and sensitive information present in your dataset — whether it was collected intentionally or incidentally. For each sensitive category present, state it explicitly. The relevant categories include: gender, age, geographic location, language, culture, socio-economic status, health and medical data, and political or religious beliefs. If the dataset contains absolutely none of these, state that explicitly too.

Beyond listing what is present, document the legal and ethical basis for its inclusion. For datasets containing medical data, confirm whether ethics review was conducted and what the outcome was. For datasets derived from user-generated content, acknowledge that personal disclosures may be embedded in open-ended entries even when no direct identifiers were collected. If privacy-preserving measures were applied — de-identification, downscaling of images, suppression of absolute geolocation, or similar — describe those measures and what residual risk, if any, remains after their application.

A dataset is only as reliable as the tasks it was actually designed and validated to support. This section asks you to define those tasks precisely, and to be honest about the boundaries of what has been tested. Describe what real-world concept or capability your dataset is intended to measure, and provide evidence that it does so reliably. Evidence might include: published benchmark results across multiple models, expert validation of ground-truth annotations, human baselines, or independent auditing. For each validated task, state the evaluation metrics used and the conditions under which validity has been established.

Equally important is what you have not validated. Datasets are frequently applied to tasks for which no validity evidence exists — a localization benchmark repurposed for 3D reconstruction, or a diversity measurement corpus repurposed for safety red-teaming. List these out-of-scope use cases explicitly, even when they might seem like natural extensions of your dataset's core purpose. If validity has only been established for specific subpopulations, geographic contexts, or input conditions, make those boundaries clear. For synthetic datasets, also report resemblance metrics (how closely the synthetic data reflects the real-world distribution it models) and utility metrics (evidence that models trained or evaluated on it perform comparably to those using real data). Without both, fitness for purpose remains unverified.

Releasing a dataset is a consequential act. This section asks you to think through those consequences — the benefits you intend, the risks you foresee, and the measures you have taken to mitigate harm. Start with the positive case: what does this dataset enable that was not possible before, and who stands to benefit? Then consider the risks. These include misuse (applying the dataset in ways it was not designed for), over-reliance (treating benchmark performance as a proxy for deployment readiness), and distributional harm (building on data that systematically excludes or misrepresents specific communities). For datasets touching safety-critical domains, consider what the human cost of a failure would be, and whether your release could accelerate premature deployment.

Finally, describe what you have done to mitigate these risks. Concrete mitigations include: restricting use via a non-commercial or share-alike license, designing the dataset as evaluation-only to prevent training misuse, applying access controls or gated release, publishing a full limitations document, or requiring a usage agreement. The goal is not to eliminate all risk — that is rarely possible — but to demonstrate that you have thought carefully about the implications of your work and taken reasonable steps to protect against foreseeable harms.

Switch to 'yes'if your dataset includes any synthetically generated content — whether fully synthetic or a real-data mixture augmented through generative processes. Flagging this field is the entry point, not the end point: if set to true, you are strongly encouraged to document the synthetic generation process in detail. This means identifying the seed data used, naming the specific generator (including the model or algorithm, and version where applicable), and describing how the synthetic output was validated. Leaving the flag set without complementary documentation significantly limits the usefulness of this information for downstream users.

Radio

If your dataset builds on existing data, this section documents its lineage — the "parent" datasets from which it inherits both content and constraints. For each source dataset, provide at least a stable URL. Optionally you can provide its name, the publishing organisation, and the license under which it is made available. This is not merely a citation formality: license compatibility is a legal requirement, and inherited biases or restrictions travel with the data. A dataset derived from a CC BY-NC-SA source, for instance, carries non-commercial restrictions into any derivative work. If your dataset incorporates multiple sources with different licenses, document each one and explain how you have ensured compatibility.

For datasets generated through data augmentation, synthetic generation, or model-assisted construction, identify the seed data used — even if the final dataset bears little surface resemblance to the original. Users need to understand what the data was ultimately derived from.

Source dataset 1
Source dataset 2
Source dataset 3
Source dataset 4
Source dataset 5

This section documents the "who" and the "what" of your dataset's creation — the full chain of decisions and actions that produced it. For each major activity — data collection, filtering, preprocessing, annotation, quality review — provide a description of what was done and an account of who or what performed the work. Agents may be human (research teams, crowdworkers, domain experts) or synthetic (language models used for classification or pre-annotation, generation algorithms, automated pipelines). Both require documentation.

For human agents, describe the team's institutional affiliations, language background, and geographic location — and if crowdsourcing was used, the platform, the prescreening criteria, and the compensation arrangements. For datasets involving synthetic generation, name the generator specifically (including model name and version for LLMs, and the prompting strategy applied), and report the resemblance and utility metrics used to validate the output. Where human and synthetic agents collaborated — for instance, a language model performing initial labelling that human annotators then reviewed — describe that interaction clearly, since the quality and bias of the final annotations depends on both.

Activity 1
Type
Agent 1
Type
Agent 2
Type
Agent 3
Type
Agent 4
Type
Activity 2
Type
Agent 1
Type
Agent 2
Type
Agent 3
Type
Agent 4
Type
Activity 3
Type
Agent 1
Type
Agent 2
Type
Agent 3
Type
Agent 4
Type
Activity 4
Type
Agent 1
Type
Agent 2
Type
Agent 3
Type
Agent 4
Type
Activity 5
Type
Agent 1
Type
Agent 2
Type
Agent 3
Type
Agent 4
Type

Learn more about 🥐Croissant

⚠️ It is possible that this tool is currently being used by a lot of people at the same time, which may trigger rate limiting by the platform hosting your data. The app will then try again and may get into a very long loop. If it takes too long to run, we recommend using any of the following options:
  • 🔁 Click the button with the three dots (⋯) above and select "Duplicate this Space" to run this app in your own Hugging Face space.
  • 💻 Click the button with the three dots (⋯) above and select "Run Locally" and then "Clone (git)" to get instructions to run the RAI tool locally. You can also use the docker option (you don't need the tokens).
  • 🥐 Open the Croissant JSON file in an editor and enter the fields yourself using the RAI guidelines