Bias in a dataset does not mean the data is wrong — it means the data reflects a particular slice of the world. This section asks you to identify which slice, and what that implies for downstream applications, such as models trained using this data.
Document any known or suspected systematic skew in your data: who is over- or under-represented, which languages or geographic regions dominate, whether annotation was performed by a homogeneous group, and how any of these factors might affect model behaviour in practice. Be specific. Rather than stating "the dataset may contain demographic bias," describe the actual distribution — for instance, that annotators were predominantly from the United States, United Kingdom, and Canada, with a majority holding higher education qualifications, and that this skew is reflected in quality judgements. Similarly, if a dataset draws from a single institutional source, document how that source's practices and norms may have shaped what is — and is not — present in the data.
Where the bias is structural — such as a naturally long-tailed distribution of rare conditions, or a deliberate overrepresentation of one collection environment — explain whether the imbalance was intentional and what effect it is expected to have on downstream use. Synthetic datasets require particular care here. Generators — including large language models — inherit and can amplify the biases of their training data. Document the known limitations of the generator, any demographic or cultural skews in its outputs, and whether the synthetic distribution accurately represents the target population.