Skip to main content

DataLad: Decentralized distribution and management of scientific datasets

Difficulty level

In this lesson, Yaroslav O. Halchenko describes how DataLad allows you to track and mange both your data and analysis code, thereby facilitating reliable, reproducible, and shareable research.

Topics covered in this lesson

DataLad aims to adapt the model of open-source software (OSS) distributions to address the technical limitations of today's data-sharing and to provide all components of a "data distribution" and "data management platform". The key concepts are: 1) Leverage - but do not replace - independent, existing, and future data hosting solutions to form a federated platform for data-sharing. 2) Employ software for data tracking and deployment logistics specialized for large data (git-annex) built atop Git, the most capable distributed version control system (dVCS) available today, to enable efficient data access at any level of granularity (from single files to entire collections of datasets). DataLad provides access to data available from various sources (e.g. lab or consortium web-sites such as INDI; data sharing portals such as ( and ( through a single interface. It enables students and scientists to operate on data using familiar concepts, such as files and directories, while transparently managing data access and authorization with underlying hosting providers.