“Data versioning is like flossing. Everyone agrees it’s a good thing to do, but few do it.” ~ Chip Huyen, Designing Machine Learning Systems
Unlike code versioning, it is a lot more difficult to implement data versioning in data science / machine learning projects.
It is because of the following reasons:
➡️ Data is often larger than codes.
➡️ Varying definitions of what constitutes a difference between two data versions and how to resolve merge conflicts.
➡️ Regulations on data protection and privacy make keeping historical data difficult.
Do you floss… erm version your data often?