Base Structure for Data Science Projects

Last updated 7 / Jan / 2024

https://github.com/JoseRZapata/data-science-project-template

Data science projects are often characterized by their complexity and exploratory nature. This can make it difficult to organize and maintain code and data. A well-defined project structure can help data scientists improve the quality of their work, use industrial software development standards, facilitate collaboration, and be scalable.

Initially these projects start with Jupyter notebooks, and after an iteration and refinement process, they are converted into Python scripts. However, Jupyter notebooks are an exploration tool and are not suitable for software development (with good software development practices). Therefore, it is important for data scientists to have a project structure that allows them to iterate quickly on experimentation and then develop complete processes (pipelines) with Python scripts.

To avoid decision fatigue and having to create a data science project structure from scratch, it is recommended to use a project template that contains the software tools and folder structure needed to develop data science projects efficiently, reproducibly, and with good software development practices.

Objective definition:

A data science project aims to solve a problem based on data analysis (Descriptive Analysis), understanding relationships in the data (Inferential Analysis) and creating a prediction model (which could be a Machine Learning model!).

This prediction model at a production level is a software product and as such should be managed as a software project.

How to use this template (Project Template)

If you want to regularly update your project with changes from the original template, it is recommended to use Cruft

These changes can be updates to library versions, new folders or tools to use in the project.

Cruft (Recommended)

Install Cruft with UV

uv tool install cruft

Create a project based on the template:

cruft create https://github.com/JoseRZapata/data-science-project-template

When it is necessary to update the project with changes from the template:

In the Root of the project where the .cruft.json file is located, run:

cruft update

Cookiecutter

Install Cookiecutter with UV:

uv tool install cookiecutter

Create a project based on the template:

cookiecutter gh:JoseRZapata/data-science-project-template

Reference projects

Comparative table of reference project templates:

Link	Pros	Cons	Last updated
Data Science Template	Well-focused elements for data science projects with software development elements	Missing Github Actions to automate CI/CD tasks and some folders	12/Jul/2023
cookiecutter-modern-datascience	Elements focused on data science projects with software development elements	No package manager with poetry to avoid dependency issues nor automation of some processes	10/Aug/2022
cookiecutter-hypermodern-python	Modern elements for developing Python projects	Needs to expand the folder structure for data science projects.	3/Jun/2022
cookiecutter-cruft-poetry-tox-pre-commit-ci-cd	Modern tool configuration for developing Python projects	Not specific for data science projects, needs to expand the folder structure	22/Aug/2023
cookiecutter-data-science	Widely used template with elements focused on data science projects	No Python software development configuration nor package manager with poetry, last update was several years ago	20/Mar/2021

Features of the base project template for data science

Based on my experience and other references, some of the features that a base project for data science should have are:

Dependency management with UV: UV is a modern dependency manager for Python that simplifies the process of installing and managing dependencies, and also manages virtual environments and Python installation. Previously I used Poetry, but UV is faster and replaces both Poetry and pyenv.
Pre-commit hooks with pre-commit: pre-commit is a tool that allows developers to automatically run scripts before making a commit. This can help ensure that the code meets the project’s quality standards.
Code quality:
- Ruff: is a tool that finds and fixes Python usage errors, and currently already replaces Black which formats code and helps ensure the code has a consistent style.
- Mypy is a static type checker for Python that helps find errors before the code runs.
Unit testing and test coverage with PyTest and Codecov:
- PyTest is a unit testing library for Python.
- Codecov is a tool that provides test coverage reports on the code.
CI/CD with GitHub Actions: GitHub Actions is a workflow automation platform that allows developers to automate tasks such as building, testing, and deploying code.
Documentation: Generate static project documentation.
Project template: Libraries are needed to create projects from templates with folders and configuration files.
- Cruft allows creating a project from a cookiecutter and also updating the project with changes from the original template.
- Cookiecutter allows creating projects from templates without updates.
Configuration scripts: Scripts to configure the project and development environment when starting the project and for repetitive commands, for example Makefile.

What the base data science project template does NOT include:

~~pypi:~~ PyPI will not be used to publish packages, since machine learning models and data transformations will be created.
~~nox~~, ~~tox:~~ Tests across different Python versions will not be needed; development will use only one Python version.

Initial base

To achieve this I will create a base project template for data science with the following elements:

Feature	Library	Status
Programming language	Python version 3.9 or higher	✅
Library manager	UV or ~~Poetry~~ or ~~pipenv~~	✅
Virtual environment manager	UV or ~~Poetry~~ or ~~pipenv~~	✅
Testing	Pytest	✅
Hook manager	pre-commit	✅
Auto update pre-commit	github actions	✅
Linter	Ruff	✅
Code formatting	Ruff and ~~Prettier~~	✅
Static type checking	Mypy	✅
Import ordering	Ruff	✅
Code coverage	coverage.py	✅
Test coverage report	codecov	✅
Dependency updates	Dependabot	✅
Security and auditing	Bandit and ~~Safety~~	✅
Scaffolding manager	cruft or Cookiecutter ~~copier~~	✅
Project configuration	OmegaConf or hydra	✅
Python syntax upgrade	Ruff	✅
Documentation	MkDocs or ~~Sphinx~~ or ~~pdoc~~	✅
Base libraries	Pandas, NumPy, scikit-learn, Jupyter	✅
Configuration scripts	Makefile or ~~just~~	✅

Currently Ruff replaces flake8, black, pylint, pyupgrade and isort which format code and help ensure the code has a consistent style. Ruff is very fast and is written in Rust

Additional MLOps elements

Additionally, to achieve a base data science project structure that enables MLOps implementation, the following elements will be included:

Libraries for a Machine Learning project

Feature	Library	Status
Data manager	DVC	❓
Data validation	Great Expectations or pandera	❓
Experiment manager	MlFlow, weights & biases or Neptune	❓
Model manager	MlFlow	❓
Pipeline manager	Kedro , ZenML	❓
Train/test validation	deepcheck	❓
Data integrity validation	deepcheck	❓
Model validation	deepcheck	❓

References

Cookiecutter Templates

References for cookiecutter project templates for data science and software projects:

Base Structure for Data Science Projects

How to use this template (Project Template)

Cruft (Recommended)

Cookiecutter

Reference projects

Features of the base project template for data science

Initial base

Additional MLOps elements

Libraries for a Machine Learning project

References

Cookiecutter Templates

Cookiecutter Data Science

Cookiecutter with UV

Cookiecutter with Poetry

Enviroments

Feedback