Last updated 7 / Jan / 2024
Data science projects are often characterized by their complexity and exploratory nature. This can make it difficult to organize and maintain code and data. A well-defined project structure can help data scientists improve the quality of their work, use industrial software development standards, facilitate collaboration, and be scalable.
Initially these projects start with Jupyter notebooks, and after an iteration and refinement process, they are converted into Python scripts. However, Jupyter notebooks are an exploration tool and are not suitable for software development (with good software development practices). Therefore, it is important for data scientists to have a project structure that allows them to iterate quickly on experimentation and then develop complete processes (pipelines) with Python scripts.
To avoid decision fatigue and having to create a data science project structure from scratch, it is recommended to use a project template that contains the software tools and folder structure needed to develop data science projects efficiently, reproducibly, and with good software development practices.
Objective definition:
A data science project aims to solve a problem based on data analysis (Descriptive Analysis), understanding relationships in the data (Inferential Analysis) and creating a prediction model (which could be a Machine Learning model!).
This prediction model at a production level is a software product and as such should be managed as a software project.
How to use this template (Project Template)
If you want to regularly update your project with changes from the original template, it is recommended to use Cruft
These changes can be updates to library versions, new folders or tools to use in the project.
Cruft (Recommended)
uv tool install cruft
- Create a project based on the template:
cruft create https://github.com/JoseRZapata/data-science-project-template
- When it is necessary to update the project with changes from the template:
In the Root of the project where the .cruft.json file is located, run:
cruft update
Cookiecutter
Install Cookiecutter with UV:
uv tool install cookiecutter
Create a project based on the template:
cookiecutter gh:JoseRZapata/data-science-project-template
Reference projects
Comparative table of reference project templates:
| Link | Pros | Cons | Last updated |
|---|---|---|---|
| Data Science Template | Well-focused elements for data science projects with software development elements | Missing Github Actions to automate CI/CD tasks and some folders | 12/Jul/2023 |
| cookiecutter-modern-datascience | Elements focused on data science projects with software development elements | No package manager with poetry to avoid dependency issues nor automation of some processes | 10/Aug/2022 |
| cookiecutter-hypermodern-python | Modern elements for developing Python projects | Needs to expand the folder structure for data science projects. | 3/Jun/2022 |
| cookiecutter-cruft-poetry-tox-pre-commit-ci-cd | Modern tool configuration for developing Python projects | Not specific for data science projects, needs to expand the folder structure | 22/Aug/2023 |
| cookiecutter-data-science | Widely used template with elements focused on data science projects | No Python software development configuration nor package manager with poetry, last update was several years ago | 20/Mar/2021 |
Features of the base project template for data science
Based on my experience and other references, some of the features that a base project for data science should have are:
- Dependency management with UV: UV is a modern dependency manager for Python that simplifies the process of installing and managing dependencies, and also manages virtual environments and Python installation. Previously I used Poetry, but UV is faster and replaces both Poetry and pyenv.
- Pre-commit hooks with pre-commit: pre-commit is a tool that allows developers to automatically run scripts before making a commit. This can help ensure that the code meets the project’s quality standards.
- Code quality:
- Ruff: is a tool that finds and fixes Python usage errors, and currently already replaces Black which formats code and helps ensure the code has a consistent style.
- Mypy is a static type checker for Python that helps find errors before the code runs.
- Unit testing and test coverage with PyTest and Codecov:
- PyTest is a unit testing library for Python.
- Codecov is a tool that provides test coverage reports on the code.
- CI/CD with GitHub Actions: GitHub Actions is a workflow automation platform that allows developers to automate tasks such as building, testing, and deploying code.
- Documentation: Generate static project documentation.
- Project template: Libraries are needed to create projects from templates with folders and configuration files.
- Cruft allows creating a project from a cookiecutter and also updating the project with changes from the original template.
- Cookiecutter allows creating projects from templates without updates.
- Configuration scripts: Scripts to configure the project and development environment when starting the project and for repetitive commands, for example Makefile.
What the base data science project template does NOT include:
pypi:PyPI will not be used to publish packages, since machine learning models and data transformations will be created.nox,tox:Tests across different Python versions will not be needed; development will use only one Python version.
Initial base
To achieve this I will create a base project template for data science with the following elements:
| Feature | Library | Status |
|---|---|---|
| Programming language | Python version 3.9 or higher | ✅ |
| Library manager | UV or | ✅ |
| Virtual environment manager | UV or | ✅ |
| Testing | Pytest | ✅ |
| Hook manager | pre-commit | ✅ |
| Auto update pre-commit | github actions | ✅ |
| Linter | Ruff | ✅ |
| Code formatting | Ruff and | ✅ |
| Static type checking | Mypy | ✅ |
| Import ordering | Ruff | ✅ |
| Code coverage | coverage.py | ✅ |
| Test coverage report | codecov | ✅ |
| Dependency updates | Dependabot | ✅ |
| Security and auditing | Bandit and | ✅ |
| Scaffolding manager | cruft or Cookiecutter | ✅ |
| Project configuration | OmegaConf or hydra | ✅ |
| Python syntax upgrade | Ruff | ✅ |
| Documentation | MkDocs or | ✅ |
| Base libraries | Pandas, NumPy, scikit-learn, Jupyter | ✅ |
| Configuration scripts | Makefile or | ✅ |
Currently Ruff replaces flake8, black, pylint, pyupgrade and isort which format code and help ensure the code has a consistent style. Ruff is very fast and is written in Rust
Additional MLOps elements
Additionally, to achieve a base data science project structure that enables MLOps implementation, the following elements will be included:
Libraries for a Machine Learning project
| Feature | Library | Status |
|---|---|---|
| Data manager | DVC | ❓ |
| Data validation | Great Expectations or pandera | ❓ |
| Experiment manager | MlFlow, weights & biases or Neptune | ❓ |
| Model manager | MlFlow | ❓ |
| Pipeline manager | Kedro , ZenML | ❓ |
| Train/test validation | deepcheck | ❓ |
| Data integrity validation | deepcheck | ❓ |
| Model validation | deepcheck | ❓ |
References
- Build a Reproducible and Maintainable Data Science Project - Online Book
- Hypermodern Python Tooling - Book
- Any Machine Learning Project is a Software Project First
- Hypermodernizing python legacy code
- https://kedro.org/
- I move from pipenv to poetry in 2023 - Am I right ?
- Hypermodern Python Cookiecutter
- Pipx: Safely Install Packages Globally
Cookiecutter Templates
References for cookiecutter project templates for data science and software projects:
Cookiecutter Data Science
- https://drivendata.github.io/cookiecutter-data-science/
- https://github.com/crmne/cookiecutter-modern-datascience
- https://github.com/khuyentran1401/data-science-template
- https://github.com/TeoZosa/cookiecutter-cruft-poetry-tox-pre-commit-ci-cd
- https://khuyentran1401.github.io/reproducible-data-science/structure_project/introduction.html
- https://github.com/aws-samples/python-data-science-template
Cookiecutter with UV
Cookiecutter with Poetry
- https://cjolowicz.github.io/posts/hypermodern-python-01-setup/
- Hypermodern Python Cookiecutter
- https://github.com/fpgmaas/cookiecutter-poetry
- https://github.com/PythonBiellaGroup/Bear
