Base Structure for Data Science Projects

Last updated 7 / Jan / 2024

Data science projects are often characterized by their complexity and exploratory nature. This can make it difficult to organize and maintain code and data. A well-defined project structure can help data scientists improve the quality of their work, use industrial software development standards, facilitate collaboration, and be scalable.

Initially these projects start with Jupyter notebooks, and after an iteration and refinement process, they are converted into Python scripts. However, Jupyter notebooks are an exploration tool and are not suitable for software development (with good software development practices). Therefore, it is important for data scientists to have a project structure that allows them to iterate quickly on experimentation and then develop complete processes (pipelines) with Python scripts.

To avoid decision fatigue and having to create a data science project structure from scratch, it is recommended to use a project template that contains the software tools and folder structure needed to develop data science projects efficiently, reproducibly, and with good software development practices.

Objective definition:

A data science project aims to solve a problem based on data analysis (Descriptive Analysis), understanding relationships in the data (Inferential Analysis) and creating a prediction model (which could be a Machine Learning model!).

This prediction model at a production level is a software product and as such should be managed as a software project.

How to use this template (Project Template)

If you want to regularly update your project with changes from the original template, it is recommended to use Cruft

These changes can be updates to library versions, new folders or tools to use in the project.

uv tool install cruft
  • Create a project based on the template:
cruft create https://github.com/JoseRZapata/data-science-project-template
  • When it is necessary to update the project with changes from the template:

In the Root of the project where the .cruft.json file is located, run:

cruft update

Cookiecutter

Install Cookiecutter with UV:

uv tool install cookiecutter

Create a project based on the template:

cookiecutter gh:JoseRZapata/data-science-project-template

Reference projects

Comparative table of reference project templates:

LinkProsConsLast updated
Data Science TemplateWell-focused elements for data science projects with software development elementsMissing Github Actions to automate CI/CD tasks and some folders12/Jul/2023
cookiecutter-modern-datascienceElements focused on data science projects with software development elementsNo package manager with poetry to avoid dependency issues nor automation of some processes10/Aug/2022
cookiecutter-hypermodern-pythonModern elements for developing Python projectsNeeds to expand the folder structure for data science projects.3/Jun/2022
cookiecutter-cruft-poetry-tox-pre-commit-ci-cdModern tool configuration for developing Python projectsNot specific for data science projects, needs to expand the folder structure22/Aug/2023
cookiecutter-data-scienceWidely used template with elements focused on data science projectsNo Python software development configuration nor package manager with poetry, last update was several years ago20/Mar/2021

Features of the base project template for data science

Based on my experience and other references, some of the features that a base project for data science should have are:

  • Dependency management with UV: UV is a modern dependency manager for Python that simplifies the process of installing and managing dependencies, and also manages virtual environments and Python installation. Previously I used Poetry, but UV is faster and replaces both Poetry and pyenv.
  • Pre-commit hooks with pre-commit: pre-commit is a tool that allows developers to automatically run scripts before making a commit. This can help ensure that the code meets the project’s quality standards.
  • Code quality:
    • Ruff: is a tool that finds and fixes Python usage errors, and currently already replaces Black which formats code and helps ensure the code has a consistent style.
    • Mypy is a static type checker for Python that helps find errors before the code runs.
  • Unit testing and test coverage with PyTest and Codecov:
    • PyTest is a unit testing library for Python.
    • Codecov is a tool that provides test coverage reports on the code.
  • CI/CD with GitHub Actions: GitHub Actions is a workflow automation platform that allows developers to automate tasks such as building, testing, and deploying code.
  • Documentation: Generate static project documentation.
  • Project template: Libraries are needed to create projects from templates with folders and configuration files.
    • Cruft allows creating a project from a cookiecutter and also updating the project with changes from the original template.
    • Cookiecutter allows creating projects from templates without updates.
  • Configuration scripts: Scripts to configure the project and development environment when starting the project and for repetitive commands, for example Makefile.

What the base data science project template does NOT include:

  • pypi: PyPI will not be used to publish packages, since machine learning models and data transformations will be created.
  • nox, tox: Tests across different Python versions will not be needed; development will use only one Python version.

Initial base

To achieve this I will create a base project template for data science with the following elements:

FeatureLibraryStatus
Programming languagePython version 3.9 or higher
Library managerUV or Poetry or pipenv
Virtual environment managerUV or Poetry or pipenv
TestingPytest
Hook managerpre-commit
Auto update pre-commitgithub actions
LinterRuff
Code formattingRuff and Prettier
Static type checkingMypy
Import orderingRuff
Code coveragecoverage.py
Test coverage reportcodecov
Dependency updatesDependabot
Security and auditingBandit and Safety
Scaffolding managercruft or Cookiecutter copier
Project configurationOmegaConf or hydra
Python syntax upgradeRuff
DocumentationMkDocs or Sphinx or pdoc
Base librariesPandas, NumPy, scikit-learn, Jupyter
Configuration scriptsMakefile or just

Currently Ruff replaces flake8, black, pylint, pyupgrade and isort which format code and help ensure the code has a consistent style. Ruff is very fast and is written in Rust

Additional MLOps elements

Additionally, to achieve a base data science project structure that enables MLOps implementation, the following elements will be included:

Libraries for a Machine Learning project

FeatureLibraryStatus
Data managerDVC
Data validationGreat Expectations or pandera
Experiment managerMlFlow, weights & biases or Neptune
Model managerMlFlow
Pipeline managerKedro , ZenML
Train/test validationdeepcheck
Data integrity validationdeepcheck
Model validationdeepcheck

References

Cookiecutter Templates

References for cookiecutter project templates for data science and software projects:

Cookiecutter Data Science

Cookiecutter with UV

Cookiecutter with Poetry

Enviroments

Previous
Next