Data science project template
A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.
β¨πβ¨ Documentation: https://joserzapata.github.io/data-science-project-template/
Source Code: https://github.com/JoseRZapata/data-science-project-template
Features
- Dependency management with UV
- Virtual environment management with UV
- Linting with pre-commit and Ruff
- Continuous integration with GitHub Actions
- Documentation with mkdocs and mkdocstrings using the mkdocs-materialtheme
- Automated dependency updates with Dependabot
- Code formatting with Ruff
- Import sorting with Ruff using isort rule.
- Testing with pytest
- Code coverage with Coverage.py
- Coverage reporting with Codecov
- Static type-checking with mypy
- Security audit with Ruff using bandit rule.
- Manage project labels with GitHub Labeler
π Creating a New Project
π Recommendations
It is highly recommended to use managers for the python versions, dependencies and virtual environments.
This project uses UV, a extremely fast tool to replace pip, pip-tools, Pipx, Poetry, Pyenv, twine, virtualenv, and more.
π Check how to setup your environment: https://joserzapata.github.io/data-science-project-template/local_setup/
πͺπ₯ Via Cruft - (recommended)
# Install cruft in a isolated environment using uv
uv tool install cruft
# Or Install with pip
pip install --user cruft # Install `cruft` on your path for easy access
then inside the project folder, init git and uv environment using Make:
πͺ Via Cookiecutter
uv tool install cookiecutter # Install cruft in a isolated environment
# Or Install with pip
pip install --user cookiecutter # Install `cookiecutter` on your path for easy access
Note: Cookiecutter uses gh: as short-hand for https://github.com/
π Linking an Existing Project
If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:
Then/else:
ποΈ Project structure
Folder structure for data science projects why?
.
βββ .code_quality
βΒ Β βββ mypy.ini # mypy configuration
βΒ Β βββ ruff.toml # ruff configuration
βββ .github # github configuration
βΒ Β βββ actions
βΒ Β βΒ Β βββ python-poetry-env
βΒ Β βΒ Β βββ action.yml # github action to setup python environment
βΒ Β βββ dependabot.md # github action to update dependencies
βΒ Β βββ pull_request_template.md # template for pull requests
βΒ Β βββ workflows # github actions workflows
βΒ Β βββ ci.yml # run continuous integration (tests, pre-commit, etc.)
βΒ Β βββ dependency_review.yml # review dependencies
βΒ Β βββ docs.yml # build documentation (mkdocs)
βΒ Β βββ pre-commit_autoupdate.yml # update pre-commit hooks
βββ .vscode # vscode configuration
| βββ extensions.json # list of recommended extensions
| βββ launch.json # vscode launch configuration
| βββ settings.json # vscode settings
βββ conf # folder configuration files
βΒ Β βββ config.yaml # main configuration file
βββ data
βΒ Β βββ 01_raw # raw immutable data
βΒ Β βββ 02_intermediate # typed data
βΒ Β βββ 03_primary # domain model data
βΒ Β βββ 04_feature # model features
βΒ Β βββ 05_model_input # often called 'master tables'
βΒ Β βββ 06_models # serialized models
βΒ Β βββ 07_model_output # data generated by model runs
βΒ Β βββ 08_reporting # reports, results, etc
βΒ Β βββ README.md # description of the data structure
βββ docs # documentation for your project
βΒ Β βββ index.md # documentation homepage
βββ models # store final models
βββ notebooks
βΒ Β βββ 1-data # data extraction and cleaning
βΒ Β βββ 2-exploration # exploratory data analysis (EDA)
βΒ Β βββ 3-analysis # Statistical analysis, hypothesis testing.
βΒ Β βββ 4-feat_eng # feature engineering (creation, selection, and transformation.)
βΒ Β βββ 5-models # model training, evaluation and hyperparameter tuning.
βΒ Β βββ 6-interpretation # model interpretation
βΒ Β βββ 7-deploy # model packaging, deployment strategies.
βΒ Β βββ 8-reports # story telling, summaries and analysis conclusions.
βΒ Β βββ notebook_template.ipynb # template for notebooks
βΒ Β βββ README.md # information about the notebooks
βββ src # source code for use in this project
β βββ README.md # description of src structure
β βββ tmp_mock.py # example python file
β βββ data # data extraction, validation, processing, transformation
β βββ model # model training, evaluation, validation, export
β βββ inference # model prediction, serving, monitoring
β βββ pipelines # orchestration of pipelines
β βββ feature_pipeline # transforms raw data into features and labels
β βββ training_pipeline # transforms features and labels into a model
β βββ inference_pipeline # takes features and a trained model for predictions
βββ tests # test code for your project
β βββ test_mock.py # example test file
β βββ data # tests for data module
β βββ model # tests for model module
β βββ inference # tests for inference module
β βββ pipelines # tests for pipelines module
βββ .editorconfig # editor configuration
βββ .gitignore # files to ignore in git
βββ .pre-commit-config.yaml # configuration for pre-commit hooks
βββ codecov.yml # configuration for codecov
βββ Makefile # useful commands to setup environment, run tests, etc.
βββ mkdocs.yml # configuration for mkdocs documentation
βββ pyproject.toml # dependencies and configuration project file
βββ uv.lock # locked dependencies
βββ README.md # description of your project
β¨ Features and Tools
π Project Standardization and Automation
π¨ Developer Workflow Automation
- Python packaging, dependency management and environment management
with UV -
why use a management, (uv is a replacement for poetry) - Project workflow orchestration
with Make as an interface shim
- Self-documenting Makefile; just type
makeon the command line to display auto-generated documentation on available targets:
- Self-documenting Makefile; just type
- Automated Cookiecutter template synchronization with Cruft -
why? - Code quality tooling automation and management with pre-commit
- Continuous integration and deployment with GitHub Actions
- Project configuration files with Hydra -
why?
π± Conditionally Rendered Python Package or Project Boilerplate
- Optional: Jupyter support
π§ Maintainability
π·οΈ Type Checking and Data Validation
- Static type-checking with Mypy
β π§ͺ Testing/Coverage
- Testing with Pytest
- Code coverage with Coverage.py
- Coverage reporting with Codecov
π¨ Linting
π Code quality
- Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
- ShellCheck
- Unsanitary commits:
- Secrets with
detect-secrets - Large files with
check-added-large-files - Files that contain merge conflict strings.check-merge-conflict
- Secrets with
π¨ Code formatting
-
Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
-
General file formatting:
π· CI/CD
Automatic Dependency updates
-
Dependency updates with Dependabot, Automated Dependabot PR merging with the Dependabot Auto Merge GitHub Action
-
This is a replacement for pip-audit , In your local environment, If you want to check for vulnerabilities in your dependencies you can use [pip-audit].
Dependency Review in PR
- Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.
Pre-commit automatic updates
- Automatic updates with GitHub Actions workflow
.github/workflows/pre-commit_autoupdate.yml
π Security
π Static Application Security Testing (SAST)
β¨οΈ Accessibility
π¨ Automation tool (Makefile)
Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc.
in terminal type make to see the available commands
Target Description
------------------- ----------------------------------------------------
check Run code quality tools with pre-commit hooks.
docs_test Test if documentation can be built without warnings or errors
docs_view Build and serve the documentation
init_env Install dependencies with uv and activate env
init_git Initialize git repository
install_data_libs Install pandas, scikit-learn, Jupyter, seaborn
pre-commit_update Update pre-commit hooks
test Test the code with pytest and coverage
π Project Documentation
- Documentation building
with MkDocs - Tutorial
- Powered by mkdocs-material
- Rich automatic documentation from type annotations and docstrings (NumPy, Google, etc.) with mkdocstrings
ποΈ Templates
Good practices
References
- https://drivendata.github.io/cookiecutter-data-science/
- https://github.com/crmne/cookiecutter-modern-datascience
- https://github.com/fpgmaas/cookiecutter-poetry
- https://github.com/khuyentran1401/data-science-template
- https://github.com/woltapp/wolt-python-package-cookiecutter
- https://khuyentran1401.github.io/reproducible-data-science/structure_project/introduction.html
- https://github.com/TeoZosa/cookiecutter-cruft-poetry-tox-pre-commit-ci-cd
- https://github.com/cjolowicz/cookiecutter-hypermodern-python
- https://github.com/gotofritz/cookiecutter-gotofritz-poetry
- https://github.com/kedro-org/kedro-starters