Data science project template
A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.
โจ๐โจ Documentation: https://joserzapata.github.io/data-science-project-template/
Source Code: https://github.com/JoseRZapata/data-science-project-template
๐ Creating a New Project
๐ Recommendations
It is highly recommended to use a python version manager like Pyenv and this project is set to use Poetry >= 1.8 to manage the dependencies and the environment.
Note: Poetry >= 1.8 should always be installed in a dedicated virtual environment to isolate it from the rest of your system. why?, I recommend using UV to install poetry in an isolated environment.
๐ Check how to setup your environment: <https://joserzapata.github.io/data-science-project-template/local_setup/>
### ๐ช๐ฅ Via [Cruft] - **recommended**
```bash title="install cruft"
pip install --user cruft # Install `cruft` on your path for easy access
# Or Install with UV
uv tool install cruft # Install cruft in a isolated environment
๐ช Via Cookiecutter
pip install --user cookiecutter # Install `cookiecutter` on your path for easy access
# Or Install with UV
uv tool install cookiecutter # Install cruft in a isolated environment
Note: Cookiecutter uses gh:
as short-hand for https://github.com/
๐ Linking an Existing Project
If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:
Then/else:
๐๏ธ Project structure
Folder structure for data science projects why?
.
โโโ .code_quality
โย ย โโโ mypy.ini # mypy configuration
โย ย โโโ ruff.toml # ruff configuration
โโโ .github # github configuration
โย ย โโโ actions
โย ย โย ย โโโ python-poetry-env
โย ย โย ย โโโ action.yml # github action to setup python environment
โย ย โโโ dependabot.md # github action to update dependencies
โย ย โโโ pull_request_template.md # template for pull requests
โย ย โโโ workflows # github actions workflows
โย ย โโโ ci.yml # run continuous integration (tests, pre-commit, etc.)
โย ย โโโ dependency_review.yml # review dependencies
โย ย โโโ docs.yml # build documentation (mkdocs)
โย ย โโโ pre-commit_autoupdate.yml # update pre-commit hooks
โโโ .vscode # vscode configuration
| โโโ extensions.json # list of recommended extensions
| โโโ launch.json # vscode launch configuration
| โโโ settings.json # vscode settings
โโโ conf # folder configuration files
โย ย โโโ config.yaml # main configuration file
โโโ data
โย ย โโโ 01_raw # raw immutable data
โย ย โโโ 02_intermediate # typed data
โย ย โโโ 03_primary # domain model data
โย ย โโโ 04_feature # model features
โย ย โโโ 05_model_input # often called 'master tables'
โย ย โโโ 06_models # serialized models
โย ย โโโ 07_model_output # data generated by model runs
โย ย โโโ 08_reporting # reports, results, etc
โย ย โโโ README.md # description of the data structure
โโโ docs # documentation for your project
โย ย โโโ index.md # documentation homepage
โโโ models # store final models
โโโ notebooks
โย ย โโโ 1-data # data extraction and cleaning
โย ย โโโ 2-exploration # exploratory data analysis (EDA)
โย ย โโโ 3-analysis # Statistical analysis, hypothesis testing.
โย ย โโโ 4-feat_eng # feature engineering (creation, selection, and transformation.)
โย ย โโโ 5-models # model training, experimentation, and hyperparameter tuning.
โย ย โโโ 6-evaluation # evaluation metrics, performance assessment
โย ย โโโ 7-deploy # model packaging, deployment strategies.
โย ย โโโ 8-reports # story telling, summaries and analysis conclusions.
โย ย โโโ notebook_template.ipynb # template for notebooks
โย ย โโโ README.md # information about the notebooks
โโโ src # source code for use in this project
โ โโโ libs # custom python scripts
โ โ โโโ data_etl # data extraction, transformation, and loading
โ โ โโโ data_validation # data validation
โ โ โโโ feat_cleaning # feature engineering data cleaning
โ โ โโโ feat_encoding # feature engineering encoding
โ โ โโโ feat_imputation # feature engineering imputation
โ โ โโโ feat_new_features # feature engineering new features
โ โ โโโ feat_pipelines # feature engineering pipelines
โ โ โโโ feat_preprocess_strings # feature engineering pre process strings
โ โ โโโ feat_scaling # feature engineering scaling data
โ โ โโโ feat_selection # feature engineering feature selection
โ โ โโโ feat_strings # feature engineering strings
โ โ โโโ metrics # evaluation metrics
โ โ โโโ model # model training and prediction
โ โ โโโ model_evaluation # model evaluation
โ โ โโโ model_selection # model selection
โ โ โโโ model_validation # model validation
โ โ โโโ reports # reports
โ โโโ pipelines
โ โ โโโ data_etl # data extraction, transformation, and loading
โ โ โโโ feature_engineering # prepare data for modeling
โ โ โโโ model_evaluation # evaluate model performance
โ โ โโโ model_prediction # model predictions
โ โ โโโ model_train # train models
โโโ tests # test code for your project
โย ย โโโ test_mock.py # example test file
โโโ .editorconfig # editor configuration
โโโ .gitignore # files to ignore in git
โโโ .pre-commit-config.yaml # configuration for pre-commit hooks
โโโ codecov.yml # configuration for codecov
โโโ Makefile # useful commands to setup environment, run tests, etc.
โโโ mkdocs.yml # configuration for mkdocs documentation
โโโ poetry.toml # poetry virtual environment configuration
โโโ pyproject.toml # dependencies for poetry
โโโ README.md # description of your project
โจ Features and Tools
๐ Project Standardization and Automation
๐จ Developer Workflow Automation
- Python packaging, dependency management and environment management
with Poetry -
why?
- Project workflow orchestration
with Make as an interface shim
- Self-documenting Makefile; just type
make
on the command line to display auto-generated documentation on available targets:
- Self-documenting Makefile; just type
- Automated Cookiecutter template synchronization with Cruft -
why?
- Code quality tooling automation and management with pre-commit
- Continuous integration and deployment with GitHub Actions
- Project configuration files with Hydra -
why?
๐ฑ Conditionally Rendered Python Package or Project Boilerplate
- Optional: Jupyter support
๐ง Maintainability
๐ท๏ธ Type Checking and Data Validation
- Static type-checking with Mypy
โ ๐งช Testing/Coverage
- Testing with Pytest
- Code coverage with Coverage.py
- Coverage reporting with Codecov
๐จ Linting
๐ Code quality
- Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
- ShellCheck
- Unsanitary commits:
- Secrets with
detect-secrets
- Large files with
check-added-large-files
- Files that contain merge conflict strings.check-merge-conflict
- Secrets with
๐จ Code formatting
-
Ruff An extremely fast (10x-100x faster) Python linter and code formatter, written in Rust.
-
General file formatting:
๐ท CI/CD
Automatic Dependency updates
-
Dependency updates with Dependabot, Automated Dependabot PR merging with the Dependabot Auto Merge GitHub Action
-
This is a replacement for pip-audit , In your local environment, If you want to check for vulnerabilities in your dependencies you can use [pip-audit].
Dependency Review in PR
- Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.
Pre-commit automatic updates
- Automatic updates with GitHub Actions workflow
.github/workflows/pre-commit_autoupdate.yml
๐ Security
๐ Static Application Security Testing (SAST)
โจ๏ธ Accessibility
๐จ Automation tool (Makefile)
Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc.
in terminal type make
to see the available commands
Target Description
------------------- ----------------------------------------------------
check Run code quality tools with pre-commit hooks.
docs_test Test if documentation can be built without warnings or errors
docs_view Build and serve the documentation
init_env Install dependencies with poetry and activate env
init_git Initialize git repository
install_data_libs Install pandas, scikit-learn, Jupyter, seaborn
install_mlops_libs Install dvc, mlflow
pre-commit_update Update pre-commit hooks
test Test the code with pytest and coverage
๐ Project Documentation
- Documentation building
with MkDocs - Tutorial
- Powered by mkdocs-material
- Rich automatic documentation from type annotations and docstrings (NumPy, Google, etc.) with mkdocstrings
๐๏ธ Templates
References
- https://drivendata.github.io/cookiecutter-data-science/
- https://github.com/crmne/cookiecutter-modern-datascience
- https://github.com/fpgmaas/cookiecutter-poetry
- https://github.com/khuyentran1401/data-science-template
- https://github.com/woltapp/wolt-python-package-cookiecutter
- https://khuyentran1401.github.io/reproducible-data-science/structure_project/introduction.html
- https://github.com/TeoZosa/cookiecutter-cruft-poetry-tox-pre-commit-ci-cd
- https://github.com/cjolowicz/cookiecutter-hypermodern-python
- https://github.com/gotofritz/cookiecutter-gotofritz-poetry
- https://github.com/kedro-org/kedro-starters