Skip to content

Data science project template

Poetry Ruff pre-commit security: bandit Checked with mypy pages-build-deployment CI codecov


A modern template for data science projects with all the necessary tools for experiment, development, testing, and deployment. From notebooks to production.

โœจ๐Ÿ“šโœจ Documentation: https://joserzapata.github.io/data-science-project-template/

Source Code: https://github.com/JoseRZapata/data-science-project-template


๐Ÿ“ Creating a New Project

๐Ÿ‘ Recommendations

It is highly recommended to use a python version manager like Pyenv and this project is set to use Poetry >= 1.8 to manage the dependencies and the environment.

Note: Poetry >= 1.8 should always be installed in a dedicated virtual environment to isolate it from the rest of your system. why?

๐ŸŒŸ Check how to setup your environment: https://joserzapata.github.io/data-science-project-template/local_setup/

install cruft
pip install --user cruft # Install `cruft` on your path for easy access
create project
cruft create https://github.com/JoseRZapata/data-science-project-template

๐Ÿช Via Cookiecutter

install cookiecutter
pip install --user cookiecutter # Install `cookiecutter` on your path for easy access
create project
cookiecutter gh:JoseRZapata/data-science-project-template

Note: Cookiecutter uses gh: as short-hand for https://github.com/

๐Ÿ”— Linking an Existing Project

If the project was originally installed via Cookiecutter, you must first use Cruft to link the project with the original template:

cruft link https://github.com/JoseRZapata/data-science-project-template

Then/else:

cruft update

๐Ÿ—ƒ๏ธ Project structure

Folder structure for data science projects why?

Data structure

.
โ”œโ”€โ”€ .code_quality
โ”‚ย ย  โ”œโ”€โ”€ mypy.ini                        # mypy configuration
โ”‚ย ย  โ””โ”€โ”€ ruff.toml                       # ruff configuration
โ”œโ”€โ”€ .github                             # github configuration
โ”‚ย ย  โ”œโ”€โ”€ actions
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ python-poetry-env
โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ action.yml              # github action to setup python environment
โ”‚ย ย  โ”œโ”€โ”€ dependabot.md                   # github action to update dependencies
โ”‚ย ย  โ”œโ”€โ”€ pull_request_template.md        # template for pull requests
โ”‚ย ย  โ””โ”€โ”€ workflows                       # github actions workflows
โ”‚ย ย      โ”œโ”€โ”€ ci.yml                      # run continuous integration (tests, pre-commit, etc.)
โ”‚ย ย      โ”œโ”€โ”€ dependency_review.yml       # review dependencies
โ”‚ย ย      โ”œโ”€โ”€ docs.yml                    # build documentation (mkdocs)
โ”‚ย ย      โ””โ”€โ”€ pre-commit_autoupdate.yml   # update pre-commit hooks
โ”œโ”€โ”€ .vscode                             # vscode configuration
|   โ”œโ”€โ”€ extensions.json                 # list of recommended extensions
|   โ”œโ”€โ”€ launch.json                     # vscode launch configuration
|   โ””โ”€โ”€ settings.json                   # vscode settings
โ”œโ”€โ”€ conf                                # folder configuration files
โ”‚ย ย  โ””โ”€โ”€ config.yaml                     # main configuration file
โ”œโ”€โ”€ data
โ”‚ย ย  โ”œโ”€โ”€ 01_raw                          # raw immutable data
โ”‚ย ย  โ”œโ”€โ”€ 02_intermediate                 # typed data
โ”‚ย ย  โ”œโ”€โ”€ 03_primary                      # domain model data
โ”‚ย ย  โ”œโ”€โ”€ 04_feature                      # model features
โ”‚ย ย  โ”œโ”€โ”€ 05_model_input                  # often called 'master tables'
โ”‚ย ย  โ”œโ”€โ”€ 06_models                       # serialized models
โ”‚ย ย  โ”œโ”€โ”€ 07_model_output                 # data generated by model runs
โ”‚ย ย  โ”œโ”€โ”€ 08_reporting                    # reports, results, etc
โ”‚ย ย  โ””โ”€โ”€ README.md                       # description of the data structure
โ”œโ”€โ”€ docs                                # documentation for your project
โ”‚ย ย  โ”œโ”€โ”€ index.md                        # documentation homepage
โ”œโ”€โ”€ models                              # store final models
โ”œโ”€โ”€ notebooks
โ”‚ย ย  โ”œโ”€โ”€ 1-data                          # data extraction and cleaning
โ”‚ย ย  โ”œโ”€โ”€ 2-exploration                   # exploratory data analysis (EDA)
โ”‚ย ย  โ”œโ”€โ”€ 3-analysis                      # Statistical analysis, hypothesis testing.
โ”‚ย ย  โ”œโ”€โ”€ 4-feat_eng                      # feature engineering (creation, selection, and transformation.)
โ”‚ย ย  โ”œโ”€โ”€ 5-models                        # model training, experimentation, and hyperparameter tuning.
โ”‚ย ย  โ”œโ”€โ”€ 6-evaluation                    # evaluation metrics, performance assessment
โ”‚ย ย  โ”œโ”€โ”€ 7-deploy                        # model packaging, deployment strategies.
โ”‚ย ย  โ”œโ”€โ”€ 8-reports                       # story telling, summaries and analysis conclusions.
โ”‚ย ย  โ”œโ”€โ”€ notebook_template.ipynb         # template for notebooks
โ”‚ย ย  โ””โ”€โ”€ README.md                       # information about the notebooks
โ”œโ”€โ”€ src                                 # source code for use in this project
โ”‚   โ”œโ”€โ”€ libs                            # custom python scripts
โ”‚   โ”‚   โ”œโ”€โ”€ data_etl                    # data extraction, transformation, and loading  
โ”‚   โ”‚   โ”œโ”€โ”€ data_validation             # data validation  
โ”‚   โ”‚   โ”œโ”€โ”€ feat_cleaning               # feature engineering data cleaning
โ”‚   โ”‚   โ”œโ”€โ”€ feat_encoding               # feature engineering encoding
โ”‚   โ”‚   โ”œโ”€โ”€ feat_imputation             # feature engineering imputation    
โ”‚   โ”‚   โ”œโ”€โ”€ feat_new_features           # feature engineering new features
โ”‚   โ”‚   โ”œโ”€โ”€ feat_pipelines              # feature engineering pipelines
โ”‚   โ”‚   โ”œโ”€โ”€ feat_preprocess_strings     # feature engineering pre process strings
โ”‚   โ”‚   โ”œโ”€โ”€ feat_scaling                # feature engineering scaling data
โ”‚   โ”‚   โ”œโ”€โ”€ feat_selection              # feature engineering feature selection
โ”‚   โ”‚   โ”œโ”€โ”€ feat_strings                # feature engineering strings
โ”‚   โ”‚   โ”œโ”€โ”€ metrics                     # evaluation metrics
โ”‚   โ”‚   โ”œโ”€โ”€ model                       # model training and prediction    
โ”‚   โ”‚   โ”œโ”€โ”€ model_evaluation            # model evaluation
โ”‚   โ”‚   โ”œโ”€โ”€ model_selection             # model selection
โ”‚   โ”‚   โ”œโ”€โ”€ model_validation            # model validation
โ”‚   โ”‚   โ””โ”€โ”€ reports                     # reports
โ”‚   โ”œโ”€โ”€ pipelines
โ”‚   โ”‚   โ”œโ”€โ”€ data_etl                    # data extraction, transformation, and loading
โ”‚   โ”‚   โ”œโ”€โ”€ feature_engineering         # prepare data for modeling
โ”‚   โ”‚   โ”œโ”€โ”€ model_evaluation            # evaluate model performance
โ”‚   โ”‚   โ”œโ”€โ”€ model_prediction            # model predictions
โ”‚   โ”‚   โ””โ”€โ”€ model_train                 # train models    
โ”œโ”€โ”€ tests                               # test code for your project
โ”‚ย ย  โ””โ”€โ”€ test_mock.py                    # example test file
โ”œโ”€โ”€ .editorconfig                       # editor configuration
โ”œโ”€โ”€ .gitignore                          # files to ignore in git
โ”œโ”€โ”€ .pre-commit-config.yaml             # configuration for pre-commit hooks
โ”œโ”€โ”€ codecov.yml                         # configuration for codecov
โ”œโ”€โ”€ Makefile                            # useful commands to setup environment, run tests, etc.
โ”œโ”€โ”€ mkdocs.yml                          # configuration for mkdocs documentation
โ”œโ”€โ”€ poetry.toml                         # poetry virtual environment configuration
โ”œโ”€โ”€ pyproject.toml                      # dependencies for poetry
โ””โ”€โ”€ README.md                           # description of your project    

โœจ Features and Tools

๐Ÿš€ Project Standardization and Automation

๐Ÿ”จ Developer Workflow Automation

  • Python packaging, dependency management and environment management with Poetry - why?
  • Project workflow orchestration with Make as an interface shim
    • Self-documenting Makefile; just type make on the command line to display auto-generated documentation on available targets:
  • Automated Cookiecutter template synchronization with Cruft - why?
  • Code quality tooling automation and management with pre-commit
  • Continuous integration and deployment with GitHub Actions
  • Project configuration files with Hydra - why?

๐ŸŒฑ Conditionally Rendered Python Package or Project Boilerplate

๐Ÿ”ง Maintainability

๐Ÿท๏ธ Type Checking and Data Validation

  • Static type-checking with Mypy

โœ… ๐Ÿงช Testing/Coverage

๐Ÿšจ Linting

๐Ÿ” Code quality
๐ŸŽจ Code formatting

๐Ÿ‘ท CI/CD

Automatic Dependency updates
Dependency Review in PR
  • Dependency Review with dependency-review-action, This action scans your pull requests for dependency changes, and will raise an error if any vulnerabilities or invalid licenses are being introduced.
Pre-commit automatic updates
  • Automatic updates with GitHub Actions workflow .github/workflows/pre-commit_autoupdate.yml

๐Ÿ”’ Security

๐Ÿ” Static Application Security Testing (SAST)

โŒจ๏ธ Accessibility

๐Ÿ”จ Automation tool (Makefile)

Makefile to automate the setup of your environment, the installation of dependencies, the execution of tests, etc. in terminal type make to see the available commands

Target                Description
-------------------   ----------------------------------------------------
check                 Run code quality tools with pre-commit hooks.
docs_test             Test if documentation can be built without warnings or errors
docs_view             Build and serve the documentation
init_env              Install dependencies with poetry and activate env
init_git              Initialize git repository
install_data_libs     Install pandas, scikit-learn, Jupyter, seaborn
install_mlops_libs    Install dvc, mlflow
pre-commit_update     Update pre-commit hooks
test                  Test the code with pytest and coverage

๐Ÿ“ Project Documentation

๐Ÿ—ƒ๏ธ Templates


References