Demo data science template

This demo of a data science project is created using the template from @JoseRZapata's data science project template which have all the necessary tools for experiment, development, testing, and deployment data science From notebooks to production.

[!WARNING] 🚧 Work in progress 🚧, This is a demo project, It is only for educational purposes.

🗃️ Project structure

.
├── codecov.yml                         # configuration for codecov
├── .code_quality
│   ├── bandit.yaml                     # bandit configuration
│   ├── mypy.ini                        # mypy configuration
│   └── ruff.toml                       # ruff configuration
├── data
│   ├── 01_raw                          # raw immutable data
│   ├── 02_intermediate                 # typed data
│   ├── 03_primary                      # domain model data
│   ├── 04_feature                      # model features
│   ├── 05_model_input                  # often called 'master tables'
│   ├── 06_models                       # serialized models
│   ├── 07_model_output                 # data generated by model runs
│   ├── 08_reporting                    # reports, results, etc
│   └── README.md                       # description of the data structure
├── docs                                # documentation for your project
├── .editorconfig                       # editor configuration
├── .github                             # github configuration
│   ├── actions
│   │   └── python-poetry-env
│   │       └── action.yml              # github action to setup python environment
│   ├── dependabot.md                   # github action to update dependencies
│   ├── pull_request_template.md        # template for pull requests
│   └── workflows
│       ├── docs.yml                    # github action to build documentation (mkdocs)
│       ├── pre-commit_autoupdate.yml   # github action update pre-commit hooks
│       └── test.yml                    # github action to run tests
├── .gitignore                          # files to ignore in git
├── Makefile                            # useful commands to setup environment,
├── models                              # store final models
├── notebooks
│   ├── 1-data                          # notebooks for data extraction and cleaning
│   ├── 2-exploration                   # notebooks for data exploration
│   ├── 3-analysis                      # notebooks for data analysis
│   ├── 4-feat_eng                      # notebooks for feature engineering
│   ├── 5-models                        # notebooks for model training
│   ├── 6-evaluation                    # notebooks for model evaluation
│   ├── 7-deploy                        # notebooks for model deployment
│   ├── notebook_template.ipynb         # template for notebooks
│   └── README.md                       # information about the notebooks
├── .pre-commit-config.yaml             # configuration for pre-commit hooks
├── pyproject.toml                      # dependencies for poetry
├── README.md                           # description of your project
├── src                                 # source code for use in this project
├── tests                               # test code for your project
└── .vscode                             # vscode configuration
    ├── extensions.json                 # list of recommended extensions
    └── settings.json                   # vscode settings

Data Science Code structure

Orchestrated experiment

flowchart TD
  subgraph input [ETL]
    %%nodes
    A1[(Data web)]
    B[Process_etl]
    BB1{{Data integrity}}
    BB2{{Data Validation}}
    Dcheck[(Data Checked)]

    %%links
    A1 ==>B
    B ==> BB1 ==> BB2 ==> Dcheck[(Data Checked)]
  end

  subgraph split [Train / Test data split]
    %%nodes
    C[Split - Train /Test]
    C1[(Train)]
    C2[(Test)]
    CC{{Train / Test Validation}}

    %%links
    Dcheck ==> C
    C --> |data test|C2
    C --> |data train|C1
    C2 & C1 --> CC
  end

  subgraph train_feature [Train Feature Engineering]
    %%nodes
    D[<b>Pre - process Train</b> <br> Not needed in test <br> Ex:  Remove outliers, Duplicated, Drops]


    subgraph feature [Feature Engineering pipeline <br> for use in train and test]
      style feature fill:grey,stroke:#333,stroke-width:2px
      %%nodes
      E[<b>Initial Processing</b> <br> Ex: Casting, New columns, Replace empty values for NaN]
      F{Split <br> Data Type}
      G1[Transformation <br> Numerics <br> <s>No Drops</s>]
      G2[Transformation <br> Categoric <br> <s>No Drops</s>]
      G3[Transformation <br> Booleans <br> <s>No Drops</s>]
      G4[Transformation <br> Dates <br> <s>No Drops</s>]
      G5[Transformation <br> Strings <br> <s>No Drops</s>]
      H[<b>Final Processing</b> <br> Final Pipeline <br> ColumnTranformer <br> and last transforms]
      TRfit[Train Transformer]
      TRdb[(Transformer <br> Pipeline)]
      %%links
      E -.-> F
      F -.->|Numeric|G1
      F -.->|Categoric|G2
      F -.->|Bool|G3
      F -.->|Dates|G4
      F -.->|Strings|G5
      G1 & G2 & G3 & G4 & G5 -.-> H

      H -.-> |objeto pipeline|TRfit
      TRfit -.-> |objeto pipeline|TRdb
    end

    %%nodes
    I[<b>Post - Processing Train</b> <br> Ej: Data Balance - smote, Drop duplicates <br> Not needed in test]

    %%links
    C1 --->D
    D --> |X - data train <br> pre-processed|E
    D --> |X - data train <br> pre-processed|TRfit
    TRfit --> |X - data train <br> transformed|I
    D --> |Y - data train <br> pre-processed|I

  end

  subgraph mod[Modeling]

    %%nodes
    J[Modeling]
    Modeldb[(Train Model <br> candidate)]

    %%links
    I ---> |X - data train <br> post-processed|J
    I --> |Y - data train <br> post-processed|J
    J -.-> |Model Object| Modeldb

  end

  subgraph pred [Prediction]
    %%nodes
    TRtest[Transformation <br> X - Data test]
    Pred_test[Prediction test]
    Pred_train[Prediction train]
    Pred_db[(Predictions)]

    %%links
    C2 --> |X - data test|TRtest
    TRdb -.->TRtest
    TRtest --> |X - data test <br> transformed|Pred_test
    C2 --> |Y - data test|Pred_test
    I --> |X - data train <br> post-processed|Pred_train
    I --> |Y - data train <br> post-processed|Pred_train

    Modeldb -.-> |model|Pred_train --> Pred_db
    Modeldb -.-> |model|Pred_test --> Pred_db
  end

  subgraph eval [Evaluation]
    %%nodes
    Modelcheck{{Model validation}}
    M[Eval]
    N[(Score)]

    %%links

    I  --> |X data train <br> post-processed|Modelcheck
    I  --> |Y data train <br> post-processed|Modelcheck
    TRtest --> |X - data test <br> transformed|Modelcheck
    C2 --> |Y - data test|Modelcheck
    Modeldb -....-> |model|Modelcheck
    Pred_db --> M
    M -.->N
  end

  %%links


  Modelcheck -..->  pass{Pass ?}
  pass -.-> |no|no((Alert!))
  pass -.-> |yes|si(Execute modeling <br> with all data):::Passclass


  %% Definine link styles
  linkStyle default stroke:blue

  linkStyle 8,10,12,33,35,42,45,46 stroke:orange
  linkStyle 29,31,38,44 stroke:deepskyblue
  linkStyle 36,46 stroke:gold

  %% Styling the title subgraph
  classDef Title stroke-width:0, color:#f66,  font-weight:bold, font-size: 24px;

  class input,train_feature,feature,pred,mod,eval Title


  %% Definine node styles
  classDef Objclass fill:#329cc1;
  classDef Checkclass fill:#EC5800;
  classDef Alertclass fill:#FF0000;
  classDef Passclass fill:#00CC88;

  %% Assigning styles to nodes
  class C1,C2,Dcheck,TRdb,Modeldb,Pred_db,N Objclass;
  class BB1,BB2,CC,Modelcheck Checkclass;
  class no Alertclass;
  class si Passclass;

Deployment

flowchart TD
  orch_exp[Orchestrated Experiment] -.-> Modelcheck
  Modelcheck{{Model validation}}:::Checkclass -.-> |si| input
  Modelcheck -.-> |no|stop((Alert! <br> Stop)):::Alertclass
  subgraph input [ETL]
   Dcheck[(Data Checked)]:::Objclass
  end
  Dcheck ==>  D[<b>Pre - processing</b>]

  subgraph train_feature [Train Feature Engineering]
    %%nodes
    D[<b>Pre - processing Train</b> <br> Not needed in test <br> Ej: Drop outliers, Duplicates, Drops]


    subgraph feature [Feature Engineering pipeline <br> for use in train and test]
      style feature fill:grey,stroke:#333,stroke-width:2px
      %%nodes
      E[<b>Initial Processing </b> <br> Ej: Casting, New columns, Replace empty values for NaN]
      F{Split <br> Data Type}
      G1[Transformation <br> Numerics <br> <s>No Drops</s>]
      G2[Transformation <br> Categoric <br> <s>No Drops</s>]
      G3[Transformation <br> Booleans <br> <s>No Drops</s>]
      G4[Transformation <br> Dates <br> <s>No Drops</s>]
      G5[Transformation <br> Strings <br> <s>No Drops</s>]
      H[<b>Processing Final</b> <br> Final Pipeline <br> ColumnTranformer <br>  and final transforms]
      TRfit[Train Transformer]
      %%links
      E -.-> F
      F -.->|Numeric|G1
      F -.->|Categoric|G2
      F -.->|Bool|G3
      F -.->|Dates|G4
      F -.->|Strings|G5
      G1 & G2 & G3 & G4 & G5 -.-> H

      H -.-> |objeto pipeline|TRfit
    end

    %%nodes
    I[<b>Post - Processing Train</b> <br> Ej: Data Balance - smote, dorp duplicates <br> Not needed in test]

    %%links

    D --> |X - data train <br> pre-processed|E
    D --> |X - data train <br> pre-processed|TRfit
    TRfit --> |X - data train <br> transformed|I
    D --> |Y - data train <br> pre-processed|I

  end

  subgraph mod[Modeling]
    J[Train]
  end

  subgraph artefacto[Artefacto de salida]
    TRfit -.-> |pipeline object|TRdb[(Transformer <br> Pipeline)]:::Objclass
    J -.-> |model object| Modeldb[(Train Model <br> Final)]:::Objclass
  end

  I --> |data post-processed|mod
  J -.->N[(Performance <br> Score)]:::Objclass

  N -.-> Scorecheck{{Performance validation <br> Score actual vs anteriores}}:::Checkclass
  Scorecheck -.->  pass{Pass ?}
  pass -.-> |no|no((Alert!)):::Alertclass
  pass -.-> |yes|si(Send Artifact to Deploy):::Passclass
  si -.-> artefacto

  linkStyle 19 stroke:deepskyblue

  classDef Objclass fill:#329cc1;
  classDef Checkclass fill:#EC5800;
  classDef Alertclass fill:#FF0000;
  classDef Passclass fill:#00CC88;

Credits

This project was generated from @JoseRZapata's data science project template template.