Step by Step in a Machine Learning Project

Last updated: 2/Feb/2025

I based this checklist on Appendix B (Machine Learning Project Checklist) from the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition by Aurélien Géron with some of my own steps added or found in other books or courses I have taken.

I really enjoyed this book; for me it is the most complete practical book on machine learning with Python that I have read. It has an excellent structure, very well-explained Python code, and many tips and suggestions for carrying out a machine learning project.

The book is accompanied by a repository with Jupyter Notebooks: https://github.com/ageron/handson-ml2

Table of Contents

This checklist can serve as a step-by-step guide for Machine Learning projects.

1. Frame the problem and look at the big picture

Define the objective in business terms.
How will your solution be used?
What are the current solutions (if any)?
How should this problem be framed (supervised / unsupervised, online / offline, etc.)?
How should the performance or effectiveness of the solution be measured?
Is the performance measure aligned with the business objective?
What would be the minimum performance or effectiveness needed to achieve the business objective?
What are comparable problems? Can experience or tools already created be reused?
Is human expertise on the problem available?
How can the problem be solved manually?
List the assumptions made so far.
Verify the assumptions if possible.

2. Get the data

Note: automate this process as much as possible so that you can easily obtain fresh data.

List the data you need and how much you need.
Find and document where the data can be obtained.
Check how much storage space the data will occupy.
Check legal limitations and obtain authorization to access the data if necessary.
Obtain data access authorizations.
Reserve sufficient storage space for the project.
Get the data.
Convert the data to a format that can be easily manipulated (without changing the data itself).
Ensure that sensitive information is deleted or protected (e.g., anonymize the data).
Verify the size and type of data (time series, data sample, geo-positioning, etc.).
Set aside a test dataset, put it aside, and never look at it.

3. Explore the data to gain insights. (EDA: Exploratory Data Analysis)

Note: try to get insights from a domain expert for these steps.

Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
Create a Jupyter Notebook to keep a record of the data exploration.
Study each attribute and its characteristics (Univariate analysis):
- Name
- Data type (categorical, int / float, bounded / unbounded, text, structured, etc.)
- Percentage (%) of missing values.
- Noise and type of noise (stochastic, outliers, rounding errors, etc.)
- Are they possibly useful for the project?
- Distribution type (Gaussian, uniform, logarithmic, etc.)
For supervised learning projects, identify the target attribute(s).
Visualize the data.
Study correlations between attributes (Bivariate analysis).
Study how to solve the problem manually.
Identify transformations that might be applicable.
Identify additional data that could be useful.
Document what you have learned.

Data Exploration Libraries

Pandas Profiling https://docs.profiling.ydata.ai/latest/
DataPrep https://dataprep.ai/
Mito https://www.trymito.io/
Dtale https://github.com/man-group/dtale
SweetViz https://github.com/fbdesignpro/sweetviz
AutoViz https://github.com/AutoViML/AutoViz
Bitrook Data cleaning https://www.bitrook.com/
dabl Simple autoeda with plot https://dabl.github.io/
Klib https://klib.readthedocs.io/
https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#data-visualization

4. Prepare the data

To better expose data patterns and use them with Machine Learning algorithms.

Notes:

Work on copies of the data (keep the original dataset intact).
Write functions for all data transformations you perform, for five reasons:
- So you can easily prepare the data the next time you obtain a new dataset
- So you can apply these transformations in future projects
- To clean and prepare the test dataset
- To clean and prepare new data instances once your solution is live (production)
- So it is easy to test different data preparation approaches as hyperparameters

Data cleaning:
- Remove duplicate data records (reduce the number of data points)
- Fix or remove outliers (optional).
- Outliers can be separated from the dataset depending on the project problem (e.g., anomaly detection).
- Fill in missing values (e.g., with zero, mean, median …) or drop the rows (or columns).
Feature Selection (optional):
- Drop features that do not provide useful information for the project.
- Remove duplicate records (after dropping features, some records may become identical)
Feature Engineering, where appropriate:
- Discretize continuous features.
- Decompose features into parts (e.g., categorical, date/time, etc.).
- Add promising feature transformations, for example:
  - log(x)
  - sqrt(x)
  - x^2
  - etc
- Apply functions to the data to add new features.
Feature Scaling:
- standardize
- normalize

Feature Engineering Libraries

Feature-engine https://feature-engine.trainindata.com/en/latest/
featuretools https://featuretools.alteryx.com/en/stable/

5. Select a model

Notes:

If you have a large amount of data, you may want to sample the data to have smaller training sets, so you can train several different models in a reasonable time (keep in mind that this penalizes complex models such as large neural networks or Random Forest).
Once again, try to automate these steps as much as possible.
Use Experiment Tracking tools to have traceability of models and their performance (e.g. MLFlow)

Train many quick models using standard parameters from different categories (e.g., linear, Naive Bayes, SVM, Random Forest, neural networks, etc.).
Measure and compare their performance.
- For each model, use N-fold cross validation (Cross validation) and compute the mean and standard deviation of the performance measure across the N evaluations.
Analyze the most significant variables for each algorithm.
Analyze the types of errors the models make.
- What data would a human have used to avoid these errors?
Quickly perform feature selection and feature engineering.
Perform one or two more quick iterations of the above five steps.
Shortlist the top three to five most promising models, preferring models that make different types of errors (diversity of errors).

6. Fine-tune the models

Notes:

You should use as much data as possible for this step, especially as you move toward the final stages of fine-tuning the model.
As always, automate what you can.

Fine-tune the hyperparameters (hyperparameter tuning) using cross validation (cross validation).
- Treat data transformation choices as hyperparameters, especially when you are not sure about them (for example, should you replace missing values with zero or with the mean value? Or simply drop the rows?).
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (for example, using Gaussian processes, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams¹).
Try ensemble methods. Combining your best models will often perform better than running them individually (better performance when there is diversity of errors among the models).
Once you are confident in your final model, measure its performance on the test set (set aside at the beginning) to estimate the generalization error.

Do not modify your model after measuring the generalization error: you would simply start overfitting the test set.

Model Hyperparameter Tuning Libraries

Ray-Tune https://docs.ray.io/en/latest/tune/index.html
optuna https://optuna.readthedocs.io/en/stable/
hyperopt http://hyperopt.github.io/hyperopt/
https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#hyperparameter-optimization--automl

7. Model Interpretability

Interpret the model obtained and identify its errors

Which features are most important?
How much does each feature contribute to the prediction?
What are the consequences of bad predictions?
What type of errors does the model make?
How can the errors be monitored?
What causes the errors?
- Outliers?
- Imbalanced class?
- Data entry errors?
- etc

Model Interpretability Libraries

Shap https://github.com/slundberg/shap
interpret https://github.com/interpretml/interpret
explainerdashboard https://explainerdashboard.readthedocs.io/
PiML https://github.com/SelfExplainML/PiML-Toolbox
Yellowbrick https://www.scikit-yb.org/en/latest/
Alibi Explain https://docs.seldon.io/projects/alibi/en/stable/
https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#model-interpretability

8. Present the solution

Document what you have done.
Create a good presentation.
- Make sure to highlight the big picture of the project or problem first.
Explain why the solution found achieves the desired objective.
Don’t forget to present interesting points noticed along the way.
- Describe what worked and what did not.
- List the assumptions and limitations of the system.
Make sure key findings are communicated through compelling visualizations or easy-to-remember statements (e.g., “median income is the number one predictor of housing prices”).

9. Deploy, monitor and maintain the system

Prepare the solution for production (connect production data inputs, write unit tests, etc.).
Write monitoring code to check the real-time performance of the system at regular intervals and trigger alerts when it drops or fails.
- Be careful about slow degradation: models tend to “rot” as data evolves, the model gradually loses validity over time.
- Performance measurement may require human supervision (e.g., through a crowdsourcing service).
- Monitor the quality of input data (e.g., a malfunctioning sensor sending random values, or another team’s data output becoming stale). This is particularly important for online learning systems.
Retrain your models regularly with fresh data (automate as much as possible); this is called Continuous Training and Continuous Deployment (CT/CD).
The area of process automation is called MLOps.

MLOps, Monitoring and Testing Libraries

Experiment tracking Libraries

MlFLow https://mlflow.org/
Weights & Biases https://github.com/wandb/wandb
Dvc Experiment Management https://dvc.org/doc/user-guide/experiment-management
Metaflow https://github.com/Netflix/metaflow
https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#workflow--experiment-tracking

Data and Model Monitoring and Testing Libraries

GreatExpectations Data quality https://greatexpectations.io/
Deepchecks Test Suites for ML Models & Data https://deepchecks.com/
evidentlyAI model monitoring https://evidentlyai.com/

MLOps and Orchestration Libraries

Kedro Modular, reproducible and maintainable data science code https://kedro.org/
ZenMl MLOps framework used to create production-ready ML pipelines https://zenml.io/
Dvc Version Control System for Machine Learning Projects https://dvc.org/
MlFLow Platform for the machine learning lifecycle https://mlflow.org/
KubeFlow Machine Learning Toolkit for Kubernetes https://www.kubeflow.org/
MetaFlow Build and manage real-life data science projects https://metaflow.org/
MLRun Machine-learning applications to production https://www.mlrun.org/

Code Testing Libraries

Pytest Unit tests https://docs.pytest.org/
Coverage Unit test coverage https://coverage.readthedocs.io/

References

“Practical Bayesian Optimization of Machine Learning Algorithms,” J. Snoek, H. Larochelle, R. Adams (2012) ↩︎

Step by Step in a Machine Learning Project

1. Frame the problem and look at the big picture

2. Get the data

3. Explore the data to gain insights. (EDA: Exploratory Data Analysis)

Data Exploration Libraries

4. Prepare the data

Feature Engineering Libraries

5. Select a model

Machine Learning Model Libraries

AutoML (Auto Machine Learning) Libraries

Simple Model Selection Libraries (Not for Production)

Deep Learning Libraries

6. Fine-tune the models

Model Hyperparameter Tuning Libraries

7. Model Interpretability

Model Interpretability Libraries

8. Present the solution

9. Deploy, monitor and maintain the system

MLOps, Monitoring and Testing Libraries

Experiment tracking Libraries

Data and Model Monitoring and Testing Libraries

MLOps and Orchestration Libraries

Code Testing Libraries

References

Feedback