Last updated: 2/Feb/2025
I based this checklist on Appendix B (Machine Learning Project Checklist) from the book Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition by Aurélien Géron with some of my own steps added or found in other books or courses I have taken.
I really enjoyed this book; for me it is the most complete practical book on machine learning with Python that I have read. It has an excellent structure, very well-explained Python code, and many tips and suggestions for carrying out a machine learning project.
The book is accompanied by a repository with Jupyter Notebooks: https://github.com/ageron/handson-ml2
Table of Contents
This checklist can serve as a step-by-step guide for Machine Learning projects.
1. Frame the problem and look at the big picture
- Define the objective in business terms.
- How will your solution be used?
- What are the current solutions (if any)?
- How should this problem be framed (supervised / unsupervised, online / offline, etc.)?
- How should the performance or effectiveness of the solution be measured?
- Is the performance measure aligned with the business objective?
- What would be the minimum performance or effectiveness needed to achieve the business objective?
- What are comparable problems? Can experience or tools already created be reused?
- Is human expertise on the problem available?
- How can the problem be solved manually?
- List the assumptions made so far.
- Verify the assumptions if possible.
2. Get the data
- List the data you need and how much you need.
- Find and document where the data can be obtained.
- Check how much storage space the data will occupy.
- Check legal limitations and obtain authorization to access the data if necessary.
- Obtain data access authorizations.
- Reserve sufficient storage space for the project.
- Get the data.
- Convert the data to a format that can be easily manipulated (without changing the data itself).
- Ensure that sensitive information is deleted or protected (e.g., anonymize the data).
- Verify the size and type of data (time series, data sample, geo-positioning, etc.).
- Set aside a test dataset, put it aside, and never look at it.
3. Explore the data to gain insights. (EDA: Exploratory Data Analysis)
Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
Create a Jupyter Notebook to keep a record of the data exploration.
Study each attribute and its characteristics (Univariate analysis):
- Name
- Data type (categorical, int / float, bounded / unbounded, text, structured, etc.)
- Percentage (%) of missing values.
- Noise and type of noise (stochastic, outliers, rounding errors, etc.)
- Are they possibly useful for the project?
- Distribution type (Gaussian, uniform, logarithmic, etc.)
For supervised learning projects, identify the target attribute(s).
Visualize the data.
Study correlations between attributes (Bivariate analysis).
Study how to solve the problem manually.
Identify transformations that might be applicable.
Identify additional data that could be useful.
Document what you have learned.
Data Exploration Libraries
- Pandas Profiling https://docs.profiling.ydata.ai/latest/
- DataPrep https://dataprep.ai/
- Mito https://www.trymito.io/
- Dtale https://github.com/man-group/dtale
- SweetViz https://github.com/fbdesignpro/sweetviz
- AutoViz https://github.com/AutoViML/AutoViz
- Bitrook Data cleaning https://www.bitrook.com/
- dabl Simple autoeda with plot https://dabl.github.io/
- Klib https://klib.readthedocs.io/
- https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#data-visualization
4. Prepare the data
To better expose data patterns and use them with Machine Learning algorithms.
Notes:
- Work on copies of the data (keep the original dataset intact).
- Write functions for all data transformations you perform, for five reasons:
- So you can easily prepare the data the next time you obtain a new dataset
- So you can apply these transformations in future projects
- To clean and prepare the test dataset
- To clean and prepare new data instances once your solution is live (production)
- So it is easy to test different data preparation approaches as hyperparameters
Data cleaning:
- Remove duplicate data records (reduce the number of data points)
- Fix or remove outliers (optional).
- Outliers can be separated from the dataset depending on the project problem (e.g., anomaly detection).
- Fill in missing values (e.g., with zero, mean, median …) or drop the rows (or columns).
Feature Selection (optional):
- Drop features that do not provide useful information for the project.
- Remove duplicate records (after dropping features, some records may become identical)
Feature Engineering, where appropriate:
- Discretize continuous features.
- Decompose features into parts (e.g., categorical, date/time, etc.).
- Add promising feature transformations, for example:
- log(x)
- sqrt(x)
- x^2
- etc
- Apply functions to the data to add new features.
Feature Scaling:
- standardize
- normalize
Feature Engineering Libraries
- Feature-engine https://feature-engine.trainindata.com/en/latest/
- featuretools https://featuretools.alteryx.com/en/stable/
5. Select a model
Notes:
- If you have a large amount of data, you may want to sample the data to have smaller training sets, so you can train several different models in a reasonable time (keep in mind that this penalizes complex models such as large neural networks or Random Forest).
- Once again, try to automate these steps as much as possible.
- Use Experiment Tracking tools to have traceability of models and their performance (e.g. MLFlow)
Train many quick models using standard parameters from different categories (e.g., linear, Naive Bayes, SVM, Random Forest, neural networks, etc.).
Measure and compare their performance.
- For each model, use N-fold cross validation (Cross validation) and compute the mean and standard deviation of the performance measure across the N evaluations.
Analyze the most significant variables for each algorithm.
Analyze the types of errors the models make.
- What data would a human have used to avoid these errors?
Quickly perform feature selection and feature engineering.
Perform one or two more quick iterations of the above five steps.
Shortlist the top three to five most promising models, preferring models that make different types of errors (diversity of errors).
Machine Learning Model Libraries
- scikit-learn https://scikit-learn.org/stable/
- catboost https://catboost.ai/
- xgboost https://xgboost.readthedocs.io/en/latest/
- lightgbm https://lightgbm.readthedocs.io/en/latest/
- https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#machine-learning-frameworks
AutoML (Auto Machine Learning) Libraries
- AutoGlueon https://auto.gluon.ai/
- Pycaret https://pycaret.org/
- TabPFN https://github.com/PriorLabs/TabPFN
- MLJar https://mljar.com/automl/
- FLAML https://microsoft.github.io/FLAML/
- Auto_ViML https://github.com/AutoViML/Auto_ViML
- H2O https://github.com/h2oai/h2o-3
- Auto-Keras https://autokeras.com/
- Tpot http://epistasislab.github.io/tpot/
- Auto-Sklearn https://automl.github.io/auto-sklearn/master/
Simple Model Selection Libraries (Not for Production)
- dabl Simple auto eda https://dabl.github.io/
- Lazypredict https://lazypredict.readthedocs.io/
- Atom https://github.com/tvdboom/ATOM
- Poniard https://github.com/rxavier/poniard
Deep Learning Libraries
- keras https://keras.io/
- pytorch https://pytorch.org/
- pytorch-lightning https://lightning.ai/docs/pytorch/stable/
- tensorflow https://www.tensorflow.org/
6. Fine-tune the models
Notes:
- You should use as much data as possible for this step, especially as you move toward the final stages of fine-tuning the model.
- As always, automate what you can.
Fine-tune the hyperparameters (hyperparameter tuning) using cross validation (cross validation).
- Treat data transformation choices as hyperparameters, especially when you are not sure about them (for example, should you replace missing values with zero or with the mean value? Or simply drop the rows?).
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (for example, using Gaussian processes, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams1).
Try ensemble methods. Combining your best models will often perform better than running them individually (better performance when there is diversity of errors among the models).
Once you are confident in your final model, measure its performance on the test set (set aside at the beginning) to estimate the generalization error.
Model Hyperparameter Tuning Libraries
- Ray-Tune https://docs.ray.io/en/latest/tune/index.html
- optuna https://optuna.readthedocs.io/en/stable/
- hyperopt http://hyperopt.github.io/hyperopt/
- https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#hyperparameter-optimization--automl
7. Model Interpretability
Interpret the model obtained and identify its errors
- Which features are most important?
- How much does each feature contribute to the prediction?
- What are the consequences of bad predictions?
- What type of errors does the model make?
- How can the errors be monitored?
- What causes the errors?
- Outliers?
- Imbalanced class?
- Data entry errors?
- etc
Model Interpretability Libraries
- Shap https://github.com/slundberg/shap
- interpret https://github.com/interpretml/interpret
- explainerdashboard https://explainerdashboard.readthedocs.io/
- PiML https://github.com/SelfExplainML/PiML-Toolbox
- Yellowbrick https://www.scikit-yb.org/en/latest/
- Alibi Explain https://docs.seldon.io/projects/alibi/en/stable/
- https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#model-interpretability
8. Present the solution
Document what you have done.
Create a good presentation.
- Make sure to highlight the big picture of the project or problem first.
Explain why the solution found achieves the desired objective.
Don’t forget to present interesting points noticed along the way.
- Describe what worked and what did not.
- List the assumptions and limitations of the system.
Make sure key findings are communicated through compelling visualizations or easy-to-remember statements (e.g., “median income is the number one predictor of housing prices”).
9. Deploy, monitor and maintain the system
- Prepare the solution for production (connect production data inputs, write unit tests, etc.).
- Write monitoring code to check the real-time performance of the system at regular intervals and trigger alerts when it drops or fails.
- Be careful about slow degradation: models tend to “rot” as data evolves, the model gradually loses validity over time.
- Performance measurement may require human supervision (e.g., through a crowdsourcing service).
- Monitor the quality of input data (e.g., a malfunctioning sensor sending random values, or another team’s data output becoming stale). This is particularly important for online learning systems.
- Retrain your models regularly with fresh data (automate as much as possible); this is called Continuous Training and Continuous Deployment (CT/CD).
- The area of process automation is called MLOps.
MLOps, Monitoring and Testing Libraries
Experiment tracking Libraries
- MlFLow https://mlflow.org/
- Weights & Biases https://github.com/wandb/wandb
- Dvc Experiment Management https://dvc.org/doc/user-guide/experiment-management
- Metaflow https://github.com/Netflix/metaflow
- https://github.com/ml-tooling/best-of-ml-python?tab=readme-ov-file#workflow--experiment-tracking
Data and Model Monitoring and Testing Libraries
- GreatExpectations Data quality https://greatexpectations.io/
- Deepchecks Test Suites for ML Models & Data https://deepchecks.com/
- evidentlyAI model monitoring https://evidentlyai.com/
MLOps and Orchestration Libraries
- Kedro Modular, reproducible and maintainable data science code https://kedro.org/
- ZenMl MLOps framework used to create production-ready ML pipelines https://zenml.io/
- Dvc Version Control System for Machine Learning Projects https://dvc.org/
- MlFLow Platform for the machine learning lifecycle https://mlflow.org/
- KubeFlow Machine Learning Toolkit for Kubernetes https://www.kubeflow.org/
- MetaFlow Build and manage real-life data science projects https://metaflow.org/
- MLRun Machine-learning applications to production https://www.mlrun.org/
Code Testing Libraries
- Pytest Unit tests https://docs.pytest.org/
- Coverage Unit test coverage https://coverage.readthedocs.io/
References
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
- https://github.com/ageron/handson-ml2
- Interpretable Machine Learning
- https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
- https://medium.com/@mcintyreshiv/how-to-master-python-for-machine-learning-from-scratch-a-step-by-step-tutorial-8c6569895cb0
- tools for data science https://twitter.com/thedataprof/status/1540747792313774081
- https://medium.com/pykes-technical-notes/testing-machine-learning-systems-unit-tests-38696264ee04
- https://dagshub.com/blog/how-to-compare-ml-experiment-tracking-tools-to-fit-your-data-science-workflow/
- https://machinelearningmastery.com/framework-for-data-preparation-for-machine-learning/
- Python libraries for ML projects https://github.com/ml-tooling/best-of-ml-python
“Practical Bayesian Optimization of Machine Learning Algorithms,” J. Snoek, H. Larochelle, R. Adams (2012) ↩︎
