Saving and loading models is essential for deploying machine learning solutions and avoiding retraining. Scikit-learn supports model persistence using the joblib
and pickle
libraries, which serialize and deserialize Python objects.
Key Characteristics
- Enables reuse of trained models
- Reduces computational overhead
- Ensures reproducibility
- Compatible with most Scikit-learn objects
Basic Rules
- Use
joblib
for Scikit-learn models (better with large numpy arrays) - Use
pickle
for general Python object serialization - Save preprocessing steps along with the model
- Validate reloaded models before use
Syntax Table
SL NO | Technique | Syntax Example | Description |
---|---|---|---|
1 | Save with joblib | joblib.dump(model, 'model.pkl') |
Saves model to file |
2 | Load with joblib | model = joblib.load('model.pkl') |
Loads model from file |
3 | Save with pickle | pickle.dump(model, open('file.pkl', 'wb')) |
Saves using pickle |
4 | Load with pickle | model = pickle.load(open('file.pkl', 'rb')) |
Loads using pickle |
5 | Save pipeline | joblib.dump(pipe, 'pipeline.pkl') |
Saves preprocessing and model pipeline |
Syntax Explanation
1. Saving a Model with joblib
What is it?
Serializes a trained model and saves it to disk using joblib
, which is optimized for objects containing large NumPy arrays.
Syntax:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestClassifier()
model.fit(X_train, y_train)
joblib.dump(model, 'rf_model.pkl')
Explanation:
- Trains a model and saves it using
joblib
- Creates a file
rf_model.pkl
containing the model
2. Loading a Model with joblib
What is it?
Deserializes a model file created with joblib
and loads it back into memory.
Syntax:
model = joblib.load('rf_model.pkl')
y_pred = model.predict(X_test)
Explanation:
- Reloads the saved model
- Predicts with no need to retrain
3. Saving a Model with pickle
What is it?
Serializes a trained model using Pythonโs built-in pickle
module for general-purpose object saving.
Syntax:
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
Explanation:
- Uses Python’s built-in
pickle
module - Works for general Python objects including models
4. Loading a Model with pickle
What is it?
Deserializes a file saved using pickle
and restores the model object.
Syntax:
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
Explanation:
- Reads binary file and loads the original model object
5. Saving a Pipeline
What is it?
Saves an entire Scikit-learn Pipeline
including both preprocessing steps and the final estimator.
Syntax:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib
pipe = Pipeline([
('scaler', StandardScaler()),
('lr', LogisticRegression())
])
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'pipeline.pkl')
Explanation:
- Saves both preprocessing and model steps
- Useful for production deployments
Real-Life Project: Save and Reload KNN Pipeline
Project Name
Reusable KNN Pipeline
Project Overview
Train a KNN model with preprocessing and persist it for reuse.
Project Goal
Save, reload, and reuse a Scikit-learn pipeline with minimal reconfiguration.
Code for This Project
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import joblib
# Prepare data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
pipe.fit(X_train, y_train)
# Save pipeline
joblib.dump(pipe, 'knn_pipeline.pkl')
# Load pipeline
loaded_pipe = joblib.load('knn_pipeline.pkl')
print("Loaded Pipeline Accuracy:", loaded_pipe.score(X_test, y_test))
Expected Output
- Model accuracy from reloaded pipeline
- Identical output to original model
Common Mistakes to Avoid
- โ Saving only the model without preprocessing steps
- โ Forgetting to test the reloaded model
- โ Using pickle with large numpy arrays (prefer joblib)
Further Reading Recommendation
๐ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan
๐ Available on Amazon