Saving and Loading Scikit-learn Models

Saving and loading models is essential for deploying machine learning solutions and avoiding retraining. Scikit-learn supports model persistence using the joblib and pickle libraries, which serialize and deserialize Python objects.

Key Characteristics

  • Enables reuse of trained models
  • Reduces computational overhead
  • Ensures reproducibility
  • Compatible with most Scikit-learn objects

Basic Rules

  • Use joblib for Scikit-learn models (better with large numpy arrays)
  • Use pickle for general Python object serialization
  • Save preprocessing steps along with the model
  • Validate reloaded models before use

Syntax Table

SL NO Technique Syntax Example Description
1 Save with joblib joblib.dump(model, 'model.pkl') Saves model to file
2 Load with joblib model = joblib.load('model.pkl') Loads model from file
3 Save with pickle pickle.dump(model, open('file.pkl', 'wb')) Saves using pickle
4 Load with pickle model = pickle.load(open('file.pkl', 'rb')) Loads using pickle
5 Save pipeline joblib.dump(pipe, 'pipeline.pkl') Saves preprocessing and model pipeline

Syntax Explanation

1. Saving a Model with joblib

What is it?
Serializes a trained model and saves it to disk using joblib, which is optimized for objects containing large NumPy arrays.

Syntax:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestClassifier()
model.fit(X_train, y_train)
joblib.dump(model, 'rf_model.pkl')

Explanation:

  • Trains a model and saves it using joblib
  • Creates a file rf_model.pkl containing the model

2. Loading a Model with joblib

What is it?
Deserializes a model file created with joblib and loads it back into memory.

Syntax:

model = joblib.load('rf_model.pkl')
y_pred = model.predict(X_test)

Explanation:

  • Reloads the saved model
  • Predicts with no need to retrain

3. Saving a Model with pickle

What is it?
Serializes a trained model using Pythonโ€™s built-in pickle module for general-purpose object saving.

Syntax:

import pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

Explanation:

  • Uses Python’s built-in pickle module
  • Works for general Python objects including models

4. Loading a Model with pickle

What is it?
Deserializes a file saved using pickle and restores the model object.

Syntax:

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

Explanation:

  • Reads binary file and loads the original model object

5. Saving a Pipeline

What is it?
Saves an entire Scikit-learn Pipeline including both preprocessing steps and the final estimator.

Syntax:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression())
])
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'pipeline.pkl')

Explanation:

  • Saves both preprocessing and model steps
  • Useful for production deployments

Real-Life Project: Save and Reload KNN Pipeline

Project Name

Reusable KNN Pipeline

Project Overview

Train a KNN model with preprocessing and persist it for reuse.

Project Goal

Save, reload, and reuse a Scikit-learn pipeline with minimal reconfiguration.

Code for This Project

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import joblib

# Prepare data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])
pipe.fit(X_train, y_train)

# Save pipeline
joblib.dump(pipe, 'knn_pipeline.pkl')

# Load pipeline
loaded_pipe = joblib.load('knn_pipeline.pkl')
print("Loaded Pipeline Accuracy:", loaded_pipe.score(X_test, y_test))

Expected Output

  • Model accuracy from reloaded pipeline
  • Identical output to original model

Common Mistakes to Avoid

  • โŒ Saving only the model without preprocessing steps
  • โŒ Forgetting to test the reloaded model
  • โŒ Using pickle with large numpy arrays (prefer joblib)

Further Reading Recommendation

๐Ÿ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

๐Ÿ”— Available on Amazon