Bagging vs Boosting in Scikit-learn

Bagging (Bootstrap Aggregating) and Boosting are two ensemble methods with distinct approaches to improving model performance. While both combine multiple models, Bagging builds them in parallel to reduce variance, whereas Boosting builds them sequentially to reduce bias.

Key Differences

Feature Bagging Boosting
Model Training Parallel Sequential
Focus Reduce Variance Reduce Bias
Model Independence Independent Learners Dependent Learners
Performance on Overfitting Helps avoid overfitting May overfit if not tuned
Example Algorithms Random Forest, BaggingClassifier AdaBoost, GradientBoostingClassifier

Syntax Comparison

Bagging

What is it?
A parallel ensemble method that trains base learners on random subsets of the training data.

Syntax:

from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier(n_estimators=10)

Explanation:

  • Reduces variance by averaging predictions from diverse models.
  • Suitable for high-variance base learners.

Boosting

What is it?
A sequential ensemble method that focuses on mistakes made by previous models.

Syntax:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)

Explanation:

  • Reduces bias by iteratively improving weak learners.
  • Effective on structured/tabular datasets.

Real-Life Use Case

Dataset

Customer churn prediction using tabular data.

Code Example

from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Bagging
bagging = BaggingClassifier(n_estimators=50)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)

# Boosting
boosting = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
boosting.fit(X_train, y_train)
boost_pred = boosting.predict(X_test)

# Results
print("Bagging Accuracy:", accuracy_score(y_test, bag_pred))
print("Boosting Accuracy:", accuracy_score(y_test, boost_pred))

Expected Output

  • Bagging and Boosting accuracy scores for comparison.
  • Boosting often outperforms Bagging on well-preprocessed datasets.

Common Mistakes

  • ❌ Not tuning learning_rate or n_estimators in Boosting.
  • ❌ Using boosting on small/noisy datasets.
  • ❌ Assuming Bagging always improves weak learners.

When to Use What?

Scenario Preferred Method
High variance, low bias Bagging
High bias, complex data patterns Boosting
Small dataset with noise Bagging
Structured/tabular large dataset Boosting

Further Reading

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon