Random Forest Classifier in Scikit-learn

Random Forest is an ensemble learning algorithm that builds multiple decision trees and merges their results to improve accuracy and control overfitting. It’s highly effective for both classification and regression tasks. In Scikit-learn, it’s implemented via RandomForestClassifier.

Key Characteristics of Random Forest

  • Ensemble of Decision Trees: Combines the output of several decision trees.
  • Reduces Overfitting: Averages multiple models to improve generalization.
  • Handles Missing and Noisy Data: More robust than a single tree.
  • Feature Importance: Provides insights into which features matter most.
  • Parallelizable: Trees can be built in parallel to improve speed.

Basic Rules for Using Random Forest

  • Set n_estimators to define the number of trees.
  • Use max_depth, min_samples_split to control overfitting.
  • Scale is not required, but encoding is needed for categorical data.
  • Use cross-validation to tune hyperparameters.
  • More trees generally improve performance, up to a point.

Syntax Table

SL NO Function Syntax Example Description
1 Import Classifier from sklearn.ensemble import RandomForestClassifier Imports the Random Forest Classifier
2 Instantiate Model model = RandomForestClassifier(n_estimators=100) Initializes classifier with 100 trees
3 Fit Model model.fit(X_train, y_train) Trains the ensemble model
4 Predict Labels model.predict(X_test) Predicts class labels
5 Feature Importances model.feature_importances_ Shows importance of each feature

Syntax Explanation

1. Import and Instantiate

  • What is it? Load and initialize the random forest model.
  • Syntax:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
  • Explanation:
    • n_estimators: Number of trees in the forest.
    • random_state: Ensures reproducibility.

2. Fit the Model

  • What is it? Train the model on the dataset.
  • Syntax:
model.fit(X_train, y_train)
  • Explanation:
    • Each tree is trained on a random subset of the data.
    • The final prediction is a majority vote.

3. Predict Labels

  • What is it? Predict class for test instances.
  • Syntax:
y_pred = model.predict(X_test)
  • Explanation:
    • Combines outputs of all trees to make final decision.

4. Feature Importance

  • What is it? Shows which features were most useful.
  • Syntax:
importances = model.feature_importances_
  • Explanation:
    • Higher values mean more important features.
    • Useful for feature selection.

Real-Life Project: Fraud Detection with Random Forest

Project Name

Detecting Credit Card Fraud with Random Forest

Project Overview

Use a random forest classifier to detect fraudulent credit card transactions using a dataset with anonymized features.

Project Goal

  • Build a robust fraud classifier
  • Evaluate metrics like precision and recall
  • Interpret feature importance

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load dataset
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Feature importance
importances = model.feature_importances_
print("Top Feature Importances:\n", sorted(zip(importances, X.columns), reverse=True)[:5])

Expected Output

  • Accurate fraud detection classifier
  • Confusion matrix and precision/recall report
  • Feature ranking list

Common Mistakes to Avoid

  • ❌ Using too few trees → underfitting
  • ❌ Not tuning hyperparameters (e.g., max_depth, min_samples_split)
  • ❌ Ignoring imbalanced classes (consider class_weight='balanced')
  • ❌ Overfitting on small datasets

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon

Also explore: