Random Forest Classifier in Scikit-learn

Random Forest is an ensemble learning algorithm that builds multiple decision trees and merges their results to improve accuracy and control overfitting. It’s highly effective for both classification and regression tasks. In Scikit-learn, it’s implemented via RandomForestClassifier.

Key Characteristics of Random Forest

Ensemble of Decision Trees: Combines the output of several decision trees.
Reduces Overfitting: Averages multiple models to improve generalization.
Handles Missing and Noisy Data: More robust than a single tree.
Feature Importance: Provides insights into which features matter most.
Parallelizable: Trees can be built in parallel to improve speed.

Basic Rules for Using Random Forest

Set n_estimators to define the number of trees.
Use max_depth, min_samples_split to control overfitting.
Scale is not required, but encoding is needed for categorical data.
Use cross-validation to tune hyperparameters.
More trees generally improve performance, up to a point.

Syntax Table

SL NO	Function	Syntax Example	Description
1	Import Classifier	`from sklearn.ensemble import RandomForestClassifier`	Imports the Random Forest Classifier
2	Instantiate Model	`model = RandomForestClassifier(n_estimators=100)`	Initializes classifier with 100 trees
3	Fit Model	`model.fit(X_train, y_train)`	Trains the ensemble model
4	Predict Labels	`model.predict(X_test)`	Predicts class labels
5	Feature Importances	`model.feature_importances_`	Shows importance of each feature

Syntax Explanation

1. Import and Instantiate

What is it? Load and initialize the random forest model.
Syntax:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)

Explanation:
- n_estimators: Number of trees in the forest.
- random_state: Ensures reproducibility.

2. Fit the Model

What is it? Train the model on the dataset.
Syntax:

model.fit(X_train, y_train)

Explanation:
- Each tree is trained on a random subset of the data.
- The final prediction is a majority vote.

3. Predict Labels

What is it? Predict class for test instances.
Syntax:

y_pred = model.predict(X_test)

Explanation:
- Combines outputs of all trees to make final decision.

4. Feature Importance

What is it? Shows which features were most useful.
Syntax:

importances = model.feature_importances_

Explanation:
- Higher values mean more important features.
- Useful for feature selection.

Real-Life Project: Fraud Detection with Random Forest

Project Name

Detecting Credit Card Fraud with Random Forest

Project Overview

Use a random forest classifier to detect fraudulent credit card transactions using a dataset with anonymized features.

Project Goal

Build a robust fraud classifier
Evaluate metrics like precision and recall
Interpret feature importance

Code for This Project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load dataset
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Feature importance
importances = model.feature_importances_
print("Top Feature Importances:\n", sorted(zip(importances, X.columns), reverse=True)[:5])

Expected Output

Accurate fraud detection classifier
Confusion matrix and precision/recall report
Feature ranking list

Common Mistakes to Avoid

❌ Using too few trees → underfitting
❌ Not tuning hyperparameters (e.g., max_depth, min_samples_split)
❌ Ignoring imbalanced classes (consider class_weight='balanced')
❌ Overfitting on small datasets