Random Forest is an ensemble learning algorithm that builds multiple decision trees and merges their results to improve accuracy and control overfitting. It’s highly effective for both classification and regression tasks. In Scikit-learn, it’s implemented via RandomForestClassifier.
Key Characteristics of Random Forest
- Ensemble of Decision Trees: Combines the output of several decision trees.
- Reduces Overfitting: Averages multiple models to improve generalization.
- Handles Missing and Noisy Data: More robust than a single tree.
- Feature Importance: Provides insights into which features matter most.
- Parallelizable: Trees can be built in parallel to improve speed.
Basic Rules for Using Random Forest
- Set
n_estimatorsto define the number of trees. - Use
max_depth,min_samples_splitto control overfitting. - Scale is not required, but encoding is needed for categorical data.
- Use cross-validation to tune hyperparameters.
- More trees generally improve performance, up to a point.
Syntax Table
| SL NO | Function | Syntax Example | Description |
|---|---|---|---|
| 1 | Import Classifier | from sklearn.ensemble import RandomForestClassifier |
Imports the Random Forest Classifier |
| 2 | Instantiate Model | model = RandomForestClassifier(n_estimators=100) |
Initializes classifier with 100 trees |
| 3 | Fit Model | model.fit(X_train, y_train) |
Trains the ensemble model |
| 4 | Predict Labels | model.predict(X_test) |
Predicts class labels |
| 5 | Feature Importances | model.feature_importances_ |
Shows importance of each feature |
Syntax Explanation
1. Import and Instantiate
- What is it? Load and initialize the random forest model.
- Syntax:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
- Explanation:
n_estimators: Number of trees in the forest.random_state: Ensures reproducibility.
2. Fit the Model
- What is it? Train the model on the dataset.
- Syntax:
model.fit(X_train, y_train)
- Explanation:
- Each tree is trained on a random subset of the data.
- The final prediction is a majority vote.
3. Predict Labels
- What is it? Predict class for test instances.
- Syntax:
y_pred = model.predict(X_test)
- Explanation:
- Combines outputs of all trees to make final decision.
4. Feature Importance
- What is it? Shows which features were most useful.
- Syntax:
importances = model.feature_importances_
- Explanation:
- Higher values mean more important features.
- Useful for feature selection.
Real-Life Project: Fraud Detection with Random Forest
Project Name
Detecting Credit Card Fraud with Random Forest
Project Overview
Use a random forest classifier to detect fraudulent credit card transactions using a dataset with anonymized features.
Project Goal
- Build a robust fraud classifier
- Evaluate metrics like precision and recall
- Interpret feature importance
Code for This Project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load dataset
data = pd.read_csv('creditcard.csv')
X = data.drop('Class', axis=1)
y = data['Class']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
# Feature importance
importances = model.feature_importances_
print("Top Feature Importances:\n", sorted(zip(importances, X.columns), reverse=True)[:5])
Expected Output
- Accurate fraud detection classifier
- Confusion matrix and precision/recall report
- Feature ranking list
Common Mistakes to Avoid
- ❌ Using too few trees → underfitting
- ❌ Not tuning hyperparameters (e.g.,
max_depth,min_samples_split) - ❌ Ignoring imbalanced classes (consider
class_weight='balanced') - ❌ Overfitting on small datasets
Further Reading Recommendation
📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning
by Sarful Hassan
🔗 Available on Amazon
Also explore:
- 🔗 Scikit-learn Random Forest Docs: https://scikit-learn.org/stable/modules/ensemble.html#random-forests
- 🔗 Tree Ensembles Explained (Google Developers)
- 🔗 Kaggle Notebooks on Fraud Detection
