The Breast Cancer Wisconsin dataset is a widely used dataset for binary classification problems. It contains features derived from digitized images of breast mass biopsies and is used to classify tumors as malignant or benign. Scikit-learn offers this dataset directly via load_breast_cancer()
.
Key Characteristics
- Binary classification task
- Target: 0 (malignant), 1 (benign)
- Features: Mean radius, texture, perimeter, area, etc.
- Clean and balanced dataset
Basic Rules
- Standardize features before model training
- Use accuracy, precision, recall, and F1 for evaluation
- Try multiple classifiers (e.g., LogisticRegression, KNN, RandomForest)
- Use
stratify=y
in train-test split for class balance
Syntax Table
SL NO | Step | Syntax Example | Description |
---|---|---|---|
1 | Load dataset | load_breast_cancer(return_X_y=True) |
Loads features and target labels |
2 | Train/test split | train_test_split(X, y, stratify=y, test_size=0.3) |
Ensures balanced class split |
3 | Standard scaling | StandardScaler().fit_transform(X_train) |
Normalizes feature values |
4 | Train classifier | LogisticRegression().fit(X_train, y_train) |
Trains a classification model |
5 | Evaluate model | classification_report(y_test, y_pred) |
Shows precision, recall, F1, accuracy |
Syntax Explanation
1. Load Dataset
What is it?
Loads the Breast Cancer Wisconsin dataset from Scikit-learn.
Syntax:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
Explanation:
X
contains feature measurementsy
contains 0 or 1 indicating cancer class
2. Train/Test Split
What is it?
Splits the dataset while maintaining class proportions.
Syntax:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
Explanation:
- Ensures fair representation of both classes in train and test sets
3. Standard Scaling
What is it?
Applies standard scaling (mean=0, std=1) to features.
Syntax:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Explanation:
- Improves convergence and performance of many models
4. Train Classifier
What is it?
Trains a classification model like Logistic Regression.
Syntax:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Explanation:
- Learns the decision boundary separating benign vs malignant
5. Evaluate Model
What is it?
Generates classification performance metrics.
Syntax:
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Explanation:
- Outputs accuracy, precision, recall, and F1-score
Real-Life Project: Tumor Classification
Project Name
Breast Cancer Detection System
Project Overview
Train a model to detect breast cancer from cell nuclei features.
Project Goal
Develop a classifier that predicts whether a tumor is malignant or benign.
Code for This Project
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load data
X, y = load_breast_cancer(return_X_y=True)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict & Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output
- Classification metrics: accuracy, precision, recall, F1
- High accuracy (>95%) for most classifiers
Common Mistakes to Avoid
- ❌ Not scaling features before training
- ❌ Ignoring recall and F1 in favor of accuracy
- ❌ Not stratifying data during split