Real-World Dataset: Breast Cancer Detection with Scikit-learn

The Breast Cancer Wisconsin dataset is a widely used dataset for binary classification problems. It contains features derived from digitized images of breast mass biopsies and is used to classify tumors as malignant or benign. Scikit-learn offers this dataset directly via load_breast_cancer().

Key Characteristics

  • Binary classification task
  • Target: 0 (malignant), 1 (benign)
  • Features: Mean radius, texture, perimeter, area, etc.
  • Clean and balanced dataset

Basic Rules

  • Standardize features before model training
  • Use accuracy, precision, recall, and F1 for evaluation
  • Try multiple classifiers (e.g., LogisticRegression, KNN, RandomForest)
  • Use stratify=y in train-test split for class balance

Syntax Table

SL NO Step Syntax Example Description
1 Load dataset load_breast_cancer(return_X_y=True) Loads features and target labels
2 Train/test split train_test_split(X, y, stratify=y, test_size=0.3) Ensures balanced class split
3 Standard scaling StandardScaler().fit_transform(X_train) Normalizes feature values
4 Train classifier LogisticRegression().fit(X_train, y_train) Trains a classification model
5 Evaluate model classification_report(y_test, y_pred) Shows precision, recall, F1, accuracy

Syntax Explanation

1. Load Dataset

What is it?
Loads the Breast Cancer Wisconsin dataset from Scikit-learn.

Syntax:

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

Explanation:

  • X contains feature measurements
  • y contains 0 or 1 indicating cancer class

2. Train/Test Split

What is it?
Splits the dataset while maintaining class proportions.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

Explanation:

  • Ensures fair representation of both classes in train and test sets

3. Standard Scaling

What is it?
Applies standard scaling (mean=0, std=1) to features.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

  • Improves convergence and performance of many models

4. Train Classifier

What is it?
Trains a classification model like Logistic Regression.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

Explanation:

  • Learns the decision boundary separating benign vs malignant

5. Evaluate Model

What is it?
Generates classification performance metrics.

Syntax:

from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Explanation:

  • Outputs accuracy, precision, recall, and F1-score

Real-Life Project: Tumor Classification

Project Name

Breast Cancer Detection System

Project Overview

Train a model to detect breast cancer from cell nuclei features.

Project Goal

Develop a classifier that predicts whether a tumor is malignant or benign.

Code for This Project

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Expected Output

  • Classification metrics: accuracy, precision, recall, F1
  • High accuracy (>95%) for most classifiers

Common Mistakes to Avoid

  • ❌ Not scaling features before training
  • ❌ Ignoring recall and F1 in favor of accuracy
  • ❌ Not stratifying data during split

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon