Real-World Dataset: Wine Classification in Scikit-learn

The Wine dataset is a classic multiclass classification dataset available in Scikit-learn. It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The goal is to classify the wine based on 13 features such as alcohol content, ash, flavanoids, and more.

Key Characteristics

  • Multiclass classification problem (3 classes)
  • Target: Wine class labels (0, 1, 2)
  • Features: Alcohol, Malic acid, Ash, Flavanoids, etc.
  • Clean and well-structured dataset

Basic Rules

  • Standardize features before training
  • Use accuracy and confusion matrix for evaluation
  • Try different classifiers (Logistic Regression, KNN, SVM)
  • Use stratify=y to maintain class proportions

Syntax Table

SL NO Step Syntax Example Description
1 Load dataset load_wine(return_X_y=True) Loads wine features and class labels
2 Train/test split train_test_split(X, y, stratify=y, test_size=0.3) Ensures balanced class split
3 Standard scaling StandardScaler().fit_transform(X_train) Scales features
4 Train classifier LogisticRegression().fit(X_train, y_train) Trains a classification model
5 Evaluate model confusion_matrix(y_test, y_pred) Shows prediction correctness per class

Syntax Explanation

1. Load Dataset

What is it?
Loads the Wine dataset from Scikit-learn.

Syntax:

from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

Explanation:

  • X contains 13 chemical features of wine samples
  • y contains the class labels (0, 1, 2)

2. Train/Test Split

What is it?
Divides the dataset into training and testing subsets.

Syntax:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

Explanation:

  • Maintains class proportions in train and test sets

3. Standard Scaling

What is it?
Applies normalization to the input features.

Syntax:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Explanation:

  • Prevents features with larger scales from dominating the model

4. Train Classifier

What is it?
Fits a logistic regression classifier on the wine data.

Syntax:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Explanation:

  • Learns decision boundaries for each wine class
  • Logistic Regression supports multiclass classification

5. Evaluate Model

What is it?
Assesses the model performance with a confusion matrix.

Syntax:

from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Explanation:

  • Shows how many instances were correctly or incorrectly classified

Real-Life Project: Wine Type Prediction

Project Name

Wine Quality Classifier

Project Overview

Classify wines into one of three types using their chemical properties.

Project Goal

Develop a model that accurately identifies the wine class based on input features.

Code for This Project

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Load data
X, y = load_wine(return_X_y=True)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict & Evaluate
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

Expected Output

  • Confusion matrix and accuracy score
  • High classification accuracy (typically >95%)

Common Mistakes to Avoid

  • ❌ Not scaling features before model training
  • ❌ Ignoring class imbalance in split
  • ❌ Using binary classifiers for multiclass problems

Further Reading Recommendation

📘 Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

🔗 Available on Amazon