Feature Selection Techniques in Scikit-learn

Feature selection is a process of selecting the most relevant features from a dataset to improve model performance, reduce overfitting, and enhance interpretability. Scikit-learn provides a variety of methods for feature selection, ranging from statistical tests to model-based approaches.

Key Characteristics

  • Reduces Overfitting by eliminating irrelevant or redundant features.
  • Improves Accuracy by focusing on the most informative features.
  • Speeds Up Training by lowering data dimensionality.
  • Enhances Interpretability for models like linear regression.

Basic Rules

  • Always apply feature selection after preprocessing.
  • Use different techniques for classification and regression.
  • Evaluate selected features using cross-validation.
  • Avoid removing important correlated features blindly.

Syntax Table

SL NO Function Syntax Example Description
1 Variance Threshold VarianceThreshold(threshold=0.1) Removes features with low variance
2 Univariate Selection SelectKBest(score_func=f_classif, k=10) Selects best k features using statistical test
3 Recursive Feature Elim. RFE(estimator, n_features_to_select=5) Recursively eliminates less important features
4 Model-Based Selection SelectFromModel(estimator) Selects based on feature importance from model
5 Embedded Methods (Lasso) LassoCV().fit(X, y).coef_ Regularization selects features implicitly

Syntax Explanation

1. Variance Threshold

What is it? A simple baseline method that removes all features with variance below a specified threshold.

Syntax:

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

Explanation:

  • Eliminates features with near-constant values.
  • Works well for binary or categorical datasets.
  • Default threshold is 0 (removes features with the same value in all samples).

2. Univariate Selection

What is it? Selects the best features based on univariate statistical tests between each feature and the target.

Syntax:

from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

Explanation:

  • Suitable for supervised learning problems.
  • f_classif for classification; f_regression for regression.
  • Selects the top k features with the highest scores.

3. Recursive Feature Elimination (RFE)

What is it? Recursively removes least important features based on weights assigned by a base estimator.

Syntax:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)

Explanation:

  • Fits the model, ranks features, and removes the least important repeatedly.
  • Ideal when feature importance can be derived from model coefficients.

4. Model-Based Selection

What is it? Uses a machine learning model’s feature importance scores to retain only the most informative features.

Syntax:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
model = SelectFromModel(RandomForestClassifier())
X_selected = model.fit_transform(X, y)

Explanation:

  • Relies on coef_ or feature_importances_ attributes.
  • Flexible: can be used with any estimator that exposes these properties.

5. Embedded Methods (Lasso)

What is it? Integrates feature selection within model training via regularization.

Syntax:

from sklearn.linear_model import LassoCV
model = LassoCV().fit(X, y)
important_features = model.coef_ != 0

Explanation:

  • Shrinks less important feature coefficients to zero using L1 penalty.
  • Automatically selects features while fitting the model.
  • Highly effective when the number of features is large.

Real-Life Project: Customer Churn Feature Selection

Objective

Select the most relevant features that influence customer churn.

Code Example

import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('churn_data.csv')
X = data.drop('Churn', axis=1)
y = data['Churn']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Feature selection
selector = SelectKBest(score_func=f_classif, k=8)
X_selected = selector.fit_transform(X_scaled, y)

Expected Output

  • A reduced dataset with only the top features.
  • Improved training efficiency and possibly better model accuracy.

Common Mistakes

  • ❌ Not scaling data before selection when required.
  • ❌ Applying the same method for both regression and classification.
  • ❌ Eliminating features solely based on correlation without domain knowledge.

Further Reading

πŸ“˜ Hands-On Python and Scikit-Learn: A Practical Guide to Machine Learning by Sarful Hassan

πŸ”— Available on Amazon