5 Useful Python Scripts for Effective Feature Selection

Image by Author

Table of Contents

# Introduction

As a machine learning practitioner, you know that feature selection is important yet time-consuming work. You need to identify which features actually contribute to model performance, remove redundant variables, detect multicollinearity, filter out noisy features, and find the optimal feature subset. For each selection method, you test different thresholds, compare results, and track what works.

This becomes more challenging as your feature space grows. With hundreds of engineered features, you will need systematic approaches to evaluate feature importance, remove redundancy, and select the best subset.

This article covers five Python scripts designed to automate the most effective feature selection techniques.

You can find the scripts on GitHub.

# 1. Filtering Constant Features with Variance Thresholds

// The Pain Point

Features with low or zero variance provide little to no information for prediction. A feature that is constant or nearly constant across all samples cannot help distinguish between different target classes. Manually identifying these features means calculating variance for each column, setting appropriate thresholds, and handling edge cases like binary features or features with different scales.

// What the Script Does

Identifies and removes low-variance features based on configurable thresholds. Handles both continuous and binary features appropriately, normalizes variance calculations for fair comparison across different scales, and provides detailed reports showing which features were removed and why.

// How It Works

The script calculates variance for each feature, applying different strategies based on feature type.

For continuous features, it computes standard variance and can optionally normalize by the feature’s range to make thresholds comparable
For binary features, it calculates the proportion of the minority class since variance in binary features relates to class imbalance.

Features falling below the threshold are flagged for removal. The script maintains a mapping of removed features and their variance scores for transparency.

⏩ Get the variance threshold-based feature selector script

# 2. Eliminating Redundant Features Through Correlation Analysis

// The Pain Point

Highly correlated features are redundant and can cause multicollinearity issues in linear models. When two features have high correlation, keeping both adds dimensionality without adding information. But with hundreds of features, identifying all correlated pairs, deciding which to keep, and ensuring you maintain features most correlated with the target requires systematic analysis.

// What the Script Does

Identifies highly correlated feature pairs using Pearson correlation for numerical features and Cramér’s V for categorical features. For each correlated pair, automatically selects which feature to keep based on correlation with the target variable. Removes redundant features while maximizing predictive power. Generates correlation heatmaps and detailed reports of removed features.

// How It Works

The script computes the correlation matrix for all features. For each pair exceeding the correlation threshold, it compares both features’ correlation with the target variable. The feature with lower target correlation is marked for removal. This process continues iteratively to handle chains of correlated features. The script handles missing values, mixed data types, and provides visualizations showing correlation clusters and the selection decision for each pair.

⏩ Get the correlation-based feature selector script

# 3. Identifying Significant Features Using Statistical Tests

// The Pain Point

Not all features have a statistically significant relationship with the target variable. Features that show no meaningful association with the target add noise and often increase overfitting risk. Testing each feature requires choosing appropriate statistical tests, computing p-values, correcting for multiple testing, and interpreting results correctly.

// What the Script Does

The script automatically selects and applies the appropriate statistical test based on the types of the feature and target variable. It uses an analysis of variance (ANOVA) F-test for numerical features paired with a classification target, a chi-square test for categorical features, mutual information scoring to capture non-linear relationships, and a regression F-test when the target is continuous. It then applies either Bonferroni or False Discovery Rate (FDR) correction to account for multiple testing, and returns all features ranked by statistical significance, along with their p-values and test statistics.

// How It Works

The script first determines the feature type and target type, then routes each feature to the correct test. For classification tasks with numerical features, ANOVA tests whether the feature’s mean differs significantly across target classes. For categorical features, a chi-square test checks for statistical independence between the feature and the target. Mutual information scores are computed alongside these to surface any non-linear relationships that standard tests might miss. When the target is continuous, a regression F-test is used instead.

Once all tests are run, p-values are adjusted using either Bonferroni correction — where each p-value is multiplied by the total number of features — or a false discovery rate method for a less conservative correction. Features with adjusted p-values below the default significance threshold of 0.05 are flagged as statistically significant and prioritized for inclusion.

⏩ Get the statistical test based feature selector script

If you are interested in a more rigorous statistical approach to feature selection, I suggest you improve this script further as outlined below.

// What You Can Also Explore and Improve

Use non-parametric alternatives where assumptions break down. ANOVA assumes approximate normality and equal variances across groups. For heavily skewed or non-normal features, swapping to a Kruskal-Wallis test is a more robust choice that makes no distributional assumptions.

Handle sparse categorical features carefully. Chi-square requires that expected cell frequencies are at least 5. When this condition is not met — which is common with high-cardinality or infrequent categories — Fisher’s exact test is a safer and more accurate alternative.

Treat mutual information scores separately from p-values. Since mutual information scores are not p-values, they do not fit naturally into the Bonferroni or FDR correction framework. A cleaner approach is to rank features by mutual information score independently and use it as a complementary signal rather than merging it into the same significance pipeline.

Prefer False Discovery Rate correction in high-dimensional settings. Bonferroni is conservative by design, which is appropriate when false positives are very costly, but it can discard genuinely useful features when you have many of them. Benjamini-Hochberg FDR correction offers more statistical power in wide datasets and is generally preferred in machine learning feature selection workflows.

Include effect size alongside p-values. Statistical significance alone does not tell you how practically meaningful a feature is. Pairing p-values with effect size measures gives a more complete picture of which features are worth keeping.

Add a permutation-based significance test. For complex or mixed-type datasets, permutation testing offers a model-agnostic way to assess significance without relying on any distributional assumptions. It works by shuffling the target variable repeatedly and checking how often a feature scores as well by chance alone.

# 4. Ranking Features with Model-Based Importance Scores

// The Pain Point

Model-based feature importance provides direct insight into which features contribute to prediction accuracy, but different models give different importance scores. Running multiple models, extracting importance scores, and combining results into a coherent ranking is complex.

// What the Script Does

Trains multiple model types and extracts feature importance from each. Normalizes importance scores across models for fair comparison. Computes ensemble importance by averaging or ranking across models. Provides permutation importance as a model-agnostic alternative. Returns ranked features with importance scores from each model and recommended feature subsets.

// How It Works

The script trains each model type on the full feature set and extracts native importance scores such as tree-based importance for forests and coefficients for linear models. For permutation importance, it randomly shuffles each feature and measures the decrease in model performance. Importance scores are normalized to sum to 1 within each model.

The ensemble score is computed as the mean rank or mean normalized importance across all models. Features are sorted by ensemble importance, and the top N features or those exceeding an importance threshold are selected.

⏩ Get the model-based selector script

# 5. Optimizing Feature Subsets Through Recursive Elimination

// The Pain Point

The optimal feature subset is not always the top N most important features individually; feature interactions matter, too. A feature might seem weak alone but be valuable when combined with others. Recursive feature elimination tests feature subsets by iteratively removing the weakest features and retraining models. But this requires running hundreds of model training iterations and tracking performance across different subset sizes.

// What the Script Does

Systematically removes features in an iterative process, retraining models and evaluating performance at each step. Starts with all features and removes the least important feature in each iteration. Tracks model performance across all subset sizes. Identifies the optimal feature subset that maximizes performance or achieves target performance with minimum features. Supports cross-validation for robust performance estimates.

// How It Works

The script begins with the complete feature set and trains a model. It ranks features by importance and removes the lowest-ranked feature. This process repeats, training a new model with the reduced feature set in each iteration. Performance metrics like accuracy, F1, and AUC are recorded for each subset size.

The script applies cross-validation to get stable performance estimates at each step. The final output includes performance curves showing how metrics change with feature count and the optimal feature subset. Meaning you see either optimal performance or elbow point where adding features yields diminishing returns.

⏩ Get the recursive feature elimination script

# Wrapping Up

These five scripts address the core challenges of feature selection that determine model performance and training efficiency. Here’s a quick overview:

Script	Description
Variance Threshold Selector	Removes uninformative constant or near-constant features.
Correlation-Based Selector	Eliminates redundant features while preserving predictive power.
Statistical Test Selector	Identifies features with significant relationships to the target.
Model-Based Selector	Ranks features using ensemble importance from multiple models.
Recursive Feature Elimination	Finds optimal feature subsets through iterative testing.

Each script can be used independently for specific selection tasks or combined into a complete pipeline. Happy feature selection!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

What's Hot

US offers $10 million for info on group behind Signal and WhatsApp hacking spree

Netflix Reportedly Making Live-Action Persona Series Nobody Asked For With Deadpool Director And Star Trek Picard Writer

Marvel Officially Shares First Image Of Its Next Civil War

5 Useful Python Scripts for Effective Feature Selection

Grounded PRD Generation with NotebookLM

How (and Why) I Built an AI Assistant

Build ChatGPT Clone with Andrej Karpathy's nanochat

5 AI Coding Subscription Plans That Give Developers the Best Value

Your RAG Pipeline Is Probably Useless. Here’s a Better Alternative

Which Retrieval Method is Best?

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

5 Useful Python Scripts for Effective Feature Selection

# Introduction

# 1. Filtering Constant Features with Variance Thresholds

// The Pain Point

// What the Script Does

// How It Works

# 2. Eliminating Redundant Features Through Correlation Analysis

// The Pain Point

// What the Script Does

// How It Works

# 3. Identifying Significant Features Using Statistical Tests

// The Pain Point

// What the Script Does

// How It Works

// What You Can Also Explore and Improve

# 4. Ranking Features with Model-Based Importance Scores

// The Pain Point

// What the Script Does

// How It Works

# 5. Optimizing Feature Subsets Through Recursive Elimination

// The Pain Point

// What the Script Does

// How It Works

# Wrapping Up

Related posts:

Grounded PRD Generation with NotebookLM

How (and Why) I Built an AI Assistant

Build ChatGPT Clone with Andrej Karpathy's nanochat

Related Posts

Subscribe to Updates