End-to-End ML AutomationProduction Ready

mlforgex

mlforgex is a cutting-edge automated machine learning Python package designed to train, evaluate, and make predictions rapidly. Tackle classification, regression, and NLP tasks using robust data preprocessing, one-click model selection, seamless hyperparameter tuning, and streamlined artifact generation—all handled for you.

Accelerate your data science workflow with end-to-end automation for every stage of machine learning.

Classification , Regression & Sentimental

Supports problem types automatically

Auto Preprocessing

Handles missing values, encoding, scaling

Model Selection

Automatically picks best performing model

Dashboard & Artifacts

Interactive dashboard for model analysis

Python 3.8+

MIT License

Active Development

Quick Start

Get started with mlforgex in minutes. Choose between CLI or Python API.

Command Line Interface

Train and predict directly from your terminal

# Train a model
mlforge-train \
  --data_path path/to/data.csv \
  --dependent_feature TargetColumn \
  --rmse_prob 0.3 \
  --f1_prob 0.7 \
  --n_jobs -1 \
  --n_iter 100 \
  --cv 3 \
  --artifacts_dir artifacts \
  --dashboard_title "My Model"  # Title for the dashboard
  # add --fast to speed up the run
  # add --nlp to enable NLP mode

# Make predictions
mlforge-predict \
  --model_path artifacts/model.pkl \
  --preprocessor_path artifacts/preprocessor.pkl \
  --input_data path/to/new_data.csv \
  --encoder_path artifacts/encoder.pkl \
  # add --no-predicted_data to disable saving predicted data \
  # add --nlp to enable NLP mode \

Add --fast flag for quicker training without hyperparameter tuning

Installation

Install mlforgex from PyPI with a single command.

Install from PyPI

$ pip install mlforgex

Latest Version

The package will automatically install all required dependencies.

Requirements

Minimum tested environment:

Python >= 3.8

Key Dependencies:

pandas

numpy

scikit-learn

matplotlib

seaborn

xgboost

imbalanced-learn

tqdm

scipy

requests

See the full list in requirements.txt

Key Features

mlforgex provides comprehensive automation for your machine learning workflow, from data preprocessing to model deployment.

Automatic Data Preprocessing

Missing value handling, outlier & duplicate removal, encoding, scaling, and multicollinearity handling.

Numeric columns: mean/median imputation
Categorical columns: mode or constant label
One-Hot vs Ordinal encoding based on cardinality
StandardScaler by default
VIF-based multicollinearity handling

Automatic Problem Detection

Classification vs regression detection, binary vs multiclass detection.

Regression: numeric target with many unique values
Classification: categorical or few unique values
Binary vs multiclass detection
Metric selection based on problem type

Imbalanced Data Handling

SMOTE oversampling, under-sampling, auto detection and application.

Automatic imbalance detection
SMOTE (Synthetic Minority Oversampling)
Random under-sampling options
Applied only to training folds (no data leakage)

Model Training & Evaluation

Trains candidate models and selects the best using task-appropriate metrics and cross-validation.

Multiple candidate models per task type
Cross-validation for performance estimation
Composite scoring for model selection
Hyperparameter tuning with RandomizedSearchCV

Artifact Saving

Trained model, preprocessing pipeline, encoder, metrics, plots, and feature importances saved to disk.

Serialized model (.pkl)
Preprocessing pipeline
Word2Vec model (NLP mode)
Label encoder (classification)
Dashboard with metrics & visualizations

Dashboard & Visualizations

Interactive HTML dashboard with key metrics and plots for model analysis.

Correlation heatmap
Confusion matrix & ROC curves
Precision-Recall curves
Learning curves (train vs validation)
Feature importance bar charts
Residual plots for regression
WordCloud
Feature Importance
Prediction Error Distribution

CLI Reference

Complete command-line interface reference with all flags and options explained.

mlforge-train

Train a machine learning model with automatic preprocessing and model selection.

mlforge-train \
  --data_path <path> \
  --dependent_feature <column> \
  --rmse_prob <float> \
  --f1_prob <float> \
  [--n_jobs <int>] \
  [--n_iter <int>] \
  [--cv <int>] \
  [--artifacts_dir <path>] \
  [--artifacts_name <name>] \
  [--fast] \
  [--nlp] \
  [--dashboard_title]

--data_pathRequiredstr

CSV file path to the dataset. Must include header row and the target column.

Default: —

--dependent_featureRequiredstr

Name of the target column to predict.

Default: —

--rmse_probfloat

Ranking weight for regression models (higher means RMSE is prioritized).

Default: 0.3

--f1_probfloat

Ranking weight for classification models (higher means F1 is prioritized).

Default: 0.7

--n_jobsint

Number of CPU cores used for parallelism (-1 uses all available cores).

Default: -1

--n_iterint

Number of parameter settings sampled when RandomizedSearchCV is used.

Default: 100

--cvint

Number of cross-validation folds.

Default: 3

--artifacts_dirstr

Directory where artifacts, metrics, and plots will be saved.

Default: None

--artifacts_namestr

Name of the artifacts directory.

Default: artifacts

--fastflag

Enable fast mode. Skips hyperparameter tuning and uses strong defaults for models.

Default: False

--nlpflag

Enable NLP mode. When provided, the trainer runs the text pipeline: uses an existing text column (or combines object cols), performs tokenization, stopword removal (keeps negations), lemmatization, vectorizes text (Word2Vec), enforces label encoding for classification, and saves NLP artifacts (word2vec/preprocessor).

Default: False

--dashboard_titlestr

The title displayed in the dashboard header.

Default: mlforgex Dashboard

Flag	Required	Type	Default	Description
`--data_path`	Yes	str	—	CSV file path to the dataset. Must include header row and the target column.
`--dependent_feature`	Yes	str	—	Name of the target column to predict.
`--rmse_prob`	No	float	0.3	Ranking weight for regression models (higher means RMSE is prioritized).
`--f1_prob`	No	float	0.7	Ranking weight for classification models (higher means F1 is prioritized).
`--n_jobs`	No	int	-1	Number of CPU cores used for parallelism (-1 uses all available cores).
`--n_iter`	No	int	100	Number of parameter settings sampled when RandomizedSearchCV is used.
`--cv`	No	int	3	Number of cross-validation folds.
`--artifacts_dir`	No	str	None	Directory where artifacts, metrics, and plots will be saved.
`--artifacts_name`	No	str	artifacts	Name of the artifacts directory.
`--fast`	No	flag	False	Enable fast mode. Skips hyperparameter tuning and uses strong defaults for models.
`--nlp`	No	flag	False	Enable NLP mode. When provided, the trainer runs the text pipeline: uses an existing text column (or combines object cols), performs tokenization, stopword removal (keeps negations), lemmatization, vectorizes text (Word2Vec), enforces label encoding for classification, and saves NLP artifacts (word2vec/preprocessor).
`--dashboard_title`	No	str	mlforgex Dashboard	The title displayed in the dashboard header.

mlforge-predict

Make predictions using a trained model on new data.

mlforge-predict \
  --model_path <model.pkl> \
  --preprocessor_path <preprocessor.pkl> \
  --input_data <input.csv> \
  --encoder_path <encoder.pkl>

--model_pathRequiredstr

Path to the trained model pickle.

Default: —

--preprocessor_pathRequiredstr

Path to the preprocessing pipeline pickle.

Default: —

--input_dataRequiredstr

CSV file with rows to predict (same feature columns except target).

Default: —

--encoder_pathOptionalstr

Path to the encoder pickle (classification only).

Default: —

--predicted_dataOptionalflag

Saves the input data with prediction column. Use --no-predicted_data to disable.

Default: True

--nlpOptionalflag

Enable NLP/text-mode for prediction. When provided, the predictor will combine object/text columns (or use an existing text column), apply the same text preprocessing used at training, load the text preprocessor / Word2Vec model from --preprocessor_path, vectorize inputs (average word‑vectors on the saved preprocessor), and decode labels with --encoder_path if supplied. Use --nlp to enable.

Default: False

Flag	Type	Required	Description
`--model_path`	str	Yes	—	Path to the trained model pickle.
`--preprocessor_path`	str	Yes	—	Path to the preprocessing pipeline pickle.
`--input_data`	str	Yes	—	CSV file with rows to predict (same feature columns except target).
`--encoder_path`	str	No	—	Path to the encoder pickle (classification only).
`--predicted_data`	flag	No	True	Saves the input data with prediction column. Use --no-predicted_data to disable.
`--nlp`	flag	No	False	Enable NLP/text-mode for prediction. When provided, the predictor will combine object/text columns (or use an existing text column), apply the same text preprocessing used at training, load the text preprocessor / Word2Vec model from --preprocessor_path, vectorize inputs (average word‑vectors on the saved preprocessor), and decode labels with --encoder_path if supplied. Use --nlp to enable.

Python API

Use mlforgex directly in your Python applications with simple function calls.

train_model()

Train a machine learning model with automatic preprocessing and model selection.

from mlforgex import train_model

# Train a model with all options
train_model(
    data_path="path/to/your/data.csv",
    dependent_feature="target_column",
    rmse_prob=0.3,          # Weight for RMSE in regression
    f1_prob=0.7,            # Weight for F1 in classification
    n_jobs=-1,              # Use all CPU cores
    n_iter=100,             # Hyperparameter search iterations
    cv=5,                   # Cross-validation folds
    artifacts_dir="models", # Where to save artifacts
    fast=False              # Full training with tuning
    nlp=True                 # Enable NLP mode
    dashboard_title="My Model"  # Title for the dashboard
)

# Fast training (no hyperparameter tuning)
train_model(
    data_path="data.csv",
    dependent_feature="target",
    fast=True  # Skip tuning for faster results
)

Parameters:

data_pathstrPath to your CSV dataset

dependent_featurestrName of target column

fastboolSkip hyperparameter tuning

predict()

Make predictions on new data using your trained model.

from mlforgex import predict

# Make predictions on new data
predictions = predict(
    model_path="artifacts/model.pkl",
    preprocessor_path="artifacts/preprocessor.pkl", 
    input_data_path="new_data.csv",
    encoder_path="artifacts/encoder.pkl"  # For classification
)

# View predictions
print("First 10 predictions:")
print(predictions[:10])

# Save predictions to file
predictions.to_csv("predictions.csv", index=False)

Returns:

pandas.DataFrame or numpy.array containing predictions for each input row.

Examples

Real-world examples showing how to use mlforgex for different machine learning tasks.

Housing Price Prediction (Regression)

Train a model to predict house prices using the classic housing dataset.

RegressionCLI

# Train on housing data
mlforge-train \
  --data_path housing.csv \
  --dependent_feature SalePrice \
  --cv 5 \
  --n_iter 50 \
  --artifacts_dir housing_artifacts \
  --dashboard_title Housing Price Prediction

# Make predictions
mlforge-predict \
  --model_path housing_artifacts/model.pkl \
  --preprocessor_path housing_artifacts/preprocessor.pkl \
  --input_data new_houses.csv

Customer Churn Classification

Predict customer churn using classification with imbalanced data handling.

ClassificationPython

from mlforgex import train_model, predict

# Train classification model
train_model(
    data_path="customer_data.csv",
    dependent_feature="churn",
    f1_prob=0.8,  # Prioritize F1 score
    n_iter=200,
    artifacts_dir="churn_model",
    dashboard_title="Churn Prediction"

)

# Predict on new customers
predictions = predict(
    model_path="churn_model/model.pkl",
    preprocessor_path="churn_model/preprocessor.pkl",
    input_data_path="new_customers.csv",
    encoder_path="churn_model/encoder.pkl"
)

print(f"Predicted churners: {sum(predictions)}")

Fast Prototyping

Quick model training for rapid experimentation and prototyping.

Fast ModeCLI

# Fast training without hyperparameter tuning
mlforge-train \
  --data_path experiment_data.csv \
  --dependent_feature target \
  --fast \
  --artifacts_dir quick_model \
  --dashboard_title Quick Experiment

# Results available in seconds, not minutes!

Multi-class Image Classification

Classify images into multiple categories using extracted features.

Multi-classPython

from mlforgex import train_model, predict

# Train on image features
train_model(
    data_path="image_features.csv",
    dependent_feature="category", 
    rmse_prob=0.2,
    f1_prob=0.8,  # Focus on classification metrics
    cv=3,
    n_jobs=-1,
    artifacts_dir="image_classifier",
    dashboard_title="Image Classification Task"
)

# Classify new images
results = predict(
    model_path="image_classifier/model.pkl",
    preprocessor_path="image_classifier/preprocessor.pkl", 
    input_data_path="new_image_features.csv",
    encoder_path="image_classifier/encoder.pkl"
)

Artifacts & Outputs

mlforgex automatically saves all artifacts needed for reproducible machine learning workflows.

File Structure

After training, your artifacts directory contains:

artifacts/
├─ model.pkl                 # Serialized best model
├─ preprocessor.pkl          # Fitted preprocessing pipeline
├─ word2vec.model            # word2vec model (NLP)
├─ encoder.pkl               # Label encoder (classification)
├─ metrics.txt              # Text file with train/test metrics
└─ Dashboard.html          # Interactive model analysis dashboard

Model Files

Serialized model and preprocessing pipeline for predictions

Metrics

Detailed performance metrics and configuration

Visualizations

Comprehensive plots for model analysis

Performance MetricsFeature InformationModel ConfigurationTraining Parameters

Generated Visualizations

mlforgex automatically generates these plots for model analysis:

Correlation Heatmap

Feature correlation analysis

Confusion Matrix

Classification performance

ROC/PR Curves

Binary classification metrics

Feature Importance

Model interpretability

Learning Curves

Training vs validation

Residual Plots

Regression analysis

Feature Importance

Model interpretability

Prediction Error Distribution

Model performance

How It Works

mlforgex follows a comprehensive 8-step pipeline to automate your entire machine learning workflow, from raw data to production-ready models.

Load & Validate Data

Reads CSV, checks for target column, basic schema validation.

Problem Detection

Infers whether we have regression or classification automatically.

Preprocessing

Missing value imputation, encoding, scaling, duplicate/outlier removal.

Imbalance Handling

If classification and imbalance detected, apply resampling on training folds.

Model Training

Train a curated set of models appropriate for the detected task.

Hyperparameter Tuning

Use randomized search to tune hyperparameters (skipped in fast mode).

Model Selection

Rank models by composite score and pick the best performing one.

Dashboard & Artifacts

Store model, pipeline, metrics, plots, and run config for reproducibility.

End-to-End Automation

Raw CSV DataAuto PreprocessingModel TrainingBest ModelProduction Ready

From raw data to production-ready model in one command

Testing

mlforgex includes comprehensive tests to ensure reliability and correctness.

Run Tests

Execute the test suite to validate functionality:

$ pytest test/

pytestUnit TestsIntegration Tests

Test Coverage

Key areas covered by the test suite:

Preprocessing pipeline idempotence

Correct problem detection behavior

Model training produces expected keys in metrics.txt

Predict pipeline loads and transforms inputs without error

Artifact saving and loading functionality

Cross-validation and scoring mechanisms

Quality Assurance

Our comprehensive test suite ensures that mlforgex works reliably across different datasets, problem types, and configurations. Run tests before contributing or after making changes.

License & Author

Open source software with MIT License - free for commercial and personal use.

MIT License

Free and open source software

✅ Commercial Use Allowed✅ Modification Allowed✅ Distribution Allowed

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.

Author Information

Created and maintained by:

Priyanshu Mathur

Machine Learning Enthusiast & Developer

mathurpriyanshu2006@gmail.com

Portfolio Website

PyPI Package

Ready to Get Started?

Install mlforgex now and transform your machine learning workflow with intelligent automation.