Logo

mlforgex

Documentation

End-to-End ML AutomationProduction Ready

mlforgex

PyPI DownloadsPyPI DownloadsPyPI Downloads

An end-to-end machine learning automation package for Python that allows you to train, evaluate, and make predictions with minimal effort — handling data preprocessing, model selection, hyperparameter tuning, and artifact generation automatically.

Classification , Regression & Sentimental

Supports problem types automatically

Auto Preprocessing

Handles missing values, encoding, scaling

Model Selection

Automatically picks best performing model

Dashboard & Artifacts

Interactive dashboard for model analysis

Python 3.8+
MIT License
Active Development

Quick Start

Get started with mlforgex in minutes. Choose between CLI or Python API.

Command Line Interface

Train and predict directly from your terminal

# Train a model
mlforge-train \
  --data_path path/to/data.csv \
  --dependent_feature TargetColumn \
  --rmse_prob 0.3 \
  --f1_prob 0.7 \
  --n_jobs -1 \
  --n_iter 100 \
  --cv 3 \
  --artifacts_dir artifacts \
  --dashboard_title "My Model"  # Title for the dashboard
  # add --fast to speed up the run
  # add --nlp to enable NLP mode

# Make predictions
mlforge-predict \
  --model_path artifacts/model.pkl \
  --preprocessor_path artifacts/preprocessor.pkl \
  --input_data path/to/new_data.csv \
  --encoder_path artifacts/encoder.pkl \
  # add --no-predicted_data to disable saving predicted data \
  # add --nlp to enable NLP mode \
Add --fast flag for quicker training without hyperparameter tuning

Installation

Install mlforgex from PyPI with a single command.

Install from PyPI
$ pip install mlforgex
Latest Version

The package will automatically install all required dependencies.

Requirements

Minimum tested environment:

Python >= 3.8

Key Dependencies:

pandas
numpy
scikit-learn
matplotlib
seaborn
xgboost
imbalanced-learn
tqdm
scipy
requests

See the full list in requirements.txt

Key Features

mlforgex provides comprehensive automation for your machine learning workflow, from data preprocessing to model deployment.

Automatic Data Preprocessing

Missing value handling, outlier & duplicate removal, encoding, scaling, and multicollinearity handling.

  • Numeric columns: mean/median imputation
  • Categorical columns: mode or constant label
  • One-Hot vs Ordinal encoding based on cardinality
  • StandardScaler by default
  • VIF-based multicollinearity handling

Automatic Problem Detection

Classification vs regression detection, binary vs multiclass detection.

  • Regression: numeric target with many unique values
  • Classification: categorical or few unique values
  • Binary vs multiclass detection
  • Metric selection based on problem type

Imbalanced Data Handling

SMOTE oversampling, under-sampling, auto detection and application.

  • Automatic imbalance detection
  • SMOTE (Synthetic Minority Oversampling)
  • Random under-sampling options
  • Applied only to training folds (no data leakage)

Model Training & Evaluation

Trains candidate models and selects the best using task-appropriate metrics and cross-validation.

  • Multiple candidate models per task type
  • Cross-validation for performance estimation
  • Composite scoring for model selection
  • Hyperparameter tuning with RandomizedSearchCV

Artifact Saving

Trained model, preprocessing pipeline, encoder, metrics, plots, and feature importances saved to disk.

  • Serialized model (.pkl)
  • Preprocessing pipeline
  • Word2Vec model (NLP mode)
  • Label encoder (classification)
  • Dashboard with metrics & visualizations

Dashboard & Visualizations

Interactive HTML dashboard with key metrics and plots for model analysis.

  • Correlation heatmap
  • Confusion matrix & ROC curves
  • Precision-Recall curves
  • Learning curves (train vs validation)
  • Feature importance bar charts
  • Residual plots for regression
  • WordCloud
  • Feature Importance
  • Prediction Error Distribution

CLI Reference

Complete command-line interface reference with all flags and options explained.

mlforge-train

Train a machine learning model with automatic preprocessing and model selection.

mlforge-train \
  --data_path <path> \
  --dependent_feature <column> \
  --rmse_prob <float> \
  --f1_prob <float> \
  [--n_jobs <int>] \
  [--n_iter <int>] \
  [--cv <int>] \
  [--artifacts_dir <path>] \
  [--artifacts_name <name>] \
  [--fast] \
  [--nlp] \
  [--dashboard_title]
--data_pathRequiredstr

CSV file path to the dataset. Must include header row and the target column.

Default: —

--dependent_featureRequiredstr

Name of the target column to predict.

Default: —

--rmse_probfloat

Ranking weight for regression models (higher means RMSE is prioritized).

Default: 0.3

--f1_probfloat

Ranking weight for classification models (higher means F1 is prioritized).

Default: 0.7

--n_jobsint

Number of CPU cores used for parallelism (-1 uses all available cores).

Default: -1

--n_iterint

Number of parameter settings sampled when RandomizedSearchCV is used.

Default: 100

--cvint

Number of cross-validation folds.

Default: 3

--artifacts_dirstr

Directory where artifacts, metrics, and plots will be saved.

Default: None

--artifacts_namestr

Name of the artifacts directory.

Default: artifacts

--fastflag

Enable fast mode. Skips hyperparameter tuning and uses strong defaults for models.

Default: False

--nlpflag

Enable NLP mode. When provided, the trainer runs the text pipeline: uses an existing text column (or combines object cols), performs tokenization, stopword removal (keeps negations), lemmatization, vectorizes text (Word2Vec), enforces label encoding for classification, and saves NLP artifacts (word2vec/preprocessor).

Default: False

--dashboard_titlestr

The title displayed in the dashboard header.

Default: mlforgex Dashboard

mlforge-predict

Make predictions using a trained model on new data.

mlforge-predict \
  --model_path <model.pkl> \
  --preprocessor_path <preprocessor.pkl> \
  --input_data <input.csv> \
  --encoder_path <encoder.pkl>
--model_pathRequiredstr

Path to the trained model pickle.

Default: —

--preprocessor_pathRequiredstr

Path to the preprocessing pipeline pickle.

Default: —

--input_dataRequiredstr

CSV file with rows to predict (same feature columns except target).

Default: —

--encoder_pathOptionalstr

Path to the encoder pickle (classification only).

Default: —

--predicted_dataOptionalflag

Saves the input data with prediction column. Use --no-predicted_data to disable.

Default: True

--nlpOptionalflag

Enable NLP/text-mode for prediction. When provided, the predictor will combine object/text columns (or use an existing text column), apply the same text preprocessing used at training, load the text preprocessor / Word2Vec model from --preprocessor_path, vectorize inputs (average word‑vectors on the saved preprocessor), and decode labels with --encoder_path if supplied. Use --nlp to enable.

Default: False

Python API

Use mlforgex directly in your Python applications with simple function calls.

train_model()

Train a machine learning model with automatic preprocessing and model selection.

from mlforgex import train_model

# Train a model with all options
train_model(
    data_path="path/to/your/data.csv",
    dependent_feature="target_column",
    rmse_prob=0.3,          # Weight for RMSE in regression
    f1_prob=0.7,            # Weight for F1 in classification
    n_jobs=-1,              # Use all CPU cores
    n_iter=100,             # Hyperparameter search iterations
    cv=5,                   # Cross-validation folds
    artifacts_dir="models", # Where to save artifacts
    fast=False              # Full training with tuning
    nlp=True                 # Enable NLP mode
    dashboard_title="My Model"  # Title for the dashboard
)

# Fast training (no hyperparameter tuning)
train_model(
    data_path="data.csv",
    dependent_feature="target",
    fast=True  # Skip tuning for faster results
)

Parameters:

data_pathstrPath to your CSV dataset
dependent_featurestrName of target column
fastboolSkip hyperparameter tuning
predict()

Make predictions on new data using your trained model.

from mlforgex import predict

# Make predictions on new data
predictions = predict(
    model_path="artifacts/model.pkl",
    preprocessor_path="artifacts/preprocessor.pkl", 
    input_data_path="new_data.csv",
    encoder_path="artifacts/encoder.pkl"  # For classification
)

# View predictions
print("First 10 predictions:")
print(predictions[:10])

# Save predictions to file
predictions.to_csv("predictions.csv", index=False)

Returns:

pandas.DataFrame or numpy.array containing predictions for each input row.

Examples

Real-world examples showing how to use mlforgex for different machine learning tasks.

Housing Price Prediction (Regression)

Train a model to predict house prices using the classic housing dataset.

RegressionCLI
# Train on housing data
mlforge-train \
  --data_path housing.csv \
  --dependent_feature SalePrice \
  --cv 5 \
  --n_iter 50 \
  --artifacts_dir housing_artifacts \
  --dashboard_title Housing Price Prediction

# Make predictions
mlforge-predict \
  --model_path housing_artifacts/model.pkl \
  --preprocessor_path housing_artifacts/preprocessor.pkl \
  --input_data new_houses.csv
Customer Churn Classification

Predict customer churn using classification with imbalanced data handling.

ClassificationPython
from mlforgex import train_model, predict

# Train classification model
train_model(
    data_path="customer_data.csv",
    dependent_feature="churn",
    f1_prob=0.8,  # Prioritize F1 score
    n_iter=200,
    artifacts_dir="churn_model",
    dashboard_title="Churn Prediction"

)

# Predict on new customers
predictions = predict(
    model_path="churn_model/model.pkl",
    preprocessor_path="churn_model/preprocessor.pkl",
    input_data_path="new_customers.csv",
    encoder_path="churn_model/encoder.pkl"
)

print(f"Predicted churners: {sum(predictions)}")
Fast Prototyping

Quick model training for rapid experimentation and prototyping.

Fast ModeCLI
# Fast training without hyperparameter tuning
mlforge-train \
  --data_path experiment_data.csv \
  --dependent_feature target \
  --fast \
  --artifacts_dir quick_model \
  --dashboard_title Quick Experiment

# Results available in seconds, not minutes!
Multi-class Image Classification

Classify images into multiple categories using extracted features.

Multi-classPython
from mlforgex import train_model, predict

# Train on image features
train_model(
    data_path="image_features.csv",
    dependent_feature="category", 
    rmse_prob=0.2,
    f1_prob=0.8,  # Focus on classification metrics
    cv=3,
    n_jobs=-1,
    artifacts_dir="image_classifier",
    dashboard_title="Image Classification Task"
)

# Classify new images
results = predict(
    model_path="image_classifier/model.pkl",
    preprocessor_path="image_classifier/preprocessor.pkl", 
    input_data_path="new_image_features.csv",
    encoder_path="image_classifier/encoder.pkl"
)

Artifacts & Outputs

mlforgex automatically saves all artifacts needed for reproducible machine learning workflows.

File Structure

After training, your artifacts directory contains:

artifacts/
├─ model.pkl                 # Serialized best model
├─ preprocessor.pkl          # Fitted preprocessing pipeline
├─ word2vec.model            # word2vec model (NLP)
├─ encoder.pkl               # Label encoder (classification)
├─ metrics.txt              # Text file with train/test metrics
└─ Dashboard.html          # Interactive model analysis dashboard

Model Files

Serialized model and preprocessing pipeline for predictions

Metrics

Detailed performance metrics and configuration

Visualizations

Comprehensive plots for model analysis

Performance MetricsFeature InformationModel ConfigurationTraining Parameters
Generated Visualizations

mlforgex automatically generates these plots for model analysis:

Correlation Heatmap

Feature correlation analysis

Confusion Matrix

Classification performance

ROC/PR Curves

Binary classification metrics

Feature Importance

Model interpretability

Learning Curves

Training vs validation

Residual Plots

Regression analysis

Feature Importance

Model interpretability

Prediction Error Distribution

Model performance

How It Works

mlforgex follows a comprehensive 8-step pipeline to automate your entire machine learning workflow, from raw data to production-ready models.

1

Load & Validate Data

Reads CSV, checks for target column, basic schema validation.

2

Problem Detection

Infers whether we have regression or classification automatically.

3

Preprocessing

Missing value imputation, encoding, scaling, duplicate/outlier removal.

4

Imbalance Handling

If classification and imbalance detected, apply resampling on training folds.

5

Model Training

Train a curated set of models appropriate for the detected task.

6

Hyperparameter Tuning

Use randomized search to tune hyperparameters (skipped in fast mode).

7

Model Selection

Rank models by composite score and pick the best performing one.

8

Dashboard & Artifacts

Store model, pipeline, metrics, plots, and run config for reproducibility.

End-to-End Automation

Raw CSV DataAuto PreprocessingModel TrainingBest ModelProduction Ready

From raw data to production-ready model in one command

Testing

mlforgex includes comprehensive tests to ensure reliability and correctness.

Run Tests

Execute the test suite to validate functionality:

$ pytest test/
pytestUnit TestsIntegration Tests
Test Coverage

Key areas covered by the test suite:

Preprocessing pipeline idempotence
Correct problem detection behavior
Model training produces expected keys in metrics.txt
Predict pipeline loads and transforms inputs without error
Artifact saving and loading functionality
Cross-validation and scoring mechanisms

Quality Assurance

Our comprehensive test suite ensures that mlforgex works reliably across different datasets, problem types, and configurations. Run tests before contributing or after making changes.

License & Author

Open source software with MIT License - free for commercial and personal use.

MIT License

Free and open source software

✅ Commercial Use Allowed✅ Modification Allowed✅ Distribution Allowed

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.

Author Information

Created and maintained by:

Priyanshu Mathur

Machine Learning Engineer & Open Source Developer

Ready to Get Started?

Install mlforgex now and transform your machine learning workflow with intelligent automation.