mlforgex
An end-to-end machine learning automation package for Python that allows you to train, evaluate, and make predictions with minimal effort — handling data preprocessing, model selection, hyperparameter tuning, and artifact generation automatically.
Classification , Regression & Sentimental
Supports problem types automatically
Auto Preprocessing
Handles missing values, encoding, scaling
Model Selection
Automatically picks best performing model
Dashboard & Artifacts
Interactive dashboard for model analysis
Quick Start
Get started with mlforgex in minutes. Choose between CLI or Python API.
Train and predict directly from your terminal
# Train a model
mlforge-train \
--data_path path/to/data.csv \
--dependent_feature TargetColumn \
--rmse_prob 0.3 \
--f1_prob 0.7 \
--n_jobs -1 \
--n_iter 100 \
--cv 3 \
--artifacts_dir artifacts \
--dashboard_title "My Model" # Title for the dashboard
# add --fast to speed up the run
# add --nlp to enable NLP mode
# Make predictions
mlforge-predict \
--model_path artifacts/model.pkl \
--preprocessor_path artifacts/preprocessor.pkl \
--input_data path/to/new_data.csv \
--encoder_path artifacts/encoder.pkl \
# add --no-predicted_data to disable saving predicted data \
# add --nlp to enable NLP mode \Installation
Install mlforgex from PyPI with a single command.
$ pip install mlforgexThe package will automatically install all required dependencies.
Minimum tested environment:
Key Dependencies:
pandasnumpyscikit-learnmatplotlibseabornxgboostimbalanced-learntqdmscipyrequestsSee the full list in requirements.txt
Key Features
mlforgex provides comprehensive automation for your machine learning workflow, from data preprocessing to model deployment.
Automatic Data Preprocessing
Missing value handling, outlier & duplicate removal, encoding, scaling, and multicollinearity handling.
- Numeric columns: mean/median imputation
- Categorical columns: mode or constant label
- One-Hot vs Ordinal encoding based on cardinality
- StandardScaler by default
- VIF-based multicollinearity handling
Automatic Problem Detection
Classification vs regression detection, binary vs multiclass detection.
- Regression: numeric target with many unique values
- Classification: categorical or few unique values
- Binary vs multiclass detection
- Metric selection based on problem type
Imbalanced Data Handling
SMOTE oversampling, under-sampling, auto detection and application.
- Automatic imbalance detection
- SMOTE (Synthetic Minority Oversampling)
- Random under-sampling options
- Applied only to training folds (no data leakage)
Model Training & Evaluation
Trains candidate models and selects the best using task-appropriate metrics and cross-validation.
- Multiple candidate models per task type
- Cross-validation for performance estimation
- Composite scoring for model selection
- Hyperparameter tuning with RandomizedSearchCV
Artifact Saving
Trained model, preprocessing pipeline, encoder, metrics, plots, and feature importances saved to disk.
- Serialized model (.pkl)
- Preprocessing pipeline
- Word2Vec model (NLP mode)
- Label encoder (classification)
- Dashboard with metrics & visualizations
Dashboard & Visualizations
Interactive HTML dashboard with key metrics and plots for model analysis.
- Correlation heatmap
- Confusion matrix & ROC curves
- Precision-Recall curves
- Learning curves (train vs validation)
- Feature importance bar charts
- Residual plots for regression
- WordCloud
- Feature Importance
- Prediction Error Distribution
CLI Reference
Complete command-line interface reference with all flags and options explained.
Train a machine learning model with automatic preprocessing and model selection.
mlforge-train \ --data_path <path> \ --dependent_feature <column> \ --rmse_prob <float> \ --f1_prob <float> \ [--n_jobs <int>] \ [--n_iter <int>] \ [--cv <int>] \ [--artifacts_dir <path>] \ [--artifacts_name <name>] \ [--fast] \ [--nlp] \ [--dashboard_title]
--data_pathRequiredstrCSV file path to the dataset. Must include header row and the target column.
Default: —
--dependent_featureRequiredstrName of the target column to predict.
Default: —
--rmse_probfloatRanking weight for regression models (higher means RMSE is prioritized).
Default: 0.3
--f1_probfloatRanking weight for classification models (higher means F1 is prioritized).
Default: 0.7
--n_jobsintNumber of CPU cores used for parallelism (-1 uses all available cores).
Default: -1
--n_iterintNumber of parameter settings sampled when RandomizedSearchCV is used.
Default: 100
--cvintNumber of cross-validation folds.
Default: 3
--artifacts_dirstrDirectory where artifacts, metrics, and plots will be saved.
Default: None
--artifacts_namestrName of the artifacts directory.
Default: artifacts
--fastflagEnable fast mode. Skips hyperparameter tuning and uses strong defaults for models.
Default: False
--nlpflagEnable NLP mode. When provided, the trainer runs the text pipeline: uses an existing text column (or combines object cols), performs tokenization, stopword removal (keeps negations), lemmatization, vectorizes text (Word2Vec), enforces label encoding for classification, and saves NLP artifacts (word2vec/preprocessor).
Default: False
--dashboard_titlestrThe title displayed in the dashboard header.
Default: mlforgex Dashboard
| Flag | Required | Type | Default | Description |
|---|---|---|---|---|
--data_path | Yes | str | — | CSV file path to the dataset. Must include header row and the target column. |
--dependent_feature | Yes | str | — | Name of the target column to predict. |
--rmse_prob | No | float | 0.3 | Ranking weight for regression models (higher means RMSE is prioritized). |
--f1_prob | No | float | 0.7 | Ranking weight for classification models (higher means F1 is prioritized). |
--n_jobs | No | int | -1 | Number of CPU cores used for parallelism (-1 uses all available cores). |
--n_iter | No | int | 100 | Number of parameter settings sampled when RandomizedSearchCV is used. |
--cv | No | int | 3 | Number of cross-validation folds. |
--artifacts_dir | No | str | None | Directory where artifacts, metrics, and plots will be saved. |
--artifacts_name | No | str | artifacts | Name of the artifacts directory. |
--fast | No | flag | False | Enable fast mode. Skips hyperparameter tuning and uses strong defaults for models. |
--nlp | No | flag | False | Enable NLP mode. When provided, the trainer runs the text pipeline: uses an existing text column (or combines object cols), performs tokenization, stopword removal (keeps negations), lemmatization, vectorizes text (Word2Vec), enforces label encoding for classification, and saves NLP artifacts (word2vec/preprocessor). |
--dashboard_title | No | str | mlforgex Dashboard | The title displayed in the dashboard header. |
Make predictions using a trained model on new data.
mlforge-predict \ --model_path <model.pkl> \ --preprocessor_path <preprocessor.pkl> \ --input_data <input.csv> \ --encoder_path <encoder.pkl>
--model_pathRequiredstrPath to the trained model pickle.
Default: —
--preprocessor_pathRequiredstrPath to the preprocessing pipeline pickle.
Default: —
--input_dataRequiredstrCSV file with rows to predict (same feature columns except target).
Default: —
--encoder_pathOptionalstrPath to the encoder pickle (classification only).
Default: —
--predicted_dataOptionalflagSaves the input data with prediction column. Use --no-predicted_data to disable.
Default: True
--nlpOptionalflagEnable NLP/text-mode for prediction. When provided, the predictor will combine object/text columns (or use an existing text column), apply the same text preprocessing used at training, load the text preprocessor / Word2Vec model from --preprocessor_path, vectorize inputs (average word‑vectors on the saved preprocessor), and decode labels with --encoder_path if supplied. Use --nlp to enable.
Default: False
| Flag | Type | Required | Description | |
|---|---|---|---|---|
--model_path | str | Yes | — | Path to the trained model pickle. |
--preprocessor_path | str | Yes | — | Path to the preprocessing pipeline pickle. |
--input_data | str | Yes | — | CSV file with rows to predict (same feature columns except target). |
--encoder_path | str | No | — | Path to the encoder pickle (classification only). |
--predicted_data | flag | No | True | Saves the input data with prediction column. Use --no-predicted_data to disable. |
--nlp | flag | No | False | Enable NLP/text-mode for prediction. When provided, the predictor will combine object/text columns (or use an existing text column), apply the same text preprocessing used at training, load the text preprocessor / Word2Vec model from --preprocessor_path, vectorize inputs (average word‑vectors on the saved preprocessor), and decode labels with --encoder_path if supplied. Use --nlp to enable. |
Python API
Use mlforgex directly in your Python applications with simple function calls.
Train a machine learning model with automatic preprocessing and model selection.
from mlforgex import train_model
# Train a model with all options
train_model(
data_path="path/to/your/data.csv",
dependent_feature="target_column",
rmse_prob=0.3, # Weight for RMSE in regression
f1_prob=0.7, # Weight for F1 in classification
n_jobs=-1, # Use all CPU cores
n_iter=100, # Hyperparameter search iterations
cv=5, # Cross-validation folds
artifacts_dir="models", # Where to save artifacts
fast=False # Full training with tuning
nlp=True # Enable NLP mode
dashboard_title="My Model" # Title for the dashboard
)
# Fast training (no hyperparameter tuning)
train_model(
data_path="data.csv",
dependent_feature="target",
fast=True # Skip tuning for faster results
)Parameters:
data_pathstrPath to your CSV datasetdependent_featurestrName of target columnfastboolSkip hyperparameter tuningMake predictions on new data using your trained model.
from mlforgex import predict
# Make predictions on new data
predictions = predict(
model_path="artifacts/model.pkl",
preprocessor_path="artifacts/preprocessor.pkl",
input_data_path="new_data.csv",
encoder_path="artifacts/encoder.pkl" # For classification
)
# View predictions
print("First 10 predictions:")
print(predictions[:10])
# Save predictions to file
predictions.to_csv("predictions.csv", index=False)Returns:
pandas.DataFrame or numpy.array containing predictions for each input row.
Examples
Real-world examples showing how to use mlforgex for different machine learning tasks.
Train a model to predict house prices using the classic housing dataset.
# Train on housing data
mlforge-train \
--data_path housing.csv \
--dependent_feature SalePrice \
--cv 5 \
--n_iter 50 \
--artifacts_dir housing_artifacts \
--dashboard_title Housing Price Prediction
# Make predictions
mlforge-predict \
--model_path housing_artifacts/model.pkl \
--preprocessor_path housing_artifacts/preprocessor.pkl \
--input_data new_houses.csvPredict customer churn using classification with imbalanced data handling.
from mlforgex import train_model, predict
# Train classification model
train_model(
data_path="customer_data.csv",
dependent_feature="churn",
f1_prob=0.8, # Prioritize F1 score
n_iter=200,
artifacts_dir="churn_model",
dashboard_title="Churn Prediction"
)
# Predict on new customers
predictions = predict(
model_path="churn_model/model.pkl",
preprocessor_path="churn_model/preprocessor.pkl",
input_data_path="new_customers.csv",
encoder_path="churn_model/encoder.pkl"
)
print(f"Predicted churners: {sum(predictions)}")Quick model training for rapid experimentation and prototyping.
# Fast training without hyperparameter tuning
mlforge-train \
--data_path experiment_data.csv \
--dependent_feature target \
--fast \
--artifacts_dir quick_model \
--dashboard_title Quick Experiment
# Results available in seconds, not minutes!Classify images into multiple categories using extracted features.
from mlforgex import train_model, predict
# Train on image features
train_model(
data_path="image_features.csv",
dependent_feature="category",
rmse_prob=0.2,
f1_prob=0.8, # Focus on classification metrics
cv=3,
n_jobs=-1,
artifacts_dir="image_classifier",
dashboard_title="Image Classification Task"
)
# Classify new images
results = predict(
model_path="image_classifier/model.pkl",
preprocessor_path="image_classifier/preprocessor.pkl",
input_data_path="new_image_features.csv",
encoder_path="image_classifier/encoder.pkl"
)Artifacts & Outputs
mlforgex automatically saves all artifacts needed for reproducible machine learning workflows.
After training, your artifacts directory contains:
artifacts/
├─ model.pkl # Serialized best model
├─ preprocessor.pkl # Fitted preprocessing pipeline
├─ word2vec.model # word2vec model (NLP)
├─ encoder.pkl # Label encoder (classification)
├─ metrics.txt # Text file with train/test metrics
└─ Dashboard.html # Interactive model analysis dashboardModel Files
Serialized model and preprocessing pipeline for predictions
Metrics
Detailed performance metrics and configuration
Visualizations
Comprehensive plots for model analysis
mlforgex automatically generates these plots for model analysis:
Correlation Heatmap
Feature correlation analysis
Confusion Matrix
Classification performance
ROC/PR Curves
Binary classification metrics
Feature Importance
Model interpretability
Learning Curves
Training vs validation
Residual Plots
Regression analysis
Feature Importance
Model interpretability
Prediction Error Distribution
Model performance
How It Works
mlforgex follows a comprehensive 8-step pipeline to automate your entire machine learning workflow, from raw data to production-ready models.
Load & Validate Data
Reads CSV, checks for target column, basic schema validation.
Problem Detection
Infers whether we have regression or classification automatically.
Preprocessing
Missing value imputation, encoding, scaling, duplicate/outlier removal.
Imbalance Handling
If classification and imbalance detected, apply resampling on training folds.
Model Training
Train a curated set of models appropriate for the detected task.
Hyperparameter Tuning
Use randomized search to tune hyperparameters (skipped in fast mode).
Model Selection
Rank models by composite score and pick the best performing one.
Dashboard & Artifacts
Store model, pipeline, metrics, plots, and run config for reproducibility.
End-to-End Automation
From raw data to production-ready model in one command
Testing
mlforgex includes comprehensive tests to ensure reliability and correctness.
Execute the test suite to validate functionality:
$ pytest test/Key areas covered by the test suite:
Quality Assurance
Our comprehensive test suite ensures that mlforgex works reliably across different datasets, problem types, and configurations. Run tests before contributing or after making changes.
License & Author
Open source software with MIT License - free for commercial and personal use.
Free and open source software
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files, to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.
Created and maintained by:
Priyanshu Mathur
Machine Learning Engineer & Open Source Developer
Ready to Get Started?
Install mlforgex now and transform your machine learning workflow with intelligent automation.
