Algorithms Reference Guide

SuperML Java provides a comprehensive collection of machine learning algorithms, each implemented with production-ready features and scikit-learn compatible APIs. This guide provides detailed information about all available algorithms, their capabilities, and usage examples.

📊 Algorithm Categories Overview

Category	Algorithms	Use Cases
Linear Models	6 algorithms	Classification, Regression, Feature Selection
Tree-Based	3 algorithms	Non-linear patterns, Feature importance, Ensemble learning
Clustering	1 algorithm	Unsupervised grouping, Customer segmentation
Meta-Classifiers	1 algorithm	Multiclass conversion, Algorithm composition
Preprocessing	1 transformer	Feature scaling, Data normalization

🔢 Linear Models

LogisticRegression

Purpose: Binary and multiclass classification with probabilistic outputs

Key Features:

Automatic multiclass handling (One-vs-Rest and Softmax strategies)
L1/L2 regularization support
Gradient descent optimization with convergence monitoring
Probability prediction capabilities
Configurable learning rate and iterations

Parameters:

LogisticRegression lr = new LogisticRegression()
    .setMaxIter(1000)           // Maximum iterations
    .setTol(1e-6)               // Convergence tolerance
    .setLearningRate(0.01)      // Learning rate
    .setRegularization("l2")    // Regularization type
    .setRegularizationStrength(1.0);  // Regularization strength

Best Use Cases:

Binary classification problems
Multiclass classification with moderate number of classes
When probability estimates are needed
Baseline classification model

LinearRegression

Purpose: Ordinary least squares regression for continuous target prediction

Key Features:

Closed-form solution using normal equation
No hyperparameters to tune
Fast training and prediction
R² score evaluation

Parameters:

LinearRegression lr = new LinearRegression()
    .setFitIntercept(true);     // Whether to fit intercept term

Best Use Cases:

Linear relationships between features and target
Baseline regression model
When interpretability is important
Small to medium datasets

Ridge

Purpose: L2 regularized regression to prevent overfitting

Key Features:

L2 regularization (weight decay)
Closed-form solution with regularization
Handles multicollinearity well
Cross-validation compatible

Parameters:

Ridge ridge = new Ridge()
    .setAlpha(1.0)              // Regularization strength
    .setFitIntercept(true);     // Whether to fit intercept

Best Use Cases:

High-dimensional datasets
When features are correlated
Preventing overfitting in linear models
When all features should be retained

Lasso

Purpose: L1 regularized regression with automatic feature selection

Key Features:

L1 regularization for feature selection
Coordinate descent optimization
Sparse solutions (some coefficients become zero)
Built-in feature selection

Parameters:

Lasso lasso = new Lasso()
    .setAlpha(1.0)              // Regularization strength
    .setMaxIter(1000)           // Maximum iterations
    .setTol(1e-4);              // Convergence tolerance

Best Use Cases:

Feature selection in high-dimensional data
When interpretability is crucial
Sparse data or when many features are irrelevant
Automatic model simplification

SoftmaxRegression

Purpose: Direct multinomial classification with softmax activation

Key Features:

Native multiclass support (no meta-learning needed)
Softmax activation for probability normalization
Cross-entropy loss optimization
Gradient descent with momentum

Parameters:

SoftmaxRegression softmax = new SoftmaxRegression()
    .setMaxIter(1000)           // Maximum iterations
    .setLearningRate(0.01)      // Learning rate
    .setTol(1e-6);              // Convergence tolerance

Best Use Cases:

Multiclass classification with many classes
When class probabilities are needed
Text classification
Image classification

OneVsRestClassifier

Purpose: Meta-classifier that converts any binary classifier into multiclass

Key Features:

Works with any binary classifier
Trains one classifier per class
Probability calibration and normalization
Parallel training support

Parameters:

OneVsRestClassifier ovr = new OneVsRestClassifier(new LogisticRegression())
    .setNJobs(4);               // Number of parallel jobs

Best Use Cases:

Converting binary algorithms to multiclass
When you want to use specific binary algorithms for multiclass
Large number of classes
When different classes have different characteristics

🌳 Tree-Based Models

DecisionTree

Purpose: Non-linear classification and regression using tree-based decisions

Key Features:

CART (Classification and Regression Trees) implementation
Multiple splitting criteria: Gini, Entropy, MSE
Comprehensive pruning controls
Handles both numerical and categorical features
Feature importance calculation

Parameters:

DecisionTree dt = new DecisionTree()
    .setCriterion("gini")           // Splitting criterion
    .setMaxDepth(10)                // Maximum tree depth
    .setMinSamplesSplit(2)          // Min samples to split node
    .setMinSamplesLeaf(1)           // Min samples in leaf
    .setMinImpurityDecrease(0.0)    // Min impurity decrease for split
    .setMaxFeatures(-1)             // Max features to consider (-1 = all)
    .setRandomState(42);            // Random seed

Best Use Cases:

Non-linear relationships
Mixed data types (numerical + categorical)
When interpretability is important
Feature selection and importance analysis
Baseline for ensemble methods

RandomForest

Purpose: Ensemble of decision trees with bootstrap aggregating

Key Features:

Bootstrap sampling for each tree
Random feature selection at each split
Parallel training capabilities
Out-of-bag (OOB) error estimation
Feature importance aggregation
Robust to overfitting

Parameters:

RandomForest rf = new RandomForest()
    .setNEstimators(100)            // Number of trees
    .setMaxDepth(10)                // Maximum depth per tree
    .setCriterion("gini")           // Splitting criterion
    .setMinSamplesSplit(2)          // Min samples to split
    .setMinSamplesLeaf(1)           // Min samples in leaf
    .setMaxFeatures("sqrt")         // Features per split
    .setBootstrap(true)             // Bootstrap sampling
    .setOobScore(true)              // Calculate OOB score
    .setNJobs(-1)                   // Parallel jobs (-1 = all cores)
    .setRandomState(42);            // Random seed

Best Use Cases:

General-purpose classification and regression
When high accuracy is needed
Large datasets with many features
When overfitting is a concern
Feature importance analysis

GradientBoosting

Purpose: Sequential ensemble that builds trees to correct previous errors

Key Features:

Sequential learning with gradient descent
Early stopping with validation monitoring
Subsampling (stochastic gradient boosting)
Configurable learning rate and regularization
Training and validation score tracking
Feature importance calculation

Parameters:

GradientBoosting gb = new GradientBoosting()
    .setNEstimators(100)            // Number of boosting stages
    .setLearningRate(0.1)           // Shrinkage parameter
    .setMaxDepth(3)                 // Maximum depth per tree
    .setSubsample(1.0)              // Fraction of samples per tree
    .setMinSamplesSplit(2)          // Min samples to split
    .setMinSamplesLeaf(1)           // Min samples in leaf
    .setMinImpurityDecrease(0.0)    // Min impurity decrease
    .setValidationFraction(0.1)     // Validation set fraction
    .setNIterNoChange(5)            // Early stopping patience
    .setTol(1e-4)                   // Early stopping tolerance
    .setRandomState(42);            // Random seed

Best Use Cases:

High-accuracy classification and regression
Competitions and benchmarks
When careful tuning can be done
Complex non-linear relationships
When overfitting can be controlled

🎯 Clustering

KMeans

Purpose: Partitioning clustering for grouping similar data points

Key Features:

K-means++ initialization for better convergence
Multiple random restarts to avoid local minima
Inertia (within-cluster sum of squares) calculation
Configurable convergence criteria
Cluster center and label prediction

Parameters:

KMeans kmeans = new KMeans()
    .setNClusters(8)                // Number of clusters
    .setInit("k-means++")           // Initialization method
    .setNInit(10)                   // Number of initializations
    .setMaxIter(300)                // Maximum iterations
    .setTol(1e-4)                   // Convergence tolerance
    .setRandomState(42);            // Random seed

Best Use Cases:

Customer segmentation
Market research
Image segmentation
Data exploration and visualization
Dimensionality reduction preprocessing

🔧 Preprocessing

StandardScaler

Purpose: Feature standardization to zero mean and unit variance

Key Features:

Z-score normalization (mean=0, std=1)
Fit/transform pattern consistent with scikit-learn
Feature-wise scaling independence
Inverse transformation capability
Numerical stability

Parameters:

StandardScaler scaler = new StandardScaler()
    .setWithMean(true)              // Center to zero mean
    .setWithStd(true);              // Scale to unit variance

Best Use Cases:

Preprocessing for linear models
When features have different scales
Before clustering algorithms
Neural network preprocessing
SVM preprocessing

📈 Performance Characteristics

Training Time Complexity

Algorithm	Time Complexity	Space Complexity	Notes
LogisticRegression	O(n × p × i)	O(p)	n=samples, p=features, i=iterations
LinearRegression	O(n × p²)	O(p²)	Matrix inversion
Ridge	O(n × p²)	O(p²)	Matrix inversion with regularization
Lasso	O(n × p × i)	O(p)	Coordinate descent iterations
DecisionTree	O(n × p × log n)	O(n)	Average case for balanced tree
RandomForest	O(t × n × p × log n)	O(t × n)	t=number of trees
GradientBoosting	O(b × n × p × log n)	O(b × n)	b=boosting iterations
KMeans	O(n × k × i × p)	O(n × p)	k=clusters, i=iterations

Prediction Time Complexity

Algorithm	Time Complexity	Notes
Linear Models	O(p)	Simple linear combination
DecisionTree	O(log n)	Tree traversal depth
RandomForest	O(t × log n)	t trees × tree depth
GradientBoosting	O(b × log n)	b boosting stages × tree depth
KMeans	O(k × p)	Distance to k centroids

🎯 Algorithm Selection Guide

For Classification Problems

Problem Type	Recommended Algorithm	Alternative
Linear separable	LogisticRegression	SoftmaxRegression
Non-linear	RandomForest	GradientBoosting
High dimensions	LogisticRegression + L1	Lasso
Many classes	SoftmaxRegression	OneVsRestClassifier
Interpretability	DecisionTree	LogisticRegression
High accuracy	GradientBoosting	RandomForest

For Regression Problems

Problem Type	Recommended Algorithm	Alternative
Linear relationship	LinearRegression	Ridge
Feature selection	Lasso	Ridge + manual selection
Non-linear	RandomForest	GradientBoosting
Multicollinearity	Ridge	LinearRegression
High accuracy	GradientBoosting	RandomForest
Interpretability	DecisionTree	LinearRegression

For Clustering Problems

Problem Type	Recommended Algorithm	Notes
Spherical clusters	KMeans	Works best with globular clusters
Unknown cluster count	KMeans + Elbow method	Try different k values

🚀 Advanced Features

Ensemble Capabilities

RandomForest: Bootstrap aggregating with feature randomization
GradientBoosting: Sequential boosting with early stopping
OneVsRestClassifier: Meta-learning for multiclass conversion

Regularization Support

L1 Regularization: Lasso, LogisticRegression
L2 Regularization: Ridge, LogisticRegression
Early Stopping: GradientBoosting with validation monitoring

Parallel Processing

RandomForest: Multi-threaded tree training
OneVsRestClassifier: Parallel binary classifier training
GridSearchCV: Parallel hyperparameter optimization

Probability Estimation

LogisticRegression: Native probability support
SoftmaxRegression: Multinomial probabilities
RandomForest: Voting-based probabilities
GradientBoosting: Sigmoid-transformed probabilities

📚 Usage Examples

Complete Classification Workflow

// Load and prepare data
var dataset = Datasets.makeClassification(1000, 20, 3);
var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);

// Preprocessing
StandardScaler scaler = new StandardScaler();
double[][] XTrainScaled = scaler.fitTransform(split.XTrain);
double[][] XTestScaled = scaler.transform(split.XTest);

// Train multiple algorithms
LogisticRegression lr = new LogisticRegression().fit(XTrainScaled, split.yTrain);
RandomForest rf = new RandomForest(100, 10).fit(XTrainScaled, split.yTrain);
GradientBoosting gb = new GradientBoosting(100, 0.1, 6).fit(XTrainScaled, split.yTrain);

// Evaluate and compare
double lrAccuracy = Metrics.accuracy(split.yTest, lr.predict(XTestScaled));
double rfAccuracy = Metrics.accuracy(split.yTest, rf.predict(XTestScaled));
double gbAccuracy = Metrics.accuracy(split.yTest, gb.predict(XTestScaled));

System.out.printf("Logistic Regression: %.3f\n", lrAccuracy);
System.out.printf("Random Forest: %.3f\n", rfAccuracy);
System.out.printf("Gradient Boosting: %.3f\n", gbAccuracy);

Hyperparameter Optimization

// Grid search for Random Forest
Map<String, Object[]> paramGrid = Map.of(
    "n_estimators", new Object[]{50, 100, 200},
    "max_depth", new Object[]{5, 10, 15},
    "min_samples_split", new Object[]{2, 5, 10}
);

GridSearchCV gridSearch = new GridSearchCV(new RandomForest(), paramGrid)
    .setCv(5)
    .setScoring("accuracy")
    .setNJobs(-1);

gridSearch.fit(XTrainScaled, split.yTrain);
RandomForest bestRF = (RandomForest) gridSearch.getBestEstimator();

This comprehensive algorithms reference provides detailed information about all available algorithms in SuperML Java, their capabilities, parameters, and best use cases. Each algorithm is implemented with production-ready features and follows scikit-learn compatible APIs for easy adoption.