Algorithms Reference Guide
SuperML Java provides a comprehensive collection of machine learning algorithms, each implemented with production-ready features and scikit-learn compatible APIs. This guide provides detailed information about all available algorithms, their capabilities, and usage examples.
📊 Algorithm Categories Overview
Category | Algorithms | Use Cases |
---|---|---|
Linear Models | 6 algorithms | Classification, Regression, Feature Selection |
Tree-Based | 3 algorithms | Non-linear patterns, Feature importance, Ensemble learning |
Clustering | 1 algorithm | Unsupervised grouping, Customer segmentation |
Meta-Classifiers | 1 algorithm | Multiclass conversion, Algorithm composition |
Preprocessing | 1 transformer | Feature scaling, Data normalization |
🔢 Linear Models
LogisticRegression
Purpose: Binary and multiclass classification with probabilistic outputs
Key Features:
- Automatic multiclass handling (One-vs-Rest and Softmax strategies)
- L1/L2 regularization support
- Gradient descent optimization with convergence monitoring
- Probability prediction capabilities
- Configurable learning rate and iterations
Parameters:
LogisticRegression lr = new LogisticRegression()
.setMaxIter(1000) // Maximum iterations
.setTol(1e-6) // Convergence tolerance
.setLearningRate(0.01) // Learning rate
.setRegularization("l2") // Regularization type
.setRegularizationStrength(1.0); // Regularization strength
Best Use Cases:
- Binary classification problems
- Multiclass classification with moderate number of classes
- When probability estimates are needed
- Baseline classification model
LinearRegression
Purpose: Ordinary least squares regression for continuous target prediction
Key Features:
- Closed-form solution using normal equation
- No hyperparameters to tune
- Fast training and prediction
- R² score evaluation
Parameters:
LinearRegression lr = new LinearRegression()
.setFitIntercept(true); // Whether to fit intercept term
Best Use Cases:
- Linear relationships between features and target
- Baseline regression model
- When interpretability is important
- Small to medium datasets
Ridge
Purpose: L2 regularized regression to prevent overfitting
Key Features:
- L2 regularization (weight decay)
- Closed-form solution with regularization
- Handles multicollinearity well
- Cross-validation compatible
Parameters:
Ridge ridge = new Ridge()
.setAlpha(1.0) // Regularization strength
.setFitIntercept(true); // Whether to fit intercept
Best Use Cases:
- High-dimensional datasets
- When features are correlated
- Preventing overfitting in linear models
- When all features should be retained
Lasso
Purpose: L1 regularized regression with automatic feature selection
Key Features:
- L1 regularization for feature selection
- Coordinate descent optimization
- Sparse solutions (some coefficients become zero)
- Built-in feature selection
Parameters:
Lasso lasso = new Lasso()
.setAlpha(1.0) // Regularization strength
.setMaxIter(1000) // Maximum iterations
.setTol(1e-4); // Convergence tolerance
Best Use Cases:
- Feature selection in high-dimensional data
- When interpretability is crucial
- Sparse data or when many features are irrelevant
- Automatic model simplification
SoftmaxRegression
Purpose: Direct multinomial classification with softmax activation
Key Features:
- Native multiclass support (no meta-learning needed)
- Softmax activation for probability normalization
- Cross-entropy loss optimization
- Gradient descent with momentum
Parameters:
SoftmaxRegression softmax = new SoftmaxRegression()
.setMaxIter(1000) // Maximum iterations
.setLearningRate(0.01) // Learning rate
.setTol(1e-6); // Convergence tolerance
Best Use Cases:
- Multiclass classification with many classes
- When class probabilities are needed
- Text classification
- Image classification
OneVsRestClassifier
Purpose: Meta-classifier that converts any binary classifier into multiclass
Key Features:
- Works with any binary classifier
- Trains one classifier per class
- Probability calibration and normalization
- Parallel training support
Parameters:
OneVsRestClassifier ovr = new OneVsRestClassifier(new LogisticRegression())
.setNJobs(4); // Number of parallel jobs
Best Use Cases:
- Converting binary algorithms to multiclass
- When you want to use specific binary algorithms for multiclass
- Large number of classes
- When different classes have different characteristics
🌳 Tree-Based Models
DecisionTree
Purpose: Non-linear classification and regression using tree-based decisions
Key Features:
- CART (Classification and Regression Trees) implementation
- Multiple splitting criteria: Gini, Entropy, MSE
- Comprehensive pruning controls
- Handles both numerical and categorical features
- Feature importance calculation
Parameters:
DecisionTree dt = new DecisionTree()
.setCriterion("gini") // Splitting criterion
.setMaxDepth(10) // Maximum tree depth
.setMinSamplesSplit(2) // Min samples to split node
.setMinSamplesLeaf(1) // Min samples in leaf
.setMinImpurityDecrease(0.0) // Min impurity decrease for split
.setMaxFeatures(-1) // Max features to consider (-1 = all)
.setRandomState(42); // Random seed
Best Use Cases:
- Non-linear relationships
- Mixed data types (numerical + categorical)
- When interpretability is important
- Feature selection and importance analysis
- Baseline for ensemble methods
RandomForest
Purpose: Ensemble of decision trees with bootstrap aggregating
Key Features:
- Bootstrap sampling for each tree
- Random feature selection at each split
- Parallel training capabilities
- Out-of-bag (OOB) error estimation
- Feature importance aggregation
- Robust to overfitting
Parameters:
RandomForest rf = new RandomForest()
.setNEstimators(100) // Number of trees
.setMaxDepth(10) // Maximum depth per tree
.setCriterion("gini") // Splitting criterion
.setMinSamplesSplit(2) // Min samples to split
.setMinSamplesLeaf(1) // Min samples in leaf
.setMaxFeatures("sqrt") // Features per split
.setBootstrap(true) // Bootstrap sampling
.setOobScore(true) // Calculate OOB score
.setNJobs(-1) // Parallel jobs (-1 = all cores)
.setRandomState(42); // Random seed
Best Use Cases:
- General-purpose classification and regression
- When high accuracy is needed
- Large datasets with many features
- When overfitting is a concern
- Feature importance analysis
GradientBoosting
Purpose: Sequential ensemble that builds trees to correct previous errors
Key Features:
- Sequential learning with gradient descent
- Early stopping with validation monitoring
- Subsampling (stochastic gradient boosting)
- Configurable learning rate and regularization
- Training and validation score tracking
- Feature importance calculation
Parameters:
GradientBoosting gb = new GradientBoosting()
.setNEstimators(100) // Number of boosting stages
.setLearningRate(0.1) // Shrinkage parameter
.setMaxDepth(3) // Maximum depth per tree
.setSubsample(1.0) // Fraction of samples per tree
.setMinSamplesSplit(2) // Min samples to split
.setMinSamplesLeaf(1) // Min samples in leaf
.setMinImpurityDecrease(0.0) // Min impurity decrease
.setValidationFraction(0.1) // Validation set fraction
.setNIterNoChange(5) // Early stopping patience
.setTol(1e-4) // Early stopping tolerance
.setRandomState(42); // Random seed
Best Use Cases:
- High-accuracy classification and regression
- Competitions and benchmarks
- When careful tuning can be done
- Complex non-linear relationships
- When overfitting can be controlled
🎯 Clustering
KMeans
Purpose: Partitioning clustering for grouping similar data points
Key Features:
- K-means++ initialization for better convergence
- Multiple random restarts to avoid local minima
- Inertia (within-cluster sum of squares) calculation
- Configurable convergence criteria
- Cluster center and label prediction
Parameters:
KMeans kmeans = new KMeans()
.setNClusters(8) // Number of clusters
.setInit("k-means++") // Initialization method
.setNInit(10) // Number of initializations
.setMaxIter(300) // Maximum iterations
.setTol(1e-4) // Convergence tolerance
.setRandomState(42); // Random seed
Best Use Cases:
- Customer segmentation
- Market research
- Image segmentation
- Data exploration and visualization
- Dimensionality reduction preprocessing
🔧 Preprocessing
StandardScaler
Purpose: Feature standardization to zero mean and unit variance
Key Features:
- Z-score normalization (mean=0, std=1)
- Fit/transform pattern consistent with scikit-learn
- Feature-wise scaling independence
- Inverse transformation capability
- Numerical stability
Parameters:
StandardScaler scaler = new StandardScaler()
.setWithMean(true) // Center to zero mean
.setWithStd(true); // Scale to unit variance
Best Use Cases:
- Preprocessing for linear models
- When features have different scales
- Before clustering algorithms
- Neural network preprocessing
- SVM preprocessing
📈 Performance Characteristics
Training Time Complexity
Algorithm | Time Complexity | Space Complexity | Notes |
---|---|---|---|
LogisticRegression | O(n × p × i) | O(p) | n=samples, p=features, i=iterations |
LinearRegression | O(n × p²) | O(p²) | Matrix inversion |
Ridge | O(n × p²) | O(p²) | Matrix inversion with regularization |
Lasso | O(n × p × i) | O(p) | Coordinate descent iterations |
DecisionTree | O(n × p × log n) | O(n) | Average case for balanced tree |
RandomForest | O(t × n × p × log n) | O(t × n) | t=number of trees |
GradientBoosting | O(b × n × p × log n) | O(b × n) | b=boosting iterations |
KMeans | O(n × k × i × p) | O(n × p) | k=clusters, i=iterations |
Prediction Time Complexity
Algorithm | Time Complexity | Notes |
---|---|---|
Linear Models | O(p) | Simple linear combination |
DecisionTree | O(log n) | Tree traversal depth |
RandomForest | O(t × log n) | t trees × tree depth |
GradientBoosting | O(b × log n) | b boosting stages × tree depth |
KMeans | O(k × p) | Distance to k centroids |
🎯 Algorithm Selection Guide
For Classification Problems
Problem Type | Recommended Algorithm | Alternative |
---|---|---|
Linear separable | LogisticRegression | SoftmaxRegression |
Non-linear | RandomForest | GradientBoosting |
High dimensions | LogisticRegression + L1 | Lasso |
Many classes | SoftmaxRegression | OneVsRestClassifier |
Interpretability | DecisionTree | LogisticRegression |
High accuracy | GradientBoosting | RandomForest |
For Regression Problems
Problem Type | Recommended Algorithm | Alternative |
---|---|---|
Linear relationship | LinearRegression | Ridge |
Feature selection | Lasso | Ridge + manual selection |
Non-linear | RandomForest | GradientBoosting |
Multicollinearity | Ridge | LinearRegression |
High accuracy | GradientBoosting | RandomForest |
Interpretability | DecisionTree | LinearRegression |
For Clustering Problems
Problem Type | Recommended Algorithm | Notes |
---|---|---|
Spherical clusters | KMeans | Works best with globular clusters |
Unknown cluster count | KMeans + Elbow method | Try different k values |
🚀 Advanced Features
Ensemble Capabilities
- RandomForest: Bootstrap aggregating with feature randomization
- GradientBoosting: Sequential boosting with early stopping
- OneVsRestClassifier: Meta-learning for multiclass conversion
Regularization Support
- L1 Regularization: Lasso, LogisticRegression
- L2 Regularization: Ridge, LogisticRegression
- Early Stopping: GradientBoosting with validation monitoring
Parallel Processing
- RandomForest: Multi-threaded tree training
- OneVsRestClassifier: Parallel binary classifier training
- GridSearchCV: Parallel hyperparameter optimization
Probability Estimation
- LogisticRegression: Native probability support
- SoftmaxRegression: Multinomial probabilities
- RandomForest: Voting-based probabilities
- GradientBoosting: Sigmoid-transformed probabilities
📚 Usage Examples
Complete Classification Workflow
// Load and prepare data
var dataset = Datasets.makeClassification(1000, 20, 3);
var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
// Preprocessing
StandardScaler scaler = new StandardScaler();
double[][] XTrainScaled = scaler.fitTransform(split.XTrain);
double[][] XTestScaled = scaler.transform(split.XTest);
// Train multiple algorithms
LogisticRegression lr = new LogisticRegression().fit(XTrainScaled, split.yTrain);
RandomForest rf = new RandomForest(100, 10).fit(XTrainScaled, split.yTrain);
GradientBoosting gb = new GradientBoosting(100, 0.1, 6).fit(XTrainScaled, split.yTrain);
// Evaluate and compare
double lrAccuracy = Metrics.accuracy(split.yTest, lr.predict(XTestScaled));
double rfAccuracy = Metrics.accuracy(split.yTest, rf.predict(XTestScaled));
double gbAccuracy = Metrics.accuracy(split.yTest, gb.predict(XTestScaled));
System.out.printf("Logistic Regression: %.3f\n", lrAccuracy);
System.out.printf("Random Forest: %.3f\n", rfAccuracy);
System.out.printf("Gradient Boosting: %.3f\n", gbAccuracy);
Hyperparameter Optimization
// Grid search for Random Forest
Map<String, Object[]> paramGrid = Map.of(
"n_estimators", new Object[]{50, 100, 200},
"max_depth", new Object[]{5, 10, 15},
"min_samples_split", new Object[]{2, 5, 10}
);
GridSearchCV gridSearch = new GridSearchCV(new RandomForest(), paramGrid)
.setCv(5)
.setScoring("accuracy")
.setNJobs(-1);
gridSearch.fit(XTrainScaled, split.yTrain);
RandomForest bestRF = (RandomForest) gridSearch.getBestEstimator();
This comprehensive algorithms reference provides detailed information about all available algorithms in SuperML Java, their capabilities, parameters, and best use cases. Each algorithm is implemented with production-ready features and follows scikit-learn compatible APIs for easy adoption.