Distributed Scikit-learn / Joblib#
Lithops supports running distributed scikit-learn programs by implementing a Lithops backend for joblib using Functions instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster.
To get started, first install Lithops and the joblib dependencies with:
python3 -m pip install lithops[joblib]
Once installed, use from lithops.util.joblib import register_lithops
and run register_lithops()
. This will register Lithops as a joblib backend for scikit-learn to use. Then run your original scikit-learn code inside with joblib.parallel_backend('lithops')
.
Refer to the official JobLib and SkLearn documentation to operate with these libraries.
Examples#
JobLib Lithops backend example
import joblib
from joblib import Parallel, delayed
from lithops.util.joblib import register_lithops
from lithops.utils import setup_lithops_logger
register_lithops()
def my_function(x):
print(x)
setup_lithops_logger('INFO')
with joblib.parallel_backend('lithops'):
Parallel()(delayed(my_function)(i) for i in range(10))
SkLearn example with Lithops as backend for JobLib
import numpy as np
import joblib
from lithops.util.joblib import register_lithops
from lithops.utils import setup_lithops_logger
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 30),
'gamma': np.logspace(-8, 8, 30),
'tol': np.logspace(-4, -1, 30),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=2, n_iter=50, verbose=10)
register_lithops()
setup_lithops_logger('INFO')
with joblib.parallel_backend('lithops'):
search.fit(digits.data, digits.target)