Logistic Regression on a Large Data Set



Often when building models, we will have a large amount of data given to us. When training models, there are different solvers we can choose from. These solvers use different techniques for solving mathematically optimization to help solve large data sets. In this article, we will see how to choose a solver for a Logistic Regression model.

Example Code

To specify a different solver for our model, we can use the solver parameter. We use the sag model, which is good at handling large data sets.

from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
features = iris.data
target = iris.target

logistic_regression = LogisticRegression( solver="sag")

model = logistic_regression.fit(features_standardized, target)

Bonus Content

Sklearn offers multiple solvers for different data sets. For Logistic Regression the offer ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’. Here is a summary of when to use these solvers from the documentation.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.

For multi-class problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

‘liblinear’ and ‘saga’ also handle L1 penalty

‘saga’ also supports ‘elasticnet’ penalty

To learn more about solvers, you can look up Linear Programming, NonLinear Programming, Convex Optimization and Mathematical Optimization to get started. This field is rich and dense.