When building models for your machine linear data, you often want to compare multiple algorithms. The standard way to do this is to use cross validation, which will split your data into multiple training and test sets, score across each, and give you the best results overall. In this article, we will see how to one way to use cross validation in Sklearn.
In this article, we will manually do cross validation by splitting our data twice, running our algorithms on each, and compare the results. Below is an example of testing Logistic Regression and SVM on the iris data set. We train both twice, score them, then take the best of all the results.
from sklearn import datasets
## Load in Validation split and score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
## Load Model to compare
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
## Load iris data set and only use first two features
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
## Create a first split of our data
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.25)
# Create a second split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train,
test_size=0.25)
## Train both models on the second plit
svm = SVC(kernel = 'linear')
svmMod = svm.fit(X_train_2, y_train_2)
lr = LogisticRegression()
lrMod = lr.fit(X_train_2, y_train_2)
## Run prediction on the test sets
smvPred = svmMod.predict(X_test_2)
lrPred = lrMod.predict(X_test_2)
## Print the scores with our second set
print(accuracy_score(y_test_2, smvPred))
print(accuracy_score(y_test_2, lrPred))
## Run prediction on the first test sets
smvPred = svmMod.predict(X_test)
lrPred = lrMod.predict(X_test)
## Print the scores with our first set
print(accuracy_score(y_test, smvPred))
print(accuracy_score(y_test, lrPred))