How to Split Train and Test data with Sklearn

02.03.2021

Intro

When testing algorithm performance, you want to split your data into a training and test set. This allows you test train on a subset of your data and compare the performance on the test set. One common error you can catch is overffiting to see if you trained too percicesly to your training set and the model can't be use on other sets. In this article, we will see how to split your data into training and test sets using Sklearn.

Splitting our data

To split our data using sklearn, we use the train_test_split method from the model_selection package. This method will split our x and y into training and test. It also does stratified sampling automatically to help with level variation from your categorical variables.

from sklearn.model_selection import train_test_split
from sklearn import datasets

boston = datasets.load_boston()

X = boston.data
y = boston.target


X_train, X_test, y_train, y_test = train_test_split(X, y)