When testing algorithm performance, you want to split your data into a training and test set. This allows you test train on a subset of your data and compare the performance on the test set. One common error you can catch is overffiting to see if you trained too percicesly to your training set and the model can't be use on other sets. In this article, we will see how to split your data into training and test sets using Sklearn.
To split our data using sklearn, we use the train_test_split
method from the model_selection
package. This method will split our x and y into training and test. It also does stratified sampling automatically to help with level variation from your categorical variables.
from sklearn.model_selection import train_test_split
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y)