Train and Test Data Split

for ML Models

Hitesh Mandav
5 min readJul 6, 2021

The first step that you should do as soon as you receive data is to split your data set into two. Most commonly the ratio is 80:20.

This is done so that we or our model don't see a particular set of data and is kept aside for testing our trained model. And the larger set is always used for training and the latter for testing.

What happens when we don't Split the dataset?

Then we will have to do the testing on the same dataset on which we have trained the model. Although this will give us high accuracy when we do the testing it is not a good model. This can mean that our model is overfitted and may perform poorly for any previously unseen data.

Overfitting is a case when the model represents the data a little too accurately. The below figure explains overfitting.

The blue line represents a linear model overfitted to exactly replicate the training data this will always give correct label values for the features that it is trained on but when we give a new unseen value it may give a value far off from the optimal value. We always want our model to be like the black line which is much more generalized. It will give some error for trined values but will give reliable values for any unseen data also. Image credit WIKI

Now we have understood the need to split the data let's see how can we do this.

Using Python

Here we are going to use python to implement a function that will do this split for us.

First, we need a data set. We will create a sample data frame using pandas.

import pandas as pddf = pd.DataFrame({‘Temprature’: [98.3, 99.8, 97.3,96.4, 92.5 ], ‘Humidity’: [0.23, 0.67, 1.7, 0.8, 1.3], ‘Rained’ : [0, 0, 1, 1, 1]})
print(df)
This sample data tells us the temperature, humidity and if it has rained on that particular day

Now we can write a function that will take data and split ratio as parameters and return two data sets one for testing and one for training. To do this we will use NumPy.

import numpy as np# mannual basic train test spliting funtion
def split_train_test(data, test_ratio):
shuffled_indicies = np.random.permutation(len(data))
test_data_size = int(len(data) * test_ratio)
test_indicies = shuffled_indicies[:test_data_size]
train_indicies = shuffled_indicies[test_data_size:]
return data.iloc[train_indicies], data.iloc[test_indicies]
train_data, test_data = split_train_test(df, 0.2)
print(f’Train data set count {len(train_data)}\nTest data set count {len(test_data)}’)

In this method, we are creating an array of shuffled indices for the length of the data set. This is done so that the data doesn't represent a pattern if it is sorted over a particular feature. Then we get the length of the test data size.

We use these variables to get indices of train and test data and return data sets for test and train.

But in this approach, we are going to run into a problem. If we call this function multiple times we will always get a different data set for test and trains we are using np.random to generate the shuffled indices.

This will lead to the very problem that we are trying to eliminate in the long run our entire data set will be exposed to the model and we will not have any data that is not seen by the model.

So to eliminate this and get the same shuffled indices every time we can set a seed for np.random. and this will create the same shuffled indices every time. The seed will take an integer value and the will generate same shuffled indices as long as the seed value is the same.

So we will add a parameter called random_seed and pass that to np.random.seed() in the split_train_test method.

import numpy as np# mannual basic train test spliting funtion
def split_train_test(data, test_ratio, random_seed):
# setting random seed value to genrate same shuffled indicies
np.random.seed(random_seed)
shuffled_indicies = np.random.permutation(len(data))
test_data_size = int(len(data) * test_ratio)
test_indicies = shuffled_indicies[:test_data_size]
train_indicies = shuffled_indicies[test_data_size:]
return data.iloc[train_indicies], data.iloc[test_indicies]
train_data, test_data = split_train_test(df, 0.2, 42)
print(f’Train data set count {len(train_data)}\nTest data set count {len(test_data)}’)

This method will now always return the same data sets even if it is called multiple times as long as the seed value is same.

Now, this data splitting is needed for every ML model that we will create so to make our task easier scikit-learn has some inbuilt methods that will take care of this splitting for us.

Split data using scikit-learn.

In sklearn.model_selection we have a train_test_split method that we can use to split data into training and testing sets.

Below is the implementation

from sklearn.model_selection import train_test_split#basic train test split using sklearn
train_data, test_data = train_test_split(df, test_size = 0.2, random_state = 42)
print(f’Train data set count {len(train_data)}\nTest data set count {len(test_data)}’)

Here we are passing all the same values that we used in the above function that we created.

Now sometimes we have a feature that we want to split evenly into training and testing data.

For example, I have taken a dataset ( Bostan housing Dataset ) where I have a feature called “CHAS” which contains two values 0 and 1.

I have copied the data in a housing data frame and now I can print the value counts for the “CHAS” feature.

print(f’Data set value counts for CHAS column\n{housing[“CHAS”].value_counts()}’ )

Here we have 367 data points with a value of 0 and 27 data points with a value of 1. This is a ratio of ~13.6 for the 0 and 1 distribution in the dataset.

Now after analyzing this data we decide that this feature needs to be split evenly in training and test data.

We can do this using scikit-learn’s StratifiedShuffleSplit

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size = 0.2, random_state = 42)
for train_i, test_i in split.split( housing, housing[‘CHAS’]):
strat_train_data = housing.iloc[train_i]
strat_test_data = housing.iloc[test_i]

print(f’Train data set count {len(strat_train_data)}\nTest data set count {len(strat_test_data)}’))
print(f”Train value count for CHAS column\n{strat_train_data[‘CHAS’].value_counts()}”)
print(f’Test value count for CHAS column\n{strat_test_data[“CHAS”].value_counts()}’)

Here we can see the train set has a ratio of ~13.3 and the test set has a ratio of ~14.8 for the 0 and 1 distribution. so the data have split almost equally considering the ‘CHAS’ feature.

This is a very short introduction to how you can split training and test data for your ML model.

Thanks for the read do leave comments if you have any inputs.

ENJOY YOUR CODING!

--

--