Split Time-Series Dataset

Sometimes the easiest part is the hardest part. Usually splitting your dataset to train, validation and test isn’t a complicated task, on the contrary. But when happens if you are using time series data or you want to custom your split. Then you need to adjust it rather than using a regular split. In one of my projects I had to prepare my dataset for training and testing and to do so, I came across various ways you can split your dataset. I tried several of them until I reached the one that provided me with the optimal solution.

Ill go over the different methods and provide code examples for each one of them.

For the sake of the argument, I’ll use the Iris dataset in my examples

Then load the iris dataset.

Then store the data and target value into separate variables.

First, we will use the sklearn vanilla train_test_split as follows

In this simple example, I split my data into 20% test and 80% training respectively. I set a random seed to 42 and the outcome is 4 datasets divided into features and labels for training and testing as well. You can also set another important parameter named shuffle. If true, it will shuffle the data before splitting. Pay attention to this one especially if you’re using time series data, it can mess up your results as you don’t want data points from the future to be included in past training (look-ahead bias).

In this example you just split your data in one point, it isn’t very helpful when you have time-series data.

Now, say you don’t want to just regularly split your dataset in a fixed manner. Say you have time series data and you want to split your data into fixed intervals. For this kind of task, you can split your dataset with TimeSeriesSplit which provides train and test indices to split time series data samples that are observed at fixed time intervals.

In each split, test indices must be higher than before, and as stated before you cannot use shuffling in a cross validator. This is the plane example from sklearn

Another option for using the TimeSeriesSplit method is with GridSearchCV

For those of you who aren’t familiar with GridSearchCV, in a nutshell, it iterates over specified parameter values for an estimator. You are supposed to provide the parameter values you want to iterate over and the scoring method you want to choose to evaluate the test set. The result will be the best parameter you should use in your model.

Here is an example of using GridSearchCV along with TimeSeriesSplit

Lastly, say you don’t want to use fixed intervals but predefined intervals that are accustomed to your needs. In this case, you can use the PredefinedSplit cross-validator.

It provides train/test indices to split data into train/test sets using a predefined scheme specified by you with the parameter.

Again, I’ll use it as a cross-validate in my GridSearchCV. The following picture illustrates it best

source: Kaggle

Note that you need to have a date column in your dataset to use PredefinedSplit.

Remember that when using a validation set, you need to set the to 0 for all samples that are part of the validation set, and to -1 for all other samples. Also, the entry represents the index of the test set that sample belongs to. It is possible to exclude the sample from any test set by setting equal to -1.

That’s it for now, I hope this article will prove useful to your endeavors in the future. Thank you so much for reading!

Correlating since 1991