Split Time-Series Dataset

4 min readJul 15, 2021

Sometimes the easiest part is the hardest part. Usually splitting your dataset to train, validation and test isn’t a complicated task, on the contrary. But when happens if you are using time series data or you want to custom your split. Then you need to adjust it rather than using a regular split. In one of my projects I had to prepare my dataset for training and testing and to do so, I came across various ways you can split your dataset. I tried several of them until I reached the one that provided me with the optimal solution.

I’ll go over the different methods and provide code examples for each one of them.

For the sake of the argument, I’ll use the Iris dataset in my examples

from sklearn.datasets import load_iris

Then load the iris dataset.

iris = load_iris()

Then store the data and target value into separate variables.

X, y = iris.data, iris.target

First, we will use the sklearn vanilla train_test_split as follows

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this simple example, I split my data into 20% test and 80% training respectively. I set a random seed to 42 and the outcome is 4 datasets divided into features and labels for training and testing as well. You can also set another important parameter named shuffle. If true, it will shuffle the data before splitting. Pay attention to this one especially if you’re using time series data, it can mess up your results as you don’t want data points from the future to be included in past training (look-ahead bias).

In this example you just split your data in one point, it isn’t very helpful when you have time-series data.

Now, say you don’t want to just regularly split your dataset in a fixed manner. Say you have time series data and you want to split your data into fixed intervals. For this kind of task, you can split your dataset with TimeSeriesSplit which provides train and test indices to split time series data samples that are observed at fixed time intervals.

In each split, test indices must be higher than before, and as stated before you cannot use shuffling in a cross validator. This is the plane example from sklearn

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Another option for using the TimeSeriesSplit method is with GridSearchCV

For those of you who aren’t familiar with GridSearchCV, in a nutshell, it iterates over specified parameter values for an estimator. You are supposed to provide the parameter values you want to iterate over and the scoring method you want to choose to evaluate the test set. The result will be the best parameter you should use in your model.

Here is an example of using GridSearchCV along with TimeSeriesSplit

import numpy as np
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# predefine the variables
n_splits = 5  # Number of splits
model = RandomForestRegressor()  # I used random forest as my model
# Add the parameter to grid serach on - this is just an example
grid_params = {'n_estimators': [int(x) for x in np.linspace(200, 1000, 3)],
              'max_depth': [int(x) for x in np.linspace(5, 55, 11)],
              'max_features': ['auto', 'sqrt', 'log2'],
              'random_state': [42]
              }
refit = True  # Refit an estimator using the best found parameters on the whole dataset
scoring = 'neg_mean_squared_error'  # Strategy to evaluate the performance of the cross-validated model on the test set
n_jobs = -1  # Number of jobs to run in parallel
tscv = TimeSeriesSplit(n_splits=5)
grid_search = GridSearchCV(estimator=model, param_grid=grid_params, refit=refit,
                           scoring=scoring, cv=tscv, n_jobs=n_jobs).fit(X, y)
print(f'Model: {model} best params are: {grid_search.best_params_}')

Lastly, say you don’t want to use fixed intervals but predefined intervals that are accustomed to your needs. In this case, you can use the PredefinedSplit cross-validator.

It provides train/test indices to split data into train/test sets using a predefined scheme specified by you with the test_fold parameter.

Again, I’ll use it as a cross-validate in my GridSearchCV. The following picture illustrates it best

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, PredefinedSplit

model = RandomForestRegressor()
grid_params = {'n_estimators': [int(x) for x in np.linspace(200, 1000, 3)],
               'max_depth': [int(x) for x in np.linspace(5, 55, 11)],
               'max_features': ['auto', 'sqrt', 'log2'],
               'random_state': [42]
               }
refit = True
scoring = 'neg_mean_squared_error'
n_jobs = -1
validation_size = 24
X.reset_index(inplace=True)
X.sort_values('date', inplace=True)
train_dates = pd.to_datetime(X['date'].unique()).sort_values()

val_dates = train_dates[-validation_size:]

n_test_obs = X['date'].isin(train_dates).sum()
n_valid_obs = X['date'].isin(val_dates).sum()

test_fold_encoding = list(np.concatenate([np.ones(n_test_obs - n_valid_obs), np.zeros(n_valid_obs)]))

cv = [[c for c in PredefinedSplit(test_fold=test_fold_encoding).split()][0]]

grid_search = GridSearchCV(estimator=model, param_grid=grid_params, refit=refit,
                           scoring=scoring, cv=cv, n_jobs=n_jobs).fit(X, y)
print(f'Model: {model} best params are: {grid_search.best_params_}')# Credit to: Idan Schatz

Note that you need to have a date column in your dataset to use PredefinedSplit.

Remember that when using a validation set, you need to set the test_fold_encoding to 0 for all samples that are part of the validation set, and to -1 for all other samples. Also, the entry test_fold_encoding[i] represents the index of the test set that sample i belongs to. It is possible to exclude the sample i from any test set by setting test_fold_encoding[i] equal to -1.

That’s it for now, I hope this article will prove useful to your endeavors in the future. Thank you so much for reading!

Split Time-Series Dataset

Written by Tomer Katzav