time

getml.data.split.time(population, time_stamp, validation=None, test=None, **kwargs)[source]

Returns a StringColumnView that can be used to divide data into training, testing, validation or other sets.

The arguments are key=value pairs of names (key) and starting points (value). The starting point defines the left endpoint of the subset. Intervals are left closed and right open, such that \([value, next value)\). The (unnamed) subset left from the first named starting point, i.e. \([0, first value)\), is always considered to be the training set.

Args:
population (DataFrame or View):

The population table you would like to split.

time_stamp (str):

The name of the time stamp column in the population table you want to use. Ideally, the role of said column would be time_stamp. If you want to split on the rowid, then pass “rowid” to time_stamp.

validation (float, optional):

The start date of the validation set.

test (float, optional):

The start date of the test set.

kwargs (float, optional):

Any other sets you would like to assign. You can name these sets whatever you want to (in our example, we called it ‘other’).

Example:
validation_begin = getml.data.time.datetime(2010, 1, 1)
test_begin = getml.data.time.datetime(2011, 1, 1)
other_begin = getml.data.time.datetime(2012, 1, 1)

split = getml.data.split.time(
    population=data_frame,
    time_stamp="ds",
    test=test_begin,
    validation=validation_begin,
    other=other_begin
)

# Contains all data before 2010-01-01 (not included)
train_set = data_frame[split=='train']

# Contains all data between 2010-01-01 (included) and 2011-01-01 (not included)
validation_set = data_frame[split=='validation']

# Contains all data between 2011-01-01 (included) and 2012-01-01 (not included)
test_set = data_frame[split=='test']

# Contains all data after 2012-01-01 (included)
other_set = data_frame[split=='other']