time_stamp

getml.data.roles.time_stamp = 'time_stamp'

Marks a column as a time stamp.

This role is used to prevent data leaks. When you join one table onto another, you usually want to make sure that no data from the future is used. Time stamps can be used to limit your joins.

In addition, the feature learning algorithm can aggregate time stamps or use them for conditions. However, they will not be compared to fixed values unless you explicitly change their units. This means that conditions like this are not possible by default:

...
WHERE time_stamp > some_fixed_date
...

Instead, time stamps will always be compared to other time stamps:

...
WHERE time_stamp1 - time_stamp2 > some_value
...

This is because it is unlikely that comparing time stamps to a fixed date performs well out-of-sample.

When assigning the role time stamp to a column that is currently a StringColumn, you need to specify the format of this string. You can do so by using the time_formats argument of set_role(). You can pass a list of time formats that is used to try to interpret the input strings. Possible format options are

  • %w - abbreviated weekday (Mon, Tue, …)

  • %W - full weekday (Monday, Tuesday, …)

  • %b - abbreviated month (Jan, Feb, …)

  • %B - full month (January, February, …)

  • %d - zero-padded day of month (01 .. 31)

  • %e - day of month (1 .. 31)

  • %f - space-padded day of month ( 1 .. 31)

  • %m - zero-padded month (01 .. 12)

  • %n - month (1 .. 12)

  • %o - space-padded month ( 1 .. 12)

  • %y - year without century (70)

  • %Y - year with century (1970)

  • %H - hour (00 .. 23)

  • %h - hour (00 .. 12)

  • %a - am/pm

  • %A - AM/PM

  • %M - minute (00 .. 59)

  • %S - second (00 .. 59)

  • %s - seconds and microseconds (equivalent to %S.%F)

  • %i - millisecond (000 .. 999)

  • %c - centisecond (0 .. 9)

  • %F - fractional seconds/microseconds (000000 - 999999)

  • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

  • %Z - time zone differential in RFC format (GMT or +NNNN)

  • %% - percent sign

If none of the formats works, the getML engine will try to interpret the time stamps as numerical values. If this fails, the time stamp will be set to NULL.

>>> data_df = dict(
... date1=[getml.data.time.days(365), getml.data.time.days(366), getml.data.time.days(367)],
... date2=['1971-01-01', '1971-01-02', '1971-01-03'],
... date3=['1|1|71', '1|2|71', '1|3|71'],
)
>>> df = getml.data.DataFrame.from_dict(data_df, name='dates')
>>> df.set_role(['date1', 'date2', 'date3'], getml.data.roles.time_stamp, time_formats=['%Y-%m-%d', '%n|%e|%y'])
>>> df
| date1                       | date2                       | date3                       |
| time stamp                  | time stamp                  | time stamp                  |
-------------------------------------------------------------------------------------------
| 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z |
| 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z |
| 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z |

Note

getML time stamps are actually floats expressing the number of seconds since UNIX time (1970-01-01T00:00:00).