StringColumn

class getml.data.columns.StringColumn(name='', role='categorical', num=0, df_name='')

Bases: getml.data.columns._Column

Handle for categorical data that is kept in the getML engine

Parameters
  • name (str, optional) – Name of the categorical column.

  • role (str, optional) – Role that the column plays.

  • num (int, optional) – Number of the column.

  • df_name (str, optional) – name instance variable of the DataFrame containing this column.

Note

All StringColumn are immutable and, thus, their content can not be changed directly. All operations altering the underlying data will return a new column, which is purely virtual and has to be added to the DataFrame using its add() method.

This class provides a set of data preparation methods. They are still experimental (and, therefore, not covered in the main documentation) yet but nevertheless widely tested and used internally. Only their signatures might change significantly in following releases.

Attributes Summary

length

num_categorical_matrices

Methods Summary

alias(alias)

Adds an alias to the column.

as_num()

Transforms a categorical column to a numerical column.

as_ts([time_formats])

Transforms a categorical column to a time stamp.

contains(other)

Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.

count([alias])

COUNT aggregation.

count_distinct([alias])

COUNT DISTINCT aggregation.

substr(begin, length)

Return a substring for every element in the column.

to_numpy([sock])

Transform column to numpy array

update(condition, values)

Returns an updated version of this column.

Attributes Documentation

length
num_categorical_matrices = 0

Methods Documentation

alias(alias)

Adds an alias to the column. This is useful for joins.

Parameters

alias (str) – The name of the column as it should appear in the new DataFrame.

as_num()

Transforms a categorical column to a numerical column.

as_ts(time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])

Transforms a categorical column to a time stamp.

Parameters

time_formats (str) – Formats to be used to parse the time stamps.

contains(other)

Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.

count(alias='new_column')

COUNT aggregation.

Parameters

alias (str) – Name for the new column.

count_distinct(alias='new_column')

COUNT DISTINCT aggregation.

Parameters

alias (str) – Name for the new column.

substr(begin, length)

Return a substring for every element in the column.

Parameters
  • begin (int) – First position of the original string.

  • length (int) – Length of the extracted string.

to_numpy(sock=None)

Transform column to numpy array

Parameters

sock (optional) – Socket connecting the Python API with the getML engine.

update(condition, values)

Returns an updated version of this column.

All entries for which the corresponding condition is True, are updated using the corresponding entry in values.

Parameters
  • condition (Boolean column) – Condition according to which the update is done

  • values – Values to update with