StringColumn¶
-
class
getml.data.columns.
StringColumn
(name='', role='categorical', num=0, df_name='')¶ Bases:
getml.data.columns._Column
Handle for categorical data that is kept in the getML engine
- Parameters
name (str, optional) – Name of the categorical column.
role (str, optional) – Role that the column plays.
num (int, optional) – Number of the column.
df_name (str, optional) –
name
instance variable of theDataFrame
containing this column.
Note
All
StringColumn
are immutable and, thus, their content can not be changed directly. All operations altering the underlying data will return a new column, which is purely virtual and has to be added to theDataFrame
using itsadd()
method.This class provides a set of data preparation methods. They are still experimental (and, therefore, not covered in the main documentation) yet but nevertheless widely tested and used internally. Only their signatures might change significantly in following releases.
Attributes Summary
Methods Summary
alias
(alias)Adds an alias to the column.
as_num
()Transforms a categorical column to a numerical column.
as_ts
([time_formats])Transforms a categorical column to a time stamp.
contains
(other)Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.
count
([alias])COUNT aggregation.
count_distinct
([alias])COUNT DISTINCT aggregation.
substr
(begin, length)Return a substring for every element in the column.
to_numpy
([sock])Transform column to numpy array
update
(condition, values)Returns an updated version of this column.
Attributes Documentation
-
length
¶
-
num_categorical_matrices
= 0¶
Methods Documentation
-
alias
(alias)¶ Adds an alias to the column. This is useful for joins.
- Parameters
alias (str) – The name of the column as it should appear in the new DataFrame.
-
as_num
()¶ Transforms a categorical column to a numerical column.
-
as_ts
(time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])¶ Transforms a categorical column to a time stamp.
- Parameters
time_formats (str) – Formats to be used to parse the time stamps.
-
contains
(other)¶ Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.
-
count
(alias='new_column')¶ COUNT aggregation.
- Parameters
alias (str) – Name for the new column.
-
count_distinct
(alias='new_column')¶ COUNT DISTINCT aggregation.
- Parameters
alias (str) – Name for the new column.
-
substr
(begin, length)¶ Return a substring for every element in the column.
- Parameters
begin (int) – First position of the original string.
length (int) – Length of the extracted string.
-
to_numpy
(sock=None)¶ Transform column to numpy array
- Parameters
sock (optional) – Socket connecting the Python API with the getML engine.
-
update
(condition, values)¶ Returns an updated version of this column.
All entries for which the corresponding condition is True, are updated using the corresponding entry in values.
- Parameters
condition (Boolean column) – Condition according to which the update is done
values – Values to update with