VirtualStringColumn

class getml.data.columns.VirtualStringColumn(df_name, operator, operand1, operand2)

Bases: object

Handle to a (lazily evaluated) virtual string column.

Virtual columns do not actually exist - they will be lazily evaluated when necessary.

Examples

import numpy as np

import getml.data as data
import getml.engine as engine
import getml.data.roles as roles

# ----------------

engine.set_project("examples")

# ----------------
# Create a data frame from a JSON string

json_str = """{
    "names": ["patrick", "alex", "phil", "ulrike"],
    "column_01": [2.4, 3.0, 1.2, 1.4],
    "join_key": ["0", "1", "2", "3"],
    "time_stamp": ["2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04"]
}"""

my_df = data.DataFrame(
    "MY DF",
    roles={
        "unused_string": ["names", "join_key", "time_stamp"],
        "unused_float": ["column_01"]}
).read_json(
    json_str
)

# ----------------

col1 = my_df["names"]

# ----------------

# col2 is a virtual column.
# The substring operation is not
# executed yet.
col2 = col1.substr(4, 3)

# This is where the engine executes
# the substring operation.
my_df.add(col2, "short_names", roles.categorical)

# ----------------
# If you do not explicitly set a role,
# the assigned role will either be
# roles.unused_string.

# col3 is a virtual column.
# The operation is not
# executed yet.
col3 = "user-" + col1 + "-" + col2

# This is where the operation is
# is executed.
my_df["new_names"] = col3
my_df.set_role("new_names", roles.categorical)

Attributes Summary

length

The length of the column (number of rows in the data frame).

Methods Summary

as_num()

Transforms a categorical column to a numerical column.

as_ts([time_formats])

Transforms a categorical column to a time stamp.

contains(other)

Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.

count([alias])

COUNT aggregation.

count_distinct([alias])

COUNT DISTINCT aggregation.

is_null()

Determine whether the value is NULL.

substr(begin, length)

Return a substring for every element in the column.

to_numpy([sock])

Transform column to numpy array

update(condition, values)

Returns an updated version of this column.

Attributes Documentation

length

The length of the column (number of rows in the data frame).

Methods Documentation

as_num()

Transforms a categorical column to a numerical column.

as_ts(time_formats=None)

Transforms a categorical column to a time stamp.

Parameters

time_formats (str) – Formats to be used to parse the time stamps.

contains(other)

Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.

count(alias='new_column')

COUNT aggregation.

Parameters

alias (str) – Name for the new column.

count_distinct(alias='new_column')

COUNT DISTINCT aggregation.

Parameters

alias (str) – Name for the new column.

is_null()

Determine whether the value is NULL.

substr(begin, length)

Return a substring for every element in the column.

Parameters
  • begin (int) – First position of the original string.

  • length (int) – Length of the extracted string.

to_numpy(sock=None)

Transform column to numpy array

Parameters

sock (optional) – Socket connecting the Python API with the getML engine.

update(condition, values)

Returns an updated version of this column.

All entries for which the corresponding condition is True, are updated using the corresponding entry in values.

Parameters
  • condition (Boolean column) – Condition according to which the update is done

  • values – Values to update with