VirtualStringColumn

class getml.data.columns.VirtualStringColumn(df_name, operator, operand1, operand2)[source]

Handle to a (lazily evaluated) virtual string column.

Virtual columns do not actually exist - they will be lazily evaluated when necessary.

Examples:

import numpy as np

import getml.data as data
import getml.engine as engine
import getml.data.roles as roles

# ----------------

engine.set_project("examples")

# ----------------
# Create a data frame from a JSON string

json_str = """{
    "names": ["patrick", "alex", "phil", "ulrike"],
    "column_01": [2.4, 3.0, 1.2, 1.4],
    "join_key": ["0", "1", "2", "3"],
    "time_stamp": ["2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04"]
}"""

my_df = data.DataFrame(
    "MY DF",
    roles={
        "unused_string": ["names", "join_key", "time_stamp"],
        "unused_float": ["column_01"]}
).read_json(
    json_str
)

# ----------------

col1 = my_df["names"]

# ----------------

# col2 is a virtual column.
# The substring operation is not
# executed yet.
col2 = col1.substr(4, 3)

# This is where the engine executes
# the substring operation.
my_df.add(col2, "short_names", roles.categorical)

# ----------------
# If you do not explicitly set a role,
# the assigned role will either be
# roles.unused_string.

# col3 is a virtual column.
# The operation is not
# executed yet.
col3 = "user-" + col1 + "-" + col2

# This is where the operation is
# is executed.
my_df["new_names"] = col3
my_df.set_role("new_names", roles.categorical)

Methods

as_num()

Transforms a categorical column to a numerical column.

as_ts([time_formats])

Transforms a categorical column to a time stamp.

contains(other)

Returns a boolean column indicating whether a string or column entry is contained in the corresponding entry of the other column.

count([alias])

COUNT aggregation.

count_distinct([alias])

COUNT DISTINCT aggregation.

is_null()

Determine whether the value is NULL.

substr(begin, length)

Return a substring for every element in the column.

to_numpy([sock])

Transform column to numpy array

update(condition, values)

Returns an updated version of this column.

Attributes

length

The length of the column (number of rows in the data frame).