getml.data.columns

Handlers for 1-d arrays storing the data of an individual variable.

Like the DataFrame, the columns do not contain any actual data themselves but are only handlers to objects within the getML engine. These containers store data of a single variable in a one-dimensional array of an uniform type.

Columns are immutable and lazily evaluated.

  • Immutable means that there are no in-place operation on the columns. Any change to the column will return a new, changed column.

  • Lazy evaluation means that operations won’t be executed until results are required. This is reflected in the virtual columns: Virtual columns do not exist until they are required.

Example:

import numpy as np

import getml.data as data
import getml.engine as engine
import getml.data.roles as roles

# ----------------

engine.set_project("examples")

# ----------------
# Create a data frame from a JSON string

json_str = """{
    "names": ["patrick", "alex", "phil", "ulrike"],
    "column_01": [2.4, 3.0, 1.2, 1.4],
    "join_key": ["0", "1", "2", "3"],
    "time_stamp": ["2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04"]
}"""

my_df = data.DataFrame(
    "MY DF",
    roles={
        "unused_string": ["names", "join_key", "time_stamp"],
        "unused_float": ["column_01"]}
).read_json(
    json_str
)

# ----------------

col1 = my_df["column_01"]

# ----------------

# col2 is a virtual column.
# The operation is not executed yet.
col2 = 2.0 - col1

# This is when '2.0 - col1' is actually
# executed.
my_df["column_02"] = col2
my_df.set_role("column_02", roles.numerical)

# If you want to update column_01,
# you can't do that in-place.
# You need to replace it with a new column
col1 = col1 + col2
my_df["column_01"] = col1
my_df.set_role("column_01", roles.numerical)

Classes

Aggregation(alias, col, agg_type)

Lazily evaluated aggregation over a column.

FloatColumn([name, role, num, df_name])

Handler for numerical data in the engine.

StringColumn([name, role, num, df_name])

Handle for categorical data that is kept in the getML engine

VirtualBooleanColumn(df_name, operator, …)

Handle to a (lazily evaluated) virtual boolean column.

VirtualFloatColumn(df_name, operator, …)

Handle to a (lazily evaluated) virtual float column.

VirtualStringColumn(df_name, operator, …)

Handle to a (lazily evaluated) virtual string column.